aboutsummaryrefslogtreecommitdiff
path: root/net/ipv6
AgeCommit message (Collapse)Author
2014-06-18netfilter: Fix potential use after free in ip6_route_me_harder()Sergey Popovich
commit a8951d5814e1373807a94f79f7ccec7041325470 upstream. Dst is released one line before we access it again with dst->error. Fixes: 58e35d147128 netfilter: ipv6: propagate routing errors from ip6_route_me_harder() Signed-off-by: Sergey Popovich <popovich_sergei@mail.ru> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-18ipv6: gro: fix CHECKSUM_COMPLETE supportEric Dumazet
[ Upstream commit 4de462ab63e23953fd05da511aeb460ae10cc726 ] When GRE support was added in linux-3.14, CHECKSUM_COMPLETE handling broke on GRE+IPv6 because we did not update/use the appropriate csum : GRO layer is supposed to use/update NAPI_GRO_CB(skb)->csum instead of skb->csum Tested using a GRE tunnel and IPv6 traffic. GRO aggregation now happens at the first level (ethernet device) instead of being done in gre tunnel. Native IPv6+TCP is still properly aggregated. Fixes: bf5a755f5e918 ("net-gre-gro: Add GRE support to the GRO stack") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jerry Chu <hkchu@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-18ipv6: fix calculation of option len in ip6_append_dataHannes Frederic Sowa
[ Upstream commit 3a1cebe7e05027a1c96f2fc1a8eddf5f19b78f42 ] tot_len does specify the size of struct ipv6_txoptions. We need opt_flen + opt_nflen to calculate the overall length of additional ipv6 extensions. I found this while auditing the ipv6 output path for a memory corruption reported by Alexey Preobrazhensky while he fuzzed an instrumented AddressSanitizer kernel with trinity. This may or may not be the cause of the original bug. Fixes: 4df98e76cde7c6 ("ipv6: pmtudisc setting not respected with UFO/CORK") Reported-by: Alexey Preobrazhensky <preobr@google.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-18ip6_tunnel: fix potential NULL pointer dereferenceSusant Sahani
[ Upstream commit c8965932a2e3b70197ec02c6741c29460279e2a8 ] The function ip6_tnl_validate assumes that the rtnl attribute IFLA_IPTUN_PROTO always be filled . If this attribute is not filled by the userspace application kernel get crashed with NULL pointer dereference. This patch fixes the potential kernel crash when IFLA_IPTUN_PROTO is missing . Signed-off-by: Susant Sahani <susant@redhat.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-18net: ipv6: send pkttoobig immediately if orig frag size > mtuFlorian Westphal
[ Upstream commit 418a31561d594a2b636c1e2fa94ecd9e1245abb1 ] If conntrack defragments incoming ipv6 frags it stores largest original frag size in ip6cb and sets ->local_df. We must thus first test the largest original frag size vs. mtu, and not vice versa. Without this patch PKTTOOBIG is still generated in ip6_fragment() later in the stack, but 1) IPSTATS_MIB_INTOOBIGERRORS won't increment 2) packet did (needlessly) traverse netfilter postrouting hook. Fixes: fe6cc55f3a9 ("net: ip, ipv6: handle gso skbs in forwarding path") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-18ipv6: fib: fix fib dump restartKumar Sundararajan
[ Upstream commit 1c2658545816088477e91860c3a645053719cb54 ] When the ipv6 fib changes during a table dump, the walk is restarted and the number of nodes dumped are skipped. But the existing code doesn't advance to the next node after a node is skipped. This can cause the dump to loop or produce lots of duplicates when the fib is modified during the dump. This change advances the walk to the next node if the current node is skipped after a restart. Signed-off-by: Kumar Sundararajan <kumar@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-18ip6_gre: don't allow to remove the fb_tunnel_devNicolas Dichtel
[ Upstream commit 54d63f787b652755e66eb4dd8892ee6d3f5197fc ] It's possible to remove the FB tunnel with the command 'ip link del ip6gre0' but this is unsafe, the module always supposes that this device exists. For example, ip6gre_tunnel_lookup() may use it unconditionally. Let's add a rtnl handler for dellink, which will never remove the FB tunnel (we let ip6gre_destroy_tunnels() do the job). Introduced by commit c12b395a4664 ("gre: Support GRE over IPv6"). CC: Dmitry Kozlov <xeb@mail.ru> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-18ipv6: Limit mtu to 65575 bytesEric Dumazet
[ Upstream commit 30f78d8ebf7f514801e71b88a10c948275168518 ] Francois reported that setting big mtu on loopback device could prevent tcp sessions making progress. We do not support (yet ?) IPv6 Jumbograms and cook corrupted packets. We must limit the IPv6 MTU to (65535 + 40) bytes in theory. Tested: ifconfig lo mtu 70000 netperf -H ::1 Before patch : Throughput : 0.05 Mbits After patch : Throughput : 35484 Mbits Reported-by: Francois WELLENREITER <f.wellenreiter@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-18netfilter: Can't fail and free after table replacementThomas Graf
commit c58dd2dd443c26d856a168db108a0cd11c285bf3 upstream. All xtables variants suffer from the defect that the copy_to_user() to copy the counters to user memory may fail after the table has already been exchanged and thus exposed. Return an error at this point will result in freeing the already exposed table. Any subsequent packet processing will result in a kernel panic. We can't copy the counters before exposing the new tables as we want provide the counter state after the old table has been unhooked. Therefore convert this into a silent error. Cc: Florian Westphal <fw@strlen.de> Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-18ipv6: some ipv6 statistic counters failed to disable bhHannes Frederic Sowa
[ Upstream commit 43a43b6040165f7b40b5b489fe61a4cb7f8c4980 ] After commit c15b1ccadb323ea ("ipv6: move DAD and addrconf_verify processing to workqueue") some counters are now updated in process context and thus need to disable bh before doing so, otherwise deadlocks can happen on 32-bit archs. Fabio Estevam noticed this while while mounting a NFS volume on an ARM board. As a compensation for missing this I looked after the other *_STATS_BH and found three other calls which need updating: 1) icmp6_send: ip6_fragment -> icmpv6_send -> icmp6_send (error handling) 2) ip6_push_pending_frames: rawv6_sendmsg -> rawv6_push_pending_frames -> ... (only in case of icmp protocol with raw sockets in error handling) 3) ping6_v6_sendmsg (error handling) Fixes: c15b1ccadb323ea ("ipv6: move DAD and addrconf_verify processing to workqueue") Reported-by: Fabio Estevam <festevam@gmail.com> Tested-by: Fabio Estevam <fabio.estevam@freescale.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-15Automatically merging tracking-linaro-android-3.14 into ↵Andrey Konovalov
merge-linux-linaro-core-tracking Conflicting files:
2014-03-28ipv6: move DAD and addrconf_verify processing to workqueueHannes Frederic Sowa
addrconf_join_solict and addrconf_join_anycast may cause actions which need rtnl locked, especially on first address creation. A new DAD state is introduced which defers processing of the initial DAD processing into a workqueue. To get rtnl lock we need to push the code paths which depend on those calls up to workqueues, specifically addrconf_verify and the DAD processing. (v2) addrconf_dad_failure needs to be queued up to the workqueue, too. This patch introduces a new DAD state and stop the DAD processing in the workqueue (this is because of the possible ipv6_del_addr processing which removes the solicited multicast address from the device). addrconf_verify_lock is removed, too. After the transition it is not needed any more. As we are not processing in bottom half anymore we need to be a bit more careful about disabling bottom half out when we lock spin_locks which are also used in bh. Relevant backtrace: [ 541.030090] RTNL: assertion failed at net/core/dev.c (4496) [ 541.031143] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 3.10.33-1-amd64-vyatta #1 [ 541.031145] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 [ 541.031146] ffffffff8148a9f0 000000000000002f ffffffff813c98c1 ffff88007c4451f8 [ 541.031148] 0000000000000000 0000000000000000 ffffffff813d3540 ffff88007fc03d18 [ 541.031150] 0000880000000006 ffff88007c445000 ffffffffa0194160 0000000000000000 [ 541.031152] Call Trace: [ 541.031153] <IRQ> [<ffffffff8148a9f0>] ? dump_stack+0xd/0x17 [ 541.031180] [<ffffffff813c98c1>] ? __dev_set_promiscuity+0x101/0x180 [ 541.031183] [<ffffffff813d3540>] ? __hw_addr_create_ex+0x60/0xc0 [ 541.031185] [<ffffffff813cfe1a>] ? __dev_set_rx_mode+0xaa/0xc0 [ 541.031189] [<ffffffff813d3a81>] ? __dev_mc_add+0x61/0x90 [ 541.031198] [<ffffffffa01dcf9c>] ? igmp6_group_added+0xfc/0x1a0 [ipv6] [ 541.031208] [<ffffffff8111237b>] ? kmem_cache_alloc+0xcb/0xd0 [ 541.031212] [<ffffffffa01ddcd7>] ? ipv6_dev_mc_inc+0x267/0x300 [ipv6] [ 541.031216] [<ffffffffa01c2fae>] ? addrconf_join_solict+0x2e/0x40 [ipv6] [ 541.031219] [<ffffffffa01ba2e9>] ? ipv6_dev_ac_inc+0x159/0x1f0 [ipv6] [ 541.031223] [<ffffffffa01c0772>] ? addrconf_join_anycast+0x92/0xa0 [ipv6] [ 541.031226] [<ffffffffa01c311e>] ? __ipv6_ifa_notify+0x11e/0x1e0 [ipv6] [ 541.031229] [<ffffffffa01c3213>] ? ipv6_ifa_notify+0x33/0x50 [ipv6] [ 541.031233] [<ffffffffa01c36c8>] ? addrconf_dad_completed+0x28/0x100 [ipv6] [ 541.031241] [<ffffffff81075c1d>] ? task_cputime+0x2d/0x50 [ 541.031244] [<ffffffffa01c38d6>] ? addrconf_dad_timer+0x136/0x150 [ipv6] [ 541.031247] [<ffffffffa01c37a0>] ? addrconf_dad_completed+0x100/0x100 [ipv6] [ 541.031255] [<ffffffff8105313a>] ? call_timer_fn.isra.22+0x2a/0x90 [ 541.031258] [<ffffffffa01c37a0>] ? addrconf_dad_completed+0x100/0x100 [ipv6] Hunks and backtrace stolen from a patch by Stephen Hemminger. Reported-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-20ip6mr: fix mfc notification flagsNicolas Dichtel
Commit 812e44dd1829 ("ip6mr: advertise new mfc entries via rtnl") reuses the function ip6mr_fill_mroute() to notify mfc events. But this function was used only for dump and thus was always setting the flag NLM_F_MULTI, which is wrong in case of a single notification. Libraries like libnl will wait forever for NLMSG_DONE. CC: Thomas Graf <tgraf@suug.ch> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-19netfilter: ipv6: fix crash caused by ipv6_find_hdr()JP Abgrall
When calling: ipv6_find_hdr(skb, &thoff, -1, NULL) on a fragmented packet, thoff would be left with a random value causing callers to read random memory offsets with: skb_header_pointer(skb, thoff, ...) Now we force ipv6_find_hdr() to return a failure in this case. Calling: ipv6_find_hdr(skb, &thoff, -1, &fragoff) will set fragoff as expected, and not return a failure. Change-Id: Ib474e8a4267dd2b300feca325811330329684a88 Signed-off-by: JP Abgrall <jpa@google.com>
2014-03-19net: Replace AID_NET_RAW checks with capable(CAP_NET_RAW).Chia-chi Yeh
Signed-off-by: Chia-chi Yeh <chiachi@android.com>
2014-03-19net: socket ioctl to reset connections matching local addressRobert Love
Introduce a new socket ioctl, SIOCKILLADDR, that nukes all sockets bound to the same local address. This is useful in situations with dynamic IPs, to kill stuck connections. Signed-off-by: Brian Swetland <swetland@google.com> net: fix tcp_v4_nuke_addr Signed-off-by: Dima Zavin <dima@android.com> net: ipv4: Fix a spinlock recursion bug in tcp_v4_nuke. We can't hold the lock while calling to tcp_done(), so we drop it before calling. We then have to start at the top of the chain again. Signed-off-by: Dima Zavin <dima@android.com> net: ipv4: Fix race in tcp_v4_nuke_addr(). To fix a recursive deadlock in 2.6.29, we stopped holding the hash table lock across tcp_done() calls. This fixed the deadlock, but introduced a race where the socket could die or change state. Fix: Before unlocking the hash table, we grab a reference to the socket. We can then unlock the hash table without risk of the socket going away. We then lock the socket, which is safe because it is pinned. We can then call tcp_done() without recursive deadlock and without race. Upon return, we unlock the socket and then unpin it, killing it. Change-Id: Idcdae072b48238b01bdbc8823b60310f1976e045 Signed-off-by: Robert Love <rlove@google.com> Acked-by: Dima Zavin <dima@android.com> ipv4: disable bottom halves around call to tcp_done(). Signed-off-by: Robert Love <rlove@google.com> Signed-off-by: Colin Cross <ccross@android.com> ipv4: Move sk_error_report inside bh_lock_sock in tcp_v4_nuke_addr When sk_error_report is called, it wakes up the user-space thread, which then calls tcp_close. When the tcp_close is interrupted by the tcp_v4_nuke_addr ioctl thread running tcp_done, it leaks 392 bytes and triggers a WARN_ON. This patch moves the call to sk_error_report inside the bh_lock_sock, which matches the locking used in tcp_v4_err. Signed-off-by: Colin Cross <ccross@android.com>
2014-03-19Paranoid network.Robert Love
With CONFIG_ANDROID_PARANOID_NETWORK, require specific uids/gids to instantiate network sockets. Signed-off-by: Robert Love <rlove@google.com> paranoid networking: Use in_egroup_p() to check group membership The previous group_search() caused trouble for partners with module builds. in_egroup_p() is also cleaner. Signed-off-by: Nick Pelly <npelly@google.com> Fix 2.6.29 build. Signed-off-by: Arve Hjønnevåg <arve@android.com> net: Fix compilation of the IPv6 module Fix compilation of the IPv6 module -- current->euid does not exist anymore, current_euid() is what needs to be used. Signed-off-by: Steinar H. Gunderson <sesse@google.com> net: bluetooth: Remove the AID_NET_BT* gid numbers Removed bluetooth checks for AID_NET_BT and AID_NET_BT_ADMIN which are not useful anymore. This is in preparation for getting rid of all the AID_* gids. Signed-off-by: JP Abgrall <jpa@google.com>
2014-03-18ipv6: ip6_append_data_mtu do not handle the mtu of the second fragment properlylucien
In ip6_append_data_mtu(), when the xfrm mode is not tunnel(such as transport),the ipsec header need to be added in the first fragment, so the mtu will decrease to reserve space for it, then the second fragment come, the mtu should be turn back, as the commit 0c1833797a5a6ec23ea9261d979aa18078720b74 said. however, in the commit a493e60ac4bbe2e977e7129d6d8cbb0dd236be, it use *mtu = min(*mtu, ...) to change the mtu, which lead to the new mtu is alway equal with the first fragment's. and cannot turn back. when I test through ping6 -c1 -s5000 $ip (mtu=1280): ...frag (0|1232) ESP(spi=0x00002000,seq=0xb), length 1232 ...frag (1232|1216) ...frag (2448|1216) ...frag (3664|1216) ...frag (4880|164) which should be: ...frag (0|1232) ESP(spi=0x00001000,seq=0x1), length 1232 ...frag (1232|1232) ...frag (2464|1232) ...frag (3696|1232) ...frag (4928|116) so delete the min() when change back the mtu. Signed-off-by: Xin Long <lucien.xin@gmail.com> Fixes: 75a493e60ac4bb ("ipv6: ip6_append_data_mtu did not care about pmtudisc and frag_size") Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-13ipv6: Avoid unnecessary temporary addresses being generatedHeiner Kallweit
tmp_prefered_lft is an offset to ifp->tstamp, not now. Therefore age needs to be added to the condition. Age calculation in ipv6_create_tempaddr is different from the one in addrconf_verify and doesn't consider ADDRCONF_TIMER_FUZZ_MINUS. This can cause age in ipv6_create_tempaddr to be less than the one in addrconf_verify and therefore unnecessary temporary address to be generated. Use age calculation as in addrconf_modify to avoid this. Signed-off-by: Heiner Kallweit <heiner.kallweit@web.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06ipv6: don't set DST_NOCOUNT for remotely added routesSabrina Dubroca
DST_NOCOUNT should only be used if an authorized user adds routes locally. In case of routes which are added on behalf of router advertisments this flag must not get used as it allows an unlimited number of routes getting added remotely. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06ipv6: Fix exthdrs offload registration.Anton Nayshtut
Without this fix, ipv6_exthdrs_offload_init doesn't register IPPROTO_DSTOPTS offload, but returns 0 (as the IPPROTO_ROUTING registration actually succeeds). This then causes the ipv6_gso_segment to drop IPv6 packets with IPPROTO_DSTOPTS header. The issue detected and the fix verified by running MS HCK Offload LSO test on top of QEMU Windows guests, as this test sends IPv6 packets with IPPROTO_DSTOPTS. Signed-off-by: Anton Nayshtut <anton@swortex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-27ipv6: ipv6_find_hdr restore prev functionalityHans Schillstrom
The commit 9195bb8e381d81d5a315f911904cdf0cfcc919b8 ("ipv6: improve ipv6_find_hdr() to skip empty routing headers") broke ipv6_find_hdr(). When a target is specified like IPPROTO_ICMPV6 ipv6_find_hdr() returns -ENOENT when it's found, not the header as expected. A part of IPVS is broken and possible also nft_exthdr_eval(). When target is -1 which it is most cases, it works. This patch exits the do while loop if the specific header is found so the nexthdr could be returned as expected. Reported-by: Art -kwaak- van Breemen <ard@telegraafnet.nl> Signed-off-by: Hans Schillstrom <hans@schillstrom.com> CC:Ansis Atteka <aatteka@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-27Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== 1) Build fix for ip_vti when NET_IP_TUNNEL is not set. We need this set to have ip_tunnel_get_stats64() available. 2) Fix a NULL pointer dereference on sub policy usage. We try to access a xfrm_state from the wrong array. 3) Take xfrm_state_lock in xfrm_migrate_state_find(), we need it to traverse through the state lists. 4) Clone states properly on migration, otherwise we crash when we migrate a state with aead algorithm attached. 5) Fix unlink race when between thread context and timer when policies are deleted. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-27net: ipv6: ping: Use socket mark in routing lookupLorenzo Colitti
Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-25ipv4: ipv6: better estimate tunnel header cut for correct ufo handlingHannes Frederic Sowa
Currently the UFO fragmentation process does not correctly handle inner UDP frames. (The following tcpdumps are captured on the parent interface with ufo disabled while tunnel has ufo enabled, 2000 bytes payload, mtu 1280, both sit device): IPv6: 16:39:10.031613 IP (tos 0x0, ttl 64, id 3208, offset 0, flags [DF], proto IPv6 (41), length 1300) 192.168.122.151 > 1.1.1.1: IP6 (hlim 64, next-header Fragment (44) payload length: 1240) 2001::1 > 2001::8: frag (0x00000001:0|1232) 44883 > distinct: UDP, length 2000 16:39:10.031709 IP (tos 0x0, ttl 64, id 3209, offset 0, flags [DF], proto IPv6 (41), length 844) 192.168.122.151 > 1.1.1.1: IP6 (hlim 64, next-header Fragment (44) payload length: 784) 2001::1 > 2001::8: frag (0x00000001:0|776) 58979 > 46366: UDP, length 5471 We can see that fragmentation header offset is not correctly updated. (fragmentation id handling is corrected by 916e4cf46d0204 ("ipv6: reuse ip6_frag_id from ip6_ufo_append_data")). IPv4: 16:39:57.737761 IP (tos 0x0, ttl 64, id 3209, offset 0, flags [DF], proto IPIP (4), length 1296) 192.168.122.151 > 1.1.1.1: IP (tos 0x0, ttl 64, id 57034, offset 0, flags [none], proto UDP (17), length 1276) 192.168.99.1.35961 > 192.168.99.2.distinct: UDP, length 2000 16:39:57.738028 IP (tos 0x0, ttl 64, id 3210, offset 0, flags [DF], proto IPIP (4), length 792) 192.168.122.151 > 1.1.1.1: IP (tos 0x0, ttl 64, id 57035, offset 0, flags [none], proto UDP (17), length 772) 192.168.99.1.13531 > 192.168.99.2.20653: UDP, length 51109 In this case fragmentation id is incremented and offset is not updated. First, I aligned inet_gso_segment and ipv6_gso_segment: * align naming of flags * ipv6_gso_segment: setting skb->encapsulation is unnecessary, as we always ensure that the state of this flag is left untouched when returning from upper gso segmenation function * ipv6_gso_segment: move skb_reset_inner_headers below updating the fragmentation header data, we don't care for updating fragmentation header data * remove currently unneeded comment indicating skb->encapsulation might get changed by upper gso_segment callback (gre and udp-tunnel reset encapsulation after segmentation on each fragment) If we encounter an IPIP or SIT gso skb we now check for the protocol == IPPROTO_UDP and that we at least have already traversed another ip(6) protocol header. The reason why we have to special case GSO_IPIP and GSO_SIT is that we reset skb->encapsulation to 0 while skb_mac_gso_segment the inner protocol of GSO_UDP_TUNNEL or GSO_GRE packets. Reported-by: Wolfgang Walter <linux@stwm.de> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Tom Herbert <therbert@google.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-22ipv6: reuse ip6_frag_id from ip6_ufo_append_dataHannes Frederic Sowa
Currently we generate a new fragmentation id on UFO segmentation. It is pretty hairy to identify the correct net namespace and dst there. Especially tunnels use IFF_XMIT_DST_RELEASE and thus have no skb_dst available at all. This causes unreliable or very predictable ipv6 fragmentation id generation while segmentation. Luckily we already have pregenerated the ip6_frag_id in ip6_ufo_append_data and can use it here. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-20sit: fix panic with route cache in ip tunnelsNicolas Dichtel
Bug introduced by commit 7d442fab0a67 ("ipv4: Cache dst in tunnels"). Because sit code does not call ip_tunnel_init(), the dst_cache was not initialized. CC: Tom Herbert <therbert@google.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-20ip6_vti: Fix build when NET_IP_TUNNEL is not set.Steffen Klassert
Since commit 469bdcefdc47a ip6_vti uses ip_tunnel_get_stats64(), so we need to select NET_IP_TUNNEL to have this function available. Fixes: 469bdcefdc ("ipv6: fix the use of pcpu_tstats in ip6_vti.c") Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2014-02-19Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for your net tree, they are: * Fix nf_trace in nftables if XT_TRACE=n, from Florian Westphal. * Don't use the fast payload operation in nf_tables if the length is not power of 2 or it is not aligned, from Nikolay Aleksandrov. * Fix missing break statement the inet flavour of nft_reject, which results in evaluating IPv4 packets with the IPv6 evaluation routine, from Patrick McHardy. * Fix wrong kconfig symbol in nft_meta to match the routing realm, from Paul Bolle. * Allocate the NAT null binding when creating new conntracks via ctnetlink to avoid that several packets race at initializing the the conntrack NAT extension, original patch from Florian Westphal, revisited version from me. * Fix DNAT handling in the snmp NAT helper, the same handling was being done for SNAT and DNAT and 2.4 already contains that fix, from Francois-Xavier Le Bail. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-17gre: add link local route when local addr is anyNicolas Dichtel
This bug was reported by Steinar H. Gunderson and was introduced by commit f7cb8886335d ("sit/gre6: don't try to add the same route two times"). root@morgental:~# ip tunnel add foo mode gre remote 1.2.3.4 ttl 64 root@morgental:~# ip link set foo up mtu 1468 root@morgental:~# ip -6 route show dev foo fe80::/64 proto kernel metric 256 but after the above commit, no such route shows up. There is no link local route because dev->dev_addr is 0 (because local ipv4 address is 0), hence no link local address is configured. In this scenario, the link local address is added manually: 'ip -6 addr add fe80::1 dev foo' and because prefix is /128, no link local route is added by the kernel. Even if the right things to do is to add the link local address with a /64 prefix, we need to restore the previous behavior to avoid breaking userpace. Reported-by: Steinar H. Gunderson <sesse@samfundet.no> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-17netfilter: nf_tables: fix nf_trace always-on with XT_TRACE=nFlorian Westphal
When using nftables with CONFIG_NETFILTER_XT_TARGET_TRACE=n, we get lots of "TRACE: filter:output:policy:1 IN=..." warnings as several places will leave skb->nf_trace uninitialised. Unlike iptables tracing functionality is not conditional in nftables, so always copy/zero nf_trace setting when nftables is enabled. Move this into __nf_copy() helper. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-02-13net: ip, ipv6: handle gso skbs in forwarding pathFlorian Westphal
Marcelo Ricardo Leitner reported problems when the forwarding link path has a lower mtu than the incoming one if the inbound interface supports GRO. Given: Host <mtu1500> R1 <mtu1200> R2 Host sends tcp stream which is routed via R1 and R2. R1 performs GRO. In this case, the kernel will fail to send ICMP fragmentation needed messages (or pkt too big for ipv6), as GSO packets currently bypass dstmtu checks in forward path. Instead, Linux tries to send out packets exceeding the mtu. When locking route MTU on Host (i.e., no ipv4 DF bit set), R1 does not fragment the packets when forwarding, and again tries to send out packets exceeding R1-R2 link mtu. This alters the forwarding dstmtu checks to take the individual gso segment lengths into account. For ipv6, we send out pkt too big error for gso if the individual segments are too big. For ipv4, we either send icmp fragmentation needed, or, if the DF bit is not set, perform software segmentation and let the output path create fragments when the packet is leaving the machine. It is not 100% correct as the error message will contain the headers of the GRO skb instead of the original/segmented one, but it seems to work fine in my (limited) tests. Eric Dumazet suggested to simply shrink mss via ->gso_size to avoid sofware segmentation. However it turns out that skb_segment() assumes skb nr_frags is related to mss size so we would BUG there. I don't want to mess with it considering Herbert and Eric disagree on what the correct behavior should be. Hannes Frederic Sowa notes that when we would shrink gso_size skb_segment would then also need to deal with the case where SKB_MAX_FRAGS would be exceeded. This uses sofware segmentation in the forward path when we hit ipv4 non-DF packets and the outgoing link mtu is too small. Its not perfect, but given the lack of bug reports wrt. GRO fwd being broken this is a rare case anyway. Also its not like this could not be improved later once the dust settles. Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Reported-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-09ipv6: icmp6_send: fix Oops when pinging a not set up IPv6 peer on a sit tunnelFX Le Bail
The patch 446fab59333dea91e54688f033dd8d788d0486fb ("ipv6: enable anycast addresses as source addresses in ICMPv6 error messages") causes an Oops when pinging a not set up IPv6 peer on a sit tunnel. The problem is that ipv6_anycast_destination() uses unconditionally skb_dst(skb), which is NULL in this case. The solution is to use instead the ipv6_chk_acast_addr_src() function. Here are the steps to reproduce it: modprobe sit ip link add sit1 type sit remote 10.16.0.121 local 10.16.0.249 ip l s sit1 up ip -6 a a dev sit1 2001:1234::123 remote 2001:1234::121 ping6 2001:1234::121 Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Tested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-06netfilter: nf_tables: add reject module for NFPROTO_INETPatrick McHardy
Add a reject module for NFPROTO_INET. It does nothing but dispatch to the AF-specific modules based on the hook family. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-02-06netfilter: nft_reject: split up reject module into IPv4 and IPv6 specifc partsPatrick McHardy
Currently the nft_reject module depends on symbols from ipv6. This is wrong since no generic module should force IPv6 support to be loaded. Split up the module into AF-specific and a generic part. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-27net: Fix memory leak if TPROXY used with TCP early demuxHolger Eitzenberger
I see a memory leak when using a transparent HTTP proxy using TPROXY together with TCP early demux and Kernel v3.8.13.15 (Ubuntu stable): unreferenced object 0xffff88008cba4a40 (size 1696): comm "softirq", pid 0, jiffies 4294944115 (age 8907.520s) hex dump (first 32 bytes): 0a e0 20 6a 40 04 1b 37 92 be 32 e2 e8 b4 00 00 .. j@..7..2..... 02 00 07 01 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff810b710a>] kmem_cache_alloc+0xad/0xb9 [<ffffffff81270185>] sk_prot_alloc+0x29/0xc5 [<ffffffff812702cf>] sk_clone_lock+0x14/0x283 [<ffffffff812aaf3a>] inet_csk_clone_lock+0xf/0x7b [<ffffffff8129a893>] netlink_broadcast+0x14/0x16 [<ffffffff812c1573>] tcp_create_openreq_child+0x1b/0x4c3 [<ffffffff812c033e>] tcp_v4_syn_recv_sock+0x38/0x25d [<ffffffff812c13e4>] tcp_check_req+0x25c/0x3d0 [<ffffffff812bf87a>] tcp_v4_do_rcv+0x287/0x40e [<ffffffff812a08a7>] ip_route_input_noref+0x843/0xa55 [<ffffffff812bfeca>] tcp_v4_rcv+0x4c9/0x725 [<ffffffff812a26f4>] ip_local_deliver_finish+0xe9/0x154 [<ffffffff8127a927>] __netif_receive_skb+0x4b2/0x514 [<ffffffff8127aa77>] process_backlog+0xee/0x1c5 [<ffffffff8127c949>] net_rx_action+0xa7/0x200 [<ffffffff81209d86>] add_interrupt_randomness+0x39/0x157 But there are many more, resulting in the machine going OOM after some days. From looking at the TPROXY code, and with help from Florian, I see that the memory leak is introduced in tcp_v4_early_demux(): void tcp_v4_early_demux(struct sk_buff *skb) { /* ... */ iph = ip_hdr(skb); th = tcp_hdr(skb); if (th->doff < sizeof(struct tcphdr) / 4) return; sk = __inet_lookup_established(dev_net(skb->dev), &tcp_hashinfo, iph->saddr, th->source, iph->daddr, ntohs(th->dest), skb->skb_iif); if (sk) { skb->sk = sk; where the socket is assigned unconditionally to skb->sk, also bumping the refcnt on it. This is problematic, because in our case the skb has already a socket assigned in the TPROXY target. This then results in the leak I see. The very same issue seems to be with IPv6, but haven't tested. Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Holger Eitzenberger <holger@eitzenberger.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-24ipv6: reallocate addrconf router for ipv6 address when lo device upGao feng
commit 25fb6ca4ed9cad72f14f61629b68dc03c0d9713f "net IPv6 : Fix broken IPv6 routing table after loopback down-up" allocates addrconf router for ipv6 address when lo device up. but commit a881ae1f625c599b460cc8f8a7fcb1c438f699ad "ipv6:don't call addrconf_dst_alloc again when enable lo" breaks this behavior. Since the addrconf router is moved to the garbage list when lo device down, we should release this router and rellocate a new one for ipv6 address when lo device up. This patch solves bug 67951 on bugzilla https://bugzilla.kernel.org/show_bug.cgi?id=67951 change from v1: use ip6_rt_put to repleace ip6_del_rt, thanks Hannes! change code style, suggested by Sergei. CC: Sabrina Dubroca <sd@queasysnail.net> CC: Hannes Frederic Sowa <hannes@stressinduktion.org> Reported-by: Weilong Chen <chenweilong@huawei.com> Signed-off-by: Weilong Chen <chenweilong@huawei.com> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-22ipv6: enable anycast addresses as source addresses for datagramsFX Le Bail
This change allows to consider an anycast address valid as source address when given via an IPV6_PKTINFO or IPV6_2292PKTINFO ancillary data item. So, when sending a datagram with ancillary data, the unicast and anycast addresses are handled in the same way. - Adds ipv6_chk_acast_addr_src() to check if an anycast address is link-local on given interface or is global. - Uses it in ip6_datagram_send_ctl(). Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21ipv6: protect protocols not handling ipv4 from v4 connection/bind attemptsHannes Frederic Sowa
Some ipv6 protocols cannot handle ipv4 addresses, so we must not allow connecting and binding to them. sendmsg logic does already check msg->name for this but must trust already connected sockets which could be set up for connection to ipv4 address family. Per-socket flag ipv6only is of no use here, as it is under users control by setsockopt. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21ipv6: enable anycast addresses as source addresses in ICMPv6 error messagesFX Le Bail
- Uses ipv6_anycast_destination() in icmp6_send(). Suggested-by: Bill Fink <billfink@mindspring.com> Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21tcp: delete redundant calls of tcp_mtup_init()Peter Pan(潘卫平)
As tcp_rcv_state_process() has already calls tcp_mtup_init() for non-fastopen sock, we can delete the redundant calls of tcp_mtup_init() in tcp_{v4,v6}_syn_recv_sock(). Signed-off-by: Weiping Pan <panweiping3@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-19ipv6: optimize link local address searchHannes Frederic Sowa
ipv6_link_dev_addr sorts newly added addresses by scope in ifp->addr_list. Smaller scope addresses are added to the tail of the list. Use this fact to iterate in reverse over addr_list and break out as soon as a higher scoped one showes up, so we can spare some cycles on machines with lot's of addresses. The ordering of the addresses is not relevant and we are more likely to get the eui64 generated address with this change anyway. Suggested-by: Brian Haley <brian.haley@hp.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-19ipv6: make IPV6_RECVPKTINFO work for ipv4 datagramsHannes Frederic Sowa
We currently don't report IPV6_RECVPKTINFO in cmsg access ancillary data for IPv4 datagrams on IPv6 sockets. This patch splits the ip6_datagram_recv_ctl into two functions, one which handles both protocol families, AF_INET and AF_INET6, while the ip6_datagram_recv_specific_ctl only handles IPv6 cmsg data. ip6_datagram_recv_*_ctl never reported back any errors, so we can make them return void. Also provide a helper for protocols which don't offer dual personality to further use ip6_datagram_recv_ctl, which is exported to modules. I needed to shuffle the code for ping around a bit to make it easier to implement dual personality for ping ipv6 sockets in future. Reported-by: Gert Doering <gert@space.net> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-19ipv6: add flowlabel_consistency sysctlFlorent Fourcot
With the introduction of IPV6_FL_F_REFLECT, there is no guarantee of flow label unicity. This patch introduces a new sysctl to protect the old behaviour, enable by default. Changelog of V3: * rename ip6_flowlabel_consistency to flowlabel_consistency * use net_info_ratelimited() * checkpatch cleanups Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-19ipv6: add a flag to get the flow label used remotlyFlorent Fourcot
This information is already available via IPV6_FLOWINFO of IPV6_2292PKTOPTIONS, and them a filtering to get the flow label information. But it is probably logical and easier for users to add this here, and to control both sent/received flow label values with the IPV6_FLOWLABEL_MGR option. Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-19ipv6: add the IPV6_FL_F_REFLECT flag to IPV6_FL_A_GETFlorent Fourcot
With this option, the socket will reply with the flow label value read on received packets. The goal is to have a connection with the same flow label in both direction of the communication. Changelog of V4: * Do not erase the flow label on the listening socket. Use pktopts to store the received value Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-18net: add build-time checks for msg->msg_name sizeSteffen Hurrle
This is a follow-up patch to f3d3342602f8bc ("net: rework recvmsg handler msg_name and msg_namelen logic"). DECLARE_SOCKADDR validates that the structure we use for writing the name information to is not larger than the buffer which is reserved for msg->msg_name (which is 128 bytes). Also use DECLARE_SOCKADDR consistently in sendmsg code paths. Signed-off-by: Steffen Hurrle <steffen@hurrle.net> Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c net/ipv4/tcp_metrics.c Overlapping changes between the "don't create two tcp metrics objects with the same key" race fix in net and the addition of the destination address in the lookup key in net-next. Minor overlapping changes in bnx2x driver. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-17ipv6: send Change Status Report after DAD is completedFlavio Leitner
The RFC 3810 defines two type of messages for multicast listeners. The "Current State Report" message, as the name implies, refreshes the *current* state to the querier. Since the querier sends Query messages periodically, there is no need to retransmit the report. On the other hand, any change should be reported immediately using "State Change Report" messages. Since it's an event triggered by a change and that it can be affected by packet loss, the rfc states it should be retransmitted [RobVar] times to make sure routers will receive timely. Currently, we are sending "Current State Reports" after DAD is completed. Before that, we send messages using unspecified address (::) which should be silently discarded by routers. This patch changes to send "State Change Report" messages after DAD is completed fixing the behavior to be RFC compliant and also to pass TAHI IPv6 testsuite. Signed-off-by: Flavio Leitner <fbl@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-17ipv6: simplify detection of first operational link-local address on interfaceHannes Frederic Sowa
In commit 1ec047eb4751e3 ("ipv6: introduce per-interface counter for dad-completed ipv6 addresses") I build the detection of the first operational link-local address much to complex. Additionally this code now has a race condition. Replace it with a much simpler variant, which just scans the address list when duplicate address detection completes, to check if this is the first valid link local address and send RS and MLD reports then. Fixes: 1ec047eb4751e3 ("ipv6: introduce per-interface counter for dad-completed ipv6 addresses") Reported-by: Jiri Pirko <jiri@resnulli.us> Cc: Flavio Leitner <fbl@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Flavio Leitner <fbl@redhat.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>