aboutsummaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)Author
2014-06-17Merge tag 'v3.10.41' into v3.10-rtSteven Rostedt (Red Hat)
This is the 3.10.41 stable release Conflicts: include/linux/list.h
2014-05-30ipv4: initialise the itag variable in __mkroute_inputLi RongQing
[ Upstream commit fbdc0ad095c0a299e9abf5d8ac8f58374951149a ] the value of itag is a random value from stack, and may not be initiated by fib_validate_source, which called fib_combine_itag if CONFIG_IP_ROUTE_CLASSID is not set This will make the cached dst uncertainty Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-30ipv4: fib_semantics: increment fib_info_cnt after fib_info allocationSergey Popovich
[ Upstream commit aeefa1ecfc799b0ea2c4979617f14cecd5cccbfd ] Increment fib_info_cnt in fib_create_info() right after successfuly alllocating fib_info structure, overwise fib_metrics allocation failure leads to fib_info_cnt incorrectly decremented in free_fib_info(), called on error path from fib_create_info(). Signed-off-by: Sergey Popovich <popovich_sergei@mail.ru> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-30net: ipv4: ip_forward: fix inverted local_df testFlorian Westphal
[ Upstream commit ca6c5d4ad216d5942ae544bbf02503041bd802aa ] local_df means 'ignore DF bit if set', so if its set we're allowed to perform ip fragmentation. This wasn't noticed earlier because the output path also drops such skbs (and emits needed icmp error) and because netfilter ip defrag did not set local_df until couple of days ago. Only difference is that DF-packets-larger-than MTU now discarded earlier (f.e. we avoid pointless netfilter postrouting trip). While at it, drop the repeated test ip_exceeds_mtu, checking it once is enough... Fixes: fe6cc55f3a9 ("net: ip, ipv6: handle gso skbs in forwarding path") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-30tcp_cubic: fix the range of delayed_ackLiu Yu
[ Upstream commit 0cda345d1b2201dd15591b163e3c92bad5191745 ] commit b9f47a3aaeab (tcp_cubic: limit delayed_ack ratio to prevent divide error) try to prevent divide error, but there is still a little chance that delayed_ack can reach zero. In case the param cnt get negative value, then ratio+cnt would overflow and may happen to be zero. As a result, min(ratio, ACK_RATIO_LIMIT) will calculate to be zero. In some old kernels, such as 2.6.32, there is a bug that would pass negative param, which then ultimately leads to this divide error. commit 5b35e1e6e9c (tcp: fix tcp_trim_head() to adjust segment count with skb MSS) fixed the negative param issue. However, it's safe that we fix the range of delayed_ack as well, to make sure we do not hit a divide by zero. CC: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Liu Yu <allanyuliu@tencent.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-30ipv4: return valid RTA_IIF on ip route getJulian Anastasov
[ Upstream commit 91146153da2feab18efab2e13b0945b6bb704ded ] Extend commit 13378cad02afc2adc6c0e07fca03903c7ada0b37 ("ipv4: Change rt->rt_iif encoding.") from 3.6 to return valid RTA_IIF on 'ip route get ... iif DEVICE' instead of rt_iif 0 which is displayed as 'iif *'. inet_iif is not appropriate to use because skb_iif is not set. Use the skb->dev->ifindex instead. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-30net: ipv4: current group_info should be put after using.Wang, Xiaoming
[ Upstream commit b04c46190219a4f845e46a459e3102137b7f6cac ] Plug a group_info refcount leak in ping_init. group_info is only needed during initialization and the code failed to release the reference on exit. While here move grabbing the reference to a place where it is actually needed. Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com> Signed-off-by: Zhang Dongxing <dongxing.zhang@intel.com> Signed-off-by: xiaoming wang <xiaoming.wang@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-30vti: don't allow to add the same tunnel twiceNicolas Dichtel
[ Upstream commit 8d89dcdf80d88007647945a753821a06eb6cc5a5 ] Before the patch, it was possible to add two times the same tunnel: ip l a vti1 type vti remote 10.16.0.121 local 10.16.0.249 key 41 ip l a vti2 type vti remote 10.16.0.121 local 10.16.0.249 key 41 It was possible, because ip_tunnel_newlink() calls ip_tunnel_find() with the argument dev->type, which was set only later (when calling ndo_init handler in register_netdevice()). Let's set this type in the setup handler, which is called before newlink handler. Introduced by commit b9959fd3b0fa ("vti: switch to new ip tunnel code"). CC: Cong Wang <amwang@redhat.com> CC: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-30gre: don't allow to add the same tunnel twiceNicolas Dichtel
[ Upstream commit 5a4552752d8f7f4cef1d98775ece7adb7616fde2 ] Before the patch, it was possible to add two times the same tunnel: ip l a gre1 type gre remote 10.16.0.121 local 10.16.0.249 ip l a gre2 type gre remote 10.16.0.121 local 10.16.0.249 It was possible, because ip_tunnel_newlink() calls ip_tunnel_find() with the argument dev->type, which was set only later (when calling ndo_init handler in register_netdevice()). Let's set this type in the setup handler, which is called before newlink handler. Introduced by commit c54419321455 ("GRE: Refactor GRE tunneling code."). CC: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-30netfilter: Can't fail and free after table replacementThomas Graf
commit c58dd2dd443c26d856a168db108a0cd11c285bf3 upstream. All xtables variants suffer from the defect that the copy_to_user() to copy the counters to user memory may fail after the table has already been exchanged and thus exposed. Return an error at this point will result in freeing the already exposed table. Any subsequent packet processing will result in a kernel panic. We can't copy the counters before exposing the new tables as we want provide the counter state after the old table has been unhooked. Therefore convert this into a silent error. Cc: Florian Westphal <fw@strlen.de> Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-24Merge tag 'v3.10.37' into v3.10-rtSteven Rostedt (Red Hat)
This is the 3.10.37 stable release
2014-04-14ipmr: fix mfc notification flagsNicolas Dichtel
[ Upstream commit 65886f439ab0fdc2dff20d1fa87afb98c6717472 ] Commit 8cd3ac9f9b7b ("ipmr: advertise new mfc entries via rtnl") reuses the function ipmr_fill_mroute() to notify mfc events. But this function was used only for dump and thus was always setting the flag NLM_F_MULTI, which is wrong in case of a single notification. Libraries like libnl will wait forever for NLMSG_DONE. CC: Thomas Graf <tgraf@suug.ch> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-14tcp: tcp_release_cb() should release socket ownershipEric Dumazet
[ Upstream commit c3f9b01849ef3bc69024990092b9f42e20df7797 ] Lars Persson reported following deadlock : -000 |M:0x0:0x802B6AF8(asm) <-- arch_spin_lock -001 |tcp_v4_rcv(skb = 0x8BD527A0) <-- sk = 0x8BE6B2A0 -002 |ip_local_deliver_finish(skb = 0x8BD527A0) -003 |__netif_receive_skb_core(skb = 0x8BD527A0, ?) -004 |netif_receive_skb(skb = 0x8BD527A0) -005 |elk_poll(napi = 0x8C770500, budget = 64) -006 |net_rx_action(?) -007 |__do_softirq() -008 |do_softirq() -009 |local_bh_enable() -010 |tcp_rcv_established(sk = 0x8BE6B2A0, skb = 0x87D3A9E0, th = 0x814EBE14, ?) -011 |tcp_v4_do_rcv(sk = 0x8BE6B2A0, skb = 0x87D3A9E0) -012 |tcp_delack_timer_handler(sk = 0x8BE6B2A0) -013 |tcp_release_cb(sk = 0x8BE6B2A0) -014 |release_sock(sk = 0x8BE6B2A0) -015 |tcp_sendmsg(?, sk = 0x8BE6B2A0, ?, ?) -016 |sock_sendmsg(sock = 0x8518C4C0, msg = 0x87D8DAA8, size = 4096) -017 |kernel_sendmsg(?, ?, ?, ?, size = 4096) -018 |smb_send_kvec() -019 |smb_send_rqst(server = 0x87C4D400, rqst = 0x87D8DBA0) -020 |cifs_call_async() -021 |cifs_async_writev(wdata = 0x87FD6580) -022 |cifs_writepages(mapping = 0x852096E4, wbc = 0x87D8DC88) -023 |__writeback_single_inode(inode = 0x852095D0, wbc = 0x87D8DC88) -024 |writeback_sb_inodes(sb = 0x87D6D800, wb = 0x87E4A9C0, work = 0x87D8DD88) -025 |__writeback_inodes_wb(wb = 0x87E4A9C0, work = 0x87D8DD88) -026 |wb_writeback(wb = 0x87E4A9C0, work = 0x87D8DD88) -027 |wb_do_writeback(wb = 0x87E4A9C0, force_wait = 0) -028 |bdi_writeback_workfn(work = 0x87E4A9CC) -029 |process_one_work(worker = 0x8B045880, work = 0x87E4A9CC) -030 |worker_thread(__worker = 0x8B045880) -031 |kthread(_create = 0x87CADD90) -032 |ret_from_kernel_thread(asm) Bug occurs because __tcp_checksum_complete_user() enables BH, assuming it is running from softirq context. Lars trace involved a NIC without RX checksum support but other points are problematic as well, like the prequeue stuff. Problem is triggered by a timer, that found socket being owned by user. tcp_release_cb() should call tcp_write_timer_handler() or tcp_delack_timer_handler() in the appropriate context : BH disabled and socket lock held, but 'owned' field cleared, as if they were running from timer handlers. Fixes: 6f458dfb4092 ("tcp: improve latencies of timer triggered events") Reported-by: Lars Persson <lars.persson@axis.com> Tested-by: Lars Persson <lars.persson@axis.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-14inet: frag: make sure forced eviction removes all fragsFlorian Westphal
[ Upstream commit e588e2f286ed7da011ed357c24c5b9a554e26595 ] Quoting Alexander Aring: While fragmentation and unloading of 6lowpan module I got this kernel Oops after few seconds: BUG: unable to handle kernel paging request at f88bbc30 [..] Modules linked in: ipv6 [last unloaded: 6lowpan] Call Trace: [<c012af4c>] ? call_timer_fn+0x54/0xb3 [<c012aef8>] ? process_timeout+0xa/0xa [<c012b66b>] run_timer_softirq+0x140/0x15f Problem is that incomplete frags are still around after unload; when their frag expire timer fires, we get crash. When a netns is removed (also done when unloading module), inet_frag calls the evictor with 'force' argument to purge remaining frags. The evictor loop terminates when accounted memory ('work') drops to 0 or the lru-list becomes empty. However, the mem accounting is done via percpu counters and may not be accurate, i.e. loop may terminate prematurely. Alter evictor to only stop once the lru list is empty when force is requested. Reported-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Reported-by: Alexander Aring <alex.aring@gmail.com> Tested-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-14net: fix for a race condition in the inet frag codeNikolay Aleksandrov
[ Upstream commit 24b9bf43e93e0edd89072da51cf1fab95fc69dec ] I stumbled upon this very serious bug while hunting for another one, it's a very subtle race condition between inet_frag_evictor, inet_frag_intern and the IPv4/6 frag_queue and expire functions (basically the users of inet_frag_kill/inet_frag_put). What happens is that after a fragment has been added to the hash chain but before it's been added to the lru_list (inet_frag_lru_add) in inet_frag_intern, it may get deleted (either by an expired timer if the system load is high or the timer sufficiently low, or by the fraq_queue function for different reasons) before it's added to the lru_list, then after it gets added it's a matter of time for the evictor to get to a piece of memory which has been freed leading to a number of different bugs depending on what's left there. I've been able to trigger this on both IPv4 and IPv6 (which is normal as the frag code is the same), but it's been much more difficult to trigger on IPv4 due to the protocol differences about how fragments are treated. The setup I used to reproduce this is: 2 machines with 4 x 10G bonded in a RR bond, so the same flow can be seen on multiple cards at the same time. Then I used multiple instances of ping/ping6 to generate fragmented packets and flood the machines with them while running other processes to load the attacked machine. *It is very important to have the _same flow_ coming in on multiple CPUs concurrently. Usually the attacked machine would die in less than 30 minutes, if configured properly to have many evictor calls and timeouts it could happen in 10 minutes or so. An important point to make is that any caller (frag_queue or timer) of inet_frag_kill will remove both the timer refcount and the original/guarding refcount thus removing everything that's keeping the frag from being freed at the next inet_frag_put. All of this could happen before the frag was ever added to the LRU list, then it gets added and the evictor uses a freed fragment. An example for IPv6 would be if a fragment is being added and is at the stage of being inserted in the hash after the hash lock is released, but before inet_frag_lru_add executes (or is able to obtain the lru lock) another overlapping fragment for the same flow arrives at a different CPU which finds it in the hash, but since it's overlapping it drops it invoking inet_frag_kill and thus removing all guarding refcounts, and afterwards freeing it by invoking inet_frag_put which removes the last refcount added previously by inet_frag_find, then inet_frag_lru_add gets executed by inet_frag_intern and we have a freed fragment in the lru_list. The fix is simple, just move the lru_add under the hash chain locked region so when a removing function is called it'll have to wait for the fragment to be added to the lru_list, and then it'll remove it (it works because the hash chain removal is done before the lru_list one and there's no window between the two list adds when the frag can get dropped). With this fix applied I couldn't kill the same machine in 24 hours with the same setup. Fixes: 3ef0eb0db4bf ("net: frag, move LRU list maintenance outside of rwlock") CC: Florian Westphal <fw@strlen.de> CC: Jesper Dangaard Brouer <brouer@redhat.com> CC: David S. Miller <davem@davemloft.net> Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-24Merge tag 'v3.10.34' into v3.10-rtSteven Rostedt (Red Hat)
This is the 3.10.34 stable release
2014-03-23net-tcp: fastopen: fix high order allocationsEric Dumazet
[ Upstream commit f5ddcbbb40aa0ba7fbfe22355d287603dbeeaaac ] This patch fixes two bugs in fastopen : 1) The tcp_sendmsg(..., @size) argument was ignored. Code was relying on user not fooling the kernel with iovec mismatches 2) When MTU is about 64KB, tcp_send_syn_data() attempts order-5 allocations, which are likely to fail when memory gets fragmented. Fixes: 783237e8daf13 ("net-tcp: Fast Open client - sending SYN-data") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Tested-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-10Merge tag 'v3.10.33' into v3.10-rtSteven Rostedt (Red Hat)
This is the 3.10.33 stable release
2014-03-06net: ip, ipv6: handle gso skbs in forwarding pathFlorian Westphal
commit fe6cc55f3a9a053482a76f5a6b2257cee51b4663 upstream. Marcelo Ricardo Leitner reported problems when the forwarding link path has a lower mtu than the incoming one if the inbound interface supports GRO. Given: Host <mtu1500> R1 <mtu1200> R2 Host sends tcp stream which is routed via R1 and R2. R1 performs GRO. In this case, the kernel will fail to send ICMP fragmentation needed messages (or pkt too big for ipv6), as GSO packets currently bypass dstmtu checks in forward path. Instead, Linux tries to send out packets exceeding the mtu. When locking route MTU on Host (i.e., no ipv4 DF bit set), R1 does not fragment the packets when forwarding, and again tries to send out packets exceeding R1-R2 link mtu. This alters the forwarding dstmtu checks to take the individual gso segment lengths into account. For ipv6, we send out pkt too big error for gso if the individual segments are too big. For ipv4, we either send icmp fragmentation needed, or, if the DF bit is not set, perform software segmentation and let the output path create fragments when the packet is leaving the machine. It is not 100% correct as the error message will contain the headers of the GRO skb instead of the original/segmented one, but it seems to work fine in my (limited) tests. Eric Dumazet suggested to simply shrink mss via ->gso_size to avoid sofware segmentation. However it turns out that skb_segment() assumes skb nr_frags is related to mss size so we would BUG there. I don't want to mess with it considering Herbert and Eric disagree on what the correct behavior should be. Hannes Frederic Sowa notes that when we would shrink gso_size skb_segment would then also need to deal with the case where SKB_MAX_FRAGS would be exceeded. This uses sofware segmentation in the forward path when we hit ipv4 non-DF packets and the outgoing link mtu is too small. Its not perfect, but given the lack of bug reports wrt. GRO fwd being broken this is a rare case anyway. Also its not like this could not be improved later once the dust settles. Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Reported-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-06ipv4: fix counter in_slow_totDuan Jiong
[ Upstream commit a6254864c08109c66a194612585afc0439005286 ] since commit 89aef8921bf("ipv4: Delete routing cache."), the counter in_slow_tot can't work correctly. The counter in_slow_tot increase by one when fib_lookup() return successfully in ip_route_input_slow(), but actually the dst struct maybe not be created and cached, so we can increase in_slow_tot after the dst struct is created. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-06tcp: tsq: fix nonagle handlingJohn Ogness
[ Upstream commit bf06200e732de613a1277984bf34d1a21c2de03d ] Commit 46d3ceabd8d9 ("tcp: TCP Small Queues") introduced a possible regression for applications using TCP_NODELAY. If TCP session is throttled because of tsq, we should consult tp->nonagle when TX completion is done and allow us to send additional segment, especially if this segment is not a full MSS. Otherwise this segment is sent after an RTO. [edumazet] : Cooked the changelog, added another fix about testing sk_wmem_alloc twice because TX completion can happen right before setting TSQ_THROTTLED bit. This problem is particularly visible with recent auto corking, but might also be triggered with low tcp_limit_output_bytes values or NIC drivers delaying TX completion by hundred of usec, and very low rtt. Thomas Glanzmann for example reported an iscsi regression, caused by tcp auto corking making this bug quite visible. Fixes: 46d3ceabd8d9 ("tcp: TCP Small Queues") Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Thomas Glanzmann <thomas@glanzmann.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-06ipv4: Fix runtime WARNING in rtmsg_ifa()Geert Uytterhoeven
[ Upstream commit 63b5f152eb4a5bb79b9caf7ec37b4201d12f6e66 ] On m68k/ARAnyM: WARNING: CPU: 0 PID: 407 at net/ipv4/devinet.c:1599 0x316a99() Modules linked in: CPU: 0 PID: 407 Comm: ifconfig Not tainted 3.13.0-atari-09263-g0c71d68014d1 #1378 Stack from 10c4fdf0: 10c4fdf0 002ffabb 000243e8 00000000 008ced6c 00024416 00316a99 0000063f 00316a99 00000009 00000000 002501b4 00316a99 0000063f c0a86117 00000080 c0a86117 00ad0c90 00250a5a 00000014 00ad0c90 00000000 00000000 00000001 00b02dd0 00356594 00000000 00356594 c0a86117 eff6c9e4 008ced6c 00000002 008ced60 0024f9b4 00250b52 00ad0c90 00000000 00000000 00252390 00ad0c90 eff6c9e4 0000004f 00000000 00000000 eff6c9e4 8000e25c eff6c9e4 80001020 Call Trace: [<000243e8>] warn_slowpath_common+0x52/0x6c [<00024416>] warn_slowpath_null+0x14/0x1a [<002501b4>] rtmsg_ifa+0xdc/0xf0 [<00250a5a>] __inet_insert_ifa+0xd6/0x1c2 [<0024f9b4>] inet_abc_len+0x0/0x42 [<00250b52>] inet_insert_ifa+0xc/0x12 [<00252390>] devinet_ioctl+0x2ae/0x5d6 Adding some debugging code reveals that net_fill_ifaddr() fails in put_cacheinfo(skb, ifa->ifa_cstamp, ifa->ifa_tstamp, preferred, valid)) nla_put complains: lib/nlattr.c:454: skb_tailroom(skb) = 12, nla_total_size(attrlen) = 20 Apparently commit 5c766d642bcaffd0c2a5b354db2068515b3846cf ("ipv4: introduce address lifetime") forgot to take into account the addition of struct ifa_cacheinfo in inet_nlmsg_size(). Hence add it, like is already done for ipv6. Suggested-by: Cong Wang <cwang@twopensource.com> Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-28net: ip_send_unicast_reply: add missing local serializationNicholas Mc Guire
in response to the oops in ip_output.c:ip_send_unicast_reply under high network load with CONFIG_PREEMPT_RT_FULL=y, reported by Sami Pietikainen <Sami.Pietikainen@wapice.com>, this patch adds local serialization in ip_send_unicast_reply. from ip_output.c: /* * Generic function to send a packet as reply to another packet. * Used to send some TCP resets/acks so far. * * Use a fake percpu inet socket to avoid false sharing and contention. */ static DEFINE_PER_CPU(struct inet_sock, unicast_sock) = { ... which was added in commit be9f4a44 in linux-stable. The git log, wich introduced the PER_CPU unicast_sock, states: <snip> commit be9f4a44e7d41cee50ddb5f038fc2391cbbb4046 Author: Eric Dumazet <edumazet@google.com> Date: Thu Jul 19 07:34:03 2012 +0000 ipv4: tcp: remove per net tcp_sock tcp_v4_send_reset() and tcp_v4_send_ack() use a single socket per network namespace. This leads to bad behavior on multiqueue NICS, because many cpus contend for the socket lock and once socket lock is acquired, extra false sharing on various socket fields slow down the operations. To better resist to attacks, we use a percpu socket. Each cpu can run without contention, using appropriate memory (local node) <snip> The per-cpu here thus is assuming exclusivity serializing per cpu - so the use of get_cpu_ligh introduced in net-use-cpu-light-in-ip-send-unicast-reply.patch, which droped the preempt_disable in favor of a migrate_disable is probably wrong as this only handles the referencial consistency but not the serialization. To evade a preempt_disable here a local lock would be needed. Therapie: * add local lock: * and re-introduce local serialization: Tested on x86 with high network load using the testcase from Sami Pietikainen while : ; do wget -O - ftp://LOCAL_SERVER/empty_file > /dev/null 2>&1; done Link: http://www.spinics.net/lists/linux-rt-users/msg11007.html Cc: stable-rt@vger.kernel.org Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-02-21Merge tag 'v3.10.29' into v3.10-rtSteven Rostedt (Red Hat)
This is the 3.10.29 stable release
2014-02-06net: Fix memory leak if TPROXY used with TCP early demuxHolger Eitzenberger
[ Upstream commit a452ce345d63ddf92cd101e4196569f8718ad319 ] I see a memory leak when using a transparent HTTP proxy using TPROXY together with TCP early demux and Kernel v3.8.13.15 (Ubuntu stable): unreferenced object 0xffff88008cba4a40 (size 1696): comm "softirq", pid 0, jiffies 4294944115 (age 8907.520s) hex dump (first 32 bytes): 0a e0 20 6a 40 04 1b 37 92 be 32 e2 e8 b4 00 00 .. j@..7..2..... 02 00 07 01 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff810b710a>] kmem_cache_alloc+0xad/0xb9 [<ffffffff81270185>] sk_prot_alloc+0x29/0xc5 [<ffffffff812702cf>] sk_clone_lock+0x14/0x283 [<ffffffff812aaf3a>] inet_csk_clone_lock+0xf/0x7b [<ffffffff8129a893>] netlink_broadcast+0x14/0x16 [<ffffffff812c1573>] tcp_create_openreq_child+0x1b/0x4c3 [<ffffffff812c033e>] tcp_v4_syn_recv_sock+0x38/0x25d [<ffffffff812c13e4>] tcp_check_req+0x25c/0x3d0 [<ffffffff812bf87a>] tcp_v4_do_rcv+0x287/0x40e [<ffffffff812a08a7>] ip_route_input_noref+0x843/0xa55 [<ffffffff812bfeca>] tcp_v4_rcv+0x4c9/0x725 [<ffffffff812a26f4>] ip_local_deliver_finish+0xe9/0x154 [<ffffffff8127a927>] __netif_receive_skb+0x4b2/0x514 [<ffffffff8127aa77>] process_backlog+0xee/0x1c5 [<ffffffff8127c949>] net_rx_action+0xa7/0x200 [<ffffffff81209d86>] add_interrupt_randomness+0x39/0x157 But there are many more, resulting in the machine going OOM after some days. From looking at the TPROXY code, and with help from Florian, I see that the memory leak is introduced in tcp_v4_early_demux(): void tcp_v4_early_demux(struct sk_buff *skb) { /* ... */ iph = ip_hdr(skb); th = tcp_hdr(skb); if (th->doff < sizeof(struct tcphdr) / 4) return; sk = __inet_lookup_established(dev_net(skb->dev), &tcp_hashinfo, iph->saddr, th->source, iph->daddr, ntohs(th->dest), skb->skb_iif); if (sk) { skb->sk = sk; where the socket is assigned unconditionally to skb->sk, also bumping the refcnt on it. This is problematic, because in our case the skb has already a socket assigned in the TPROXY target. This then results in the leak I see. The very same issue seems to be with IPv6, but haven't tested. Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Holger Eitzenberger <holger@eitzenberger.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-06fib_frontend: fix possible NULL pointer dereferenceOliver Hartkopp
[ Upstream commit a0065f266a9b5d51575535a25c15ccbeed9a9966 ] The two commits 0115e8e30d (net: remove delay at device dismantle) and 748e2d9396a (net: reinstate rtnl in call_netdevice_notifiers()) silently removed a NULL pointer check for in_dev since Linux 3.7. This patch re-introduces this check as it causes crashing the kernel when setting small mtu values on non-ip capable netdevices. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-06ip_tunnel: clear IPCB in ip_tunnel_xmit() in case dst_link_failure() is calledDuan Jiong
[ Upstream commit 11c21a307d79ea5f6b6fc0d3dfdeda271e5e65f6 ] commit a622260254ee48("ip_tunnel: fix kernel panic with icmp_dest_unreach") clear IPCB in ip_tunnel_xmit() , or else skb->cb[] may contain garbage from GSO segmentation layer. But commit 0e6fbc5b6c621("ip_tunnels: extend iptunnel_xmit()") refactor codes, and it clear IPCB behind the dst_link_failure(). So clear IPCB in ip_tunnel_xmit() just like commti a622260254ee48("ip_tunnel: fix kernel panic with icmp_dest_unreach"). Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-06tcp: metrics: Avoid duplicate entries with the same destination-IPChristoph Paasch
[ Upstream commit 77f99ad16a07aa062c2d30fae57b1fee456f6ef6 ] Because the tcp-metrics is an RCU-list, it may be that two soft-interrupts are inside __tcp_get_metrics() for the same destination-IP at the same time. If this destination-IP is not yet part of the tcp-metrics, both soft-interrupts will end up in tcpm_new and create a new entry for this IP. So, we will have two tcp-metrics with the same destination-IP in the list. This patch checks twice __tcp_get_metrics(). First without holding the lock, then while holding the lock. The second one is there to confirm that the entry has not been added by another soft-irq while waiting for the spin-lock. Fixes: 51c5d0c4b169b (tcp: Maintain dynamic metrics in local cache.) Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-06net: avoid reference counter overflows on fib_rules in multicast forwardingHannes Frederic Sowa
[ Upstream commit 95f4a45de1a0f172b35451fc52283290adb21f6e ] Bob Falken reported that after 4G packets, multicast forwarding stopped working. This was because of a rule reference counter overflow which freed the rule as soon as the overflow happend. This patch solves this by adding the FIB_LOOKUP_NOREF flag to fib_rules_lookup calls. This is safe even from non-rcu locked sections as in this case the flag only implies not taking a reference to the rule, which we don't need at all. Rules only hold references to the namespace, which are guaranteed to be available during the call of the non-rcu protected function reg_vif_xmit because of the interface reference which itself holds a reference to the net namespace. Fixes: f0ad0860d01e47 ("ipv4: ipmr: support multiple tables") Fixes: d1db275dd3f6e4 ("ipv6: ip6mr: support multiple tables") Reported-by: Bob Falken <NetFestivalHaveFun@gmx.com> Cc: Patrick McHardy <kaber@trash.net> Cc: Thomas Graf <tgraf@suug.ch> Cc: Julian Anastasov <ja@ssi.bg> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-06inet_diag: fix inet_diag_dump_icsk() timewait socket state logicNeal Cardwell
[ Based upon upstream commit 70315d22d3c7383f9a508d0aab21e2eb35b2303a ] Fix inet_diag_dump_icsk() to reflect the fact that both TIME_WAIT and FIN_WAIT2 connections are represented by inet_timewait_sock (not just TIME_WAIT). Thus: (a) We need to iterate through the time_wait buckets if the user wants either TIME_WAIT or FIN_WAIT2. (Before fixing this, "ss -nemoi state fin-wait-2" would not return any sockets, even if there were some in FIN_WAIT2.) (b) We need to check tw_substate to see if the user wants to dump sockets in the particular substate (TIME_WAIT or FIN_WAIT2) that a given connection is in. (Before fixing this, "ss -nemoi state time-wait" would actually return sockets in state FIN_WAIT2.) An analogous fix is in v3.13: 70315d22d3c7383f9a508d0aab21e2eb35b2303a ("inet_diag: fix inet_diag_dump_icsk() to use correct state for timewait sockets") but that patch is quite different because 3.13 code is very different in this area due to the unification of TCP hash tables in 05dbc7b ("tcp/dccp: remove twchain") in v3.13-rc1. I tested that this applies cleanly between v3.3 and v3.12, and tested that it works in both 3.3 and 3.12. It does not apply cleanly to 3.2 and earlier (though it makes semantic sense), and semantically is not the right fix for 3.13 and beyond (as mentioned above). Signed-off-by: Neal Cardwell <ncardwell@google.com> Cc: Eric Dumazet <edumazet@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-18Merge tag 'v3.10.27' into v3.10-rtSteven Rostedt (Red Hat)
This is the 3.10.27 stable release
2014-01-15net: inet_diag: zero out uninitialized idiag_{src,dst} fieldsDaniel Borkmann
[ Upstream commit b1aac815c0891fe4a55a6b0b715910142227700f ] Jakub reported while working with nlmon netlink sniffer that parts of the inet_diag_sockid are not initialized when r->idiag_family != AF_INET6. That is, fields of r->id.idiag_src[1 ... 3], r->id.idiag_dst[1 ... 3]. In fact, it seems that we can leak 6 * sizeof(u32) byte of kernel [slab] memory through this. At least, in udp_dump_one(), we allocate a skb in ... rep = nlmsg_new(sizeof(struct inet_diag_msg) + ..., GFP_KERNEL); ... and then pass that to inet_sk_diag_fill() that puts the whole struct inet_diag_msg into the skb, where we only fill out r->id.idiag_src[0], r->id.idiag_dst[0] and leave the rest untouched: r->id.idiag_src[0] = inet->inet_rcv_saddr; r->id.idiag_dst[0] = inet->inet_daddr; struct inet_diag_msg embeds struct inet_diag_sockid that is correctly / fully filled out in IPv6 case, but for IPv4 not. So just zero them out by using plain memset (for this little amount of bytes it's probably not worth the extra check for idiag_family == AF_INET). Similarly, fix also other places where we fill that out. Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-15ip_gre: fix msg_name parsing for recvfrom/recvmsgTimo Teräs
[ Upstream commit 0e3da5bb8da45890b1dc413404e0f978ab71173e ] ipgre_header_parse() needs to parse the tunnel's ip header and it uses mac_header to locate the iphdr. This got broken when gre tunneling was refactored as mac_header is no longer updated to point to iphdr. Introduce skb_pop_mac_header() helper to do the mac_header assignment and use it in ipgre_rcv() to fix msg_name parsing. Bug introduced in commit c54419321455 (GRE: Refactor GRE tunneling code.) Cc: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Timo Teräs <timo.teras@iki.fi> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-02Merge tag 'v3.10.25' into v3.10-rtSteven Rostedt (Red Hat)
This is the 3.10.25 stable release
2013-12-20Revert "net: update consumers of MSG_MORE to recognize MSG_SENDPAGE_NOTLAST"Greg Kroah-Hartman
It turns out that commit: d3f7d56a7a4671d395e8af87071068a195257bf6 was applied to the tree twice, which didn't hurt anything, but it's good to fix this up. Reported-by: Veaceslav Falico <veaceslav@falico.eu> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Richard Weinberger <richard@nod.at> Cc: Shawn Landden <shawnlandden@gmail.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-17Merge tag 'v3.10.24' into v3.10-rtSteven Rostedt (Red Hat)
This is the 3.10.24 stable release
2013-12-17Merge tag 'v3.10.23' into v3.10-rtSteven Rostedt (Red Hat)
This is the 3.10.23 stable release
2013-12-11net: update consumers of MSG_MORE to recognize MSG_SENDPAGE_NOTLASTShawn Landden
commit d3f7d56a7a4671d395e8af87071068a195257bf6 upstream. Commit 35f9c09fe (tcp: tcp_sendpages() should call tcp_push() once) added an internal flag MSG_SENDPAGE_NOTLAST, similar to MSG_MORE. algif_hash, algif_skcipher, and udp used MSG_MORE from tcp_sendpages() and need to see the new flag as identical to MSG_MORE. This fixes sendfile() on AF_ALG. v3: also fix udp Cc: Tom Herbert <therbert@google.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David S. Miller <davem@davemloft.net> Reported-and-tested-by: Shawn Landden <shawnlandden@gmail.com> Original-patch: Richard Weinberger <richard@nod.at> Signed-off-by: Shawn Landden <shawn@churchofgit.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-08tcp: gso: fix truesize trackingEric Dumazet
[ Upstream commit 0d08c42cf9a71530fef5ebcfe368f38f2dd0476f ] commit 6ff50cd55545 ("tcp: gso: do not generate out of order packets") had an heuristic that can trigger a warning in skb_try_coalesce(), because skb->truesize of the gso segments were exactly set to mss. This breaks the requirement that skb->truesize >= skb->len + truesizeof(struct sk_buff); It can trivially be reproduced by : ifconfig lo mtu 1500 ethtool -K lo tso off netperf As the skbs are looped into the TCP networking stack, skb_try_coalesce() warns us of these skb under-estimating their truesize. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-08inet: fix possible seqlock deadlocksEric Dumazet
[ Upstream commit f1d8cba61c3c4b1eb88e507249c4cb8d635d9a76 ] In commit c9e9042994d3 ("ipv4: fix possible seqlock deadlock") I left another places where IP_INC_STATS_BH() were improperly used. udp_sendmsg(), ping_v4_sendmsg() and tcp_v4_connect() are called from process context, not from softirq context. This was detected by lockdep seqlock support. Reported-by: jongman heo <jongman.heo@samsung.com> Fixes: 584bdf8cbdf6 ("[IPV4]: Fix "ipOutNoRoutes" counter error for TCP and UDP") Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-08net: update consumers of MSG_MORE to recognize MSG_SENDPAGE_NOTLASTShawn Landden
[ Upstream commit d3f7d56a7a4671d395e8af87071068a195257bf6 ] Commit 35f9c09fe (tcp: tcp_sendpages() should call tcp_push() once) added an internal flag MSG_SENDPAGE_NOTLAST, similar to MSG_MORE. algif_hash, algif_skcipher, and udp used MSG_MORE from tcp_sendpages() and need to see the new flag as identical to MSG_MORE. This fixes sendfile() on AF_ALG. v3: also fix udp Reported-and-tested-by: Shawn Landden <shawnlandden@gmail.com> Cc: Tom Herbert <therbert@google.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David S. Miller <davem@davemloft.net> Original-patch: Richard Weinberger <richard@nod.at> Signed-off-by: Shawn Landden <shawn@churchofgit.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-08ipv4: fix race in concurrent ip_route_input_slow()Alexei Starovoitov
[ Upstream commit dcdfdf56b4a6c9437fc37dbc9cee94a788f9b0c4 ] CPUs can ask for local route via ip_route_input_noref() concurrently. if nh_rth_input is not cached yet, CPUs will proceed to allocate equivalent DSTs on 'lo' and then will try to cache them in nh_rth_input via rt_cache_route() Most of the time they succeed, but on occasion the following two lines: orig = *p; prev = cmpxchg(p, orig, rt); in rt_cache_route() do race and one of the cpus fails to complete cmpxchg. But ip_route_input_slow() doesn't check the return code of rt_cache_route(), so dst is leaking. dst_destroy() is never called and 'lo' device refcnt doesn't go to zero, which can be seen in the logs as: unregister_netdevice: waiting for lo to become free. Usage count = 1 Adding mdelay() between above two lines makes it easily reproducible. Fix it similar to nh_pcpu_rth_output case. Fixes: d2d68ba9fe8b ("ipv4: Cache input routes in fib_info nexthops.") Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-08tcp: don't update snd_nxt, when a socket is switched from repair modeAndrey Vagin
[ Upstream commit dbde497966804e63a38fdedc1e3815e77097efc2 ] snd_nxt must be updated synchronously with sk_send_head. Otherwise tp->packets_out may be updated incorrectly, what may bring a kernel panic. Here is a kernel panic from my host. [ 103.043194] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048 [ 103.044025] IP: [<ffffffff815aaaaf>] tcp_rearm_rto+0xcf/0x150 ... [ 146.301158] Call Trace: [ 146.301158] [<ffffffff815ab7f0>] tcp_ack+0xcc0/0x12c0 Before this panic a tcp socket was restored. This socket had sent and unsent data in the write queue. Sent data was restored in repair mode, then the socket was switched from reapair mode and unsent data was restored. After that the socket was switched back into repair mode. In that moment we had a socket where write queue looks like this: snd_una snd_nxt write_seq |_________|________| | sk_send_head After a second switching from repair mode the state of socket was changed: snd_una snd_nxt, write_seq |_________ ________| | sk_send_head This state is inconsistent, because snd_nxt and sk_send_head are not synchronized. Bellow you can find a call trace, how packets_out can be incremented twice for one skb, if snd_nxt and sk_send_head are not synchronized. In this case packets_out will be always positive, even when sk_write_queue is empty. tcp_write_wakeup skb = tcp_send_head(sk); tcp_fragment if (!before(tp->snd_nxt, TCP_SKB_CB(buff)->end_seq)) tcp_adjust_pcount(sk, skb, diff); tcp_event_new_data_sent tp->packets_out += tcp_skb_pcount(skb); I think update of snd_nxt isn't required, when a socket is switched from repair mode. Because it's initialized in tcp_connect_init. Then when a write queue is restored, snd_nxt is incremented in tcp_event_new_data_sent, so it's always is in consistent state. I have checked, that the bug is not reproduced with this patch and all tests about restoring tcp connections work fine. Signed-off-by: Andrey Vagin <avagin@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Eric Dumazet <edumazet@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Acked-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-08xfrm: Release dst if this dst is improper for vti tunnelfan.du
[ Upstream commit 236c9f84868534c718b6889aa624de64763281f9 ] After searching rt by the vti tunnel dst/src parameter, if this rt has neither attached to any transformation nor the transformation is not tunnel oriented, this rt should be released back to ip layer. otherwise causing dst memory leakage. Signed-off-by: Fan Du <fan.du@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-08inet: fix addr_len/msg->msg_namelen assignment in recv_error and rxpmtu ↵Hannes Frederic Sowa
functions [ Upstream commit 85fbaa75037d0b6b786ff18658ddf0b4014ce2a4 ] Commit bceaa90240b6019ed73b49965eac7d167610be69 ("inet: prevent leakage of uninitialized memory to user in recv syscalls") conditionally updated addr_len if the msg_name is written to. The recv_error and rxpmtu functions relied on the recvmsg functions to set up addr_len before. As this does not happen any more we have to pass addr_len to those functions as well and set it to the size of the corresponding sockaddr length. This broke traceroute and such. Fixes: bceaa90240b6 ("inet: prevent leakage of uninitialized memory to user in recv syscalls") Reported-by: Brad Spengler <spender@grsecurity.net> Reported-by: Tom Labanowski Cc: mpb <mpb.mail@gmail.com> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-08inet: prevent leakage of uninitialized memory to user in recv syscallsHannes Frederic Sowa
[ Upstream commit bceaa90240b6019ed73b49965eac7d167610be69 ] Only update *addr_len when we actually fill in sockaddr, otherwise we can return uninitialized memory from the stack to the caller in the recvfrom, recvmmsg and recvmsg syscalls. Drop the the (addr_len == NULL) checks because we only get called with a valid addr_len pointer either from sock_common_recvmsg or inet_recvmsg. If a blocking read waits on a socket which is concurrently shut down we now return zero and set msg_msgnamelen to 0. Reported-by: mpb <mpb.mail@gmail.com> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-08ipv4: fix possible seqlock deadlockEric Dumazet
[ Upstream commit c9e9042994d37cbc1ee538c500e9da1bb9d1bcdf ] ip4_datagram_connect() being called from process context, it should use IP_INC_STATS() instead of IP_INC_STATS_BH() otherwise we can deadlock on 32bit arches, or get corruptions of SNMP counters. Fixes: 584bdf8cbdf6 ("[IPV4]: Fix "ipOutNoRoutes" counter error for TCP and UDP") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-08net-tcp: fix panic in tcp_fastopen_cache_set()Eric Dumazet
[ Upstream commit dccf76ca6b626c0c4a4e09bb221adee3270ab0ef ] We had some reports of crashes using TCP fastopen, and Dave Jones gave a nice stack trace pointing to the error. Issue is that tcp_get_metrics() should not be called with a NULL dst Fixes: 1fe4c481ba637 ("net-tcp: Fast Open client - cookie cache") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dave Jones <davej@redhat.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Tested-by: Dave Jones <davej@fedoraproject.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-08tcp: tsq: restore minimal amount of queueingEric Dumazet
[ Upstream commit 98e09386c0ef4dfd48af7ba60ff908f0d525cdee ] After commit c9eeec26e32e ("tcp: TSQ can use a dynamic limit"), several users reported throughput regressions, notably on mvneta and wifi adapters. 802.11 AMPDU requires a fair amount of queueing to be effective. This patch partially reverts the change done in tcp_write_xmit() so that the minimal amount is sysctl_tcp_limit_output_bytes. It also remove the use of this sysctl while building skb stored in write queue, as TSO autosizing does the right thing anyway. Users with well behaving NICS and correct qdisc (like sch_fq), can then lower the default sysctl_tcp_limit_output_bytes value from 128KB to 8KB. This new usage of sysctl_tcp_limit_output_bytes permits each driver authors to check how their driver performs when/if the value is set to a minimum of 4KB. Normally, line rate for a single TCP flow should be possible, but some drivers rely on timers to perform TX completion and too long TX completion delays prevent reaching full throughput. Fixes: c9eeec26e32e ("tcp: TSQ can use a dynamic limit") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Sujith Manoharan <sujith@msujith.org> Reported-by: Arnaud Ebalard <arno@natisbad.org> Tested-by: Sujith Manoharan <sujith@msujith.org> Cc: Felix Fietkau <nbd@openwrt.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-21Merge tag 'v3.10.20' into v3.10-rtSteven Rostedt (Red Hat)
This is the 3.10.20 stable release