aboutsummaryrefslogtreecommitdiff
path: root/include/linux/netdevice.h
AgeCommit message (Collapse)Author
2014-06-18net: Allow for more then a single subclass for netif_addr_lockVlad Yasevich
[ Upstream commit 25175ba5c9bff9aaf0229df34bb5d54c81633ec3 ] Currently netif_addr_lock_nested assumes that there can be only a single nesting level between 2 devices. However, if we have multiple devices of the same type stacked, this fails. For example: eth0 <-- vlan0.10 <-- vlan0.10.20 A more complicated configuration may stack more then one type of device in different order. Ex: eth0 <-- vlan0.10 <-- macvlan0 <-- vlan1.10.20 <-- macvlan1 This patch adds an ndo_* function that allows each stackable device to report its nesting level. If the device doesn't provide this function default subclass of 1 is used. Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-18net: Find the nesting level of a given device by type.Vlad Yasevich
[ Upstream commit 4085ebe8c31face855fd01ee40372cb4aab1df3a ] Multiple devices in the kernel can be stacked/nested and they need to know their nesting level for the purposes of lockdep. This patch provides a generic function that determines a nesting level of a particular device by its type (ex: vlan, macvlan, etc). We only care about nesting of the same type of devices. For example: eth0 <- vlan0.10 <- macvlan0 <- vlan1.20 The nesting level of vlan1.20 would be 1, since there is another vlan in the stack under it. Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-28net: Account for all vlan headers in skb_mac_gso_segmentVlad Yasevich
skb_network_protocol() already accounts for multiple vlan headers that may be present in the skb. However, skb_mac_gso_segment() doesn't know anything about it and assumes that skb->mac_len is set correctly to skip all mac headers. That may not always be the case. If we are simply forwarding the packet (via bridge or macvtap), all vlan headers may not be accounted for. A simple solution is to allow skb_network_protocol to return the vlan depth it has calculated. This way skb_mac_gso_segment will correctly skip all mac headers. Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-17netdevice: move netdev_cap_txqueue for shared usage to headerDaniel Borkmann
In order to allow users to invoke netdev_cap_txqueue, it needs to be moved into netdevice.h header file. While at it, also add kernel doc header to document the API. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-17netdevice: add queue selection fallback handler for ndo_select_queueDaniel Borkmann
Add a new argument for ndo_select_queue() callback that passes a fallback handler. This gets invoked through netdev_pick_tx(); fallback handler is currently __netdev_pick_tx() as most drivers invoke this function within their customized implementation in case for skbs that don't need any special handling. This fallback handler can then be replaced on other call-sites with different queue selection methods (e.g. in packet sockets, pktgen etc). This also has the nice side-effect that __netdev_pick_tx() is then only invoked from netdev_pick_tx() and export of that function to modules can be undone. Suggested-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-13net: core: introduce netif_skb_dev_featuresFlorian Westphal
Will be used by upcoming ipv4 forward path change that needs to determine feature mask using skb->dst->dev instead of skb->dev. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-22rtnetlink: remove ndo_get_slaveJiri Pirko
No longer used API bond-specific can be removed now. This is now handled in a generic way in rtnl_link_ops. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21net: Add GRO support for UDP encapsulating protocolsOr Gerlitz
Add GRO handlers for protocols that do UDP encapsulation, with the intent of being able to coalesce packets which encapsulate packets belonging to the same TCP session. For GRO purposes, the destination UDP port takes the role of the ether type field in the ethernet header or the next protocol in the IP header. The UDP GRO handler will only attempt to coalesce packets whose destination port is registered to have gro handler. Use a mark on the skb GRO CB data to disallow (flush) running the udp gro receive code twice on a packet. This solves the problem of udp encapsulated packets whose inner VM packet is udp and happen to carry a port which has registered offloads. Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-17bonding: add netlink attributes to slave link devsfeldma@cumulusnetworks.com
If link is IFF_SLAVE, extend link dev netlink attributes to include slave attributes with new IFLA_SLAVE nest. Add netlink notification (RTM_NEWLINK) when slave status changes from backup to active, or visa-versa. Adds new ndo_get_slave op to net_device_ops to fill skb with IFLA_SLAVE attributes. Currently only used by bonding driver, but could be used by other aggregating devices with slaves. Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16net-sysfs: add support for device-specific rx queue sysfs attributesMichael Dalton
Extend existing support for netdevice receive queue sysfs attributes to permit a device-specific attribute group. Initial use case for this support will be to allow the virtio-net device to export per-receive queue mergeable receive buffer size. Signed-off-by: Michael Dalton <mwdalton@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16net: add NETDEV_PRECHANGEMTU to notify before mtu change happensVeaceslav Falico
Currently, if a device changes its mtu, first the change happens (invloving all the side effects), and after that the NETDEV_CHANGEMTU is sent so that other devices can catch up with the new mtu. However, if they return NOTIFY_BAD, then the change is reverted and error returned. This is a really long and costy operation (sometimes). To fix this, add NETDEV_PRECHANGEMTU notification which is called prior to any change actually happening, and if any callee returns NOTIFY_BAD - the change is aborted. This way we're skipping all the playing with apply/revert the mtu. CC: "David S. Miller" <davem@davemloft.net> CC: Jiri Pirko <jiri@resnulli.us> CC: Eric Dumazet <edumazet@google.com> CC: Nicolas Dichtel <nicolas.dichtel@6wind.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-15net: rename sysfs symlinks on device name changeVeaceslav Falico
Currently, we don't rename the upper/lower_ifc symlinks in /sys/class/net/*/ , which might result stale/duplicate links/names. Fix this by adding netdev_adjacent_rename_links(dev, oldname) which renames all the upper/lower interface's links to dev from the upper/lower_oldname to the new name. We don't need a rollback because only we control these symlinks and if we fail to rename them - sysfs will anyway complain. Reported-by: Ding Tianhong <dingtianhong@huawei.com> CC: Ding Tianhong <dingtianhong@huawei.com> CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Nicolas Dichtel <nicolas.dichtel@6wind.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2014-01-10net: core: explicitly select a txq before doing l2 forwardingJason Wang
Currently, the tx queue were selected implicitly in ndo_dfwd_start_xmit(). The will cause several issues: - NETIF_F_LLTX were removed for macvlan, so txq lock were done for macvlan instead of lower device which misses the necessary txq synchronization for lower device such as txq stopping or frozen required by dev watchdog or control path. - dev_hard_start_xmit() was called with NULL txq which bypasses the net device watchdog. - dev_hard_start_xmit() does not check txq everywhere which will lead a crash when tso is disabled for lower device. Fix this by explicitly introducing a new param for .ndo_select_queue() for just selecting queues in the case of l2 forwarding offload. netdev_pick_tx() was also extended to accept this parameter and dev_queue_xmit_accel() was used to do l2 forwarding transmission. With this fixes, NETIF_F_LLTX could be preserved for macvlan and there's no need to check txq against NULL in dev_hard_start_xmit(). Also there's no need to keep a dedicated ndo_dfwd_start_xmit() and we can just reuse the code of dev_queue_xmit() to do the transmission. In the future, it was also required for macvtap l2 forwarding support since it provides a necessary synchronization method. Cc: John Fastabend <john.r.fastabend@intel.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: e1000-devel@lists.sourceforge.net Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07net-gre-gro: Add GRE support to the GRO stackJerry Chu
This patch built on top of Commit 299603e8370a93dd5d8e8d800f0dff1ce2c53d36 ("net-gro: Prepare GRO stack for the upcoming tunneling support") to add the support of the standard GRE (RFC1701/RFC2784/RFC2890) to the GRO stack. It also serves as an example for supporting other encapsulation protocols in the GRO stack in the future. The patch supports version 0 and all the flags (key, csum, seq#) but will flush any pkt with the S (seq#) flag. This is because the S flag is not support by GSO, and a GRO pkt may end up in the forwarding path, thus requiring GSO support to break it up correctly. Currently the "packet_offload" structure only contains L3 (ETH_P_IP/ ETH_P_IPV6) GRO offload support so the encapped pkts are limited to IP pkts (i.e., w/o L2 hdr). But support for other protocol type can be easily added, so is the support for GRE variations like NVGRE. The patch also support csum offload. Specifically if the csum flag is on and the h/w is capable of checksumming the payload (CHECKSUM_COMPLETE), the code will take advantage of the csum computed by the h/w when validating the GRE csum. Note that commit 60769a5dcd8755715c7143b4571d5c44f01796f1 "ipv4: gre: add GRO capability" already introduces GRO capability to IPv4 GRE tunnels, using the gro_cells infrastructure. But GRO is done after GRE hdr has been removed (i.e., decapped). The following patch applies GRO when pkts first come in (before hitting the GRE tunnel code). There is some performance advantage for applying GRO as early as possible. Also this approach is transparent to other subsystem like Open vSwitch where GRE decap is handled outside of the IP stack hence making it harder for the gro_cells stuff to apply. On the other hand, some NICs are still not capable of hashing on the inner hdr of a GRE pkt (RSS). In that case the GRO processing of pkts from the same remote host will all happen on the same CPU and the performance may be suboptimal. I'm including some rough preliminary performance numbers below. Note that the performance will be highly dependent on traffic load, mix as usual. Moreover it also depends on NIC offload features hence the following is by no means a comprehesive study. Local testing and tuning will be needed to decide the best setting. All tests spawned 50 copies of netperf TCP_STREAM and ran for 30 secs. (super_netperf 50 -H 192.168.1.18 -l 30) An IP GRE tunnel with only the key flag on (e.g., ip tunnel add gre1 mode gre local 10.246.17.18 remote 10.246.17.17 ttl 255 key 123) is configured. The GRO support for pkts AFTER decap are controlled through the device feature of the GRE device (e.g., ethtool -K gre1 gro on/off). 1.1 ethtool -K gre1 gro off; ethtool -K eth0 gro off thruput: 9.16Gbps CPU utilization: 19% 1.2 ethtool -K gre1 gro on; ethtool -K eth0 gro off thruput: 5.9Gbps CPU utilization: 15% 1.3 ethtool -K gre1 gro off; ethtool -K eth0 gro on thruput: 9.26Gbps CPU utilization: 12-13% 1.4 ethtool -K gre1 gro on; ethtool -K eth0 gro on thruput: 9.26Gbps CPU utilization: 10% The following tests were performed on a different NIC that is capable of csum offload. I.e., the h/w is capable of computing IP payload csum (CHECKSUM_COMPLETE). 2.1 ethtool -K gre1 gro on (hence will use gro_cells) 2.1.1 ethtool -K eth0 gro off; csum offload disabled thruput: 8.53Gbps CPU utilization: 9% 2.1.2 ethtool -K eth0 gro off; csum offload enabled thruput: 8.97Gbps CPU utilization: 7-8% 2.1.3 ethtool -K eth0 gro on; csum offload disabled thruput: 8.83Gbps CPU utilization: 5-6% 2.1.4 ethtool -K eth0 gro on; csum offload enabled thruput: 8.98Gbps CPU utilization: 5% 2.2 ethtool -K gre1 gro off 2.2.1 ethtool -K eth0 gro off; csum offload disabled thruput: 5.93Gbps CPU utilization: 9% 2.2.2 ethtool -K eth0 gro off; csum offload enabled thruput: 5.62Gbps CPU utilization: 8% 2.2.3 ethtool -K eth0 gro on; csum offload disabled thruput: 7.69Gbps CPU utilization: 8% 2.2.4 ethtool -K eth0 gro on; csum offload enabled thruput: 8.96Gbps CPU utilization: 5-6% Signed-off-by: H.K. Jerry Chu <hkchu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c net/ipv6/ip6_tunnel.c net/ipv6/ip6_vti.c ipv6 tunnel statistic bug fixes conflicting with consolidation into generic sw per-cpu net stats. qlogic conflict between queue counting bug fix and the addition of multiple MAC address support. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== netfilter/IPVS updates for net-next The following patchset contains Netfilter updates for your net-next tree, they are: * Add full port randomization support. Some crazy researchers found a way to reconstruct the secure ephemeral ports that are allocated in random mode by sending off-path bursts of UDP packets to overrun the socket buffer of the DNS resolver to trigger retransmissions, then if the timing for the DNS resolution done by a client is larger than usual, then they conclude that the port that received the burst of UDP packets is the one that was opened. It seems a bit aggressive method to me but it seems to work for them. As a result, Daniel Borkmann and Hannes Frederic Sowa came up with a new NAT mode to fully randomize ports using prandom. * Add a new classifier to x_tables based on the socket net_cls set via cgroups. These includes two patches to prepare the field as requested by Zefan Li. Also from Daniel Borkmann. * Use prandom instead of get_random_bytes in several locations of the netfilter code, from Florian Westphal. * Allow to use the CTA_MARK_MASK in ctnetlink when mangling the conntrack mark, also from Florian Westphal. * Fix compilation warning due to unused variable in IPVS, from Geert Uytterhoeven. * Add support for UID/GID via nfnetlink_queue, from Valentina Giusti. * Add IPComp extension to x_tables, from Fan Du. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-04net: unify the pcpu_tstats and br_cpu_netstats as oneLi RongQing
They are same, so unify them as one, pcpu_sw_netstats. Define pcpu_sw_netstat in netdevice.h, remove pcpu_tstats from if_tunnel and remove br_cpu_netstats from br_private.h Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-03net: netprio: rename config to be more consistent with cgroup configsDaniel Borkmann
While we're at it and introduced CGROUP_NET_CLASSID, lets also make NETPRIO_CGROUP more consistent with the rest of cgroups and rename it into CONFIG_CGROUP_NET_PRIO so that for networking, we now have CONFIG_CGROUP_NET_{PRIO,CLASSID}. This not only makes the CONFIG option consistent among networking cgroups, but also among cgroups CONFIG conventions in general as the vast majority has a prefix of CONFIG_CGROUP_<SUBSYS>. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Zefan Li <lizefan@huawei.com> Cc: cgroups@vger.kernel.org Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-02ipv4: fix tunneled VM traffic over hw VXLAN/GRE GSO NICWei-Chun Chao
VM to VM GSO traffic is broken if it goes through VXLAN or GRE tunnel and the physical NIC on the host supports hardware VXLAN/GRE GSO offload (e.g. bnx2x and next-gen mlx4). Two issues - (VXLAN) VM traffic has SKB_GSO_DODGY and SKB_GSO_UDP_TUNNEL with SKB_GSO_TCP/UDP set depending on the inner protocol. GSO header integrity check fails in udp4_ufo_fragment if inner protocol is TCP. Also gso_segs is calculated incorrectly using skb->len that includes tunnel header. Fix: robust check should only be applied to the inner packet. (VXLAN & GRE) Once GSO header integrity check passes, NULL segs is returned and the original skb is sent to hardware. However the tunnel header is already pulled. Fix: tunnel header needs to be restored so that hardware can perform GSO properly on the original packet. Signed-off-by: Wei-Chun Chao <weichunc@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-01net: core functions cleanupstephen hemminger
The following functions are not used outside of net/core/dev.c and should be declared static. call_netdevice_notifiers_info __dev_remove_offload netdev_has_any_upper_dev __netdev_adjacent_dev_remove __netdev_adjacent_dev_link_lists __netdev_adjacent_dev_unlink_lists __netdev_adjacent_dev_unlink __netdev_adjacent_dev_link_neighbour __netdev_adjacent_dev_unlink_neighbour And the following are never used and should be deleted netdev_lower_dev_get_private_rcu __netdev_find_adj_rcu Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-31vlan: Fix header ops passthru when doing TX VLAN offload.David S. Miller
When the vlan code detects that the real device can do TX VLAN offloads in hardware, it tries to arrange for the real device's header_ops to be invoked directly. But it does so illegally, by simply hooking the real device's header_ops up to the VLAN device. This doesn't work because we will end up invoking a set of header_ops routines which expect a device type which matches the real device, but will see a VLAN device instead. Fix this by providing a pass-thru set of header_ops which will arrange to pass the proper real device instead. To facilitate this add a dev_rebuild_header(). There are implementations which provide a ->cache and ->create but not a ->rebuild (f.e. PLIP). So we need a helper function just like dev_hard_header() to avoid crashes. Use this helper in the one existing place where the header_ops->rebuild was being invoked, the neighbour code. With lots of help from Florian Westphal. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/intel/i40e/i40e_main.c drivers/net/macvtap.c Both minor merge hassles, simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17net: remove dead code for add/del multiplestephen hemminger
These function to manipulate multiple addresses are not used anywhere in current net-next tree. Some out of tree code maybe using these but too bad; they should submit their code upstream.. Also, make __hw_addr_flush local since only used by dev_addr_lists.c Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-14bonding: create bond_first_slave_rcu()dingtianhong
The bond_first_slave_rcu() will be used to instead of bond_first_slave() in rcu_read_lock(). According to the Jay Vosburgh's suggestion, the struct netdev_adjacent should hide from users who wanted to use it directly. so I package a new function to get the first slave of the bond. Suggested-by: Nikolay Aleksandrov <nikolay@redhat.com> Suggested-by: Jay Vosburgh <fubar@us.ibm.com> Suggested-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: Ding Tianhong <dingtianhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-12net: make neigh_priv_len in struct net_device 16bit instead of 8bitSebastian Siewior
neigh_priv_len is defined as u8. With all debug enabled struct ipoib_neigh has 200 bytes. The largest part is sk_buff_head with 96 bytes and here the spinlock with 72 bytes. The size value still fits in this u8 leaving some room for more. On -RT struct ipoib_neigh put on weight and has 392 bytes. The main reason is sk_buff_head with 288 and the fatty here is spinlock with 192 bytes. This does no longer fit into into neigh_priv_len and gcc complains. This patch changes neigh_priv_len from being 8bit to 16bit. Since the following element (dev_id) is 16bit followed by a spinlock which is aligned, the struct remains with a total size of 3200 (allmodconfig) / 2048 (with as much debug off as possible) bytes on x86-64. On x86-32 the struct is 1856 (allmodconfig) / 1216 (with as much debug off as possible) bytes long. The numbers were gained with and without the patch to prove that this change does not increase the size of the struct. Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-12net-gro: Prepare GRO stack for the upcoming tunneling supportJerry Chu
This patch modifies the GRO stack to avoid the use of "network_header" and associated macros like ip_hdr() and ipv6_hdr() in order to allow an arbitary number of IP hdrs (v4 or v6) to be used in the encapsulation chain. This lays the foundation for various IP tunneling support (IP-in-IP, GRE, VXLAN, SIT,...) to be added later. With this patch, the GRO stack traversing now is mostly based on skb_gro_offset rather than special hdr offsets saved in skb (e.g., skb->network_header). As a result all but the top layer (i.e., the the transport layer) must have hdrs of the same length in order for a pkt to be considered for aggregation. Therefore when adding a new encap layer (e.g., for tunneling), one must check and skip flows (e.g., by setting NAPI_GRO_CB(p)->same_flow to 0) that have a different hdr length. Note that unlike the network header, the transport header can and will continue to be set by the GRO code since there will be at most one "transport layer" in the encap chain. Signed-off-by: H.K. Jerry Chu <hkchu@google.com> Suggested-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-11tipc: remove TIPC usage of field af_packet_priv in struct net_deviceYing Xue
TIPC is currently using the field 'af_packet_priv' in struct net_device as a handle to find the bearer instance associated to the given network device. But, by doing so it is blocking other networking cleanups, such as the one discussed here: http://patchwork.ozlabs.org/patch/178044/ This commit removes this usage from TIPC. Instead, we introduce a new field, 'tipc_ptr', to the net_device structure, to serve this purpose. When TIPC bearer is enabled, the bearer object is associated to 'tipc_ptr'. When a TIPC packet arrives in the recv_msg() upcall from a networking device, the bearer object can now be obtained from 'tipc_ptr'. When a bearer is disabled, the bearer object is detached from its underlying network device by setting 'tipc_ptr' to NULL. Additionally, an RCU lock is used to protect the new pointer. Henceforth, the existing tipc_net_lock is used in write mode to serialize write accesses to this pointer, while the new RCU lock is applied on the read side to ensure that the pointer is 100% valid within its wrapped area for all readers. Signed-off-by: Ying Xue <ying.xue@windriver.com> Cc: Patrick McHardy <kaber@trash.net> Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-06net: introduce dev_consume_skb_any()Eric Dumazet
Some network drivers use dev_kfree_skb_any() and dev_kfree_skb_irq() helpers to free skbs, both for dropped packets and TX completed ones. We need to separate the two causes to get better diagnostics given by dropwatch or "perf record -e skb:kfree_skb" This patch provides two new helpers, dev_consume_skb_any() and dev_consume_skb_irq() to be used for consumed skbs. __dev_kfree_skb_irq() is slightly optimized to remove one atomic_dec_and_test() in fast path, and use this_cpu_{r|w} accessors. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-15Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina: "Usual earth-shaking, news-breaking, rocket science pile from trivial.git" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits) doc: usb: Fix typo in Documentation/usb/gadget_configs.txt doc: add missing files to timers/00-INDEX timekeeping: Fix some trivial typos in comments mm: Fix some trivial typos in comments irq: Fix some trivial typos in comments NUMA: fix typos in Kconfig help text mm: update 00-INDEX doc: Documentation/DMA-attributes.txt fix typo DRM: comment: `halve' -> `half' Docs: Kconfig: `devlopers' -> `developers' doc: typo on word accounting in kprobes.c in mutliple architectures treewide: fix "usefull" typo treewide: fix "distingush" typo mm/Kconfig: Grammar s/an/a/ kexec: Typo s/the/then/ Documentation/kvm: Update cpuid documentation for steal time and pv eoi treewide: Fix common typo in "identify" __page_to_pfn: Fix typo in comment Correct some typos for word frequency clk: fixed-factor: Fix a trivial typo ...
2013-11-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: 1) The addition of nftables. No longer will we need protocol aware firewall filtering modules, it can all live in userspace. At the core of nftables is a, for lack of a better term, virtual machine that executes byte codes to inspect packet or metadata (arriving interface index, etc.) and make verdict decisions. Besides support for loading packet contents and comparing them, the interpreter supports lookups in various datastructures as fundamental operations. For example sets are supports, and therefore one could create a set of whitelist IP address entries which have ACCEPT verdicts attached to them, and use the appropriate byte codes to do such lookups. Since the interpreted code is composed in userspace, userspace can do things like optimize things before giving it to the kernel. Another major improvement is the capability of atomically updating portions of the ruleset. In the existing netfilter implementation, one has to update the entire rule set in order to make a change and this is very expensive. Userspace tools exist to create nftables rules using existing netfilter rule sets, but both kernel implementations will need to co-exist for quite some time as we transition from the old to the new stuff. Kudos to Patrick McHardy, Pablo Neira Ayuso, and others who have worked so hard on this. 2) Daniel Borkmann and Hannes Frederic Sowa made several improvements to our pseudo-random number generator, mostly used for things like UDP port randomization and netfitler, amongst other things. In particular the taus88 generater is updated to taus113, and test cases are added. 3) Support 64-bit rates in HTB and TBF schedulers, from Eric Dumazet and Yang Yingliang. 4) Add support for new 577xx tigon3 chips to tg3 driver, from Nithin Sujir. 5) Fix two fatal flaws in TCP dynamic right sizing, from Eric Dumazet, Neal Cardwell, and Yuchung Cheng. 6) Allow IP_TOS and IP_TTL to be specified in sendmsg() ancillary control message data, much like other socket option attributes. From Francesco Fusco. 7) Allow applications to specify a cap on the rate computed automatically by the kernel for pacing flows, via a new SO_MAX_PACING_RATE socket option. From Eric Dumazet. 8) Make the initial autotuned send buffer sizing in TCP more closely reflect actual needs, from Eric Dumazet. 9) Currently early socket demux only happens for TCP sockets, but we can do it for connected UDP sockets too. Implementation from Shawn Bohrer. 10) Refactor inet socket demux with the goal of improving hash demux performance for listening sockets. With the main goals being able to use RCU lookups on even request sockets, and eliminating the listening lock contention. From Eric Dumazet. 11) The bonding layer has many demuxes in it's fast path, and an RCU conversion was started back in 3.11, several changes here extend the RCU usage to even more locations. From Ding Tianhong and Wang Yufen, based upon suggestions by Nikolay Aleksandrov and Veaceslav Falico. 12) Allow stackability of segmentation offloads to, in particular, allow segmentation offloading over tunnels. From Eric Dumazet. 13) Significantly improve the handling of secret keys we input into the various hash functions in the inet hashtables, TCP fast open, as well as syncookies. From Hannes Frederic Sowa. The key fundamental operation is "net_get_random_once()" which uses static keys. Hannes even extended this to ipv4/ipv6 fragmentation handling and our generic flow dissector. 14) The generic driver layer takes care now to set the driver data to NULL on device removal, so it's no longer necessary for drivers to explicitly set it to NULL any more. Many drivers have been cleaned up in this way, from Jingoo Han. 15) Add a BPF based packet scheduler classifier, from Daniel Borkmann. 16) Improve CRC32 interfaces and generic SKB checksum iterators so that SCTP's checksumming can more cleanly be handled. Also from Daniel Borkmann. 17) Add a new PMTU discovery mode, IP_PMTUDISC_INTERFACE, which forces using the interface MTU value. This helps avoid PMTU attacks, particularly on DNS servers. From Hannes Frederic Sowa. 18) Use generic XPS for transmit queue steering rather than internal (re-)implementation in virtio-net. From Jason Wang. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits) random32: add test cases for taus113 implementation random32: upgrade taus88 generator to taus113 from errata paper random32: move rnd_state to linux/random.h random32: add prandom_reseed_late() and call when nonblocking pool becomes initialized random32: add periodic reseeding random32: fix off-by-one in seeding requirement PHY: Add RTL8201CP phy_driver to realtek xtsonic: add missing platform_set_drvdata() in xtsonic_probe() macmace: add missing platform_set_drvdata() in mace_probe() ethernet/arc/arc_emac: add missing platform_set_drvdata() in arc_emac_probe() ipv6: protect for_each_sk_fl_rcu in mem_check with rcu_read_lock_bh vlan: Implement vlan_dev_get_egress_qos_mask as an inline. ixgbe: add warning when max_vfs is out of range. igb: Update link modes display in ethtool netfilter: push reasm skb through instead of original frag skbs ip6_output: fragment outgoing reassembled skb properly MAINTAINERS: mv643xx_eth: take over maintainership from Lennart net_sched: tbf: support of 64bit rates ixgbe: deleting dfwd stations out of order can cause null ptr deref ixgbe: fix build err, num_rx_queues is only available with CONFIG_RPS ...
2013-11-07net: Add layer 2 hardware acceleration operations for macvlan devicesJohn Fastabend
Add a operations structure that allows a network interface to export the fact that it supports package forwarding in hardware between physical interfaces and other mac layer devices assigned to it (such as macvlans). This operaions structure can be used by virtual mac devices to bypass software switching so that forwarding can be done in hardware more efficiently. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Andy Gospodarek <andy@greyhouse.net> CC: "David S. Miller" <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-07net: make ndev->irq signed for error handlingDan Carpenter
There is a bug in cpsw_probe() where we do: ndev->irq = platform_get_irq(pdev, 0); if (ndev->irq < 0) { The problem is that "ndev->irq" is unsigned so the error handling doesn't work. I have changed it to a regular int. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-07Merge tag 'driver-core-3.13-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core / sysfs patches from Greg KH: "Here's the big driver core / sysfs update for 3.13-rc1. There's lots of dev_groups updates for different subsystems, as they all get slowly migrated over to the safe versions of the attribute groups (removing userspace races with the creation of the sysfs files.) Also in here are some kobject updates, devres expansions, and the first round of Tejun's sysfs reworking to enable it to be used by other subsystems as a backend for an in-kernel filesystem. All of these have been in linux-next for a while with no reported issues" * tag 'driver-core-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (83 commits) sysfs: rename sysfs_assoc_lock and explain what it's about sysfs: use generic_file_llseek() for sysfs_file_operations sysfs: return correct error code on unimplemented mmap() mdio_bus: convert bus code to use dev_groups device: Make dev_WARN/dev_WARN_ONCE print device as well as driver name sysfs: separate out dup filename warning into a separate function sysfs: move sysfs_hash_and_remove() to fs/sysfs/dir.c sysfs: remove unused sysfs_get_dentry() prototype sysfs: honor bin_attr.attr.ignore_lockdep sysfs: merge sysfs_elem_bin_attr into sysfs_elem_attr devres: restore zeroing behavior of devres_alloc() sysfs: fix sysfs_write_file for bin file input: gameport: convert bus code to use dev_groups input: serio: remove bus usage of dev_attrs input: serio: use DEVICE_ATTR_RO() i2o: convert bus code to use dev_groups memstick: convert bus code to use dev_groups tifm: convert bus code to use dev_groups virtio: convert bus code to use dev_groups ipack: convert bus code to use dev_groups ...
2013-11-03net: extend net_device allocation to vmalloc()Eric Dumazet
Joby Poriyath provided a xen-netback patch to reduce the size of xenvif structure as some netdev allocation could fail under memory pressure/fragmentation. This patch is handling the problem at the core level, allowing any netdev structures to use vmalloc() if kmalloc() failed. As vmalloc() adds overhead on a critical network path, add __GFP_REPEAT to kzalloc() flags to do this fallback only when really needed. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Joby Poriyath <joby.poriyath@citrix.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-29net: add might_sleep() call to napi_disableJacob Keller
napi_disable uses an msleep() call to wait for outstanding napi work to be finished after setting the disable bit. It does not always sleep incase there was no outstanding work. This resulted in a rare bug in ixgbe_down operation where a napi_disable call took place inside of a local_bh_disable()d context. In order to enable easier detection of future sleep while atomic BUGs, this patch adds a might_sleep() call, so that every use of napi_disable during atomic context will be visible. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com> Cc: Alexander Duyck <alexander.duyck@intel.com> Cc: Hyong-Youb Kim <hykim@myri.com> Cc: Amir Vadai <amirv@mellanox.com> Cc: Dmitry Kravkov <dmitry@broadcom.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2013-10-18net: Delete trailing semi-colon from definition of netdev_WARN()Ben Hutchings
Macro definitions should not normally end with a semi-colon, as this makes it dangerous to use them an if...else statement. Happily this has not happened yet. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-14treewide: Fix common typo in "identify"Maxime Jayat
Correct common misspelling of "identify" as "indentify" throughout the kernel Signed-off-by: Maxime Jayat <maxime@artisandeveloppeur.fr> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-10-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: include/linux/netdevice.h net/core/sock.c Trivial merge issues. Removal of "extern" for functions declaration in netdevice.h at the same time "const" was added to an argument. Two parallel line additions in net/core/sock.c Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-07net: Separate the close_list and the unreg_list v2Eric W. Biederman
Separate the unreg_list and the close_list in dev_close_many preventing dev_close_many from permuting the unreg_list. The permutations of the unreg_list have resulted in cases where the loopback device is accessed it has been freed in code such as dst_ifdown. Resulting in subtle memory corruption. This is the second bug from sharing the storage between the close_list and the unreg_list. The issues that crop up with sharing are apparently too subtle to show up in normal testing or usage, so let's forget about being clever and use two separate lists. v2: Make all callers pass in a close_list to dev_close_many Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-07netif_set_xps_queue: make cpu mask constMichael S. Tsirkin
virtio wants to pass in cpumask_of(cpu), make parameter const to avoid build warnings. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-30Merge branch '20130926_include_linux_networking_externs' of ↵David S. Miller
git://repo.or.cz/linux-2.6/trivial-mods Conflicts: include/linux/netdevice.h More extern removals from Joe Perches. Minor conflict with the dev_notify_flags changes which added a new argument to __dev_notify_flags(). Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-30dev: update __dev_notify_flags() to send rtnl msgNicolas Dichtel
This patch only prepares the next one, there is no functional change. Now, __dev_notify_flags() can also be used to notify flags changes via rtnetlink. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-26[networking]device.h: Remove extern from function prototypesJoe Perches
There are a mix of function prototypes with and without extern in the kernel sources. Standardize on not using extern for function prototypes. Function prototypes don't need to be written with extern. extern is assumed by the compiler. Its use is as unnecessary as using auto to declare automatic/local variables in a block. Signed-off-by: Joe Perches <joe@perches.com>
2013-09-26sysfs: make attr namespace interface less convolutedTejun Heo
sysfs ns (namespace) implementation became more convoluted than necessary while trying to hide ns information from visible interface. The relatively recent attr ns support is a good example. * attr ns tag is determined by sysfs_ops->namespace() callback while dir tag is determined by kobj_type->namespace(). The placement is arbitrary. * Instead of performing operations with explicit ns tag, the namespace callback is routed through sysfs_attr_ns(), sysfs_ops->namespace(), class_attr_namespace(), class_attr->namespace(). It's not simpler in any sense. The only thing this convolution does is traversing the whole stack backwards. The namespace callbacks are unncessary because the operations involved are inherently synchronous. The information can be provided in in straight-forward top-down direction and reversing that direction is unnecessary and against basic design principles. This backward interface is unnecessarily convoluted and hinders properly separating out sysfs from driver model / kobject for proper layering. This patch updates attr ns support such that * sysfs_ops->namespace() and class_attr->namespace() are dropped. * sysfs_{create|remove}_file_ns(), which take explicit @ns param, are added and sysfs_{create|remove}_file() are now simple wrappers around the ns aware functions. * ns handling is dropped from sysfs_chmod_file(). Nobody uses it at this point. sysfs_chmod_file_ns() can be added later if necessary. * Explicit @ns is propagated through class_{create|remove}_file_ns() and netdev_class_{create|remove}_file_ns(). * driver/net/bonding which is currently the only user of attr namespace is updated to use netdev_class_{create|remove}_file_ns() with @bh->net as the ns tag instead of using the namespace callback. This patch should be an equivalent conversion without any functional difference. It makes the code easier to follow, reduces lines of code a bit and helps proper separation and layering. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kay Sievers <kay@vrfy.org> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-26net: add a possibility to get private from netdev_adjacent->listVeaceslav Falico
It will be useful to get first/last element. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-26net: add for_each iterators through neighbour lower link's privateVeaceslav Falico
Add a possibility to iterate through netdev_adjacent's private, currently only for lower neighbours. Add both RCU and RTNL/other locking variants of iterators, and make the non-rcu variant to be safe from removal. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-26net: add netdev_adjacent->private and allow to use itVeaceslav Falico
Currently, even though we can access any linked device, we can't attach anything to it, which is vital to properly manage them. To fix this, add a new void *private to netdev_adjacent and functions setting/getting it (per link), so that we can save, per example, bonding's slave structures there, per slave device. netdev_master_upper_dev_link_private(dev, upper_dev, private) links dev to upper dev and populates the neighbour link only with private. netdev_lower_dev_get_private{,_rcu}() returns the private, if found. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-26net: add adj_list to save only neighboursVeaceslav Falico
Currently, we distinguish neighbours (first-level linked devices) from non-neighbours by the neighbour bool in the netdev_adjacent. This could be quite time-consuming in case we would like to traverse *only* through neighbours - cause we'd have to traverse through all devices and check for this flag, and in a (quite common) scenario where we have lots of vlans on top of bridge, which is on top of a bond - the bonding would have to go through all those vlans to get its upper neighbour linked devices. This situation is really unpleasant, cause there are already a lot of cases when a device with slaves needs to go through them in hot path. To fix this, introduce a new upper/lower device lists structure - adj_list, which contains only the neighbours. It works always in pair with the all_adj_list structure (renamed from upper/lower_dev_list), i.e. both of them contain the same links, only that all_adj_list contains also non-neighbour device links. It's really a small change visible, currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't change the main linked logic at all. Also, add some comments a fix a name collision in netdev_for_each_upper_dev_rcu() and rework the naming by the following rules: netdev_(all_)(upper|lower)_* If "all_" is present, then we work with the whole list of upper/lower devices, otherwise - only with direct neighbours. Uninline functions - to get better stack traces. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-15vxlan: Fix sparse warningsJoseph Gasparakis
This patch fixes sparse warnings when incorrectly handling the port number and using int instead of unsigned int iterating through &vn->sock_list[]. Keeping the port as __be16 also makes things clearer wrt endianess. Also, it was pointed out that vxlan_get_rx_port() had unnecessary checks which got removed. Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>