aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2017-03-23tracing: move trace_handle_return() out of lineSteven Rostedt
Currently trace_handle_return() looks like this: static inline enum print_line_t trace_handle_return(struct trace_seq *s) { return trace_seq_has_overflowed(s) ? TRACE_TYPE_PARTIAL_LINE : TRACE_TYPE_HANDLED; } Where trace_seq_overflowed(s) is: static inline bool trace_seq_has_overflowed(struct trace_seq *s) { return s->full || seq_buf_has_overflowed(&s->seq); } And seq_buf_has_overflowed(&s->seq) is: static inline bool seq_buf_has_overflowed(struct seq_buf *s) { return s->len > s->size; } Making trace_handle_return() into: return (s->full || (s->seq->len > s->seq->size)) ? TRACE_TYPE_PARTIAL_LINE : TRACE_TYPE_HANDLED; One would think this is not an issue to keep as an inline. But because this is used in the TRACE_EVENT() macro, it is extended for every tracepoint in the system. Taking a look at a single tracepoint x86_irq_vector (was the first one I randomly chosen). As trace_handle_return is used in the TRACE_EVENT() macro of trace_raw_output_##call() we disassemble trace_raw_output_x86_irq_vector and do a diff. I removed identical lines that were different just due to different addresses. The original has 22 bytes of text more than the out of line version. As this is for every TRACE_EVENT() defined in the system, this can become quite large. text data bss dec hex filename 8690305 5450490 1298432 15439227 eb957b vmlinux-orig 8681725 5450490 1298432 15430647 eb73f7 vmlinux-handle This change has a total of 8580 bytes in savings. $ objdump -dr /tmp/vmlinux-orig | grep '^[0-9a-f]* <trace_raw_output' | wc -l 324 That's 324 tracepoints. But this does not include modules (which contain many more tracepoints). For an allyesconfig build: $ objdump -dr vmlinux-allyes-orig | grep '^[0-9a-f]* <trace_raw_output' | wc -l 1401 That's 1401 tracepoints giving us: text data bss dec hex filename 137827709 140221067 53264384 331313160 13bf7008 vmlinux-allyes-handle 137920629 140221067 53264384 331406080 13c0db00 vmlinux-allyes-orig 92920 bytes in savings!!! Link: http://lkml.kernel.org/r/20170315021431.13107-2-andi@firstfloor.org Link: http://lkml.kernel.org/r/20170316113459.2366588b@gandalf.local.home Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Reported-by: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2017-03-23kernel/sched/fair.c: uninline __update_load_avg()Andi Kleen
This is a very complex function, which is called in multiple places. It is unlikely that inlining or not inlining it makes any difference for its run time. This saves around 13k text in my kernel text data bss dec hex filename 9083992 5367600 11116544 25568136 1862388 vmlinux-before-load-avg 9070166 5367600 11116544 25554310 185ed86 vmlinux-load-avg Link: http://lkml.kernel.org/r/20170315021431.13107-4-andi@firstfloor.org Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2017-03-23kernel/power/snapshot.c: use set_memory.h headerLaura Abbott
set_memory_* functions have moved to set_memory.h. Switch to this explicitly. Link: http://lkml.kernel.org/r/1488920133-27229-13-git-send-email-labbott@redhat.com Signed-off-by: Laura Abbott <labbott@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2017-03-23kernel/module.c: use set_memory.h headerLaura Abbott
set_memory_* functions have moved to set_memory.h. Switch to this explicitly. Link: http://lkml.kernel.org/r/1488920133-27229-12-git-send-email-labbott@redhat.com Signed-off-by: Laura Abbott <labbott@redhat.com> Acked-by: Jessica Yu <jeyu@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2017-03-23mm, vmalloc: use __GFP_HIGHMEM implicitlyMichal Hocko
__vmalloc* allows users to provide gfp flags for the underlying allocation. This API is quite popular $ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l 77 The only problem is that many people are not aware that they really want to give __GFP_HIGHMEM along with other flags because there is really no reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages which are mapped to the kernel vmalloc space. About half of users don't use this flag, though. This signals that we make the API unnecessarily too complex. This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to be mapped to the vmalloc space. Current users which add __GFP_HIGHMEM are simplified and drop the flag. Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Cristopher Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2017-03-23Merge branch 'akpm-current/current'Stephen Rothwell
2017-03-23Merge remote-tracking branch 'livepatching/for-next'Stephen Rothwell
2017-03-23Merge remote-tracking branch 'cgroup/for-next'Stephen Rothwell
2017-03-23Merge remote-tracking branch 'workqueues/for-next'Stephen Rothwell
2017-03-23Merge remote-tracking branch 'rcu/rcu/next'Stephen Rothwell
2017-03-23Merge remote-tracking branch 'tip/auto-latest'Stephen Rothwell
2017-03-23Merge remote-tracking branch 'audit/next'Stephen Rothwell
2017-03-23Merge remote-tracking branch 'security/next'Stephen Rothwell
2017-03-23Merge remote-tracking branch 'kgdb/kgdb-next'Stephen Rothwell
2017-03-23Merge remote-tracking branch 'drm/drm-next'Stephen Rothwell
2017-03-23Merge remote-tracking branch 'net-next/master'Stephen Rothwell
2017-03-23Merge remote-tracking branch 'pm/linux-next'Stephen Rothwell
2017-03-23Merge remote-tracking branch 'vfs/for-next'Stephen Rothwell
2017-03-23Merge remote-tracking branch 'net/master'Stephen Rothwell
2017-03-22srcu: Merge ->srcu_state into ->srcu_gp_seqPaul E. McKenney
Updating ->srcu_state and ->srcu_gp_seq will lead to extremely complex race conditions given multiple callback queues, so this commit takes advantage of the two-bit state now available in rcu_seq counters to store the state in the bottom two bits of ->srcu_gp_seq. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-03-22bpf: fix hashmap extra_elems logicAlexei Starovoitov
In both kmalloc and prealloc mode the bpf_map_update_elem() is using per-cpu extra_elems to do atomic update when the map is full. There are two issues with it. The logic can be misused, since it allows max_entries+num_cpus elements to be present in the map. And alloc_extra_elems() at map creation time can fail percpu alloc for large map values with a warn: WARNING: CPU: 3 PID: 2752 at ../mm/percpu.c:892 pcpu_alloc+0x119/0xa60 illegal size (32824) or align (8) for percpu allocation The fixes for both of these issues are different for kmalloc and prealloc modes. For prealloc mode allocate extra num_possible_cpus elements and store their pointers into extra_elems array instead of actual elements. Hence we can use these hidden(spare) elements not only when the map is full but during bpf_map_update_elem() that replaces existing element too. That also improves performance, since pcpu_freelist_pop/push is avoided. Unfortunately this approach cannot be used for kmalloc mode which needs to kfree elements after rcu grace period. Therefore switch it back to normal kmalloc even when full and old element exists like it was prior to commit 6c9059817432 ("bpf: pre-allocate hash map elements"). Add tests to check for over max_entries and large map values. Reported-by: Dave Jones <davej@codemonkey.org.uk> Fixes: 6c9059817432 ("bpf: pre-allocate hash map elements") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22Merge branches 'pm-cpufreq-fixes', 'pm-cpufreq-sched-fixes', ↵Rafael J. Wysocki
'intel_pstate-fixes' and 'pm-cpuidle-fixes' into linux-next * pm-cpufreq-fixes: cpufreq: Restore policy min/max limits on CPU online * pm-cpufreq-sched-fixes: cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start() * intel_pstate-fixes: cpufreq: intel_pstate: Fix policy data management in passive mode * pm-cpuidle-fixes: cpuidle: Validate cpu_dev in cpuidle_add_sysfs()
2017-03-21srcu: Allow a second bit in rcu_seq for SRCU statePaul E. McKenney
This commit increases the number of reserved bits at the bottom of an rcu_seq grace-period counter from one to two, as will be needed to accommodate SRCU's three-state grace periods. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-03-21srcu: Improve rcu_seq grace-period-counter abstractionPaul E. McKenney
The expedited grace-period code contains several open-coded shifts know the format of an rcu_seq grace-period counter, which is not particularly good style. This commit therefore creates a new rcu_seq_ctr() function that extracts the counter portion of the counter, and an rcu_seq_state() function that extracts the low-order state bit. This commit prepares for SRCU callback parallelization, which will require two state bits. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-03-21Merge branch 'timers/core'Ingo Molnar
2017-03-21Merge branch 'sched/core'Ingo Molnar
2017-03-21Merge branch 'perf/core'Ingo Molnar
2017-03-21Merge branch 'locking/core'Ingo Molnar
2017-03-21Merge branch 'irq/core'Ingo Molnar
2017-03-21cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start()Rafael J. Wysocki
sugov_start() only initializes struct sugov_cpu per-CPU structures for shared policies, but it should do that for single-CPU policies too. That in particular makes the IO-wait boost mechanism work in the cases when cpufreq policies correspond to individual CPUs. Fixes: 21ca6d2c52f8 (cpufreq: schedutil: Add iowait boosting) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: 4.9+ <stable@vger.kernel.org> # 4.9+
2017-03-20Merge branches 'pm-cpufreq', 'pm-cpufreq-sched' and 'intel_pstate' into ↵Rafael J. Wysocki
linux-next * pm-cpufreq: cpufreq: dbx500: Manage cooling device from cpufreq driver MAINTAINERS: Add file patterns for cpufreq device tree bindings cpufreq: qoriq: enhance bus frequency calculation cpufreq: mediatek: Add support for MT8176 and MT817x cpufreq: mt8173: Mark mt8173_cpufreq_driver_init as __init * pm-cpufreq-sched: cpufreq: schedutil: Refactor sugov_next_freq_shared() cpufreq: schedutil: Redefine the rate_limit_us tunable * intel_pstate: cpufreq: intel_pstate: Drop redundant wrapper function
2017-03-19taskstats-add-e-u-stime-for-tgid-command-fix-fixAndrew Morton
include linux/sched/cputime.h for task_cputime() Cc: Balbir Singh <bsingharora@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Zhang Xiao <xiao.zhang@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2017-03-19taskstats-add-e-u-stime-for-tgid-command-fixAndrew Morton
run ktime_get_ns() a single time Cc: Balbir Singh <bsingharora@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Zhang Xiao <xiao.zhang@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2017-03-19taskstats: add e/u/stime for TGID commandZhang Xiao
The elapsed time, user CPU time and system CPU time for the thread group status request are presently left at zero. Fill these in. Link: http://lkml.kernel.org/r/1488508424-12322-1-git-send-email-xiao.zhang@windriver.com Signed-off-by: Zhang Xiao <xiao.zhang@windriver.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2017-03-19kernel/hung_task.c: defer showing held locksTetsuo Handa
When I was running my testcase which may block hundreds of threads on fs locks, I got lockup due to output from debug_show_all_locks() added by commit b2d4c2edb2e4f89a ("locking/hung_task: Show all locks"). For example, if 1000 threads were blocked in TASK_UNINTERRUPTIBLE state and 500 out of 1000 threads hold some lock, debug_show_all_locks() from for_each_process_thread() loop will report locks held by 500 threads for 1000 times. This is a too much noise. In order to make sure rcu_lock_break() is called frequently, we should avoid calling debug_show_all_locks() from for_each_process_thread() loop because debug_show_all_locks() effectively calls for_each_process_thread() loop. Let's defer calling debug_show_all_locks() till before panic() or leaving for_each_process_thread() loop. Link: http://lkml.kernel.org/r/1489296834-60436-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2017-03-19mm: introduce memalloc_nofs_{save,restore} APIMichal Hocko
GFP_NOFS context is used for the following 5 reasons currently - to prevent from deadlocks when the lock held by the allocation context would be needed during the memory reclaim - to prevent from stack overflows during the reclaim because the allocation is performed from a deep context already - to prevent lockups when the allocation context depends on other reclaimers to make a forward progress indirectly - just in case because this would be safe from the fs POV - silence lockdep false positives Unfortunately overuse of this allocation context brings some problems to the MM. Memory reclaim is much weaker (especially during heavy FS metadata workloads), OOM killer cannot be invoked because the MM layer doesn't have enough information about how much memory is freeable by the FS layer. In many cases it is far from clear why the weaker context is even used and so it might be used unnecessarily. We would like to get rid of those as much as possible. One way to do that is to use the flag in scopes rather than isolated cases. Such a scope is declared when really necessary, tracked per task and all the allocation requests from within the context will simply inherit the GFP_NOFS semantic. Not only this is easier to understand and maintain because there are much less problematic contexts than specific allocation requests, this also helps code paths where FS layer interacts with other layers (e.g. crypto, security modules, MM etc...) and there is no easy way to convey the allocation context between the layers. Introduce memalloc_nofs_{save,restore} API to control the scope of GFP_NOFS allocation context. This is basically copying memalloc_noio_{save,restore} API we have for other restricted allocation context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is just an alias for PF_FSTRANS which has been xfs specific until recently. There are no more PF_FSTRANS users anymore so let's just drop it. PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags is renamed to current_gfp_context because it now cares about both PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve their semantic. kmem_flags_convert() doesn't need to evaluate the flag anymore. This patch shouldn't introduce any functional changes. Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS) usage as much as possible and only use a properly documented memalloc_nofs_{save,restore} checkpoints where they are appropriate. Link: http://lkml.kernel.org/r/20170306131408.9828-5-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <clm@fb.com> Cc: David Sterba <dsterba@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Brian Foster <bfoster@redhat.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Nikolay Borisov <nborisov@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2017-03-19lockdep: allow to disable reclaim lockup detectionMichal Hocko
The current implementation of the reclaim lockup detection can lead to false positives and those even happen and usually lead to tweak the code to silence the lockdep by using GFP_NOFS even though the context can use __GFP_FS just fine. See http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example. ================================= [ INFO: inconsistent lock state ] 4.5.0-rc2+ #4 Tainted: G O --------------------------------- inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage. kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes: (&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs] {RECLAIM_FS-ON-R} state was registered at: [<ffffffff8110f369>] mark_held_locks+0x79/0xa0 [<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100 [<ffffffff81224623>] kmem_cache_alloc+0x33/0x230 [<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs] [<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs] [<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs] [<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs] [<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs] [<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs] [<ffffffff81244208>] do_vfs_ioctl+0x498/0x670 [<ffffffff81244459>] SyS_ioctl+0x79/0x90 [<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f CPU0 ---- lock(&xfs_nondir_ilock_class); <Interrupt> lock(&xfs_nondir_ilock_class); *** DEADLOCK *** 3 locks held by kswapd0/543: stack backtrace: CPU: 0 PID: 543 Comm: kswapd0 Tainted: G O 4.5.0-rc2+ #4 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0 ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001 ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0 Call Trace: [<ffffffff813a14f9>] dump_stack+0x4b/0x72 [<ffffffff8110ec65>] print_usage_bug+0x215/0x240 [<ffffffff8110ee85>] mark_lock+0x1f5/0x660 [<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0 [<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50 [<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230 [<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs] [<ffffffff811122e8>] lock_acquire+0xd8/0x1e0 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs] [<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs] [<ffffffff8110aace>] down_write_nested+0x5e/0xc0 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs] [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs] [<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs] [<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs] [<ffffffff8124d7d5>] evict+0xc5/0x190 [<ffffffff8124d8d9>] dispose_list+0x39/0x60 [<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60 [<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0 [<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0 [<ffffffff811e50ee>] shrink_zone+0x15e/0x170 [<ffffffff811e5ef1>] kswapd+0x4f1/0xa80 [<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230 [<ffffffff810e6882>] kthread+0xf2/0x110 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220 [<ffffffff8184803f>] ret_from_fork+0x3f/0x70 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220 To quote Dave: " Ignoring whether reflink should be doing anything or not, that's a "xfs_refcountbt_init_cursor() gets called both outside and inside transactions" lockdep false positive case. The problem here is lockdep has seen this allocation from within a transaction, hence a GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context. Also note that we have an active reference to this inode. So, because the reclaim annotations overload the interrupt level detections and it's seen the inode ilock been taken in reclaim ("interrupt") context, this triggers a reclaim context warning where it thinks it is unsafe to do this allocation in GFP_KERNEL context holding the inode ilock... " This sounds like a fundamental problem of the reclaim lock detection. It is really impossible to annotate such a special usecase IMHO unless the reclaim lockup detection is reworked completely. Until then it is much better to provide a way to add "I know what I am doing flag" and mark problematic places. This would prevent from abusing GFP_NOFS flag which has a runtime effect even on configurations which have lockdep disabled. Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to skip the current allocation request. While we are at it also make sure that the radix tree doesn't accidentaly override tags stored in the upper part of the gfp_mask. Link: http://lkml.kernel.org/r/20170306131408.9828-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <clm@fb.com> Cc: David Sterba <dsterba@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Brian Foster <bfoster@redhat.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2017-03-19lockdep: teach lockdep about memalloc_noio_saveNikolay Borisov
Patch series "scope GFP_NOFS api", v5. This patch (of 7): Commit 21caf2fc1931 ("mm: teach mm by current context info to not do I/O during memory allocation") added the memalloc_noio_(save|restore) functions to enable people to modify the MM behavior by disabling I/O during memory allocation. This was further extended in Fixes: 934f3072c17c ("mm: clear __GFP_FS when PF_MEMALLOC_NOIO is set"). memalloc_noio_* functions prevent allocation paths recursing back into the filesystem without explicitly changing the flags for every allocation site. However, lockdep hasn't been keeping up with the changes and it entirely misses handling the memalloc_noio adjustments. Instead, it is left to the callers of __lockdep_trace_alloc to call the function after they have shaven the respective GFP flags which can lead to false positives: [ 644.173373] ================================= [ 644.174012] [ INFO: inconsistent lock state ] [ 644.174012] 4.10.0-nbor #134 Not tainted [ 644.174012] --------------------------------- [ 644.174012] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage. [ 644.174012] fsstress/3365 [HC0[0]:SC0[0]:HE1:SE1] takes: [ 644.174012] (&xfs_nondir_ilock_class){++++?.}, at: [<ffffffff8136f231>] xfs_ilock+0x141/0x230 [ 644.174012] {IN-RECLAIM_FS-W} state was registered at: [ 644.174012] __lock_acquire+0x62a/0x17c0 [ 644.174012] lock_acquire+0xc5/0x220 [ 644.174012] down_write_nested+0x4f/0x90 [ 644.174012] xfs_ilock+0x141/0x230 [ 644.174012] xfs_reclaim_inode+0x12a/0x320 [ 644.174012] xfs_reclaim_inodes_ag+0x2c8/0x4e0 [ 644.174012] xfs_reclaim_inodes_nr+0x33/0x40 [ 644.174012] xfs_fs_free_cached_objects+0x19/0x20 [ 644.174012] super_cache_scan+0x191/0x1a0 [ 644.174012] shrink_slab+0x26f/0x5f0 [ 644.174012] shrink_node+0xf9/0x2f0 [ 644.174012] kswapd+0x356/0x920 [ 644.174012] kthread+0x10c/0x140 [ 644.174012] ret_from_fork+0x31/0x40 [ 644.174012] irq event stamp: 173777 [ 644.174012] hardirqs last enabled at (173777): [<ffffffff8105b440>] __local_bh_enable_ip+0x70/0xc0 [ 644.174012] hardirqs last disabled at (173775): [<ffffffff8105b407>] __local_bh_enable_ip+0x37/0xc0 [ 644.174012] softirqs last enabled at (173776): [<ffffffff81357e2a>] _xfs_buf_find+0x67a/0xb70 [ 644.174012] softirqs last disabled at (173774): [<ffffffff81357d8b>] _xfs_buf_find+0x5db/0xb70 [ 644.174012] [ 644.174012] other info that might help us debug this: [ 644.174012] Possible unsafe locking scenario: [ 644.174012] [ 644.174012] CPU0 [ 644.174012] ---- [ 644.174012] lock(&xfs_nondir_ilock_class); [ 644.174012] <Interrupt> [ 644.174012] lock(&xfs_nondir_ilock_class); [ 644.174012] [ 644.174012] *** DEADLOCK *** [ 644.174012] [ 644.174012] 4 locks held by fsstress/3365: [ 644.174012] #0: (sb_writers#10){++++++}, at: [<ffffffff81208d04>] mnt_want_write+0x24/0x50 [ 644.174012] #1: (&sb->s_type->i_mutex_key#12){++++++}, at: [<ffffffff8120ea2f>] vfs_setxattr+0x6f/0xb0 [ 644.174012] #2: (sb_internal#2){++++++}, at: [<ffffffff8138185c>] xfs_trans_alloc+0xfc/0x140 [ 644.174012] #3: (&xfs_nondir_ilock_class){++++?.}, at: [<ffffffff8136f231>] xfs_ilock+0x141/0x230 [ 644.174012] [ 644.174012] stack backtrace: [ 644.174012] CPU: 0 PID: 3365 Comm: fsstress Not tainted 4.10.0-nbor #134 [ 644.174012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 644.174012] Call Trace: [ 644.174012] dump_stack+0x85/0xc9 [ 644.174012] print_usage_bug.part.37+0x284/0x293 [ 644.174012] ? print_shortest_lock_dependencies+0x1b0/0x1b0 [ 644.174012] mark_lock+0x27e/0x660 [ 644.174012] mark_held_locks+0x66/0x90 [ 644.174012] lockdep_trace_alloc+0x6f/0xd0 [ 644.174012] kmem_cache_alloc_node_trace+0x3a/0x2c0 [ 644.174012] ? vm_map_ram+0x2a1/0x510 [ 644.174012] vm_map_ram+0x2a1/0x510 [ 644.174012] ? vm_map_ram+0x46/0x510 [ 644.174012] _xfs_buf_map_pages+0x77/0x140 [ 644.174012] xfs_buf_get_map+0x185/0x2a0 [ 644.174012] xfs_attr_rmtval_set+0x233/0x430 [ 644.174012] xfs_attr_leaf_addname+0x2d2/0x500 [ 644.174012] xfs_attr_set+0x214/0x420 [ 644.174012] xfs_xattr_set+0x59/0xb0 [ 644.174012] __vfs_setxattr+0x76/0xa0 [ 644.174012] __vfs_setxattr_noperm+0x5e/0xf0 [ 644.174012] vfs_setxattr+0xae/0xb0 [ 644.174012] ? __might_fault+0x43/0xa0 [ 644.174012] setxattr+0x15e/0x1a0 [ 644.174012] ? __lock_is_held+0x53/0x90 [ 644.174012] ? rcu_read_lock_sched_held+0x93/0xa0 [ 644.174012] ? rcu_sync_lockdep_assert+0x2f/0x60 [ 644.174012] ? __sb_start_write+0x130/0x1d0 [ 644.174012] ? mnt_want_write+0x24/0x50 [ 644.174012] path_setxattr+0x8f/0xc0 [ 644.174012] SyS_lsetxattr+0x11/0x20 [ 644.174012] entry_SYSCALL_64_fastpath+0x23/0xc6 Let's fix this by making lockdep explicitly do the shaving of respective GFP flags. Fixes: 934f3072c17c ("mm: clear __GFP_FS when PF_MEMALLOC_NOIO is set") Link: http://lkml.kernel.org/r/20170306131408.9828-2-mhocko@kernel.org Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <clm@fb.com> Cc: David Sterba <dsterba@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Brian Foster <bfoster@redhat.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2017-03-19mm: update callers to use HASH_ZERO flagPavel Tatashin
Update dcache, inode, pid, mountpoint, and mount hash tables to use HASH_ZERO, and remove initialization after allocations. In case of places where HASH_EARLY was used such as in __pv_init_lock_hash the zeroed hash table was already assumed, because memblock zeroes the memory. CPU: SPARC M6, Memory: 7T Before fix: Dentry cache hash table entries: 1073741824 Inode-cache hash table entries: 536870912 Mount-cache hash table entries: 16777216 Mountpoint-cache hash table entries: 16777216 ftrace: allocating 20414 entries in 40 pages Total time: 11.798s After fix: Dentry cache hash table entries: 1073741824 Inode-cache hash table entries: 536870912 Mount-cache hash table entries: 16777216 Mountpoint-cache hash table entries: 16777216 ftrace: allocating 20414 entries in 40 pages Total time: 3.198s CPU: Intel Xeon E5-2630, Memory: 2.2T: Before fix: Dentry cache hash table entries: 536870912 Inode-cache hash table entries: 268435456 Mount-cache hash table entries: 8388608 Mountpoint-cache hash table entries: 8388608 CPU: Physical Processor ID: 0 Total time: 3.245s After fix: Dentry cache hash table entries: 536870912 Inode-cache hash table entries: 268435456 Mount-cache hash table entries: 8388608 Mountpoint-cache hash table entries: 8388608 CPU: Physical Processor ID: 0 Total time: 3.244s Link: http://lkml.kernel.org/r/1488432825-92126-4-git-send-email-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Babu Moger <babu.moger@oracle.com> Cc: David Miller <davem@davemloft.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2017-03-18Merge branch 'smp-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull CPU hotplug fix from Thomas Gleixner: "A single fix preventing the concurrent execution of the CPU hotplug callback install/invocation machinery. Long standing bug caused by a massive brain slip of that Gleixner dude, which went unnoticed for almost a year" * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: cpu/hotplug: Serialize callback invocations proper
2017-03-17Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "A set of perf related fixes: - fix a CR4.PCE propagation issue caused by usage of mm instead of active_mm and therefore propagated the wrong value. - perf core fixes, which plug a use-after-free issue and make the event inheritance on fork more robust. - a tooling fix for symbol handling" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf symbols: Fix symbols__fixup_end heuristic for corner cases x86/perf: Clarify why x86_pmu_event_mapped() isn't racy x86/perf: Fix CR4.PCE propagation to use active_mm instead of mm perf/core: Better explain the inherit magic perf/core: Simplify perf_event_free_task() perf/core: Fix event inheritance on fork() perf/core: Fix use-after-free in perf_release()
2017-03-17Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Thomas Gleixner: "From the scheduler departement: - a bunch of sched deadline related fixes which deal with various buglets and corner cases. - two fixes for the loadavg spikes which are caused by the delayed NOHZ accounting" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Use deadline instead of period when calculating overflow sched/deadline: Throttle a constrained deadline task activated after the deadline sched/deadline: Make sure the replenishment timer fires in the next period sched/loadavg: Use {READ,WRITE}_ONCE() for sample window sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting sched/deadline: Add missing update_rq_clock() in dl_task_timer()
2017-03-17Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Thomas Gleixner: "Three fixes related to locking: - fix a SIGKILL issue for RWSEM_GENERIC_SPINLOCK which has been fixed for the XCHGADD variant already - plug a potential use after free in the futex code - prevent leaking a held spinlock in an futex error handling code path" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rwsem: Fix down_write_killable() for CONFIG_RWSEM_GENERIC_SPINLOCK=y futex: Add missing error handling to FUTEX_REQUEUE_PI futex: Fix potential use-after-free in FUTEX_REQUEUE_PI
2017-03-17hrtimer: Remove hrtimer_peek_ahead_timers() leftoversStephen Boyd
This function was removed in commit c6eb3f70d448 (hrtimer: Get rid of hrtimer softirq, 2015-04-14) but the prototype wasn't ever deleted. Delete it now. Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Link: http://lkml.kernel.org/r/20170317010814.2591-1-sboyd@codeaurora.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-17Merge branch 'for-4.11-fixes' into for-nextTejun Heo
2017-03-17cgroup, kthread: close race window where new kthreads can be migrated to ↵Tejun Heo
non-root cgroups Creation of a kthread goes through a couple interlocked stages between the kthread itself and its creator. Once the new kthread starts running, it initializes itself and wakes up the creator. The creator then can further configure the kthread and then let it start doing its job by waking it up. In this configuration-by-creator stage, the creator is the only one that can wake it up but the kthread is visible to userland. When altering the kthread's attributes from userland is allowed, this is fine; however, for cases where CPU affinity is critical, kthread_bind() is used to first disable affinity changes from userland and then set the affinity. This also prevents the kthread from being migrated into non-root cgroups as that can affect the CPU affinity and many other things. Unfortunately, the cgroup side of protection is racy. While the PF_NO_SETAFFINITY flag prevents further migrations, userland can win the race before the creator sets the flag with kthread_bind() and put the kthread in a non-root cgroup, which can lead to all sorts of problems including incorrect CPU affinity and starvation. This bug got triggered by userland which periodically tries to migrate all processes in the root cpuset cgroup to a non-root one. Per-cpu workqueue workers got caught while being created and ended up with incorrected CPU affinity breaking concurrency management and sometimes stalling workqueue execution. This patch adds task->no_cgroup_migration which disallows the task to be migrated by userland. kthreadd starts with the flag set making every child kthread start in the root cgroup with migration disallowed. The flag is cleared after the kthread finishes initialization by which time PF_NO_SETAFFINITY is set if the kthread should stay in the root cgroup. It'd be better to wait for the initialization instead of failing but I couldn't think of a way of implementing that without adding either a new PF flag, or sleeping and retrying from waiting side. Even if userland depends on changing cgroup membership of a kthread, it either has to be synchronized with kthread_create() or periodically repeat, so it's unlikely that this would break anything. v2: Switch to a simpler implementation using a new task_struct bit field suggested by Oleg. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Reported-and-debugged-by: Chris Mason <clm@fb.com> Cc: stable@vger.kernel.org # v4.3+ (we can't close the race on < v4.3) Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-16bpf: inline htab_map_lookup_elem()Alexei Starovoitov
Optimize: bpf_call bpf_map_lookup_elem map->ops->map_lookup_elem htab_map_lookup_elem __htab_map_lookup_elem into: bpf_call __htab_map_lookup_elem to improve performance of JITed programs. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16bpf: add helper inlining infra and optimize map_array lookupAlexei Starovoitov
Optimize bpf_call -> bpf_map_lookup_elem() -> array_map_lookup_elem() into a sequence of bpf instructions. When JIT is on the sequence of bpf instructions is the sequence of native cpu instructions with significantly faster performance than indirect call and two function's prologue/epilogue. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16bpf: adjust insn_aux_data when patching insnsAlexei Starovoitov
convert_ctx_accesses() replaces single bpf instruction with a set of instructions. Adjust corresponding insn_aux_data while patching. It's needed to make sure subsequent 'for(all insn)' loops have matching insn and insn_aux_data. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16bpf: refactor fixup_bpf_calls()Alexei Starovoitov
reduce indent and make it iterate over instructions similar to convert_ctx_accesses(). Also convert hard BUG_ON into soft verifier error. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>