Age | Commit message (Collapse) | Author |
|
Currently trace_handle_return() looks like this:
static inline enum print_line_t trace_handle_return(struct trace_seq *s)
{
return trace_seq_has_overflowed(s) ?
TRACE_TYPE_PARTIAL_LINE : TRACE_TYPE_HANDLED;
}
Where trace_seq_overflowed(s) is:
static inline bool trace_seq_has_overflowed(struct trace_seq *s)
{
return s->full || seq_buf_has_overflowed(&s->seq);
}
And seq_buf_has_overflowed(&s->seq) is:
static inline bool
seq_buf_has_overflowed(struct seq_buf *s)
{
return s->len > s->size;
}
Making trace_handle_return() into:
return (s->full || (s->seq->len > s->seq->size)) ?
TRACE_TYPE_PARTIAL_LINE :
TRACE_TYPE_HANDLED;
One would think this is not an issue to keep as an inline. But because
this is used in the TRACE_EVENT() macro, it is extended for every
tracepoint in the system. Taking a look at a single tracepoint
x86_irq_vector (was the first one I randomly chosen). As
trace_handle_return is used in the TRACE_EVENT() macro of
trace_raw_output_##call() we disassemble trace_raw_output_x86_irq_vector
and do a diff. I removed identical lines that were different just due to
different addresses.
The original has 22 bytes of text more than the out of line version. As
this is for every TRACE_EVENT() defined in the system, this can become
quite large.
text data bss dec hex filename
8690305 5450490 1298432 15439227 eb957b vmlinux-orig
8681725 5450490 1298432 15430647 eb73f7 vmlinux-handle
This change has a total of 8580 bytes in savings.
$ objdump -dr /tmp/vmlinux-orig | grep '^[0-9a-f]* <trace_raw_output' | wc -l
324
That's 324 tracepoints. But this does not include modules (which contain
many more tracepoints). For an allyesconfig build:
$ objdump -dr vmlinux-allyes-orig | grep '^[0-9a-f]* <trace_raw_output' | wc -l
1401
That's 1401 tracepoints giving us:
text data bss dec hex filename
137827709 140221067 53264384 331313160 13bf7008 vmlinux-allyes-handle
137920629 140221067 53264384 331406080 13c0db00 vmlinux-allyes-orig
92920 bytes in savings!!!
Link: http://lkml.kernel.org/r/20170315021431.13107-2-andi@firstfloor.org
Link: http://lkml.kernel.org/r/20170316113459.2366588b@gandalf.local.home
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reported-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This is a very complex function, which is called in multiple places. It
is unlikely that inlining or not inlining it makes any difference for its
run time.
This saves around 13k text in my kernel
text data bss dec hex filename
9083992 5367600 11116544 25568136 1862388 vmlinux-before-load-avg
9070166 5367600 11116544 25554310 185ed86 vmlinux-load-avg
Link: http://lkml.kernel.org/r/20170315021431.13107-4-andi@firstfloor.org
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
set_memory_* functions have moved to set_memory.h. Switch to this
explicitly.
Link: http://lkml.kernel.org/r/1488920133-27229-13-git-send-email-labbott@redhat.com
Signed-off-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
set_memory_* functions have moved to set_memory.h. Switch to this
explicitly.
Link: http://lkml.kernel.org/r/1488920133-27229-12-git-send-email-labbott@redhat.com
Signed-off-by: Laura Abbott <labbott@redhat.com>
Acked-by: Jessica Yu <jeyu@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
__vmalloc* allows users to provide gfp flags for the underlying
allocation. This API is quite popular
$ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l
77
The only problem is that many people are not aware that they really want
to give __GFP_HIGHMEM along with other flags because there is really no
reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages
which are mapped to the kernel vmalloc space. About half of users don't
use this flag, though. This signals that we make the API unnecessarily
too complex.
This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to
be mapped to the vmalloc space. Current users which add __GFP_HIGHMEM are
simplified and drop the flag.
Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Cristopher Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Updating ->srcu_state and ->srcu_gp_seq will lead to extremely complex
race conditions given multiple callback queues, so this commit takes
advantage of the two-bit state now available in rcu_seq counters to
store the state in the bottom two bits of ->srcu_gp_seq.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
|
In both kmalloc and prealloc mode the bpf_map_update_elem() is using
per-cpu extra_elems to do atomic update when the map is full.
There are two issues with it. The logic can be misused, since it allows
max_entries+num_cpus elements to be present in the map. And alloc_extra_elems()
at map creation time can fail percpu alloc for large map values with a warn:
WARNING: CPU: 3 PID: 2752 at ../mm/percpu.c:892 pcpu_alloc+0x119/0xa60
illegal size (32824) or align (8) for percpu allocation
The fixes for both of these issues are different for kmalloc and prealloc modes.
For prealloc mode allocate extra num_possible_cpus elements and store
their pointers into extra_elems array instead of actual elements.
Hence we can use these hidden(spare) elements not only when the map is full
but during bpf_map_update_elem() that replaces existing element too.
That also improves performance, since pcpu_freelist_pop/push is avoided.
Unfortunately this approach cannot be used for kmalloc mode which needs
to kfree elements after rcu grace period. Therefore switch it back to normal
kmalloc even when full and old element exists like it was prior to
commit 6c9059817432 ("bpf: pre-allocate hash map elements").
Add tests to check for over max_entries and large map values.
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Fixes: 6c9059817432 ("bpf: pre-allocate hash map elements")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
'intel_pstate-fixes' and 'pm-cpuidle-fixes' into linux-next
* pm-cpufreq-fixes:
cpufreq: Restore policy min/max limits on CPU online
* pm-cpufreq-sched-fixes:
cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start()
* intel_pstate-fixes:
cpufreq: intel_pstate: Fix policy data management in passive mode
* pm-cpuidle-fixes:
cpuidle: Validate cpu_dev in cpuidle_add_sysfs()
|
|
This commit increases the number of reserved bits at the bottom of an
rcu_seq grace-period counter from one to two, as will be needed to
accommodate SRCU's three-state grace periods.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
|
The expedited grace-period code contains several open-coded shifts
know the format of an rcu_seq grace-period counter, which is not
particularly good style. This commit therefore creates a new
rcu_seq_ctr() function that extracts the counter portion of the
counter, and an rcu_seq_state() function that extracts the low-order
state bit. This commit prepares for SRCU callback parallelization,
which will require two state bits.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
sugov_start() only initializes struct sugov_cpu per-CPU structures
for shared policies, but it should do that for single-CPU policies too.
That in particular makes the IO-wait boost mechanism work in the
cases when cpufreq policies correspond to individual CPUs.
Fixes: 21ca6d2c52f8 (cpufreq: schedutil: Add iowait boosting)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: 4.9+ <stable@vger.kernel.org> # 4.9+
|
|
linux-next
* pm-cpufreq:
cpufreq: dbx500: Manage cooling device from cpufreq driver
MAINTAINERS: Add file patterns for cpufreq device tree bindings
cpufreq: qoriq: enhance bus frequency calculation
cpufreq: mediatek: Add support for MT8176 and MT817x
cpufreq: mt8173: Mark mt8173_cpufreq_driver_init as __init
* pm-cpufreq-sched:
cpufreq: schedutil: Refactor sugov_next_freq_shared()
cpufreq: schedutil: Redefine the rate_limit_us tunable
* intel_pstate:
cpufreq: intel_pstate: Drop redundant wrapper function
|
|
include linux/sched/cputime.h for task_cputime()
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Zhang Xiao <xiao.zhang@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
run ktime_get_ns() a single time
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Zhang Xiao <xiao.zhang@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The elapsed time, user CPU time and system CPU time for the thread group
status request are presently left at zero. Fill these in.
Link: http://lkml.kernel.org/r/1488508424-12322-1-git-send-email-xiao.zhang@windriver.com
Signed-off-by: Zhang Xiao <xiao.zhang@windriver.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When I was running my testcase which may block hundreds of threads on fs
locks, I got lockup due to output from debug_show_all_locks() added by
commit b2d4c2edb2e4f89a ("locking/hung_task: Show all locks").
For example, if 1000 threads were blocked in TASK_UNINTERRUPTIBLE state
and 500 out of 1000 threads hold some lock, debug_show_all_locks() from
for_each_process_thread() loop will report locks held by 500 threads for
1000 times. This is a too much noise.
In order to make sure rcu_lock_break() is called frequently, we should
avoid calling debug_show_all_locks() from for_each_process_thread() loop
because debug_show_all_locks() effectively calls for_each_process_thread()
loop. Let's defer calling debug_show_all_locks() till before panic() or
leaving for_each_process_thread() loop.
Link: http://lkml.kernel.org/r/1489296834-60436-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
GFP_NOFS context is used for the following 5 reasons currently
- to prevent from deadlocks when the lock held by the allocation
context would be needed during the memory reclaim
- to prevent from stack overflows during the reclaim because
the allocation is performed from a deep context already
- to prevent lockups when the allocation context depends on
other reclaimers to make a forward progress indirectly
- just in case because this would be safe from the fs POV
- silence lockdep false positives
Unfortunately overuse of this allocation context brings some problems to
the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.
In many cases it is far from clear why the weaker context is even used and
so it might be used unnecessarily. We would like to get rid of those as
much as possible. One way to do that is to use the flag in scopes rather
than isolated cases. Such a scope is declared when really necessary,
tracked per task and all the allocation requests from within the context
will simply inherit the GFP_NOFS semantic.
Not only this is easier to understand and maintain because there are much
less problematic contexts than specific allocation requests, this also
helps code paths where FS layer interacts with other layers (e.g. crypto,
security modules, MM etc...) and there is no easy way to convey the
allocation context between the layers.
Introduce memalloc_nofs_{save,restore} API to control the scope of
GFP_NOFS allocation context. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is just
an alias for PF_FSTRANS which has been xfs specific until recently. There
are no more PF_FSTRANS users anymore so let's just drop it.
PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
their semantic. kmem_flags_convert() doesn't need to evaluate the flag
anymore.
This patch shouldn't introduce any functional changes.
Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much as possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.
Link: http://lkml.kernel.org/r/20170306131408.9828-5-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The current implementation of the reclaim lockup detection can lead to
false positives and those even happen and usually lead to tweak the
code to silence the lockdep by using GFP_NOFS even though the context
can use __GFP_FS just fine. See
http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example.
=================================
[ INFO: inconsistent lock state ]
4.5.0-rc2+ #4 Tainted: G O
---------------------------------
inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:
(&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]
{RECLAIM_FS-ON-R} state was registered at:
[<ffffffff8110f369>] mark_held_locks+0x79/0xa0
[<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100
[<ffffffff81224623>] kmem_cache_alloc+0x33/0x230
[<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs]
[<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
[<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs]
[<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs]
[<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs]
[<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs]
[<ffffffff81244208>] do_vfs_ioctl+0x498/0x670
[<ffffffff81244459>] SyS_ioctl+0x79/0x90
[<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f
CPU0
----
lock(&xfs_nondir_ilock_class);
<Interrupt>
lock(&xfs_nondir_ilock_class);
*** DEADLOCK ***
3 locks held by kswapd0/543:
stack backtrace:
CPU: 0 PID: 543 Comm: kswapd0 Tainted: G O 4.5.0-rc2+ #4
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0
ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001
ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0
Call Trace:
[<ffffffff813a14f9>] dump_stack+0x4b/0x72
[<ffffffff8110ec65>] print_usage_bug+0x215/0x240
[<ffffffff8110ee85>] mark_lock+0x1f5/0x660
[<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0
[<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50
[<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230
[<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs]
[<ffffffff811122e8>] lock_acquire+0xd8/0x1e0
[<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
[<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
[<ffffffff8110aace>] down_write_nested+0x5e/0xc0
[<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
[<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]
[<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
[<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
[<ffffffff8124d7d5>] evict+0xc5/0x190
[<ffffffff8124d8d9>] dispose_list+0x39/0x60
[<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60
[<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0
[<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0
[<ffffffff811e50ee>] shrink_zone+0x15e/0x170
[<ffffffff811e5ef1>] kswapd+0x4f1/0xa80
[<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230
[<ffffffff810e6882>] kthread+0xf2/0x110
[<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
[<ffffffff8184803f>] ret_from_fork+0x3f/0x70
[<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
To quote Dave:
"
Ignoring whether reflink should be doing anything or not, that's a
"xfs_refcountbt_init_cursor() gets called both outside and inside
transactions" lockdep false positive case. The problem here is
lockdep has seen this allocation from within a transaction, hence a
GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context.
Also note that we have an active reference to this inode.
So, because the reclaim annotations overload the interrupt level
detections and it's seen the inode ilock been taken in reclaim
("interrupt") context, this triggers a reclaim context warning where
it thinks it is unsafe to do this allocation in GFP_KERNEL context
holding the inode ilock...
"
This sounds like a fundamental problem of the reclaim lock detection.
It is really impossible to annotate such a special usecase IMHO unless
the reclaim lockup detection is reworked completely. Until then it
is much better to provide a way to add "I know what I am doing flag"
and mark problematic places. This would prevent from abusing GFP_NOFS
flag which has a runtime effect even on configurations which have
lockdep disabled.
Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
skip the current allocation request.
While we are at it also make sure that the radix tree doesn't
accidentaly override tags stored in the upper part of the gfp_mask.
Link: http://lkml.kernel.org/r/20170306131408.9828-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "scope GFP_NOFS api", v5.
This patch (of 7):
Commit 21caf2fc1931 ("mm: teach mm by current context info to not do I/O
during memory allocation") added the memalloc_noio_(save|restore)
functions to enable people to modify the MM behavior by disabling I/O
during memory allocation. This was further extended in Fixes:
934f3072c17c ("mm: clear __GFP_FS when PF_MEMALLOC_NOIO is set").
memalloc_noio_* functions prevent allocation paths recursing back into the
filesystem without explicitly changing the flags for every allocation
site. However, lockdep hasn't been keeping up with the changes and it
entirely misses handling the memalloc_noio adjustments. Instead, it is
left to the callers of __lockdep_trace_alloc to call the function after
they have shaven the respective GFP flags which can lead to false
positives:
[ 644.173373] =================================
[ 644.174012] [ INFO: inconsistent lock state ]
[ 644.174012] 4.10.0-nbor #134 Not tainted
[ 644.174012] ---------------------------------
[ 644.174012] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
[ 644.174012] fsstress/3365 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 644.174012] (&xfs_nondir_ilock_class){++++?.}, at: [<ffffffff8136f231>] xfs_ilock+0x141/0x230
[ 644.174012] {IN-RECLAIM_FS-W} state was registered at:
[ 644.174012] __lock_acquire+0x62a/0x17c0
[ 644.174012] lock_acquire+0xc5/0x220
[ 644.174012] down_write_nested+0x4f/0x90
[ 644.174012] xfs_ilock+0x141/0x230
[ 644.174012] xfs_reclaim_inode+0x12a/0x320
[ 644.174012] xfs_reclaim_inodes_ag+0x2c8/0x4e0
[ 644.174012] xfs_reclaim_inodes_nr+0x33/0x40
[ 644.174012] xfs_fs_free_cached_objects+0x19/0x20
[ 644.174012] super_cache_scan+0x191/0x1a0
[ 644.174012] shrink_slab+0x26f/0x5f0
[ 644.174012] shrink_node+0xf9/0x2f0
[ 644.174012] kswapd+0x356/0x920
[ 644.174012] kthread+0x10c/0x140
[ 644.174012] ret_from_fork+0x31/0x40
[ 644.174012] irq event stamp: 173777
[ 644.174012] hardirqs last enabled at (173777): [<ffffffff8105b440>] __local_bh_enable_ip+0x70/0xc0
[ 644.174012] hardirqs last disabled at (173775): [<ffffffff8105b407>] __local_bh_enable_ip+0x37/0xc0
[ 644.174012] softirqs last enabled at (173776): [<ffffffff81357e2a>] _xfs_buf_find+0x67a/0xb70
[ 644.174012] softirqs last disabled at (173774): [<ffffffff81357d8b>] _xfs_buf_find+0x5db/0xb70
[ 644.174012]
[ 644.174012] other info that might help us debug this:
[ 644.174012] Possible unsafe locking scenario:
[ 644.174012]
[ 644.174012] CPU0
[ 644.174012] ----
[ 644.174012] lock(&xfs_nondir_ilock_class);
[ 644.174012] <Interrupt>
[ 644.174012] lock(&xfs_nondir_ilock_class);
[ 644.174012]
[ 644.174012] *** DEADLOCK ***
[ 644.174012]
[ 644.174012] 4 locks held by fsstress/3365:
[ 644.174012] #0: (sb_writers#10){++++++}, at: [<ffffffff81208d04>] mnt_want_write+0x24/0x50
[ 644.174012] #1: (&sb->s_type->i_mutex_key#12){++++++}, at: [<ffffffff8120ea2f>] vfs_setxattr+0x6f/0xb0
[ 644.174012] #2: (sb_internal#2){++++++}, at: [<ffffffff8138185c>] xfs_trans_alloc+0xfc/0x140
[ 644.174012] #3: (&xfs_nondir_ilock_class){++++?.}, at: [<ffffffff8136f231>] xfs_ilock+0x141/0x230
[ 644.174012]
[ 644.174012] stack backtrace:
[ 644.174012] CPU: 0 PID: 3365 Comm: fsstress Not tainted 4.10.0-nbor #134
[ 644.174012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 644.174012] Call Trace:
[ 644.174012] dump_stack+0x85/0xc9
[ 644.174012] print_usage_bug.part.37+0x284/0x293
[ 644.174012] ? print_shortest_lock_dependencies+0x1b0/0x1b0
[ 644.174012] mark_lock+0x27e/0x660
[ 644.174012] mark_held_locks+0x66/0x90
[ 644.174012] lockdep_trace_alloc+0x6f/0xd0
[ 644.174012] kmem_cache_alloc_node_trace+0x3a/0x2c0
[ 644.174012] ? vm_map_ram+0x2a1/0x510
[ 644.174012] vm_map_ram+0x2a1/0x510
[ 644.174012] ? vm_map_ram+0x46/0x510
[ 644.174012] _xfs_buf_map_pages+0x77/0x140
[ 644.174012] xfs_buf_get_map+0x185/0x2a0
[ 644.174012] xfs_attr_rmtval_set+0x233/0x430
[ 644.174012] xfs_attr_leaf_addname+0x2d2/0x500
[ 644.174012] xfs_attr_set+0x214/0x420
[ 644.174012] xfs_xattr_set+0x59/0xb0
[ 644.174012] __vfs_setxattr+0x76/0xa0
[ 644.174012] __vfs_setxattr_noperm+0x5e/0xf0
[ 644.174012] vfs_setxattr+0xae/0xb0
[ 644.174012] ? __might_fault+0x43/0xa0
[ 644.174012] setxattr+0x15e/0x1a0
[ 644.174012] ? __lock_is_held+0x53/0x90
[ 644.174012] ? rcu_read_lock_sched_held+0x93/0xa0
[ 644.174012] ? rcu_sync_lockdep_assert+0x2f/0x60
[ 644.174012] ? __sb_start_write+0x130/0x1d0
[ 644.174012] ? mnt_want_write+0x24/0x50
[ 644.174012] path_setxattr+0x8f/0xc0
[ 644.174012] SyS_lsetxattr+0x11/0x20
[ 644.174012] entry_SYSCALL_64_fastpath+0x23/0xc6
Let's fix this by making lockdep explicitly do the shaving of respective
GFP flags.
Fixes: 934f3072c17c ("mm: clear __GFP_FS when PF_MEMALLOC_NOIO is set")
Link: http://lkml.kernel.org/r/20170306131408.9828-2-mhocko@kernel.org
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Update dcache, inode, pid, mountpoint, and mount hash tables to use
HASH_ZERO, and remove initialization after allocations. In case of places
where HASH_EARLY was used such as in __pv_init_lock_hash the zeroed hash
table was already assumed, because memblock zeroes the memory.
CPU: SPARC M6, Memory: 7T
Before fix:
Dentry cache hash table entries: 1073741824
Inode-cache hash table entries: 536870912
Mount-cache hash table entries: 16777216
Mountpoint-cache hash table entries: 16777216
ftrace: allocating 20414 entries in 40 pages
Total time: 11.798s
After fix:
Dentry cache hash table entries: 1073741824
Inode-cache hash table entries: 536870912
Mount-cache hash table entries: 16777216
Mountpoint-cache hash table entries: 16777216
ftrace: allocating 20414 entries in 40 pages
Total time: 3.198s
CPU: Intel Xeon E5-2630, Memory: 2.2T:
Before fix:
Dentry cache hash table entries: 536870912
Inode-cache hash table entries: 268435456
Mount-cache hash table entries: 8388608
Mountpoint-cache hash table entries: 8388608
CPU: Physical Processor ID: 0
Total time: 3.245s
After fix:
Dentry cache hash table entries: 536870912
Inode-cache hash table entries: 268435456
Mount-cache hash table entries: 8388608
Mountpoint-cache hash table entries: 8388608
CPU: Physical Processor ID: 0
Total time: 3.244s
Link: http://lkml.kernel.org/r/1488432825-92126-4-git-send-email-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
Cc: David Miller <davem@davemloft.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug fix from Thomas Gleixner:
"A single fix preventing the concurrent execution of the CPU hotplug
callback install/invocation machinery. Long standing bug caused by a
massive brain slip of that Gleixner dude, which went unnoticed for
almost a year"
* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Serialize callback invocations proper
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
"A set of perf related fixes:
- fix a CR4.PCE propagation issue caused by usage of mm instead of
active_mm and therefore propagated the wrong value.
- perf core fixes, which plug a use-after-free issue and make the
event inheritance on fork more robust.
- a tooling fix for symbol handling"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf symbols: Fix symbols__fixup_end heuristic for corner cases
x86/perf: Clarify why x86_pmu_event_mapped() isn't racy
x86/perf: Fix CR4.PCE propagation to use active_mm instead of mm
perf/core: Better explain the inherit magic
perf/core: Simplify perf_event_free_task()
perf/core: Fix event inheritance on fork()
perf/core: Fix use-after-free in perf_release()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
"From the scheduler departement:
- a bunch of sched deadline related fixes which deal with various
buglets and corner cases.
- two fixes for the loadavg spikes which are caused by the delayed
NOHZ accounting"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Use deadline instead of period when calculating overflow
sched/deadline: Throttle a constrained deadline task activated after the deadline
sched/deadline: Make sure the replenishment timer fires in the next period
sched/loadavg: Use {READ,WRITE}_ONCE() for sample window
sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting
sched/deadline: Add missing update_rq_clock() in dl_task_timer()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Thomas Gleixner:
"Three fixes related to locking:
- fix a SIGKILL issue for RWSEM_GENERIC_SPINLOCK which has been fixed
for the XCHGADD variant already
- plug a potential use after free in the futex code
- prevent leaking a held spinlock in an futex error handling code
path"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rwsem: Fix down_write_killable() for CONFIG_RWSEM_GENERIC_SPINLOCK=y
futex: Add missing error handling to FUTEX_REQUEUE_PI
futex: Fix potential use-after-free in FUTEX_REQUEUE_PI
|
|
This function was removed in commit c6eb3f70d448 (hrtimer: Get rid of
hrtimer softirq, 2015-04-14) but the prototype wasn't ever deleted.
Delete it now.
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Link: http://lkml.kernel.org/r/20170317010814.2591-1-sboyd@codeaurora.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
|
|
non-root cgroups
Creation of a kthread goes through a couple interlocked stages between
the kthread itself and its creator. Once the new kthread starts
running, it initializes itself and wakes up the creator. The creator
then can further configure the kthread and then let it start doing its
job by waking it up.
In this configuration-by-creator stage, the creator is the only one
that can wake it up but the kthread is visible to userland. When
altering the kthread's attributes from userland is allowed, this is
fine; however, for cases where CPU affinity is critical,
kthread_bind() is used to first disable affinity changes from userland
and then set the affinity. This also prevents the kthread from being
migrated into non-root cgroups as that can affect the CPU affinity and
many other things.
Unfortunately, the cgroup side of protection is racy. While the
PF_NO_SETAFFINITY flag prevents further migrations, userland can win
the race before the creator sets the flag with kthread_bind() and put
the kthread in a non-root cgroup, which can lead to all sorts of
problems including incorrect CPU affinity and starvation.
This bug got triggered by userland which periodically tries to migrate
all processes in the root cpuset cgroup to a non-root one. Per-cpu
workqueue workers got caught while being created and ended up with
incorrected CPU affinity breaking concurrency management and sometimes
stalling workqueue execution.
This patch adds task->no_cgroup_migration which disallows the task to
be migrated by userland. kthreadd starts with the flag set making
every child kthread start in the root cgroup with migration
disallowed. The flag is cleared after the kthread finishes
initialization by which time PF_NO_SETAFFINITY is set if the kthread
should stay in the root cgroup.
It'd be better to wait for the initialization instead of failing but I
couldn't think of a way of implementing that without adding either a
new PF flag, or sleeping and retrying from waiting side. Even if
userland depends on changing cgroup membership of a kthread, it either
has to be synchronized with kthread_create() or periodically repeat,
so it's unlikely that this would break anything.
v2: Switch to a simpler implementation using a new task_struct bit
field suggested by Oleg.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-and-debugged-by: Chris Mason <clm@fb.com>
Cc: stable@vger.kernel.org # v4.3+ (we can't close the race on < v4.3)
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Optimize:
bpf_call
bpf_map_lookup_elem
map->ops->map_lookup_elem
htab_map_lookup_elem
__htab_map_lookup_elem
into:
bpf_call
__htab_map_lookup_elem
to improve performance of JITed programs.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Optimize bpf_call -> bpf_map_lookup_elem() -> array_map_lookup_elem()
into a sequence of bpf instructions.
When JIT is on the sequence of bpf instructions is the sequence
of native cpu instructions with significantly faster performance
than indirect call and two function's prologue/epilogue.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
convert_ctx_accesses() replaces single bpf instruction with a set of
instructions. Adjust corresponding insn_aux_data while patching.
It's needed to make sure subsequent 'for(all insn)' loops
have matching insn and insn_aux_data.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
reduce indent and make it iterate over instructions similar to
convert_ctx_accesses(). Also convert hard BUG_ON into soft verifier error.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|