aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2012-10-09mm: mmu_notifier: have mmu_notifiers use a global SRCU so they may safely ↵Sagi Grimberg
schedule With an RCU based mmu_notifier implementation, any callout to mmu_notifier_invalidate_range_{start,end}() or mmu_notifier_invalidate_page() would not be allowed to call schedule() as that could potentially allow a modification to the mmu_notifier structure while it is currently being used. Since srcu allocs 4 machine words per instance per cpu, we may end up with memory exhaustion if we use srcu per mm. So all mms share a global srcu. Note that during large mmu_notifier activity exit & unregister paths might hang for longer periods, but it is tolerable for current mmu_notifier clients. Signed-off-by: Sagi Grimberg <sagig@mellanox.co.il> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Haggai Eran <haggaie@mellanox.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mempolicy: fix a memory corruption by refcount imbalance in alloc_pages_vma()Mel Gorman
Commit cc9a6c877661 ("cpuset: mm: reduce large amounts of memory barrier related damage v3") introduced a potential memory corruption. shmem_alloc_page() uses a pseudo vma and it has one significant unique combination, vma->vm_ops=NULL and vma->policy->flags & MPOL_F_SHARED. get_vma_policy() does NOT increase a policy ref when vma->vm_ops=NULL and mpol_cond_put() DOES decrease a policy ref when a policy has MPOL_F_SHARED. Therefore, when a cpuset update race occurs, alloc_pages_vma() falls in 'goto retry_cpuset' path, decrements the reference count and frees the policy prematurely. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Christoph Lameter <cl@linux.com> Cc: Josh Boyer <jwboyer@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mempolicy: fix refcount leak in mpol_set_shared_policy()KOSAKI Motohiro
When shared_policy_replace() fails to allocate new->policy is not freed correctly by mpol_set_shared_policy(). The problem is that shared mempolicy code directly call kmem_cache_free() in multiple places where it is easy to make a mistake. This patch creates an sp_free wrapper function and uses it. The bug was introduced pre-git age (IOW, before 2.6.12-rc2). [mgorman@suse.de: Editted changelog] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Christoph Lameter <cl@linux.com> Cc: Josh Boyer <jwboyer@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mempolicy: fix a race in shared_policy_replace()Mel Gorman
shared_policy_replace() use of sp_alloc() is unsafe. 1) sp_node cannot be dereferenced if sp->lock is not held and 2) another thread can modify sp_node between spin_unlock for allocating a new sp node and next spin_lock. The bug was introduced before 2.6.12-rc2. Kosaki's original patch for this problem was to allocate an sp node and policy within shared_policy_replace and initialise it when the lock is reacquired. I was not keen on this approach because it partially duplicates sp_alloc(). As the paths were sp->lock is taken are not that performance critical this patch converts sp->lock to sp->mutex so it can sleep when calling sp_alloc(). [kosaki.motohiro@jp.fujitsu.com: Original patch] Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Christoph Lameter <cl@linux.com> Cc: Josh Boyer <jwboyer@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mempolicy: remove mempolicy sharingKOSAKI Motohiro
Dave Jones' system call fuzz testing tool "trinity" triggered the following bug error with slab debugging enabled ============================================================================= BUG numa_policy (Not tainted): Poison overwritten ----------------------------------------------------------------------------- INFO: 0xffff880146498250-0xffff880146498250. First byte 0x6a instead of 0x6b INFO: Allocated in mpol_new+0xa3/0x140 age=46310 cpu=6 pid=32154 __slab_alloc+0x3d3/0x445 kmem_cache_alloc+0x29d/0x2b0 mpol_new+0xa3/0x140 sys_mbind+0x142/0x620 system_call_fastpath+0x16/0x1b INFO: Freed in __mpol_put+0x27/0x30 age=46268 cpu=6 pid=32154 __slab_free+0x2e/0x1de kmem_cache_free+0x25a/0x260 __mpol_put+0x27/0x30 remove_vma+0x68/0x90 exit_mmap+0x118/0x140 mmput+0x73/0x110 exit_mm+0x108/0x130 do_exit+0x162/0xb90 do_group_exit+0x4f/0xc0 sys_exit_group+0x17/0x20 system_call_fastpath+0x16/0x1b INFO: Slab 0xffffea0005192600 objects=27 used=27 fp=0x (null) flags=0x20000000004080 INFO: Object 0xffff880146498250 @offset=592 fp=0xffff88014649b9d0 The problem is that the structure is being prematurely freed due to a reference count imbalance. In the following case mbind(addr, len) should replace the memory policies of both vma1 and vma2 and thus they will become to share the same mempolicy and the new mempolicy will have the MPOL_F_SHARED flag. +-------------------+-------------------+ | vma1 | vma2(shmem) | +-------------------+-------------------+ | | addr addr+len alloc_pages_vma() uses get_vma_policy() and mpol_cond_put() pair for maintaining the mempolicy reference count. The current rule is that get_vma_policy() only increments refcount for shmem VMA and mpol_conf_put() only decrements refcount if the policy has MPOL_F_SHARED. In above case, vma1 is not shmem vma and vma->policy has MPOL_F_SHARED! The reference count will be decreased even though was not increased whenever alloc_page_vma() is called. This has been broken since commit [52cd3b07: mempolicy: rework mempolicy Reference Counting] in 2008. There is another serious bug with the sharing of memory policies. Currently, mempolicy rebind logic (it is called from cpuset rebinding) ignores a refcount of mempolicy and override it forcibly. Thus, any mempolicy sharing may cause mempolicy corruption. The bug was introduced by commit [68860ec1: cpusets: automatic numa mempolicy rebinding]. Ideally, the shared policy handling would be rewritten to either properly handle COW of the policy structures or at least reference count MPOL_F_SHARED based exclusively on information within the policy. However, this patch takes the easier approach of disabling any policy sharing between VMAs. Each new range allocated with sp_alloc will allocate a new policy, set the reference count to 1 and drop the reference count of the old policy. This increases the memory footprint but is not expected to be a major problem as mbind() is unlikely to be used for fine-grained ranges. It is also inefficient because it means we allocate a new policy even in cases where mbind_range() could use the new_policy passed to it. However, it is more straight-forward and the change should be invisible to the user. [mgorman@suse.de: Edited changelog] Reported-by: Dave Jones <davej@redhat.com>, Cc: Christoph Lameter <cl@linux.com>, Reviewed-by: Christoph Lameter <cl@linux.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Josh Boyer <jwboyer@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09revert "mm: mempolicy: Let vma_merge and vma_split handle vma->vm_policy ↵KOSAKI Motohiro
linkages" Commit 05f144a0d5c2 ("mm: mempolicy: Let vma_merge and vma_split handle vma->vm_policy linkages") removed vma->vm_policy updates code but it is the purpose of mbind_range(). Now, mbind_range() is virtually a no-op and while it does not allow memory corruption it is not the right fix. This patch is a revert. [mgorman@suse.de: Edited changelog] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Cc: Josh Boyer <jwboyer@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: compaction: capture a suitable high-order page immediately when it is ↵Mel Gorman
made available While compaction is migrating pages to free up large contiguous blocks for allocation it races with other allocation requests that may steal these blocks or break them up. This patch alters direct compaction to capture a suitable free page as soon as it becomes available to reduce this race. It uses similar logic to split_free_page() to ensure that watermarks are still obeyed. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on ↵Mel Gorman
failures If allocation fails after compaction then compaction may be deferred for a number of allocation attempts. If there are subsequent failures, compact_defer_shift is increased to defer for longer periods. This patch uses that information to scale the number of pages reclaimed with compact_defer_shift until allocations succeed again. The rationale is that reclaiming the normal number of pages still allowed compaction to fail and its success depends on the number of pages. If it's failing, reclaim more pages until it succeeds again. Note that this is not implying that VM reclaim is not reclaiming enough pages or that its logic is broken. try_to_free_pages() always asks for SWAP_CLUSTER_MAX pages to be reclaimed regardless of order and that is what it does. Direct reclaim stops normally with this check. if (sc->nr_reclaimed >= sc->nr_to_reclaim) goto out; should_continue_reclaim delays when that check is made until a minimum number of pages for reclaim/compaction are reclaimed. It is possible that this patch could instead set nr_to_reclaim in try_to_free_pages() and drive it from there but that's behaves differently and not necessarily for the better. If driven from do_try_to_free_pages(), it is also possible that priorities will rise. When they reach DEF_PRIORITY-2, it will also start stalling and setting pages for immediate reclaim which is more disruptive than not desirable in this case. That is a more wide-reaching change that could cause another regression related to THP requests causing interactive jitter. [akpm@linux-foundation.org: fix build] Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: compaction: update comment in try_to_compact_pagesMel Gorman
Allocation success rates have been far lower since 3.4 due to commit fe2c2a106663 ("vmscan: reclaim at order 0 when compaction is enabled"). This commit was introduced for good reasons and it was known in advance that the success rates would suffer but it was justified on the grounds that the high allocation success rates were achieved by aggressive reclaim. Success rates are expected to suffer even more in 3.6 due to commit 7db8889ab05b ("mm: have order > 0 compaction start off where it left") which testing has shown to severely reduce allocation success rates under load - to 0% in one case. This series aims to improve the allocation success rates without regressing the benefits of commit fe2c2a106663. The series is based on latest mmotm and takes into account the __GFP_NO_KSWAPD flag is going away. Patch 1 updates a stale comment seeing as I was in the general area. Patch 2 updates reclaim/compaction to reclaim pages scaled on the number of recent failures. Patch 3 captures suitable high-order pages freed by compaction to reduce races with parallel allocation requests. Patch 4 fixes the upstream commit [7db8889a: mm: have order > 0 compaction start off where it left] to enable compaction again Patch 5 identifies when compacion is taking too long due to contention and aborts. STRESS-HIGHALLOC 3.6-rc1-akpm full-series Pass 1 36.00 ( 0.00%) 51.00 (15.00%) Pass 2 42.00 ( 0.00%) 63.00 (21.00%) while Rested 86.00 ( 0.00%) 86.00 ( 0.00%) From http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3/hydra/comparison.html I know that the allocation success rates in 3.3.6 was 78% in comparison to 36% in in the current akpm tree. With the full series applied, the success rates are up to around 51% with some variability in the results. This is not as high a success rate but it does not reclaim excessively which is a key point. MMTests Statistics: vmstat Page Ins 3050912 3078892 Page Outs 8033528 8039096 Swap Ins 0 0 Swap Outs 0 0 Note that swap in/out rates remain at 0. In 3.3.6 with 78% success rates there were 71881 pages swapped out. Direct pages scanned 70942 122976 Kswapd pages scanned 1366300 1520122 Kswapd pages reclaimed 1366214 1484629 Direct pages reclaimed 70936 105716 Kswapd efficiency 99% 97% Kswapd velocity 1072.550 1182.615 Direct efficiency 99% 85% Direct velocity 55.690 95.672 The kswapd velocity changes very little as expected. kswapd velocity is around the 1000 pages/sec mark where as in kernel 3.3.6 with the high allocation success rates it was 8140 pages/second. Direct velocity is higher as a result of patch 2 of the series but this is expected and is acceptable. The direct reclaim and kswapd velocities change very little. If these get accepted for merging then there is a difficulty in how they should be handled. 7db8889a ("mm: have order > 0 compaction start off where it left") is broken but it is already in 3.6-rc1 and needs to be fixed. However, if just patch 4 from this series is applied then Jim Schutt's workload is known to break again as his workload also requires patch 5. While it would be preferred to have all these patches in 3.6 to improve compaction in general, it would at least be acceptable if just patches 4 and 5 were merged to 3.6 to fix a known problem without breaking compaction completely. On the face of it, that would force __GFP_NO_KSWAPD patches to be merged at the same time but I can do a version of this series with __GFP_NO_KSWAPD change reverted and then rebase it on top of this series. That might be best overall because I note that the __GFP_NO_KSWAPD patch should have removed deferred_compaction from page_alloc.c but it didn't but fixing that causes collisions with this series. This patch: The comment about order applied when the check was order > PAGE_ALLOC_COSTLY_ORDER which has not been the case since c5a73c3d ("thp: use compaction for all allocation orders"). Fixing the comment while I'm in the general area. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm/mmap.c: replace find_vma_prepare() with clearer find_vma_links()Hugh Dickins
People get confused by find_vma_prepare(), because it doesn't care about what it returns in its output args, when its callers won't be interested. Clarify by passing in end-of-range address too, and returning failure if any existing vma overlaps the new range: instead of returning an ambiguous vma which most callers then must check. find_vma_links() is a clearer name. This does revert 2.6.27's dfe195fb79e88 ("mm: fix uninitialized variables for find_vma_prepare callers"), but it looks like gcc 4.3.0 was one of those releases too eager to shout about uninitialized variables: only copy_vma() warns with 4.5.1 and 4.7.1, which a BUG on error silences. [hughd@google.com: fix warning, remove BUG()] Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Benny Halevy <bhalevy@tonian.com> Acked-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: fix nonuniform page status when writing new file with small bufferRobin Dong
When writing a new file with 2048 bytes buffer, such as write(fd, buffer, 2048), it will call generic_perform_write() twice for every page: write_begin mark_page_accessed(page) write_end write_begin mark_page_accessed(page) write_end Pages 1-13 will be added to lru-pvecs in write_begin() and will *NOT* be added to active_list even they have be accessed twice because they are not PageLRU(page). But when page 14th comes, all pages in lru-pvecs will be moved to inactive_list (by __lru_cache_add() ) in first write_begin(), now page 14th *is* PageLRU(page). And after second write_end() only page 14th will be in active_list. In Hadoop environment, we do comes to this situation: after writing a file, we find out that only 14th, 28th, 42th... page are in active_list and others in inactive_list. Now kswapd works, shrinks the inactive_list, the file only have 14th, 28th...pages in memory, the readahead request size will be broken to only 52k (13*4k), system's performance falls dramatically. This problem can also replay by below steps (the machine has 8G memory): 1. dd if=/dev/zero of=/test/file.out bs=1024 count=1048576 2. cat another 7.5G file to /dev/null 3. vmtouch -m 1G -v /test/file.out, it will show: /test/file.out [oooooooooooooooooooOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO] 187847/262144 the 'o' means same pages are in memory but same are not. The solution for this problem is simple: the 14th page should be added to lru_add_pvecs before mark_page_accessed() just as other pages. [akpm@linux-foundation.org: tweak comment] [akpm@linux-foundation.org: grab better comment from the v3 patch] Signed-off-by: Robin Dong <sanbai@taobao.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: kill vma flag VM_RESERVED and mm->reserved_vm counterKonstantin Khlebnikov
A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA, currently it lost original meaning but still has some effects: | effect | alternative flags -+------------------------+--------------------------------------------- 1| account as reserved_vm | VM_IO 2| skip in core dump | VM_IO, VM_DONTDUMP 3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP 4| do not mlock | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP This patch removes reserved_vm counter from mm_struct. Seems like nobody cares about it, it does not exported into userspace directly, it only reduces total_vm showed in proc. Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP. remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP. remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP. [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup] Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: prepare VM_DONTDUMP for using in driversKonstantin Khlebnikov
Rename VM_NODUMP into VM_DONTDUMP: this name matches other negative flags: VM_DONTEXPAND, VM_DONTCOPY. Currently this flag used only for sys_madvise. The next patch will use it for replacing the outdated flag VM_RESERVED. Also forbid madvise(MADV_DODUMP) for special kernel mappings VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP) Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: kill vma flag VM_EXECUTABLE and mm->num_exe_file_vmasKonstantin Khlebnikov
Currently the kernel sets mm->exe_file during sys_execve() and then tracks number of vmas with VM_EXECUTABLE flag in mm->num_exe_file_vmas, as soon as this counter drops to zero kernel resets mm->exe_file to NULL. Plus it resets mm->exe_file at last mmput() when mm->mm_users drops to zero. VMA with VM_EXECUTABLE flag appears after mapping file with flag MAP_EXECUTABLE, such vmas can appears only at sys_execve() or after vma splitting, because sys_mmap ignores this flag. Usually binfmt module sets mm->exe_file and mmaps executable vmas with this file, they hold mm->exe_file while task is running. comment from v2.6.25-6245-g925d1c4 ("procfs task exe symlink"), where all this stuff was introduced: > The kernel implements readlink of /proc/pid/exe by getting the file from > the first executable VMA. Then the path to the file is reconstructed and > reported as the result. > > Because of the VMA walk the code is slightly different on nommu systems. > This patch avoids separate /proc/pid/exe code on nommu systems. Instead of > walking the VMAs to find the first executable file-backed VMA we store a > reference to the exec'd file in the mm_struct. > > That reference would prevent the filesystem holding the executable file > from being unmounted even after unmapping the VMAs. So we track the number > of VM_EXECUTABLE VMAs and drop the new reference when the last one is > unmapped. This avoids pinning the mounted filesystem. exe_file's vma accounting is hooked into every file mmap/unmmap and vma split/merge just to fix some hypothetical pinning fs from umounting by mm, which already unmapped all its executable files, but still alive. Seems like currently nobody depends on this behaviour. We can try to remove this logic and keep mm->exe_file until final mmput(). mm->exe_file is still protected with mm->mmap_sem, because we want to change it via new sys_prctl(PR_SET_MM_EXE_FILE). Also via this syscall task can change its mm->exe_file and unpin mountpoint explicitly. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: kill vma flag VM_CAN_NONLINEARKonstantin Khlebnikov
Move actual pte filling for non-linear file mappings into the new special vma operation: ->remap_pages(). Filesystems must implement this method to get non-linear mapping support, if it uses filemap_fault() then generic_file_remap_pages() can be used. Now device drivers can implement this method and obtain nonlinear vma support. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> #arch/tile Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: kill vma flag VM_INSERTPAGEKonstantin Khlebnikov
Merge VM_INSERTPAGE into VM_MIXEDMAP. VM_MIXEDMAP VMA can mix pure-pfn ptes, special ptes and normal ptes. Now copy_page_range() always copies VM_MIXEDMAP VMA on fork like VM_PFNMAP. If driver populates whole VMA at mmap() it probably not expects page-faults. This patch removes special check from vma_wants_writenotify() which disables pages write tracking for VMA populated via vm_instert_page(). BDI below mapped file should not use dirty-accounting, moreover do_wp_page() can handle this. vm_insert_page() still marks vma after first usage. Usually it is called from f_op->mmap() handler under mm->mmap_sem write-lock, so it able to change vma->vm_flags. Caller must set VM_MIXEDMAP at mmap time if it wants to call this function from other places, for example from page-fault handler. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: introduce arch-specific vma flag VM_ARCH_1Konstantin Khlebnikov
Combine several arch-specific vma flags into one. before patch: 0x00000200 0x01000000 0x20000000 0x40000000 x86 VM_NOHUGEPAGE VM_HUGEPAGE - VM_PAT powerpc - - VM_SAO - parisc VM_GROWSUP - - - ia64 VM_GROWSUP - - - nommu - VM_MAPPED_COPY - - others - - - - after patch: 0x00000200 0x01000000 0x20000000 0x40000000 x86 - VM_PAT VM_HUGEPAGE VM_NOHUGEPAGE powerpc - VM_SAO - - parisc - VM_GROWSUP - - ia64 - VM_GROWSUP - - nommu - VM_MAPPED_COPY - - others - VM_ARCH_1 - - And voila! One completely free bit. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm, x86, pat: rework linear pfn-mmap trackingKonstantin Khlebnikov
Replace the generic vma-flag VM_PFN_AT_MMAP with x86-only VM_PAT. We can toss mapping address from remap_pfn_range() into track_pfn_vma_new(), and collect all PAT-related logic together in arch/x86/. This patch also restores orignal frustration-free is_cow_mapping() check in remap_pfn_range(), as it was before commit v2.6.28-rc8-88-g3c8bb73 ("x86: PAT: store vm_pgoff for all linear_over_vma_region mappings - v3") is_linear_pfn_mapping() checks can be removed from mm/huge_memory.c, because it already handled by VM_PFNMAP in VM_NO_THP bit-mask. [suresh.b.siddha@intel.com: Reset the VM_PAT flag as part of untrack_pfn_vma()] Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Venkatesh Pallipadi <venki@google.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09x86, pat: separate the pfn attribute tracking for remap_pfn_range and ↵Suresh Siddha
vm_insert_pfn With PAT enabled, vm_insert_pfn() looks up the existing pfn memory attribute and uses it. Expectation is that the driver reserves the memory attributes for the pfn before calling vm_insert_pfn(). remap_pfn_range() (when called for the whole vma) will setup a new attribute (based on the prot argument) for the specified pfn range. This addresses the legacy usage which typically calls remap_pfn_range() with a desired memory attribute. For ranges smaller than the vma size (which is typically not the case), remap_pfn_range() will use the existing memory attribute for the pfn range. Expose two different API's for these different behaviors. track_pfn_insert() for tracking the pfn attribute set by vm_insert_pfn() and track_pfn_remap() for the remap_pfn_range(). This cleanup also prepares the ground for the track/untrack pfn vma routines to take over the ownership of setting PAT specific vm_flag in the 'vma'. [khlebnikov@openvz.org: Clear checks in track_pfn_remap()] [akpm@linux-foundation.org: tweak a few comments] Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Venkatesh Pallipadi <venki@google.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: remove __GFP_NO_KSWAPDRik van Riel
When transparent huge pages were introduced, memory compaction and swap storms were an issue, and the kernel had to be careful to not make THP allocations cause pageout or compaction. Now that we have working compaction deferral, kswapd is smart enough to invoke compaction and the quadratic behaviour around isolate_free_pages has been fixed, it should be safe to remove __GFP_NO_KSWAPD. [minchan@kernel.org: Comment fix] [mgorman@suse.de: Avoid direct reclaim for deferred compaction] Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-07Merge branch 'slab/for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux Pull SLAB changes from Pekka Enberg: "New and noteworthy: * More SLAB allocator unification patches from Christoph Lameter and others. This paves the way for slab memcg patches that hopefully will land in v3.8. * SLAB tracing improvements from Ezequiel Garcia. * Kernel tainting upon SLAB corruption from Dave Jones. * Miscellanous SLAB allocator bug fixes and improvements from various people." * 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: (43 commits) slab: Fix build failure in __kmem_cache_create() slub: init_kmem_cache_cpus() and put_cpu_partial() can be static mm/slab: Fix kmem_cache_alloc_node_trace() declaration Revert "mm/slab: Fix kmem_cache_alloc_node_trace() declaration" mm, slob: fix build breakage in __kmalloc_node_track_caller mm/slab: Fix kmem_cache_alloc_node_trace() declaration mm/slab: Fix typo _RET_IP -> _RET_IP_ mm, slub: Rename slab_alloc() -> slab_alloc_node() to match SLAB mm, slab: Rename __cache_alloc() -> slab_alloc() mm, slab: Match SLAB and SLUB kmem_cache_alloc_xxx_trace() prototype mm, slab: Replace 'caller' type, void* -> unsigned long mm, slob: Add support for kmalloc_track_caller() mm, slab: Remove silly function slab_buffer_size() mm, slob: Use NUMA_NO_NODE instead of -1 mm, sl[au]b: Taint kernel when we detect a corrupted slab slab: Only define slab_error for DEBUG slab: fix the DEADLOCK issue on l3 alien lock slub: Zero initial memory segment for kmem_cache and kmem_cache_node Revert "mm/sl[aou]b: Move sysfs_slab_add to common" mm/sl[aou]b: Move kmem_cache refcounting to common code ...
2012-10-06sections: fix section conflicts in mm/percpu.cAndi Kleen
Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-03Merge branch 'slab/tracing' into slab/for-linusPekka Enberg
2012-10-03Merge branch 'slab/common-for-cgroups' into slab/for-linusPekka Enberg
Fix up a trivial conflict with NUMA_NO_NODE cleanups. Conflicts: mm/slob.c Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-10-03Merge branch 'slab/next' into slab/for-linusPekka Enberg
2012-10-03slab: Fix build failure in __kmem_cache_create()Tetsuo Handa
Fix build failure with CONFIG_DEBUG_SLAB=y && CONFIG_DEBUG_PAGEALLOC=y caused by commit 8a13a4cc "mm/sl[aou]b: Shrink __kmem_cache_create() parameter lists". mm/slab.c: In function '__kmem_cache_create': mm/slab.c:2474: error: 'align' undeclared (first use in this function) mm/slab.c:2474: error: (Each undeclared identifier is reported only once mm/slab.c:2474: error: for each function it appears in.) make[1]: *** [mm/slab.o] Error 1 make: *** [mm] Error 2 Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-10-03slub: init_kmem_cache_cpus() and put_cpu_partial() can be staticFengguang Wu
Acked-by: Glauber Costa <glommer@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-10-02Merge tag 'stable/for-linus-3.7-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm Pull frontswap update from Konrad Rzeszutek Wilk: "Features: - Support exlusive get if backend is capable. Bug-fixes: - Fix compile warnings - Add comments/cleanup doc - Fix wrong if condition" * tag 'stable/for-linus-3.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm: frontswap: support exclusive gets if tmem backend is capable mm: frontswap: fix a wrong if condition in frontswap_shrink mm/frontswap: fix uninit'ed variable warning mm/frontswap: cleanup doc and comment error mm: frontswap: remove unneeded headers
2012-10-02Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs update from Al Viro: - big one - consolidation of descriptor-related logics; almost all of that is moved to fs/file.c (BTW, I'm seriously tempted to rename the result to fd.c. As it is, we have a situation when file_table.c is about handling of struct file and file.c is about handling of descriptor tables; the reasons are historical - file_table.c used to be about a static array of struct file we used to have way back). A lot of stray ends got cleaned up and converted to saner primitives, disgusting mess in android/binder.c is still disgusting, but at least doesn't poke so much in descriptor table guts anymore. A bunch of relatively minor races got fixed in process, plus an ext4 struct file leak. - related thing - fget_light() partially unuglified; see fdget() in there (and yes, it generates the code as good as we used to have). - also related - bits of Cyrill's procfs stuff that got entangled into that work; _not_ all of it, just the initial move to fs/proc/fd.c and switch of fdinfo to seq_file. - Alex's fs/coredump.c spiltoff - the same story, had been easier to take that commit than mess with conflicts. The rest is a separate pile, this was just a mechanical code movement. - a few misc patches all over the place. Not all for this cycle, there'll be more (and quite a few currently sit in akpm's tree)." Fix up trivial conflicts in the android binder driver, and some fairly simple conflicts due to two different changes to the sock_alloc_file() interface ("take descriptor handling from sock_alloc_file() to callers" vs "net: Providing protocol type via system.sockprotoname xattr of /proc/PID/fd entries" adding a dentry name to the socket) * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits) MAX_LFS_FILESIZE should be a loff_t compat: fs: Generic compat_sys_sendfile implementation fs: push rcu_barrier() from deactivate_locked_super() to filesystems btrfs: reada_extent doesn't need kref for refcount coredump: move core dump functionality into its own file coredump: prevent double-free on an error path in core dumper usb/gadget: fix misannotations fcntl: fix misannotations ceph: don't abuse d_delete() on failure exits hypfs: ->d_parent is never NULL or negative vfs: delete surplus inode NULL check switch simple cases of fget_light to fdget new helpers: fdget()/fdput() switch o2hb_region_dev_write() to fget_light() proc_map_files_readdir(): don't bother with grabbing files make get_file() return its argument vhost_set_vring(): turn pollstart/pollstop into bool switch prctl_set_mm_exe_file() to fget_light() switch xfs_find_handle() to fget_light() switch xfs_swapext() to fget_light() ...
2012-10-02Merge branch 'for-3.7-hierarchy' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup hierarchy update from Tejun Heo: "Currently, different cgroup subsystems handle nested cgroups completely differently. There's no consistency among subsystems and the behaviors often are outright broken. People at least seem to agree that the broken hierarhcy behaviors need to be weeded out if any progress is gonna be made on this front and that the fallouts from deprecating the broken behaviors should be acceptable especially given that the current behaviors don't make much sense when nested. This patch makes cgroup emit warning messages if cgroups for subsystems with broken hierarchy behavior are nested to prepare for fixing them in the future. This was put in a separate branch because more related changes were expected (didn't make it this round) and the memory cgroup wanted to pull in this and make changes on top." * 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
2012-10-02Merge branch 'for-3.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - xattr support added. The implementation is shared with tmpfs. The usage is restricted and intended to be used to manage per-cgroup metadata by system software. tmpfs changes are routed through this branch with Hugh's permission. - cgroup subsystem ID handling simplified. * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Define CGROUP_SUBSYS_COUNT according the configuration cgroup: Assign subsystem IDs during compile time cgroup: Do not depend on a given order when populating the subsys array cgroup: Wrap subsystem selection macro cgroup: Remove CGROUP_BUILTIN_SUBSYS_COUNT cgroup: net_prio: Do not define task_netpioidx() when not selected cgroup: net_cls: Do not define task_cls_classid() when not selected cgroup: net_cls: Move sock_update_classid() declaration to cls_cgroup.h cgroup: trivial fixes for Documentation/cgroups/cgroups.txt xattr: mark variable as uninitialized to make both gcc and smatch happy fs: add missing documentation to simple_xattr functions cgroup: add documentation on extended attributes usage cgroup: rename subsys_bits to subsys_mask cgroup: add xattr support cgroup: revise how we re-populate root directory xattr: extract simple_xattr code from tmpfs
2012-10-02Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
Pull workqueue changes from Tejun Heo: "This is workqueue updates for v3.7-rc1. A lot of activities this round including considerable API and behavior cleanups. * delayed_work combines a timer and a work item. The handling of the timer part has always been a bit clunky leading to confusing cancelation API with weird corner-case behaviors. delayed_work is updated to use new IRQ safe timer and cancelation now works as expected. * Another deficiency of delayed_work was lack of the counterpart of mod_timer() which led to cancel+queue combinations or open-coded timer+work usages. mod_delayed_work[_on]() are added. These two delayed_work changes make delayed_work provide interface and behave like timer which is executed with process context. * A work item could be executed concurrently on multiple CPUs, which is rather unintuitive and made flush_work() behavior confusing and half-broken under certain circumstances. This problem doesn't exist for non-reentrant workqueues. While non-reentrancy check isn't free, the overhead is incurred only when a work item bounces across different CPUs and even in simulated pathological scenario the overhead isn't too high. All workqueues are made non-reentrant. This removes the distinction between flush_[delayed_]work() and flush_[delayed_]_work_sync(). The former is now as strong as the latter and the specified work item is guaranteed to have finished execution of any previous queueing on return. * In addition to the various bug fixes, Lai redid and simplified CPU hotplug handling significantly. * Joonsoo introduced system_highpri_wq and used it during CPU hotplug. There are two merge commits - one to pull in IRQ safe timer from tip/timers/core and the other to pull in CPU hotplug fixes from wq/for-3.6-fixes as Lai's hotplug restructuring depended on them." Fixed a number of trivial conflicts, but the more interesting conflicts were silent ones where the deprecated interfaces had been used by new code in the merge window, and thus didn't cause any real data conflicts. Tejun pointed out a few of them, I fixed a couple more. * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits) workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending() workqueue: use cwq_set_max_active() helper for workqueue_set_max_active() workqueue: introduce cwq_set_max_active() helper for thaw_workqueues() workqueue: remove @delayed from cwq_dec_nr_in_flight() workqueue: fix possible stall on try_to_grab_pending() of a delayed work item workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback() workqueue: use __cpuinit instead of __devinit for cpu callbacks workqueue: rename manager_mutex to assoc_mutex workqueue: WORKER_REBIND is no longer necessary for idle rebinding workqueue: WORKER_REBIND is no longer necessary for busy rebinding workqueue: reimplement idle worker rebinding workqueue: deprecate __cancel_delayed_work() workqueue: reimplement cancel_delayed_work() using try_to_grab_pending() workqueue: use mod_delayed_work() instead of __cancel + queue workqueue: use irqsafe timer for delayed_work workqueue: clean up delayed_work initializers and add missing one workqueue: make deferrable delayed_work initializer names consistent workqueue: cosmetic whitespace updates for macro definitions workqueue: deprecate system_nrt[_freezable]_wq workqueue: deprecate flush[_delayed]_work_sync() ...
2012-10-01Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU changes from Ingo Molnar: 0. 'idle RCU': Adds RCU APIs that allow non-idle tasks to enter RCU idle mode and provides x86 code to make use of them, allowing RCU to treat user-mode execution as an extended quiescent state when the new RCU_USER_QS kernel configuration parameter is specified. (Work is in progress to port this to a few other architectures, but is not part of this series.) 1. A fix for a latent bug that has been in RCU ever since the addition of CPU stall warnings. This bug results in false-positive stall warnings, but thus far only on embedded systems with severely cut-down userspace configurations. 2. Further reductions in latency spikes for huge systems, along with additional boot-time adaptation to the actual hardware. This is a large change, as it moves RCU grace-period initialization and cleanup, along with quiescent-state forcing, from softirq to a kthread. However, it appears to be in quite good shape (famous last words). 3. Updates to documentation and rcutorture, the latter category including keeping statistics on CPU-hotplug latencies and fixing some initialization-time races. 4. CPU-hotplug fixes and improvements. 5. Idle-loop fixes that were omitted on an earlier submission. 6. Miscellaneous fixes and improvements In certain RCU configurations new kernel threads will show up (rcu_bh, rcu_sched), showing RCU processing overhead. * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (90 commits) rcu: Apply micro-optimization and int/bool fixes to RCU's idle handling rcu: Userspace RCU extended QS selftest x86: Exit RCU extended QS on notify resume x86: Use the new schedule_user API on userspace preemption rcu: Exit RCU extended QS on user preemption rcu: Exit RCU extended QS on kernel preemption after irq/exception x86: Exception hooks for userspace RCU extended QS x86: Unspaghettize do_general_protection() x86: Syscall hooks for userspace RCU extended QS rcu: Switch task's syscall hooks on context switch rcu: Ignore userspace extended quiescent state by default rcu: Allow rcu_user_enter()/exit() to nest rcu: Settle config for userspace extended quiescent state rcu: Make RCU_FAST_NO_HZ handle adaptive ticks rcu: New rcu_user_enter_after_irq() and rcu_user_exit_after_irq() APIs rcu: New rcu_user_enter() and rcu_user_exit() APIs ia64: Add missing RCU idle APIs on idle loop xtensa: Add missing RCU idle APIs on idle loop score: Add missing RCU idle APIs on idle loop parisc: Add missing RCU idle APIs on idle loop ...
2012-10-01Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull the trivial tree from Jiri Kosina: "Tiny usual fixes all over the place" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits) doc: fix old config name of kprobetrace fs/fs-writeback.c: cleanup riteback_sb_inodes kerneldoc btrfs: fix the commment for the action flags in delayed-ref.h btrfs: fix trivial typo for the comment of BTRFS_FREE_INO_OBJECTID vfs: fix kerneldoc for generic_fh_to_parent() treewide: fix comment/printk/variable typos ipr: fix small coding style issues doc: fix broken utf8 encoding nfs: comment fix platform/x86: fix asus_laptop.wled_type module parameter mfd: printk/comment fixes doc: getdelays.c: remember to close() socket on error in create_nl_socket() doc: aliasing-test: close fd on write error mmc: fix comment typos dma: fix comments spi: fix comment/printk typos in spi Coccinelle: fix typo in memdup_user.cocci tmiofb: missing NULL pointer checks tools: perf: Fix typo in tools/perf tools/testing: fix comment / output typos ...
2012-09-29Revert "mm/slab: Fix kmem_cache_alloc_node_trace() declaration"Pekka Enberg
This reverts commit 1e5965bf1f018cc30a4659fa3f1a40146e4276f6. Ezequiel Garcia has a better fix.
2012-09-28thp: avoid VM_BUG_ON page_count(page) false positives in ↵Andrea Arcangeli
__collapse_huge_page_copy Speculative cache pagecache lookups can elevate the refcount from under us, so avoid the false positive. If the refcount is < 2 we'll be notified by a VM_BUG_ON in put_page_testzero as there are two put_page(src_page) in a row before returning from this function. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-28CPU hotplug, writeback: Don't call writeback_set_ratelimit() too often ↵Srivatsa S. Bhat
during hotplug The CPU hotplug callback related to writeback calls writeback_set_ratelimit() during every state change in the hotplug sequence. This is unnecessary since num_online_cpus() changes only once during the entire hotplug operation. So invoke the function only once per hotplug, thereby avoiding the unnecessary repetition of those costly calculations. Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
2012-09-26switch simple cases of fget_light to fdgetAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26make get_file() return its argumentAl Viro
simplifies a bunch of callers... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26switch readahead(2) to fget_light()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26switch fadvise(2) to fget_light()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26mm, slob: fix build breakage in __kmalloc_node_track_callerDavid Rientjes
On Sat, 8 Sep 2012, Ezequiel Garcia wrote: > @@ -454,15 +455,35 @@ void *__kmalloc_node(size_t size, gfp_t gfp, int node) > gfp |= __GFP_COMP; > ret = slob_new_pages(gfp, order, node); > > - trace_kmalloc_node(_RET_IP_, ret, > + trace_kmalloc_node(caller, ret, > size, PAGE_SIZE << order, gfp, node); > } > > kmemleak_alloc(ret, size, 1, gfp); > return ret; > } > + > +void *__kmalloc_node(size_t size, gfp_t gfp, int node) > +{ > + return __do_kmalloc_node(size, gfp, node, _RET_IP_); > +} > EXPORT_SYMBOL(__kmalloc_node); > > +#ifdef CONFIG_TRACING > +void *__kmalloc_track_caller(size_t size, gfp_t gfp, unsigned long caller) > +{ > + return __do_kmalloc_node(size, gfp, NUMA_NO_NODE, caller); > +} > + > +#ifdef CONFIG_NUMA > +void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags, > + int node, unsigned long caller) > +{ > + return __do_kmalloc_node(size, gfp, node, caller); > +} > +#endif This breaks Pekka's slab/next tree with this: mm/slob.c: In function '__kmalloc_node_track_caller': mm/slob.c:488: error: 'gfp' undeclared (first use in this function) mm/slob.c:488: error: (Each undeclared identifier is reported only once mm/slob.c:488: error: for each function it appears in.) mm, slob: fix build breakage in __kmalloc_node_track_caller "mm, slob: Add support for kmalloc_track_caller()" breaks the build because gfp is undeclared. Fix it. Acked-by: Ezequiel Garcia <elezegarcia@gmail.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-09-25mm/slab: Fix kmem_cache_alloc_node_trace() declarationEzequiel Garcia
The bug was introduced in commit 4052147c0afa ("mm, slab: Match SLAB and SLUB kmem_cache_alloc_xxx_trace() prototype"). Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Ezequiel Garcia <elezegarcia@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-09-25mm/slab: Fix typo _RET_IP -> _RET_IP_Ezequiel Garcia
The bug was introduced by commit 7c0cb9c64f83 ("mm, slab: Replace 'caller' type, void* -> unsigned long"). Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Ezequiel Garcia <elezegarcia@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-09-25Merge remote-tracking branch 'tip/core/rcu' into next.2012.09.25bPaul E. McKenney
Resolved conflict in kernel/sched/core.c using Peter Zijlstra's approach from https://lkml.org/lkml/2012/9/5/585.
2012-09-25mm, slub: Rename slab_alloc() -> slab_alloc_node() to match SLABEzequiel Garcia
This patch does not fix anything, and its only goal is to enable us to obtain some common code between SLAB and SLUB. Neither behavior nor produced code is affected. Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Ezequiel Garcia <elezegarcia@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-09-25mm, slab: Rename __cache_alloc() -> slab_alloc()Ezequiel Garcia
This patch does not fix anything and its only goal is to produce common code between SLAB and SLUB. Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Ezequiel Garcia <elezegarcia@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-09-25mm, slab: Match SLAB and SLUB kmem_cache_alloc_xxx_trace() prototypeEzequiel Garcia
This long (seemingly unnecessary) patch does not fix anything and its only goal is to produce common code between SLAB and SLUB. Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Ezequiel Garcia <elezegarcia@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-09-25mm, slab: Replace 'caller' type, void* -> unsigned longEzequiel Garcia
This allows to use _RET_IP_ instead of builtin_address(0), thus achiveing implementation consistency in all three allocators. Though maybe a nitpick, the real goal behind this patch is to be able to obtain common code between SLAB and SLUB. Signed-off-by: Ezequiel Garcia <elezegarcia@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-09-25mm, slob: Add support for kmalloc_track_caller()Ezequiel Garcia
Currently slob falls back to regular kmalloc for this case. With this patch kmalloc_track_caller() is correctly implemented, thus tracing the specified caller. This is important to trace accurately allocations performed by krealloc, kstrdup, kmemdup, etc. Signed-off-by: Ezequiel Garcia <elezegarcia@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>