aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2011-10-17UBUNTU: SAUCE: (no-up) add tracing for user initiated readahead requestsAndy Whitcroft
Track pages which undergo readahead and for each record which were actually consumed, via either read or faulted into a map. This allows userspace readahead applications (such as ureadahead) to track which pages in core at the end of a boot are actually required and generate an optimal readahead pack. It also allows pack adjustment and optimisation in parallel with readahead, allowing the pack to evolve to be accurate as userspace paths change. The status of the pages are reported back via the mincore() call using a newly allocated bit. Signed-off-by: Andy Whitcroft <apw@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Leann Ogasawara <leann.ogasawara@canonical.com>
2011-10-17UBUNTU: ubuntu: nx-emu - i386: mmap randomization for executable mappingsIngo Molnar
This code is originally from Ingo Molnar, with some later rebasing and fixes to respect all the randomization-disabling knobs. It provides address randomization algorithm when NX emulation is in use in 32-bit processes. Kees Cook pushed the brk area further away in the case of PIE binaries landing their brk inside the CS limit. Signed-off-by: Kees Cook <kees.cook@canonical.com> Signed-off-by: Leann Ogasawara <leann.ogasawara@canonical.com>
2011-10-17UBUNTU: ubuntu: nx-emu - i386: NX emulationIngo Molnar
This is old code with some cruft, all originally by Ingo Molnar with much later rebasing by Fedora folks and at least one arcane fix by Roland McGrath a few years ago. No longer uses exec-shield sysctl, merged with disable_nx. Kees Cook fixed boottime NX reporting for various corner cases. Signed-off-by: Kees Cook <kees.cook@canonical.com> Signed-off-by: Leann Ogasawara <leann.ogasawara@canonical.com>
2011-10-17UBUNTU: SAUCE: (no-up) swap: Add notify_swap_entry_free callback for compcacheTim Gardner
Code is required for ubuntu/compcache Signed-off-by: Ben Collins <ben.collins@canonical.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
2011-10-03memcg: fix vmscan count in small memcgsKAMEZAWA Hiroyuki
commit 4508378b9523e22a2a0175d8bf64d932fb10a67d upstream. Commit 246e87a93934 ("memcg: fix get_scan_count() for small targets") fixes the memcg/kswapd behavior against small targets and prevent vmscan priority too high. But the implementation is too naive and adds another problem to small memcg. It always force scan to 32 pages of file/anon and doesn't handle swappiness and other rotate_info. It makes vmscan to scan anon LRU regardless of swappiness and make reclaim bad. This patch fixes it by adjusting scanning count with regard to swappiness at el. At a test "cat 1G file under 300M limit." (swappiness=20) before patch scanned_pages_by_limit 360919 scanned_anon_pages_by_limit 180469 scanned_file_pages_by_limit 180450 rotated_pages_by_limit 31 rotated_anon_pages_by_limit 25 rotated_file_pages_by_limit 6 freed_pages_by_limit 180458 freed_anon_pages_by_limit 19 freed_file_pages_by_limit 180439 elapsed_ns_by_limit 429758872 after patch scanned_pages_by_limit 180674 scanned_anon_pages_by_limit 24 scanned_file_pages_by_limit 180650 rotated_pages_by_limit 35 rotated_anon_pages_by_limit 24 rotated_file_pages_by_limit 11 freed_pages_by_limit 180634 freed_anon_pages_by_limit 0 freed_file_pages_by_limit 180634 elapsed_ns_by_limit 367119089 scanned_pages_by_system 0 the numbers of scanning anon are decreased(as expected), and elapsed time reduced. By this patch, small memcgs will work better. (*) Because the amount of file-cache is much bigger than anon, recalaim_stat's rotate-scan counter make scanning files more. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stageWu Fengguang
commit 6e6938b6d3130305a5960c86b1a9b21e58cf6144 upstream. sync(2) is performed in two stages: the WB_SYNC_NONE sync and the WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and do livelock prevention for it, too. Jan's commit f446daaea9 ("mm: implement writeback livelock avoidance using page tagging") is a partial fix in that it only fixed the WB_SYNC_ALL phase livelock. Although ext4 is tested to no longer livelock with commit f446daaea9, it may due to some "redirty_tail() after pages_skipped" effect which is by no means a guarantee for _all_ the file systems. Note that writeback_inodes_sb() is called by not only sync(), they are treated the same because the other callers also need livelock prevention. Impact: It changes the order in which pages/inodes are synced to disk. Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode until finished with the current inode. Acked-by: Jan Kara <jack@suse.cz> CC: Dave Chinner <david@fromorbit.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03mm: sync vmalloc address space page tables in alloc_vm_area()David Vrabel
commit 461ae488ecb125b140d7ea29ceeedbcce9327003 upstream. Xen backend drivers (e.g., blkback and netback) would sometimes fail to map grant pages into the vmalloc address space allocated with alloc_vm_area(). The GNTTABOP_map_grant_ref would fail because Xen could not find the page (in the L2 table) containing the PTEs it needed to update. (XEN) mm.c:3846:d0 Could not find L1 PTE for address fbb42000 netback and blkback were making the hypercall from a kernel thread where task->active_mm != &init_mm and alloc_vm_area() was only updating the page tables for init_mm. The usual method of deferring the update to the page tables of other processes (i.e., after taking a fault) doesn't work as a fault cannot occur during the hypercall. This would work on some systems depending on what else was using vmalloc. Fix this by reverting ef691947d8a3 ("vmalloc: remove vmalloc_sync_all() from alloc_vm_area()") and add a comment to explain why it's needed. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Ian Campbell <Ian.Campbell@citrix.com> Cc: Keir Fraser <keir.xen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03mm: page allocator: reconsider zones for allocation after direct reclaimMel Gorman
commit 76d3fbf8fbf6cc78ceb63549e0e0c5bc8a88f838 upstream. With zone_reclaim_mode enabled, it's possible for zones to be considered full in the zonelist_cache so they are skipped in the future. If the process enters direct reclaim, the ZLC may still consider zones to be full even after reclaiming pages. Reconsider all zones for allocation if direct reclaim returns successfully. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Stefan Priebe <s.priebe@profihost.ag> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03mm: page allocator: initialise ZLC for first zone eligible for zone_reclaimMel Gorman
commit cd38b115d5ad79b0100ac6daa103c4fe2c50a913 upstream. There have been a small number of complaints about significant stalls while copying large amounts of data on NUMA machines reported on a distribution bugzilla. In these cases, zone_reclaim was enabled by default due to large NUMA distances. In general, the complaints have not been about the workload itself unless it was a file server (in which case the recommendation was disable zone_reclaim). The stalls are mostly due to significant amounts of time spent scanning the preferred zone for pages to free. After a failure, it might fallback to another node (as zonelists are often node-ordered rather than zone-ordered) but stall quickly again when the next allocation attempt occurs. In bad cases, each page allocated results in a full scan of the preferred zone. Patch 1 checks the preferred zone for recent allocation failure which is particularly important if zone_reclaim has failed recently. This avoids rescanning the zone in the near future and instead falling back to another node. This may hurt node locality in some cases but a failure to zone_reclaim is more expensive than a remote access. Patch 2 clears the zlc information after direct reclaim. Otherwise, zone_reclaim can mark zones full, direct reclaim can reclaim enough pages but the zone is still not considered for allocation. This was tested on a 24-thread 2-node x86_64 machine. The tests were focused on large amounts of IO. All tests were bound to the CPUs on node-0 to avoid disturbances due to processes being scheduled on different nodes. The kernels tested are 3.0-rc6-vanilla Vanilla 3.0-rc6 zlcfirst Patch 1 applied zlcreconsider Patches 1+2 applied FS-Mark ./fs_mark -d /tmp/fsmark-10813 -D 100 -N 5000 -n 208 -L 35 -t 24 -S0 -s 524288 fsmark-3.0-rc6 3.0-rc6 3.0-rc6 vanilla zlcfirs zlcreconsider Files/s min 54.90 ( 0.00%) 49.80 (-10.24%) 49.10 (-11.81%) Files/s mean 100.11 ( 0.00%) 135.17 (25.94%) 146.93 (31.87%) Files/s stddev 57.51 ( 0.00%) 138.97 (58.62%) 158.69 (63.76%) Files/s max 361.10 ( 0.00%) 834.40 (56.72%) 802.40 (55.00%) Overhead min 76704.00 ( 0.00%) 76501.00 ( 0.27%) 77784.00 (-1.39%) Overhead mean 1485356.51 ( 0.00%) 1035797.83 (43.40%) 1594680.26 (-6.86%) Overhead stddev 1848122.53 ( 0.00%) 881489.88 (109.66%) 1772354.90 ( 4.27%) Overhead max 7989060.00 ( 0.00%) 3369118.00 (137.13%) 10135324.00 (-21.18%) MMTests Statistics: duration User/Sys Time Running Test (seconds) 501.49 493.91 499.93 Total Elapsed Time (seconds) 2451.57 2257.48 2215.92 MMTests Statistics: vmstat Page Ins 46268 63840 66008 Page Outs 90821596 90671128 88043732 Swap Ins 0 0 0 Swap Outs 0 0 0 Direct pages scanned 13091697 8966863 8971790 Kswapd pages scanned 0 1830011 1831116 Kswapd pages reclaimed 0 1829068 1829930 Direct pages reclaimed 13037777 8956828 8648314 Kswapd efficiency 100% 99% 99% Kswapd velocity 0.000 810.643 826.346 Direct efficiency 99% 99% 96% Direct velocity 5340.128 3972.068 4048.788 Percentage direct scans 100% 83% 83% Page writes by reclaim 0 3 0 Slabs scanned 796672 720640 720256 Direct inode steals 7422667 7160012 7088638 Kswapd inode steals 0 1736840 2021238 Test completes far faster with a large increase in the number of files created per second. Standard deviation is high as a small number of iterations were much higher than the mean. The number of pages scanned by zone_reclaim is reduced and kswapd is used for more work. LARGE DD 3.0-rc6 3.0-rc6 3.0-rc6 vanilla zlcfirst zlcreconsider download tar 59 ( 0.00%) 59 ( 0.00%) 55 ( 7.27%) dd source files 527 ( 0.00%) 296 (78.04%) 320 (64.69%) delete source 36 ( 0.00%) 19 (89.47%) 20 (80.00%) MMTests Statistics: duration User/Sys Time Running Test (seconds) 125.03 118.98 122.01 Total Elapsed Time (seconds) 624.56 375.02 398.06 MMTests Statistics: vmstat Page Ins 3594216 439368 407032 Page Outs 23380832 23380488 23377444 Swap Ins 0 0 0 Swap Outs 0 436 287 Direct pages scanned 17482342 69315973 82864918 Kswapd pages scanned 0 519123 575425 Kswapd pages reclaimed 0 466501 522487 Direct pages reclaimed 5858054 2732949 2712547 Kswapd efficiency 100% 89% 90% Kswapd velocity 0.000 1384.254 1445.574 Direct efficiency 33% 3% 3% Direct velocity 27991.453 184832.737 208171.929 Percentage direct scans 100% 99% 99% Page writes by reclaim 0 5082 13917 Slabs scanned 17280 29952 35328 Direct inode steals 115257 1431122 332201 Kswapd inode steals 0 0 979532 This test downloads a large tarfile and copies it with dd a number of times - similar to the most recent bug report I've dealt with. Time to completion is reduced. The number of pages scanned directly is still disturbingly high with a low efficiency but this is likely due to the number of dirty pages encountered. The figures could probably be improved with more work around how kswapd is used and how dirty pages are handled but that is separate work and this result is significant on its own. Streaming Mapped Writer MMTests Statistics: duration User/Sys Time Running Test (seconds) 124.47 111.67 112.64 Total Elapsed Time (seconds) 2138.14 1816.30 1867.56 MMTests Statistics: vmstat Page Ins 90760 89124 89516 Page Outs 121028340 120199524 120736696 Swap Ins 0 86 55 Swap Outs 0 0 0 Direct pages scanned 114989363 96461439 96330619 Kswapd pages scanned 56430948 56965763 57075875 Kswapd pages reclaimed 27743219 27752044 27766606 Direct pages reclaimed 49777 46884 36655 Kswapd efficiency 49% 48% 48% Kswapd velocity 26392.541 31363.631 30561.736 Direct efficiency 0% 0% 0% Direct velocity 53780.091 53108.759 51581.004 Percentage direct scans 67% 62% 62% Page writes by reclaim 385 122 1513 Slabs scanned 43008 39040 42112 Direct inode steals 0 10 8 Kswapd inode steals 733 534 477 This test just creates a large file mapping and writes to it linearly. Time to completion is again reduced. The gains are mostly down to two things. In many cases, there is less scanning as zone_reclaim simply gives up faster due to recent failures. The second reason is that memory is used more efficiently. Instead of scanning the preferred zone every time, the allocator falls back to another zone and uses it instead improving overall memory utilisation. This patch: initialise ZLC for first zone eligible for zone_reclaim. The zonelist cache (ZLC) is used among other things to record if zone_reclaim() failed for a particular zone recently. The intention is to avoid a high cost scanning extremely long zonelists or scanning within the zone uselessly. Currently the zonelist cache is setup only after the first zone has been considered and zone_reclaim() has been called. The objective was to avoid a costly setup but zone_reclaim is itself quite expensive. If it is failing regularly such as the first eligible zone having mostly mapped pages, the cost in scanning and allocation stalls is far higher than the ZLC initialisation step. This patch initialises ZLC before the first eligible zone calls zone_reclaim(). Once initialised, it is checked whether the zone failed zone_reclaim recently. If it has, the zone is skipped. As the first zone is now being checked, additional care has to be taken about zones marked full. A zone can be marked "full" because it should not have enough unmapped pages for zone_reclaim but this is excessive as direct reclaim or kswapd may succeed where zone_reclaim fails. Only mark zones "full" after zone_reclaim fails if it failed to reclaim enough pages after scanning. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Stefan Priebe <s.priebe@profihost.ag> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-17mm: fix wrong vmap address calculations with odd NR_CPUS valuesClemens Ladisch
commit f982f91516fa4cfd9d20518833cd04ad714585be upstream. Commit db64fe02258f ("mm: rewrite vmap layer") introduced code that does address calculations under the assumption that VMAP_BLOCK_SIZE is a power of two. However, this might not be true if CONFIG_NR_CPUS is not set to a power of two. Wrong vmap_block index/offset values could lead to memory corruption. However, this has never been observed in practice (or never been diagnosed correctly); what caught this was the BUG_ON in vb_alloc() that checks for inconsistent vmap_block indices. To fix this, ensure that VMAP_BLOCK_SIZE always is a power of two. BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=31572 Reported-by: Pavel Kysilka <goldenfish@linuxsoft.cz> Reported-by: Matias A. Fonzo <selk@dragora.org> Signed-off-by: Clemens Ladisch <clemens@ladisch.de> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de> Cc: Nick Piggin <npiggin@suse.de> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Krzysztof Helt <krzysztof.h1@poczta.fm> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-04oom: task->mm == NULL doesn't mean the memory was freedOleg Nesterov
commit c027a474a68065391c8773f6e83ed5412657e369 upstream. exit_mm() sets ->mm == NULL then it does mmput()->exit_mmap() which frees the memory. However select_bad_process() checks ->mm != NULL before TIF_MEMDIE, so it continues to kill other tasks even if we have the oom-killed task freeing its memory. Change select_bad_process() to check ->mm after TIF_MEMDIE, but skip the tasks which have already passed exit_notify() to ensure a zombie with TIF_MEMDIE set can't block oom-killer. Alternatively we could probably clear TIF_MEMDIE after exit_mmap(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-04memcg: fix behavior of mem_cgroup_resize_limit()Daisuke Nishimura
commit 108b6a78463bb8c7163e4f9779f36ad8bbade334 upstream. Commit 22a668d7c3ef ("memcg: fix behavior under memory.limit equals to memsw.limit") introduced "memsw_is_minimum" flag, which becomes true when mem_limit == memsw_limit. The flag is checked at the beginning of reclaim, and "noswap" is set if the flag is true, because using swap is meaningless in this case. This works well in most cases, but when we try to shrink mem_limit, which is the same as memsw_limit now, we might fail to shrink mem_limit because swap doesn't used. This patch fixes this behavior by: - check MEM_CGROUP_RECLAIM_SHRINK at the begining of reclaim - If it is set, don't set "noswap" flag even if memsw_is_minimum is true. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-04mm/backing-dev.c: reset bdi min_ratio in bdi_unregister()Peter Zijlstra
commit ccb6108f5b0b541d3eb332c3a73e645c0f84278e upstream. Vito said: : The system has many usb disks coming and going day to day, with their : respective bdi's having min_ratio set to 1 when inserted. It works for : some time until eventually min_ratio can no longer be set, even when the : active set of bdi's seen in /sys/class/bdi/*/min_ratio doesn't add up to : anywhere near 100. : : This then leads to an unrelated starvation problem caused by write-heavy : fuse mounts being used atop the usb disks, a problem the min_ratio setting : at the underlying devices bdi effectively prevents. Fix this leakage by resetting the bdi min_ratio when unregistering the BDI. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Reported-by: Vito Caputo <lkml@pengaru.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-04mm/futex: fix futex writes on archs with SW tracking of dirty & youngBenjamin Herrenschmidt
commit 2efaca927f5cd7ecd0f1554b8f9b6a9a2c329c03 upstream. I haven't reproduced it myself but the fail scenario is that on such machines (notably ARM and some embedded powerpc), if you manage to hit that futex path on a writable page whose dirty bit has gone from the PTE, you'll livelock inside the kernel from what I can tell. It will go in a loop of trying the atomic access, failing, trying gup to "fix it up", getting succcess from gup, go back to the atomic access, failing again because dirty wasn't fixed etc... So I think you essentially hang in the kernel. The scenario is probably rare'ish because affected architecture are embedded and tend to not swap much (if at all) so we probably rarely hit the case where dirty is missing or young is missing, but I think Shan has a piece of SW that can reliably reproduce it using a shared writable mapping & fork or something like that. On archs who use SW tracking of dirty & young, a page without dirty is effectively mapped read-only and a page without young unaccessible in the PTE. Additionally, some architectures might lazily flush the TLB when relaxing write protection (by doing only a local flush), and expect a fault to invalidate the stale entry if it's still present on another processor. The futex code assumes that if the "in_atomic()" access -EFAULT's, it can "fix it up" by causing get_user_pages() which would then be equivalent to taking the fault. However that isn't the case. get_user_pages() will not call handle_mm_fault() in the case where the PTE seems to have the right permissions, regardless of the dirty and young state. It will eventually update those bits ... in the struct page, but not in the PTE. Additionally, it will not handle the lazy TLB flushing that can be required by some architectures in the fault case. Basically, gup is the wrong interface for the job. The patch provides a more appropriate one which boils down to just calling handle_mm_fault() since what we are trying to do is simulate a real page fault. The futex code currently attempts to write to user memory within a pagefault disabled section, and if that fails, tries to fix it up using get_user_pages(). This doesn't work on archs where the dirty and young bits are maintained by software, since they will gate access permission in the TLB, and will not be updated by gup(). In addition, there's an expectation on some archs that a spurious write fault triggers a local TLB flush, and that is missing from the picture as well. I decided that adding those "features" to gup() would be too much for this already too complex function, and instead added a new simpler fixup_user_fault() which is essentially a wrapper around handle_mm_fault() which the futex code can call. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix some nits Darren saw, fiddle comment layout] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Reported-by: Shan Hai <haishan.bai@gmail.com> Tested-by: Shan Hai <haishan.bai@gmail.com> Cc: David Laight <David.Laight@ACULAB.COM> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Darren Hart <darren.hart@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-07-19vmscan: fix a livelock in kswapdShaohua Li
I'm running a workload which triggers a lot of swap in a machine with 4 nodes. After I kill the workload, I found a kswapd livelock. Sometimes kswapd3 or kswapd2 are keeping running and I can't access filesystem, but most memory is free. This looks like a regression since commit 08951e545918c159 ("mm: vmscan: correct check for kswapd sleeping in sleeping_prematurely"). Node 2 and 3 have only ZONE_NORMAL, but balance_pgdat() will return 0 for classzone_idx. The reason is end_zone in balance_pgdat() is 0 by default, if all zones have watermark ok, end_zone will keep 0. Later sleeping_prematurely() always returns true. Because this is an order 3 wakeup, and if classzone_idx is 0, both balanced_pages and present_pages in pgdat_balanced() are 0. We add a special case here. If a zone has no page, we think it's balanced. This fixes the livelock. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-08mm/nommu.c: fix remap_pfn_range()Bob Liu
remap_pfn_range() means map physical address pfn<<PAGE_SHIFT to user addr. For nommu arch it's implemented by vma->vm_start = pfn << PAGE_SHIFT which is wrong acroding the original meaning of this function. And some driver developer using remap_pfn_range() with correct parameter will get unexpected result because vm_start is changed. It should be implementd like addr = pfn << PAGE_SHIFT but which is meanless on nommu arch, this patch just make it simply return. Parameter name and setting of vma->vm_flags also be fixed. Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: David Howells <dhowells@redhat.com> Acked-by: Greg Ungerer <gerg@uclinux.org> Cc: Mike Frysinger <vapier@gentoo.org> Cc: Bob Liu <lliubbo@gmail.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-08memcg: fix numa scan information update to be triggered by memory eventKAMEZAWA Hiroyuki
commit 889976dbcb12 ("memcg: reclaim memory from nodes in round-robin order") adds an numa node round-robin for memcg. But the information is updated once per 10sec. This patch changes the update trigger from jiffies to memcg's event count. After this patch, numa scan information will be updated when we see 1024 events of pagein/pageout under a memcg. [akpm@linux-foundation.org: attempt to repair code layout] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ying Han <yinghan@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-08memcg: fix reclaimable lru check in memcgKAMEZAWA Hiroyuki
Now, in mem_cgroup_hierarchical_reclaim(), mem_cgroup_local_usage() is used for checking whether the memcg contains reclaimable pages or not. If no pages in it, the routine skips it. But, mem_cgroup_local_usage() contains Unevictable pages and cannot handle "noswap" condition correctly. This doesn't work on a swapless system. This patch adds test_mem_cgroup_reclaimable() and replaces mem_cgroup_local_usage(). test_mem_cgroup_reclaimable() see LRU counter and returns correct answer to the caller. And this new function has "noswap" argument and can see only FILE LRU if necessary. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix kerneldoc layout] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ying Han <yinghan@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-08mm: __tlb_remove_page() check the correct batchShaohua Li
__tlb_remove_page() switches to a new batch page, but still checks space in the old batch. This check always fails, and causes a forced tlb flush. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-08mm: vmscan: only read new_classzone_idx from pgdat when reclaiming successfullyMel Gorman
During allocator-intensive workloads, kswapd will be woken frequently causing free memory to oscillate between the high and min watermark. This is expected behaviour. Unfortunately, if the highest zone is small, a problem occurs. When balance_pgdat() returns, it may be at a lower classzone_idx than it started because the highest zone was unreclaimable. Before checking if it should go to sleep though, it checks pgdat->classzone_idx which when there is no other activity will be MAX_NR_ZONES-1. It interprets this as it has been woken up while reclaiming, skips scheduling and reclaims again. As there is no useful reclaim work to do, it enters into a loop of shrinking slab consuming loads of CPU until the highest zone becomes reclaimable for a long period of time. There are two problems here. 1) If the returned classzone or order is lower, it'll continue reclaiming without scheduling. 2) if the highest zone was marked unreclaimable but balance_pgdat() returns immediately at DEF_PRIORITY, the new lower classzone is not communicated back to kswapd() for sleeping. This patch does two things that are related. If the end_zone is unreclaimable, this information is communicated back. Second, if the classzone or order was reduced due to failing to reclaim, new information is not read from pgdat and instead an attempt is made to go to sleep. Due to this, it is also necessary that pgdat->classzone_idx be initialised each time to pgdat->nr_zones - 1 to avoid re-reads being interpreted as wakeups. Signed-off-by: Mel Gorman <mgorman@suse.de> Reported-by: Pádraig Brady <P@draigBrady.com> Tested-by: Pádraig Brady <P@draigBrady.com> Tested-by: Andrew Lutomirski <luto@mit.edu> Acked-by: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-08mm: vmscan: evaluate the watermarks against the correct classzoneMel Gorman
When deciding if kswapd is sleeping prematurely, the classzone is taken into account but this is different to what balance_pgdat() and the allocator are doing. Specifically, the DMA zone will be checked based on the classzone used when waking kswapd which could be for a GFP_KERNEL or GFP_HIGHMEM request. The lowmem reserve limit kicks in, the watermark is not met and kswapd thinks it's sleeping prematurely keeping kswapd awake in error. Signed-off-by: Mel Gorman <mgorman@suse.de> Reported-by: Pádraig Brady <P@draigBrady.com> Tested-by: Pádraig Brady <P@draigBrady.com> Tested-by: Andrew Lutomirski <luto@mit.edu> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-08mm: vmscan: do not apply pressure to slab if we are not applying pressure to ↵Mel Gorman
zone During allocator-intensive workloads, kswapd will be woken frequently causing free memory to oscillate between the high and min watermark. This is expected behaviour. When kswapd applies pressure to zones during node balancing, it checks if the zone is above a high+balance_gap threshold. If it is, it does not apply pressure but it unconditionally shrinks slab on a global basis which is excessive. In the event kswapd is being kept awake due to a high small unreclaimable zone, it skips zone shrinking but still calls shrink_slab(). Once pressure has been applied, the check for zone being unreclaimable is being made before the check is made if all_unreclaimable should be set. This miss of unreclaimable can cause has_under_min_watermark_zone to be set due to an unreclaimable zone preventing kswapd backing off on congestion_wait(). Signed-off-by: Mel Gorman <mgorman@suse.de> Reported-by: Pádraig Brady <P@draigBrady.com> Tested-by: Pádraig Brady <P@draigBrady.com> Tested-by: Andrew Lutomirski <luto@mit.edu> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-08mm: vmscan: correct check for kswapd sleeping in sleeping_prematurelyMel Gorman
During allocator-intensive workloads, kswapd will be woken frequently causing free memory to oscillate between the high and min watermark. This is expected behaviour. Unfortunately, if the highest zone is small, a problem occurs. This seems to happen most with recent sandybridge laptops but it's probably a co-incidence as some of these laptops just happen to have a small Normal zone. The reproduction case is almost always during copying large files that kswapd pegs at 100% CPU until the file is deleted or cache is dropped. The problem is mostly down to sleeping_prematurely() keeping kswapd awake when the highest zone is small and unreclaimable and compounded by the fact we shrink slabs even when not shrinking zones causing a lot of time to be spent in shrinkers and a lot of memory to be reclaimed. Patch 1 corrects sleeping_prematurely to check the zones matching the classzone_idx instead of all zones. Patch 2 avoids shrinking slab when we are not shrinking a zone. Patch 3 notes that sleeping_prematurely is checking lower zones against a high classzone which is not what allocators or balance_pgdat() is doing leading to an artifical belief that kswapd should be still awake. Patch 4 notes that when balance_pgdat() gives up on a high zone that the decision is not communicated to sleeping_prematurely() This problem affects 2.6.38.8 for certain and is expected to affect 2.6.39 and 3.0-rc4 as well. If accepted, they need to go to -stable to be picked up by distros and this series is against 3.0-rc4. I've cc'd people that reported similar problems recently to see if they still suffer from the problem and if this fixes it. This patch: correct the check for kswapd sleeping in sleeping_prematurely() During allocator-intensive workloads, kswapd will be woken frequently causing free memory to oscillate between the high and min watermark. This is expected behaviour. A problem occurs if the highest zone is small. balance_pgdat() only considers unreclaimable zones when priority is DEF_PRIORITY but sleeping_prematurely considers all zones. It's possible for this sequence to occur 1. kswapd wakes up and enters balance_pgdat() 2. At DEF_PRIORITY, marks highest zone unreclaimable 3. At DEF_PRIORITY-1, ignores highest zone setting end_zone 4. At DEF_PRIORITY-1, calls shrink_slab freeing memory from highest zone, clearing all_unreclaimable. Highest zone is still unbalanced 5. kswapd returns and calls sleeping_prematurely 6. sleeping_prematurely looks at *all* zones, not just the ones being considered by balance_pgdat. The highest small zone has all_unreclaimable cleared but the zone is not balanced. all_zones_ok is false so kswapd stays awake This patch corrects the behaviour of sleeping_prematurely to check the zones balance_pgdat() checked. Signed-off-by: Mel Gorman <mgorman@suse.de> Reported-by: Pádraig Brady <P@draigBrady.com> Tested-by: Pádraig Brady <P@draigBrady.com> Tested-by: Andrew Lutomirski <luto@mit.edu> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-27memcg: fix direct softlimit reclaim to be called in limit pathKAMEZAWA Hiroyuki
Commit d149e3b25d7c ("memcg: add the soft_limit reclaim in global direct reclaim") adds a softlimit hook to shrink_zones(). By this, soft limit is called as try_to_free_pages() do_try_to_free_pages() shrink_zones() mem_cgroup_soft_limit_reclaim() Then, direct reclaim is memcg softlimit hint aware, now. But, the memory cgroup's "limit" path can call softlimit shrinker. try_to_free_mem_cgroup_pages() do_try_to_free_pages() shrink_zones() mem_cgroup_soft_limit_reclaim() This will cause a global reclaim when a memcg hits limit. This is bug. soft_limit_reclaim() should be called when scanning_global_lru(sc) == true. And the commit adds a variable "total_scanned" for counting softlimit scanned pages....it's not "total". This patch removes the variable and update sc->nr_scanned instead of it. This will affect shrink_slab()'s scan condition but, global LRU is scanned by softlimit and I think this change makes sense. TODO: avoid too much scanning of a zone when softlimit did enough work. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Ying Han <yinghan@google.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-27mm: fix assertion mapping->nrpages == 0 in end_writeback()Jan Kara
Under heavy memory and filesystem load, users observe the assertion mapping->nrpages == 0 in end_writeback() trigger. This can be caused by page reclaim reclaiming the last page from a mapping in the following race: CPU0 CPU1 ... shrink_page_list() __remove_mapping() __delete_from_page_cache() radix_tree_delete() evict_inode() truncate_inode_pages() truncate_inode_pages_range() pagevec_lookup() - finds nothing end_writeback() mapping->nrpages != 0 -> BUG page->mapping = NULL mapping->nrpages-- Fix the problem by doing a reliable check of mapping->nrpages under mapping->tree_lock in end_writeback(). Analyzed by Jay <jinshan.xiong@whamcloud.com>, lost in LKML, and dug out by Miklos Szeredi <mszeredi@suse.de>. Cc: Jay <jinshan.xiong@whamcloud.com> Cc: Miklos Szeredi <mszeredi@suse.de> Signed-off-by: Jan Kara <jack@suse.cz> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-27mm/memory-failure.c: fix spinlock vs mutex orderPeter Zijlstra
We cannot take a mutex while holding a spinlock, so flip the order and fix the locking documentation. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-27tmpfs: add shmem_read_mapping_page_gfpHugh Dickins
Although it is used (by i915) on nothing but tmpfs, read_cache_page_gfp() is unsuited to tmpfs, because it inserts a page into pagecache before calling the filesystem's ->readpage: tmpfs may have pages in swapcache which only it knows how to locate and switch to filecache. At present tmpfs provides a ->readpage method, and copes with this by copying pages; but soon we can simplify it by removing its ->readpage. Provide shmem_read_mapping_page_gfp() now, ready for that transition, Export shmem_read_mapping_page_gfp() and add it to list in shmem_fs.h, with shmem_read_mapping_page() inline for the common mapping_gfp case. (shmem_read_mapping_page_gfp or shmem_read_cache_page_gfp? Generally the read_mapping_page functions use the mapping's ->readpage, and the read_cache_page functions use the supplied filler, so I think read_cache_page_gfp was slightly misnamed.) Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-27tmpfs: take control of its truncate_rangeHugh Dickins
2.6.35's new truncate convention gave tmpfs the opportunity to control its file truncation, no longer enforced from outside by vmtruncate(). We shall want to build upon that, to handle pagecache and swap together. Slightly redefine the ->truncate_range interface: let it now be called between the unmap_mapping_range()s, with the filesystem responsible for doing the truncate_inode_pages_range() from it - just as the filesystem is nowadays responsible for doing that from its ->setattr. Let's rename shmem_notify_change() to shmem_setattr(). Instead of calling the generic truncate_setsize(), bring that code in so we can call shmem_truncate_range() - which will later be updated to perform its own variant of truncate_inode_pages_range(). Remove the punch_hole unmap_mapping_range() from shmem_truncate_range(): now that the COW's unmap_mapping_range() comes after ->truncate_range, there is no need to call it a third time. Export shmem_truncate_range() and add it to the list in shmem_fs.h, so that i915_gem_object_truncate() can call it explicitly in future; get this patch in first, then update drm/i915 once this is available (until then, i915 will just be doing the truncate_inode_pages() twice). Though introduced five years ago, no other filesystem is implementing ->truncate_range, and its only other user is madvise(,,MADV_REMOVE): we expect to convert it to fallocate(,FALLOC_FL_PUNCH_HOLE,,) shortly, whereupon ->truncate_range can be removed from inode_operations - shmem_truncate_range() will help i915 across that transition too. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-27mm: move shmem prototypes to shmem_fs.hHugh Dickins
Before adding any more global entry points into shmem.c, gather such prototypes into shmem_fs.h. Remove mm's own declarations from swap.h, but for now leave the ones in mm.h: because shmem_file_setup() and shmem_zero_setup() are called from various places, and we should not force other subsystems to update immediately. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-27mm: move vmtruncate_range to truncate.cHugh Dickins
You would expect to find vmtruncate_range() next to vmtruncate() in mm/truncate.c: move it there. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-22mm, hotplug: protect zonelist building with zonelists_mutexDavid Rientjes
Commit 959ecc48fc75 ("mm/memory_hotplug.c: fix building of node hotplug zonelist") does not protect the build_all_zonelists() call with zonelists_mutex as needed. This can lead to races in constructing zonelist ordering if a concurrent build is underway. Protecting this with lock_memory_hotplug() is insufficient since zonelists can be rebuild though sysfs as well. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-22mm, hotplug: fix error handling in mem_online_node()David Rientjes
The error handling in mem_online_node() is incorrect: hotadd_new_pgdat() returns NULL if the new pgdat could not have been allocated and a pointer to it otherwise. mem_online_node() should fail if hotadd_new_pgdat() fails, not the inverse. This fixes an issue when memoryless nodes are not onlined and their sysfs interface is not registered when their first cpu is brought up. The bug was introduced by commit cf23422b9d76 ("cpu/mem hotplug: enable CPUs online before local memory online") iow v2.6.35. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-17mm: avoid anon_vma_chain allocation under anon_vma lockLinus Torvalds
Hugh Dickins points out that lockdep (correctly) spots a potential deadlock on the anon_vma lock, because we now do a GFP_KERNEL allocation of anon_vma_chain while doing anon_vma_clone(). The problem is that page reclaim will want to take the anon_vma lock of any anonymous pages that it will try to reclaim. So re-organize the code in anon_vma_clone() slightly: first do just a GFP_NOWAIT allocation, which will usually work fine. But if that fails, let's just drop the lock and re-do the allocation, now with GFP_KERNEL. End result: not only do we avoid the locking problem, this also ends up getting better concurrency in case the allocation does need to block. Tim Chen reports that with all these anon_vma locking tweaks, we're now almost back up to the spinlock performance. Reported-and-tested-by: Hugh Dickins <hughd@google.com> Tested-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-17mm: avoid repeated anon_vma lock/unlock sequences in unlink_anon_vmas()Peter Zijlstra
This matches the anon_vma_clone() case, and uses the same lock helper functions. Because of the need to potentially release the anon_vma's, it's a bit more complex, though. We traverse the 'vma->anon_vma_chain' in two phases: the first loop gets the anon_vma lock (with the helper function that only takes the lock once for the whole loop), and removes any entries that don't need any more processing. The second phase just traverses the remaining list entries (without holding the anon_vma lock), and does any actual freeing of the anon_vma's that is required. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Tested-by: Hugh Dickins <hughd@google.com> Tested-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-17mm: avoid repeated anon_vma lock/unlock sequences in anon_vma_clone()Linus Torvalds
In anon_vma_clone() we traverse the vma->anon_vma_chain of the source vma, locking the anon_vma for each entry. But they are all going to have the same root entry, which means that we're locking and unlocking the same lock over and over again. Which is expensive in locked operations, but can get _really_ expensive when that root entry sees any kind of lock contention. In fact, Tim Chen reports a big performance regression due to this: when we switched to use a mutex instead of a spinlock, the contention case gets much worse. So to alleviate this all, this commit creates a small helper function (lock_anon_vma_root()) that can be used to take the lock just once rather than taking and releasing it over and over again. We still have the same "take the lock and release" it behavior in the exit path (in unlink_anon_vmas()), but that one is a bit harder to fix since we're actually freeing the anon_vma entries as we go, and that will touch the lock too. Reported-and-tested-by: Tim Chen <tim.c.chen@linux.intel.com> Tested-by: Hugh Dickins <hughd@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-16migrate: don't account swapcache as shmemAndrea Arcangeli
swapcache will reach the below code path in migrate_page_move_mapping, and swapcache is accounted as NR_FILE_PAGES but it's not accounted as NR_SHMEM. Hugh pointed out we must use PageSwapCache instead of comparing mapping to &swapper_space, to avoid build failure with CONFIG_SWAP=n. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-16mm: get rid of the most spurious find_vma_prev() usersLinus Torvalds
We have some users of this function that date back to before the vma list was doubly linked, and just are silly. These days, you can find the previous vma by just following the vma->vm_prev pointer. In some cases you don't need any find_vma() lookup at all, and in other cases you're better off with the regular "find_vma()" that uses the vma cache front-end lookup. Some "find_vma_prev()" users are still valid, though. For example, in the case of a stack that grows up, it can be the case that we don't find any 'vma' at all (because we're looking up an address that is past the last vma), and that the stack that we want to grow is the 'prev' vma. But that kind of special case aside, we generally should prefer to use 'find_vma()'. Noticed due to a totally unrelated POWER memory corruption bug that just happened to hit in 'find_vma_prev()' and made me go "Hmm - why are we using that function here?". Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-15ksm: fix NULL pointer dereference in scan_get_next_rmap_item()Hugh Dickins
Andrea Righi reported a case where an exiting task can race against ksmd::scan_get_next_rmap_item (http://lkml.org/lkml/2011/6/1/742) easily triggering a NULL pointer dereference in ksmd. ksm_scan.mm_slot == &ksm_mm_head with only one registered mm CPU 1 (__ksm_exit) CPU 2 (scan_get_next_rmap_item) list_empty() is false lock slot == &ksm_mm_head list_del(slot->mm_list) (list now empty) unlock lock slot = list_entry(slot->mm_list.next) (list is empty, so slot is still ksm_mm_head) unlock slot->mm == NULL ... Oops Close this race by revalidating that the new slot is not simply the list head again. Andrea's test case: #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #define BUFSIZE getpagesize() int main(int argc, char **argv) { void *ptr; if (posix_memalign(&ptr, getpagesize(), BUFSIZE) < 0) { perror("posix_memalign"); exit(1); } if (madvise(ptr, BUFSIZE, MADV_MERGEABLE) < 0) { perror("madvise"); exit(1); } *(char *)NULL = 0; return 0; } Reported-by: Andrea Righi <andrea@betterlinux.com> Tested-by: Andrea Righi <andrea@betterlinux.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-15mm: compaction: abort compaction if too many pages are isolated and caller ↵Mel Gorman
is asynchronous V2 Asynchronous compaction is used when promoting to huge pages. This is all very nice but if there are a number of processes in compacting memory, a large number of pages can be isolated. An "asynchronous" process can stall for long periods of time as a result with a user reporting that firefox can stall for 10s of seconds. This patch aborts asynchronous compaction if too many pages are isolated as it's better to fail a hugepage promotion than stall a process. [minchan.kim@gmail.com: return COMPACT_PARTIAL for abort] Reported-and-tested-by: Ury Stankevich <urykhy@gmail.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-15mm: vmscan: do not use page_count without a page pinAndrea Arcangeli
It is unsafe to run page_count during the physical pfn scan because compound_head could trip on a dangling pointer when reading page->first_page if the compound page is being freed by another CPU. [mgorman@suse.de: split out patch] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-15mm: compaction: ensure that the compaction free scanner does not move to the ↵Mel Gorman
next zone Compaction works with two scanners, a migration and a free scanner. When the scanners crossover, migration within the zone is complete. The location of the scanner is recorded on each cycle to avoid excesive scanning. When a zone is small and mostly reserved, it's very easy for the migration scanner to be close to the end of the zone. Then the following situation can occurs o migration scanner isolates some pages near the end of the zone o free scanner starts at the end of the zone but finds that the migration scanner is already there o free scanner gets reinitialised for the next cycle as cc->migrate_pfn + pageblock_nr_pages moving the free scanner into the next zone o migration scanner moves into the next zone When this happens, NR_ISOLATED accounting goes haywire because some of the accounting happens against the wrong zone. One zones counter remains positive while the other goes negative even though the overall global count is accurate. This was reported on X86-32 with !SMP because !SMP allows the negative counters to be visible. The fact that it is the bug should theoritically be possible there. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-15compaction: checks correct fragmentation indexShaohua Li
fragmentation_index() returns -1000 when the allocation might succeed This doesn't match the comment and code in compaction_suitable(). I thought compaction_suitable should return COMPACT_PARTIAL in -1000 case, because in this case allocation could succeed depending on watermarks. The impact of this is that compaction starts and compact_finished() is called which rechecks the watermarks and the free lists. It should have the same result in that compaction should not start but is more expensive. Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-15mm/memory-failure.c: fix page isolated count mismatchMinchan Kim
Pages isolated for migration are accounted with the vmstat counters NR_ISOLATE_[ANON|FILE]. Callers of migrate_pages() are expected to increment these counters when pages are isolated from the LRU. Once the pages have been migrated, they are put back on the LRU or freed and the isolated count is decremented. Memory failure is not properly accounting for pages it isolates causing the NR_ISOLATED counters to be negative. On SMP builds, this goes unnoticed as negative counters are treated as 0 due to expected per-cpu drift. On UP builds, the counter is treated by too_many_isolated() as a large value causing processes to enter D state during page reclaim or compaction. This patch accounts for pages isolated by memory failure correctly. [mel@csn.ul.ie: rewrote changelog] Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Cc: Andi Kleen <andi@firstfloor.org> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-15memcg: avoid percpu cached charge draining at softlimitKAMEZAWA Hiroyuki
Based on Michal Hocko's comment. We are not draining per cpu cached charges during soft limit reclaim because background reclaim doesn't care about charges. It tries to free some memory and charges will not give any. Cached charges might influence only selection of the biggest soft limit offender but as the call is done only after the selection has been already done it makes no change. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-15memcg: fix percpu cached charge draining frequencyKAMEZAWA Hiroyuki
For performance, memory cgroup caches some "charge" from res_counter into per cpu cache. This works well but because it's cache, it needs to be flushed in some cases. Typical cases are 1. when someone hit limit. 2. when rmdir() is called and need to charges to be 0. But "1" has problem. Recently, with large SMP machines, we see many kworker runs because of flushing memcg's cache. Bad things in implementation are that even if a cpu contains a cache for memcg not related to a memcg which hits limit, drain code is called. This patch does A) check percpu cache contains a useful data or not. B) check other asynchronous percpu draining doesn't run. C) don't call local cpu callback. (*)This patch avoid changing the calling condition with hard-limit. When I run "cat 1Gfile > /dev/null" under 300M limit memcg, [Before] 13767 kamezawa 20 0 98.6m 424 416 D 10.0 0.0 0:00.61 cat 58 root 20 0 0 0 0 S 0.6 0.0 0:00.09 kworker/2:1 60 root 20 0 0 0 0 S 0.6 0.0 0:00.08 kworker/4:1 4 root 20 0 0 0 0 S 0.3 0.0 0:00.02 kworker/0:0 57 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/1:1 61 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/5:1 62 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/6:1 63 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/7:1 [After] 2676 root 20 0 98.6m 416 416 D 9.3 0.0 0:00.87 cat 2626 kamezawa 20 0 15192 1312 920 R 0.3 0.0 0:00.28 top 1 root 20 0 19384 1496 1204 S 0.0 0.0 0:00.66 init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd 3 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0 4 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0 [akpm@linux-foundation.org: make percpu_charge_mutex static, tweak comments] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: Michal Hocko <mhocko@suse.cz> Tested-by: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-15memcg: fix wrong check of noswap with softlimitKAMEZAWA Hiroyuki
Hierarchical reclaim doesn't swap out if memsw and resource limits are thye same (memsw_is_minimum == true) because we would hit mem+swap limit anyway (during hard limit reclaim). If it comes to the soft limit we shouldn't consider memsw_is_minimum at all because it doesn't make much sense. Either the soft limit is bellow the hard limit and then we cannot hit mem+swap limit or the direct reclaim takes a precedence. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-15memcg: fix init_page_cgroup nid with sparsememKAMEZAWA Hiroyuki
Commit 21a3c9646873 ("memcg: allocate memory cgroup structures in local nodes") makes page_cgroup allocation as NUMA aware. But that caused a problem https://bugzilla.kernel.org/show_bug.cgi?id=36192. The problem was getting a NID from invalid struct pages, which was not initialized because it was out-of-node, out of [node_start_pfn, node_end_pfn) Now, with sparsemem, page_cgroup_init scans pfn from 0 to max_pfn. But this may scan a pfn which is not on any node and can access memmap which is not initialized. This makes page_cgroup_init() for SPARSEMEM node aware and remove a code to get nid from page->flags. (Then, we'll use valid NID always.) [akpm@linux-foundation.org: try to fix up comments] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-15mm: memory.numa_stat: fix file permissionKAMEZAWA Hiroyuki
Commit 406eb0c9ba76 ("memcg: add memory.numastat api for numa statistics") adds memory.numa_stat file for memory cgroup. But the file permissions are wrong. [kamezawa@bluextal linux-2.6]$ ls -l /cgroup/memory/A/memory.numa_stat ---------- 1 root root 0 Jun 9 18:36 /cgroup/memory/A/memory.numa_stat This patch fixes the permission as [root@bluextal kamezawa]# ls -l /cgroup/memory/A/memory.numa_stat -r--r--r-- 1 root root 0 Jun 10 16:49 /cgroup/memory/A/memory.numa_stat Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-15mm: fix negative commitlimit when gigantic hugepages are allocatedRafael Aquini
When 1GB hugepages are allocated on a system, free(1) reports less available memory than what really is installed in the box. Also, if the total size of hugepages allocated on a system is over half of the total memory size, CommitLimit becomes a negative number. The problem is that gigantic hugepages (order > MAX_ORDER) can only be allocated at boot with bootmem, thus its frames are not accounted to 'totalram_pages'. However, they are accounted to hugetlb_total_pages() What happens to turn CommitLimit into a negative number is this calculation, in fs/proc/meminfo.c: allowed = ((totalram_pages - hugetlb_total_pages()) * sysctl_overcommit_ratio / 100) + total_swap_pages; A similar calculation occurs in __vm_enough_memory() in mm/mmap.c. Also, every vm statistic which depends on 'totalram_pages' will render confusing values, as if system were 'missing' some part of its memory. Impact of this bug: When gigantic hugepages are allocated and sysctl_overcommit_memory == OVERCOMMIT_NEVER. In a such situation, __vm_enough_memory() goes through the mentioned 'allowed' calculation and might end up mistakenly returning -ENOMEM, thus forcing the system to start reclaiming pages earlier than it would be ususal, and this could cause detrimental impact to overall system's performance, depending on the workload. Besides the aforementioned scenario, I can only think of this causing annoyances with memory reports from /proc/meminfo and free(1). [akpm@linux-foundation.org: standardize comment layout] Reported-by: Russ Anderson <rja@sgi.com> Signed-off-by: Rafael Aquini <aquini@linux.com> Acked-by: Russ Anderson <rja@sgi.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-15mm/memory_hotplug.c: fix building of node hotplug zonelistKAMEZAWA Hiroyuki
During memory hotplug we refresh zonelists when we online a page in a new zone. It means that the node's zonelist is not initialized until pages are onlined. So for example, "nid" passed by MEM_GOING_ONLINE notifier will point to NODE_DATA(nid) which has no zone fallback list. Moreover, if we hot-add cpu-only nodes, alloc_pages() will do no fallback. This patch makes a zonelist when a new pgdata is available. Note: in production, at fujitsu, memory should be onlined before cpu and our server didn't have any memory-less nodes and had no problems. But recent changes in MEM_GOING_ONLINE+page_cgroup will access not initialized zonelist of node. Anyway, there are memory-less node and we need some care. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>