aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2008-04-30revert "memory hotplug: allocate usemap on the section with pgdat"Andrew Morton
This: commit 86f6dae1377523689bd8468fed2f2dd180fc0560 Author: Yasunori Goto <y-goto@jp.fujitsu.com> Date: Mon Apr 28 02:13:33 2008 -0700 memory hotplug: allocate usemap on the section with pgdat Usemaps are allocated on the section which has pgdat by this. Because usemap size is very small, many other sections usemaps are allocated on only one page. If a section has usemap, it can't be removed until removing other sections. This dependency is not desirable for memory removing. Pgdat has similar feature. When a section has pgdat area, it must be the last section for removing on the node. So, if section A has pgdat and section B has usemap for section A, Both sections can't be removed due to dependency each other. To solve this issue, this patch collects usemap on same section with pgdat. If other sections doesn't have any dependency, this section will be able to be removed finally. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> broke davem's sparc64 bootup. Revert it while we work out what went wrong. Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30mm: fix warning on memory offlineNick Piggin
KAMEZAWA Hiroyuki found a warning message in the buffer dirtying code that is coming from page migration caller. WARNING: at fs/buffer.c:720 __set_page_dirty+0x330/0x360() Call Trace: [<a000000100015220>] show_stack+0x80/0xa0 [<a000000100015270>] dump_stack+0x30/0x60 [<a000000100089ed0>] warn_on_slowpath+0x90/0xe0 [<a0000001001f8b10>] __set_page_dirty+0x330/0x360 [<a0000001001ffb90>] __set_page_dirty_buffers+0xd0/0x280 [<a00000010012fec0>] set_page_dirty+0xc0/0x260 [<a000000100195670>] migrate_page_copy+0x5d0/0x5e0 [<a000000100197840>] buffer_migrate_page+0x2e0/0x3c0 [<a000000100195eb0>] migrate_pages+0x770/0xe00 What was happening is that migrate_page_copy wants to transfer the PG_dirty bit from old page to new page, so what it would do is set_page_dirty(newpage). However set_page_dirty() is used to set the entire page dirty, wheras in this case, only part of the page was dirty, and it also was not uptodate. Marking the whole page dirty with set_page_dirty would lead to corruption or unresolvable conditions -- a dirty && !uptodate page and dirty && !uptodate buffers. Possibly we could just ClearPageDirty(oldpage); SetPageDirty(newpage); however in the interests of keeping the change minimal... Signed-off-by: Nick Piggin <npiggin@suse.de> Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30mm: remove remaining __FUNCTION__ occurrencesHarvey Harrison
__FUNCTION__ is gcc-specific, use __func__ Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30infrastructure to debug (dynamic) objectsThomas Gleixner
We can see an ever repeating problem pattern with objects of any kind in the kernel: 1) freeing of active objects 2) reinitialization of active objects Both problems can be hard to debug because the crash happens at a point where we have no chance to decode the root cause anymore. One problem spot are kernel timers, where the detection of the problem often happens in interrupt context and usually causes the machine to panic. While working on a timer related bug report I had to hack specialized code into the timer subsystem to get a reasonable hint for the root cause. This debug hack was fine for temporary use, but far from a mergeable solution due to the intrusiveness into the timer code. The code further lacked the ability to detect and report the root cause instantly and keep the system operational. Keeping the system operational is important to get hold of the debug information without special debugging aids like serial consoles and special knowledge of the bug reporter. The problems described above are not restricted to timers, but timers tend to expose it usually in a full system crash. Other objects are less explosive, but the symptoms caused by such mistakes can be even harder to debug. Instead of creating specialized debugging code for the timer subsystem a generic infrastructure is created which allows developers to verify their code and provides an easy to enable debug facility for users in case of trouble. The debugobjects core code keeps track of operations on static and dynamic objects by inserting them into a hashed list and sanity checking them on object operations and provides additional checks whenever kernel memory is freed. The tracked object operations are: - initializing an object - adding an object to a subsystem list - deleting an object from a subsystem list Each operation is sanity checked before the operation is executed and the subsystem specific code can provide a fixup function which allows to prevent the damage of the operation. When the sanity check triggers a warning message and a stack trace is printed. The list of operations can be extended if the need arises. For now it's limited to the requirements of the first user (timers). The core code enqueues the objects into hash buckets. The hash index is generated from the address of the object to simplify the lookup for the check on kfree/vfree. Each bucket has it's own spinlock to avoid contention on a global lock. The debug code can be compiled in without being active. The runtime overhead is minimal and could be optimized by asm alternatives. A kernel command line option enables the debugging code. Thanks to Ingo Molnar for review, suggestions and cleanup patches. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Greg KH <greg@kroah.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30mm: Add NR_WRITEBACK_TEMP counterMiklos Szeredi
Fuse will use temporary buffers to write back dirty data from memory mappings (normal writes are done synchronously). This is needed, because there cannot be any guarantee about the time in which a write will complete. By using temporary buffers, from the MM's point if view the page is written back immediately. If the writeout was due to memory pressure, this effectively migrates data from a full zone to a less full zone. This patch adds a new counter (NR_WRITEBACK_TEMP) for the number of pages used as temporary buffers. [Lee.Schermerhorn@hp.com: add vmstat_text for NR_WRITEBACK_TEMP] Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30mm: bdi: export bdi_writeout_inc()Miklos Szeredi
Fuse needs this for writable mmap support. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30mm: bdi: add separate writeback accounting capabilityMiklos Szeredi
Add a new BDI capability flag: BDI_CAP_NO_ACCT_WB. If this flag is set, then don't update the per-bdi writeback stats from test_set_page_writeback() and test_clear_page_writeback(). Misc cleanups: - convert bdi_cap_writeback_dirty() and friends to static inline functions - create a flag that includes all three dirty/writeback related flags, since almst all users will want to have them toghether Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30mm: bdi: move statistics to debugfsMiklos Szeredi
Move BDI statistics to debugfs: /sys/kernel/debug/bdi/<bdi>/stats Use postcore_initcall() to initialize the sysfs class and debugfs, because debugfs is initialized in core_initcall(). Update descriptions in ABI documentation. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30mm: bdi: allow setting a maximum for the bdi dirty limitPeter Zijlstra
Add "max_ratio" to /sys/class/bdi. This indicates the maximum percentage of the global dirty threshold allocated to this bdi. [mszeredi@suse.cz] - fix parsing in max_ratio_store(). - export bdi_set_max_ratio() to modules - limit bdi_dirty with bdi->max_ratio - document new sysfs attribute Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30mm: bdi: allow setting a minimum for the bdi dirty limitPeter Zijlstra
Under normal circumstances each device is given a part of the total write-back cache that relates to its current avg writeout speed in relation to the other devices. min_ratio - allows one to assign a minimum portion of the write-back cache to a particular device. This is useful in situations where you might want to provide a minimum QoS. (One request for this feature came from flash based storage people who wanted to avoid writing out at all costs - they of course needed some pdflush hacks as well) max_ratio - allows one to assign a maximum portion of the dirty limit to a particular device. This is useful in situations where you want to avoid one device taking all or most of the write-back cache. Eg. an NFS mount that is prone to get stuck, or a FUSE mount which you don't trust to play fair. Add "min_ratio" to /sys/class/bdi. This indicates the minimum percentage of the global dirty threshold allocated to this bdi. [mszeredi@suse.cz] - fix parsing in min_ratio_store() - document new sysfs attribute Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30mm: bdi: export BDI attributes in sysfsPeter Zijlstra
Provide a place in sysfs (/sys/class/bdi) for the backing_dev_info object. This allows us to see and set the various BDI specific variables. In particular this properly exposes the read-ahead window for all relevant users and /sys/block/<block>/queue/read_ahead_kb should be deprecated. With patient help from Kay Sievers and Greg KH [mszeredi@suse.cz] - split off NFS and FUSE changes into separate patches - document new sysfs attributes under Documentation/ABI - do bdi_class_init as a core_initcall, otherwise the "default" BDI won't be initialized - remove bdi_init_fmt macro, it's not used very much [akpm@linux-foundation.org: fix ia64 warning] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Kay Sievers <kay.sievers@vrfy.org> Acked-by: Greg KH <greg@kroah.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30/proc/pagetypeinfo: fix output for memoryless nodesKOSAKI Motohiro
on memoryless node, /proc/pagetypeinfo is displayed slightly funny output. this patch fix it. output example (header is outputed, but no data is outputed) -------------------------------------------------------------- Page block order: 14 Pages per block: 16384 Free pages count per migrate type at order 0 1 2 3 4 5 \ 6 7 8 9 10 11 12 13 14 15 16 Number of blocks type Unmovable Reclaimable Movable Reserve Isolate Page block order: 14 Pages per block: 16384 Free pages count per migrate type at order 0 1 2 3 4 5 \ 6 7 8 9 10 11 12 13 14 15 16 Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29mm: use non-racy method for /proc/swaps creationDenis V. Lunev
Use proc_create() to make sure that ->proc_fops be setup before gluing PDE to main tree. Signed-off-by: Denis V. Lunev <den@openvz.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29procfs task exe symlinkMatt Helsley
The kernel implements readlink of /proc/pid/exe by getting the file from the first executable VMA. Then the path to the file is reconstructed and reported as the result. Because of the VMA walk the code is slightly different on nommu systems. This patch avoids separate /proc/pid/exe code on nommu systems. Instead of walking the VMAs to find the first executable file-backed VMA we store a reference to the exec'd file in the mm_struct. That reference would prevent the filesystem holding the executable file from being unmounted even after unmapping the VMAs. So we track the number of VM_EXECUTABLE VMAs and drop the new reference when the last one is unmapped. This avoids pinning the mounted filesystem. [akpm@linux-foundation.org: improve comments] [yamamoto@valinux.co.jp: fix dup_mmap] Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: David Howells <dhowells@redhat.com> Cc:"Eric W. Biederman" <ebiederm@xmission.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29ipc: define the slab_memory_callback priority as a constantNadia Derbey
This is a trivial patch that defines the priority of slab_memory_callback in the callback chain as a constant. This is to prepare for next patch in the series. Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Mingming Cao <cmm@us.ibm.com> Cc: Pierre Peiffer <pierre.peiffer@bull.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29memcg: remove redundant initialization in mem_cgroup_create()Li Zefan
*mem has been zeroed, that means mem->info has already been filled with 0. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29memcgroup: use vmalloc for mem_cgroup allocationKAMEZAWA Hiroyuki
On ia64, this kmalloc() requires order-4 pages. But this is not necessary to be physically contiguous. For big mem_cgroup, vmalloc is better. For small ones, kmalloc is used. [akpm@linux-foundation.org: simplification] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29memcgroup: make the memory controller more desktop responsiveBalbir Singh
This patch makes the memory controller more responsive on my desktop. 1. Set all cached pages as inactive. We were by default marking all pages as active, thus forcing us to go through two passes for reclaiming pages 2. Remove congestion_wait(), since we already have that logic in do_try_to_free_pages() Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29memcg: remove redundant function callsKAMEZAWA Hiroyuki
remove_list/add_list uses page_cgroup_zoneinfo() in it. So, it's called twice before and after lock. mz = page_cgroup_zoneinfo(); lock(); mz = page_cgroup_zoneinfo(); .... unlock(); And address of mz never changes. This is not good. This patch fixes this behavior. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29memcgroup: implement failcounter resetPavel Emelyanov
This is a very common requirement from people using the resource accounting facilities (not only memcgroup but also OpenVZ beancounters). They want to put the cgroup in an initial state without re-creating it. For example after re-configuring a group people want to observe how this new configuration fits the group needs without saving the previous failcnt value. Merge two resets into one mem_cgroup_reset() function to demonstrate how multiplexing work. Besides, I have plans to move the files, that correspond to res_counter to the res_counter.c file and somehow "import" them into controller. I don't know how to make it gracefully yet, but merging resets of max_usage and failcnt in one function will be there for sure. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29memcgroup: use triggers in force_empty and max_usage filesPavel Emelyanov
These two files are essentially event callbacks. They do not care about the contents of the string, but only about the fact of the write itself. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29memcgroup: move memory controller allocations to their own slabsBalbir Singh
Move the memory controller data structure page_cgroup to its own slab cache. It saves space on the system, allocations are not necessarily pushed to order of 2 and should provide performance benefits. Users who disable the memory controller can also double check that the memory controller is not allocating page_cgroup's. NOTE: Hugh Dickins brought up the issue of whether we want to mark page_cgroup as __GFP_MOVABLE or __GFP_RECLAIMABLE. I don't think there is an easy answer at the moment. page_cgroup's are associated with user pages, they can be reclaimed once the user page has been reclaimed, so it might make sense to mark them as __GFP_RECLAIMABLE. For now, I am leaving the marking to default values that the slab allocator uses. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Hugh Dickins <hugh@veritas.com> Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29memcgroup: add the max_usage member on the res_counterPavel Emelyanov
This field is the maximal value of the usage one since the counter creation (or since the latest reset). To reset this to the usage value simply write anything to the appropriate cgroup file. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29cgroups: add an owner to the mm_structBalbir Singh
Remove the mem_cgroup member from mm_struct and instead adds an owner. This approach was suggested by Paul Menage. The advantage of this approach is that, once the mm->owner is known, using the subsystem id, the cgroup can be determined. It also allows several control groups that are virtually grouped by mm_struct, to exist independent of the memory controller i.e., without adding mem_cgroup's for each controller, to mm_struct. A new config option CONFIG_MM_OWNER is added and the memory resource controller selects this config option. This patch also adds cgroup callbacks to notify subsystems when mm->owner changes. The mm_cgroup_changed callback is called with the task_lock() of the new task held and is called just prior to changing the mm->owner. I am indebted to Paul Menage for the several reviews of this patchset and helping me make it lighter and simpler. This patch was tested on a powerpc box, it was compiled with both the MM_OWNER config turned on and off. After the thread group leader exits, it's moved to init_css_state by cgroup_exit(), thus all future charges from runnings threads would be redirected to the init_css_set's subsystem. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Hugh Dickins <hugh@veritas.com> Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: David Rientjes <rientjes@google.com>, Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Reviewed-by: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29CGroup API files: drop mem_cgroup_force_empty()Paul Menage
This function isn't needed - a NULL pointer in the cftype read function will result in the same EINVAL response to userspace. Signed-off-by: Paul Menage <menage@google.com> Cc: "Li Zefan" <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29CGroup API files: use cgroup map for memcontrol stats filePaul Menage
Remove the seq_file boilerplate used to construct the memcontrol stats map, and instead use the new map representation for cgroup control files Signed-off-by: Paul Menage <menage@google.com> Cc: "Li Zefan" <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29CGroup API files: use read_u64 in memory controllerPaul Menage
Update the memory controller to use read_u64 for its limit/usage/failcnt control files, calling the new res_counter_read_u64() function. Signed-off-by: Paul Menage <menage@google.com> Cc: "Li Zefan" <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29page allocator: explicitly retry hugepage allocationsNishanth Aravamudan
Add __GFP_REPEAT to hugepage allocations. Do so to not necessitate userspace putting pressure on the VM by repeated echo's into /proc/sys/vm/nr_hugepages to grow the pool. With the previous patch to allow for large-order __GFP_REPEAT attempts to loop for a bit (as opposed to indefinitely), this increases the likelihood of getting hugepages when the system experiences (or recently experienced) load. Mel tested the patchset on an x86_32 laptop. With the patches, it was easier to use the proc interface to grow the hugepage pool. The following is the output of a script that grows the pool as much as possible running on 2.6.25-rc9. Allocating hugepages test ------------------------- Disabling OOM Killer for current test process Starting page count: 0 Attempt 1: 57 pages Progress made with 57 pages Attempt 2: 73 pages Progress made with 16 pages Attempt 3: 74 pages Progress made with 1 pages Attempt 4: 75 pages Progress made with 1 pages Attempt 5: 77 pages Progress made with 2 pages 77 pages was the most it allocated but it took 5 attempts from userspace to get it. With the 3 patches in this series applied, Allocating hugepages test ------------------------- Disabling OOM Killer for current test process Starting page count: 0 Attempt 1: 75 pages Progress made with 75 pages Attempt 2: 76 pages Progress made with 1 pages Attempt 3: 79 pages Progress made with 3 pages And 79 pages was the most it got. Your patches were able to allocate the bulk of possible pages on the first attempt. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Tested-by: Mel Gorman <mel@csn.ul.ie> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29page allocator: smarter retry of costly-order allocationsNishanth Aravamudan
Because of page order checks in __alloc_pages(), hugepage (and similarly large order) allocations will not retry unless explicitly marked __GFP_REPEAT. However, the current retry logic is nearly an infinite loop (or until reclaim does no progress whatsoever). For these costly allocations, that seems like overkill and could potentially never terminate. Mel observed that allowing current __GFP_REPEAT semantics for hugepage allocations essentially killed the system. I believe this is because we may continue to reclaim small orders of pages all over, but never have enough to satisfy the hugepage allocation request. This is clearly only a problem for large order allocations, of which hugepages are the most obvious (to me). Modify try_to_free_pages() to indicate how many pages were reclaimed. Use that information in __alloc_pages() to eventually fail a large __GFP_REPEAT allocation when we've reclaimed an order of pages equal to or greater than the allocation's order. This relies on lumpy reclaim functioning as advertised. Due to fragmentation, lumpy reclaim may not be able to free up the order needed in one invocation, so multiple iterations may be requred. In other words, the more fragmented memory is, the more retry attempts __GFP_REPEAT will make (particularly for higher order allocations). This changes the semantics of __GFP_REPEAT subtly, but *only* for allocations > PAGE_ALLOC_COSTLY_ORDER. With this patch, for those size allocations, we will try up to some point (at least 1<<order reclaimed pages), rather than forever (which is the case for allocations <= PAGE_ALLOC_COSTLY_ORDER). This change improves the /proc/sys/vm/nr_hugepages interface with a follow-on patch that makes pool allocations use __GFP_REPEAT. Rather than administrators repeatedly echo'ing a particular value into the sysctl, and forcing reclaim into action manually, this change allows for the sysctl to attempt a reasonable effort itself. Similarly, dynamic pool growth should be more successful under load, as lumpy reclaim can try to free up pages, rather than failing right away. Choosing to reclaim only up to the order of the requested allocation strikes a balance between not failing hugepage allocations and returning to the caller when it's unlikely to every succeed. Because of lumpy reclaim, if we have freed the order requested, hopefully it has been in big chunks and those chunks will allow our allocation to succeed. If that isn't the case after freeing up the current order, I don't think it is likely to succeed in the future, although it is possible given a particular fragmentation pattern. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Tested-by: Mel Gorman <mel@csn.ul.ie> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29mm: fix misleading __GFP_REPEAT related commentsNishanth Aravamudan
The definition and use of __GFP_REPEAT, __GFP_NOFAIL and __GFP_NORETRY in the core VM have somewhat differing comments as to their actual semantics. Annoyingly, the flags definition has inline and header comments, which might be interpreted as not being equivalent. Just add references to the header comments in the inline ones so they don't go out of sync in the future. In their use in __alloc_pages() clarify that the current implementation treats low-order allocations and __GFP_REPEAT allocations as distinct cases. To clarify, the flags' semantics are: __GFP_NORETRY means try no harder than one run through __alloc_pages __GFP_REPEAT means __GFP_NOFAIL __GFP_NOFAIL means repeat forever order <= PAGE_ALLOC_COSTLY_ORDER means __GFP_NOFAIL Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29mm: fix usemap initializationKAMEZAWA Hiroyuki
usemap must be initialized only when pfn is within zone. If not, it corrupts memory. And this patch also reduces the number of calls to set_pageblock_migratetype() from (pfn & (pageblock_nr_pages -1) to !(pfn & (pageblock_nr_pages-1) it should be called once per pageblock. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Hugh Dickins <hugh@veritas.com> Cc: Shi Weihua <shiwh@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28mm: fix integer as NULL pointer warningsHarvey Harrison
mm/hugetlb.c:207:11: warning: Using plain integer as NULL pointer Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: slub: pack objects denser slub: Calculate min_objects based on number of processors. slub: Drop DEFAULT_MAX_ORDER / DEFAULT_MIN_OBJECTS slub: Simplify any_slab_object checks slub: Make the order configurable for each slab cache slub: Drop fallback to page allocator method slub: Fallback to minimal order during slab page allocation slub: Update statistics handling for variable order slabs slub: Add kmem_cache_order_objects struct slub: for_each_object must be passed the number of objects in a slab slub: Store max number of objects in the page struct. slub: Dump list of objects not freed on kmem_cache_close() slub: free_list() cleanup slub: improve kmem_cache_destroy() error message slob: fix bug - when slob allocates "struct kmem_cache", it does not force alignment.
2008-04-28mm/memory_hotplug.c must #include "internal.h"Adrian Bunk
This patch fixes the following compile error caused by commit 04753278769f3b6c3b79a080edb52f21d83bf6e2 ("memory hotplug: register section/node id to free"): CC mm/memory_hotplug.o /home/bunk/linux/kernel-2.6/git/linux-2.6/mm/memory_hotplug.c: In function ‘put_page_bootmem’: /home/bunk/linux/kernel-2.6/git/linux-2.6/mm/memory_hotplug.c:82: error: implicit declaration of function ‘__free_pages_bootmem’ /home/bunk/linux/kernel-2.6/git/linux-2.6/mm/memory_hotplug.c: At top level: /home/bunk/linux/kernel-2.6/git/linux-2.6/mm/memory_hotplug.c:87: warning: no previous prototype for ‘register_page_bootmem_info_section’ make[2]: *** [mm/memory_hotplug.o] Error 1 [ Andrew: "Argh. The -mm-only memory-hotplug-add-removable-to-sysfs- to-show-memblock-removability.patch debugging patch adds that include so nobody hit this before. ] Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28mm/nommu.c: return 0 from kobjsize with invalid objectsMichael Hennerich
Don't perform kobjsize operations on objects the kernel doesn't manage. On Blackfin, drivers can get dma coherent memory by calling a function dma_alloc_coherent(). We do this in nommu by configuring a chunk of uncached memory at the top of memory. Since we don't want the kernel to use the uncached memory, we lie to the kernel, and tell it that it's max memory is between 0, and the start of the uncached dma coherent section. this all works well, until this memory gets exposed into userspace (with a frame buffer), when you look at the process's maps, it shows the framebuf: root:/proc> cat maps [snip] 03f0ef00-03f34700 rw-p 00000000 1f:00 192 /dev/fb0 root:/proc> This is outside the "normal" range for the kernel. When the kernel tries to find the size of this object (when you run ps), it dies in nommu.c in kobjsize. BUG_ON(page->index >= MAX_ORDER); since the page we are referring to is outside what the kernel thinks is it's max valid memory. root:~> while [ 1 ]; ps > /dev/null; done kernel BUG at mm/nommu.c:119! Kernel panic - not syncing: BUG! We fixed this by adding a check to reject out of range object pointers as it already does that for NULL pointers. Signed-off-by: Michael Hennerich <Michael.Hennerich@analog.com> Signed-off-by: Robin Getz <rgetz@blackfin.uclinux.org> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28vmstats: add cond_resched() to refresh_cpu_vm_stats()Dimitri Sivanich
We've found that it can take quite a bit of time (100's of usec) to get through the zone loop in refresh_cpu_vm_stats(). Adding a cond_resched() to allow other threads to run in the non-preemptive case. Signed-off-by: Dimitri Sivanich <sivanich@sgi.com> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28mm/page_alloc.c: remove hand-coded get_order()Pavel Machek
Remove hand-coded get_order() from page_alloc.c. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28oom_kill: remove unused parameter in badness()Li Zefan
In commit 4c4a22148909e4c003562ea7ffe0a06e26919e3c, we moved the memcontroller-related code from badness() to select_bad_process(), so the parameter 'mem' in badness() is unused now. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28memory hotplug: free memmaps allocated by bootmemYasunori Goto
This patch is to free memmaps which is allocated by bootmem. Freeing usemap is not necessary. The pages of usemap may be necessary for other sections. If removing section is last section on the node, its section is the final user of usemap page. (usemaps are allocated on its section by previous patch.) But it shouldn't be freed too, because the section must be logical offline state which all pages are isolated against page allocater. If it is freed, page alloctor may use it which will be removed physically soon. It will be disaster. So, this patch keeps it as it is. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28memory hotplug: allocate usemap on the section with pgdatYasunori Goto
Usemaps are allocated on the section which has pgdat by this. Because usemap size is very small, many other sections usemaps are allocated on only one page. If a section has usemap, it can't be removed until removing other sections. This dependency is not desirable for memory removing. Pgdat has similar feature. When a section has pgdat area, it must be the last section for removing on the node. So, if section A has pgdat and section B has usemap for section A, Both sections can't be removed due to dependency each other. To solve this issue, this patch collects usemap on same section with pgdat. If other sections doesn't have any dependency, this section will be able to be removed finally. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28memory hotplug: make alloc_bootmem_section()Yasunori Goto
alloc_bootmem_section() can allocate specified section's area. This is used for usemap to keep same section with pgdat by later patch. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28memory hotplug: align memmap to page sizeYasunori Goto
To free memmap easier, this patch aligns it to page size. Bootmem allocater may mix some objects in one pages. It's not good for freeing memmap of memory hot-remove. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28memory hotplug: register section/node id to freeYasunori Goto
This patch set is to free pages which is allocated by bootmem for memory-hotremove. Some structures of memory management are allocated by bootmem. ex) memmap, etc. To remove memory physically, some of them must be freed according to circumstance. This patch set makes basis to free those pages, and free memmaps. Basic my idea is using remain members of struct page to remember information of users of bootmem (section number or node id). When the section is removing, kernel can confirm it. By this information, some issues can be solved. 1) When the memmap of removing section is allocated on other section by bootmem, it should/can be free. 2) When the memmap of removing section is allocated on the same section, it shouldn't be freed. Because the section has to be logical memory offlined already and all pages must be isolated against page allocater. If it is freed, page allocator may use it which will be removed physically soon. 3) When removing section has other section's memmap, kernel will be able to show easily which section should be removed before it for user. (Not implemented yet) 4) When the above case 2), the page isolation will be able to check and skip memmap's page when logical memory offline (offline_pages()). Current page isolation code fails in this case because this page is just reserved page and it can't distinguish this pages can be removed or not. But, it will be able to do by this patch. (Not implemented yet.) 5) The node information like pgdat has similar issues. But, this will be able to be solved too by this. (Not implemented yet, but, remembering node id in the pages.) Fortunately, current bootmem allocator just keeps PageReserved flags, and doesn't use any other members of page struct. The users of bootmem doesn't use them too. This patch: This is to register information which is node or section's id. Kernel can distinguish which node/section uses the pages allcated by bootmem. This is basis for hot-remove sections or nodes. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28hugetlbfs: common code update for s390Gerald Schaefer
Huge ptes have a special type on s390 and cannot be handled with the standard pte functions in certain cases, e.g. because of a different location of the invalid bit. This patch adds some new architecture- specific functions to hugetlb common code, as a prerequisite for the s390 large page support. This won't affect other architectures in functionality, but I need to add some new dummy inline functions to the headers. Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28hugetlbfs: add missing TLB flush to hugetlb_cow()Gerald Schaefer
A cow break on a hugetlbfs page with page_count > 1 will set a new pte with set_huge_pte_at(), w/o any tlb flush operation. The old pte will remain in the tlb and subsequent write access to the page will result in a page fault loop, for as long as it may take until the tlb is flushed from somewhere else. This patch introduces an architecture-specific huge_ptep_clear_flush() function, which is called before the the set_huge_pte_at() in hugetlb_cow(). ATTENTION: This is just a nop on all architectures for now, the s390 implementation will come with our large page patch later. Other architectures should define their own huge_ptep_clear_flush() if needed. Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28mempolicy: use struct mempolicy pointer in shmem_sb_infoLee Schermerhorn
This patch replaces the mempolicy mode, mode_flags, and nodemask in the shmem_sb_info struct with a struct mempolicy pointer, initialized to NULL. This removes dependency on the details of mempolicy from shmem.c and hugetlbfs inode.c and simplifies the interfaces. mpol_parse_str() in mempolicy.c is changed to return, via a pointer to a pointer arg, a struct mempolicy pointer on success. For MPOL_DEFAULT, the returned pointer is NULL. Further, mpol_parse_str() now takes a 'no_context' argument that causes the input nodemask to be stored in the w.user_nodemask of the created mempolicy for use when the mempolicy is installed in a tmpfs inode shared policy tree. At that time, any cpuset contextualization is applied to the original input nodemask. This preserves the previous behavior where the input nodemask was stored in the superblock. We can think of the returned mempolicy as "context free". Because mpol_parse_str() is now calling mpol_new(), we can remove from mpol_to_str() the semantic checks that mpol_new() already performs. Add 'no_context' parameter to mpol_to_str() to specify that it should format the nodemask in w.user_nodemask for 'bind' and 'interleave' policies. Change mpol_shared_policy_init() to take a pointer to a "context free" struct mempolicy and to create a new, "contextualized" mempolicy using the mode, mode_flags and user_nodemask from the input mempolicy. Note: we know that the mempolicy passed to mpol_to_str() or mpol_shared_policy_init() from a tmpfs superblock is "context free". This is currently the only instance thereof. However, if we found more uses for this concept, and introduced any ambiguity as to whether a mempolicy was context free or not, we could add another internal mode flag to identify context free mempolicies. Then, we could remove the 'no_context' argument from mpol_to_str(). Added shmem_get_sbmpol() to return a reference counted superblock mempolicy, if one exists, to pass to mpol_shared_policy_init(). We must add the reference under the sb stat_lock to prevent races with replacement of the mpol by remount. This reference is removed in mpol_shared_policy_init(). [akpm@linux-foundation.org: build fix] [akpm@linux-foundation.org: another build fix] [akpm@linux-foundation.org: yet another build fix] Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28mempolicy: support mpol=local tmpfs mount optionLee Schermerhorn
For tmpfs/shmem shared policies, MPOL_DEFAULT is not necessarily equivalent to "local allocation". Because shared policies are at the same "scope" level [see Documentation/vm/numa_memory_policy.txt], as vma policies MPOL_DEFAULT means "fall back to current task policy". This patch extends the memory policy string parsing function to display "local" for MPOL_PREFERRED + MPOL_F_LOCAL. This allows one to specify local allocation as the default policy for shared memory areas via the tmpfs mpol mount option, regardless of the current task's policy. Also, "local" is now displayed for this policy. This patch allows us to accept the same input format as the display. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28mempolicy: rework shmem mpol parsing and displayLee Schermerhorn
mm/shmem.c currently contains functions to parse and display memory policy strings for the tmpfs 'mpol' mount option. Move this to mm/mempolicy.c with the rest of the mempolicy support. With subsequent patches, we'll be able to remove knowledge of the details [mode, flags, policy, ...] completely from shmem.c 1) replace shmem_parse_mpol() in mm/shmem.c with mpol_parse_str() in mm/mempolicy.c. Rework to use the policy_types[] array [used by mpol_to_str()] to look up mode by name. 2) use mpol_to_str() to format policy for shmem_show_mpol(). mpol_to_str() expects a pointer to a struct mempolicy, so temporarily construct one. This will be replaced with a reference to a struct mempolicy in the tmpfs superblock in a subsequent patch. NOTE 1: I changed mpol_to_str() to use a colon ':' rather than an equal sign '=' as the nodemask delimiter to match mpol_parse_str() and the tmpfs/shmem mpol mount option formatting that now uses mpol_to_str(). This is a user visible change to numa_maps, but then the addition of the mode flags already changed the display. It makes sense to me to have the mounts and numa_maps display the policy in the same format. However, if anyone objects strongly, I can pass the desired nodemask delimeter as an arg to mpol_to_str(). Note 2: Like show_numa_map(), I don't check the return code from mpol_to_str(). I do use a longer buffer than the one provided by show_numa_map(), which seems to have sufficed so far. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28mempolicy: clean-up mpol-to-str() mempolicy formattingLee Schermerhorn
mpol-to-str() formats memory policies into printable strings. Currently this is only used to display "numa_maps". A subsequent patch will use mpol_to_str() for formatting tmpfs [shmem] mpol mount options, allowing us to remove essentially duplicate code in mm/shmem.c. This patch cleans up mpol_to_str() generally and in preparation for that patch. 1) show_numa_maps() is not checking the return code from mpol_to_str(). There's not a lot we can do in this context if mpol_to_str() did return the error [insufficient space in buffer]. Proposed "solution": just check, under DEBUG_VM, that callers are providing sufficient buffer space for the policy, flags, and a few nodes. This way, we'll get some display. show_numa_maps() is providing a 50-byte buffer, so it won't trip this check. 50-bytes should be sufficient unless one has a large number of nodes in a very sparse nodemask. 2) The display of the new mode flags ["static" & "relative"] was set up to display multiple flags, separated by a "bar" '|'. However, this support is incomplete--e.g., need_bar was never incremented; and currently, these two flags are mutually exclusive. So remove the "bar" support, for now, and only display one flag. 3) Use snprint() to format flags, so as not to overflow the buffer. Not that it's ever happed, AFAIK. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28mempolicy: use MPOL_F_LOCAL to Indicate Preferred Local PolicyLee Schermerhorn
Now that we're using "preferred local" policy for system default, we need to make this as fast as possible. Because of the variable size of the mempolicy structure [based on size of nodemasks], the preferred_node may be in a different cacheline from the mode. This can result in accessing an extra cacheline in the normal case of system default policy. Suspect this is the cause of an observed 2-3% slowdown in page fault testing relative to kernel without this patch series. To alleviate this, use an internal mode flag, MPOL_F_LOCAL in the mempolicy flags member which is guaranteed [?] to be in the same cacheline as the mode itself. Verified that reworked mempolicy now performs slightly better on 25-rc8-mm1 for both anon and shmem segments with system default and vma [preferred local] policy. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>