linaro-lsk.git - [no description]

Age	Commit message (Collapse)	Author
2008-02-05	radix-tree: avoid atomic allocations for preloaded insertions	Nick Piggin
	Most pagecache (and some other) radix tree insertions have the great opportunity to preallocate a few nodes with relaxed gfp flags. But the preallocation is squandered when it comes time to allocate a node, we default to first attempting a GFP_ATOMIC allocation -- that doesn't normally fail, but it can eat into atomic memory reserves that we don't need to be using. Another upshot of this is that it removes the sometimes highly contended zone->lock from underneath tree_lock. Pagecache insertions are always performed with a radix tree preload, and after this change, such a situation will never fall back to kmem_cache_alloc within radix_tree_node_alloc. David Miller reports seeing this allocation fail on a highly threaded sparc64 system: [527319.459981] dd: page allocation failure. order:0, mode:0x20 [527319.460403] Call Trace: [527319.460568] [00000000004b71e0] __slab_alloc+0x1b0/0x6a8 [527319.460636] [00000000004b7bbc] kmem_cache_alloc+0x4c/0xa8 [527319.460698] [000000000055309c] radix_tree_node_alloc+0x20/0x90 [527319.460763] [0000000000553238] radix_tree_insert+0x12c/0x260 [527319.460830] [0000000000495cd0] add_to_page_cache+0x38/0xb0 [527319.460893] [00000000004e4794] mpage_readpages+0x6c/0x134 [527319.460955] [000000000049c7fc] __do_page_cache_readahead+0x170/0x280 [527319.461028] [000000000049cc88] ondemand_readahead+0x208/0x214 [527319.461094] [0000000000496018] do_generic_mapping_read+0xe8/0x428 [527319.461152] [0000000000497948] generic_file_aio_read+0x108/0x170 [527319.461217] [00000000004badac] do_sync_read+0x88/0xd0 [527319.461292] [00000000004bb5cc] vfs_read+0x78/0x10c [527319.461361] [00000000004bb920] sys_read+0x34/0x60 [527319.461424] [0000000000406294] linux_sparc_syscall32+0x3c/0x40 The calltrace is significant: __do_page_cache_readahead allocates a number of pages with GFP_KERNEL, and hence it should have reclaimed sufficient memory to satisfy GFP_ATOMIC allocations. However after the list of pages goes to mpage_readpages, there can be significant intervals (including disk IO) before all the pages are inserted into the radix-tree. So the reserves can easily be depleted at that point. The patch is confirmed to fix the problem. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	make __vmalloc_area_node() static	Adrian Bunk
	__vmalloc_area_node() can become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	Remove unused code from mm/tiny-shmem.c	Balbir Singh
	This code in mm/tiny-shmem.c is under #if 0 - remove it. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	mm/page-writeback.c: make a function static	Adrian Bunk
	task_dirty_limit() can become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	maps4: make page monitoring /proc file optional	Matt Mackall
	Make /proc/ page monitoring configurable This puts the following files under an embedded config option: /proc/pid/clear_refs /proc/pid/smaps /proc/pid/pagemap /proc/kpagecount /proc/kpageflags [akpm@linux-foundation.org: Kconfig fix] Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	maps4: introduce a generic page walker	Matt Mackall
	Introduce a general page table walker Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	maps4: move is_swap_pte	Matt Mackall
	Move is_swap_pte helper function to swapops.h for use by pagemap code Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	clean up vmtruncate	Christoph Hellwig
	vmtruncate is a twisted maze of gotos, this patch cleans it up to have a proper if else for the two major cases of extending and truncating truncate and thus makes it a lot more readable while keeping exactly the same functinality. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	tmpfs: fix shmem_swaplist races	Hugh Dickins
	Intensive swapoff testing shows shmem_unuse spinning on an entry in shmem_swaplist pointing to itself: how does that come about? Days pass... First guess is this: shmem_delete_inode tests list_empty without taking the global mutex (so the swapping case doesn't slow down the common case); but there's an instant in shmem_unuse_inode's list_move_tail when the list entry may appear empty (a rare case, because it's actually moving the head not the the list member). So there's a danger of leaving the inode on the swaplist when it's freed, then reinitialized to point to itself when reused. Fix that by skipping the list_move_tail when it's a no-op, which happens to plug this. But this same spinning then surfaces on another machine. Ah, I'd never suspected it, but shmem_writepage's swaplist manipulation is unsafe: though we still hold page lock, which would hold off inode deletion if the page were in pagecache, it doesn't hold off once it's in swapcache (free_swap_and_cache doesn't wait on locked pages). Hmm: we could put the the inode on swaplist earlier, but then shmem_unuse_inode could never prune unswapped inodes. Fix this with an igrab before dropping info->lock, as in shmem_unuse_inode; though I am a little uneasy about the iput which has to follow - it works, and I see nothing wrong with it, but it is surprising that shmem inode deletion may now occur below shmem_writepage. Revisit this fix later? And while we're looking at these races: the way shmem_unuse tests swapped without holding info->lock looks unsafe, if we've more than one swap area: a racing shmem_writepage on another page of the same inode could be putting it in swapcache, just as we're deciding to remove the inode from swaplist - there's a danger of going on swap without being listed, so a later swapoff would hang, being unable to locate the entry. Move that test and removal down into shmem_unuse_inode, once info->lock is held. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	tmpfs: radix_tree_preloading	Hugh Dickins
	Nick has observed that shmem.c still uses GFP_ATOMIC when adding to page cache or swap cache, without any radix tree preload: so tending to deplete emergency reserves of memory. GFP_ATOMIC remains appropriate in shmem_writepage's add_to_swap_cache: it's being called under memory pressure, so must not wait for more memory to become available. But shmem_unuse_inode now has a window in which it can and should preload with GFP_KERNEL, and say GFP_NOWAIT instead of GFP_ATOMIC in its add_to_page_cache. shmem_getpage is not so straightforward: its filepage/swappage integrity relies upon exchanging between caches under spinlock, and it would need a lot of restructuring to place the preloads correctly. Instead, follow its pattern of retrying on races: use GFP_NOWAIT instead of GFP_ATOMIC in add_to_page_cache, and begin each circuit of the repeat loop with a sleeping radix_tree_preload, followed immediately by radix_tree_preload_end - that won't guarantee success in the next add_to_page_cache, but doesn't need to. And we can then remove that bothersome congestion_wait: when needed, it'll automatically get done in the course of the radix_tree_preload. Signed-off-by: Hugh Dickins <hugh@veritas.com> Looks-good-to: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	tmpfs: open a window in shmem_unuse_inode	Hugh Dickins
	There are a couple of reasons (patches follow) why it would be good to open a window for sleep in shmem_unuse_inode, between its search for a matching swap entry, and its handling of the entry found. shmem_unuse_inode must then use igrab to hold the inode against deletion in that window, and its corresponding iput might result in deletion: so it had better unlock_page before the iput, and might as well release the page too. Nor is there any need to hold on to shmem_swaplist_mutex once we know we'll leave the loop. So this unwinding moves from try_to_unuse and shmem_unuse into shmem_unuse_inode, in the case when it finds a match. Let try_to_unuse break on error in the shmem_unuse case, as it does in the unuse_mm case: though at this point in the series, no error to break on. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	tmpfs: make shmem_unuse more preemptible	Hugh Dickins
	shmem_unuse is at present an unbroken search through every swap vector page of every tmpfs file which might be swapped, all under shmem_swaplist_lock. This dates from long ago, when the caller held mmlist_lock over it all too: long gone, but there's never been much pressure for preemptible swapoff. Make it a little more preemptible, replacing shmem_swaplist_lock by shmem_swaplist_mutex, inserting a cond_resched in the main loop, and a cond_resched_lock (on info->lock) at one convenient point in the shmem_unuse_inode loop, where it has no outstanding kmap_atomic. If we're serious about preemptible swapoff, there's much further to go e.g. I'm stupid to let the kmap_atomics of the decreasingly significant HIGHMEM case dictate preemptiblility for other configs. But as in the earlier patch to make swapoff scan ptes preemptibly, my hidden agenda is really towards making memcgroups work, hardly about preemptibility at all. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	tmpfs: allocate on read when stacked	Hugh Dickins
	tmpfs is expected to limit the memory used (unless mounted with nr_blocks=0 or size=0). But if a stacked filesystem such as unionfs gets pages from a sparse tmpfs file by reading holes, and then writes to them, it can easily exceed any such limit at present. So suppress the SGP_READ "don't allocate page" ZERO_PAGE optimization when reading for the kernel (a KERNEL_DS check, ugh, sorry about that). Indeed, pessimistically mark such pages as dirty, so they cannot get reclaimed and unaccounted by mistake. The venerable shmem_recalc_inode code (originally to account for the reclaim of clean pages) suffices to get the accounting right when swappages are dropped in favour of more uptodate filepages. This also fixes the NULL shmem_swp_entry BUG or oops in shmem_writepage, caused by unionfs writing to a very sparse tmpfs file: to minimize memory allocation in swapout, tmpfs requires the swap vector be allocated upfront, which wasn't always happening in this stacked case. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	tmpfs: allow filepage alongside swappage	Hugh Dickins
	tmpfs has long allowed for a fresh filepage to be created in pagecache, just before shmem_getpage gets the chance to match it up with the swappage which already belongs to that offset. But unionfs_writepage now does a find_or_create_page, divorced from shmem_getpage, which leaves conflicting filepage and swappage outstanding indefinitely, when unionfs is over tmpfs. Therefore shmem_writepage (where a page is swizzled from file to swap) must now be on the lookout for existing swap, ready to free it in favour of the more uptodate filepage, instead of BUGging on that clash. And when the add_to_page_cache fails in shmem_unuse_inode, it must defer to an uptodate filepage, otherwise swapoff would hang. Whereas when add_to_page_cache fails in shmem_getpage, it should retry in the same way it already does. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	tmpfs: move swap swizzling into shmem	Hugh Dickins
	move_to_swap_cache and move_from_swap_cache functions (which swizzle a page between tmpfs page cache and swap cache, to avoid page copying) are only used by shmem.c; and our subsequent fix for unionfs needs different treatments in the two instances of move_from_swap_cache. Move them from swap_state.c into their callsites shmem_writepage, shmem_unuse_inode and shmem_getpage, making add_to_swap_cache externally visible. shmem.c likes to say set_page_dirty where swap_state.c liked to say SetPageDirty: respect that diversity, which __set_page_dirty_no_writeback makes moot (and implies we should lose that "shift page from clean_pages to dirty_pages list" comment: it's on neither). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	tmpfs: shuffle add_to_swap_caches	Hugh Dickins
	add_to_swap_cache doesn't amount to much: merge it into its sole caller read_swap_cache_async. But we'll be needing to call __add_to_swap_cache from shmem.c, so promote it to the new add_to_swap_cache. Both were static, so there's no interface confusion to worry about. And lose that inappropriate "Anon pages are already on the LRU" comment in the merging: they're not already on the LRU, as Nick Piggin noticed. Signed-off-by: Hugh Dickins <hugh@veritas.com> No-problems-with: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	tmpfs: move swap_state stats update	Hugh Dickins
	Both unionfs and memcgroups pose challenges to tmpfs and shmem. To help fix, it's best to move the swap swizzling functions from swap_state.c to shmem.c. As a preliminary to that, move swap stats updating down into __add_to_swap_cache, which will remain internal to swap_state.c. Well, actually, just move down the incrementation of add_total: remove noent_race and exist_race completely, they are relics of my 2.4.11 testing. Alt-SysRq-m users will be thrilled if 2.6.25 is at last free of "race M+N"s. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	tmpfs: fix mounts when size is less than the page size	Michael Marineau
	When tmpfs is mounted with a size less than one page, the number of blocks is set to 0 which makes the tmpfs mount unlimited. This can lead to a quick and surprising death if someone typos a tmpfs mount command and writes too much. tmpfs can still be mounted as unlimited if size or nr_blocks is exactly 0, as Documentation/filesystems/tmpfs.txt says. Hugh: do this by rounding size up instead of down in all cases: which slightly expands other odd-sized tmpfs mounts, but in a consistent way. Signed-off-by: Michael Marineau <mike@marineau.org> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	shmem: factor out sbi->free_inodes manipulations	Pavel Emelyanov
	The shmem_sb_info structure has a number of free_inodes. This value is altered in appropriate places under spinlock and with the sbi->max_inodes != 0 check. Consolidate these manipulations into two helpers. This is minus 42 bytes of shmem.o and minus 4 :) lines of code. [akpm@linux-foundation.org: fix error return values] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	swapoff: scan ptes preemptibly	Hugh Dickins
	Provided that CONFIG_HIGHPTE is not set, unuse_pte_range can reduce latency in swapoff by scanning the page table preemptibly: so long as unuse_pte is careful to recheck that entry under pte lock. (To tell the truth, this patch was not inspired by any cries for lower latency here: rather, this restructuring permits a future memory controller patch to allocate with GFP_KERNEL in unuse_pte, where before it could not. But it would be wrong to tuck this change away inside a memcgroup patch.) Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	swapin: fix valid_swaphandles defect	Hugh Dickins
	valid_swaphandles is supposed to do a quick pass over the swap map entries neigbouring the entry which swapin_readahead is targetting, to determine for it a range worth reading all together. But since it always starts its search from the beginning of the swap "cluster", a reject (free entry) there immediately curtails the readaround, and every swapin_readahead from that cluster is for just a single page. Instead scan forwards and backwards around the target entry. Use better names for some variables: a swap_info pointer is usually called "si" not "swapdev". And at the end, if only the target page should be read, return count of 0 to disable readaround, to avoid the unnecessarily repeated call to read_swap_cache_async. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	shmem_file_write is redundant	Hugh Dickins
	With the old aops, writing to a tmpfs file had to use its own special method: the generic method would pass in a fresh page to prepare_write when the right page was there in swapcache - which was inefficient to handle, even once we'd concocted the code to handle it. With the new aops, the generic method uses shmem_write_end, which lets shmem_getpage find the right page: so now abandon shmem_file_write in favour of the generic method. Yes, that does do several things that tmpfs hasn't really needed (notably balance_dirty_pages_ratelimited, which ramfs also calls); but more use of common code is preferable. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	shmem_getpage return page locked	Hugh Dickins
	In the new aops, write_begin is supposed to return the page locked: though I've seen no ill effects, that's been overlooked in the case of shmem_write_begin, and should be fixed. Then shmem_write_end must unlock the page: do so _after_ updating i_size, as we found to be important in other filesystems (though since shmem pages don't go the usual writeback route, they never suffered from that corruption). For shmem_write_begin to return the page locked, we need shmem_getpage to return the page locked in SGP_WRITE case as well as SGP_CACHE case: let's simplify the interface and return it locked even when SGP_READ. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	shmem: SGP_QUICK and SGP_FAULT redundant	Hugh Dickins
	Remove SGP_QUICK from the sgp_type enum: it was for shmem_populate and has no users now. Remove SGP_FAULT from the enum: SGP_CACHE does just as well (and shmem_getpage is about to return with page always locked). Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	swapin needs gfp_mask for loop on tmpfs	Hugh Dickins
	Building in a filesystem on a loop device on a tmpfs file can hang when swapping, the loop thread caught in that infamous throttle_vm_writeout. In theory this is a long standing problem, which I've either never seen in practice, or long ago suppressed the recollection, after discounting my load and my tmpfs size as unrealistically high. But now, with the new aops, it has become easy to hang on one machine. Loop used to grab_cache_page before the old prepare_write to tmpfs, which seems to have been enough to free up some memory for any swapin needed; but the new write_begin lets tmpfs find or allocate the page (much nicer, since grab_cache_page missed tmpfs pages in swapcache). When allocating a fresh page, tmpfs respects loop's mapping_gfp_mask, which has __GFP_IO\|__GFP_FS stripped off, and throttle_vm_writeout is designed to break out when __GFP_IO or GFP_FS is unset; but when tmfps swaps in, read_swap_cache_async allocates with GFP_HIGHUSER_MOVABLE regardless of the mapping_gfp_mask - hence the hang. So, pass gfp_mask down the line from shmem_getpage to shmem_swapin to swapin_readahead to read_swap_cache_async to add_to_swap_cache. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	swapin_readahead: move and rearrange args	Hugh Dickins
	swapin_readahead has never sat well in mm/memory.c: move it to mm/swap_state.c beside its kindred read_swap_cache_async. Why were its args in a different order? rearrange them. And since it was always followed by a read_swap_cache_async of the target page, fold that in and return struct page*. Then CONFIG_SWAP=n no longer needs valid_swaphandles and read_swap_cache_async stubs. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	swapin_readahead: excise NUMA bogosity	Hugh Dickins
	For three years swapin_readahead has been cluttered with fanciful CONFIG_NUMA code, advancing addr, and stepping on to the next vma at the boundary, to line up the mempolicy for each page allocation. It _might_ be a good idea to allocate swap more according to vma layout; but the fact is, that's not how we do it at all, 2.6 even less than 2.4: swap is allocated as needed for pages as they sink to the bottom of the inactive LRUs. Sometimes that may match vma layout, but not so often that it's worth going to these misleading vma->vm_next lengths: rip all that out. Originally I intended to retain the incrementation of addr, but correct its initial value: valid_swaphandles generally supplies an offset below the target addr (this is readaround rather than readahead), but addr has not been adjusted accordingly, so in the interleave case it has usually been allocating the target page from the "wrong" node (though that may not matter very much). But look at the equivalent shmem_swapin code: either by oversight or by design, though it has all the apparatus for choosing a new mempolicy per page, it uses the same idx throughout, choosing the same mempolicy and interleave node for each page of the cluster. Which is actually a much better strategy: each node has its own LRUs and its own kswapd, so if you're betting on any particular relationship between swap and node, the best bet is that nearby swap entries belong to pages from the same node - even when the mempolicy of the target page is to interleave. And examining a map of nodes corresponding to swap entries on a numa=fake system bears this out. (We could later tweak swap allocation to make it even more likely, but this patch is merely about removing cruft.) So, neither adjust nor increment addr in swapin_readahead, and then shmem_swapin can use it too; the pseudo-vma to pass policy need only be set up once per cluster, and so few fields of pvma are used, let's skip the memset - from shmem_alloc_page also. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <ak@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	vmalloc: clean up page array indexing	Christoph Lameter
	The page array is repeatedly indexed both in vunmap and vmalloc_area_node(). Add a temporary variable to make it easier to read (and easier to patch later). Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	is_vmalloc_addr(): Check if an address is within the vmalloc boundaries	Christoph Lameter
	Checking if an address is a vmalloc address is done in a couple of places. Define a common version in mm.h and replace the other checks. Again the include structures suck. The definition of VMALLOC_START and VMALLOC_END is not available in vmalloc.h since highmem.c cannot be included there. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	vmalloc: add const to void* parameters	Christoph Lameter
	Make vmalloc functions work the same way as kfree() and friends that take a const void * argument. [akpm@linux-foundation.org: fix consts, coding-style] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	Move vmalloc_to_page() to mm/vmalloc.	Christoph Lameter
	We already have page table manipulation for vmalloc in vmalloc.c. Move the vmalloc_to_page() function there as well. Move the definitions for vmalloc related functions in mm.h to a newly created section. A better place would be vmalloc.h but mm.h is basic and may depend on these functions. An alternative would be to include vmalloc.h in mm.h (like done for vmstat.h). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user	Christoph Lameter
	Simplify page cache zeroing of segments of pages through 3 functions zero_user_segments(page, start1, end1, start2, end2) Zeros two segments of the page. It takes the position where to start and end the zeroing which avoids length calculations and makes code clearer. zero_user_segment(page, start, end) Same for a single segment. zero_user(page, start, length) Length variant for the case where we know the length. We remove the zero_user_page macro. Issues: 1. Its a macro. Inline functions are preferable. 2. The KM_USER0 macro is only defined for HIGHMEM. Having to treat this special case everywhere makes the code needlessly complex. The parameter for zeroing is always KM_USER0 except in one single case that we open code. Avoiding KM_USER0 makes a lot of code not having to be dealing with the special casing for HIGHMEM anymore. Dealing with kmap is only necessary for HIGHMEM configurations. In those configurations we use KM_USER0 like we do for a series of other functions defined in highmem.h. Since KM_USER0 is depends on HIGHMEM the existing zero_user_page function could not be a macro. zero_user_* functions introduced here can be be inline because that constant is not used when these functions are called. Also extract the flushing of the caches to be outside of the kmap. [akpm@linux-foundation.org: fix nfs and ntfs build] [akpm@linux-foundation.org: fix ntfs build some more] Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Steven French <sfrench@us.ibm.com> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Cc: <linux-ext4@vger.kernel.org> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Anton Altaparmakov <aia21@cantab.net> Cc: Mark Fasheh <mark.fasheh@oracle.com> Cc: David Chinner <dgc@sgi.com> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Cc: Steven French <sfrench@us.ibm.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05	sys_remap_file_pages: fix ->vm_file accounting	Oleg Nesterov
	Fix ->vm_file accounting, mmap_region() may do do_munmap(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-04	Merge branch 'slub-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/christoph/vm * 'slub-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/christoph/vm: Explain kmem_cache_cpu fields SLUB: Do not upset lockdep SLUB: Fix coding style violations Add parameter to add_partial to avoid having two functions SLUB: rename defrag to remote_node_defrag_ratio Move count_partial before kmem_cache_shrink SLUB: Fix sysfs refcounting slub: fix shadowed variable sparse warnings
2008-02-04	SLUB: Do not upset lockdep	root
	inconsistent {softirq-on-W} -> {in-softirq-W} usage. swapper/0 [HC0[0]:SC1[1]:HE0:SE0] takes: (&n->list_lock){-+..}, at: [<ffffffff802935c1>] add_partial+0x31/0xa0 {softirq-on-W} state was registered at: [<ffffffff80259fb8>] __lock_acquire+0x3e8/0x1140 [<ffffffff80259838>] debug_check_no_locks_freed+0x188/0x1a0 [<ffffffff8025ad65>] lock_acquire+0x55/0x70 [<ffffffff802935c1>] add_partial+0x31/0xa0 [<ffffffff805c76de>] _spin_lock+0x1e/0x30 [<ffffffff802935c1>] add_partial+0x31/0xa0 [<ffffffff80296f9c>] kmem_cache_open+0x1cc/0x330 [<ffffffff805c7984>] _spin_unlock_irq+0x24/0x30 [<ffffffff802974f4>] create_kmalloc_cache+0x64/0xf0 [<ffffffff80295640>] init_alloc_cpu_cpu+0x70/0x90 [<ffffffff8080ada5>] kmem_cache_init+0x65/0x1d0 [<ffffffff807f1b4e>] start_kernel+0x23e/0x350 [<ffffffff807f112d>] _sinittext+0x12d/0x140 [<ffffffffffffffff>] 0xffffffffffffffff This change isn't really necessary for correctness, but it prevents lockdep from getting upset and then disabling itself. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <clameter@sgi.com> Cc: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-02-04	SLUB: Fix coding style violations	Pekka Enberg
	This fixes most of the obvious coding style violations in mm/slub.c as reported by checkpatch. Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-02-04	Add parameter to add_partial to avoid having two functions	Christoph Lameter
	Add a parameter to add_partial instead of having separate functions. The parameter allows a more detailed control of where the slab pages is placed in the partial queues. If we put slabs back to the front then they are likely immediately used for allocations. If they are put at the end then we can maximize the time that the partial slabs spent without being subject to allocations. When deactivating slab we can put the slabs that had remote objects freed (we can see that because objects were put on the freelist that requires locks) to them at the end of the list so that the cachelines of remote processors can cool down. Slabs that had objects from the local cpu freed to them (objects exist in the lockless freelist) are put in the front of the list to be reused ASAP in order to exploit the cache hot state of the local cpu. Patch seems to slightly improve tbench speed (1-2%). Signed-off-by: Christoph Lameter <clameter@sgi.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2008-02-04	SLUB: rename defrag to remote_node_defrag_ratio	Christoph Lameter
	The NUMA defrag works by allocating objects from partial slabs on remote nodes. Rename it to remote_node_defrag_ratio to be clear about this. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2008-02-04	Move count_partial before kmem_cache_shrink	Christoph Lameter
	Move the counting function for objects in partial slabs so that it is placed before kmem_cache_shrink. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2008-02-04	SLUB: Fix sysfs refcounting	Christoph Lameter
	If CONFIG_SYSFS is set then free the kmem_cache structure when sysfs tells us its okay. Otherwise there is the danger (as pointed out by Al Viro) that sysfs thinks the kobject still exists after kmem_cache_destroy() removed it. Signed-off-by: Christoph Lameter <clameter@sgi.com> Reviewed-by: Pekka J Enberg <penberg@cs.helsinki.fi>
2008-02-04	slub: fix shadowed variable sparse warnings	Harvey Harrison
	Introduce 'len' at outer level: mm/slub.c:3406:26: warning: symbol 'n' shadows an earlier one mm/slub.c:3393:6: originally declared here No need to declare new node: mm/slub.c:3501:7: warning: symbol 'node' shadows an earlier one mm/slub.c:3491:6: originally declared here No need to declare new x: mm/slub.c:3513:9: warning: symbol 'x' shadows an earlier one mm/slub.c:3492:6: originally declared here Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-02-04	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial	Linus Torvalds
	* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (79 commits) Jesper Juhl is the new trivial patches maintainer Documentation: mention email-clients.txt in SubmittingPatches fs/binfmt_elf.c: spello fix do_invalidatepage() comment typo fix Documentation/filesystems/porting fixes typo fixes in net/core/net_namespace.c typo fix in net/rfkill/rfkill.c typo fixes in net/sctp/sm_statefuns.c lib/: Spelling fixes kernel/: Spelling fixes include/scsi/: Spelling fixes include/linux/: Spelling fixes include/asm-m68knommu/: Spelling fixes include/asm-frv/: Spelling fixes fs/: Spelling fixes drivers/watchdog/: Spelling fixes drivers/video/: Spelling fixes drivers/ssb/: Spelling fixes drivers/serial/: Spelling fixes drivers/scsi/: Spelling fixes ...
2008-02-04	vm audit: add VM_DONTEXPAND to mmap for drivers that need it	Nick Piggin
	Drivers that register a ->fault handler, but do not range-check the offset argument, must set VM_DONTEXPAND in the vm_flags in order to prevent an expanding mremap from overflowing the resource. I've audited the tree and attempted to fix these problems (usually by adding VM_DONTEXPAND where it is not obvious). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-03	do_invalidatepage() comment typo fix	Fengguang Wu
	Fix a typo in the comment for do_invalidatepage(). Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Signed-off-by: Adrian Bunk <bunk@kernel.org>
2008-02-03	fix writev regression: pan hanging unkillable and un-straceable	Nick Piggin
	Frederik Himpe reported an unkillable and un-straceable pan process. Zero length iovecs can go into an infinite loop in writev, because the iovec iterator does not always advance over them. The sequence required to trigger this is not trivial. I think it requires that a zero-length iovec be followed by a non-zero-length iovec which causes a pagefault in the atomic usercopy. This causes the writev code to drop back into single-segment copy mode, which then tries to copy the 0 bytes of the zero-length iovec; a zero length copy looks like a failure though, so it loops. Put a test into iov_iter_advance to catch zero-length iovecs. We could just put the test in the fallback path, but I feel it is more robust to skip over zero-length iovecs throughout the code (iovec iterator may be used in filesystems too, so it should be robust). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-01	Merge branch 'task_killable' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc * 'task_killable' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc: (22 commits) Remove commented-out code copied from NFS NFS: Switch from intr mount option to TASK_KILLABLE Add wait_for_completion_killable Add wait_event_killable Add schedule_timeout_killable Use mutex_lock_killable in vfs_readdir Add mutex_lock_killable Use lock_page_killable Add lock_page_killable Add fatal_signal_pending Add TASK_WAKEKILL exit: Use task_is_* signal: Use task_is_* sched: Use task_contributes_to_load, TASK_ALL and TASK_NORMAL ptrace: Use task_is_* power: Use task_is_* wait: Use TASK_NORMAL proc/base.c: Use task_is_* proc/array.c: Use TASK_REPORT perfmon: Use task_is_* ... Fixed up conflicts in NFS/sunrpc manually..
2008-01-30	x86: print which shared library/executable faulted in segfault etc. messages v3	Andi Kleen
	They now look like: hal-resmgr[13791]: segfault at 3c rip 2b9c8caec182 rsp 7fff1e825d30 error 4 in libacl.so.1.1.0[2b9c8caea000+6000] This makes it easier to pinpoint bugs to specific libraries. And printing the offset into a mapping also always allows to find the correct fault point in a library even with randomized mappings. Previously there was no way to actually find the correct code address inside the randomized mapping. Relies on earlier patch to shorten the printk formats. They are often now longer than 80 characters, but I think that's worth it. [includes fix from Eric Dumazet to check d_path error value] Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30	spinlock: lockbreak cleanup	Nick Piggin
	The break_lock data structure and code for spinlocks is quite nasty. Not only does it double the size of a spinlock but it changes locking to a potentially less optimal trylock. Put all of that under CONFIG_GENERIC_LOCKBREAK, and introduce a __raw_spin_is_contended that uses the lock data itself to determine whether there are waiters on the lock, to be used if CONFIG_GENERIC_LOCKBREAK is not set. Rename need_lockbreak to spin_needbreak, make it use spin_is_contended to decouple it from the spinlock implementation, and make it typesafe (rwlocks do not have any need_lockbreak sites -- why do they even get bloated up with that break_lock then?). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30	x86: randomize brk	Jiri Kosina
	Randomize the location of the heap (brk) for i386 and x86_64. The range is randomized in the range starting at current brk location up to 0x02000000 offset for both architectures. This, together with pie-executable-randomization.patch and pie-executable-randomization-fix.patch, should make the address space randomization on i386 and x86_64 complete. Arjan says: This is known to break older versions of some emacs variants, whose dumper code assumed that the last variable declared in the program is equal to the start of the dynamically allocated memory region. (The dumper is the code where emacs effectively dumps core at the end of it's compilation stage; this coredump is then loaded as the main program during normal use) iirc this was 5 years or so; we found this way back when I was at RH and we first did the security stuff there (including this brk randomization). It wasn't all variants of emacs, and it got fixed as a result (I vaguely remember that emacs already had code to deal with it for other archs/oses, just ifdeffed wrongly). It's a rare and wrong assumption as a general thing, just on x86 it mostly happened to be true (but to be honest, it'll break too if gcc does something fancy or if the linker does a non-standard order). Still its something we should at least document. Note 2: afaik it only broke the emacs build. I'm not 100% sure about that (it IS 5 years ago) though. [ akpm@linux-foundation.org: deuglification ] Signed-off-by: Jiri Kosina <jkosina@suse.cz> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Roland McGrath <roland@redhat.com> Cc: Jakub Jelinek <jakub@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-28	sh: Bump number of quicklists for SH-5.	Paul Mundt
	Sync up with the SH definitions. Signed-off-by: Paul Mundt <lethal@linux-sh.org>