aboutsummaryrefslogtreecommitdiff
path: root/ipc/shm.c
AgeCommit message (Collapse)Author
2013-02-26Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile (part one) from Al Viro: "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent locking violations, etc. The most visible changes here are death of FS_REVAL_DOT (replaced with "has ->d_weak_revalidate()") and a new helper getting from struct file to inode. Some bits of preparation to xattr method interface changes. Misc patches by various people sent this cycle *and* ocfs2 fixes from several cycles ago that should've been upstream right then. PS: the next vfs pile will be xattr stuff." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits) saner proc_get_inode() calling conventions proc: avoid extra pde_put() in proc_fill_super() fs: change return values from -EACCES to -EPERM fs/exec.c: make bprm_mm_init() static ocfs2/dlm: use GFP_ATOMIC inside a spin_lock ocfs2: fix possible use-after-free with AIO ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero target: writev() on single-element vector is pointless export kernel_write(), convert open-coded instances fs: encode_fh: return FILEID_INVALID if invalid fid_type kill f_vfsmnt vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op nfsd: handle vfs_getattr errors in acl protocol switch vfs_getattr() to struct path default SET_PERSONALITY() in linux/elf.h ceph: prepopulate inodes only when request is aborted d_hash_and_lookup(): export, switch open-coded instances 9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate() 9p: split dropping the acls from v9fs_set_create_acl() ...
2013-02-23mm: make do_mmap_pgoff return populate as a size in bytes, not as a boolMichel Lespinasse
do_mmap_pgoff() rounds up the desired size to the next PAGE_SIZE multiple, however there was no equivalent code in mm_populate(), which caused issues. This could be fixed by introduced the same rounding in mm_populate(), however I think it's preferable to make do_mmap_pgoff() return populate as a size rather than as a boolean, so we don't have to duplicate the size rounding logic in mm_populate(). Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Greg Ungerer <gregungerer@westnet.com.au> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: introduce mm_populate() for populating new vmasMichel Lespinasse
When creating new mappings using the MAP_POPULATE / MAP_LOCKED flags (or with MCL_FUTURE in effect), we want to populate the pages within the newly created vmas. This may take a while as we may have to read pages from disk, so ideally we want to do this outside of the write-locked mmap_sem region. This change introduces mm_populate(), which is used to defer populating such mappings until after the mmap_sem write lock has been released. This is implemented as a generalization of the former do_mlock_pages(), which accomplished the same task but was using during mlock() / mlockall(). Signed-off-by: Michel Lespinasse <walken@google.com> Reported-by: Andy Lutomirski <luto@amacapital.net> Acked-by: Rik van Riel <riel@redhat.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Greg Ungerer <gregungerer@westnet.com.au> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-22fs: Preserve error code in get_empty_filp(), part 2Anatol Pomozov
Allocating a file structure in function get_empty_filp() might fail because of several reasons: - not enough memory for file structures - operation is not allowed - user is over its limit Currently the function returns NULL in all cases and we loose the exact reason of the error. All callers of get_empty_filp() assume that the function can fail with ENFILE only. Return error through pointer. Change all callers to preserve this error code. [AV: cleaned up a bit, carved the get_empty_filp() part out into a separate commit (things remaining here deal with alloc_file()), removed pipe(2) behaviour change] Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-22new helper: file_inode(file)Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-11mm: support more pagesizes for MAP_HUGETLB/SHM_HUGETLBAndi Kleen
There was some desire in large applications using MAP_HUGETLB or SHM_HUGETLB to use 1GB huge pages on some mappings, and stay with 2MB on others. This is useful together with NUMA policy: use 2MB interleaving on some mappings, but 1GB on local mappings. This patch extends the IPC/SHM syscall interfaces slightly to allow specifying the page size. It borrows some upper bits in the existing flag arguments and allows encoding the log of the desired page size in addition to the *_HUGETLB flag. When 0 is specified the default size is used, this makes the change fully compatible. Extending the internal hugetlb code to handle this is straight forward. Instead of a single mount it just keeps an array of them and selects the right mount based on the specified page size. When no page size is specified it uses the mount of the default page size. The change is not visible in /proc/mounts because internal mounts don't appear there. It also has very little overhead: the additional mounts just consume a super block, but not more memory when not used. I also exported the new flags to the user headers (they were previously under __KERNEL__). Right now only symbols for x86 and some other architecture for 1GB and 2MB are defined. The interface should already work for all other architectures though. Only architectures that define multiple hugetlb sizes actually need it (that is currently x86, tile, powerpc). However tile and powerpc have user configurable hugetlb sizes, so it's not easy to add defines. A program on those architectures would need to query sysfs and use the appropiate log2. [akpm@linux-foundation.org: cleanups] [rientjes@google.com: fix build] [akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-06userns: Convert ipc to use kuid and kgid where appropriateEric W. Biederman
- Store the ipc owner and creator with a kuid - Store the ipc group and the crators group with a kgid. - Add error handling to ipc_update_perms, allowing it to fail if the uids and gids can not be converted to kuids or kgids. - Modify the proc files to display the ipc creator and owner in the user namespace of the opener of the proc file. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-07-30ipc: add COMPAT_SHMLBA supportWill Deacon
If the SHMLBA definition for a native task differs from the definition for a compat task, the do_shmat() function would need to handle both. This patch introduces COMPAT_SHMLBA, which is used by the compat shmat syscall when calling the ipc code and allows architectures such as AArch64 (where the native SHMLBA is 64k but the compat (AArch32) definition is 16k) to provide the correct semantics for compat IPC system calls. Cc: David S. Miller <davem@davemloft.net> Cc: Chris Zankel <chris@zankel.net> Cc: Arnd Bergmann <arnd@arndb.de> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-07ipc: shm: restore MADV_REMOVE functionality on shared memory segmentsWill Deacon
Commit 17cf28afea2a ("mm/fs: remove truncate_range") removed the truncate_range inode operation in favour of the fallocate file operation. When using SYSV IPC shared memory segments, calling madvise with the MADV_REMOVE advice on an area of shared memory will attempt to invoke the .fallocate function for the shm_file_operations, which is NULL and therefore returns -EOPNOTSUPP to userspace. The previous behaviour would inherit the inode_operations from the underlying tmpfs file and invoke truncate_range there. This patch restores the previous behaviour by wrapping the underlying fallocate function in shm_fallocate, as we do for fsync. [hughd@google.com: use -ENOTSUPP in shm_fallocate()] Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01switch aio and shm to do_mmap_pgoff(), make do_mmap() staticAl Viro
after all, 0 bytes and 0 pages is the same thing... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-01take security_mmap_file() outside of ->mmap_semAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-03-21hugetlbfs: fix alignment of huge page requestsSteven Truelove
When calling shmget() with SHM_HUGETLB, shmget aligns the request size to PAGE_SIZE, but this is not sufficient. Modify hugetlb_file_setup() to align requests to the huge page size, and to accept an address argument so that all alignment checks can be performed in hugetlb_file_setup(), rather than in its callers. Change newseg() and mmap_pgoff() to match the new prototype and eliminate a now redundant alignment check. [akpm@linux-foundation.org: fix build] Signed-off-by: Steven Truelove <steven.truelove@utoronto.ca> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-23SHM_UNLOCK: fix Unevictable pages stranded after swapHugh Dickins
Commit cc39c6a9bbde ("mm: account skipped entries to avoid looping in find_get_pages") correctly fixed an infinite loop; but left a problem that find_get_pages() on shmem would return 0 (appearing to callers to mean end of tree) when it meets a run of nr_pages swap entries. The only uses of find_get_pages() on shmem are via pagevec_lookup(), called from invalidate_mapping_pages(), and from shmctl SHM_UNLOCK's scan_mapping_unevictable_pages(). The first is already commented, and not worth worrying about; but the second can leave pages on the Unevictable list after an unusual sequence of swapping and locking. Fix that by using shmem_find_get_pages_and_swap() (then ignoring the swap) instead of pagevec_lookup(). But I don't want to contaminate vmscan.c with shmem internals, nor shmem.c with LRU locking. So move scan_mapping_unevictable_pages() into shmem.c, renaming it shmem_unlock_mapping(); and rename check_move_unevictable_page() to check_move_unevictable_pages(), looping down an array of pages, oftentimes under the same lock. Leave out the "rotate unevictable list" block: that's a leftover from when this was used for /proc/sys/vm/scan_unevictable_pages, whose flawed handling involved looking at pages at tail of LRU. Was there significance to the sequence first ClearPageUnevictable, then test page_evictable, then SetPageUnevictable here? I think not, we're under LRU lock, and have no barriers between those. Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michel Lespinasse <walken@google.com> Cc: <stable@vger.kernel.org> [back to 3.1 but will need respins] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-23SHM_UNLOCK: fix long unpreemptible sectionHugh Dickins
scan_mapping_unevictable_pages() is used to make SysV SHM_LOCKed pages evictable again once the shared memory is unlocked. It does this with pagevec_lookup()s across the whole object (which might occupy most of memory), and takes 300ms to unlock 7GB here. A cond_resched() every PAGEVEC_SIZE pages would be good. However, KOSAKI-san points out that this is called under shmem.c's info->lock, and it's also under shm.c's shm_lock(), both spinlocks. There is no strong reason for that: we need to take these pages off the unevictable list soonish, but those locks are not required for it. So move the call to scan_mapping_unevictable_pages() from shmem.c's unlock handling up to shm.c's unlock handling. Remove the recently added barrier, not needed now we have spin_unlock() before the scan. Use get_file(), with subsequent fput(), to make sure we have a reference to mapping throughout scan_mapping_unevictable_pages(): that's something that was previously guaranteed by the shm_lock(). Remove shmctl's lru_add_drain_all(): we don't fault in pages at SHM_LOCK time, and we lazily discover them to be Unevictable later, so it serves no purpose for SHM_LOCK; and serves no purpose for SHM_UNLOCK, since pages still on pagevec are not marked Unevictable. The original code avoided redundant rescans by checking VM_LOCKED flag at its level: now avoid them by checking shp's SHM_LOCKED. The original code called scan_mapping_unevictable_pages() on a locked area at shm_destroy() time: perhaps we once had accounting cross-checks which required that, but not now, so skip the overhead and just let inode eviction deal with them. Put check_move_unevictable_page() and scan_mapping_unevictable_pages() under CONFIG_SHMEM (with stub for the TINY case when ramfs is used), more as comment than to save space; comment them used for SHM_UNLOCK. Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-04Do 'shm_init_ns()' in an early pure_initcallLinus Torvalds
This isn't really critical any more, since other patches (commit 298507d4d2cf: "shm: optimize exit_shm()") have caused us to not actually need to touch the rw_mutex unless there are actual shm segments associated with the namespace, but we really should do tne shm_init_ns() earlier than we do now. This, together with commit 288d5abec831 ("Boot up with usermodehelper disabled") will mean that we really do initialize the initial ipc namespace data structure before we run any tasks. Tested-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-03shm: optimize exit_shm()Vasiliy Kulikov
We may optimistically check .in_use == 0 without holding the rw_mutex: it's the common case, and if it's zero, there certainly won't be any segments associated with us. After taking the lock, the idr_for_each() will do the right thing, so we could now drop the re-check inside the lock without any real cost. But it won't hurt. Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-03shm: fix wrong testsVasiliy Kulikov
Commit 4c677e2eefdb ("shm: optimize locking and ipc_namespace getting") introduced a copy-paste bug. Due to the bug cycle optimizations were disabled. Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-30shm: optimize locking and ipc_namespace gettingVasiliy Kulikov
shm_lock() does a lookup of shm segment in shm_ids(ns).ipcs_idr, which is redundant as we already know shmid_kernel address. An actual lock is also not required for reads until we really want to destroy the segment. exit_shm() and shm_destroy_orphaned() may avoid the loop by checking whether there is at least one segment in current ipc_namespace. The check of nsproxy and ipc_ns against NULL is redundant as exit_shm() is called from do_exit() before the call to exit_notify(), so the dereferencing current->nsproxy->ipc_ns is guaranteed to be safe. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-30shm: handle separate PID namespaces caseVasiliy Kulikov
shm_try_destroy_orphaned() and shm_try_destroy_current() didn't handle the case of separate PID namespaces, but a single IPC namespace. If there are tasks with the same PID values using the same shmem object, the wrong destroy decision could be reached. On shm segment creation store the pointer to the creator task in shmid_kernel->shm_creator field and zero it on task exit. Then use the ->shm_creator insread of shm_cprid in both functions. As shmid_kernel object is already locked at this stage, no additional locking is needed. Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-26ipc: introduce shm_rmid_forced sysctlVasiliy Kulikov
Add support for the shm_rmid_forced sysctl. If set to 1, all shared memory objects in current ipc namespace will be automatically forced to use IPC_RMID. The POSIX way of handling shmem allows one to create shm objects and call shmdt(), leaving shm object associated with no process, thus consuming memory not counted via rlimits. With shm_rmid_forced=1 the shared memory object is counted at least for one process, so OOM killer may effectively kill the fat process holding the shared memory. It obviously breaks POSIX - some programs relying on the feature would stop working. So set shm_rmid_forced=1 only if you're sure nobody uses "orphaned" memory. Use shm_rmid_forced=0 by default for compatability reasons. The feature was previously impemented in -ow as a configure option. [akpm@linux-foundation.org: fix documentation, per Randy] [akpm@linux-foundation.org: fix warning] [akpm@linux-foundation.org: readability/conventionality tweaks] [akpm@linux-foundation.org: fix shm_rmid_forced/shm_forced_rmid confusion, use standard comment layout] Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Serge E. Hallyn" <serge.hallyn@canonical.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Solar Designer <solar@openwall.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-20fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlersJosef Bacik
Btrfs needs to be able to control how filemap_write_and_wait_range() is called in fsync to make it less of a painful operation, so push down taking i_mutex and the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some file systems can drop taking the i_mutex altogether it seems, like ext3 and ocfs2. For correctness sake I just pushed everything down in all cases to make sure that we keep the current behavior the same for everybody, and then each individual fs maintainer can make up their mind about what to do from there. Thanks, Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26mm: don't access vm_flags as 'int'KOSAKI Motohiro
The type of vma->vm_flags is 'unsigned long'. Neither 'int' nor 'unsigned int'. This patch fixes such misuse. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [ Changed to use a typedef - we'll extend it to cover more cases later, since there has been discussion about making it a 64-bit type.. - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-31Fix common misspellingsLucas De Marchi
Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-23userns: user namespaces: convert several capable() callsSerge E. Hallyn
CAP_IPC_OWNER and CAP_IPC_LOCK can be checked against current_user_ns(), because the resource comes from current's own ipc namespace. setuid/setgid are to uids in own namespace, so again checks can be against current_user_ns(). Changelog: Jan 11: Use task_ns_capable() in place of sched_capable(). Jan 11: Use nsown_capable() as suggested by Bastian Blank. Jan 11: Clarify (hopefully) some logic in futex and sched.c Feb 15: use ns_capable for ipc, not nsown_capable Feb 23: let copy_ipcs handle setting ipc_ns->user_ns Feb 23: pass ns down rather than taking it from current [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-30ipc: shm: fix information leak to userlandVasiliy Kulikov
The shmid_ds structure is copied to userland with shm_unused{,2,3} fields unitialized. It leads to leaking of contents of kernel stack memory. Signed-off-by: Vasiliy Kulikov <segooon@gmail.com> Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27ipc/shm.c: add RSS and swap size information to /proc/sysvipc/shmHelge Deller
The kernel currently provides no functionality to analyze the RSS and swap space usage of each individual sysvipc shared memory segment. This patch adds this info for each existing shm segment by extending the output of /proc/sysvipc/shm by two columns for RSS and swap. Since shmctl(SHM_INFO) already provides a similiar calculation (it currently sums up all RSS/swap info for all segments), I did split out a static function which is now used by the /proc/sysvipc/shm output and shmctl(SHM_INFO). SAP products (esp. the SAP Netweaver ABAP Kernel) uses lots of big shared memory segments (we often have Linux systems with >= 16GB shm usage). Sometimes we get customer reports about "slow" system responses and while looking into their configurations we often find massive swapping activity on the system. With this patch it's now easy to see from the command line if and which shm segments gets swapped out (and how much) and can more easily give recommendations for system tuning. Without the patch it's currently not possible to do such shm analysis at all. Also... Add some spaces in front of the "size" field for 64bit kernels to get the columns correct if you cat the contents of the file. In sysvipc_shm_proc_show() the kernel prints the size value in "SPEC_SIZE" format, which is defined like this: #if BITS_PER_LONG <= 32 #define SIZE_SPEC "%10lu" #else #define SIZE_SPEC "%21lu" #endif So, if the header is not adjusted, the columns are not correctly aligned. I actually tested this on 32- and 64-bit and it seems correct now. Signed-off-by: Helge Deller <deller@gmx.de> Cc: Manfred Spraul <manfred@colorfullife.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-15llseek: automatically add .llseek fopArnd Bergmann
All file_operations should get a .llseek operation so we can make nonseekable_open the default for future file operations without a .llseek pointer. The three cases that we can automatically detect are no_llseek, seq_lseek and default_llseek. For cases where we can we can automatically prove that the file offset is always ignored, we use noop_llseek, which maintains the current behavior of not returning an error from a seek. New drivers should normally not use noop_llseek but instead use no_llseek and call nonseekable_open at open time. Existing drivers can be converted to do the same when the maintainer knows for certain that no user code relies on calling seek on the device file. The generated code is often incorrectly indented and right now contains comments that clarify for each added line why a specific variant was chosen. In the version that gets submitted upstream, the comments will be gone and I will manually fix the indentation, because there does not seem to be a way to do that using coccinelle. Some amount of new code is currently sitting in linux-next that should get the same modifications, which I will do at the end of the merge window. Many thanks to Julia Lawall for helping me learn to write a semantic patch that does all this. ===== begin semantic patch ===== // This adds an llseek= method to all file operations, // as a preparation for making no_llseek the default. // // The rules are // - use no_llseek explicitly if we do nonseekable_open // - use seq_lseek for sequential files // - use default_llseek if we know we access f_pos // - use noop_llseek if we know we don't access f_pos, // but we still want to allow users to call lseek // @ open1 exists @ identifier nested_open; @@ nested_open(...) { <+... nonseekable_open(...) ...+> } @ open exists@ identifier open_f; identifier i, f; identifier open1.nested_open; @@ int open_f(struct inode *i, struct file *f) { <+... ( nonseekable_open(...) | nested_open(...) ) ...+> } @ read disable optional_qualifier exists @ identifier read_f; identifier f, p, s, off; type ssize_t, size_t, loff_t; expression E; identifier func; @@ ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off) { <+... ( *off = E | *off += E | func(..., off, ...) | E = *off ) ...+> } @ read_no_fpos disable optional_qualifier exists @ identifier read_f; identifier f, p, s, off; type ssize_t, size_t, loff_t; @@ ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off) { ... when != off } @ write @ identifier write_f; identifier f, p, s, off; type ssize_t, size_t, loff_t; expression E; identifier func; @@ ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off) { <+... ( *off = E | *off += E | func(..., off, ...) | E = *off ) ...+> } @ write_no_fpos @ identifier write_f; identifier f, p, s, off; type ssize_t, size_t, loff_t; @@ ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off) { ... when != off } @ fops0 @ identifier fops; @@ struct file_operations fops = { ... }; @ has_llseek depends on fops0 @ identifier fops0.fops; identifier llseek_f; @@ struct file_operations fops = { ... .llseek = llseek_f, ... }; @ has_read depends on fops0 @ identifier fops0.fops; identifier read_f; @@ struct file_operations fops = { ... .read = read_f, ... }; @ has_write depends on fops0 @ identifier fops0.fops; identifier write_f; @@ struct file_operations fops = { ... .write = write_f, ... }; @ has_open depends on fops0 @ identifier fops0.fops; identifier open_f; @@ struct file_operations fops = { ... .open = open_f, ... }; // use no_llseek if we call nonseekable_open //////////////////////////////////////////// @ nonseekable1 depends on !has_llseek && has_open @ identifier fops0.fops; identifier nso ~= "nonseekable_open"; @@ struct file_operations fops = { ... .open = nso, ... +.llseek = no_llseek, /* nonseekable */ }; @ nonseekable2 depends on !has_llseek @ identifier fops0.fops; identifier open.open_f; @@ struct file_operations fops = { ... .open = open_f, ... +.llseek = no_llseek, /* open uses nonseekable */ }; // use seq_lseek for sequential files ///////////////////////////////////// @ seq depends on !has_llseek @ identifier fops0.fops; identifier sr ~= "seq_read"; @@ struct file_operations fops = { ... .read = sr, ... +.llseek = seq_lseek, /* we have seq_read */ }; // use default_llseek if there is a readdir /////////////////////////////////////////// @ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; identifier readdir_e; @@ // any other fop is used that changes pos struct file_operations fops = { ... .readdir = readdir_e, ... +.llseek = default_llseek, /* readdir is present */ }; // use default_llseek if at least one of read/write touches f_pos ///////////////////////////////////////////////////////////////// @ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; identifier read.read_f; @@ // read fops use offset struct file_operations fops = { ... .read = read_f, ... +.llseek = default_llseek, /* read accesses f_pos */ }; @ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; identifier write.write_f; @@ // write fops use offset struct file_operations fops = { ... .write = write_f, ... + .llseek = default_llseek, /* write accesses f_pos */ }; // Use noop_llseek if neither read nor write accesses f_pos /////////////////////////////////////////////////////////// @ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; identifier read_no_fpos.read_f; identifier write_no_fpos.write_f; @@ // write fops use offset struct file_operations fops = { ... .write = write_f, .read = read_f, ... +.llseek = noop_llseek, /* read and write both use no f_pos */ }; @ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; identifier write_no_fpos.write_f; @@ struct file_operations fops = { ... .write = write_f, ... +.llseek = noop_llseek, /* write uses no f_pos */ }; @ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; identifier read_no_fpos.read_f; @@ struct file_operations fops = { ... .read = read_f, ... +.llseek = noop_llseek, /* read uses no f_pos */ }; @ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; @@ struct file_operations fops = { ... +.llseek = noop_llseek, /* no read or write fn */ }; ===== End semantic patch ===== Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Julia Lawall <julia@diku.dk> Cc: Christoph Hellwig <hch@infradead.org>
2010-05-27drop unused dentry argument to ->fsyncChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-12ipc: use rlimit helpersJiri Slaby
Make sure compiler won't do weird things with limits. E.g. fetching them twice may return 2 different values after writable limits are implemented. I.e. either use rlimit helpers added in 3e10e716abf3c71bdb5d86b8f507f9e72236c9cd ("resource: add helpers for fetching rlimits") or ACCESS_ONCE if not applicable. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-16nommu: fix SYSV SHM for NOMMUDavid Howells
Commit c4caa778157dbbf04116f0ac2111e389b5cd7a29 ("file ->get_unmapped_area() shouldn't duplicate work of get_unmapped_area()") broke SYSV SHM for NOMMU by taking away the pointer to shm_get_unmapped_area() from shm_file_operations. Put it back conditionally on CONFIG_MMU=n. file->f_ops->get_unmapped_area() is used to find out the base address for a mapping of a mappable chardev device or mappable memory-based file (such as a ramfs file). It needs to be called prior to file->f_ops->mmap() being called. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16Merge branch 'master' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (38 commits) direct I/O fallback sync simplification ocfs: stop using do_sync_mapping_range cleanup blockdev_direct_IO locking make generic_acl slightly more generic sanitize xattr handler prototypes libfs: move EXPORT_SYMBOL for d_alloc_name vfs: force reval of target when following LAST_BIND symlinks (try #7) ima: limit imbalance msg Untangling ima mess, part 3: kill dead code in ima Untangling ima mess, part 2: deal with counters Untangling ima mess, part 1: alloc_file() O_TRUNC open shouldn't fail after file truncation ima: call ima_inode_free ima_inode_free IMA: clean up the IMA counts updating code ima: only insert at inode creation time ima: valid return code from ima_inode_alloc fs: move get_empty_filp() deffinition to internal.h Sanitize exec_permission_lite() Kill cached_lookup() and real_lookup() Kill path_lookup_open() ... Trivial conflicts in fs/direct-io.c
2009-12-16Untangling ima mess, part 1: alloc_file()Al Viro
There are 2 groups of alloc_file() callers: * ones that are followed by ima_counts_get * ones giving non-regular files So let's pull that ima_counts_get() into alloc_file(); it's a no-op in case of non-regular files. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-12-16switch alloc_file() to passing struct pathAl Viro
... and have the caller grab both mnt and dentry; kill leak in infiniband, while we are at it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-12-16ipc ns: fix memory leak (idr)Serge E. Hallyn
We have apparently had a memory leak since 7ca7e564e049d8b350ec9d958ff25eaa24226352 "ipc: store ipcs into IDRs" in 2007. The idr of which 3 exist for each ipc namespace is never freed. This patch simply frees them when the ipcns is freed. I don't believe any idr_remove() are done from rcu (and could therefore be delayed until after this idr_destroy()), so the patch should be safe. Some quick testing showed no harm, and the memory leak fixed. Caught by kmemleak. Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-11file ->get_unmapped_area() shouldn't duplicate work of get_unmapped_area()Al Viro
... we should call mm ->get_unmapped_area() instead and let our caller do the final checks. Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-09-27const: mark struct vm_struct_operationsAlexey Dobriyan
* mark struct vm_area_struct::vm_ops as const * mark vm_ops in AGP code But leave TTM code alone, something is fishy there with global vm_ops being used. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22hugetlbfs: allow the creation of files suitable for MAP_PRIVATE on the vfs ↵Eric B Munson
internal mount This patchset adds a flag to mmap that allows the user to request that an anonymous mapping be backed with huge pages. This mapping will borrow functionality from the huge page shm code to create a file on the kernel internal mount and use it to approximate an anonymous mapping. The MAP_HUGETLB flag is a modifier to MAP_ANONYMOUS and will not work without both flags being preset. A new flag is necessary because there is no other way to hook into huge pages without creating a file on a hugetlbfs mount which wouldn't be MAP_ANONYMOUS. To userspace, this mapping will behave just like an anonymous mapping because the file is not accessible outside of the kernel. This patchset is meant to simplify the programming model. Presently there is a large chunk of boiler platecode, contained in libhugetlbfs, required to create private, hugepage backed mappings. This patch set would allow use of hugepages without linking to libhugetlbfs or having hugetblfs mounted. Unification of the VM code would provide these same benefits, but it has been resisted each time that it has been suggested for several reasons: it would break PAGE_SIZE assumptions across the kernel, it makes page-table abstractions really expensive, and it does not provide any benefit on architectures that do not support huge pages, incurring fast path penalties without providing any benefit on these architectures. This patch: There are two means of creating mappings backed by huge pages: 1. mmap() a file created on hugetlbfs 2. Use shm which creates a file on an internal mount which essentially maps it MAP_SHARED The internal mount is only used for shared mappings but there is very little that stops it being used for private mappings. This patch extends hugetlbfs_file_setup() to deal with the creation of files that will be mapped MAP_PRIVATE on the internal hugetlbfs mount. This extended API is used in a subsequent patch to implement the MAP_HUGETLB mmap() flag. Signed-off-by: Eric Munson <ebmunson@us.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Adam Litke <agl@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-14fix undefined reference to user_shm_unlockHugh Dickins
My 353d5c30c666580347515da609dd74a2b8e9b828 "mm: fix hugetlb bug due to user_shm_unlock call" broke the CONFIG_SYSVIPC !CONFIG_MMU build of both 2.6.31 and 2.6.30.6: "undefined reference to `user_shm_unlock'". gcc didn't understand my comment! so couldn't figure out to optimize away user_shm_unlock() from the error path in the hugetlb-less case, as it does elsewhere. Help it to do so, in a language it understands. Reported-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-08-24mm: fix hugetlb bug due to user_shm_unlock callHugh Dickins
2.6.30's commit 8a0bdec194c21c8fdef840989d0d7b742bb5d4bc removed user_shm_lock() calls in hugetlb_file_setup() but left the user_shm_unlock call in shm_destroy(). In detail: Assume that can_do_hugetlb_shm() returns true and hence user_shm_lock() is not called in hugetlb_file_setup(). However, user_shm_unlock() is called in any case in shm_destroy() and in the following atomic_dec_and_lock(&up->__count) in free_uid() is executed and if up->__count gets zero, also cleanup_user_struct() is scheduled. Note that sched_destroy_user() is empty if CONFIG_USER_SCHED is not set. However, the ref counter up->__count gets unexpectedly non-positive and the corresponding structs are freed even though there are live references to them, resulting in a kernel oops after a lots of shmget(SHM_HUGETLB)/shmctl(IPC_RMID) cycles and CONFIG_USER_SCHED set. Hugh changed Stefan's suggested patch: can_do_hugetlb_shm() at the time of shm_destroy() may give a different answer from at the time of hugetlb_file_setup(). And fixed newseg()'s no_id error path, which has missed user_shm_unlock() ever since it came in 2.6.9. Reported-by: Stefan Huber <shuber2@gmail.com> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Tested-by: Stefan Huber <shuber2@gmail.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-11Merge branch 'next' into for-linusJames Morris
2009-06-09shm: fix unused warnings on nommuMike Frysinger
The massive nommu update (8feae131) resulted in these warnings: ipc/shm.c: In function `sys_shmdt': ipc/shm.c:974: warning: unused variable `size' ipc/shm.c:972: warning: unused variable `next' Signed-off-by: Mike Frysinger <vapier@gentoo.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-22integrity: move ima_counts_getMimi Zohar
Based on discussion on lkml (Andrew Morton and Eric Paris), move ima_counts_get down a layer into shmem/hugetlb__file_setup(). Resolves drm shmem_file_setup() usage case as well. HD comment: I still think you're doing this at the wrong level, but recognize that you probably won't be persuaded until a few more users of alloc_file() emerge, all wanting your ima_counts_get(). Resolving GEM's shmem_file_setup() is an improvement, so I'll say Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Mimi Zohar <zohar@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>
2009-05-22integrity: path_check updateMimi Zohar
- Add support in ima_path_check() for integrity checking without incrementing the counts. (Required for nfsd.) - rename and export opencount_get to ima_counts_get - replace ima_shm_check calls with ima_counts_get - export ima_path_check Signed-off-by: Mimi Zohar <zohar@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>
2009-04-02ipc: make shm_get_stat() more robustTony Battersby
shm_get_stat() assumes idr_find(&shm_ids(ns).ipcs_idr) returns "struct shmid_kernel *"; all other callers assume that it returns "struct kern_ipc_perm *". This works because "struct kern_ipc_perm" is currently the first member of "struct shmid_kernel", but it would be better to use container_of() to prevent future breakage. Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Cc: Jiri Olsa <olsajiri@gmail.com> Cc: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-24Merge branch 'master' into nextJames Morris
2009-02-10Do not account for the address space used by hugetlbfs using VM_ACCOUNTMel Gorman
When overcommit is disabled, the core VM accounts for pages used by anonymous shared, private mappings and special mappings. It keeps track of VMAs that should be accounted for with VM_ACCOUNT and VMAs that never had a reserve with VM_NORESERVE. Overcommit for hugetlbfs is much riskier than overcommit for base pages due to contiguity requirements. It avoids overcommiting on both shared and private mappings using reservation counters that are checked and updated during mmap(). This ensures (within limits) that hugepages exist in the future when faults occurs or it is too easy to applications to be SIGKILLed. As hugetlbfs makes its own reservations of a different unit to the base page size, VM_ACCOUNT should never be set. Even if the units were correct, we would double account for the usage in the core VM and hugetlbfs. VM_NORESERVE may be set because an application can request no reserves be made for hugetlbfs at the risk of getting killed later. With commit fc8744adc870a8d4366908221508bb113d8b72ee, VM_NORESERVE and VM_ACCOUNT are getting unconditionally set for hugetlbfs-backed mappings. This breaks the accounting for both the core VM and hugetlbfs, can trigger an OOM storm when hugepage pools are too small lockups and corrupted counters otherwise are used. This patch brings hugetlbfs more in line with how the core VM treats VM_NORESERVE but prevents VM_ACCOUNT being set. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-06Merge branch 'master' into nextJames Morris
Conflicts: fs/namei.c Manually merged per: diff --cc fs/namei.c index 734f2b5,bbc15c2..0000000 --- a/fs/namei.c +++ b/fs/namei.c @@@ -860,9 -848,8 +849,10 @@@ static int __link_path_walk(const char nd->flags |= LOOKUP_CONTINUE; err = exec_permission_lite(inode); if (err == -EAGAIN) - err = vfs_permission(nd, MAY_EXEC); + err = inode_permission(nd->path.dentry->d_inode, + MAY_EXEC); + if (!err) + err = ima_path_check(&nd->path, MAY_EXEC); if (err) break; @@@ -1525,14 -1506,9 +1509,14 @@@ int may_open(struct path *path, int acc flag &= ~O_TRUNC; } - error = vfs_permission(nd, acc_mode); + error = inode_permission(inode, acc_mode); if (error) return error; + - error = ima_path_check(&nd->path, ++ error = ima_path_check(path, + acc_mode & (MAY_READ | MAY_WRITE | MAY_EXEC)); + if (error) + return error; /* * An append-only file must be opened in append mode for writing. */ Signed-off-by: James Morris <jmorris@namei.org>
2009-02-06Integrity: IMA file free imbalanceMimi Zohar
The number of calls to ima_path_check()/ima_file_free() should be balanced. An extra call to fput(), indicates the file could have been accessed without first being measured. Although f_count is incremented/decremented in places other than fget/fput, like fget_light/fput_light and get_file, the current task must already hold a file refcnt. The call to __fput() is delayed until the refcnt becomes 0, resulting in ima_file_free() flagging any changes. - add hook to increment opencount for IPC shared memory(SYSV), shmat files, and /dev/zero - moved NULL iint test in opencount_get() Signed-off-by: Mimi Zohar <zohar@us.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>
2009-02-05shm: fix shmctl(SHM_INFO) lockup with !CONFIG_SHMEMTony Battersby
shm_get_stat() assumes that the inode is a "struct shmem_inode_info", which is incorrect for !CONFIG_SHMEM (see fs/ramfs/inode.c: ramfs_get_inode() vs. mm/shmem.c: shmem_get_inode()). This bad assumption can cause shmctl(SHM_INFO) to lockup when shm_get_stat() tries to spin_lock(&info->lock). Users of !CONFIG_SHMEM may encounter this lockup simply by invoking the 'ipcs' command. Reported by Jiri Olsa back in February 2008: http://lkml.org/lkml/2008/2/29/74 Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Cc: Jiri Kosina <jkosina@suse.cz> Reported-by: Jiri Olsa <olsajiri@gmail.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: <stable@kernel.org> [2.6.everything] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-31Stop playing silly games with the VM_ACCOUNT flagLinus Torvalds
The mmap_region() code would temporarily set the VM_ACCOUNT flag for anonymous shared mappings just to inform shmem_zero_setup() that it should enable accounting for the resulting shm object. It would then clear the flag after calling ->mmap (for the /dev/zero case) or doing shmem_zero_setup() (for the MAP_ANON case). This just resulted in vma merge issues, but also made for just unnecessary confusion. Use the already-existing VM_NORESERVE flag for this instead, and let shmem_{zero|file}_setup() just figure it out from that. This also happens to make it obvious that the new DRI2 GEM layer uses a non-reserving backing store for its object allocation - which is quite possibly not intentional. But since I didn't want to change semantics in this patch, I left it alone, and just updated the caller to use the new flag semantics. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>