aboutsummaryrefslogtreecommitdiff
path: root/ipc
AgeCommit message (Collapse)Author
2013-09-24ipc: fix race with LSMsDavidlohr Bueso
Currently, IPC mechanisms do security and auditing related checks under RCU. However, since security modules can free the security structure, for example, through selinux_[sem,msg_queue,shm]_free_security(), we can race if the structure is freed before other tasks are done with it, creating a use-after-free condition. Manfred illustrates this nicely, for instance with shared mem and selinux: -> do_shmat calls rcu_read_lock() -> do_shmat calls shm_object_check(). Checks that the object is still valid - but doesn't acquire any locks. Then it returns. -> do_shmat calls security_shm_shmat (e.g. selinux_shm_shmat) -> selinux_shm_shmat calls ipc_has_perm() -> ipc_has_perm accesses ipc_perms->security shm_close() -> shm_close acquires rw_mutex & shm_lock -> shm_close calls shm_destroy -> shm_destroy calls security_shm_free (e.g. selinux_shm_free_security) -> selinux_shm_free_security calls ipc_free_security(&shp->shm_perm) -> ipc_free_security calls kfree(ipc_perms->security) This patch delays the freeing of the security structures after all RCU readers are done. Furthermore it aligns the security life cycle with that of the rest of IPC - freeing them based on the reference counter. For situations where we need not free security, the current behavior is kept. Linus states: "... the old behavior was suspect for another reason too: having the security blob go away from under a user sounds like it could cause various other problems anyway, so I think the old code was at least _prone_ to bugs even if it didn't have catastrophic behavior." I have tested this patch with IPC testcases from LTP on both my quad-core laptop and on a 64 core NUMA server. In both cases selinux is enabled, and tests pass for both voluntary and forced preemption models. While the mentioned races are theoretical (at least no one as reported them), I wanted to make sure that this new logic doesn't break anything we weren't aware of. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11ipc: drop ipc_lock_checkDavidlohr Bueso
No remaining users, we now use ipc_obtain_object_check(). Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Sedat Dilek <sedat.dilek@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11ipc, shm: drop shm_lock_checkDavidlohr Bueso
This function was replaced by a the lockless shm_obtain_object_check(), and no longer has any users. Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Sedat Dilek <sedat.dilek@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11ipc: drop ipc_lock_by_ptrDavidlohr Bueso
After previous cleanups and optimizations, this function is no longer heavily used and we don't have a good reason to keep it. Update the few remaining callers and get rid of it. Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Sedat Dilek <sedat.dilek@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11ipc, shm: guard against non-existant vma in shmdt(2)Davidlohr Bueso
When !CONFIG_MMU there's a chance we can derefence a NULL pointer when the VM area isn't found - check the return value of find_vma(). Also, remove the redundant -EINVAL return: retval is set to the proper return code and *only* changed to 0, when we actually unmap the segments. Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Sedat Dilek <sedat.dilek@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11ipc: document general ipc locking schemeDavidlohr Bueso
As suggested by Andrew, add a generic initial locking scheme used throughout all sysv ipc mechanisms. Documenting the ids rwsem, how rcu can be enough to do the initial checks and when to actually acquire the kern_ipc_perm.lock spinlock. I found that adding it to util.c was generic enough. Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11ipc,msg: drop msg_unlockDavidlohr Bueso
There is only one user left, drop this function and just call ipc_unlock_object() and rcu_read_unlock(). Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11ipc: rename ids->rw_mutexDavidlohr Bueso
Since in some situations the lock can be shared for readers, we shouldn't be calling it a mutex, rename it to rwsem. Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11ipc,shm: shorten critical region for shmatDavidlohr Bueso
Similar to other system calls, acquire the kern_ipc_perm lock after doing the initial permission and security checks. [sasha.levin@oracle.com: dont leave do_shmat with rcu lock held] Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11ipc,shm: cleanup do_shmat pastaDavidlohr Bueso
Clean up some of the messy do_shmat() spaghetti code, getting rid of out_free and out_put_dentry labels. This makes shortening the critical region of this function in the next patch a little easier to do and read. Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11ipc,shm: shorten critical region for shmctlDavidlohr Bueso
With the *_INFO, *_STAT, IPC_RMID and IPC_SET commands already optimized, deal with the remaining SHM_LOCK and SHM_UNLOCK commands. Take the shm_perm lock after doing the initial auditing and security checks. The rest of the logic remains unchanged. Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11ipc,shm: make shmctl_nolock locklessDavidlohr Bueso
While the INFO cmd doesn't take the ipc lock, the STAT commands do acquire it unnecessarily. We can do the permissions and security checks only holding the rcu lock. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11ipc,shm: introduce shmctl_nolockDavidlohr Bueso
Similar to semctl and msgctl, when calling msgctl, the *_INFO and *_STAT commands can be performed without acquiring the ipc object. Add a shmctl_nolock() function and move the logic of *_INFO and *_STAT out of msgctl(). Since we are just moving functionality, this change still takes the lock and it will be properly lockless in the next patch. Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11ipc: drop ipcctl_pre_downDavidlohr Bueso
Now that sem, msgque and shm, through *_down(), all use the lockless variant of ipcctl_pre_down(), go ahead and delete it. [akpm@linux-foundation.org: fix function name in kerneldoc, cleanups] Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11ipc,shm: shorten critical region in shmctl_downDavidlohr Bueso
Instead of holding the ipc lock for the entire function, use the ipcctl_pre_down_nolock and only acquire the lock for specific commands: RMID and SET. Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11ipc,shm: introduce lockless functions to obtain the ipc objectDavidlohr Bueso
This is the third and final patchset that deals with reducing the amount of contention we impose on the ipc lock (kern_ipc_perm.lock). These changes mostly deal with shared memory, previous work has already been done for semaphores and message queues: http://lkml.org/lkml/2013/3/20/546 (sems) http://lkml.org/lkml/2013/5/15/584 (mqueues) With these patches applied, a custom shm microbenchmark stressing shmctl doing IPC_STAT with 4 threads a million times, reduces the execution time by 50%. A similar run, this time with IPC_SET, reduces the execution time from 3 mins and 35 secs to 27 seconds. Patches 1-8: replaces blindly taking the ipc lock for a smarter combination of rcu and ipc_obtain_object, only acquiring the spinlock when updating. Patch 9: renames the ids rw_mutex to rwsem, which is what it already was. Patch 10: is a trivial mqueue leftover cleanup Patch 11: adds a brief lock scheme description, requested by Andrew. This patch: Add shm_obtain_object() and shm_obtain_object_check(), which will allow us to get the ipc object without acquiring the lock. Just as with other forms of ipc, these functions are basically wrappers around ipc_obtain_object*(). Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-07Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace changes from Eric Biederman: "This is an assorted mishmash of small cleanups, enhancements and bug fixes. The major theme is user namespace mount restrictions. nsown_capable is killed as it encourages not thinking about details that need to be considered. A very hard to hit pid namespace exiting bug was finally tracked and fixed. A couple of cleanups to the basic namespace infrastructure. Finally there is an enhancement that makes per user namespace capabilities usable as capabilities, and an enhancement that allows the per userns root to nice other processes in the user namespace" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: userns: Kill nsown_capable it makes the wrong thing easy capabilities: allow nice if we are privileged pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD userns: Allow PR_CAPBSET_DROP in a user namespace. namespaces: Simplify copy_namespaces so it is clear what is going on. pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup sysfs: Restrict mounting sysfs userns: Better restrictions on when proc and sysfs can be mounted vfs: Don't copy mount bind mounts of /proc/<pid>/ns/mnt between namespaces kernel/nsproxy.c: Improving a snippet of code. proc: Restrict mounting the proc filesystem vfs: Lock in place mounts from more privileged users
2013-09-03ipc/msg.c: Fix lost wakeup in msgsnd().Manfred Spraul
The check if the queue is full and adding current to the wait queue of pending msgsnd() operations (ss_add()) must be atomic. Otherwise: - the thread that performs msgsnd() finds a full queue and decides to sleep. - the thread that performs msgrcv() first reads all messages from the queue and then sleeps, because the queue is empty. - the msgrcv() calls do not perform any wakeups, because the msgsnd() task has not yet called ss_add(). - then the msgsnd()-thread first calls ss_add() and then sleeps. Net result: msgsnd() and msgrcv() both sleep forever. Observed with msgctl08 from ltp with a preemptible kernel. Fix: Call ipc_lock_object() before performing the check. The patch also moves security_msg_queue_msgsnd() under ipc_lock_object: - msgctl(IPC_SET) explicitely mentions that it tries to expunge any pending operations that are not allowed anymore with the new permissions. If security_msg_queue_msgsnd() is called without locks, then there might be races. - it makes the patch much simpler. Reported-and-tested-by: Vineet Gupta <Vineet.Gupta1@synopsys.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: stable@vger.kernel.org # for 3.11 Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-30userns: Kill nsown_capable it makes the wrong thing easyEric W. Biederman
nsown_capable is a special case of ns_capable essentially for just CAP_SETUID and CAP_SETGID. For the existing users it doesn't noticably simplify things and from the suggested patches I have seen it encourages people to do the wrong thing. So remove nsown_capable. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-08-28IPC: bugfix for msgrcv with msgtyp < 0Svenning Sørensen
According to 'man msgrcv': "If msgtyp is less than 0, the first message of the lowest type that is less than or equal to the absolute value of msgtyp shall be received." Bug: The kernel only returns a message if its type is 1; other messages with type < abs(msgtype) will never get returned. Fix: After having traversed the list to find the first message with the lowest type, we need to actually return that message. This regression was introduced by commit daaf74cf0867 ("ipc: refactor msg list search into separate function") Signed-off-by: Svenning Soerensen <sss@secomea.dk> Reviewed-by: Peter Hurley <peter@hurleysoftware.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09ipc/sem.c: rename try_atomic_semop() to perform_atomic_semop(), docu updateManfred Spraul
Cleanup: Some minor points that I noticed while writing the previous patches 1) The name try_atomic_semop() is misleading: The function performs the operation (if it is possible). 2) Some documentation updates. No real code change, a rename and documentation changes. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Rik van Riel <riel@redhat.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09ipc/sem.c: replace shared sem_otime with per-semaphore valueManfred Spraul
sem_otime contains the time of the last semaphore operation that completed successfully. Every operation updates this value, thus access from multiple cpus can cause thrashing. Therefore the patch replaces the variable with a per-semaphore variable. The per-array sem_otime is only calculated when required. No performance improvement on a single-socket i3 - only important for larger systems. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Rik van Riel <riel@redhat.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09ipc/sem.c: always use only one queue for alter operationsManfred Spraul
There are two places that can contain alter operations: - the global queue: sma->pending_alter - the per-semaphore queues: sma->sem_base[].pending_alter. Since one of the queues must be processed first, this causes an odd priorization of the wakeups: complex operations have priority over simple ops. The patch restores the behavior of linux <=3.0.9: The longest waiting operation has the highest priority. This is done by using only one queue: - if there are complex ops, then sma->pending_alter is used. - otherwise, the per-semaphore queues are used. As a side effect, do_smart_update_queue() becomes much simpler: no more goto logic. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Rik van Riel <riel@redhat.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09ipc/sem: separate wait-for-zero and alter tasks into seperate queuesManfred Spraul
Introduce separate queues for operations that do not modify the semaphore values. Advantages: - Simpler logic in check_restart(). - Faster update_queue(): Right now, all wait-for-zero operations are always tested, even if the semaphore value is not 0. - wait-for-zero gets again priority, as in linux <=3.0.9 Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Rik van Riel <riel@redhat.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09ipc/sem.c: cacheline align the semaphore structuresManfred Spraul
As now each semaphore has its own spinlock and parallel operations are possible, give each semaphore its own cacheline. On a i3 laptop, this gives up to 28% better performance: #semscale 10 | grep "interleave 2" - before: Cpus 1, interleave 2 delay 0: 36109234 in 10 secs Cpus 2, interleave 2 delay 0: 55276317 in 10 secs Cpus 3, interleave 2 delay 0: 62411025 in 10 secs Cpus 4, interleave 2 delay 0: 81963928 in 10 secs -after: Cpus 1, interleave 2 delay 0: 35527306 in 10 secs Cpus 2, interleave 2 delay 0: 70922909 in 10 secs <<< + 28% Cpus 3, interleave 2 delay 0: 80518538 in 10 secs Cpus 4, interleave 2 delay 0: 89115148 in 10 secs <<< + 8.7% i3, with 2 cores and with hyperthreading enabled. Interleave 2 in order use first the full cores. HT partially hides the delay from cacheline trashing, thus the improvement is "only" 8.7% if 4 threads are running. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Rik van Riel <riel@redhat.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09ipc/util.c, ipc_rcu_alloc: cacheline align allocationManfred Spraul
Enforce that ipc_rcu_alloc returns a cacheline aligned pointer on SMP. Rationale: The SysV sem code tries to move the main spinlock into a seperate cacheline (____cacheline_aligned_in_smp). This works only if ipc_rcu_alloc returns cacheline aligned pointers. vmalloc and kmalloc return cacheline algined pointers, the implementation of ipc_rcu_alloc breaks that. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Rik van Riel <riel@redhat.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09ipc: remove unused functionsDavidlohr Bueso
We can now drop the msg_lock and msg_lock_check functions along with a bogus comment introduced previously in semctl_down. Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09ipc,msg: shorten critical region in msgrcvDavidlohr Bueso
do_msgrcv() is the last msg queue function that abuses the ipc lock Take it only when needed when actually updating msq. Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Rik van Riel <riel@redhat.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09ipc,msg: shorten critical region in msgsndDavidlohr Bueso
do_msgsnd() is another function that does too many things with the ipc object lock acquired. Take it only when needed when actually updating msq. Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09ipc,msg: make msgctl_nolock locklessDavidlohr Bueso
While the INFO cmd doesn't take the ipc lock, the STAT commands do acquire it unnecessarily. We can do the permissions and security checks only holding the rcu lock. This function now mimics semctl_nolock(). Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09ipc,msg: introduce lockless functions to obtain the ipc objectDavidlohr Bueso
Add msq_obtain_object() and msq_obtain_object_check(), which will allow us to get the ipc object without acquiring the lock. Just as with semaphores, these functions are basically wrappers around ipc_obtain_object*(). Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09ipc,msg: introduce msgctl_nolockDavidlohr Bueso
Similar to semctl, when calling msgctl, the *_INFO and *_STAT commands can be performed without acquiring the ipc object. Add a msgctl_nolock() function and move the logic of *_INFO and *_STAT out of msgctl(). This change still takes the lock and it will be properly lockless in the next patch Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09ipc,msg: shorten critical region in msgctl_downDavidlohr Bueso
Instead of holding the ipc lock for the entire function, use the ipcctl_pre_down_nolock and only acquire the lock for specific commands: RMID and SET. Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09ipc: move locking out of ipcctl_pre_down_nolockDavidlohr Bueso
This function currently acquires both the rw_mutex and the rcu lock on successful lookups, leaving the callers to explicitly unlock them, creating another two level locking situation. Make the callers (including those that still use ipcctl_pre_down()) explicitly lock and unlock the rwsem and rcu lock. Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09ipc: close open coded spin lock callsDavidlohr Bueso
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09ipc: introduce ipc object locking helpersDavidlohr Bueso
Simple helpers around the (kern_ipc_perm *)->lock spinlock. Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09ipc: move rcu lock out of ipc_addidDavidlohr Bueso
This patchset continues the work that began in the sysv ipc semaphore scaling series, see https://lkml.org/lkml/2013/3/20/546 Just like semaphores used to be, sysv shared memory and msg queues also abuse the ipc lock, unnecessarily holding it for operations such as permission and security checks. This patchset mostly deals with mqueues, and while shared mem can be done in a very similar way, I want to get these patches out in the open first. It also does some pending cleanups, mostly focused on the two level locking we have in ipc code, taking care of ipc_addid() and ipcctl_pre_down_nolock() - yes there are still functions that need to be updated as well. This patch: Make all callers explicitly take and release the RCU read lock. This addresses the two level locking seen in newary(), newseg() and newqueue(). For the last two, explicitly unlock the ipc object and the rcu lock, instead of calling the custom shm_unlock and msg_unlock functions. The next patch will deal with the open coded locking for ->perm.lock Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09ipc/shmc.c: eliminate ugly 80-col tricksAndrew Morton
Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09audit: fix mq_open and mq_unlink to add the MQ root as a hidden parent ↵Jeff Layton
audit_names record The old audit PATH records for mq_open looked like this: type=PATH msg=audit(1366282323.982:869): item=1 name=(null) inode=6777 dev=00:0c mode=041777 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:tmpfs_t:s15:c0.c1023 type=PATH msg=audit(1366282323.982:869): item=0 name="test_mq" inode=26732 dev=00:0c mode=0100700 ouid=0 ogid=0 rdev=00:00 obj=staff_u:object_r:user_tmpfs_t:s15:c0.c1023 ...with the audit related changes that went into 3.7, they now look like this: type=PATH msg=audit(1366282236.776:3606): item=2 name=(null) inode=66655 dev=00:0c mode=0100700 ouid=0 ogid=0 rdev=00:00 obj=staff_u:object_r:user_tmpfs_t:s15:c0.c1023 type=PATH msg=audit(1366282236.776:3606): item=1 name=(null) inode=6926 dev=00:0c mode=041777 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:tmpfs_t:s15:c0.c1023 type=PATH msg=audit(1366282236.776:3606): item=0 name="test_mq" Both of these look wrong to me. As Steve Grubb pointed out: "What we need is 1 PATH record that identifies the MQ. The other PATH records probably should not be there." Fix it to record the mq root as a parent, and flag it such that it should be hidden from view when the names are logged, since the root of the mq filesystem isn't terribly interesting. With this change, we get a single PATH record that looks more like this: type=PATH msg=audit(1368021604.836:484): item=0 name="test_mq" inode=16914 dev=00:0c mode=0100644 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:user_tmpfs_t:s0 In order to do this, a new audit_inode_parent_hidden() function is added. If we do it this way, then we avoid having the existing callers of audit_inode needing to do any sort of flag conversion if auditing is inactive. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reported-by: Jiri Jaburek <jjaburek@redhat.com> Cc: Steve Grubb <sgrubb@redhat.com> Cc: Eric Paris <eparis@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-26ipc/sem.c: Fix missing wakeups in do_smart_update_queue()Manfred Spraul
do_smart_update_queue() is called when an operation (semop, semctl(SETVAL), semctl(SETALL), ...) modified the array. It must check which of the sleeping tasks can proceed. do_smart_update_queue() missed a few wakeups: - if a sleeping complex op was completed, then all per-semaphore queues must be scanned - not only those that were modified by *sops - if a sleeping simple op proceeded, then the global queue must be scanned again And: - the test for "|sops == NULL) before scanning the global queue is not required: If the global queue is empty, then it doesn't need to be scanned - regardless of the reason for calling do_smart_update_queue() The patch is not optimized, i.e. even completing a wait-for-zero operation causes a rescan. This is done to keep the patch as simple as possible. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-09shm: fix null pointer deref when userspace specifies invalid hugepage sizeLi Zefan
Dave reported an oops triggered by trinity: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: newseg+0x10d/0x390 PGD cf8c1067 PUD cf8c2067 PMD 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC CPU: 2 PID: 7636 Comm: trinity-child2 Not tainted 3.9.0+#67 ... Call Trace: ipcget+0x182/0x380 SyS_shmget+0x5a/0x60 tracesys+0xdd/0xe2 This bug was introduced by commit af73e4d9506d ("hugetlbfs: fix mmap failure in unaligned size request"). Reported-by: Dave Jones <davej@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Li Zefan <lizfan@huawei.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-09ipc,sem: fix semctl(..., GETNCNT)Rik van Riel
The semctl GETNCNT returns the number of semops waiting for the specified semaphore to become nonzero. After commit 9f1bc2c9022c ("ipc,sem: have only one list in struct sem_queue"), the semops waiting on just one semaphore are waiting on that semaphore's list. In order to return the correct count, we have to walk that list too, in addition to the sem_array's list for complex operations. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-09ipc,sem: fix semctl(..., GETZCNT)Rik van Riel
The semctl GETZCNT returns the number of semops waiting for the specified semaphore to become zero. After commit 9f1bc2c9022c ("ipc,sem: have only one list in struct sem_queue"), the semops waiting on just one semaphore are waiting on that semaphore's list. In order to return the correct count, we have to walk that list too, in addition to the sem_array's list for complex operations. This bug broke dbench; it works again with this patch applied. Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Kent Overstreet <koverstreet@google.com> Tested-by: Kent Overstreet <koverstreet@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-07hugetlbfs: fix mmap failure in unaligned size requestNaoya Horiguchi
The current kernel returns -EINVAL unless a given mmap length is "almost" hugepage aligned. This is because in sys_mmap_pgoff() the given length is passed to vm_mmap_pgoff() as it is without being aligned with hugepage boundary. This is a regression introduced in commit 40716e29243d ("hugetlbfs: fix alignment of huge page requests"), where alignment code is pushed into hugetlb_file_setup() and the variable len in caller side is not changed. To fix this, this patch partially reverts that commit, and adds alignment code in caller side. And it also introduces hstate_sizelog() in order to get proper hstate to specified hugepage size. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=56881 [akpm@linux-foundation.org: fix warning when CONFIG_HUGETLB_PAGE=n] Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: <iceman_dvd@yahoo.com> Cc: Steven Truelove <steven.truelove@utoronto.ca> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-04ipc: simplify rcu_read_lock() in semctl_nolock()Linus Torvalds
This trivially combines two rcu_read_lock() calls in both sides of a if-statement into one single one in front of the if-statement. Split out as an independent cleanup from the previous commit. Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-04ipc: simplify semtimedop/semctl_main() common error path handlingLinus Torvalds
With various straight RCU lock/unlock movements, one common exit path pattern had become rcu_read_unlock(); goto out_wakeup; and in fact there were no cases where we wanted to exit to out_wakeup _without_ releasing the RCU read lock. So replace that pattern with "goto out_rcu_wakeup", and remove the old out_wakeup. Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-04ipc: move sem_obtain_lock() rcu locking into the only callerLinus Torvalds
sem_obtain_lock() was another of those functions that returned with the RCU lock held for reading in the success case. Move the RCU locking to the caller (semtimedop()), making it more obvious. We already did RCU locking elsewhere in that function. Side note: why does semtimedop() re-do the semphore lookup after the sleep, rather than just getting a reference to the semaphore it already looked up originally? Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-04ipc: fix double sem unlock in semctl error pathLinus Torvalds
Fix another ipc locking buglet introduced by the scalability patches: when semctl_down() was changed to delay the semaphore locking, one error path for security_sem_semctl() went through the semaphore unlock logic even though the semaphore had never been locked. Introduced by commit 16df3674efe3 ("ipc,sem: do not hold ipc lock more than necessary") Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-04ipc: move the rcu_read_lock() from sem_lock_and_putref() into callersLinus Torvalds
This is another ipc semaphore locking cleanup, trying to make the locking more straightforward. We move the rcu read locking into the callers of sem_lock_and_putref(), which in general means that we now mostly do the rcu_read_lock() and rcu_read_unlock() in the same function. Mostly. We still have the ipc_addid/newary/freeary mess, and things like ipcctl_pre_down_nolock(). Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-04ipc: sem_putref() does not need the semaphore lock any moreLinus Torvalds
ipc_rcu_putref() uses atomics for the refcount, and the games to lock and unlock the semaphore just to try to keep the reference counting working are no longer useful. Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>