aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2013-09-11mm: mempolicy: turn vma_set_policy() into vma_dup_policy()Oleg Nesterov
Simple cleanup. Every user of vma_set_policy() does the same work, this looks a bit annoying imho. And the new trivial helper which does mpol_dup() + vma_set_policy() to simplify the callers. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11fork: unify and tighten up CLONE_NEWUSER/CLONE_NEWPID checksOleg Nesterov
do_fork() denies CLONE_THREAD | CLONE_PARENT if NEWUSER | NEWPID. Then later copy_process() denies CLONE_SIGHAND if the new process will be in a different pid namespace (task_active_pid_ns() doesn't match current->nsproxy->pid_ns). This looks confusing and inconsistent. CLONE_NEWPID is very similar to the case when ->pid_ns was already unshared, we want the same restrictions so copy_process() should also nack CLONE_PARENT. And it would be better to deny CLONE_NEWUSER && CLONE_SIGHAND as well just for consistency. Kill the "CLONE_NEWUSER | CLONE_NEWPID" check in do_fork() and change copy_process() to do the same check along with ->pid_ns check we already have. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Andy Lutomirski <luto@amacapital.net> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Colin Walters <walters@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11pidns: kill the unnecessary CLONE_NEWPID in copy_process()Oleg Nesterov
Commit 8382fcac1b81 ("pidns: Outlaw thread creation after unshare(CLONE_NEWPID)") nacks CLONE_NEWPID if the forking process unshared pid_ns. This is correct but unnecessary, copy_pid_ns() does the same check. Remove the CLONE_NEWPID check to cleanup the code and prepare for the next change. Test-case: static int child(void *arg) { return 0; } static char stack[16 * 1024]; int main(void) { pid_t pid; assert(unshare(CLONE_NEWUSER | CLONE_NEWPID) == 0); pid = clone(child, stack + sizeof(stack) / 2, CLONE_NEWPID | SIGCHLD, NULL); assert(pid < 0 && errno == EINVAL); return 0; } clone(CLONE_NEWPID) correctly fails with or without this change. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Andy Lutomirski <luto@amacapital.net> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Colin Walters <walters@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11pidns: fix vfork() after unshare(CLONE_NEWPID)Oleg Nesterov
Commit 8382fcac1b81 ("pidns: Outlaw thread creation after unshare(CLONE_NEWPID)") nacks CLONE_VM if the forking process unshared pid_ns, this obviously breaks vfork: int main(void) { assert(unshare(CLONE_NEWUSER | CLONE_NEWPID) == 0); assert(vfork() >= 0); _exit(0); return 0; } fails without this patch. Change this check to use CLONE_SIGHAND instead. This also forbids CLONE_THREAD automatically, and this is what the comment implies. We could probably even drop CLONE_SIGHAND and use CLONE_THREAD, but it would be safer to not do this. The current check denies CLONE_SIGHAND implicitely and there is no reason to change this. Eric said "CLONE_SIGHAND is fine. CLONE_THREAD would be even better. Having shared signal handling between two different pid namespaces is the case that we are fundamentally guarding against." Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Colin Walters <walters@redhat.com> Acked-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11perf: Fix up MMAP2 buffer space reservationArnaldo Carvalho de Melo
The ino_generation field was added in the PERF_RECORD_MMAP2 record in the 13d7a24 cset but no space for it was allocated, corrupting the PERF_FORMAT_{TIME,CPU,TID,etc} area (sample_type/sample_id_all), fix it. Detected with one of the regression tests done by 'perf test': [root@sandy ~]# perf test -v 7 7: Validate PERF_RECORD_* events & perf_sample fields : --- start --- 61315294449606 0 PERF_RECORD_SAMPLE 61315294453161 0 PERF_RECORD_SAMPLE 61315294454441 0 PERF_RECORD_SAMPLE 61315294455709 0 PERF_RECORD_SAMPLE 61315295600899 0 PERF_RECORD_COMM: sleep:6500 27917287430500 342521613 PERF_RECORD_MMAP2 6500/6500: [0x400000(0x7000) @ 0 00:1d 311442 9016]: /usr/bin/sleep MMAP2 going backwards in time, prev=61315295600899, curr=27917287430500 MMAP2 with unexpected cpu, expected 0, got 342521613 MMAP2 with unexpected pid, expected 6500, got 1701606191 MMAP2 with unexpected tid, expected 6500, got 28773 27917287430500 342561333 PERF_RECORD_MMAP2 6500/6500: [0x3b7e000000(0x223000) @ 0 00:1d 309186 9016]: /usr/lib64/ld-2.16.so MMAP2 with unexpected cpu, expected 0, got 342561333 MMAP2 with unexpected pid, expected 6500, got 1932408369 MMAP2 with unexpected tid, expected 6500, got 111 27917287430500 342600095 PERF_RECORD_MMAP2 6500/6500: [0x7fffbd7dc000(0x1000) @ 0x7fffbd7dc000 00:00 0 0]: [vdso] MMAP2 with unexpected cpu, expected 0, got 342600095 MMAP2 with unexpected pid, expected 6500, got 1935963739 MMAP2 with unexpected tid, expected 6500, got 23919 27917287430500 342882834 PERF_RECORD_MMAP2 6500/6500: [0x3b7e400000(0x3b8000) @ 0 00:1d 309187 9016]: /usr/lib64/libc-2.16.so MMAP2 with unexpected cpu, expected 0, got 342882834 MMAP2 with unexpected pid, expected 6500, got 909192754 MMAP2 with unexpected tid, expected 6500, got 7303982 61316297195411 0 PERF_RECORD_EXIT(6500:6500):(6500:6500) ---- end ---- Validate PERF_RECORD_* events & perf_sample fields: FAILED! [root@sandy ~]# After this patch: [root@sandy ~]# perf test 7 7: Validate PERF_RECORD_* events & perf_sample fields : Ok [root@sandy ~]# Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Stephane Eranian <eranian@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/n/tip-heeuv986b8ha7whqg4o3he7c@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2013-09-10fs: bump inode and dentry counters to longGlauber Costa
This series reworks our current object cache shrinking infrastructure in two main ways: * Noticing that a lot of users copy and paste their own version of LRU lists for objects, we put some effort in providing a generic version. It is modeled after the filesystem users: dentries, inodes, and xfs (for various tasks), but we expect that other users could benefit in the near future with little or no modification. Let us know if you have any issues. * The underlying list_lru being proposed automatically and transparently keeps the elements in per-node lists, and is able to manipulate the node lists individually. Given this infrastructure, we are able to modify the up-to-now hammer called shrink_slab to proceed with node-reclaim instead of always searching memory from all over like it has been doing. Per-node lru lists are also expected to lead to less contention in the lru locks on multi-node scans, since we are now no longer fighting for a global lock. The locks usually disappear from the profilers with this change. Although we have no official benchmarks for this version - be our guest to independently evaluate this - earlier versions of this series were performance tested (details at http://permalink.gmane.org/gmane.linux.kernel.mm/100537) yielding no visible performance regressions while yielding a better qualitative behavior in NUMA machines. With this infrastructure in place, we can use the list_lru entry point to provide memcg isolation and per-memcg targeted reclaim. Historically, those two pieces of work have been posted together. This version presents only the infrastructure work, deferring the memcg work for a later time, so we can focus on getting this part tested. You can see more about the history of such work at http://lwn.net/Articles/552769/ Dave Chinner (18): dcache: convert dentry_stat.nr_unused to per-cpu counters dentry: move to per-sb LRU locks dcache: remove dentries from LRU before putting on dispose list mm: new shrinker API shrinker: convert superblock shrinkers to new API list: add a new LRU list type inode: convert inode lru list to generic lru list code. dcache: convert to use new lru list infrastructure list_lru: per-node list infrastructure shrinker: add node awareness fs: convert inode and dentry shrinking to be node aware xfs: convert buftarg LRU to generic code xfs: rework buffer dispose list tracking xfs: convert dquot cache lru to list_lru fs: convert fs shrinkers to new scan/count API drivers: convert shrinkers to new count/scan API shrinker: convert remaining shrinkers to count/scan API shrinker: Kill old ->shrink API. Glauber Costa (7): fs: bump inode and dentry counters to long super: fix calculation of shrinkable objects for small numbers list_lru: per-node API vmscan: per-node deferred work i915: bail out earlier when shrinker cannot acquire mutex hugepage: convert huge zero page shrinker to new shrinker API list_lru: dynamically adjust node arrays This patch: There are situations in very large machines in which we can have a large quantity of dirty inodes, unused dentries, etc. This is particularly true when umounting a filesystem, where eventually since every live object will eventually be discarded. Dave Chinner reported a problem with this while experimenting with the shrinker revamp patchset. So we believe it is time for a change. This patch just moves int to longs. Machines where it matters should have a big long anyway. Signed-off-by: Glauber Costa <glommer@openvz.org> Cc: Dave Chinner <dchinner@redhat.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dave Chinner <dchinner@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: J. Bruce Fields <bfields@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10Merge branch 'acpi-hotplug'Rafael J. Wysocki
* acpi-hotplug: PM / hibernate / memory hotplug: Rework mutual exclusion PM / hibernate: Create memory bitmaps after freezing user space ACPI / scan: Change ordering of locks for device hotplug
2013-09-10Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile 3 (of many) from Al Viro: "Waiman's conversion of d_path() and bits related to it, kern_path_mountpoint(), several cleanups and fixes (exportfs one is -stable fodder, IMO). There definitely will be more... ;-/" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: split read_seqretry_or_unlock(), convert d_walk() to resulting primitives dcache: Translating dentry into pathname without taking rename_lock autofs4 - fix device ioctl mount lookup introduce kern_path_mountpoint() rename user_path_umountat() to user_path_mountpoint_at() take unlazy_walk() into umount_lookup_last() Kill indirect include of file.h from eventfd.h, use fdget() in cgroup.c prune_super(): sb->s_op is never NULL exportfs: don't assume that ->iterate() won't feed us too long entries afs: get rid of redundant ->d_name.len checks
2013-09-10sched: Fix load balancing performance regression in should_we_balance()Joonsoo Kim
Commit 23f0d20 ("sched: Factor out code to should_we_balance()") introduces the should_we_balance() function. This function should return 1 if this cpu is appropriate for balancing. But the newly introduced code doesn't do so, it returns 0 instead of 1. This introduces performance regression, reported by Dave Chinner: v4 filesystem v5 filesystem 3.11+xfsdev: 220k files/s 225k files/s 3.12-git 180k files/s 185k files/s 3.12-git-revert 245k files/s 247k files/s You can find more detailed information at: https://lkml.org/lkml/2013/9/10/1 This patch corrects the return value of should_we_balance() function as orignally intended. With this patch, Dave Chinner reports that the regression is gone: v4 filesystem v5 filesystem 3.11+xfsdev: 220k files/s 225k files/s 3.12-git 180k files/s 185k files/s 3.12-git-revert 245k files/s 247k files/s 3.12-git-fix 249k files/s 248k files/s Reported-by: Dave Chinner <dchinner@redhat.com> Tested-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Link: http://lkml.kernel.org/r/20130910065448.GA20368@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-09Merge tag 'trace-3.12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: "Not much changes for the 3.12 merge window. The major tracing changes are still in flux, and will have to wait for 3.13. The changes for 3.12 are mostly clean ups and minor fixes. H Peter Anvin added a check to x86_32 static function tracing that helps a small segment of the kernel community. Oleg Nesterov had a few changes from 3.11, but were mostly clean ups and not worth pushing in the -rc time frame. Li Zefan had small clean up with annotating a raw_init with __init. I fixed a slight race in updating function callbacks, but the race is so small and the bug that happens when it occurs is so minor it's not even worth pushing to stable. The only real enhancement is from Alexander Z Lam that made the tracing_cpumask work for trace buffer instances, instead of them all sharing a global cpumask" * tag 'trace-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ftrace/rcu: Do not trace debug_lockdep_rcu_enabled() x86-32, ftrace: Fix static ftrace when early microcode is enabled ftrace: Fix a slight race in modifying what function callback gets traced tracing: Make tracing_cpumask available for all instances tracing: Kill the !CONFIG_MODULES code in trace_events.c tracing: Don't pass file_operations array to event_create_dir() tracing: Kill trace_create_file_ops() and friends tracing/syscalls: Annotate raw_init function with __init
2013-09-09Merge tag 'xfs-for-linus-v3.12-rc1' of git://oss.sgi.com/xfs/xfsLinus Torvalds
Pull xfs updates from Ben Myers: "For 3.12-rc1 there are a number of bugfixes in addition to work to ease usage of shared code between libxfs and the kernel, the rest of the work to enable project and group quotas to be used simultaneously, performance optimisations in the log and the CIL, directory entry file type support, fixes for log space reservations, some spelling/grammar cleanups, and the addition of user namespace support. - introduce readahead to log recovery - add directory entry file type support - fix a number of spelling errors in comments - introduce new Q_XGETQSTATV quotactl for project quotas - add USER_NS support - log space reservation rework - CIL optimisations - kernel/userspace libxfs rework" * tag 'xfs-for-linus-v3.12-rc1' of git://oss.sgi.com/xfs/xfs: (112 commits) xfs: XFS_MOUNT_QUOTA_ALL needed by userspace xfs: dtype changed xfs_dir2_sfe_put_ino to xfs_dir3_sfe_put_ino Fix wrong flag ASSERT in xfs_attr_shortform_getvalue xfs: finish removing IOP_* macros. xfs: inode log reservations are too small xfs: check correct status variable for xfs_inobt_get_rec() call xfs: inode buffers may not be valid during recovery readahead xfs: check LSN ordering for v5 superblocks during recovery xfs: btree block LSN escaping to disk uninitialised XFS: Assertion failed: first <= last && last < BBTOB(bp->b_length), file: fs/xfs/xfs_trans_buf.c, line: 568 xfs: fix bad dquot buffer size in log recovery readahead xfs: don't account buffer cancellation during log recovery readahead xfs: check for underflow in xfs_iformat_fork() xfs: xfs_dir3_sfe_put_ino can be static xfs: introduce object readahead to log recovery xfs: Simplify xfs_ail_min() with list_first_entry_or_null() xfs: Register hotcpu notifier after initialization xfs: add xfs sb v4 support for dirent filetype field xfs: Add write support for dirent filetype field xfs: Add read-only support for dirent filetype field ...
2013-09-07Kill indirect include of file.h from eventfd.h, use fdget() in cgroup.cAl Viro
kernel/cgroup.c is the only place in the tree that relies on eventfd.h pulling file.h; move that include there. Switch from eventfd_fget()/fput() to fdget()/fdput(), while we are at it - eventfd_ctx_fileget() will fail on non-eventfd descriptors just fine, no need to do that check twice... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-07Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace changes from Eric Biederman: "This is an assorted mishmash of small cleanups, enhancements and bug fixes. The major theme is user namespace mount restrictions. nsown_capable is killed as it encourages not thinking about details that need to be considered. A very hard to hit pid namespace exiting bug was finally tracked and fixed. A couple of cleanups to the basic namespace infrastructure. Finally there is an enhancement that makes per user namespace capabilities usable as capabilities, and an enhancement that allows the per userns root to nice other processes in the user namespace" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: userns: Kill nsown_capable it makes the wrong thing easy capabilities: allow nice if we are privileged pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD userns: Allow PR_CAPBSET_DROP in a user namespace. namespaces: Simplify copy_namespaces so it is clear what is going on. pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup sysfs: Restrict mounting sysfs userns: Better restrictions on when proc and sysfs can be mounted vfs: Don't copy mount bind mounts of /proc/<pid>/ns/mnt between namespaces kernel/nsproxy.c: Improving a snippet of code. proc: Restrict mounting the proc filesystem vfs: Lock in place mounts from more privileged users
2013-09-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6Linus Torvalds
Pull crypto update from Herbert Xu: "Here is the crypto update for 3.12: - Added MODULE_SOFTDEP to allow pre-loading of modules. - Reinstated crct10dif driver using the module softdep feature. - Allow via rng driver to be auto-loaded. - Split large input data when necessary in nx. - Handle zero length messages correctly for GCM/XCBC in nx. - Handle SHA-2 chunks bigger than block size properly in nx. - Handle unaligned lengths in omap-aes. - Added SHA384/SHA512 to omap-sham. - Added OMAP5/AM43XX SHAM support. - Added OMAP4 TRNG support. - Misc fixes" * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (66 commits) Reinstate "crypto: crct10dif - Wrap crc_t10dif function all to use crypto transform framework" hwrng: via - Add MODULE_DEVICE_TABLE crypto: fcrypt - Fix bitoperation for compilation with clang crypto: nx - fix SHA-2 for chunks bigger than block size crypto: nx - fix GCM for zero length messages crypto: nx - fix XCBC for zero length messages crypto: nx - fix limits to sg lists for AES-CCM crypto: nx - fix limits to sg lists for AES-XCBC crypto: nx - fix limits to sg lists for AES-GCM crypto: nx - fix limits to sg lists for AES-CTR crypto: nx - fix limits to sg lists for AES-CBC crypto: nx - fix limits to sg lists for AES-ECB crypto: nx - add offset to nx_build_sg_lists() padata - Register hotcpu notifier after initialization padata - share code between CPU_ONLINE and CPU_DOWN_FAILED, same to CPU_DOWN_PREPARE and CPU_UP_CANCELED hwrng: omap - reorder OMAP TRNG driver code crypto: omap-sham - correct dma burst size crypto: omap-sham - Enable Polling mode if DMA fails crypto: tegra-aes - bitwise vs logical and crypto: sahara - checking the wrong variable ...
2013-09-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linuxHerbert Xu
Merge upstream tree in order to reinstate crct10dif.
2013-09-06Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree from Jiri Kosina: "The usual trivial updates all over the tree -- mostly typo fixes and documentation updates" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (52 commits) doc: Documentation/cputopology.txt fix typo treewide: Convert retrun typos to return Fix comment typo for init_cma_reserved_pageblock Documentation/trace: Correcting and extending tracepoint documentation mm/hotplug: fix a typo in Documentation/memory-hotplug.txt power: Documentation: Update s2ram link doc: fix a typo in Documentation/00-INDEX Documentation/printk-formats.txt: No casts needed for u64/s64 doc: Fix typo "is is" in Documentations treewide: Fix printks with 0x%# zram: doc fixes Documentation/kmemcheck: update kmemcheck documentation doc: documentation/hwspinlock.txt fix typo PM / Hibernate: add section for resume options doc: filesystems : Fix typo in Documentations/filesystems scsi/megaraid fixed several typos in comments ppc: init_32: Fix error typo "CONFIG_START_KERNEL" treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks page_isolation: Fix a comment typo in test_pages_isolated() doc: fix a typo about irq affinity ...
2013-09-05Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull cputime fix from Ingo Molnar: "This fixes a longer-standing cputime accounting bug that Stanislaw Gruszka finally managed to track down" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/cputime: Do not scale when utime == 0
2013-09-05Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile 1 from Al Viro: "Unfortunately, this merge window it'll have a be a lot of small piles - my fault, actually, for not keeping #for-next in anything that would resemble a sane shape ;-/ This pile: assorted fixes (the first 3 are -stable fodder, IMO) and cleanups + %pd/%pD formats (dentry/file pathname, up to 4 last components) + several long-standing patches from various folks. There definitely will be a lot more (starting with Miklos' check_submount_and_drop() series)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits) direct-io: Handle O_(D)SYNC AIO direct-io: Implement generic deferred AIO completions add formats for dentry/file pathnames kvm eventfd: switch to fdget powerpc kvm: use fdget switch fchmod() to fdget switch epoll_ctl() to fdget switch copy_module_from_fd() to fdget git simplify nilfs check for busy subtree ibmasmfs: don't bother passing superblock when not needed don't pass superblock to hypfs_{mkdir,create*} don't pass superblock to hypfs_diag_create_files don't pass superblock to hypfs_vm_create_files() oprofile: get rid of pointless forward declarations of struct super_block oprofilefs_create_...() do not need superblock argument oprofilefs_mkdir() doesn't need superblock argument don't bother with passing superblock to oprofile_create_stats_files() oprofile: don't bother with passing superblock to ->create_files() don't bother passing sb to oprofile_create_files() coh901318: don't open-code simple_read_from_buffer() ...
2013-09-05ftrace/rcu: Do not trace debug_lockdep_rcu_enabled()Steven Rostedt (Red Hat)
The function debug_lockdep_rcu_enabled() is part of the RCU lockdep debugging, and is called very frequently. I found that if I enable a lot of debugging and run the function graph tracer, this function can cause a live lock of the system. We don't usually trace lockdep infrastructure, no need to trace this either. Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-09-04Merge branch 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM updates from Gleb Natapov: "The highlights of the release are nested EPT and pv-ticketlocks support (hypervisor part, guest part, which is most of the code, goes through tip tree). Apart of that there are many fixes for all arches" Fix up semantic conflicts as discussed in the pull request thread.. * 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (88 commits) ARM: KVM: Add newlines to panic strings ARM: KVM: Work around older compiler bug ARM: KVM: Simplify tracepoint text ARM: KVM: Fix kvm_set_pte assignment ARM: KVM: vgic: Bump VGIC_NR_IRQS to 256 ARM: KVM: Bugfix: vgic_bytemap_get_reg per cpu regs ARM: KVM: vgic: fix GICD_ICFGRn access ARM: KVM: vgic: simplify vgic_get_target_reg KVM: MMU: remove unused parameter KVM: PPC: Book3S PR: Rework kvmppc_mmu_book3s_64_xlate() KVM: PPC: Book3S PR: Make instruction fetch fallback work for system calls KVM: PPC: Book3S PR: Don't corrupt guest state when kernel uses VMX KVM: x86: update masterclock when kvmclock_offset is calculated (v2) KVM: PPC: Book3S: Fix compile error in XICS emulation KVM: PPC: Book3S PR: return appropriate error when allocation fails arch: powerpc: kvm: add signed type cast for comparation KVM: x86: add comments where MMIO does not return to the emulator KVM: vmx: count exits to userspace during invalid guest emulation KVM: rename __kvm_io_bus_sort_cmp to kvm_io_bus_cmp kvm: optimize away THP checks in kvm_is_mmio_pfn() ...
2013-09-04Merge tag 'modules-next-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull module updates from Rusty Russell: "Minor fixes mainly, including a potential use-after-free on remove found by CONFIG_DEBUG_KOBJECT_RELEASE which may be theoretical" * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: module: Fix mod->mkobj.kobj potentially freed too early kernel/params.c: use scnprintf() instead of sprintf() kernel/module.c: use scnprintf() instead of sprintf() module/lsm: Have apparmor module parameters work with no args module: Add NOARG flag for ops with param_set_bool_enable_only() set function module: Add flag to allow mod params to have no arguments modules: add support for soft module dependencies scripts/mod/modpost.c: permit '.cranges' secton for sh64 architecture. module: fix sprintf format specifier in param_get_byte()
2013-09-04Merge branch 'x86-spinlocks-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 spinlock changes from Ingo Molnar: "The biggest change here are paravirtualized ticket spinlocks (PV spinlocks), which bring a nice speedup on various benchmarks. The KVM host side will come to you via the KVM tree" * 'x86-spinlocks-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/kvm/guest: Fix sparse warning: "symbol 'klock_waiting' was not declared as static" kvm: Paravirtual ticketlocks support for linux guests running on KVM hypervisor kvm guest: Add configuration support to enable debug information for KVM Guests kvm uapi: Add KICK_CPU and PV_UNHALT definition to uapi xen, pvticketlock: Allow interrupts to be enabled while blocking x86, ticketlock: Add slowpath logic jump_label: Split jumplabel ratelimit x86, pvticketlock: When paravirtualizing ticket locks, increment by 2 x86, pvticketlock: Use callee-save for lock_spinning xen, pvticketlocks: Add xen_nopvspin parameter to disable xen pv ticketlocks xen, pvticketlock: Xen implementation for PV ticket locks xen: Defer spinlock setup until boot CPU setup x86, ticketlock: Collapse a layer of functions x86, ticketlock: Don't inline _spin_unlock when using paravirt spinlocks x86, spinlock: Replace pv spinlocks with pv ticketlocks
2013-09-04Merge branch 'timers-nohz-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timers/nohz changes from Ingo Molnar: "It mostly contains fixes and full dynticks off-case optimizations, by Frederic Weisbecker" * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits) nohz: Include local CPU in full dynticks global kick nohz: Optimize full dynticks's sched hooks with static keys nohz: Optimize full dynticks state checks with static keys nohz: Rename a few state variables vtime: Always debug check snapshot source _before_ updating it vtime: Always scale generic vtime accounting results vtime: Optimize full dynticks accounting off case with static keys vtime: Describe overriden functions in dedicated arch headers m68k: hardirq_count() only need preempt_mask.h hardirq: Split preempt count mask definitions context_tracking: Split low level state headers vtime: Fix racy cputime delta update vtime: Remove a few unneeded generic vtime state checks context_tracking: User/kernel broundary cross trace events context_tracking: Optimize context switch off case with static keys context_tracking: Optimize guest APIs off case with static key context_tracking: Optimize main APIs off case with static key context_tracking: Ground setup for static key use context_tracking: Remove full dynticks' hacky dependency on wide context tracking nohz: Only enable context tracking on full dynticks CPUs ...
2013-09-04Merge branch 'x86-asmlinkage-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/asmlinkage changes from Ingo Molnar: "As a preparation for Andi Kleen's LTO patchset (link time optimizations using GCC's -flto which build time optimization has steadily increased in quality over the past few years and might eventually be usable for the kernel too) this tree includes a handful of preparatory patches that make function calling convention annotations consistent again: - Mark every function without arguments (or 64bit only) that is used by assembly code with asmlinkage() - Mark every function with parameters or variables that is used by assembly code as __visible. For the vanilla kernel this has documentation, consistency and debuggability advantages, for the time being" * 'x86-asmlinkage-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/asmlinkage: Fix warning in xen asmlinkage change x86, asmlinkage, vdso: Mark vdso variables __visible x86, asmlinkage, power: Make various symbols used by the suspend asm code visible x86, asmlinkage: Make dump_stack visible x86, asmlinkage: Make 64bit checksum functions visible x86, asmlinkage, paravirt: Add __visible/asmlinkage to xen paravirt ops x86, asmlinkage, apm: Make APM data structure used from assembler visible x86, asmlinkage: Make syscall tables visible x86, asmlinkage: Make several variables used from assembler/linker script visible x86, asmlinkage: Make kprobes code visible and fix assembler code x86, asmlinkage: Make various syscalls asmlinkage x86, asmlinkage: Make 32bit/64bit __switch_to visible x86, asmlinkage: Make _*_start_kernel visible x86, asmlinkage: Make all interrupt handlers asmlinkage / __visible x86, asmlinkage: Change dotraplinkage into __visible on 32bit x86: Fix sys_call_table type in asm/syscall.h
2013-09-04Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes from Ingo Molnar: "Various optimizations, cleanups and smaller fixes - no major changes in scheduler behavior" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix the sd_parent_degenerate() code sched/fair: Rework and comment the group_imb code sched/fair: Optimize find_busiest_queue() sched/fair: Make group power more consistent sched/fair: Remove duplicate load_per_task computations sched/fair: Shrink sg_lb_stats and play memset games sched: Clean-up struct sd_lb_stat sched: Factor out code to should_we_balance() sched: Remove one division operation in find_busiest_queue() sched/cputime: Use this_cpu_add() in task_group_account_field() cpumask: Fix cpumask leak in partition_sched_domains() sched/x86: Optimize switch_mm() for multi-threaded workloads generic-ipi: Kill unnecessary variable - csd_flags numa: Mark __node_set() as __always_inline sched/fair: Cleanup: remove duplicate variable declaration sched/__wake_up_sync_key(): Fix nr_exclusive tasks which lead to WF_SYNC clearing
2013-09-04Merge branches 'perf-urgent-for-linus' and 'perf-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf changes from Ingo Molnar: "As a first remark I'd like to point out that the obsolete '-f' (--force) option, which has not done anything for several releases, has been removed from 'perf record' and related utilities. Everyone please update muscle memory accordingly! :-) Main changes on the perf kernel side: - Performance optimizations: . for trace events, by Steve Rostedt. . for time values, by Peter Zijlstra - New hardware support: . for Intel Silvermont (22nm Atom) CPUs, by Zheng Yan . for Intel SNB-EP uncore PMUs, by Zheng Yan - Enhanced hardware support: . for Intel uncore PMUs: add filter support for QPI boxes, by Zheng Yan - Core perf events code enhancements and fixes: . for full-nohz feature handling, by Frederic Weisbecker . for group events, by Jiri Olsa . for call chains, by Frederic Weisbecker . for event stream parsing, by Adrian Hunter - New ABI details: . Add attr->mmap2 attribute, by Stephane Eranian . Add PERF_EVENT_IOC_ID ioctl to return event ID, by Jiri Olsa . Export u64 time_zero on the mmap header page to allow TSC calculation, by Adrian Hunter . Add dummy software event, by Adrian Hunter. . Add a new PERF_SAMPLE_IDENTIFIER to make samples always parseable, by Adrian Hunter. . Make Power7 events available via sysfs, by Runzhen Wang. - Code cleanups and refactorings: . for nohz-full, by Frederic Weisbecker . for group events, by Jiri Olsa - Documentation updates: . for perf_event_type, by Peter Zijlstra Main changes on the perf tooling side (some of these tooling changes utilize the above kernel side changes): - Lots of 'perf trace' enhancements: . Make 'perf trace' command line arguments consistent with 'perf record', by David Ahern. . Allow specifying syscalls a la strace, by Arnaldo Carvalho de Melo. . Add --verbose and -o/--output options, by Arnaldo Carvalho de Melo. . Support ! in -e expressions, to filter a list of syscalls, by Arnaldo Carvalho de Melo. . Arg formatting improvements to allow masking arguments in syscalls such as futex and open, where the some arguments are ignored and thus should not be printed depending on other args, by Arnaldo Carvalho de Melo. . Beautify futex open, openat, open_by_handle_at, lseek and futex syscalls, by Arnaldo Carvalho de Melo. . Add option to analyze events in a file versus live, so that one can do: [root@zoo ~]# perf record -a -e raw_syscalls:* sleep 1 [ perf record: Woken up 0 times to write data ] [ perf record: Captured and wrote 25.150 MB perf.data (~1098836 samples) ] [root@zoo ~]# perf trace -i perf.data -e futex --duration 1 17.799 ( 1.020 ms): 7127 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, ua 113.344 (95.429 ms): 7127 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, uaddr2: 0x7fff3f6c6648, val3: 4294967 133.778 ( 1.042 ms): 18004 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, uaddr2: 0x7fff3f6c6648, val3: 429496 [root@zoo ~]# By David Ahern. . Honor target pid / tid options when analyzing a file, by David Ahern. . Introduce better formatting of syscall arguments, including so far beautifiers for mmap, madvise, syscall return values, by Arnaldo Carvalho de Melo. . Handle HUGEPAGE defines in the mmap beautifier, by David Ahern. - 'perf report/top' enhancements: . Do annotation using /proc/kcore and /proc/kallsyms when available, removing the forced need for a vmlinux file kernel assembly annotation. This also improves this use case because vmlinux has just the initial kernel image, not what is actually in use after various code patchings by things like alternatives. By Adrian Hunter. . Add --ignore-callees=<regex> option to collapse undesired parts of call graphs, by Greg Price. . Simplify symbol filtering by doing it at machine class level, by Adrian Hunter. . Add support for callchains in the gtk UI, by Namhyung Kim. . Add --objdump option to 'perf top', by Sukadev Bhattiprolu. - 'perf kvm' enhancements: . Add option to print only events that exceed a specified time duration, by David Ahern. . Improve stack trace printing, by David Ahern. . Update documentation of the live command, by David Ahern . Add perf kvm stat live mode that combines aspects of 'perf kvm stat' record and report, by David Ahern. . Add option to analyze specific VM in perf kvm stat report, by David Ahern. . Do not require /lib/modules/* on a guest, by Jason Wessel. - 'perf script' enhancements: . Fix symbol offset computation for some dsos, by David Ahern. . Fix named threads support, by David Ahern. . Don't install scripting files files when perl/python support is disabled, by Arnaldo Carvalho de Melo. - 'perf test' enhancements: . Add various improvements and fixes to the "vmlinux matches kallsyms" 'perf test' entry, related to the /proc/kcore annotation feature. By Adrian Hunter. . Add sample parsing test, by Adrian Hunter. . Add test for reading object code, by Adrian Hunter. . Add attr record group sampling test, by Jiri Olsa. . Misc testing infrastructure improvements and other details, by Jiri Olsa. - 'perf list' enhancements: . Skip unsupported hardware events, by Namhyung Kim. . List pmu events, by Andi Kleen. - 'perf diff' enhancements: . Add support for more than two files comparison, by Jiri Olsa. - 'perf sched' enhancements: . Various improvements, including removing reliance on some scheduler tracepoints that provide the same information as the PERF_RECORD_{FORK,EXIT} events. By David Ahern. . Remove odd build stall by moving a large struct initialization from a local variable to a global one, by Namhyung Kim. - 'perf stat' enhancements: . Add --initial-delay option to skip measuring for a defined startup phase, by Andi Kleen. - Generic perf tooling infrastructure/plumbing changes: . Tidy up sample parsing validation, by Adrian Hunter. . Fix up jobserver setup in libtraceevent Makefile. by Arnaldo Carvalho de Melo. . Debug improvements, by Adrian Hunter. . Fix correlation of samples coming after PERF_RECORD_EXIT event, by David Ahern. . Improve robustness of the topology parsing code, by Stephane Eranian. . Add group leader sampling, that allows just one event in a group to sample while the other events have just its values read, by Jiri Olsa. . Add support for a new modifier "D", which requests that the event, or group of events, be pinned to the PMU. By Michael Ellerman. . Support callchain sorting based on addresses, by Andi Kleen . Prep work for multi perf data file storage, by Jiri Olsa. . libtraceevent cleanups, by Namhyung Kim. And lots and lots of other fixes and code reorganizations that did not make it into the list, see the shortlog, diffstat and the Git log for details!" [ Also merge a leftover from the 3.11 cycle ] * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Prevent race in unthrottling code * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (237 commits) perf trace: Tell arg formatters the arg index perf trace: Add beautifier for open's flags arg perf trace: Add beautifier for lseek's whence arg perf tools: Fix symbol offset computation for some dsos perf list: Skip unsupported events perf tests: Add 'keep tracking' test perf tools: Add support for PERF_COUNT_SW_DUMMY perf: Add a dummy software event to keep tracking perf trace: Add beautifier for futex 'operation' parm perf trace: Allow syscall arg formatters to mask args perf: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node() perf: Export struct perf_branch_entry to userspace perf: Add attr->mmap2 attribute to an event perf/x86: Add Silvermont (22nm Atom) support perf/x86: use INTEL_UEVENT_EXTRA_REG to define MSR_OFFCORE_RSP_X perf trace: Handle missing HUGEPAGE defines perf trace: Honor target pid / tid options when analyzing a file perf trace: Add option to analyze events in a file versus live perf evlist: Add tracepoint lookup by name perf tests: Add a sample parsing test ...
2013-09-04Merge branch 'core-locking-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core/locking changes from Ingo Molnar: "Main changes: - another mutex optimization, from Davidlohr Bueso - improved lglock lockdep tracking, from Michel Lespinasse - [ assorted smaller updates, improvements, cleanups. ]" * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: generic-ipi/locking: Fix misleading smp_call_function_any() description hung_task debugging: Print more info when reporting the problem mutex: Avoid label warning when !CONFIG_MUTEX_SPIN_ON_OWNER mutex: Do not unnecessarily deal with waiters mutex: Fix/document access-once assumption in mutex_can_spin_on_owner() lglock: Update lockdep annotations to report recursive local locks lockdep: Introduce lock_acquire_exclusive()/shared() helper macros
2013-09-04Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Ingo Molnar: "Main RCU changes this cycle were: - Full-system idle detection. This is for use by Frederic Weisbecker's adaptive-ticks mechanism. Its purpose is to allow the timekeeping CPU to shut off its tick when all other CPUs are idle. - Miscellaneous fixes. - Improved rcutorture test coverage. - Updated RCU documentation" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits) nohz_full: Force RCU's grace-period kthreads onto timekeeping CPU nohz_full: Add full-system-idle state machine jiffies: Avoid undefined behavior from signed overflow rcu: Simplify _rcu_barrier() processing rcu: Make rcutorture emit online failures if verbose rcu: Remove unused variable from rcu_torture_writer() rcu: Sort rcutorture module parameters rcu: Increase rcutorture test coverage rcu: Add duplicate-callback tests to rcutorture doc: Fix memory-barrier control-dependency example rcu: Update RTFP documentation nohz_full: Add full-system-idle arguments to API nohz_full: Add full-system idle states and variables nohz_full: Add per-CPU idle-state tracking nohz_full: Add rcu_dyntick data for scalable detection of all-idle state nohz_full: Add Kconfig parameter for scalable detection of all-idle state nohz_full: Add testing information to documentation rcu: Eliminate unused APIs intended for adaptive ticks rcu: Select IRQ_WORK from TREE_PREEMPT_RCU rculist: list_first_or_null_rcu() should use list_entry_rcu() ...
2013-09-04sched/cputime: Do not scale when utime == 0Stanislaw Gruszka
scale_stime() silently assumes that stime < rtime, otherwise when stime == rtime and both values are big enough (operations on them do not fit in 32 bits), the resulting scaling stime can be bigger than rtime. In consequence utime = rtime - stime results in negative value. User space visible symptoms of the bug are overflowed TIME values on ps/top, for example: $ ps aux | grep rcu root 8 0.0 0.0 0 0 ? S 12:42 0:00 [rcuc/0] root 9 0.0 0.0 0 0 ? S 12:42 0:00 [rcub/0] root 10 62422329 0.0 0 0 ? R 12:42 21114581:37 [rcu_preempt] root 11 0.1 0.0 0 0 ? S 12:42 0:02 [rcuop/0] root 12 62422329 0.0 0 0 ? S 12:42 21114581:35 [rcuop/1] root 10 62422329 0.0 0 0 ? R 12:42 21114581:37 [rcu_preempt] or overflowed utime values read directly from /proc/$PID/stat Reference: https://lkml.org/lkml/2013/8/20/259 Reported-and-tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Cc: stable@vger.kernel.org Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Link: http://lkml.kernel.org/r/20130904131602.GC2564@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-03switch copy_module_from_fd() to fdgetAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-03Merge branch 'for-3.12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "A lot of activities on the cgroup front. Most changes aren't visible to userland at all at this point and are laying foundation for the planned unified hierarchy. - The biggest change is decoupling the lifetime management of css (cgroup_subsys_state) from that of cgroup's. Because controllers (cpu, memory, block and so on) will need to be dynamically enabled and disabled, css which is the association point between a cgroup and a controller may come and go dynamically across the lifetime of a cgroup. Till now, css's were created when the associated cgroup was created and stayed till the cgroup got destroyed. Assumptions around this tight coupling permeated through cgroup core and controllers. These assumptions are gradually removed, which consists bulk of patches, and css destruction path is completely decoupled from cgroup destruction path. Note that decoupling of creation path is relatively easy on top of these changes and the patchset is pending for the next window. - cgroup has its own event mechanism cgroup.event_control, which is only used by memcg. It is overly complex trying to achieve high flexibility whose benefits seem dubious at best. Going forward, new events will simply generate file modified event and the existing mechanism is being made specific to memcg. This pull request contains prepatory patches for such change. - Various fixes and cleanups" Fixed up conflict in kernel/cgroup.c as per Tejun. * 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (69 commits) cgroup: fix cgroup_css() invocation in css_from_id() cgroup: make cgroup_write_event_control() use css_from_dir() instead of __d_cgrp() cgroup: make cgroup_event hold onto cgroup_subsys_state instead of cgroup cgroup: implement CFTYPE_NO_PREFIX cgroup: make cgroup_css() take cgroup_subsys * instead and allow NULL subsys cgroup: rename cgroup_css_from_dir() to css_from_dir() and update its syntax cgroup: fix cgroup_write_event_control() cgroup: fix subsystem file accesses on the root cgroup cgroup: change cgroup_from_id() to css_from_id() cgroup: use css_get() in cgroup_create() to check CSS_ROOT cpuset: remove an unncessary forward declaration cgroup: RCU protect each cgroup_subsys_state release cgroup: move subsys file removal to kill_css() cgroup: factor out kill_css() cgroup: decouple cgroup_subsys_state destruction from cgroup destruction cgroup: replace cgroup->css_kill_cnt with ->nr_css cgroup: bounce cgroup_subsys_state ref kill confirmation to a work item cgroup: move cgroup->subsys[] assignment to online_css() cgroup: reorganize css init / exit paths cgroup: add __rcu modifier to cgroup->subsys[] ...
2013-09-03Merge branch 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
Pull workqueue updates from Tejun Heo: "Nothing interesting. All are doc / comment updates" * 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Correct/Drop references to gcwq in Documentation workqueue: Fix manage_workers() RETURNS description workqueue: Comment correction in file header workqueue: mark WQ_NON_REENTRANT deprecated
2013-09-03ftrace: Fix a slight race in modifying what function callback gets tracedSteven Rostedt (Red Hat)
There's a slight race when going from a list function to a non list function. That is, when only one callback is registered to the function tracer, it gets called directly by the mcount trampoline. But if this function has filters, it may be called by the wrong functions. As the list ops callback that handles multiple callbacks that are registered to ftrace, it also handles what functions they call. While the transaction is taking place, use the list function always, and after all the updates are finished (only the functions that should be traced are being traced), then we can update the trampoline to call the function directly. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-09-03Merge tag 'pm+acpi-3.12-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI and power management updates from Rafael Wysocki: 1) ACPI-based PCI hotplug (ACPIPHP) subsystem rework and introduction of Intel Thunderbolt support on systems that use ACPI for signalling Thunderbolt hotplug events. This also should make ACPIPHP work in some cases in which it was known to have problems. From Rafael J Wysocki, Mika Westerberg and Kirill A Shutemov. 2) ACPI core code cleanups and dock station support cleanups from Jiang Liu and Rafael J Wysocki. 3) Fixes for locking problems related to ACPI device hotplug from Rafael J Wysocki. 4) ACPICA update to version 20130725 includig fixes, cleanups, support for more than 256 GPEs per GPE block and a change to make the ACPI PM Timer optional (we've seen systems without the PM Timer in the field already). One of the fixes, related to the DeRefOf operator, is necessary to prevent some Windows 8 oriented AML from causing problems to happen. From Bob Moore, Lv Zheng, and Jung-uk Kim. 5) Removal of the old and long deprecated /proc/acpi/event interface and related driver changes from Thomas Renninger. 6) ACPI and Xen changes to make the reduced hardware sleep work with the latter from Ben Guthro. 7) ACPI video driver cleanups and a blacklist of systems that should not tell the BIOS that they are compatible with Windows 8 (or ACPI backlight and possibly other things will not work on them). From Felipe Contreras. 8) Assorted ACPI fixes and cleanups from Aaron Lu, Hanjun Guo, Kuppuswamy Sathyanarayanan, Lan Tianyu, Sachin Kamat, Tang Chen, Toshi Kani, and Wei Yongjun. 9) cpufreq ondemand governor target frequency selection change to reduce oscillations between min and max frequencies (essentially, it causes the governor to choose target frequencies proportional to load) from Stratos Karafotis. 10) cpufreq fixes allowing sysfs attributes file permissions to be preserved over suspend/resume cycles Srivatsa S Bhat. 11) Removal of Device Tree parsing for CPU device nodes from multiple cpufreq drivers that required some changes related to of_get_cpu_node() to be made in a few architectures and in the driver core. From Sudeep KarkadaNagesha. 12) cpufreq core fixes and cleanups related to mutual exclusion and driver module references from Viresh Kumar, Lukasz Majewski and Rafael J Wysocki. 13) Assorted cpufreq fixes and cleanups from Amit Daniel Kachhap, Bartlomiej Zolnierkiewicz, Hanjun Guo, Jingoo Han, Joseph Lo, Julia Lawall, Li Zhong, Mark Brown, Sascha Hauer, Stephen Boyd, Stratos Karafotis, and Viresh Kumar. 14) Fixes to prevent race conditions in coupled cpuidle from happening from Colin Cross. 15) cpuidle core fixes and cleanups from Daniel Lezcano and Tuukka Tikkanen. 16) Assorted cpuidle fixes and cleanups from Daniel Lezcano, Geert Uytterhoeven, Jingoo Han, Julia Lawall, Linus Walleij, and Sahara. 17) System sleep tracing changes from Todd E Brandt and Shuah Khan. 18) PNP subsystem conversion to using struct dev_pm_ops for power management from Shuah Khan. * tag 'pm+acpi-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (217 commits) cpufreq: Don't use smp_processor_id() in preemptible context cpuidle: coupled: fix race condition between pokes and safe state cpuidle: coupled: abort idle if pokes are pending cpuidle: coupled: disable interrupts after entering safe state ACPI / hotplug: Remove containers synchronously driver core / ACPI: Avoid device hot remove locking issues cpufreq: governor: Fix typos in comments cpufreq: governors: Remove duplicate check of target freq in supported range cpufreq: Fix timer/workqueue corruption due to double queueing ACPI / EC: Add ASUSTEK L4R to quirk list in order to validate ECDT ACPI / thermal: Add check of "_TZD" availability and evaluating result cpufreq: imx6q: Fix clock enable balance ACPI: blacklist win8 OSI for buggy laptops cpufreq: tegra: fix the wrong clock name cpuidle: Change struct menu_device field types cpuidle: Add a comment warning about possible overflow cpuidle: Fix variable domains in get_typical_interval() cpuidle: Fix menu_device->intervals type cpuidle: CodingStyle: Break up multiple assignments on single line cpuidle: Check called function parameter in get_typical_interval() ...
2013-09-03Merge tag 'tty-3.12-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty/serial driver patches from Greg KH: "Here's the big tty/serial driver pull request for 3.12-rc1. Lots of n_tty reworks to resolve some very long-standing issues, removing the 3-4 different locks that were taken for every character. This code has been beaten on for a long time in linux-next with no reported regressions. Other than that, a range of serial and tty driver updates and revisions. Full details in the shortlog" * tag 'tty-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (226 commits) hvc_xen: Remove unnecessary __GFP_ZERO from kzalloc serial: imx: initialize the local variable tty: ar933x_uart: add device tree support and binding documentation tty: ar933x_uart: allow to build the driver as a module ARM: dts: msm: Update uartdm compatible strings devicetree: serial: Document msm_serial bindings serial: unify serial bindings into a single dir serial: fsl-imx-uart: Cleanup duplicate device tree binding tty: ar933x_uart: use config_enabled() macro to clean up ifdefs tty: ar933x_uart: remove superfluous assignment of ar933x_uart_driver.nr tty: ar933x_uart: use the clk API to get the uart clock tty: serial: cpm_uart: Adding proper request of GPIO used by cpm_uart driver serial: sirf: fix the amount of serial ports serial: sirf: define macro for some magic numbers of USP serial: icom: move array overflow checks earlier TTY: amiserial, remove unnecessary platform_set_drvdata() serial: st-asc: remove unnecessary platform_set_drvdata() msm_serial: Send more than 1 character on the console w/ UARTDM msm_serial: Add support for non-GSBI UARTDM devices msm_serial: Switch clock consumer strings and simplify code ...
2013-09-03Merge tag 'driver-core-3.12-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core patches from Greg KH: "Here's the big driver core pull request for 3.12-rc1. Lots of tiny changes here fixing up the way sysfs attributes are created, to try to make drivers simpler, and fix a whole class race conditions with creations of device attributes after the device was announced to userspace. All the various pieces are acked by the different subsystem maintainers" * tag 'driver-core-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (119 commits) firmware loader: fix pending_fw_head list corruption drivers/base/memory.c: introduce help macro to_memory_block dynamic debug: line queries failing due to uninitialized local variable sysfs: sysfs_create_groups returns a value. debugfs: provide debugfs_create_x64() when disabled rbd: convert bus code to use bus_groups firmware: dcdbas: use binary attribute groups sysfs: add sysfs_create/remove_groups for when SYSFS is not enabled driver core: add #include <linux/sysfs.h> to core files. HID: convert bus code to use dev_groups Input: serio: convert bus code to use drv_groups Input: gameport: convert bus code to use drv_groups driver core: firmware: use __ATTR_RW() driver core: core: use DEVICE_ATTR_RO driver core: bus: use DRIVER_ATTR_WO() driver core: create write-only attribute macros for devices and drivers sysfs: create __ATTR_WO() driver-core: platform: convert bus code to use dev_groups workqueue: convert bus code to use dev_groups MEI: convert bus code to use dev_groups ...
2013-09-03module: Fix mod->mkobj.kobj potentially freed too earlyLi Zhong
DEBUG_KOBJECT_RELEASE helps to find the issue attached below. After some investigation, it seems the reason is: The mod->mkobj.kobj(ffffffffa01600d0 below) is freed together with mod itself in free_module(). However, its children still hold references to it, as the delay caused by DEBUG_KOBJECT_RELEASE. So when the child(holders below) tries to decrease the reference count to its parent in kobject_del(), BUG happens as it tries to access already freed memory. This patch tries to fix it by waiting for the mod->mkobj.kobj to be really released in the module removing process (and some error code paths). [ 1844.175287] kobject: 'holders' (ffff88007c1f1600): kobject_release, parent ffffffffa01600d0 (delayed) [ 1844.178991] kobject: 'notes' (ffff8800370b2a00): kobject_release, parent ffffffffa01600d0 (delayed) [ 1845.180118] kobject: 'holders' (ffff88007c1f1600): kobject_cleanup, parent ffffffffa01600d0 [ 1845.182130] kobject: 'holders' (ffff88007c1f1600): auto cleanup kobject_del [ 1845.184120] BUG: unable to handle kernel paging request at ffffffffa01601d0 [ 1845.185026] IP: [<ffffffff812cda81>] kobject_put+0x11/0x60 [ 1845.185026] PGD 1a13067 PUD 1a14063 PMD 7bd30067 PTE 0 [ 1845.185026] Oops: 0000 [#1] PREEMPT [ 1845.185026] Modules linked in: xfs libcrc32c [last unloaded: kprobe_example] [ 1845.185026] CPU: 0 PID: 18 Comm: kworker/0:1 Tainted: G O 3.11.0-rc6-next-20130819+ #1 [ 1845.185026] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 [ 1845.185026] Workqueue: events kobject_delayed_cleanup [ 1845.185026] task: ffff88007ca51f00 ti: ffff88007ca5c000 task.ti: ffff88007ca5c000 [ 1845.185026] RIP: 0010:[<ffffffff812cda81>] [<ffffffff812cda81>] kobject_put+0x11/0x60 [ 1845.185026] RSP: 0018:ffff88007ca5dd08 EFLAGS: 00010282 [ 1845.185026] RAX: 0000000000002000 RBX: ffffffffa01600d0 RCX: ffffffff8177d638 [ 1845.185026] RDX: ffff88007ca5dc18 RSI: 0000000000000000 RDI: ffffffffa01600d0 [ 1845.185026] RBP: ffff88007ca5dd18 R08: ffffffff824e9810 R09: ffffffffffffffff [ 1845.185026] R10: ffff8800ffffffff R11: dead4ead00000001 R12: ffffffff81a95040 [ 1845.185026] R13: ffff88007b27a960 R14: ffff88007c1f1600 R15: 0000000000000000 [ 1845.185026] FS: 0000000000000000(0000) GS:ffffffff81a23000(0000) knlGS:0000000000000000 [ 1845.185026] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 1845.185026] CR2: ffffffffa01601d0 CR3: 0000000037207000 CR4: 00000000000006b0 [ 1845.185026] Stack: [ 1845.185026] ffff88007c1f1600 ffff88007c1f1600 ffff88007ca5dd38 ffffffff812cdb7e [ 1845.185026] 0000000000000000 ffff88007c1f1640 ffff88007ca5dd68 ffffffff812cdbfe [ 1845.185026] ffff88007c974800 ffff88007c1f1640 ffff88007ff61a00 0000000000000000 [ 1845.185026] Call Trace: [ 1845.185026] [<ffffffff812cdb7e>] kobject_del+0x2e/0x40 [ 1845.185026] [<ffffffff812cdbfe>] kobject_delayed_cleanup+0x6e/0x1d0 [ 1845.185026] [<ffffffff81063a45>] process_one_work+0x1e5/0x670 [ 1845.185026] [<ffffffff810639e3>] ? process_one_work+0x183/0x670 [ 1845.185026] [<ffffffff810642b3>] worker_thread+0x113/0x370 [ 1845.185026] [<ffffffff810641a0>] ? rescuer_thread+0x290/0x290 [ 1845.185026] [<ffffffff8106bfba>] kthread+0xda/0xe0 [ 1845.185026] [<ffffffff814ff0f0>] ? _raw_spin_unlock_irq+0x30/0x60 [ 1845.185026] [<ffffffff8106bee0>] ? kthread_create_on_node+0x130/0x130 [ 1845.185026] [<ffffffff8150751a>] ret_from_fork+0x7a/0xb0 [ 1845.185026] [<ffffffff8106bee0>] ? kthread_create_on_node+0x130/0x130 [ 1845.185026] Code: 81 48 c7 c7 28 95 ad 81 31 c0 e8 9b da 01 00 e9 4f ff ff ff 66 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 85 ff 74 1d <f6> 87 00 01 00 00 01 74 1e 48 8d 7b 38 83 6b 38 01 0f 94 c0 84 [ 1845.185026] RIP [<ffffffff812cda81>] kobject_put+0x11/0x60 [ 1845.185026] RSP <ffff88007ca5dd08> [ 1845.185026] CR2: ffffffffa01601d0 [ 1845.185026] ---[ end trace 49a70afd109f5653 ]--- Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2013-09-03Merge branch 'rcu/next' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU updates from Paul E. McKenney: " * Update RCU documentation. These were posted to LKML at https://lkml.org/lkml/2013/8/19/611. * Miscellaneous fixes. These were posted to LKML at https://lkml.org/lkml/2013/8/19/619. * Full-system idle detection. This is for use by Frederic Weisbecker's adaptive-ticks mechanism. Its purpose is to allow the timekeeping CPU to shut off its tick when all other CPUs are idle. These were posted to LKML at https://lkml.org/lkml/2013/8/19/648. * Improve rcutorture test coverage. These were posted to LKML at https://lkml.org/lkml/2013/8/19/675. " Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-02perf: Add attr->mmap2 attribute to an eventStephane Eranian
Adds a new PERF_RECORD_MMAP2 record type which is essence an expanded version of PERF_RECORD_MMAP. Used to request mmap records with more information about the mapping, including device major, minor and the inode number and generation for mappings associated with files or shared memory segments. Works for code and data (with attr->mmap_data set). Existing PERF_RECORD_MMAP record is unmodified by this patch. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Link: http://lkml.kernel.org/r/1377079825-19057-2-git-send-email-eranian@google.com [ Added Al to the Cc:. Are the ino, maj/min exports of vma->vm_file OK? ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-02sched/fair: Fix the sd_parent_degenerate() codePeter Zijlstra
I found that on my WSM box I had a redundant domain: [ 0.949769] CPU0 attaching sched-domain: [ 0.953765] domain 0: span 0,12 level SIBLING [ 0.958335] groups: 0 (cpu_power = 587) 12 (cpu_power = 588) [ 0.964548] domain 1: span 0-5,12-17 level MC [ 0.969206] groups: 0,12 (cpu_power = 1175) 1,13 (cpu_power = 1176) 2,14 (cpu_power = 1176) 3,15 (cpu_power = 1176) 4,16 (cpu_power = 1176) 5,17 (cpu_power = 1176) [ 0.984993] domain 2: span 0-5,12-17 level CPU [ 0.989822] groups: 0-5,12-17 (cpu_power = 7055) [ 0.995049] domain 3: span 0-23 level NUMA [ 0.999620] groups: 0-5,12-17 (cpu_power = 7055) 6-11,18-23 (cpu_power = 7056) Note how domain 2 has only a single group and spans the same CPUs as domain 1. We should not keep such domains and do in fact have code to prune these. It turns out that the 'new' SD_PREFER_SIBLING flag causes this, it makes sd_parent_degenerate() fail on the CPU domain. We can easily fix this by 'ignoring' the SD_PREFER_SIBLING bit and transfering it to whatever domain ends up covering the span. With this patch the domains now look like this: [ 0.950419] CPU0 attaching sched-domain: [ 0.954454] domain 0: span 0,12 level SIBLING [ 0.959039] groups: 0 (cpu_power = 587) 12 (cpu_power = 588) [ 0.965271] domain 1: span 0-5,12-17 level MC [ 0.969936] groups: 0,12 (cpu_power = 1175) 1,13 (cpu_power = 1176) 2,14 (cpu_power = 1176) 3,15 (cpu_power = 1176) 4,16 (cpu_power = 1176) 5,17 (cpu_power = 1176) [ 0.985737] domain 2: span 0-23 level NUMA [ 0.990231] groups: 0-5,12-17 (cpu_power = 7055) 6-11,18-23 (cpu_power = 7056) Reviewed-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-ys201g4jwukj0h8xcamakxq1@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-02sched/fair: Rework and comment the group_imb codePeter Zijlstra
Rik reported some weirdness due to the group_imb code. As a start to looking at it, clean it up a little and add a few explanatory comments. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-caeeqttnla4wrrmhp5uf89gp@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-02sched/fair: Optimize find_busiest_queue()Peter Zijlstra
Use for_each_cpu_and() and thereby avoid computing the capacity for CPUs we know we're not interested in. Reviewed-by: Paul Turner <pjt@google.com> Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-lppceyv6kb3a19g8spmrn20b@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-02sched/fair: Make group power more consistentPeter Zijlstra
For easier access, less dereferences and more consistent value, store the group power in update_sg_lb_stats() and use it thereafter. The actual value in sched_group::sched_group_power::power can change throughout the load-balance pass if we're unlucky. Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-739xxqkyvftrhnh9ncudutc7@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-02sched/fair: Remove duplicate load_per_task computationsPeter Zijlstra
Since we already compute (but don't store) the sgs load_per_task value in update_sg_lb_stats() we might as well store it and not re-compute it later on. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-ym1vmljiwbzgdnnrwp9azftq@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-02sched/fair: Shrink sg_lb_stats and play memset gamesPeter Zijlstra
We can shrink sg_lb_stats because rq::nr_running is an unsigned int and cpu numbers are 'int' Before: sgs: /* size: 72, cachelines: 2, members: 10 */ sds: /* size: 184, cachelines: 3, members: 7 */ After: sgs: /* size: 56, cachelines: 1, members: 10 */ sds: /* size: 152, cachelines: 3, members: 7 */ Further we can avoid clearing all of sds since we do a total clear/assignment of sg_stats in update_sg_lb_stats() with exception of busiest_stat.avg_load which is referenced in update_sd_pick_busiest(). Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-0klzmz9okll8wc0nsudguc9p@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-02sched: Clean-up struct sd_lb_statJoonsoo Kim
There is no reason to maintain separate variables for this_group and busiest_group in sd_lb_stat, except saving some space. But this structure is always allocated in stack, so this saving isn't really benificial [peterz: reducing stack space is good; in this case readability increases enough that I think its still beneficial] This patch unify these variables, so IMO, readability may be improved. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> [ Rename this to local -- avoids confusion between this_cpu and the C++ this pointer. ] Reviewed-by: Paul Turner <pjt@google.com> [ Lots of style edits, a few fixes and a rename. ] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1375778203-31343-4-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-02sched: Factor out code to should_we_balance()Joonsoo Kim
Now checking whether this cpu is appropriate to balance or not is embedded into update_sg_lb_stats() and this checking has no direct relationship to this function. There is not enough reason to place this checking at update_sg_lb_stats(), except saving one iteration for sched_group_cpus. In this patch, I factor out this checking to should_we_balance() function. And before doing actual work for load_balancing, check whether this cpu is appropriate to balance via should_we_balance(). If this cpu is not a candidate for balancing, it quit the work immediately. With this change, we can save two memset cost and can expect better compiler optimization. Below is result of this patch. * Vanilla * text data bss dec hex filename 34499 1136 116 35751 8ba7 kernel/sched/fair.o * Patched * text data bss dec hex filename 34243 1136 116 35495 8aa7 kernel/sched/fair.o In addition, rename @balance to @continue_balancing in order to represent its purpose more clearly. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> [ s/should_balance/continue_balancing/g ] Reviewed-by: Paul Turner <pjt@google.com> [ Made style changes and a fix in should_we_balance(). ] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1375778203-31343-3-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-02sched: Remove one division operation in find_busiest_queue()Joonsoo Kim
Remove one division operation in find_busiest_queue() by using crosswise multiplication: wl_i / power_i > wl_j / power_j := wl_i * power_j > wl_j * power_i Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> [ Expanded the changelog. ] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1375778203-31343-2-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-02perf: Prevent race in unthrottling codeJiri Olsa
The current throttling code triggers WARN below via following workload (only hit on AMD machine with 48 CPUs): # while [ 1 ]; do perf record perf bench sched messaging; done WARNING: at arch/x86/kernel/cpu/perf_event.c:1054 x86_pmu_start+0xc6/0x100() SNIP Call Trace: <IRQ> [<ffffffff815f62d6>] dump_stack+0x19/0x1b [<ffffffff8105f531>] warn_slowpath_common+0x61/0x80 [<ffffffff8105f60a>] warn_slowpath_null+0x1a/0x20 [<ffffffff810213a6>] x86_pmu_start+0xc6/0x100 [<ffffffff81129dd2>] perf_adjust_freq_unthr_context.part.75+0x182/0x1a0 [<ffffffff8112a058>] perf_event_task_tick+0xc8/0xf0 [<ffffffff81093221>] scheduler_tick+0xd1/0x140 [<ffffffff81070176>] update_process_times+0x66/0x80 [<ffffffff810b9565>] tick_sched_handle.isra.15+0x25/0x60 [<ffffffff810b95e1>] tick_sched_timer+0x41/0x60 [<ffffffff81087c24>] __run_hrtimer+0x74/0x1d0 [<ffffffff810b95a0>] ? tick_sched_handle.isra.15+0x60/0x60 [<ffffffff81088407>] hrtimer_interrupt+0xf7/0x240 [<ffffffff81606829>] smp_apic_timer_interrupt+0x69/0x9c [<ffffffff8160569d>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff81129f74>] ? __perf_event_task_sched_in+0x184/0x1a0 [<ffffffff814dd937>] ? kfree_skbmem+0x37/0x90 [<ffffffff815f2c47>] ? __slab_free+0x1ac/0x30f [<ffffffff8118143d>] ? kfree+0xfd/0x130 [<ffffffff81181622>] kmem_cache_free+0x1b2/0x1d0 [<ffffffff814dd937>] kfree_skbmem+0x37/0x90 [<ffffffff814e03c4>] consume_skb+0x34/0x80 [<ffffffff8158b057>] unix_stream_recvmsg+0x4e7/0x820 [<ffffffff814d5546>] sock_aio_read.part.7+0x116/0x130 [<ffffffff8112c10c>] ? __perf_sw_event+0x19c/0x1e0 [<ffffffff814d5581>] sock_aio_read+0x21/0x30 [<ffffffff8119a5d0>] do_sync_read+0x80/0xb0 [<ffffffff8119ac85>] vfs_read+0x145/0x170 [<ffffffff8119b699>] SyS_read+0x49/0xa0 [<ffffffff810df516>] ? __audit_syscall_exit+0x1f6/0x2a0 [<ffffffff81604a19>] system_call_fastpath+0x16/0x1b ---[ end trace 622b7e226c4a766a ]--- The reason is a race in perf_event_task_tick() throttling code. The race flow (simplified code): - perf_throttled_count is per cpu variable and is CPU throttling flag, here starting with 0 - perf_throttled_seq is sequence/domain for allowed count of interrupts within the tick, gets increased each tick on single CPU (CPU bounded event): ... workload perf_event_task_tick: | | T0 inc(perf_throttled_seq) | T1 needs_unthr = xchg(perf_throttled_count, 0) == 0 tick gets interrupted: ... event gets throttled under new seq ... T2 last NMI comes, event is throttled - inc(perf_throttled_count) back to tick: | perf_adjust_freq_unthr_context: | | T3 unthrottling is skiped for event (needs_unthr == 0) | T4 event is stop and started via freq adjustment | tick ends ... workload ... no sample is hit for event ... perf_event_task_tick: | | T5 needs_unthr = xchg(perf_throttled_count, 0) != 0 (from T2) | T6 unthrottling is done on event (interrupts == MAX_INTERRUPTS) | event is already started (from T4) -> WARN Fixing this by not checking needs_unthr again and thus check all events for unthrottling. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Reported-by: Jan Stancek <jstancek@redhat.com> Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1377355554-8934-1-git-send-email-jolsa@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-08-31Merge branches 'doc.2013.08.19a', 'fixes.2013.08.20a', 'sysidle.2013.08.31a' ↵Paul E. McKenney
and 'torture.2013.08.20a' into HEAD doc.2013.08.19a: Documentation updates fixes.2013.08.20a: Miscellaneous fixes sysidle.2013.08.31a: Detect system-wide idle state. torture.2013.08.20a: rcutorture updates.