aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2011-02-17sched: Export account_system_vtime()Ingo Molnar
Commit: b7dadc38797584f6203386da1947ed5edf516646 upstream KVM uses it for example: ERROR: "account_system_vtime" [arch/x86/kvm/kvm.ko] undefined! Cc: Venkatesh Pallipadi <venki@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-3-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17sched: Call tick_check_idle before __irq_enterVenkatesh Pallipadi
Commit: d267f87fb8179c6dba03d08b91952e81bc3723c7 upstream When CPU is idle and on first interrupt, irq_enter calls tick_check_idle() to notify interruption from idle. But, there is a problem if this call is done after __irq_enter, as all routines in __irq_enter may find stale time due to yet to be done tick_check_idle. Specifically, trace calls in __irq_enter when they use global clock and also account_system_vtime change in this patch as it wants to use sched_clock_cpu() to do proper irq timing. But, tick_check_idle was moved after __irq_enter intentionally to prevent problem of unneeded ksoftirqd wakeups by the commit ee5f80a: irq: call __irq_enter() before calling the tick_idle_check Impact: avoid spurious ksoftirqd wakeups Moving tick_check_idle() before __irq_enter and wrapping it with local_bh_enable/disable would solve both the problems. Fixed-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-9-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17sched: Remove irq time from available CPU powerVenkatesh Pallipadi
Commit: aa483808516ca5cacfa0e5849691f64fec25828e upstream The idea was suggested by Peter Zijlstra here: http://marc.info/?l=linux-kernel&m=127476934517534&w=2 irq time is technically not available to the tasks running on the CPU. This patch removes irq time from CPU power piggybacking on sched_rt_avg_update(). Tested this by keeping CPU X busy with a network intensive task having 75% oa a single CPU irq processing (hard+soft) on a 4-way system. And start seven cycle soakers on the system. Without this change, there will be two tasks on each CPU. With this change, there is a single task on irq busy CPU X and remaining 7 tasks are spread around among other 3 CPUs. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-8-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17sched: Do not account irq time to current taskVenkatesh Pallipadi
Commit: 305e6835e05513406fa12820e40e4a8ecb63743c upstream Scheduler accounts both softirq and interrupt processing times to the currently running task. This means, if the interrupt processing was for some other task in the system, then the current task ends up being penalized as it gets shorter runtime than otherwise. Change sched task accounting to acoount only actual task time from currently running task. Now update_curr(), modifies the delta_exec to depend on rq->clock_task. Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats for later. This change will impact scheduling behavior in interrupt heavy conditions. Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2 spending 75%+ of its time in irq processing. CPU 3 spending around 35% time running nc task. Now, if I run another CPU intensive task on CPU 2, without this change /proc/<pid>/schedstat shows 100% of time accounted to this task. With this change, it rightly shows less than 25% accounted to this task as remaining time is actually spent on irq processing. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-7-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17x86: Add IRQ_TIME_ACCOUNTINGVenkatesh Pallipadi
Commit: e82b8e4ea4f3dffe6e7939f90e78da675fcc450e upstream This patch adds IRQ_TIME_ACCOUNTING option on x86 and runtime enables it when TSC is enabled. This change just enables fine grained irq time accounting, isn't used yet. Following patches use it for different purposes. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-6-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq timeVenkatesh Pallipadi
Commit: b52bfee445d315549d41eacf2fa7c156e7d153d5 upstream s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does the fine granularity accounting of user, system, hardirq, softirq times. Adding that option on archs like x86 will be challenging however, given the state of TSC reliability on various platforms and also the overhead it will add in syscall entry exit. Instead, add a lighter variant that only does finer accounting of hardirq and softirq times, providing precise irq times (instead of timer tick based samples). This accounting is added with a new config option CONFIG_IRQ_TIME_ACCOUNTING so that there won't be any overhead for users not interested in paying the perf penalty. This accounting is based on sched_clock, with the code being generic. So, other archs may find it useful as well. This patch just adds the core logic and does not enable this logic yet. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-5-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17sched: Add a PF flag for ksoftirqd identificationVenkatesh Pallipadi
Commit: 6cdd5199daf0cb7b0fcc8dca941af08492612887 upstream To account softirq time cleanly in scheduler, we need to identify whether softirq is invoked in ksoftirqd context or softirq at hardirq tail context. Add PF_KSOFTIRQD for that purpose. As all PF flag bits are currently taken, create space by moving one of the infrequently used bits (PF_THREAD_BOUND) down in task_struct to be along with some other state fields. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-4-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17sched: Remove unused PF_ALIGNWARN flagDave Young
Commit: 637bbdc5b83615ef9f45f50399d1c7f27473c713 upstream PF_ALIGNWARN is not implemented and it is for 486 as the comment. It is not likely someone will implement this flag feature. So here remove this flag and leave the valuable 0x00000001 for future use. Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <20100913121903.GB22238@darkstar> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17sched: Consolidate account_system_vtime extern declarationVenkatesh Pallipadi
Commit: e1e10a265d28273ab8c70be19d43dcbdeead6c5a upstream Just a minor cleanup patch that makes things easier to the following patches. No functionality change in this patch. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-3-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17sched: Fix softirq time accountingVenkatesh Pallipadi
Commit: 75e1056f5c57050415b64cb761a3acc35d91f013 upstream Peter Zijlstra found a bug in the way softirq time is accounted in VIRT_CPU_ACCOUNTING on this thread: http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html The problem is, softirq processing uses local_bh_disable internally. There is no way, later in the flow, to differentiate between whether softirq is being processed or is it just that bh has been disabled. So, a hardirq when bh is disabled results in time being wrongly accounted as softirq. Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING as well. As account_system_time() in normal tick based accouting also uses softirq_count, which will be set even when not in softirq with bh disabled. Peter also suggested solution of using 2*SOFTIRQ_OFFSET as irq count for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq processing. The patch below does that and adds API in_serving_softirq() which returns whether we are currently processing softirq or not. Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c to in_serving_softirq. Looks like many usages of in_softirq really want in_serving_softirq. Those changes can be made individually on a case by case basis. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-2-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17sched: Drop group_capacity to 1 only if local group has extra capacityNikhil Rao
Commit: 75dd321d79d495a0ee579e6249ebc38ddbb2667f upstream When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1 only if the local group has extra capacity. The extra check prevents the case where you always pull from the heaviest group when it is already under-utilized (possible with a large weight task outweighs the tasks on the system). For example, consider a 16-cpu quad-core quad-socket machine with MC and NUMA scheduling domains. Let's say we spawn 15 nice0 tasks and one nice-15 task, and each task is running on one core. In this case, we observe the following events when balancing at the NUMA domain: - find_busiest_group() will always pick the sched group containing the niced task to be the busiest group. - find_busiest_queue() will then always pick one of the cpus running the nice0 task (never picks the cpu with the nice -15 task since weighted_cpuload > imbalance). - The load balancer fails to migrate the task since it is the running task and increments sd->nr_balance_failed. - It repeats the above steps a few more times until sd->nr_balance_failed > 5, at which point it kicks off the active load balancer, wakes up the migration thread and kicks the nice 0 task off the cpu. The load balancer doesn't stop until we kick out all nice 0 tasks from the sched group, leaving you with 3 idle cpus and one cpu running the nice -15 task. When balancing at the NUMA domain, we drop sgs.group_capacity to 1 if the child domain (in this case MC) has SD_PREFER_SIBLING set. Subsequent load checks are not relevant because the niced task has a very large weight. In this patch, we add an extra condition to the "if(prefer_sibling)" check in update_sd_lb_stats(). We drop the capacity of a group only if the local group has extra capacity, ie. nr_running < group_capacity. This patch preserves the original intent of the prefer_siblings check (to spread tasks across the system in low utilization scenarios) and fixes the case above. It helps in the following ways: - In low utilization cases (where nr_tasks << nr_cpus), we still drop group_capacity down to 1 if we prefer siblings. - On very busy systems (where nr_tasks >> nr_cpus), sgs.nr_running will most likely be > sgs.group_capacity. - When balancing large weight tasks, if the local group does not have extra capacity, we do not pick the group with the niced task as the busiest group. This prevents failed balances, active migration and the under-utilization described above. Signed-off-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1287173550-30365-5-git-send-email-ncrao@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17sched: Force balancing on newidle balance if local group has capacityNikhil Rao
Commit: fab476228ba37907ad75216d0fd9732ada9c119e upstream This patch forces a load balance on a newly idle cpu when the local group has extra capacity and the busiest group does not have any. It improves system utilization when balancing tasks with a large weight differential. Under certain situations, such as a niced down task (i.e. nice = -15) in the presence of nr_cpus NICE0 tasks, the niced task lands on a sched group and kicks away other tasks because of its large weight. This leads to sub-optimal utilization of the machine. Even though the sched group has capacity, it does not pull tasks because sds.this_load >> sds.max_load, and f_b_g() returns NULL. With this patch, if the local group has extra capacity, we shortcut the checks in f_b_g() and try to pull a task over. A sched group has extra capacity if the group capacity is greater than the number of running tasks in that group. Thanks to Mike Galbraith for discussions leading to this patch and for the insight to reuse SD_NEWIDLE_BALANCE. Signed-off-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1287173550-30365-4-git-send-email-ncrao@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17sched: Set group_imb only a task can be pulled from the busiest cpuNikhil Rao
Commit: 2582f0eba54066b5e98ff2b27ef0cfa833b59f54 upstream When cycling through sched groups to determine the busiest group, set group_imb only if the busiest cpu has more than 1 runnable task. This patch fixes the case where two cpus in a group have one runnable task each, but there is a large weight differential between these two tasks. The load balancer is unable to migrate any task from this group, and hence do not consider this group to be imbalanced. Signed-off-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286996978-7007-3-git-send-email-ncrao@google.com> [ small code readability edits ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17sched: Do not consider SCHED_IDLE tasks to be cache hotNikhil Rao
Commit: ef8002f6848236de5adc613063ebeabddea8a6fb upstream This patch adds a check in task_hot to return if the task has SCHED_IDLE policy. SCHED_IDLE tasks have very low weight, and when run with regular workloads, are typically scheduled many milliseconds apart. There is no need to consider these tasks hot for load balancing. Signed-off-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1287173550-30365-2-git-send-email-ncrao@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17sched: fix RCU lockdep splat from task_group()Peter Zijlstra
Commit: 6506cf6ce68d78a5470a8360c965dafe8e4b78e3 upstream This addresses the following RCU lockdep splat: [0.051203] CPU0: AMD QEMU Virtual CPU version 0.12.4 stepping 03 [0.052999] lockdep: fixing up alternatives. [0.054105] [0.054106] =================================================== [0.054999] [ INFO: suspicious rcu_dereference_check() usage. ] [0.054999] --------------------------------------------------- [0.054999] kernel/sched.c:616 invoked rcu_dereference_check() without protection! [0.054999] [0.054999] other info that might help us debug this: [0.054999] [0.054999] [0.054999] rcu_scheduler_active = 1, debug_locks = 1 [0.054999] 3 locks held by swapper/1: [0.054999] #0: (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff814be933>] cpu_up+0x42/0x6a [0.054999] #1: (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff810400d8>] cpu_hotplug_begin+0x2a/0x51 [0.054999] #2: (&rq->lock){-.-...}, at: [<ffffffff814be2f7>] init_idle+0x2f/0x113 [0.054999] [0.054999] stack backtrace: [0.054999] Pid: 1, comm: swapper Not tainted 2.6.35 #1 [0.054999] Call Trace: [0.054999] [<ffffffff81068054>] lockdep_rcu_dereference+0x9b/0xa3 [0.054999] [<ffffffff810325c3>] task_group+0x7b/0x8a [0.054999] [<ffffffff810325e5>] set_task_rq+0x13/0x40 [0.054999] [<ffffffff814be39a>] init_idle+0xd2/0x113 [0.054999] [<ffffffff814be78a>] fork_idle+0xb8/0xc7 [0.054999] [<ffffffff81068717>] ? mark_held_locks+0x4d/0x6b [0.054999] [<ffffffff814bcebd>] do_fork_idle+0x17/0x2b [0.054999] [<ffffffff814bc89b>] native_cpu_up+0x1c1/0x724 [0.054999] [<ffffffff814bcea6>] ? do_fork_idle+0x0/0x2b [0.054999] [<ffffffff814be876>] _cpu_up+0xac/0x127 [0.054999] [<ffffffff814be946>] cpu_up+0x55/0x6a [0.054999] [<ffffffff81ab562a>] kernel_init+0xe1/0x1ff [0.054999] [<ffffffff81003854>] kernel_thread_helper+0x4/0x10 [0.054999] [<ffffffff814c353c>] ? restore_args+0x0/0x30 [0.054999] [<ffffffff81ab5549>] ? kernel_init+0x0/0x1ff [0.054999] [<ffffffff81003850>] ? kernel_thread_helper+0x0/0x10 [0.056074] Booting Node 0, Processors #1lockdep: fixing up alternatives. [0.130045] #2lockdep: fixing up alternatives. [0.203089] #3 Ok. [0.275286] Brought up 4 CPUs [0.276005] Total of 4 processors activated (16017.17 BogoMIPS). The cgroup_subsys_state structures referenced by idle tasks are never freed, because the idle tasks should be part of the root cgroup, which is not removable. The problem is that while we do in-fact hold rq->lock, the newly spawned idle thread's cpu is not yet set to the correct cpu so the lockdep check in task_group(): lockdep_is_held(&task_rq(p)->lock) will fail. But this is a chicken and egg problem. Setting the CPU's runqueue requires that the CPU's runqueue already be set. ;-) So insert an RCU read-side critical section to avoid the complaint. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17sched: suppress RCU lockdep splat in task_fork_fairPaul E. McKenney
Commit: b0a0f667a349247bd7f05f806b662a25653822bc upstream > =================================================== > [ INFO: suspicious rcu_dereference_check() usage. ] > --------------------------------------------------- > /home/greearb/git/linux.wireless-testing/kernel/sched.c:618 invoked rcu_dereference_check() without protection! > > other info that might help us debug this: > > rcu_scheduler_active = 1, debug_locks = 1 > 1 lock held by ifup/23517: > #0: (&rq->lock){-.-.-.}, at: [<c042f782>] task_fork_fair+0x3b/0x108 > > stack backtrace: > Pid: 23517, comm: ifup Not tainted 2.6.36-rc6-wl+ #5 > Call Trace: > [<c075e219>] ? printk+0xf/0x16 > [<c0455842>] lockdep_rcu_dereference+0x74/0x7d > [<c0426854>] task_group+0x6d/0x79 > [<c042686e>] set_task_rq+0xe/0x57 > [<c042f79e>] task_fork_fair+0x57/0x108 > [<c042e965>] sched_fork+0x82/0xf9 > [<c04334b3>] copy_process+0x569/0xe8e > [<c0433ef0>] do_fork+0x118/0x262 > [<c076302f>] ? do_page_fault+0x16a/0x2cf > [<c044b80c>] ? up_read+0x16/0x2a > [<c04085ae>] sys_clone+0x1b/0x20 > [<c04030a5>] ptregs_clone+0x15/0x30 > [<c0402f1c>] ? sysenter_do_call+0x12/0x38 Here a newly created task is having its runqueue assigned. The new task is not yet on the tasklist, so cannot go away. This is therefore a false positive, suppress with an RCU read-side critical section. Reported-by: Ben Greear <greearb@candelatech.com Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Ben Greear <greearb@candelatech.com Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17sched: Give CPU bound RT tasks preferencestable-bot for Steven Rostedt
From:: Steven Rostedt <srostedt@redhat.com> Commit: b3bc211cfe7d5fe94b310480d78e00bea96fbf2a upstream If a high priority task is waking up on a CPU that is running a lower priority task that is bound to a CPU, see if we can move the high RT task to another CPU first. Note, if all other CPUs are running higher priority tasks than the CPU bounded current task, then it will be preempted regardless. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Gregory Haskins <ghaskins@novell.com> LKML-Reference: <20100921024138.888922071@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17sched: Try not to migrate higher priority RT tasksSteven Rostedt
Commit: 43fa5460fe60dea5c610490a1d263415419c60f6 upstream When first working on the RT scheduler design, we concentrated on keeping all CPUs running RT tasks instead of having multiple RT tasks on a single CPU waiting for the migration thread to move them. Instead we take a more proactive stance and push or pull RT tasks from one CPU to another on wakeup or scheduling. When an RT task wakes up on a CPU that is running another RT task, instead of preempting it and killing the cache of the running RT task, we look to see if we can migrate the RT task that is waking up, even if the RT task waking up is of higher priority. This may sound a bit odd, but RT tasks should be limited in migration by the user anyway. But in practice, people do not do this, which causes high prio RT tasks to bounce around the CPUs. This becomes even worse when we have priority inheritance, because a high prio task can block on a lower prio task and boost its priority. When the lower prio task wakes up the high prio task, if it happens to be on the same CPU it will migrate off of it. But in reality, the above does not happen much either, because the wake up of the lower prio task, which has already been boosted, if it was on the same CPU as the higher prio task, it would then migrate off of it. But anyway, we do not want to migrate them either. To examine the scheduling, I created a test program and examined it under kernelshark. The test program created CPU * 2 threads, where each thread had a different priority. The program takes different options. The options used in this change log was to have priority inheritance mutexes or not. All threads did the following loop: static void grab_lock(long id, int iter, int l) { ftrace_write("thread %ld iter %d, taking lock %d\n", id, iter, l); pthread_mutex_lock(&locks[l]); ftrace_write("thread %ld iter %d, took lock %d\n", id, iter, l); busy_loop(nr_tasks - id); ftrace_write("thread %ld iter %d, unlock lock %d\n", id, iter, l); pthread_mutex_unlock(&locks[l]); } void *start_task(void *id) { [...] while (!done) { for (l = 0; l < nr_locks; l++) { grab_lock(id, i, l); ftrace_write("thread %ld iter %d sleeping\n", id, i); ms_sleep(id); } i++; } [...] } The busy_loop(ms) keeps the CPU spinning for ms milliseconds. The ms_sleep(ms) sleeps for ms milliseconds. The ftrace_write() writes to the ftrace buffer to help analyze via ftrace. The higher the id, the higher the prio, the shorter it does the busy loop, but the longer it spins. This is usually the case with RT tasks, the lower priority tasks usually run longer than higher priority tasks. At the end of the test, it records the number of loops each thread took, as well as the number of voluntary preemptions, non-voluntary preemptions, and number of migrations each thread took, taking the information from /proc/$$/sched and /proc/$$/status. Running this on a 4 CPU processor, the results without changes to the kernel looked like this: Task vol nonvol migrated iterations ---- --- ------ -------- ---------- 0: 53 3220 1470 98 1: 562 773 724 98 2: 752 933 1375 98 3: 749 39 697 98 4: 758 5 515 98 5: 764 2 679 99 6: 761 2 535 99 7: 757 3 346 99 total: 5156 4977 6341 787 Each thread regardless of priority migrated a few hundred times. The higher priority tasks, were a little better but still took quite an impact. By letting higher priority tasks bump the lower prio task from the CPU, things changed a bit: Task vol nonvol migrated iterations ---- --- ------ -------- ---------- 0: 37 2835 1937 98 1: 666 1821 1865 98 2: 654 1003 1385 98 3: 664 635 973 99 4: 698 197 352 99 5: 703 101 159 99 6: 708 1 75 99 7: 713 1 2 99 total: 4843 6594 6748 789 The total # of migrations did not change (several runs showed the difference all within the noise). But we now see a dramatic improvement to the higher priority tasks. (kernelshark showed that the watchdog timer bumped the highest priority task to give it the 2 count. This was actually consistent with every run). Notice that the # of iterations did not change either. The above was with priority inheritance mutexes. That is, when the higher prority task blocked on a lower priority task, the lower priority task would inherit the higher priority task (which shows why task 6 was bumped so many times). When not using priority inheritance mutexes, the current kernel shows this: Task vol nonvol migrated iterations ---- --- ------ -------- ---------- 0: 56 3101 1892 95 1: 594 713 937 95 2: 625 188 618 95 3: 628 4 491 96 4: 640 7 468 96 5: 631 2 501 96 6: 641 1 466 96 7: 643 2 497 96 total: 4458 4018 5870 765 Not much changed with or without priority inheritance mutexes. But if we let the high priority task bump lower priority tasks on wakeup we see: Task vol nonvol migrated iterations ---- --- ------ -------- ---------- 0: 115 3439 2782 98 1: 633 1354 1583 99 2: 652 919 1218 99 3: 645 713 934 99 4: 690 3 3 99 5: 694 1 4 99 6: 720 3 4 99 7: 747 0 1 100 Which shows a even bigger change. The big difference between task 3 and task 4 is because we have only 4 CPUs on the machine, causing the 4 highest prio tasks to always have preference. Although I did not measure cache misses, and I'm sure there would be little to measure since the test was not data intensive, I could imagine large improvements for higher priority tasks when dealing with lower priority tasks. Thus, I'm satisfied with making the change and agreeing with what Gregory Haskins argued a few years ago when we first had this discussion. One final note. All tasks in the above tests were RT tasks. Any RT task will always preempt a non RT task that is running on the CPU the RT task wants to run on. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Gregory Haskins <ghaskins@novell.com> LKML-Reference: <20100921024138.605460343@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17sched: Increment cache_nice_tries only on periodic lbVenkatesh Pallipadi
Commit: 58b26c4c025778c09c7a1438ff185080e11b7d0a upstream scheduler uses cache_nice_tries as an indicator to do cache_hot and active load balance, when normal load balance fails. Currently, this value is changed on any failed load balance attempt. That ends up being not so nice to workloads that enter/exit idle often, as they do more frequent new_idle balance and that pretty soon results in cache hot tasks being pulled in. Making the cache_nice_tries ignore failed new_idle balance seems to make better sense. With that only the failed load balance in periodic load balance gets accounted and the rate of accumulation of cache_nice_tries will not depend on idle entry/exit (short running sleep-wakeup kind of tasks). This reduces movement of cache_hot tasks. schedstat diff (after-before) excerpt from a workload that has frequent and short wakeup-idle pattern (:2 in cpu col below refers to NEWIDLE idx) This snapshot was across ~400 seconds. Without this change: domainstats: domain0 cpu cnt bln fld imb gain hgain nobusyq nobusyg 0:2 306487 219575 73167 110069413 44583 19070 1172 218403 1:2 292139 194853 81421 120893383 50745 21902 1259 193594 2:2 283166 174607 91359 129699642 54931 23688 1287 173320 3:2 273998 161788 93991 132757146 57122 24351 1366 160422 4:2 289851 215692 62190 83398383 36377 13680 851 214841 5:2 316312 222146 77605 117582154 49948 20281 988 221158 6:2 297172 195596 83623 122133390 52801 21301 929 194667 7:2 283391 178078 86378 126622761 55122 22239 928 177150 8:2 297655 210359 72995 110246694 45798 19777 1125 209234 9:2 297357 202011 79363 119753474 50953 22088 1089 200922 10:2 278797 178703 83180 122514385 52969 22726 1128 177575 11:2 272661 167669 86978 127342327 55857 24342 1195 166474 12:2 293039 204031 73211 110282059 47285 19651 948 203083 13:2 289502 196762 76803 114712942 49339 20547 1016 195746 14:2 264446 169609 78292 115715605 50459 21017 982 168627 15:2 260968 163660 80142 116811793 51483 21281 1064 162596 With this change: domainstats: domain0 cpu cnt bln fld imb gain hgain nobusyq nobusyg 0:2 272347 187380 77455 105420270 24975 1 953 186427 1:2 267276 172360 86234 116242264 28087 6 1028 171332 2:2 259769 156777 93281 123243134 30555 1 1043 155734 3:2 250870 143129 97627 127370868 32026 6 1188 141941 4:2 248422 177116 64096 78261112 22202 2 757 176359 5:2 275595 180683 84950 116075022 29400 6 778 179905 6:2 262418 162609 88944 119256898 31056 4 817 161792 7:2 252204 147946 92646 122388300 32879 4 824 147122 8:2 262335 172239 81631 110477214 26599 4 864 171375 9:2 261563 164775 88016 117203621 28331 3 849 163926 10:2 243389 140949 93379 121353071 29585 2 909 140040 11:2 242795 134651 98310 124768957 30895 2 1016 133635 12:2 255234 166622 79843 104696912 26483 4 746 165876 13:2 244944 151595 83855 109808099 27787 3 801 150794 14:2 241301 140982 89935 116954383 30403 6 845 140137 15:2 232271 128564 92821 119185207 31207 4 1416 127148 Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1284167957-3675-1-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17sched: Move sched_avg_update() to update_cpu_load()Suresh Siddha
Commit: da2b71edd8a7db44fe1746261410a981f3e03632 upstream Currently sched_avg_update() (which updates rt_avg stats in the rq) is getting called from scale_rt_power() (in the load balance context) which doesn't take rq->lock. Fix it by moving the sched_avg_update() to more appropriate update_cpu_load() where the CFS load gets updated as well. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1282596171.2694.3.camel@sbsiddha-MOBL3> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17sched: Remove remaining USER_SCHED codeLi Zefan
Commit: 32bd7eb5a7f4596c8440dd9440322fe9e686634d upstream This is left over from commit 7c9414385e ("sched: Remove USER_SCHED"") Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Dhaval Giani <dhaval.giani@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: David Howells <dhowells@redhat.com> LKML-Reference: <4BA9A05F.7010407@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2011-02-17sched: Remove USER_SCHEDDhaval Giani
Commit: 7c9414385ebfdd87cc542d4e7e3bb0dbb2d3ce25 upstream Remove the USER_SCHED feature. It has been scheduled to be removed in 2.6.34 as per http://marc.info/?l=linux-kernel&m=125728479022976&w=2 [trace from referenced thread] [1046577.884289] general protection fault: 0000 [#1] SMP [1046577.911332] last sysfs file: /sys/devices/platform/coretemp.7/temp1_input [1046577.938715] CPU 3 [1046577.965814] Modules linked in: ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables coretemp k8temp [1046577.994456] Pid: 38, comm: events/3 Not tainted 2.6.32.27intel #1 X8DT3 [1046578.023166] RIP: 0010:[] [] sched_destroy_group+0x3c/0x10d [1046578.052639] RSP: 0000:ffff88043e5abe10 EFLAGS: 00010097 [1046578.081360] RAX: ffff880139fa5540 RBX: ffff8803d18419c0 RCX: ffff8801d2f8fb78 [1046578.109903] RDX: dead000000200200 RSI: 0000000000000000 RDI: 0000000000000000 [1046578.109905] RBP: 0000000000000246 R08: 0000000000000020 R09: ffffffff816339b8 [1046578.109907] R10: 0000000004e6e5f0 R11: 0000000000000006 R12: ffffffff816339b8 [1046578.109909] R13: ffff8803d63ac4e0 R14: ffff88043e582340 R15: ffffffff8104a216 [1046578.109911] FS: 0000000000000000(0000) GS:ffff880028260000(0000) knlGS:0000000000000000 [1046578.109914] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b [1046578.109915] CR2: 00007f55ab220000 CR3: 00000001e5797000 CR4: 00000000000006e0 [1046578.109917] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [1046578.109919] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [1046578.109922] Process events/3 (pid: 38, threadinfo ffff88043e5aa000, task ffff88043e582340) [1046578.109923] Stack: [1046578.109924] ffff8803d63ac498 ffff8803d63ac4d8 ffff8803d63ac440 ffffffff8104a2c3 [1046578.109927] <0> ffff88043e5abef8 ffff880028276040 ffff8803d63ac4d8 ffffffff81050395 [1046578.109929] <0> ffff88043e582340 ffff88043e5826c8 ffff88043e582340 ffff88043e5abfd8 [1046578.109932] Call Trace: [1046578.109938] [] ? cleanup_user_struct+0xad/0xcc [1046578.109942] [] ? worker_thread+0x148/0x1d4 [1046578.109946] [] ? autoremove_wake_function+0x0/0x2e [1046578.109948] [] ? worker_thread+0x0/0x1d4 [1046578.109951] [] ? kthread+0x79/0x81 [1046578.109955] [] ? child_rip+0xa/0x20 [1046578.109957] [] ? kthread+0x0/0x81 [1046578.109959] [] ? child_rip+0x0/0x20 [1046578.109961] Code: 3c 00 4c 8b 25 02 98 3d 00 48 89 c5 83 cf ff eb 5c 48 8b 43 10 48 63 f7 48 8b 04 f0 48 8b 90 80 00 00 00 48 8b 48 78 48 89 51 08 <48> 89 0a 48 b9 00 02 20 00 00 00 ad de 48 89 88 80 00 00 00 48 [1046578.109975] RIP [] sched_destroy_group+0x3c/0x10d [1046578.109979] RSP [1046578.109981] ---[ end trace 5ebc2944b7872d4a ]--- Signed-off-by: Dhaval Giani <dhaval.giani@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1263990378.24844.3.camel@localhost> LKML-Reference: http://marc.info/?l=linux-kernel&m=129466345327931 Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2011-02-17usb: Realloc xHCI structures after a hub is verified.Sarah Sharp
commit 653a39d1f61bdc9f277766736d21d2e9be0391cb upstream. When there's an xHCI host power loss after a suspend from memory, the USB core attempts to reset and verify the USB devices that are attached to the system. The xHCI driver has to reallocate those devices, since the hardware lost all knowledge of them during the power loss. When a hub is plugged in, and the host loses power, the xHCI hardware structures are not updated to say the device is a hub. This is usually done in hub_configure() when the USB hub is detected. That function is skipped during a reset and verify by the USB core, since the core restores the old configuration and alternate settings, and the hub driver has no idea this happened. This bug makes the xHCI host controller reject the enumeration of low speed devices under the resumed hub. Therefore, make the USB core re-setup the internal xHCI hub device information by calling update_hub_device() when hub_activate() is called for a hub reset resume. After a host power loss, all devices under the roothub get a reset-resume or a disconnect. This patch should be queued for the 2.6.37 stable tree. Signed-off-by: Sarah Sharp <sarah.a.sharp@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17x86, mm: avoid possible bogus tlb entries by clearing prev mm_cpumask after ↵Suresh Siddha
switching mm commit 831d52bc153971b70e64eccfbed2b232394f22f8 upstream. Clearing the cpu in prev's mm_cpumask early will avoid the flush tlb IPI's while the cr3 is still pointing to the prev mm. And this window can lead to the possibility of bogus TLB fills resulting in strange failures. One such problematic scenario is mentioned below. T1. CPU-1 is context switching from mm1 to mm2 context and got a NMI etc between the point of clearing the cpu from the mm_cpumask(mm1) and before reloading the cr3 with the new mm2. T2. CPU-2 is tearing down a specific vma for mm1 and will proceed with flushing the TLB for mm1. It doesn't send the flush TLB to CPU-1 as it doesn't see that cpu listed in the mm_cpumask(mm1). T3. After the TLB flush is complete, CPU-2 goes ahead and frees the page-table pages associated with the removed vma mapping. T4. CPU-2 now allocates those freed page-table pages for something else. T5. As the CR3 and TLB caches for mm1 is still active on CPU-1, CPU-1 can potentially speculate and walk through the page-table caches and can insert new TLB entries. As the page-table pages are already freed and being used on CPU-2, this page walk can potentially insert a bogus global TLB entry depending on the (random) contents of the page that is being used on CPU-2. T6. This bogus TLB entry being global will be active across future CR3 changes and can result in weird memory corruption etc. To avoid this issue, for the prev mm that is handing over the cpu to another mm, clear the cpu from the mm_cpumask(prev) after the cr3 is changed. Marking it for -stable, though we haven't seen any reported failure that can be attributed to this. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17drm/i915: Add dependency on CONFIG_TMPFSChris Wilson
commit f7ab9b407b3bc83161c2aa74c992ba4782e87c9c upstream. Without tmpfs, shmem_readpage() is not compiled in causing an OOPS as soon as we try to allocate some swappable pages for GEM. Jan 19 22:52:26 harlie kernel: Modules linked in: i915(+) drm_kms_helper cfbcopyarea video backlight cfbimgblt cfbfillrect Jan 19 22:52:26 harlie kernel: Jan 19 22:52:26 harlie kernel: Pid: 1125, comm: modprobe Not tainted 2.6.37Harlie #10 To be filled by O.E.M./To be filled by O.E.M. Jan 19 22:52:26 harlie kernel: EIP: 0060:[<00000000>] EFLAGS: 00010246 CPU: 3 Jan 19 22:52:26 harlie kernel: EIP is at 0x0 Jan 19 22:52:26 harlie kernel: EAX: 00000000 EBX: f7b7d000 ECX: f3383100 EDX: f7b7d000 Jan 19 22:52:26 harlie kernel: ESI: f1456118 EDI: 00000000 EBP: f2303c98 ESP: f2303c7c Jan 19 22:52:26 harlie kernel: DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Jan 19 22:52:26 harlie kernel: Process modprobe (pid: 1125, ti=f2302000 task=f259cd80 task.ti=f2302000) Jan 19 22:52:26 harlie kernel: Stack: Jan 19 22:52:26 harlie udevd-work[1072]: '/sbin/modprobe -b pci:v00008086d00000046sv00000000sd00000000bc03sc00i00' unexpected exit with status 0x0009 Jan 19 22:52:26 harlie kernel: c1074061 000000d0 f2f42b80 00000000 000a13d2 f2d5dcc0 00000001 f2303cac Jan 19 22:52:26 harlie kernel: c107416f 00000000 000a13d2 00000000 f2303cd4 f8d620ed f2cee620 00001000 Jan 19 22:52:26 harlie kernel: 00000000 000a13d2 f1456118 f2d5dcc0 f1a40000 00001000 f2303d04 f8d637ab Jan 19 22:52:26 harlie kernel: Call Trace: Jan 19 22:52:26 harlie kernel: [<c1074061>] ? do_read_cache_page+0x71/0x160 Jan 19 22:52:26 harlie kernel: [<c107416f>] ? read_cache_page_gfp+0x1f/0x30 Jan 19 22:52:26 harlie kernel: [<f8d620ed>] ? i915_gem_object_get_pages+0xad/0x1d0 [i915] Jan 19 22:52:26 harlie kernel: [<f8d637ab>] ? i915_gem_object_bind_to_gtt+0xeb/0x2d0 [i915] Jan 19 22:52:26 harlie kernel: [<f8d65961>] ? i915_gem_object_pin+0x151/0x190 [i915] Jan 19 22:52:26 harlie kernel: [<c11e16ed>] ? drm_gem_object_init+0x3d/0x60 Jan 19 22:52:26 harlie kernel: [<f8d65aa5>] ? i915_gem_init_ringbuffer+0x105/0x1e0 [i915] Jan 19 22:52:26 harlie kernel: [<f8d571b7>] ? i915_driver_load+0x667/0x1160 [i915] Reported-by: John J. Stimson-III <john@idsfa.net> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17drm/i915/lvds: Add AOpen i915GMm-HFS to the list of false-positive LVDSKnut Petersen
commit 22ab70d3262ddb6e69b3c246a34e2967ba5eb1e8 upstream. Signed-off-by: Knut Petersen <knut_petersen@t-online.de> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17drm/radeon/kms: fix s/r issues with bios scratch regsAlex Deucher
commit 87364760de5d631390c478fcbac8db1b926e0adf upstream. The accelerate mode bit gets checked by certain atom command tables to set up some register state. It needs to be clear when setting modes and set when not. Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=26942 Signed-off-by: Alex Deucher <alexdeucher@gmail.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17drm/radeon: remove 0x4243 pci idAlex Deucher
commit 63a507800c8aca5a1891d598ae13f829346e8e39 upstream. 0x4243 is a PCI bridge, not a GPU. Fixes: https://bugs.freedesktop.org/show_bug.cgi?id=33815 Signed-off-by: Alex Deucher <alexdeucher@gmail.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17drm/radeon/kms: add pll debugging outputAlex Deucher
commit 51d4bf840a27fe02c883ddc6d9708af056773769 upstream. Signed-off-by: Alex Deucher <alexdeucher@gmail.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17drm/radeon/kms: make the mac rv630 quirk genericAlex Deucher
commit be23da8ad219650517cbbb7acbeaeb235667113a upstream. Seems some other boards do this as well. Reported-by: Andrea Merello <andrea.merello@gmail.com> Signed-off-by: Alex Deucher <alexdeucher@gmail.com> Signed-off-by: Dave Airlie <airlied@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17drm/radeon/kms: add quirk for Mac Radeon HD 2600 cardAlex Deucher
commit f598aa7593427ffe3a61e7767c34bd695a5e7ed0 upstream. Reported-by: 屋国遥 <hyagni@gmail.com> Signed-off-by: Alex Deucher <alexdeucher@gmail.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17dm mpath: disable blk_abort_queueMike Snitzer
commit 09c9d4c9b6a2b5909ae3c6265e4cd3820b636863 upstream. Revert commit 224cb3e981f1b2f9f93dbd49eaef505d17d894c2 dm: Call blk_abort_queue on failed paths Multipath began to use blk_abort_queue() to allow for lower latency path deactivation. This was found to cause list corruption: the cmd gets blk_abort_queued/timedout run on it and the scsi eh somehow is able to complete and run scsi_queue_insert while scsi_request_fn is still trying to process the request. https://www.redhat.com/archives/dm-devel/2010-November/msg00085.html Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Mike Anderson <andmike@linux.vnet.ibm.com> Cc: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17dm: dont take i_mutex to change device sizeMike Snitzer
commit c217649bf2d60ac119afd71d938278cffd55962b upstream. No longer needlessly hold md->bdev->bd_inode->i_mutex when changing the size of a DM device. This additional locking is unnecessary because i_size_write() is already protected by the existing critical section in dm_swap_table(). DM already has a reference on md->bdev so the associated bd_inode may be changed without lifetime concerns. A negative side-effect of having held md->bdev->bd_inode->i_mutex was that a concurrent DM device resize and flush (via fsync) would deadlock. Dropping md->bdev->bd_inode->i_mutex eliminates this potential for deadlock. The following reproducer no longer deadlocks: https://www.redhat.com/archives/dm-devel/2009-July/msg00284.html Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17ieee80211: correct IEEE80211_ADDBA_PARAM_BUF_SIZE_MASK macroAmitkumar Karwar
commit 8d661f1e462d50bd83de87ee628aaf820ce3c66c upstream. It is defined in include/linux/ieee80211.h. As per IEEE spec. bit6 to bit15 in block ack parameter represents buffer size. So the bitmask should be 0xFFC0. Signed-off-by: Amitkumar Karwar <akarwar@marvell.com> Signed-off-by: Bing Zhao <bzhao@marvell.com> Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17SELinux: do not compute transition labels on mountpoint labeled filesystemsEric Paris
commit 415103f9932d45f7927f4b17e3a9a13834cdb9a1 upstream. selinux_inode_init_security computes transitions sids even for filesystems that use mount point labeling. It shouldn't do that. It should just use the mount point label always and no matter what. This causes 2 problems. 1) it makes file creation slower than it needs to be since we calculate the transition sid and 2) it allows files to be created with a different label than the mount point! # id -Z staff_u:sysadm_r:sysadm_t:s0-s0:c0.c1023 # sesearch --type --class file --source sysadm_t --target tmp_t Found 1 semantic te rules: type_transition sysadm_t tmp_t : file user_tmp_t; # mount -o loop,context="system_u:object_r:tmp_t:s0" /tmp/fs /mnt/tmp # ls -lZ /mnt/tmp drwx------. root root system_u:object_r:tmp_t:s0 lost+found # touch /mnt/tmp/file1 # ls -lZ /mnt/tmp -rw-r--r--. root root staff_u:object_r:user_tmp_t:s0 file1 drwx------. root root system_u:object_r:tmp_t:s0 lost+found Whoops, we have a mount point labeled filesystem tmp_t with a user_tmp_t labeled file! Signed-off-by: Eric Paris <eparis@redhat.com> Reviewed-by: Reviewed-by: James Morris <jmorris@namei.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17SELinux: define permissions for DCB netlink messagesEric Paris
commit 350e4f31e0eaf56dfc3b328d24a11bdf42a41fb8 upstream. Commit 2f90b865 added two new netlink message types to the netlink route socket. SELinux has hooks to define if netlink messages are allowed to be sent or received, but it did not know about these two new message types. By default we allow such actions so noone likely noticed. This patch adds the proper definitions and thus proper permissions enforcement. Signed-off-by: Eric Paris <eparis@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17tpm_tis: Use timeouts returned from TPMStefan Berger
commit 9b29050f8f75916f974a2d231ae5d3cd59792296 upstream. The current TPM TIS driver in git discards the timeout values returned from the TPM. The check of the response packet needs to consider that the return_code field is 0 on success and the size of the expected packet is equivalent to the header size + u32 length indicator for the TPM_GetCapability() result + 3 timeout indicators of type u32. I am also adding a sysfs entry 'timeouts' showing the timeouts that are being used. Signed-off-by: Stefan Berger <stefanb@linux.vnet.ibm.com> Tested-by: Guillaume Chazarain <guichaz@gmail.com> Signed-off-by: Rajiv Andrade <srajiv@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17TPM: Long default timeout fixRajiv Andrade
commit c4ff4b829ef9e6353c0b133b7adb564a68054979 upstream. If duration variable value is 0 at this point, it's because chip->vendor.duration wasn't filled by tpm_get_timeouts() yet. This patch sets then the lowest timeout just to give enough time for tpm_get_timeouts() to further succeed. This fix avoids long boot times in case another entity attempts to send commands to the TPM when the TPM isn't accessible. Signed-off-by: Rajiv Andrade <srajiv@linux.vnet.ibm.com> Signed-off-by: James Morris <jmorris@namei.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17pata_mpc52xx: inherit from ata_bmdma_port_opsTejun Heo
commit 77c5fd19075d299fe820bb59bb21b0b113676e20 upstream. pata_mpc52xx supports BMDMA but inherits ata_sff_port_ops which triggers BUG_ON() when a DMA command is issued. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Roman Fietze <roman.fietze@telemotive.de> Cc: Sergei Shtylyov <sshtylyov@mvista.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17md: fix regression with re-adding devices to arrays with no metadataNeilBrown
commit bf572541ab44240163eaa2d486b06f306a31d45a upstream. Commit 1a855a0606 (2.6.37-rc4) fixed a problem where devices were re-added when they shouldn't be but caused a regression in a less common case that means sometimes devices cannot be re-added when they should be. In particular, when re-adding a device to an array without metadata we should always access the device, but after the above commit we didn't. This patch sets the In_sync flag in that case so that the re-add succeeds. This patch is suitable for any -stable kernel to which 1a855a0606 was applied. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17hostap_cs: fix sleeping function called from invalid contextStanislaw Gruszka
commit 4e5518ca53be29c1ec3c00089c97bef36bfed515 upstream. pcmcia_request_irq() and pcmcia_enable_device() are intended to be called from process context (first function allocate memory with GFP_KERNEL, second take a mutex). We can not take spin lock and call them. It's safe to move spin lock after pcmcia_enable_device() as we still hold off IRQ until dev->base_addr is 0 and driver will not proceed with interrupts when is not ready. Patch resolves: https://bugzilla.redhat.com/show_bug.cgi?id=643758 Reported-and-tested-by: rbugz@biobind.com Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17kernel/smp.c: fix smp_call_function_many() SMP raceAnton Blanchard
commit 6dc19899958e420a931274b94019e267e2396d3e upstream. I noticed a failure where we hit the following WARN_ON in generic_smp_call_function_interrupt: if (!cpumask_test_and_clear_cpu(cpu, data->cpumask)) continue; data->csd.func(data->csd.info); refs = atomic_dec_return(&data->refs); WARN_ON(refs < 0); <------------------------- We atomically tested and cleared our bit in the cpumask, and yet the number of cpus left (ie refs) was 0. How can this be? It turns out commit 54fdade1c3332391948ec43530c02c4794a38172 ("generic-ipi: make struct call_function_data lockless") is at fault. It removes locking from smp_call_function_many and in doing so creates a rather complicated race. The problem comes about because: - The smp_call_function_many interrupt handler walks call_function.queue without any locking. - We reuse a percpu data structure in smp_call_function_many. - We do not wait for any RCU grace period before starting the next smp_call_function_many. Imagine a scenario where CPU A does two smp_call_functions back to back, and CPU B does an smp_call_function in between. We concentrate on how CPU C handles the calls: CPU A CPU B CPU C CPU D smp_call_function smp_call_function_interrupt walks call_function.queue sees data from CPU A on list smp_call_function smp_call_function_interrupt walks call_function.queue sees (stale) CPU A on list smp_call_function int clears last ref on A list_del_rcu, unlock smp_call_function reuses percpu *data A data->cpumask sees and clears bit in cpumask might be using old or new fn! decrements refs below 0 set data->refs (too late!) The important thing to note is since the interrupt handler walks a potentially stale call_function.queue without any locking, then another cpu can view the percpu *data structure at any time, even when the owner is in the process of initialising it. The following test case hits the WARN_ON 100% of the time on my PowerPC box (having 128 threads does help :) #include <linux/module.h> #include <linux/init.h> #define ITERATIONS 100 static void do_nothing_ipi(void *dummy) { } static void do_ipis(struct work_struct *dummy) { int i; for (i = 0; i < ITERATIONS; i++) smp_call_function(do_nothing_ipi, NULL, 1); printk(KERN_DEBUG "cpu %d finished\n", smp_processor_id()); } static struct work_struct work[NR_CPUS]; static int __init testcase_init(void) { int cpu; for_each_online_cpu(cpu) { INIT_WORK(&work[cpu], do_ipis); schedule_work_on(cpu, &work[cpu]); } return 0; } static void __exit testcase_exit(void) { } module_init(testcase_init) module_exit(testcase_exit) MODULE_LICENSE("GPL"); MODULE_AUTHOR("Anton Blanchard"); I tried to fix it by ordering the read and the write of ->cpumask and ->refs. In doing so I missed a critical case but Paul McKenney was able to spot my bug thankfully :) To ensure we arent viewing previous iterations the interrupt handler needs to read ->refs then ->cpumask then ->refs _again_. Thanks to Milton Miller and Paul McKenney for helping to debug this issue. [miltonm@bga.com: add WARN_ON and BUG_ON, remove extra read of refs before initial read of mask that doesn't help (also noted by Peter Zijlstra), adjust comments, hopefully clarify scenario ] [miltonm@bga.com: remove excess tests] Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Milton Miller <miltonm@bga.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17parisc : Remove broken line wrapping handling pdc_iodc_print()Guy Martin
commit fbea668498e93bb38ac9226c7af9120a25957375 upstream. Remove the broken line wrapping handling in pdc_iodc_print(). It is broken in 3 ways : - It doesn't keep track of the current screen position, it just assumes that the new buffer will be printed at the begining of the screen. - It doesn't take in account that non printable characters won't increase the current position on the screen. - And last but not least, it triggers a kernel panic if a backspace is the first char in the provided buffer : Backtrace: [<0000000040128ec4>] pdc_console_write+0x44/0x78 [<0000000040128f18>] pdc_console_tty_write+0x20/0x38 [<000000004032f1ac>] n_tty_write+0x2a4/0x550 [<000000004032b158>] tty_write+0x1e0/0x2d8 [<00000000401bb420>] vfs_write+0xb8/0x188 [<00000000401bb630>] sys_write+0x68/0xb8 [<0000000040104eb8>] syscall_exit+0x0/0x14 Most terminals handle the line wrapping just fine. I've confirmed that it works correctly on a C8000 with both vga and serial output. Signed-off-by: Guy Martin <gmsoft@tuxicoman.be> Signed-off-by: James Bottomley <James.Bottomley@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17powerpc: Fix some 6xx/7xxx CPU setup functionsBenjamin Herrenschmidt
commit 1f1936ff3febf38d582177ea319eaa278f32c91f upstream. Some of those functions try to adjust the CPU features, for example to remove NAP support on some revisions. However, they seem to use r5 as an index into the CPU table entry, which might have been right a long time ago but no longer is. r4 is the right register to use. This probably caused some off behaviours on some PowerMac variants using 750cx or 7455 processor revisions. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17klist: Fix object alignment on 64-bit.David Miller
commit 795abaf1e4e188c4171e3cd3dbb11a9fcacaf505 upstream. Commit c0e69a5bbc6f ("klist.c: bit 0 in pointer can't be used as flag") intended to make sure that all klist objects were at least pointer size aligned, but used the constant "4" which only works on 32-bit. Use "sizeof(void *)" which is correct in all cases. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Jesper Nilsson <jesper.nilsson@axis.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17drivers: update to pl2303 usb-serial to support Motorola cablesDario Lombardo
commit 96a3e79edff6f41b0f115a82f1a39d66218077a7 upstream. Added 0x0307 device id to support Motorola cables to the pl2303 usb serial driver. This cable has a modified chip that is a pl2303, but declares itself as 0307. Fixed by adding the right device id to the supported devices list, assigning it the code labeled PL2303_PRODUCT_ID_MOTOROLA. Signed-off-by: Dario Lombardo <dario.lombardo@libero.it> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17USB: serial: pl2303: Hybrid reader Uniform HCR331Simone Contini
commit 18344a1cd5889d48dac67229fcf024ed300030d5 upstream. I tried a magnetic stripe reader (http://www.kimaldi.com/kimaldi_eng/productos/lectores_de_tarjetas/lectores_tarjeta_chip_y_dni/lector_hibrido_uniform_hcr_331) and I see that it is interfaced with a PL2303. I wrote a patch to use your driver which simply adds the product ID for the device and it seems working fine. From: Simone Contini <s.contini@oltrelinux.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17fix jiffy calculations in calibrate_delay_direct to handle overflowTim Deegan
commit 70a062286b9dfcbd24d2e11601aecfead5cf709a upstream. Fixes a hang when booting as dom0 under Xen, when jiffies can be quite large by the time the kernel init gets this far. Signed-off-by: Tim Deegan <Tim.Deegan@citrix.com> [jbeulich@novell.com: !time_after() -> time_before_eq() as suggested by Jiri Slaby] Signed-off-by: Jan Beulich <jbeulich@novell.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17x86, mtrr: Avoid MTRR reprogramming on BP during boot on UP platformsSuresh Siddha
commit f7448548a9f32db38f243ccd4271617758ddfe2c upstream. Markus Kohn ran into a hard hang regression on an acer aspire 1310, when acpi is enabled. git bisect showed the following commit as the bad one that introduced the boot regression. commit d0af9eed5aa91b6b7b5049cae69e5ea956fd85c3 Author: Suresh Siddha <suresh.b.siddha@intel.com> Date: Wed Aug 19 18:05:36 2009 -0700 x86, pat/mtrr: Rendezvous all the cpus for MTRR/PAT init Because of the UP configuration of that platform, native_smp_prepare_cpus() bailed out (in smp_sanity_check()) before doing the set_mtrr_aps_delayed_init() Further down the boot path, native_smp_cpus_done() will call the delayed MTRR initialization for the AP's (mtrr_aps_init()) with mtrr_aps_delayed_init not set. This resulted in the boot processor reprogramming its MTRR's to the values seen during the start of the OS boot. While this is not needed ideally, this shouldn't have caused any side-effects. This is because the reprogramming of MTRR's (set_mtrr_state() that gets called via set_mtrr()) will check if the live register contents are different from what is being asked to write and will do the actual write only if they are different. BP's mtrr state is read during the start of the OS boot and typically nothing would have changed when we ask to reprogram it on BP again because of the above scenario on an UP platform. So on a normal UP platform no reprogramming of BP MTRR MSR's happens and all is well. However, on this platform, bios seems to be modifying the fixed mtrr range registers between the start of OS boot and when we double check the live registers for reprogramming BP MTRR registers. And as the live registers are modified, we end up reprogramming the MTRR's to the state seen during the start of the OS boot. During ACPI initialization, something in the bios (probably smi handler?) don't like this fact and results in a hard lockup. We didn't see this boot hang issue on this platform before the commit d0af9eed5aa91b6b7b5049cae69e5ea956fd85c3, because only the AP's (if any) will program its MTRR's to the value that BP had at the start of the OS boot. Fix this issue by checking mtrr_aps_delayed_init before continuing further in the mtrr_aps_init(). Now, only AP's (if any) will program its MTRR's to the BP values during boot. Addresses https://bugzilla.novell.com/show_bug.cgi?id=623393 [ By the way, this behavior of the bios modifying MTRR's after the start of the OS boot is not common and the kernel is not prepared to handle this situation well. Irrespective of this issue, during suspend/resume, linux kernel will try to reprogram the BP's MTRR values to the values seen during the start of the OS boot. So suspend/resume might be already broken on this platform for all linux kernel versions. ] Reported-and-bisected-by: Markus Kohn <jabber@gmx.org> Tested-by: Markus Kohn <jabber@gmx.org> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Thomas Renninger <trenn@novell.com> Cc: Rafael Wysocki <rjw@novell.com> Cc: Venkatesh Pallipadi <venki@google.com> LKML-Reference: <1296694975.4418.402.camel@sbsiddha-MOBL3.sc.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17ptrace: use safer wake up on ptrace_detach()Tejun Heo
commit 01e05e9a90b8f4c3997ae0537e87720eb475e532 upstream. The wake_up_process() call in ptrace_detach() is spurious and not interlocked with the tracee state. IOW, the tracee could be running or sleeping in any place in the kernel by the time wake_up_process() is called. This can lead to the tracee waking up unexpectedly which can be dangerous. The wake_up is spurious and should be removed but for now reduce its toxicity by only waking up if the tracee is in TRACED or STOPPED state. This bug can possibly be used as an attack vector. I don't think it will take too much effort to come up with an attack which triggers oops somewhere. Most sleeps are wrapped in condition test loops and should be safe but we have quite a number of places where sleep and wakeup conditions are expected to be interlocked. Although the window of opportunity is tiny, ptrace can be used by non-privileged users and with some loading the window can definitely be extended and exploited. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Roland McGrath <roland@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>