aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2012-07-26Merge branches 'arm-asymmetric-support-v3', 'cpuidle-next-v4', ↵big-LITTLE-MP-v4Viresh Kumar
'per-cpu-thread-hotplug-v3-fixed', 'fast-slow-cpu-dt-v1', 'wq-hotplug-v1' and 'config-fragments' into big-LITTLE-MP-v4 Based on v3.5 $ gco -b big-LITTLE-MP-v4 v3.5 $ git merge arm-asymmetric-support-v3 cpuidle-next-v4 per-cpu-thread-hotplug-v3-fixed per-task-load-average-v2 task-placement-v1 fast-slow-cpu-dt-v1 wq-hotplug-v1 config-fragments
2012-07-26hotplug: Fix UP bug in smpboot hotplug codePaul E. McKenney
Because kernel subsystems need their per-CPU kthreads on UP systems as well as on SMP systems, the smpboot hotplug kthread functions must be provided in UP builds as well as in SMP builds. This commit therefore adds smpboot.c to UP builds and excludes irrelevant code via #ifdef. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-26infiniband: ehca: Use hotplug thread infrastructureThomas Gleixner
Get rid of the hotplug notifiers and use the generic hotplug thread infrastructure. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-26rcu: Use smp_hotplug_thread facility for RCUs per-CPU kthreadPaul E. McKenney
Bring RCU into the new-age CPU-hotplug fold by modifying RCU's per-CPU kthread code to use the new smp_hotplug_thread facility. [ tglx: Adapted it to use callbacks and to the simplified rcu yield ] Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-26watchdog: Use hotplug thread infrastructureThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-26softirq: Use hotplug thread infrastructureThomas Gleixner
[ paulmck: Updated to avoid invoking rcu_note_context_switch() with preemption enabled. ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-26smpboot: Provide infrastructure for percpu hotplug threadsThomas Gleixner
Provide a generic interface for setting up and tearing down percpu threads. On registration the threads for already online cpus are created and started. On deregistration (modules) the threads are stoppped. During hotplug operations the threads are created, started, parked and unparked. The datastructure for registration provides a pointer to percpu storage space and optional setup, cleanup, park, unpark functions. These functions are called when the thread state changes. Each implementation has to provide a function which is queried and returns whether the thread should run and the thread function itself. The core code handles all state transitions and avoids duplicated code in the call sites. [ paulmck: Updated to fix preempt_disable() misnesting. ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-25sched: Use device-tree to provide fast/slow CPU list for HMPJon Medhurst (Tixy)
We can't rely on Kconfig options to set the fast and slow CPU lists for HMP scheduling if we want a single kernel binary to support multiple devices with different CPU topology. E.g. ARM's TC2, Fast Models, or even non big.LITTLE devices. This patch adds the function arch_get_fast_and_slow_cpus() to generate the lists at run-time by parsing the CPU nodes in device-tree; it assumes slow cores are A7s and everything else is fast. The function still supports the old Kconfig options as this is useful for testing the HMP scheduler on devices without big.LITTLE. Signed-off-by: Jon Medhurst <tixy@linaro.org>
2012-07-25sched: Add HMP forced task migration ftrace eventMorten Rasmussen
Adds ftrace event for tracing forced task migrations using HMP optimized scheduling. Signed-off-by: Morten Rasmussen <Morten.Rasmussen@arm.com>
2012-07-25sched: Forced migration of high load task on HMP platformsMorten Rasmussen
This patch introduces a periodic check to look for high load tasks on runqueues of low-performance cpus on heterogeneous platforms. These will be migrated immediately rather than wait until next time they go to sleep and goes through the wakeup migration. The patch is proof-of-concept code and therefore attempts to have minimal impact on existing scheduler code paths. The most of the functions can potentially be merged with existing functions and reduce the size of this patch considerably. Signed-off-by: Morten Rasmussen <Morten.Rasmussen@arm.com>
2012-07-25sched: load-tracking driven wakeup migration for HMP platformsMorten Rasmussen
Attempts to migrate tasks to an appropriate cpu on heterogeneous platforms based on the task's individual tracked load at wakeup. The migration decision is based on task load thresholds and task priority. Currently only two types of cpus are supported: fast (high-performance) and slow (power-efficient). The HMP setup (fast/slow cpuids) is currently hardcoded in the scheduler. Obviously, this hack needs to be replaced by a generic way to expose to expose this information to the scheduler. Ideally this could be done using device tree and a not yet implemented scheduler interface. Signed-off-by: Morten Rasmussen <Morten.Rasmussen@arm.com>
2012-07-25sched: entity load-tracking load_avg_ratioMorten Rasmussen
load_avg_contrib includes task load.weight and therefore not the pure tracked load of the task. This patch adds load_avg_ratio, which does not include the task load.weight. Signed-off-by: Morten Rasmussen <Morten.Rasmussen@arm.com>
2012-07-25sched: Add ftrace events for entity load-trackingMorten Rasmussen
Adds ftrace events for key variables related to the entity load-tracking to help debugging scheduler behaviour. Allows tracing of load contribution and runqueue residency ratio for both entities and runqueues as well as entity CPU usage ratio. Signed-off-by: Morten Rasmussen <Morten.Rasmussen@arm.com>
2012-07-25sched: introduce temporary FAIR_GROUP_SCHED dependency for load-trackingPaul Turner
While per-entity load-tracking is generally useful, beyond computing shares distribution, e.g. runnable based load-balance (in progress), governors, power-management, etc These facilities are not yet consumers of this data. This may be trivially reverted when the information is required; but avoid paying the overhead for calculations we will not use until then. Signed-off-by: Paul Turner <pjt@google.com>
2012-07-25sched: implement usage trackingPaul Turner
With the frame-work for runnable tracking now fully in place. Per-entity usage tracking is a simple and low-overhead addition. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Ben Segall <bsegall@google.com>
2012-07-25sched: make __update_entity_runnable_avg() fastPaul Turner
__update_entity_runnable_avg forms the core of maintaining an entity's runnable load average. In this function we charge the accumulated run-time since last update and handle appropriate decay. In some cases, e.g. a waking task, this time interval may be much larger than our period unit. Fortunately we can exploit some properties of our series to perform decay for a blocked update in constant time and account the contribution for a running update in essentially-constant* time. [*]: For any running entity they should be performing updates at the tick which gives us a soft limit of 1 jiffy between updates, and we can compute up to a 32 jiffy update in a single pass. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Ben Segall <bsegall@google.com>
2012-07-25sched: update_cfs_shares at period edgePaul Turner
Now that our measurement intervals are small (~1ms) we can amortize the posting of update_shares() to be about each period overflow. This is a large cost saving for frequently switching tasks. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Ben Segall <bsegall@google.com>
2012-07-25sched: refactor update_shares_cpu() -> update_blocked_avgs()Paul Turner
Now that running entities maintain their own load-averages the work we must do in update_shares() is largely restricted to the periodic decay of blocked entities. This allows us to be a little less pessimistic regarding our occupancy on rq->lock and the associated rq->clock updates required. Signed-off-by: Paul Turner <pjt@google.com>
2012-07-25sched: replace update_shares weight distribution with per-entity computationPaul Turner
Now that the machinery in place is in place to compute contributed load in a bottom up fashion; replace the shares distribution code within update_shares() accordingly. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Ben Segall <bsegall@google.com>
2012-07-25sched: maintain runnable averages across throttled periodsPaul Turner
With bandwidth control tracked entities may cease execution according to user specified bandwidth limits. Charging this time as either throttled or blocked however, is incorrect and would falsely skew in either direction. What we actually want is for any throttled periods to be "invisible" to load-tracking as they are removed from the system for that interval and contribute normally otherwise. Do this by moderating the progression of time to omit any periods in which the entity belonged to a throttled hierarchy. Signed-off-by: Paul Turner <pjt@google.com>
2012-07-25sched: normalize tg load contributions against runnable timePaul Turner
Entities of equal weight should receive equitable distribution of cpu time. This is challenging in the case of a task_group's shares as execution may be occurring on multiple cpus simultaneously. To handle this we divide up the shares into weights proportionate with the load on each cfs_rq. This does not however, account for the fact that the sum of the parts may be less than one cpu and so we need to normalize: load(tg) = min(runnable_avg(tg), 1) * tg->shares Where runnable_avg is the aggregate time in which the task_group had runnable children. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Ben Segall <bsegall@google.com>.
2012-07-25sched: compute load contribution by a group entityPaul Turner
Unlike task entities who have a fixed weight, group entities instead own a fraction of their parenting task_group's shares as their contributed weight. Compute this fraction so that we can correctly account hierarchies and shared entity nodes. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Ben Segall <bsegall@google.com>
2012-07-25sched: aggregate total task_group loadPaul Turner
Maintain a global running sum of the average load seen on each cfs_rq belonging to each task group so that it may be used in calculating an appropriate shares:weight distribution. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Ben Segall <bsegall@google.com>
2012-07-25sched: account for blocked load waking back upPaul Turner
When a running entity blocks we migrate its tracked load to cfs_rq->blocked_runnable_avg. In the sleep case this occurs while holding rq->lock and so is a natural transition. Wake-ups however, are potentially asynchronous in the presence of migration and so special care must be taken. We use an atomic counter to track such migrated load, taking care to match this with the previously introduced decay counters so that we don't migrate too much load. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Ben Segall <bsegall@google.com>
2012-07-25sched: add an rq migration call-back to sched_classPaul Turner
Since we are now doing bottom up load accumulation we need explicit notification when a task has been re-parented so that the old hierarchy can be updated. Adds task_migrate_rq(struct rq *prev, struct *rq new_rq); (The alternative is to do this out of __set_task_cpu, but it was suggested that this would be a cleaner encapsulation.) Signed-off-by: Paul Turner <pjt@google.com>
2012-07-25sched: maintain the load contribution of blocked entitiesPaul Turner
We are currently maintaining: runnable_load(cfs_rq) = \Sum task_load(t) For all running children t of cfs_rq. While this can be naturally updated for tasks in a runnable state (as they are scheduled); this does not account for the load contributed by blocked task entities. This can be solved by introducing a separate accounting for blocked load: blocked_load(cfs_rq) = \Sum runnable(b) * weight(b) Obviously we do not want to iterate over all blocked entities to account for their decay, we instead observe that: runnable_load(t) = \Sum p_i*y^i and that to account for an additional idle period we only need to compute: y*runnable_load(t). This means that we can compute all blocked entities at once by evaluating: blocked_load(cfs_rq)` = y * blocked_load(cfs_rq) Finally we maintain a decay counter so that when a sleeping entity re-awakens we can determine how much of its load should be removed from the blocked sum. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Ben Segall <bsegall@google.com>
2012-07-25sched: aggregate load contributed by task entities on parenting cfs_rqPaul Turner
For a given task t, we can compute its contribution to load as: task_load(t) = runnable_avg(t) * weight(t) On a parenting cfs_rq we can then aggregate runnable_load(cfs_rq) = \Sum task_load(t), for all runnable children t Maintain this bottom up, with task entities adding their contributed load to the parenting cfs_rq sum. When a task entities load changes we add the same delta to the maintained sum. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Ben Segall <bsegall@google.com>
2012-07-25sched: maintain per-rq runnable averagesBen Segall
Since runqueues do not have a corresponding sched_entity we instead embed a sched_avg structure directly. Signed-off-by: Ben Segall <bsegall@google.com> Signed-off-by: Paul Turner <pjt@google.com>
2012-07-25sched: track the runnable average on a per-task entitiy basisPaul Turner
Instead of tracking averaging the load parented by a cfs_rq, we can track entity load directly. With the load for a given cfs_Rq then being the sum of its children. To do this we represent the historical contribution to runnable average within each trailing 1024us of execution as the coefficients of a geometric series. We can express this for a given task t as: runnable_sum(t) = \Sum u_i * y^i , load(t) = weight_t * runnable_sum(t) / (\Sum 1024 * y^i) Where: u_i is the usage in the last i`th 1024us period (approximately 1ms) ~ms and y is chosen such that y^k = 1/2. We currently choose k to be 32 which roughly translates to about a sched period. Signed-off-by: Paul Turner <pjt@google.com>
2012-07-25kthread: Implement park/unpark facilityThomas Gleixner
To avoid the full teardown/setup of per cpu kthreads in the case of cpu hot(un)plug, provide a facility which allows to put the kthread into a park position and unpark it when the cpu comes online again. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Namhyung Kim <namhyung@kernel.org>
2012-07-25rcu: Yield simplerThomas Gleixner
The rcu_yield() code is amazing. It's there to avoid starvation of the system when lots of (boosting) work is to be done. Now looking at the code it's functionality is: Make the thread SCHED_OTHER and very nice, i.e. get it out of the way Arm a timer with 2 ticks schedule() Now if the system goes idle the rcu task returns, regains SCHED_FIFO and plugs on. If the systems stays busy the timer fires and wakes a per node kthread which in turn makes the per cpu thread SCHED_FIFO and brings it back on the cpu. For the boosting thread the "make it FIFO" bit is missing and it just runs some magic boost checks. Now this is a lot of code with extra threads and complexity. It's way simpler to let the tasks when they detect overload schedule away for 2 ticks and defer the normal wakeup as long as they are in yielded state and the cpu is not idle. That solves the same problem and the only difference is that when the cpu goes idle it's not guaranteed that the thread returns right away, but it won't be longer out than two ticks, so no harm is done. If that's an issue than it is way simpler just to wake the task from idle as RCU has callbacks there anyway. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-07-25cpuidle: coupled: add parallel barrier functionColin Cross
Adds cpuidle_coupled_parallel_barrier, which can be used by coupled cpuidle state enter functions to handle resynchronization after determining if any cpu needs to abort. The normal use case will be: static bool abort_flag; static atomic_t abort_barrier; int arch_cpuidle_enter(struct cpuidle_device *dev, ...) { if (arch_turn_off_irq_controller()) { /* returns an error if an irq is pending and would be lost if idle continued and turned off power */ abort_flag = true; } cpuidle_coupled_parallel_barrier(dev, &abort_barrier); if (abort_flag) { /* One of the cpus didn't turn off it's irq controller */ arch_turn_on_irq_controller(); return -EINTR; } /* continue with idle */ ... } This will cause all cpus to abort idle together if one of them needs to abort. Reviewed-by: Santosh Shilimkar <santosh.shilimkar@ti.com> Tested-by: Santosh Shilimkar <santosh.shilimkar@ti.com> Reviewed-by: Kevin Hilman <khilman@ti.com> Tested-by: Kevin Hilman <khilman@ti.com> Signed-off-by: Colin Cross <ccross@android.com> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2012-07-25cpuidle: add support for states that affect multiple cpusColin Cross
On some ARM SMP SoCs (OMAP4460, Tegra 2, and probably more), the cpus cannot be independently powered down, either due to sequencing restrictions (on Tegra 2, cpu 0 must be the last to power down), or due to HW bugs (on OMAP4460, a cpu powering up will corrupt the gic state unless the other cpu runs a work around). Each cpu has a power state that it can enter without coordinating with the other cpu (usually Wait For Interrupt, or WFI), and one or more "coupled" power states that affect blocks shared between the cpus (L2 cache, interrupt controller, and sometimes the whole SoC). Entering a coupled power state must be tightly controlled on both cpus. The easiest solution to implementing coupled cpu power states is to hotplug all but one cpu whenever possible, usually using a cpufreq governor that looks at cpu load to determine when to enable the secondary cpus. This causes problems, as hotplug is an expensive operation, so the number of hotplug transitions must be minimized, leading to very slow response to loads, often on the order of seconds. This file implements an alternative solution, where each cpu will wait in the WFI state until all cpus are ready to enter a coupled state, at which point the coupled state function will be called on all cpus at approximately the same time. Once all cpus are ready to enter idle, they are woken by an smp cross call. At this point, there is a chance that one of the cpus will find work to do, and choose not to enter idle. A final pass is needed to guarantee that all cpus will call the power state enter function at the same time. During this pass, each cpu will increment the ready counter, and continue once the ready counter matches the number of online coupled cpus. If any cpu exits idle, the other cpus will decrement their counter and retry. To use coupled cpuidle states, a cpuidle driver must: Set struct cpuidle_device.coupled_cpus to the mask of all coupled cpus, usually the same as cpu_possible_mask if all cpus are part of the same cluster. The coupled_cpus mask must be set in the struct cpuidle_device for each cpu. Set struct cpuidle_device.safe_state to a state that is not a coupled state. This is usually WFI. Set CPUIDLE_FLAG_COUPLED in struct cpuidle_state.flags for each state that affects multiple cpus. Provide a struct cpuidle_state.enter function for each state that affects multiple cpus. This function is guaranteed to be called on all cpus at approximately the same time. The driver should ensure that the cpus all abort together if any cpu tries to abort once the function is called. Cc: Len Brown <len.brown@intel.com> Cc: Amit Kucheria <amit.kucheria@linaro.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Trinabh Gupta <g.trinabh@gmail.com> Cc: Deepthi Dharwar <deepthi@linux.vnet.ibm.com> Reviewed-by: Santosh Shilimkar <santosh.shilimkar@ti.com> Tested-by: Santosh Shilimkar <santosh.shilimkar@ti.com> Reviewed-by: Kevin Hilman <khilman@ti.com> Tested-by: Kevin Hilman <khilman@ti.com> Acked-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Colin Cross <ccross@android.com> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2012-07-25cpuidle: fix error handling in __cpuidle_register_deviceColin Cross
Fix the error handling in __cpuidle_register_device to include the missing list_del. Move it to a label, which will simplify the error handling when coupled states are added. Reviewed-by: Santosh Shilimkar <santosh.shilimkar@ti.com> Tested-by: Santosh Shilimkar <santosh.shilimkar@ti.com> Reviewed-by: Kevin Hilman <khilman@ti.com> Tested-by: Kevin Hilman <khilman@ti.com> Reviewed-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Colin Cross <ccross@android.com> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2012-07-25cpuidle: refactor out cpuidle_enter_stateColin Cross
Split the code to enter a state and update the stats into a helper function, cpuidle_enter_state, and export it. This function will be called by the coupled state code to handle entering the safe state and the final coupled state. Reviewed-by: Santosh Shilimkar <santosh.shilimkar@ti.com> Tested-by: Santosh Shilimkar <santosh.shilimkar@ti.com> Reviewed-by: Kevin Hilman <khilman@ti.com> Tested-by: Kevin Hilman <khilman@ti.com> Reviewed-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Colin Cross <ccross@android.com> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2012-07-25sched: cpu_power: enable ARCH_POWERVincent Guittot
Heteregeneous ARM platform uses arch_scale_freq_power function to reflect the relative capacity of each core Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
2012-07-25sched, x86: Remove broken power estimationPeter Zijlstra
The x86 sched power implementation has been broken forever and gets in the way of other stuff, remove it. For archaeological interest, fixing this code would require dealing with the cross-cpu calling of these functions and more importantly, we need to filter idle time out of the a/m-perf stuff because the ratio will go down to 0 when idle, giving a 0 capacity which is not what we'd want. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-wjjwelpti8f8k7i1pdnzmdr8@git.kernel.org
2012-07-25ARM: topology: Update cpu_power according to DT informationVincent Guittot
Use cpu compatibility field and clock-frequency field of DT to estimate the capacity of each core of the system and to update the cpu_power field accordingly. This patch enables to put more running tasks on big cores than on LITTLE ones. But this patch doesn't ensure that long running tasks will run on big cores and short ones on LITTLE cores. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
2012-07-25ARM: topology: factorize the update of sibling masksVincent Guittot
The factorization has also been proposed in another patch that has not been merged yet: http://lists.infradead.org/pipermail/linux-arm-kernel/2012-January/080873.html So, this patch could be dropped depending of the state of the other one. Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
2012-07-25ARM: topology: Add arch_scale_freq_power functionVincent Guittot
Add infrastructure to be able to modify the cpu_power of each core Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
2012-07-25workqueue: fix spurious CPU locality WARN from process_one_work()Tejun Heo
25511a4776 "workqueue: reimplement CPU online rebinding to handle idle workers" added CPU locality sanity check in process_one_work(). It triggers if a worker is executing on a different CPU without UNBOUND or REBIND set. This works for all normal workers but rescuers can trigger this spuriously when they're serving the unbound or a disassociated global_cwq - rescuers don't have either flag set and thus its gcwq->cpu can be a different value including %WORK_CPU_UNBOUND. Fix it by additionally testing %GCWQ_DISASSOCIATED. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> LKML-Refence: <20120721213656.GA7783@linux.vnet.ibm.com>
2012-07-25kthread_worker: reimplement flush_kthread_work() to allow freeing the work ↵Tejun Heo
item being executed kthread_worker provides minimalistic workqueue-like interface for users which need a dedicated worker thread (e.g. for realtime priority). It has basic queue, flush_work, flush_worker operations which mostly match the workqueue counterparts; however, due to the way flush_work() is implemented, it has a noticeable difference of not allowing work items to be freed while being executed. While the current users of kthread_worker are okay with the current behavior, the restriction does impede some valid use cases. Also, removing this difference isn't difficult and actually makes the code easier to understand. This patch reimplements flush_kthread_work() such that it uses a flush_work item instead of queue/done sequence numbers. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-07-25kthread_worker: reorganize to prepare for flush_kthread_work() reimplementationTejun Heo
Make the following two non-functional changes. * Separate out insert_kthread_work() from queue_kthread_work(). * Relocate struct kthread_flush_work and kthread_flush_work_fn() definitions above flush_kthread_work(). v2: Added lockdep_assert_held() in insert_kthread_work() as suggested by Andy Walls. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andy Walls <awalls@md.metrocast.net>
2012-07-25workqueue: simplify CPU hotplug codeTejun Heo
With trustee gone, CPU hotplug code can be simplified. * gcwq_claim/release_management() now grab and release gcwq lock too respectively and gained _and_lock and _and_unlock postfixes. * All CPU hotplug logic was implemented in workqueue_cpu_callback() which was called by workqueue_cpu_up/down_callback() for the correct priority. This was because up and down paths shared a lot of logic, which is no longer true. Remove workqueue_cpu_callback() and move all hotplug logic into the two actual callbacks. This patch doesn't make any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
2012-07-25workqueue: remove CPU offline trusteeTejun Heo
With the previous changes, a disassociated global_cwq now can run as an unbound one on its own - it can create workers as necessary to drain remaining works after the CPU has been brought down and manage the number of workers using the usual idle timer mechanism making trustee completely redundant except for the actual unbinding operation. This patch removes the trustee and let a disassociated global_cwq manage itself. Unbinding is moved to a work item (for CPU affinity) which is scheduled and flushed from CPU_DONW_PREPARE. This patch moves nr_running clearing outside gcwq and manager locks to simplify the code. As nr_running is unused at the point, this is safe. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
2012-07-25workqueue: don't butcher idle workers on an offline CPUTejun Heo
Currently, during CPU offlining, after all pending work items are drained, the trustee butchers all workers. Also, on CPU onlining failure, workqueue_cpu_callback() ensures that the first idle worker is destroyed. Combined, these guarantee that an offline CPU doesn't have any worker for it once all the lingering work items are finished. This guarantee isn't really necessary and makes CPU on/offlining more expensive than needs to be, especially for platforms which use CPU hotplug for powersaving. This patch lets offline CPUs removes idle worker butchering from the trustee and let a CPU which failed onlining keep the created first worker. The first worker is created if the CPU doesn't have any during CPU_DOWN_PREPARE and started right away. If onlining succeeds, the rebind_workers() call in CPU_ONLINE will rebind it like any other workers. If onlining fails, the worker is left alone till the next try. This makes CPU hotplugs cheaper by allowing global_cwqs to keep workers across them and simplifies code. Note that trustee doesn't re-arm idle timer when it's done and thus the disassociated global_cwq will keep all workers until it comes back online. This will be improved by further patches. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
2012-07-25workqueue: reimplement CPU online rebinding to handle idle workersTejun Heo
Currently, if there are left workers when a CPU is being brough back online, the trustee kills all idle workers and scheduled rebind_work so that they re-bind to the CPU after the currently executing work is finished. This works for busy workers because concurrency management doesn't try to wake up them from scheduler callbacks, which require the target task to be on the local run queue. The busy worker bumps concurrency counter appropriately as it clears WORKER_UNBOUND from the rebind work item and it's bound to the CPU before returning to the idle state. To reduce CPU on/offlining overhead (as many embedded systems use it for powersaving) and simplify the code path, workqueue is planned to be modified to retain idle workers across CPU on/offlining. This patch reimplements CPU online rebinding such that it can also handle idle workers. As noted earlier, due to the local wakeup requirement, rebinding idle workers is tricky. All idle workers must be re-bound before scheduler callbacks are enabled. This is achieved by interlocking idle re-binding. Idle workers are requested to re-bind and then hold until all idle re-binding is complete so that no bound worker starts executing work item. Only after all idle workers are re-bound and parked, CPU_ONLINE proceeds to release them and queue rebind work item to busy workers thus guaranteeing scheduler callbacks aren't invoked until all idle workers are ready. worker_rebind_fn() is renamed to busy_worker_rebind_fn() and idle_worker_rebind() for idle workers is added. Rebinding logic is moved to rebind_workers() and now called from CPU_ONLINE after flushing trustee. While at it, add CPU sanity check in worker_thread(). Note that now a worker may become idle or the manager between trustee release and rebinding during CPU_ONLINE. As the previous patch updated create_worker() so that it can be used by regular manager while unbound and this patch implements idle re-binding, this is safe. This prepares for removal of trustee and keeping idle workers across CPU hotplugs. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
2012-07-25workqueue: drop @bind from create_worker()Tejun Heo
Currently, create_worker()'s callers are responsible for deciding whether the newly created worker should be bound to the associated CPU and create_worker() sets WORKER_UNBOUND only for the workers for the unbound global_cwq. Creation during normal operation is always via maybe_create_worker() and @bind is true. For workers created during hotplug, @bind is false. Normal operation path is planned to be used even while the CPU is going through hotplug operations or offline and this static decision won't work. Drop @bind from create_worker() and decide whether to bind by looking at GCWQ_DISASSOCIATED. create_worker() will also set WORKER_UNBOUND autmatically if disassociated. To avoid flipping GCWQ_DISASSOCIATED while create_worker() is in progress, the flag is now allowed to be changed only while holding all manager_mutexes on the global_cwq. This requires that GCWQ_DISASSOCIATED is not cleared behind trustee's back. CPU_ONLINE no longer clears DISASSOCIATED before flushing trustee, which clears DISASSOCIATED before rebinding remaining workers if asked to release. For cases where trustee isn't around, CPU_ONLINE clears DISASSOCIATED after flushing trustee. Also, now, first_idle has UNBOUND set on creation which is explicitly cleared by CPU_ONLINE while binding it. These convolutions will soon be removed by further simplification of CPU hotplug path. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
2012-07-25workqueue: use mutex for global_cwq manager exclusionTejun Heo
POOL_MANAGING_WORKERS is used to ensure that at most one worker takes the manager role at any given time on a given global_cwq. Trustee later hitched on it to assume manager adding blocking wait for the bit. As trustee already needed a custom wait mechanism, waiting for MANAGING_WORKERS was rolled into the same mechanism. Trustee is scheduled to be removed. This patch separates out MANAGING_WORKERS wait into per-pool mutex. Workers use mutex_trylock() to test for manager role and trustee uses mutex_lock() to claim manager roles. gcwq_claim/release_management() helpers are added to grab and release manager roles of all pools on a global_cwq. gcwq_claim_management() always grabs pool manager mutexes in ascending pool index order and uses pool index as lockdep subclass. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
2012-07-25workqueue: ROGUE workers are UNBOUND workersTejun Heo
Currently, WORKER_UNBOUND is used to mark workers for the unbound global_cwq and WORKER_ROGUE is used to mark workers for disassociated per-cpu global_cwqs. Both are used to make the marked worker skip concurrency management and the only place they make any difference is in worker_enter_idle() where WORKER_ROGUE is used to skip scheduling idle timer, which can easily be replaced with trustee state testing. This patch replaces WORKER_ROGUE with WORKER_UNBOUND and drops WORKER_ROGUE. This is to prepare for removing trustee and handling disassociated global_cwqs as unbound. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>