eas-backports.git - Energy Aware Scheduler backports

Age	Commit message (Collapse)	Author
2014-12-04	sched/fair: Make calculate_imbalance() independent	Peter Zijlstra
	Rik noticed that calculate_imbalance() relies on update_sd_pick_busiest() to guarantee that busiest->sum_nr_running > busiest->group_capacity_factor. Break this implicit assumption (with the intent of not providing it anymore) by having calculat_imbalance() verify it and not rely on others. Reported-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20140729152631.GW12054@laptop.lan Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 743cb1ff191f00fee653212bdbcee1e56086d6ce) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-04	sched: Robustify topology setup	Peter Zijlstra
	We hard assume that higher topology levels are supersets of lower levels. Detect, warn and try to fixup when we encounter this violated. Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Josh Boyer <jwboyer@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Bruno Wolff III <bruno@wolff.to> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140722094740.GJ12054@laptop.lan Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 6ae72dff37596f736b795426306404f0793e4b1b) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-04	ARM64: add IPI tracepoints	Nicolas Pitre
	The strings used to list IPIs in /proc/interrupts are reused for tracing purposes. While at it, the code is slightly cleaned up so the ipi_types array indices are no longer offset by IPI_RESCHEDULE whose value is 0 anyway. Link: http://lkml.kernel.org/p/1406318733-26754-5-git-send-email-nicolas.pitre@linaro.org Acked-by: Will Deacon <will.deacon@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> (cherry picked from commit 45ed695ac10a23cb4e60a3e0b68b3f21a8670670) Signed-off-by: Alex Shi <alex.shi@linaro.org> Conflicts: arch/arm64/kernel/smp.c
2014-12-04	arm64: Support arch_irq_work_raise() via self IPIs	Larry Bassel
	Support for arch_irq_work_raise() was missing from arm64 (a prerequisite for FULL_NOHZ). This patch is based on the arm32 patch ARM 7872/1. commit bf18525fd793101df42a1344ecc48b49b62e48c9 Author: Stephen Boyd <sboyd@codeaurora.org> Date: Tue Oct 29 20:32:56 2013 +0100 ARM: 7872/1: Support arch_irq_work_raise() via self IPIs By default, IRQ work is run from the tick interrupt (see irq_work_run() in update_process_times()). When we're in full NOHZ mode, restarting the tick requires the use of IRQ work and if the only place we run IRQ work is in the tick interrupt we have an unbreakable cycle. Implement arch_irq_work_raise() via self IPIs to break this cycle and get the tick started again. Note that we implement this via IPIs which are only available on SMP builds. This shouldn't be a problem because full NOHZ is only supported on SMP builds anyway. Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Reviewed-by: Kevin Hilman <khilman@linaro.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Larry Bassel <larry.bassel@linaro.org> Reviewed-by: Kevin Hilman <khilman@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> (cherry picked from commit eb631bb5bf5b042202aaaee4a8dd8f863ba2a900) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-04	arm64: enable generic clockevent broadcast	Lorenzo Pieralisi
	On platforms with power management capabilities, timers that are shut down when a CPU enters deep C-states must be emulated using an always-on timer and a timer IPI to relay the timer IRQ to target CPUs on an SMP system. This patch enables the generic clockevents broadcast infrastructure for arm64, by providing the required Kconfig entries and adding the timer IPI infrastructure. Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> (cherry picked from commit 1f85008e74768a88e1ddb96cc1fe45bb2378166c) Signed-off-by: Alex Shi <alex.shi@linaro.org> Conflicts: arch/arm64/Kconfig
2014-12-04	ARM: add IPI tracepoints	Nicolas Pitre
	The strings used to list IPIs in /proc/interrupts are reused for tracing purposes. While at it, prevent a negative ipinr from escaping the range check in handle_IPI(). Link: http://lkml.kernel.org/p/1406318733-26754-4-git-send-email-nicolas.pitre@linaro.org Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> (cherry picked from commit 365ec7b17327329efc71276722ca8db3f21f2edd) Signed-off-by: Alex Shi <alex.shi@linaro.org> Conflicts: arch/arm/kernel/smp.c
2014-12-04	tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks	Steven Rostedt (Red Hat)
	Being able to show a cpumask of events can be useful as some events may affect only some CPUs. There is no standard way to record the cpumask and converting it to a string is rather expensive during the trace as traces happen in hotpaths. It would be better to record the raw event mask and be able to parse it at print time. The following macros were added for use with the TRACE_EVENT() macro: __bitmask() __assign_bitmask() __get_bitmask() To test this, I added this to the sched_migrate_task event, which looked like this: TRACE_EVENT(sched_migrate_task, TP_PROTO(struct task_struct p, int dest_cpu, const struct cpumask cpus), TP_ARGS(p, dest_cpu, cpus), TP_STRUCT__entry( __array( char, comm, TASK_COMM_LEN ) __field( pid_t, pid ) __field( int, prio ) __field( int, orig_cpu ) __field( int, dest_cpu ) __bitmask( cpumask, num_possible_cpus() ) ), TP_fast_assign( memcpy(__entry->comm, p->comm, TASK_COMM_LEN); __entry->pid = p->pid; __entry->prio = p->prio; __entry->orig_cpu = task_cpu(p); __entry->dest_cpu = dest_cpu; __assign_bitmask(cpumask, cpumask_bits(cpus), num_possible_cpus()); ), TP_printk("comm=%s pid=%d prio=%d orig_cpu=%d dest_cpu=%d cpumask=%s", __entry->comm, __entry->pid, __entry->prio, __entry->orig_cpu, __entry->dest_cpu, __get_bitmask(cpumask)) ); With the output of: ksmtuned-3613 [003] d..2 485.220508: sched_migrate_task: comm=ksmtuned pid=3615 prio=120 orig_cpu=3 dest_cpu=2 cpumask=00000000,0000000f migration/1-13 [001] d..5 485.221202: sched_migrate_task: comm=ksmtuned pid=3614 prio=120 orig_cpu=1 dest_cpu=0 cpumask=00000000,0000000f awk-3615 [002] d.H5 485.221747: sched_migrate_task: comm=rcu_preempt pid=7 prio=120 orig_cpu=0 dest_cpu=1 cpumask=00000000,000000ff migration/2-18 [002] d..5 485.222062: sched_migrate_task: comm=ksmtuned pid=3615 prio=120 orig_cpu=2 dest_cpu=3 cpumask=00000000,0000000f Link: http://lkml.kernel.org/r/1399377998-14870-6-git-send-email-javi.merino@arm.com Link: http://lkml.kernel.org/r/20140506132238.22e136d1@gandalf.local.home Suggested-by: Javi Merino <javi.merino@arm.com> Tested-by: Javi Merino <javi.merino@arm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> (cherry picked from commit 4449bf927b61bdb4389393c6fea6837214d1ace7) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-04	ARM: SMP: basic IPI triggered completion support	Nicolas Pitre
	We need a mechanism to let an inbound CPU signal that it is alive before even getting into the kernel environment i.e. from early assembly code. Using an IPI is the simplest way to achieve that. This adds some basic infrastructure to register a struct completion pointer to be "completed" when the dedicated IPI for this task is received. Signed-off-by: Nicolas Pitre <nico@linaro.org> (cherry picked from commit 5135d875e1457ef946a055003d8f80713e862135) Signed-off-by: Alex Shi <alex.shi@linaro.org> Conflicts: arch/arm/kernel/smp.c
2014-12-04	ARM: 7872/1: Support arch_irq_work_raise() via self IPIs	Stephen Boyd
	By default, IRQ work is run from the tick interrupt (see irq_work_run() in update_process_times()). When we're in full NOHZ mode, restarting the tick requires the use of IRQ work and if the only place we run IRQ work is in the tick interrupt we have an unbreakable cycle. Implement arch_irq_work_raise() via self IPIs to break this cycle and get the tick started again. Note that we implement this via IPIs which are only available on SMP builds. This shouldn't be a problem because full NOHZ is only supported on SMP builds anyway. Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Reviewed-by: Kevin Hilman <khilman@linaro.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> (cherry picked from commit bf18525fd793101df42a1344ecc48b49b62e48c9) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-04	tracepoint: add generic tracepoint definitions for IPI tracing	Nicolas Pitre
	The Inter Processor Interrupt is used to make another processor do a specific action such as rescheduling tasks, signal a timer event or execute something in another CPU's context. IRQs are already traceable but IPIs were not. Tracing them is useful for monitoring IPI latency, or to verify when they are the source of CPU wake-ups with power management implications. Three trace hooks are defined: ipi_raise, ipi_entry and ipi_exit. To make them portable, a string is used to identify them and correlate related events. Additionally, ipi_raise records a bitmask representing targeted CPUs. Link: http://lkml.kernel.org/p/1406318733-26754-3-git-send-email-nicolas.pitre@linaro.org Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> (cherry picked from commit f6d9804d145b9c42dbbabefdda208a6a492b2236) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-04	tracing: Do not do anything special with tracepoint_string when tracing is ↵	Steven Rostedt
	disabled When CONFIG_TRACING is not enabled, there's no reason to save the trace strings either by the linker or as a static variable that can be referenced later. Simply pass back the string that is given to tracepoint_string(). Had to move the define to include/linux/tracepoint.h so that it is still visible when CONFIG_TRACING is not set. Link: http://lkml.kernel.org/p/1406318733-26754-2-git-send-email-nicolas.pitre@linaro.org Suggested-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> (cherry picked from commit 3c49b52b155d0f723792377e1a4480a0e7ca0ba2) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-04	cpuidle: fix permission for driver name sysfs node	Mohammad Merajul Islam Molla
	cpuidle driver name sysfs node is read-only, so permissions should be 0444. Signed-off-by: Mohammad Merajul Islam Molla <meraj.enigma@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> (cherry picked from commit 4f8eea9b9ff464ce93ab10d72993755b7d86d587) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-04	cpuidle: big.LITTLE: init driver for exynos5420	Chander Kashyap
	Add "samsung,exynos5420" compatible string to initialize generic big-little cpuidle driver for Exynos5420. Signed-off-by: Chander Kashyap <chander.kashyap@linaro.org> Reviewed-by: Tomasz Figa <t.figa@samsung.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Kukjin Kim <kgene.kim@samsung.com> (cherry picked from commit 64a3c4caa91c72a00ba2e464a0b2a0a5ce7a312b) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-04	cpuidle: big.LITTLE: Add ARCH_EXYNOS entry in config	Chander Kashyap
	Add support to select generic big-little cpuidle driver for Samsung Exynos series SoC's. This is required for Exynos big-llittle SoC's eg, Exynos5420. Signed-off-by: Chander Kashyap <chander.kashyap@linaro.org> Reviewed-by: Tomasz Figa <t.figa@samsung.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Kukjin Kim <kgene.kim@samsung.com> (cherry picked from commit 2aaafcdb68830cb849a08e0ff57f7ca1cffde57d) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-04	cpuidle: big.LITTLE: add of_device_id structure	Chander Kashyap
	This driver will be used by many big.Little Soc's. As of now it does string matching of hardcoded compatible string to init the driver. This comparison list will keep on growing with addition of new SoC's. Hence add of_device_id structure to collect the compatible strings of SoC's using this driver. Signed-off-by: Chander Kashyap <chander.kashyap@linaro.org> Reviewed-by: Tomasz Figa <t.figa@samsung.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Kukjin Kim <kgene.kim@samsung.com> (cherry picked from commit e2e54362d9c8c1e8d52ff576a4e0f6e61f569356) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-04	cpuidle: Remove time measurement in poll state	Daniel Lezcano
	The time measurement is already done in the cpuidle framework in the 'cpuidle_enter_state' function. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> (cherry picked from commit dd38c9d35ba8e40011b36659cae2719aefd11904) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-04	cpuidle: move idle traces to cpuidle_enter_state()	Sandeep Tripathy
	idle_exit event is the first event after a core exits idle state. So this should be traced before local irq is ebabled. Likewise idle_entry is the last event before a core enters idle state. This will ease visualising the cpu idle state from kernel traces. Signed-off-by: Sandeep Tripathy <sandeep.tripathy@linaro.org> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> [rjw: Subject, rebase] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> (cherry picked from commit 30fe6884021b9fa0124609e898a6341be188eb44) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-04	sched: Fix CACHE_HOT_BUDY condition	Hillf Danton
	When computing cache hot, we should check if the migration dst cpu is idle, instead of the current cpu. Though they are same in normal balancing, that is false nowadays in nohz idle balancing at least. Signed-off-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Mike Galbraith <mgalbraith@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140607090452.4696E301D2@webmail.sinamail.sina.com.cn Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 5d5e2b1bcbdc996e72815c03fdc5ea82c4642397) Signed-off-by: Alex Shi <alex.shi@linaro.org> Conflicts: kernel/sched/fair.c
2014-12-03	sched: Rename capacity related flags	Nicolas Pitre
	It is better not to think about compute capacity as being equivalent to "CPU power". The upcoming "power aware" scheduler work may create confusion with the notion of energy consumption if "power" is used too liberally. Let's rename the following feature flags since they do relate to capacity: SD_SHARE_CPUPOWER -> SD_SHARE_CPUCAPACITY ARCH_POWER -> ARCH_CAPACITY NONTASK_POWER -> NONTASK_CAPACITY Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: linaro-kernel@lists.linaro.org Cc: Andy Fleming <afleming@freescale.com> Cc: Anton Blanchard <anton@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Grant Likely <grant.likely@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: Rob Herring <robh+dt@kernel.org> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: devicetree@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/n/tip-e93lpnxb87owfievqatey6b5@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 5d4dfddd4f02b028d6ddaaa04d75d3b0cad1c9ae) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-03	of: move of_get_cpu_node implementation to DT core library	Sudeep KarkadaNagesha
	This patch moves the generalized implementation of of_get_cpu_node from PowerPC to DT core library, thereby adding support for retrieving cpu node for a given logical cpu index on any architecture. The CPU subsystem can now use this function to assign of_node in the cpu device while registering CPUs. It is recommended to use these helper function only in pre-SMP/early initialisation stages to retrieve CPU device node pointers in logical ordering. Once the cpu devices are registered, it can be retrieved easily from cpu device of_node which avoids unnecessary parsing and matching. Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Grant Likely <grant.likely@linaro.org> Acked-by: Rob Herring <rob.herring@calxeda.com> Signed-off-by: Sudeep KarkadaNagesha <sudeep.karkadanagesha@arm.com> (cherry picked from commit 183912d352a242a276a7877852f107459a13aff9) Signed-off-by: Alex Shi <alex.shi@linaro.org> Conflicts: arch/powerpc/kernel/prom.c
2014-12-03	sched: Final power vs. capacity cleanups	Nicolas Pitre
	It is better not to think about compute capacity as being equivalent to "CPU power". The upcoming "power aware" scheduler work may create confusion with the notion of energy consumption if "power" is used too liberally. This contains the architecture visible changes. Incidentally, only ARM takes advantage of the available pow^H^H^Hcapacity scaling hooks and therefore those changes outside kernel/sched/ are confined to one ARM specific file. The default arch_scale_smt_power() hook is not overridden by anyone. Replacements are as follows: arch_scale_freq_power --> arch_scale_freq_capacity arch_scale_smt_power --> arch_scale_smt_capacity SCHED_POWER_SCALE --> SCHED_CAPACITY_SCALE SCHED_POWER_SHIFT --> SCHED_CAPACITY_SHIFT The local usage of "power" in arch/arm/kernel/topology.c is also changed to "capacity" as appropriate. Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: linaro-kernel@lists.linaro.org Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Brown <broonie@linaro.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Sudeep KarkadaNagesha <sudeep.karkadanagesha@arm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: devicetree@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/n/tip-48zba9qbznvglwelgq2cfygh@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit ca8ce3d0b144c318a5a9ce99649053e9029061ea) Signed-off-by: Alex Shi <alex.shi@linaro.org> Conflicts: kernel/sched/fair.c
2014-12-03	ARM: 7920/1: topology: Staticise non-exported symbols	Mark Brown
	These symbols are only referenced in this source file so can be made static, and the efficiency table is constant data so can be declared as such. This avoids polluting the global namespace and fixes warnings from sparse. The function arch_scale_freq_power() is still not prototyped or static, this is a separate issue as this is overriding a weak symbol from the scheduler which neglects to provide a prototype. Signed-off-by: Mark Brown <broonie@linaro.org> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> (cherry picked from commit 145bc292dce9dbdface2acf1e7e1f175729fb5fb) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-03	ARM: topology: remove hwid/MPIDR dependency from cpu_capacity	Sudeep KarkadaNagesha
	Currently the topology code computes cpu capacity and stores it in the list along with hwid(which is MPIDR) as it parses the CPU nodes in the device tree. This is required as it needs to be mapped to the logical CPU later. Since the CPU device nodes can be retrieved in the logical ordering using DT/OF helpers, its possible to store cpu_capacity also in logical ordering and avoid storing hwid for each entry. This patch removes hwid by making use of of_get_cpu_node. Cc: Russell King <linux@arm.linux.org.uk> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Acked-by: Rob Herring <rob.herring@calxeda.com> Acked-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Sudeep KarkadaNagesha <sudeep.karkadanagesha@arm.com> (cherry picked from commit 816a8de0017f16c32e747abc5367bf379515b20a) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-03	sched: Remove remaining dubious usage of "power"	Nicolas Pitre
	It is better not to think about compute capacity as being equivalent to "CPU power". The upcoming "power aware" scheduler work may create confusion with the notion of energy consumption if "power" is used too liberally. This is the remaining "power" -> "capacity" rename for local symbols. Those symbols visible to the rest of the kernel are not included yet. Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: linaro-kernel@lists.linaro.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/n/tip-yyyhohzhkwnaotr3lx8zd5aa@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit ced549fa5fc1fdaf7fac93894dc673092eb3dc20) Signed-off-by: Alex Shi <alex.shi@linaro.org> Conflicts: kernel/sched/fair.c
2014-12-03	sched: Let 'struct sched_group_power' care about CPU capacity	Nicolas Pitre
	It is better not to think about compute capacity as being equivalent to "CPU power". The upcoming "power aware" scheduler work may create confusion with the notion of energy consumption if "power" is used too liberally. Since struct sched_group_power is really about compute capacity of sched groups, let's rename it to struct sched_group_capacity. Similarly sgp becomes sgc. Related variables and functions dealing with groups are also adjusted accordingly. Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: linaro-kernel@lists.linaro.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/n/tip-5yeix833vvgf2uyj5o36hpu9@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 63b2ca30bdb3dbf60bc7ac5f46713c0d32308261) Signed-off-by: Alex Shi <alex.shi@linaro.org> Conflicts: kernel/sched/fair.c
2014-12-03	sched/fair: Disambiguate existing/remaining "capacity" usage	Nicolas Pitre
	We have "power" (which should actually become "capacity") and "capacity" which is a scaled down "capacity factor" in terms of unitary tasks. Let's use "capacity_factor" to make room for proper usage of "capacity" later. Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: linaro-kernel@lists.linaro.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/n/tip-gk1co8sqdev3763opqm6ovml@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 0fedc6c8e34f4ce0b37b1f25c3619b4a8faa244c) Signed-off-by: Alex Shi <alex.shi@linaro.org> Conflicts: kernel/sched/fair.c
2014-12-03	sched/fair: Rework Change "has_capacity" to "has_free_capacity"	Nicolas Pitre
	The capacity of a CPU/group should be some intrinsic value that doesn't change with task placement. It is like a container which capacity is stable regardless of the amount of liquid in it (its "utilization")... unless the container itself is crushed that is, but that's another story. Therefore let's rename "has_capacity" to "has_free_capacity" in order to better convey the wanted meaning. Alex removed the numa part changes. Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: linaro-kernel@lists.linaro.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/n/tip-djzkk027jm0e8x8jxy70opzh@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (rework commit from 1b6a749) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-03	sched/balancing: Reduce the rate of needless idle load balancing	Tim Chen
	The current no_hz idle load balancer do load balancing for all idle cpus, even though the time due to load balance for a particular idle cpu could be still a while in the future. This introduces a much higher load balancing rate than what is necessary. The patch changes the behavior by only doing idle load balancing on behalf of an idle cpu only when it is due for load balancing. On SGI's systems with over 3000 cores, the cpu responsible for idle balancing got overwhelmed with idle balancing, and introduces a lot of OS noise to workloads. This patch fixes the issue. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Acked-by: Russ Anderson <rja@sgi.com> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Jason Low <jason.low2@hp.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Len Brown <len.brown@intel.com> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Hedi Berriche <hedi@sgi.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: MichelLespinasse <walken@google.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1400621967.2970.280.camel@schen9-DESK Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit ed61bbc69c773465782476c7e5869fa5607fa73a) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-03	sched/fair: Fix unlocked reads of some cfs_b->quota/period	Ben Segall
	sched_cfs_period_timer() reads cfs_b->period without locks before calling do_sched_cfs_period_timer(), and similarly unthrottle_offline_cfs_rqs() would read cfs_b->period without the right lock. Thus a simultaneous change of bandwidth could cause corruption on any platform where ktime_t or u64 writes/reads are not atomic. Extend cfs_b->lock from do_sched_cfs_period_timer() to include the read of cfs_b->period to solve that issue; unthrottle_offline_cfs_rqs() can just use 1 rather than the exact quota, much like distribute_cfs_runtime() does. There is also an unlocked read of cfs_b->runtime_expires, but a race there would only delay runtime expiry by a tick. Still, the comparison should just be != anyway, which clarifies even that problem. Signed-off-by: Ben Segall <bsegall@google.com> Tested-by: Roman Gushchin <klamm@yandex-team.ru> [peterz: Fix compile warn] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140519224945.20303.93530.stgit@sword-of-the-dawn.mtv.corp.google.com Cc: pjt@google.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 51f2176d74ace4c3f58579a605ef5a9720befb00) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-03	sched, powerpc: Create a dedicated topology table	Vincent Guittot
	Create a dedicated topology table for handling asymetric feature of powerpc. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Andy Fleming <afleming@freescale.com> Cc: Anton Blanchard <anton@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Grant Likely <grant.likely@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Preeti U. Murthy <preeti@linux.vnet.ibm.com> Cc: Rob Herring <robh+dt@kernel.org> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: schwidefsky@de.ibm.com Cc: cmetcalf@tilera.com Cc: dietmar.eggemann@arm.com Cc: devicetree@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/1397209481-28542-4-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 607b45e9a216e89a63351556e488eea06be0ff48) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-03	sched: Do not zero sg->cpumask and sg->sgp->power in build_sched_groups()	Dietmar Eggemann
	There is no need to zero struct sched_group member cpumask and struct sched_group_power member power since both structures are already allocated as zeroed memory in __sdt_alloc(). This patch has been tested with BUG_ON(!cpumask_empty(sched_group_cpus(sg))); and BUG_ON(sg->sgp->power); in build_sched_groups() on ARM TC2 and INTEL i5 M520 platform including CPU hotplug scenarios. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1398865178-12577-1-git-send-email-dietmar.eggemann@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit caffcdd8d27ba78730d5540396ce72ad022aff2c) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-03	sched/idle: replace smp_mb__after_atomic function in cpu_idle_loop	Alex Shi
	The function used widely in drivers, so better not to use it in backporting. Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-03	sched/idle: Optimize try-to-wake-up IPI	Peter Zijlstra
	[ This series reduces the number of IPIs on Andy's workload by something like 99%. It's down from many hundreds per second to very few. The basic idea behind this series is to make TIF_POLLING_NRFLAG be a reliable indication that the idle task is polling. Once that's done, the rest is reasonably straightforward. ] When enqueueing tasks on remote LLC domains, we send an IPI to do the work 'locally' and avoid bouncing all the cachelines over. However, when the remote CPU is idle (and polling, say x86 mwait), we don't need to send an IPI, we can simply kick the TIF word to wake it up and have the 'idle' loop do the work. So when _TIF_POLLING_NRFLAG is set, but _TIF_NEED_RESCHED is not (yet) set, set _TIF_NEED_RESCHED and avoid sending the IPI. Much-requested-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Peter Zijlstra <peterz@infradead.org> [Edited by Andy Lutomirski, but this is mostly Peter Zijlstra's code.] Signed-off-by: Andy Lutomirski <luto@amacapital.net> Cc: nicolas.pitre@linaro.org Cc: daniel.lezcano@linaro.org Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: umgwanakikbuti@gmail.com Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/ce06f8b02e7e337be63e97597fc4b248d3aa6f9b.1401902905.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit e3baac47f0e82c4be632f4f97215bb93bf16b342) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-03	sched/idle: Simplify wake_up_idle_cpu()	Andy Lutomirski
	Now that rq->idle's polling bit is a reliable indication that the cpu is polling, use it. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: nicolas.pitre@linaro.org Cc: daniel.lezcano@linaro.org Cc: umgwanakikbuti@gmail.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/922f00761445a830ebb23d058e2ae53956ce2d73.1401902905.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 67b9ca70c3030e832999e8d1cdba2984c7bb5bfc) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-03	sched/idle: Clear polling before descheduling the idle thread	Andy Lutomirski
	Currently, the only real guarantee provided by the polling bit is that, if you hold rq->lock and the polling bit is set, then you can set need_resched to force a reschedule. The only reason the lock is needed is that the idle thread might not be running at all when setting its need_resched bit, and rq->lock keeps it pinned. This is easy to fix: just clear the polling bit before scheduling. Now the idle thread's polling bit is only ever set when rq->curr == rq->idle. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: nicolas.pitre@linaro.org Cc: daniel.lezcano@linaro.org Cc: umgwanakikbuti@gmail.com Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/b2059fcb4c613d520cb503b6fad6e47033c7c203.1401902905.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 82c65d60d64401aedc1006d6572469bbfdf148de) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-03	sched, trace: Add a tracepoint for IPI-less remote wakeups	Andy Lutomirski
	Remote wakeups of polling CPUs are a valuable performance improvement; add a tracepoint to make it much easier to verify that they're working. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: nicolas.pitre@linaro.org Cc: daniel.lezcano@linaro.org Cc: umgwanakikbuti@gmail.com Cc: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/16205aee116772aa686814f9b13bccb562108047.1401902905.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit dfc68f29ae67f2a6e799b44e6a4eb3417dffbfcd) Signed-off-by: Alex Shi <alex.shi@linaro.org> Conflicts: include/trace/events/sched.h
2014-12-03	cpuidle: Set polling in poll_idle	Andy Lutomirski
	poll_idle is the archetypal polling idle loop; tell the core idle code about it. This avoids pointless IPIs when all of the other cpuidle states are disabled. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: nicolas.pitre@linaro.org Cc: umgwanakikbuti@gmail.com Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: linux-kernel@vger.kernel.org Cc: linux-pm@vger.kernel.org Link: http://lkml.kernel.org/r/c65ce49615d338bae8fb79df5daffab19353c900.1401902905.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 84c407084137d4e491b07ea5ff8665d19106a5ac) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-03	cpuidle: declare cpuidle_dev in cpuidle.h	Paul Burton
	Declaring this allows drivers which need to initialise each struct cpuidle_device at initialisation time to make use of the structures already defined in cpuidle.c, rather than having to wastefully define their own. Signed-off-by: Paul Burton <paul.burton@imgtec.com> (cherry picked from commit f08dbf8a61462aa122b9b5077849a3f4bd84702a) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-03	sched/idle: Make cpuidle_idle_call() void	Rafael J. Wysocki
	The only value ever returned by cpuidle_idle_call() is 0 and its only caller ignores that value anyway, so make it void. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/4717784.WmVEpDoliM@vostro.rjw.lan Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 08c373e5123b4595588ae1a7aa7e00a046c61cc6) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-03	sched/idle: Reflow cpuidle_idle_call()	Peter Zijlstra
	Apply goto to reduce lines and nesting levels. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-cc6vb0snt3sr7op6rlbfeqfh@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 37352273ad48f2d177ed1b06ced32d5536b773fb) Signed-off-by: Alex Shi <alex.shi@linaro.org> Conflicts: kernel/sched/idle.c
2014-12-03	sched/fair: Fix tg_set_cfs_bandwidth() deadlock on rq->lock	Roman Gushchin
	tg_set_cfs_bandwidth() sets cfs_b->timer_active to 0 to force the period timer restart. It's not safe, because can lead to deadlock, described in commit 927b54fccbf0: "__start_cfs_bandwidth calls hrtimer_cancel while holding rq->lock, waiting for the hrtimer to finish. However, if sched_cfs_period_timer runs for another loop iteration, the hrtimer can attempt to take rq->lock, resulting in deadlock." Three CPUs must be involved: CPU0 CPU1 CPU2 take rq->lock period timer fired ... take cfs_b lock ... ... tg_set_cfs_bandwidth() throttle_cfs_rq() release cfs_b lock take cfs_b lock ... distribute_cfs_runtime() timer_active = 0 take cfs_b->lock wait for rq->lock ... __start_cfs_bandwidth() {wait for timer callback break if timer_active == 1} So, CPU0 and CPU1 are deadlocked. Instead of resetting cfs_b->timer_active, tg_set_cfs_bandwidth can wait for period timer callbacks (ignoring cfs_b->timer_active) and restart the timer explicitly. Signed-off-by: Roman Gushchin <klamm@yandex-team.ru> Reviewed-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/87wqdi9g8e.wl\%klamm@yandex-team.ru Cc: pjt@google.com Cc: chris.j.arges@canonical.com Cc: gregkh@linuxfoundation.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 09dc4ab03936df5c5aa711d27c81283c6d09f495) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-03	sched/idle: Delay clearing the polling bit	Peter Zijlstra
	With the generic idle functions assuming !polling we should only clear the polling bit at the very last opportunity in order to avoid spurious IPIs. Ideally we'd flip the default to polling, but that means auditing all arch idle functions. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-vq7719foqzf6z5h4j7eh7f9e@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit c444117f0f39d59733ec23da67c44424df529230) Signed-off-by: Alex Shi <alex.shi@linaro.org> Conflicts: kernel/sched/idle.c
2014-12-03	sched/idle: Avoid spurious wakeup IPIs	Peter Zijlstra
	Because mwait_idle_with_hints() gets called from !idle context it must call current_clr_polling(). This however means that resched_task() is very likely to send an IPI even when we were polling: CPU0 CPU1 if (current_set_polling_and_test()) goto out; __monitor(&ti->flags); if (!need_resched()) __mwait(eax, ecx); set_tsk_need_resched(p); smp_mb(); out: current_clr_polling(); if (!tsk_is_polling(p)) smp_send_reschedule(cpu); So while it is correct (extra IPIs aren't a problem, whereas a missed IPI would be) it is a performance problem (for some). Avoid this issue by using fetch_or() to atomically set NEED_RESCHED and test if POLLING_NRFLAG is set. Since a CPU stuck in mwait is unlikely to modify the flags word, contention on the cmpxchg is unlikely and thus we should mostly succeed in a single go. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Nicolas Pitre <nico@linaro.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-kf5suce6njh5xf5d3od13rr0@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit fd99f91aa007ba255aac44fe6cf21c1db398243a) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-03	sched/idle: Remove TS_POLLING support	Peter Zijlstra
	Now that there are no architectures left using it, kill the support for TS_POLLING. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/n/tip-6yurip2tfix2f4bfc5agu2s0@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 69dd0f848879328ae6c6f54c2ec80e49eef042d8) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-03	sched/fair: Stop searching for tasks in newidle balance if there are ↵	Jason Low
	runnable tasks It was found that when running some workloads (such as AIM7) on large systems with many cores, CPUs do not remain idle for long. Thus, tasks can wake/get enqueued while doing idle balancing. In this patch, while traversing the domains in idle balance, in addition to checking for pulled_task, we add an extra check for this_rq->nr_running for determining if we should stop searching for tasks to pull. If there are runnable tasks on this rq, then we will stop traversing the domains. This reduces the chance that idle balance delays a task from running. This patch resulted in approximately a 6% performance improvement when running a Java Server workload on an 8 socket machine. Signed-off-by: Jason Low <jason.low2@hp.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: daniel.lezcano@linaro.org Cc: alex.shi@linaro.org Cc: preeti@linux.vnet.ibm.com Cc: efault@gmx.de Cc: vincent.guittot@linaro.org Cc: morten.rasmussen@arm.com Cc: aswin@hp.com Cc: chegu_vinod@hp.com Link: http://lkml.kernel.org/r/1398303035-18255-4-git-send-email-jason.low2@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 39a4d9ca77a31503c6317e49742341d0859d5cb2) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-03	sched: Add a new SD_SHARE_POWERDOMAIN for sched_domain	Vincent Guittot
	A new flag SD_SHARE_POWERDOMAIN is created to reflect whether groups of CPUs in a sched_domain level can or not reach different power state. As an example, the flag should be cleared at CPU level if groups of cores can be power gated independently. This information can be used in the load balance decision or to add load balancing level between group of CPUs that can power gate independantly. This flag is part of the topology flags that can be set by arch. Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: schwidefsky@de.ibm.com Cc: cmetcalf@tilera.com Cc: benh@kernel.crashing.org Cc: preeti@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1397209481-28542-5-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit d77b3ed5c9f8ebedf154b52b5e943c461f3d37e6) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-03	sched: added SD_NUMA definition	Alex Shi
	The commit 143e1e2 sched: Rework sched_domain topology definition move the SD_NUMA definition from CONFIG_NUMA, so we need it for even we don't use unma. Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-03	sched: Rework sched_domain topology definition	Vincent Guittot
	We replace the old way to configure the scheduler topology with a new method which enables a platform to declare additionnal level (if needed). We still have a default topology table definition that can be used by platform that don't want more level than the SMT, MC, CPU and NUMA ones. This table can be overwritten by an arch which either wants to add new level where a load balance make sense like BOOK or powergating level or wants to change the flags configuration of some levels. For each level, we need a function pointer that returns cpumask for each cpu, a function pointer that returns the flags for the level and a name. Only flags that describe topology, can be set by an architecture. The current topology flags are: SD_SHARE_CPUPOWER SD_SHARE_PKG_RESOURCES SD_NUMA SD_ASYM_PACKING Then, each level must be a subset on the next one. The build sequence of the sched_domain will take care of removing useless levels like those with 1 CPU and those with the same CPU span and no more relevant information for load balancing than its children. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hanjun Guo <hanjun.guo@linaro.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jason Low <jason.low2@hp.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: linux390@de.ibm.com Cc: linux-ia64@vger.kernel.org Cc: linux-s390@vger.kernel.org Link: http://lkml.kernel.org/r/1397209481-28542-2-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 143e1e28cb40bed836b0a06567208bd7347c9672) Signed-off-by: Alex Shi <alex.shi@linaro.org> Conflicts: kernel/sched/core.c
2014-12-03	cpuidle / menu: move repeated correction factor check to init	Chander Kashyap
	In menu_select function we check for correction factor every time. If it is zero we are initializing to unity. Hence move it to init function and initialise by unity, hence avoid repeated comparisons. Signed-off-by: Chander Kashyap <chander.kashyap@linaro.org> Reviewed-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> (cherry picked from commit bed4d597a0f99b380d24ab3a9da47b62cbf1ad0e) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2014-12-03	cpuidle / menu: Return (-1) if there are no suitable states	Rafael J. Wysocki
	If there is a PM QoS latency limit and all of the sufficiently shallow C-states are disabled, the cpuidle menu governor returns 0 which on some systems is CPUIDLE_DRIVER_STATE_START and shouldn't be returned if that C-state has been disabled. Fix the issue by modifying the menu governor to return (-1) in such situations. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> (cherry picked from commit 3836785a1bdcd6706c68ad46bf53adc0b057b310) Signed-off-by: Alex Shi <alex.shi@linaro.org>