diff options
author | Jon Medhurst <tixy@linaro.org> | 2016-07-13 10:22:07 +0100 |
---|---|---|
committer | Jon Medhurst <tixy@linaro.org> | 2016-07-13 10:22:07 +0100 |
commit | 2a37c2e1bc185224cef4526918647f698908f9a3 (patch) | |
tree | d8b9bcceddaf10440233f05050ba5c172b82b69a | |
parent | 77cdc1ee445e211b95fc2cd964db3ea814991aa8 (diff) | |
parent | 6733f7207b94731023ed4b539108282162e5a526 (diff) |
Merge branch 'lsk-3.18-armlt-hmp' into lsk-3.18-armlt-hmp-testlsk-3.18-armlt-20160713-hmp-test
Conflicts:
arch/arm/common/bL_switcher.c
-rw-r--r-- | Documentation/arm/small_task_packing.txt | 136 | ||||
-rw-r--r-- | Documentation/kernel-parameters.txt | 9 | ||||
-rw-r--r-- | arch/arm/Kconfig | 98 | ||||
-rw-r--r-- | arch/arm/kernel/topology.c | 106 | ||||
-rw-r--r-- | arch/arm64/Kconfig | 91 | ||||
-rw-r--r-- | arch/arm64/kernel/topology.c | 122 | ||||
-rw-r--r-- | include/linux/sched.h | 33 | ||||
-rw-r--r-- | include/trace/events/sched.h | 276 | ||||
-rw-r--r-- | include/trace/events/smp.h | 90 | ||||
-rw-r--r-- | kernel/irq/irqdesc.c | 29 | ||||
-rw-r--r-- | kernel/sched/core.c | 44 | ||||
-rw-r--r-- | kernel/sched/debug.c | 3 | ||||
-rw-r--r-- | kernel/sched/fair.c | 2016 | ||||
-rw-r--r-- | kernel/sched/sched.h | 12 | ||||
-rw-r--r-- | kernel/smp.c | 11 | ||||
-rw-r--r-- | linaro/configs/big-LITTLE-MP.conf | 11 |
16 files changed, 3026 insertions, 61 deletions
diff --git a/Documentation/arm/small_task_packing.txt b/Documentation/arm/small_task_packing.txt new file mode 100644 index 000000000000..43f0a8b80234 --- /dev/null +++ b/Documentation/arm/small_task_packing.txt @@ -0,0 +1,136 @@ +Small Task Packing in the big.LITTLE MP Reference Patch Set + +What is small task packing? +---- +Simply that the scheduler will fit as many small tasks on a single CPU +as possible before using other CPUs. A small task is defined as one +whose tracked load is less than 90% of a NICE_0 task. This is a change +from the usual behavior since the scheduler will normally use an idle +CPU for a waking task unless that task is considered cache hot. + + +How is it implemented? +---- +Since all small tasks must wake up relatively frequently, the main +requirement for packing small tasks is to select a partly-busy CPU when +waking rather than looking for an idle CPU. We use the tracked load of +the CPU runqueue to determine how heavily loaded each CPU is and the +tracked load of the task to determine if it will fit on the CPU. We +always start with the lowest-numbered CPU in a sched domain and stop +looking when we find a CPU with enough space for the task. + +Some further tweaks are necessary to suppress load balancing when the +CPU is not fully loaded, otherwise the scheduler attempts to spread +tasks evenly across the domain. + + +How does it interact with the HMP patches? +---- +Firstly, we only enable packing on the little domain. The intent is that +the big domain is intended to spread tasks amongst the available CPUs +one-task-per-CPU. The little domain however is attempting to use as +little power as possible while servicing its tasks. + +Secondly, since we offload big tasks onto little CPUs in order to try +to devote one CPU to each task, we have a threshold above which we do +not try to pack a task and instead will select an idle CPU if possible. +This maintains maximum forward progress for busy tasks temporarily +demoted from big CPUs. + + +Can the behaviour be tuned? +---- +Yes, the load level of a 'full' CPU can be easily modified in the source +and is exposed through sysfs as /sys/kernel/hmp/packing_limit to be +changed at runtime. The presence of the packing behaviour is controlled +by CONFIG_SCHED_HMP_LITTLE_PACKING and can be disabled at run-time +using /sys/kernel/hmp/packing_enable. +The definition of a small task is hard coded as 90% of NICE_0_LOAD +and cannot be modified at run time. + + +Why do I need to tune it? +---- +The optimal configuration is likely to be different depending upon the +design and manufacturing of your SoC. + +In the main, there are two system effects from enabling small task +packing. + +1. CPU operating point may increase +2. wakeup latency of tasks may be increased + +There are also likely to be secondary effects from loading one CPU +rather than spreading tasks. + +Note that all of these system effects are dependent upon the workload +under consideration. + + +CPU Operating Point +---- +The primary impact of loading one CPU with a number of light tasks is to +increase the compute requirement of that CPU since it is no longer idle +as often. Increased compute requirement causes an increase in the +frequency of the CPU through CPUfreq. + +Consider this example: +We have a system with 3 CPUs which can operate at any frequency between +350MHz and 1GHz. The system has 6 tasks which would each produce 10% +load at 1GHz. The scheduler has frequency-invariant load scaling +enabled. Our DVFS governor aims for 80% utilization at the chosen +frequency. + +Without task packing, these tasks will be spread out amongst all CPUs +such that each has 2. This will produce roughly 20% system load, and +the frequency of the package will remain at 350MHz. + +With task packing set to the default packing_limit, all of these tasks +will sit on one CPU and require a package frequency of ~750MHz to reach +80% utilization. (0.75 = 0.6 * 0.8). + +When a package operates on a single frequency domain, all CPUs in that +package share frequency and voltage. + +Depending upon the SoC implementation there can be a significant amount +of energy lost to leakage from idle CPUs. The decision about how +loaded a CPU must be to be considered 'full' is therefore controllable +through sysfs (sys/kernel/hmp/packing_limit) and directly in the code. + +Continuing the example, lets set packing_limit to 450 which means we +will pack tasks until the total load of all running tasks >= 450. In +practise, this is very similar to a 55% idle 1Ghz CPU. + +Now we are only able to place 4 tasks on CPU0, and two will overflow +onto CPU1. CPU0 will have a load of 40% and CPU1 will have a load of +20%. In order to still hit 80% utilization, CPU0 now only needs to +operate at (0.4*0.8=0.32) 320MHz, which means that the lowest operating +point will be selected, the same as in the non-packing case, except that +now CPU2 is no longer needed and can be power-gated. + +In order to use less energy, the saving from power-gating CPU2 must be +more than the energy spent running CPU0 for the extra cycles. This +depends upon the SoC implementation. + +This is obviously a contrived example requiring all the tasks to +be runnable at the same time, but it illustrates the point. + + +Wakeup Latency +---- +This is an unavoidable consequence of trying to pack tasks together +rather than giving them a CPU each. If you cannot find an acceptable +level of wakeup latency, you should turn packing off. + +Cyclictest is a good test application for determining the added latency +when configuring packing. + + +Why is it turned off for the VersatileExpress V2P_CA15A7 CoreTile? +---- +Simply, this core tile only has power gating for the whole A7 package. +When small task packing is enabled, all our low-energy use cases +normally fit onto one A7 CPU. We therefore end up with 2 mostly-idle +CPUs and one mostly-busy CPU. This decreases the amount of time +available where the whole package is idle and can be turned off. + diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 105ee3defd36..f71feef79215 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1496,6 +1496,15 @@ bytes respectively. Such letter suffixes can also be entirely omitted. ip= [IP_PNP] See Documentation/filesystems/nfs/nfsroot.txt. + irqaffinity= [SMP] Set the default irq affinity mask + Format: + <cpu number>,...,<cpu number> + or + <cpu number>-<cpu number> + (must be a positive range in ascending order) + or a mixture + <cpu number>,...,<cpu number>-<cpu number> + irqfixup [HW] When an interrupt is not handled search all handlers for it. Intended to get systems with badly broken diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index a9e02fddecdd..3326da7e28a4 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1395,6 +1395,104 @@ config SCHED_SMT MultiThreading at a cost of slightly increased overhead in some places. If unsure say N here. +config SCHED_HMP + bool "(EXPERIMENTAL) Heterogenous multiprocessor scheduling" + depends on SCHED_MC && FAIR_GROUP_SCHED && !SCHED_AUTOGROUP + help + Experimental scheduler optimizations for heterogeneous platforms. + Attempts to introspectively select task affinity to optimize power + and performance. Basic support for multiple (>2) cpu types is in place, + but it has only been tested with two types of cpus. + There is currently no support for migration of task groups, hence + !SCHED_AUTOGROUP. Furthermore, normal load-balancing must be disabled + between cpus of different type (DISABLE_CPU_SCHED_DOMAIN_BALANCE). + When turned on, this option adds sys/kernel/hmp directory which + contains the following files: + up_threshold - the load average threshold used for up migration + (0 - 1023) + down_threshold - the load average threshold used for down migration + (0 - 1023) + hmp_domains - a list of cpumasks for the present HMP domains, + starting with the 'biggest' and ending with the + 'smallest'. + Note that both the threshold files can be written at runtime to + control scheduler behaviour. + +config SCHED_HMP_PRIO_FILTER + bool "(EXPERIMENTAL) Filter HMP migrations by task priority" + depends on SCHED_HMP + help + Enables task priority based HMP migration filter. Any task with + a NICE value above the threshold will always be on low-power cpus + with less compute capacity. + +config SCHED_HMP_PRIO_FILTER_VAL + int "NICE priority threshold" + default 5 + depends on SCHED_HMP_PRIO_FILTER + +config HMP_FAST_CPU_MASK + string "HMP scheduler fast CPU mask" + depends on SCHED_HMP + help + Leave empty to use device tree information. + Specify the cpuids of the fast CPUs in the system as a list string, + e.g. cpuid 0+1 should be specified as 0-1. + +config HMP_SLOW_CPU_MASK + string "HMP scheduler slow CPU mask" + depends on SCHED_HMP + help + Leave empty to use device tree information. + Specify the cpuids of the slow CPUs in the system as a list string, + e.g. cpuid 0+1 should be specified as 0-1. + +config HMP_VARIABLE_SCALE + bool "Allows changing the load tracking scale through sysfs" + depends on SCHED_HMP + help + When turned on, this option exports the load average period value + for the load tracking patches through sysfs. + The values can be modified to change the rate of load accumulation + used for HMP migration. 'load_avg_period_ms' is the time in ms to + reach a load average of 0.5 for an idle task of 0 load average + ratio which becomes 100% busy. + For example, with load_avg_period_ms = 128 and up_threshold = 512, + a running task with a load of 0 will be migrated to a bigger CPU after + 128ms, because after 128ms its load_avg_ratio is 0.5 and the real + up_threshold is 0.5. + This patch has the same behavior as changing the Y of the load + average computation to + (1002/1024)^(LOAD_AVG_PERIOD/load_avg_period_ms) + but removes intermediate overflows in computation. + +config HMP_FREQUENCY_INVARIANT_SCALE + bool "(EXPERIMENTAL) Frequency-Invariant Tracked Load for HMP" + depends on SCHED_HMP && CPU_FREQ + help + Scales the current load contribution in line with the frequency + of the CPU that the task was executed on. + In this version, we use a simple linear scale derived from the + maximum frequency reported by CPUFreq. + Restricting tracked load to be scaled by the CPU's frequency + represents the consumption of possible compute capacity + (rather than consumption of actual instantaneous capacity as + normal) and allows the HMP migration's simple threshold + migration strategy to interact more predictably with CPUFreq's + asynchronous compute capacity changes. + +config SCHED_HMP_LITTLE_PACKING + bool "Small task packing for HMP" + depends on SCHED_HMP + default n + help + Allows the HMP Scheduler to pack small tasks into CPUs in the + smallest HMP domain. + Controlled by two sysfs files in sys/kernel/hmp. + packing_enable: 1 to enable, 0 to disable packing. Default 1. + packing_limit: runqueue load ratio where a RQ is considered + to be full. Default is NICE_0_LOAD * 9/8. + config HAVE_ARM_SCU bool help diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 89cfdd6e50cb..efc2cb392e9b 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -23,6 +23,7 @@ #include <linux/slab.h> #include <asm/cputype.h> +#include <asm/smp_plat.h> #include <asm/topology.h> /* @@ -289,6 +290,111 @@ static struct sched_domain_topology_level arm_topology[] = { { NULL, }, }; +#ifdef CONFIG_SCHED_HMP + +static const char * const little_cores[] = { + "arm,cortex-a7", + NULL, +}; + +static bool is_little_cpu(struct device_node *cn) +{ + const char * const *lc; + for (lc = little_cores; *lc; lc++) + if (of_device_is_compatible(cn, *lc)) + return true; + return false; +} + +void __init arch_get_fast_and_slow_cpus(struct cpumask *fast, + struct cpumask *slow) +{ + struct device_node *cn = NULL; + int cpu; + + cpumask_clear(fast); + cpumask_clear(slow); + + /* + * Use the config options if they are given. This helps testing + * HMP scheduling on systems without a big.LITTLE architecture. + */ + if (strlen(CONFIG_HMP_FAST_CPU_MASK) && strlen(CONFIG_HMP_SLOW_CPU_MASK)) { + if (cpulist_parse(CONFIG_HMP_FAST_CPU_MASK, fast)) + WARN(1, "Failed to parse HMP fast cpu mask!\n"); + if (cpulist_parse(CONFIG_HMP_SLOW_CPU_MASK, slow)) + WARN(1, "Failed to parse HMP slow cpu mask!\n"); + return; + } + + /* + * Else, parse device tree for little cores. + */ + while ((cn = of_find_node_by_type(cn, "cpu"))) { + + const u32 *mpidr; + int len; + + mpidr = of_get_property(cn, "reg", &len); + if (!mpidr || len != 4) { + pr_err("* %s missing reg property\n", cn->full_name); + continue; + } + + cpu = get_logical_index(be32_to_cpup(mpidr)); + if (cpu == -EINVAL) { + pr_err("couldn't get logical index for mpidr %x\n", + be32_to_cpup(mpidr)); + break; + } + + if (is_little_cpu(cn)) + cpumask_set_cpu(cpu, slow); + else + cpumask_set_cpu(cpu, fast); + } + + if (!cpumask_empty(fast) && !cpumask_empty(slow)) + return; + + /* + * We didn't find both big and little cores so let's call all cores + * fast as this will keep the system running, with all cores being + * treated equal. + */ + cpumask_setall(fast); + cpumask_clear(slow); +} + +struct cpumask hmp_slow_cpu_mask; + +void __init arch_get_hmp_domains(struct list_head *hmp_domains_list) +{ + struct cpumask hmp_fast_cpu_mask; + struct hmp_domain *domain; + + arch_get_fast_and_slow_cpus(&hmp_fast_cpu_mask, &hmp_slow_cpu_mask); + + /* + * Initialize hmp_domains + * Must be ordered with respect to compute capacity. + * Fastest domain at head of list. + */ + if(!cpumask_empty(&hmp_slow_cpu_mask)) { + domain = (struct hmp_domain *) + kmalloc(sizeof(struct hmp_domain), GFP_KERNEL); + cpumask_copy(&domain->possible_cpus, &hmp_slow_cpu_mask); + cpumask_and(&domain->cpus, cpu_online_mask, &domain->possible_cpus); + list_add(&domain->hmp_domains, hmp_domains_list); + } + domain = (struct hmp_domain *) + kmalloc(sizeof(struct hmp_domain), GFP_KERNEL); + cpumask_copy(&domain->possible_cpus, &hmp_fast_cpu_mask); + cpumask_and(&domain->cpus, cpu_online_mask, &domain->possible_cpus); + list_add(&domain->hmp_domains, hmp_domains_list); +} +#endif /* CONFIG_SCHED_HMP */ + /* * init_cpu_topology is called at boot when only one cpu is running * which prevent simultaneous write access to cpu_topology array diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index c1cc24f0baab..62e94d95c4b3 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -454,6 +454,97 @@ config SCHED_SMT MultiThreading at a cost of slightly increased overhead in some places. If unsure say N here. +config SCHED_HMP + bool "(EXPERIMENTAL) Heterogenous multiprocessor scheduling" + depends on SCHED_MC && FAIR_GROUP_SCHED && !SCHED_AUTOGROUP + help + Experimental scheduler optimizations for heterogeneous platforms. + Attempts to introspectively select task affinity to optimize power + and performance. Basic support for multiple (>2) cpu types is in place, + but it has only been tested with two types of cpus. + There is currently no support for migration of task groups, hence + !SCHED_AUTOGROUP. Furthermore, normal load-balancing must be disabled + between cpus of different type (DISABLE_CPU_SCHED_DOMAIN_BALANCE). + +config SCHED_HMP_PRIO_FILTER + bool "(EXPERIMENTAL) Filter HMP migrations by task priority" + depends on SCHED_HMP + help + Enables task priority based HMP migration filter. Any task with + a NICE value above the threshold will always be on low-power cpus + with less compute capacity. + +config SCHED_HMP_PRIO_FILTER_VAL + int "NICE priority threshold" + default 5 + depends on SCHED_HMP_PRIO_FILTER + +config HMP_FAST_CPU_MASK + string "HMP scheduler fast CPU mask" + depends on SCHED_HMP + help + Leave empty to use device tree information. + Specify the cpuids of the fast CPUs in the system as a list string, + e.g. cpuid 0+1 should be specified as 0-1. + +config HMP_SLOW_CPU_MASK + string "HMP scheduler slow CPU mask" + depends on SCHED_HMP + help + Leave empty to use device tree information. + Specify the cpuids of the slow CPUs in the system as a list string, + e.g. cpuid 0+1 should be specified as 0-1. + +config HMP_VARIABLE_SCALE + bool "Allows changing the load tracking scale through sysfs" + depends on SCHED_HMP + help + When turned on, this option exports the thresholds and load average + period value for the load tracking patches through sysfs. + The values can be modified to change the rate of load accumulation + and the thresholds used for HMP migration. + The load_avg_period_ms is the time in ms to reach a load average of + 0.5 for an idle task of 0 load average ratio that start a busy loop. + The up_threshold and down_threshold is the value to go to a faster + CPU or to go back to a slower cpu. + The {up,down}_threshold are devided by 1024 before being compared + to the load average. + For examples, with load_avg_period_ms = 128 and up_threshold = 512, + a running task with a load of 0 will be migrated to a bigger CPU after + 128ms, because after 128ms its load_avg_ratio is 0.5 and the real + up_threshold is 0.5. + This patch has the same behavior as changing the Y of the load + average computation to + (1002/1024)^(LOAD_AVG_PERIOD/load_avg_period_ms) + but it remove intermadiate overflows in computation. + +config HMP_FREQUENCY_INVARIANT_SCALE + bool "(EXPERIMENTAL) Frequency-Invariant Tracked Load for HMP" + depends on HMP_VARIABLE_SCALE && CPU_FREQ + help + Scales the current load contribution in line with the frequency + of the CPU that the task was executed on. + In this version, we use a simple linear scale derived from the + maximum frequency reported by CPUFreq. + Restricting tracked load to be scaled by the CPU's frequency + represents the consumption of possible compute capacity + (rather than consumption of actual instantaneous capacity as + normal) and allows the HMP migration's simple threshold + migration strategy to interact more predictably with CPUFreq's + asynchronous compute capacity changes. + +config SCHED_HMP_LITTLE_PACKING + bool "Small task packing for HMP" + depends on SCHED_HMP + default n + help + Allows the HMP Scheduler to pack small tasks into CPUs in the + smallest HMP domain. + Controlled by two sysfs files in sys/kernel/hmp. + packing_enable: 1 to enable, 0 to disable packing. Default 1. + packing_limit: runqueue load ratio where a RQ is considered + to be full. Default is NICE_0_LOAD * 9/8. + config NR_CPUS int "Maximum number of CPUs (2-64)" range 2 64 diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c index b6ee26b0939a..4a59e6922e88 100644 --- a/arch/arm64/kernel/topology.c +++ b/arch/arm64/kernel/topology.c @@ -18,9 +18,11 @@ #include <linux/node.h> #include <linux/nodemask.h> #include <linux/of.h> +#include <linux/slab.h> #include <linux/sched.h> #include <asm/cputype.h> +#include <asm/smp_plat.h> #include <asm/topology.h> static int __init get_cpu_for_node(struct device_node *node) @@ -271,6 +273,126 @@ topology_populated: update_siblings_masks(cpuid); } +#ifdef CONFIG_SCHED_HMP + +/* + * Retrieve logical cpu index corresponding to a given MPIDR[23:0] + * - mpidr: MPIDR[23:0] to be used for the look-up + * + * Returns the cpu logical index or -EINVAL on look-up error + */ +static inline int get_logical_index(u32 mpidr) +{ + int cpu; + for (cpu = 0; cpu < nr_cpu_ids; cpu++) + if (cpu_logical_map(cpu) == mpidr) + return cpu; + return -EINVAL; +} + +static const char * const little_cores[] = { + "arm,cortex-a53", + NULL, +}; + +static bool is_little_cpu(struct device_node *cn) +{ + const char * const *lc; + for (lc = little_cores; *lc; lc++) + if (of_device_is_compatible(cn, *lc)) + return true; + return false; +} + +void __init arch_get_fast_and_slow_cpus(struct cpumask *fast, + struct cpumask *slow) +{ + struct device_node *cn = NULL; + int cpu; + + cpumask_clear(fast); + cpumask_clear(slow); + + /* + * Use the config options if they are given. This helps testing + * HMP scheduling on systems without a big.LITTLE architecture. + */ + if (strlen(CONFIG_HMP_FAST_CPU_MASK) && strlen(CONFIG_HMP_SLOW_CPU_MASK)) { + if (cpulist_parse(CONFIG_HMP_FAST_CPU_MASK, fast)) + WARN(1, "Failed to parse HMP fast cpu mask!\n"); + if (cpulist_parse(CONFIG_HMP_SLOW_CPU_MASK, slow)) + WARN(1, "Failed to parse HMP slow cpu mask!\n"); + return; + } + + /* + * Else, parse device tree for little cores. + */ + while ((cn = of_find_node_by_type(cn, "cpu"))) { + + const u32 *mpidr; + int len; + + mpidr = of_get_property(cn, "reg", &len); + if (!mpidr || len != 8) { + pr_err("%s missing reg property\n", cn->full_name); + continue; + } + + cpu = get_logical_index(be32_to_cpup(mpidr+1)); + if (cpu == -EINVAL) { + pr_err("couldn't get logical index for mpidr %x\n", + be32_to_cpup(mpidr+1)); + break; + } + + if (is_little_cpu(cn)) + cpumask_set_cpu(cpu, slow); + else + cpumask_set_cpu(cpu, fast); + } + + if (!cpumask_empty(fast) && !cpumask_empty(slow)) + return; + + /* + * We didn't find both big and little cores so let's call all cores + * fast as this will keep the system running, with all cores being + * treated equal. + */ + cpumask_setall(fast); + cpumask_clear(slow); +} + +struct cpumask hmp_slow_cpu_mask; + +void __init arch_get_hmp_domains(struct list_head *hmp_domains_list) +{ + struct cpumask hmp_fast_cpu_mask; + struct hmp_domain *domain; + + arch_get_fast_and_slow_cpus(&hmp_fast_cpu_mask, &hmp_slow_cpu_mask); + + /* + * Initialize hmp_domains + * Must be ordered with respect to compute capacity. + * Fastest domain at head of list. + */ + if(!cpumask_empty(&hmp_slow_cpu_mask)) { + domain = (struct hmp_domain *) + kmalloc(sizeof(struct hmp_domain), GFP_KERNEL); + cpumask_copy(&domain->possible_cpus, &hmp_slow_cpu_mask); + cpumask_and(&domain->cpus, cpu_online_mask, &domain->possible_cpus); + list_add(&domain->hmp_domains, hmp_domains_list); + } + domain = (struct hmp_domain *) + kmalloc(sizeof(struct hmp_domain), GFP_KERNEL); + cpumask_copy(&domain->possible_cpus, &hmp_fast_cpu_mask); + cpumask_and(&domain->cpus, cpu_online_mask, &domain->possible_cpus); + list_add(&domain->hmp_domains, hmp_domains_list); +} +#endif /* CONFIG_SCHED_HMP */ + static void __init reset_cpu_topology(void) { unsigned int cpu; diff --git a/include/linux/sched.h b/include/linux/sched.h index 1e8b3a5aa473..319a305f442c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1035,6 +1035,14 @@ extern void wake_up_if_idle(int cpu); # define SD_INIT_NAME(type) #endif +#ifdef CONFIG_SCHED_HMP +struct hmp_domain { + struct cpumask cpus; + struct cpumask possible_cpus; + struct list_head hmp_domains; +}; +#endif /* CONFIG_SCHED_HMP */ + #else /* CONFIG_SMP */ struct sched_domain_attr; @@ -1082,8 +1090,33 @@ struct sched_avg { u64 last_runnable_update; s64 decay_count; unsigned long load_avg_contrib; + unsigned long load_avg_ratio; +#ifdef CONFIG_SCHED_HMP + u64 hmp_last_up_migration; + u64 hmp_last_down_migration; +#endif + u32 usage_avg_sum; }; +#ifdef CONFIG_SCHED_HMP +/* + * We want to avoid boosting any processes forked from init (PID 1) + * and kthreadd (assumed to be PID 2). + * + * System services are generally started by init, whilst kernel threads are + * started by kthreadd. We do not want to give those tasks a head start, as + * this costs power for very little benefit. We do however wish to do that for + * tasks which the user launches. + * + * Further, some tasks allocate per-cpu timers directly after launch which can + * lead to those tasks being always scheduled on a big CPU when there is no + * computational need to do so. Not promoting services to big CPUs on launch + * will prevent that unless a service allocates their per-cpu resources after + * a period of intense computation, which is not a common pattern. + */ +#define hmp_task_should_forkboost(task) ((task->parent && task->parent->pid > 2)) +#endif + #ifdef CONFIG_SCHEDSTATS struct sched_statistics { u64 wait_start; diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 7fcebf07d373..2f2e6c447ab1 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -599,6 +599,282 @@ TRACE_EVENT(sched_wake_idle_without_ipi, TP_printk("cpu=%d", __entry->cpu) ); + +/* + * Tracepoint for showing tracked load contribution. + */ +TRACE_EVENT(sched_task_load_contrib, + + TP_PROTO(struct task_struct *tsk, unsigned long load_contrib), + + TP_ARGS(tsk, load_contrib), + + TP_STRUCT__entry( + __array(char, comm, TASK_COMM_LEN) + __field(pid_t, pid) + __field(unsigned long, load_contrib) + ), + + TP_fast_assign( + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); + __entry->pid = tsk->pid; + __entry->load_contrib = load_contrib; + ), + + TP_printk("comm=%s pid=%d load_contrib=%lu", + __entry->comm, __entry->pid, + __entry->load_contrib) +); + +/* + * Tracepoint for showing tracked task runnable ratio [0..1023]. + */ +TRACE_EVENT(sched_task_runnable_ratio, + + TP_PROTO(struct task_struct *tsk, unsigned long ratio), + + TP_ARGS(tsk, ratio), + + TP_STRUCT__entry( + __array(char, comm, TASK_COMM_LEN) + __field(pid_t, pid) + __field(unsigned long, ratio) + ), + + TP_fast_assign( + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); + __entry->pid = tsk->pid; + __entry->ratio = ratio; + ), + + TP_printk("comm=%s pid=%d ratio=%lu", + __entry->comm, __entry->pid, + __entry->ratio) +); + +/* + * Tracepoint for showing tracked rq runnable ratio [0..1023]. + */ +TRACE_EVENT(sched_rq_runnable_ratio, + + TP_PROTO(int cpu, unsigned long ratio), + + TP_ARGS(cpu, ratio), + + TP_STRUCT__entry( + __field(int, cpu) + __field(unsigned long, ratio) + ), + + TP_fast_assign( + __entry->cpu = cpu; + __entry->ratio = ratio; + ), + + TP_printk("cpu=%d ratio=%lu", + __entry->cpu, + __entry->ratio) +); + +/* + * Tracepoint for showing tracked rq runnable load. + */ +TRACE_EVENT(sched_rq_runnable_load, + + TP_PROTO(int cpu, u64 load), + + TP_ARGS(cpu, load), + + TP_STRUCT__entry( + __field(int, cpu) + __field(u64, load) + ), + + TP_fast_assign( + __entry->cpu = cpu; + __entry->load = load; + ), + + TP_printk("cpu=%d load=%llu", + __entry->cpu, + __entry->load) +); + +TRACE_EVENT(sched_rq_nr_running, + + TP_PROTO(int cpu, unsigned int nr_running, int nr_iowait), + + TP_ARGS(cpu, nr_running, nr_iowait), + + TP_STRUCT__entry( + __field(int, cpu) + __field(unsigned int, nr_running) + __field(int, nr_iowait) + ), + + TP_fast_assign( + __entry->cpu = cpu; + __entry->nr_running = nr_running; + __entry->nr_iowait = nr_iowait; + ), + + TP_printk("cpu=%d nr_running=%u nr_iowait=%d", + __entry->cpu, + __entry->nr_running, __entry->nr_iowait) +); + +/* + * Tracepoint for showing tracked task cpu usage ratio [0..1023]. + */ +TRACE_EVENT(sched_task_usage_ratio, + + TP_PROTO(struct task_struct *tsk, unsigned long ratio), + + TP_ARGS(tsk, ratio), + + TP_STRUCT__entry( + __array(char, comm, TASK_COMM_LEN) + __field(pid_t, pid) + __field(unsigned long, ratio) + ), + + TP_fast_assign( + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); + __entry->pid = tsk->pid; + __entry->ratio = ratio; + ), + + TP_printk("comm=%s pid=%d ratio=%lu", + __entry->comm, __entry->pid, + __entry->ratio) +); + +/* + * Tracepoint for HMP (CONFIG_SCHED_HMP) task migrations, + * marking the forced transition of runnable or running tasks. + */ +TRACE_EVENT(sched_hmp_migrate_force_running, + + TP_PROTO(struct task_struct *tsk, int running), + + TP_ARGS(tsk, running), + + TP_STRUCT__entry( + __array(char, comm, TASK_COMM_LEN) + __field(int, running) + ), + + TP_fast_assign( + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); + __entry->running = running; + ), + + TP_printk("running=%d comm=%s", + __entry->running, __entry->comm) +); + +/* + * Tracepoint for HMP (CONFIG_SCHED_HMP) task migrations, + * marking the forced transition of runnable or running + * tasks when a task is about to go idle. + */ +TRACE_EVENT(sched_hmp_migrate_idle_running, + + TP_PROTO(struct task_struct *tsk, int running), + + TP_ARGS(tsk, running), + + TP_STRUCT__entry( + __array(char, comm, TASK_COMM_LEN) + __field(int, running) + ), + + TP_fast_assign( + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); + __entry->running = running; + ), + + TP_printk("running=%d comm=%s", + __entry->running, __entry->comm) +); + +/* + * Tracepoint for HMP (CONFIG_SCHED_HMP) task migrations. + */ +#define HMP_MIGRATE_WAKEUP 0 +#define HMP_MIGRATE_FORCE 1 +#define HMP_MIGRATE_OFFLOAD 2 +#define HMP_MIGRATE_IDLE_PULL 3 +#define HMP_MIGRATE_FORCED_IDLE_PULL 4 +TRACE_EVENT(sched_hmp_migrate, + + TP_PROTO(struct task_struct *tsk, int dest, int force), + + TP_ARGS(tsk, dest, force), + + TP_STRUCT__entry( + __array(char, comm, TASK_COMM_LEN) + __field(pid_t, pid) + __field(int, dest) + __field(int, force) + ), + + TP_fast_assign( + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); + __entry->pid = tsk->pid; + __entry->dest = dest; + __entry->force = force; + ), + + TP_printk("comm=%s pid=%d dest=%d force=%d", + __entry->comm, __entry->pid, + __entry->dest, __entry->force) +); + +TRACE_EVENT(sched_hmp_offload_abort, + + TP_PROTO(int cpu, int data, char *label), + + TP_ARGS(cpu,data,label), + + TP_STRUCT__entry( + __array(char, label, 64) + __field(int, cpu) + __field(int, data) + ), + + TP_fast_assign( + strncpy(__entry->label, label, 64); + __entry->cpu = cpu; + __entry->data = data; + ), + + TP_printk("cpu=%d data=%d label=%63s", + __entry->cpu, __entry->data, + __entry->label) +); + +TRACE_EVENT(sched_hmp_offload_succeed, + + TP_PROTO(int cpu, int dest_cpu), + + TP_ARGS(cpu,dest_cpu), + + TP_STRUCT__entry( + __field(int, cpu) + __field(int, dest_cpu) + ), + + TP_fast_assign( + __entry->cpu = cpu; + __entry->dest_cpu = dest_cpu; + ), + + TP_printk("cpu=%d dest=%d", + __entry->cpu, + __entry->dest_cpu) +); + #endif /* _TRACE_SCHED_H */ /* This part must be outside protection */ diff --git a/include/trace/events/smp.h b/include/trace/events/smp.h new file mode 100644 index 000000000000..da0baf27a39a --- /dev/null +++ b/include/trace/events/smp.h @@ -0,0 +1,90 @@ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM smp + +#if !defined(_TRACE_SMP_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_SMP_H + +#include <linux/tracepoint.h> + +DECLARE_EVENT_CLASS(smp_call_class, + + TP_PROTO(void * fnc), + + TP_ARGS(fnc), + + TP_STRUCT__entry( + __field( void *, func ) + ), + + TP_fast_assign( + __entry->func = fnc; + ), + + TP_printk("func=%pf", __entry->func) +); + +/** + * smp_call_func_entry - called in the generic smp-cross-call-handler + * immediately before calling the destination + * function + * @func: function pointer + * + * When used in combination with the smp_call_func_exit tracepoint + * we can determine the cross-call runtime. + */ +DEFINE_EVENT(smp_call_class, smp_call_func_entry, + + TP_PROTO(void * fnc), + + TP_ARGS(fnc) +); + +/** + * smp_call_func_exit - called in the generic smp-cross-call-handler + * immediately after the destination function + * returns + * @func: function pointer + * + * When used in combination with the smp_call_entry tracepoint + * we can determine the cross-call runtime. + */ +DEFINE_EVENT(smp_call_class, smp_call_func_exit, + + TP_PROTO(void * fnc), + + TP_ARGS(fnc) +); + +/** + * smp_call_func_send - called as destination function is set + * in the per-cpu storage + * @func: function pointer + * @dest: cpu to send to + * + * When used in combination with the smp_cross_call_entry tracepoint + * we can determine the call-to-run latency. + */ +TRACE_EVENT(smp_call_func_send, + + TP_PROTO(void * func, int dest), + + TP_ARGS(func, dest), + + TP_STRUCT__entry( + __field( void * , func ) + __field( int , dest ) + ), + + TP_fast_assign( + __entry->func = func; + __entry->dest = dest; + ), + + TP_printk("dest=%d func=%pf", __entry->dest, + __entry->func) +); + +#endif /* _TRACE_SMP_H */ + +/* This part must be outside protection */ +#include <trace/define_trace.h> diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c index 99793b9b6d23..fec632d8ef98 100644 --- a/kernel/irq/irqdesc.c +++ b/kernel/irq/irqdesc.c @@ -24,10 +24,35 @@ static struct lock_class_key irq_desc_lock_class; #if defined(CONFIG_SMP) +static int __init irq_affinity_setup(char *str) +{ + zalloc_cpumask_var(&irq_default_affinity, GFP_NOWAIT); + cpulist_parse(str, irq_default_affinity); + /* + * Set at least the boot cpu. We don't want to end up with + * bugreports caused by random comandline masks + */ + cpumask_set_cpu(smp_processor_id(), irq_default_affinity); + return 1; +} +__setup("irqaffinity=", irq_affinity_setup); + +extern struct cpumask hmp_slow_cpu_mask; + static void __init init_irq_default_affinity(void) { - alloc_cpumask_var(&irq_default_affinity, GFP_NOWAIT); - cpumask_setall(irq_default_affinity); +#ifdef CONFIG_CPUMASK_OFFSTACK + if (!irq_default_affinity) + zalloc_cpumask_var(&irq_default_affinity, GFP_NOWAIT); +#endif +#ifdef CONFIG_SCHED_HMP + if (!cpumask_empty(&hmp_slow_cpu_mask)) { + cpumask_copy(irq_default_affinity, &hmp_slow_cpu_mask); + return; + } +#endif + if (cpumask_empty(irq_default_affinity)) + cpumask_setall(irq_default_affinity); } #else static void __init init_irq_default_affinity(void) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 1a839dc713f7..c42e25856f18 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1582,7 +1582,12 @@ void scheduler_ipi(void) */ preempt_fold_need_resched(); - if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick()) + if (llist_empty(&this_rq()->wake_list) + && !got_nohz_idle_kick() +#ifdef CONFIG_SCHED_HMP + && !this_rq()->wake_for_idle_pull +#endif + ) return; /* @@ -1608,6 +1613,11 @@ void scheduler_ipi(void) this_rq()->idle_balance = 1; raise_softirq_irqoff(SCHED_SOFTIRQ); } +#ifdef CONFIG_SCHED_HMP + else if (unlikely(this_rq()->wake_for_idle_pull)) + raise_softirq_irqoff(SCHED_SOFTIRQ); +#endif + irq_exit(); } @@ -1834,6 +1844,20 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) p->se.vruntime = 0; INIT_LIST_HEAD(&p->se.group_node); +#ifdef CONFIG_SCHED_HMP + /* keep LOAD_AVG_MAX in sync with fair.c if load avg series is changed */ +#define LOAD_AVG_MAX 47742 + p->se.avg.hmp_last_up_migration = 0; + p->se.avg.hmp_last_down_migration = 0; + if (hmp_task_should_forkboost(p)) { + p->se.avg.load_avg_ratio = 1023; + p->se.avg.load_avg_contrib = + (1023 * scale_load_down(p->se.load.weight)); + p->se.avg.runnable_avg_period = LOAD_AVG_MAX; + p->se.avg.runnable_avg_sum = LOAD_AVG_MAX; + p->se.avg.usage_avg_sum = LOAD_AVG_MAX; + } +#endif #ifdef CONFIG_SCHEDSTATS memset(&p->se.statistics, 0, sizeof(p->se.statistics)); #endif @@ -3332,6 +3356,8 @@ static void __setscheduler_params(struct task_struct *p, set_load_weight(p); } +extern struct cpumask hmp_slow_cpu_mask; + /* Actually do priority change: must hold pi & rq lock. */ static void __setscheduler(struct rq *rq, struct task_struct *p, const struct sched_attr *attr, bool keep_boost) @@ -3349,8 +3375,18 @@ static void __setscheduler(struct rq *rq, struct task_struct *p, if (dl_prio(p->prio)) p->sched_class = &dl_sched_class; - else if (rt_prio(p->prio)) + else if (rt_prio(p->prio)) { p->sched_class = &rt_sched_class; +#ifdef CONFIG_SCHED_HMP + if (!cpumask_empty(&hmp_slow_cpu_mask)) + if (cpumask_equal(&p->cpus_allowed, + cpu_possible_mask)) { + p->nr_cpus_allowed = + cpumask_weight(&hmp_slow_cpu_mask); + do_set_cpus_allowed(p, &hmp_slow_cpu_mask); + } +#endif + } else p->sched_class = &fair_sched_class; } @@ -6242,6 +6278,10 @@ sd_init(struct sched_domain_topology_level *tl, int cpu) #endif } else { sd->flags |= SD_PREFER_SIBLING; +#ifdef CONFIG_SCHED_HMP + /* Disable load balance on DIE level */ + sd->flags &= ~SD_LOAD_BALANCE; +#endif sd->cache_nice_tries = 1; sd->busy_idx = 2; sd->idle_idx = 1; diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index ce33780d8f20..c5c876442f51 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -95,6 +95,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group #ifdef CONFIG_SMP P(se->avg.runnable_avg_sum); P(se->avg.runnable_avg_period); + P(se->avg.usage_avg_sum); P(se->avg.load_avg_contrib); P(se->avg.decay_count); #endif @@ -223,6 +224,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq) atomic_long_read(&cfs_rq->tg->load_avg)); SEQ_printf(m, " .%-30s: %d\n", "tg->runnable_avg", atomic_read(&cfs_rq->tg->runnable_avg)); + SEQ_printf(m, " .%-30s: %d\n", "tg->usage_avg", + atomic_read(&cfs_rq->tg->usage_avg)); #endif #endif #ifdef CONFIG_CFS_BANDWIDTH diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d821f4985de5..87f7b29bb2e0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -32,6 +32,17 @@ #include <linux/task_work.h> #include <trace/events/sched.h> +#include <linux/sysfs.h> +#include <linux/vmalloc.h> +#ifdef CONFIG_HMP_FREQUENCY_INVARIANT_SCALE +/* Include cpufreq header to add a notifier so that cpu frequency + * scaling can track the current CPU frequency + */ +#include <linux/cpufreq.h> +#endif /* CONFIG_HMP_FREQUENCY_INVARIANT_SCALE */ +#ifdef CONFIG_SCHED_HMP +#include <linux/cpuidle.h> +#endif #include "sched.h" @@ -2289,6 +2300,93 @@ static u32 __compute_runnable_contrib(u64 n) return contrib + runnable_avg_yN_sum[n]; } +#ifdef CONFIG_SCHED_HMP +#define HMP_VARIABLE_SCALE_SHIFT 16ULL +struct hmp_global_attr { + struct attribute attr; + ssize_t (*show)(struct kobject *kobj, + struct attribute *attr, char *buf); + ssize_t (*store)(struct kobject *a, struct attribute *b, + const char *c, size_t count); + int *value; + int (*to_sysfs)(int); + int (*from_sysfs)(int); + ssize_t (*to_sysfs_text)(char *buf, int buf_size); +}; + +#define HMP_DATA_SYSFS_MAX 8 + +struct hmp_data_struct { +#ifdef CONFIG_HMP_FREQUENCY_INVARIANT_SCALE + int freqinvar_load_scale_enabled; +#endif + int multiplier; /* used to scale the time delta */ + struct attribute_group attr_group; + struct attribute *attributes[HMP_DATA_SYSFS_MAX + 1]; + struct hmp_global_attr attr[HMP_DATA_SYSFS_MAX]; +} hmp_data; + +static u64 hmp_variable_scale_convert(u64 delta); + +#ifdef CONFIG_HMP_FREQUENCY_INVARIANT_SCALE +/* Frequency-Invariant Load Modification: + * Loads are calculated as in PJT's patch however we also scale the current + * contribution in line with the frequency of the CPU that the task was + * executed on. + * In this version, we use a simple linear scale derived from the maximum + * frequency reported by CPUFreq. As an example: + * + * Consider that we ran a task for 100% of the previous interval. + * + * Our CPU was under asynchronous frequency control through one of the + * CPUFreq governors. + * + * The CPUFreq governor reports that it is able to scale the CPU between + * 500MHz and 1GHz. + * + * During the period, the CPU was running at 1GHz. + * + * In this case, our load contribution for that period is calculated as + * 1 * (number_of_active_microseconds) + * + * This results in our task being able to accumulate maximum load as normal. + * + * + * Consider now that our CPU was executing at 500MHz. + * + * We now scale the load contribution such that it is calculated as + * 0.5 * (number_of_active_microseconds) + * + * Our task can only record 50% maximum load during this period. + * + * This represents the task consuming 50% of the CPU's *possible* compute + * capacity. However the task did consume 100% of the CPU's *available* + * compute capacity which is the value seen by the CPUFreq governor and + * user-side CPU Utilization tools. + * + * Restricting tracked load to be scaled by the CPU's frequency accurately + * represents the consumption of possible compute capacity and allows the + * HMP migration's simple threshold migration strategy to interact more + * predictably with CPUFreq's asynchronous compute capacity changes. + */ +#define SCHED_FREQSCALE_SHIFT 10 +struct cpufreq_extents { + u32 curr_scale; + u32 min; + u32 max; + u32 flags; +}; + +/* Flag set when the governor in use only allows one frequency. + * Disables scaling. + */ +#define SCHED_LOAD_FREQINVAR_SINGLEFREQ 0x01 + +static struct cpufreq_extents freq_scale[CONFIG_NR_CPUS]; + +#endif /* CONFIG_HMP_FREQUENCY_INVARIANT_SCALE */ +#endif /* CONFIG_SCHED_HMP */ + /* * We can represent the historical contribution to runnable average as the * coefficients of a geometric series. To do this we sub-divide our runnable @@ -2319,13 +2417,24 @@ static u32 __compute_runnable_contrib(u64 n) */ static __always_inline int __update_entity_runnable_avg(u64 now, struct sched_avg *sa, - int runnable) + int runnable, + int running, + int cpu) { u64 delta, periods; u32 runnable_contrib; int delta_w, decayed = 0; +#ifdef CONFIG_HMP_FREQUENCY_INVARIANT_SCALE + u64 scaled_delta; + u32 scaled_runnable_contrib; + int scaled_delta_w; + u32 curr_scale = 1024; +#endif /* CONFIG_HMP_FREQUENCY_INVARIANT_SCALE */ delta = now - sa->last_runnable_update; +#ifdef CONFIG_SCHED_HMP + delta = hmp_variable_scale_convert(delta); +#endif /* * This should only happen when time goes backwards, which it * unfortunately does during sched clock init when we swap over to TSC. @@ -2344,6 +2453,12 @@ static __always_inline int __update_entity_runnable_avg(u64 now, return 0; sa->last_runnable_update = now; +#ifdef CONFIG_HMP_FREQUENCY_INVARIANT_SCALE + /* retrieve scale factor for load */ + if (hmp_data.freqinvar_load_scale_enabled) + curr_scale = freq_scale[cpu].curr_scale; +#endif /* CONFIG_HMP_FREQUENCY_INVARIANT_SCALE */ + /* delta_w is the amount already accumulated against our next period */ delta_w = sa->runnable_avg_period % 1024; if (delta + delta_w >= 1024) { @@ -2356,8 +2471,20 @@ static __always_inline int __update_entity_runnable_avg(u64 now, * period and accrue it. */ delta_w = 1024 - delta_w; + /* scale runnable time if necessary */ +#ifdef CONFIG_HMP_FREQUENCY_INVARIANT_SCALE + scaled_delta_w = (delta_w * curr_scale) + >> SCHED_FREQSCALE_SHIFT; + if (runnable) + sa->runnable_avg_sum += scaled_delta_w; + if (running) + sa->usage_avg_sum += scaled_delta_w; +#else if (runnable) sa->runnable_avg_sum += delta_w; + if (running) + sa->usage_avg_sum += delta_w; +#endif /* #ifdef CONFIG_HMP_FREQUENCY_INVARIANT_SCALE */ sa->runnable_avg_period += delta_w; delta -= delta_w; @@ -2365,22 +2492,51 @@ static __always_inline int __update_entity_runnable_avg(u64 now, /* Figure out how many additional periods this update spans */ periods = delta / 1024; delta %= 1024; + /* decay the load we have accumulated so far */ sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum, periods + 1); sa->runnable_avg_period = decay_load(sa->runnable_avg_period, periods + 1); + sa->usage_avg_sum = decay_load(sa->usage_avg_sum, periods + 1); + /* add the contribution from this period */ /* Efficiently calculate \sum (1..n_period) 1024*y^i */ runnable_contrib = __compute_runnable_contrib(periods); + /* Apply load scaling if necessary. + * Note that multiplying the whole series is same as + * multiplying all terms + */ +#ifdef CONFIG_HMP_FREQUENCY_INVARIANT_SCALE + scaled_runnable_contrib = (runnable_contrib * curr_scale) + >> SCHED_FREQSCALE_SHIFT; + if (runnable) + sa->runnable_avg_sum += scaled_runnable_contrib; + if (running) + sa->usage_avg_sum += scaled_runnable_contrib; +#else if (runnable) sa->runnable_avg_sum += runnable_contrib; + if (running) + sa->usage_avg_sum += runnable_contrib; +#endif /* CONFIG_HMP_FREQUENCY_INVARIANT_SCALE */ sa->runnable_avg_period += runnable_contrib; } /* Remainder of delta accrued against u_0` */ + /* scale if necessary */ +#ifdef CONFIG_HMP_FREQUENCY_INVARIANT_SCALE + scaled_delta = ((delta * curr_scale) >> SCHED_FREQSCALE_SHIFT); + if (runnable) + sa->runnable_avg_sum += scaled_delta; + if (running) + sa->usage_avg_sum += scaled_delta; +#else if (runnable) sa->runnable_avg_sum += delta; + if (running) + sa->usage_avg_sum += delta; +#endif /* CONFIG_HMP_FREQUENCY_INVARIANT_SCALE */ sa->runnable_avg_period += delta; return decayed; @@ -2393,10 +2549,8 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se) u64 decays = atomic64_read(&cfs_rq->decay_counter); decays -= se->avg.decay_count; - if (!decays) - return 0; - - se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays); + if (decays) + se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays); se->avg.decay_count = 0; return decays; @@ -2429,16 +2583,28 @@ static inline void __update_tg_runnable_avg(struct sched_avg *sa, struct cfs_rq *cfs_rq) { struct task_group *tg = cfs_rq->tg; - long contrib; + long contrib, usage_contrib; /* The fraction of a cpu used by this cfs_rq */ contrib = div_u64((u64)sa->runnable_avg_sum << NICE_0_SHIFT, sa->runnable_avg_period + 1); contrib -= cfs_rq->tg_runnable_contrib; - if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) { + usage_contrib = div_u64(sa->usage_avg_sum << NICE_0_SHIFT, + sa->runnable_avg_period + 1); + usage_contrib -= cfs_rq->tg_usage_contrib; + + /* + * contrib/usage at this point represent deltas, only update if they + * are substantive. + */ + if ((abs(contrib) > cfs_rq->tg_runnable_contrib / 64) || + (abs(usage_contrib) > cfs_rq->tg_usage_contrib / 64)) { atomic_add(contrib, &tg->runnable_avg); cfs_rq->tg_runnable_contrib += contrib; + + atomic_add(usage_contrib, &tg->usage_avg); + cfs_rq->tg_usage_contrib += usage_contrib; } } @@ -2486,8 +2652,17 @@ static inline void __update_group_entity_contrib(struct sched_entity *se) static inline void update_rq_runnable_avg(struct rq *rq, int runnable) { - __update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable); + int cpu = -1; /* not used in normal case */ + +#ifdef CONFIG_HMP_FREQUENCY_INVARIANT_SCALE + cpu = rq->cpu; +#endif + __update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable, + runnable, cpu); __update_tg_runnable_avg(&rq->avg, &rq->cfs); + trace_sched_rq_runnable_ratio(cpu_of(rq), rq->avg.load_avg_ratio); + trace_sched_rq_runnable_load(cpu_of(rq), rq->cfs.runnable_load_avg); + trace_sched_rq_nr_running(cpu_of(rq), rq->nr_running, rq->nr_iowait.counter); } #else /* CONFIG_FAIR_GROUP_SCHED */ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, @@ -2506,12 +2681,18 @@ static inline void __update_task_entity_contrib(struct sched_entity *se) contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight); contrib /= (se->avg.runnable_avg_period + 1); se->avg.load_avg_contrib = scale_load(contrib); + trace_sched_task_load_contrib(task_of(se), se->avg.load_avg_contrib); + contrib = se->avg.runnable_avg_sum * scale_load_down(NICE_0_LOAD); + contrib /= (se->avg.runnable_avg_period + 1); + se->avg.load_avg_ratio = scale_load(contrib); + trace_sched_task_runnable_ratio(task_of(se), se->avg.load_avg_ratio); } /* Compute the current contribution to load_avg by se, return any delta */ -static long __update_entity_load_avg_contrib(struct sched_entity *se) +static long __update_entity_load_avg_contrib(struct sched_entity *se, long *ratio) { long old_contrib = se->avg.load_avg_contrib; + long old_ratio = se->avg.load_avg_ratio; if (entity_is_task(se)) { __update_task_entity_contrib(se); @@ -2520,6 +2701,8 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se) __update_group_entity_contrib(se); } + if (ratio) + *ratio = se->avg.load_avg_ratio - old_ratio; return se->avg.load_avg_contrib - old_contrib; } @@ -2539,9 +2722,13 @@ static inline void update_entity_load_avg(struct sched_entity *se, int update_cfs_rq) { struct cfs_rq *cfs_rq = cfs_rq_of(se); - long contrib_delta; + long contrib_delta, ratio_delta; u64 now; + int cpu = -1; /* not used in normal case */ +#ifdef CONFIG_HMP_FREQUENCY_INVARIANT_SCALE + cpu = cfs_rq->rq->cpu; +#endif /* * For a group entity we need to use their owned cfs_rq_clock_task() in * case they are the parent of a throttled hierarchy. @@ -2551,18 +2738,23 @@ static inline void update_entity_load_avg(struct sched_entity *se, else now = cfs_rq_clock_task(group_cfs_rq(se)); - if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq)) + if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq, + cfs_rq->curr == se, cpu)) return; - contrib_delta = __update_entity_load_avg_contrib(se); + contrib_delta = __update_entity_load_avg_contrib(se, &ratio_delta); if (!update_cfs_rq) return; - if (se->on_rq) + if (se->on_rq) { cfs_rq->runnable_load_avg += contrib_delta; - else +#ifdef CONFIG_SCHED_HMP + rq_of(cfs_rq)->avg.load_avg_ratio += ratio_delta; +#endif /* CONFIG_SCHED_HMP */ + } else { subtract_blocked_load_contrib(cfs_rq, -contrib_delta); + } } /* @@ -2637,6 +2829,10 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq, } cfs_rq->runnable_load_avg += se->avg.load_avg_contrib; +#ifdef CONFIG_SCHED_HMP + rq_of(cfs_rq)->avg.load_avg_ratio += se->avg.load_avg_ratio; +#endif /* CONFIG_SCHED_HMP */ + /* we force update consideration on load-balancer moves */ update_cfs_rq_blocked_load(cfs_rq, !wakeup); } @@ -2655,6 +2851,10 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq, update_cfs_rq_blocked_load(cfs_rq, !sleep); cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib; +#ifdef CONFIG_SCHED_HMP + rq_of(cfs_rq)->avg.load_avg_ratio -= se->avg.load_avg_ratio; +#endif /* CONFIG_SCHED_HMP */ + if (sleep) { cfs_rq->blocked_load_avg += se->avg.load_avg_contrib; se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter); @@ -2993,6 +3193,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) */ update_stats_wait_end(cfs_rq, se); __dequeue_entity(cfs_rq, se); + update_entity_load_avg(se, 1); } update_stats_curr_start(cfs_rq, se); @@ -4529,6 +4730,844 @@ done: return target; } +#ifdef CONFIG_SCHED_HMP +/* + * Heterogenous multiprocessor (HMP) optimizations + * + * The cpu types are distinguished using a list of hmp_domains + * which each represent one cpu type using a cpumask. + * The list is assumed ordered by compute capacity with the + * fastest domain first. + */ +DEFINE_PER_CPU(struct hmp_domain *, hmp_cpu_domain); +static const int hmp_max_tasks = 5; + +extern void __init arch_get_hmp_domains(struct list_head *hmp_domains_list); + +#ifdef CONFIG_CPU_IDLE +/* + * hmp_idle_pull: + * + * In this version we have stopped using forced up migrations when we + * detect that a task running on a little CPU should be moved to a bigger + * CPU. In most cases, the bigger CPU is in a deep sleep state and a forced + * migration means we stop the task immediately but need to wait for the + * target CPU to wake up before we can restart the task which is being + * moved. Instead, we now wake a big CPU with an IPI and ask it to pull + * a task when ready. This allows the task to continue executing on its + * current CPU, reducing the amount of time that the task is stalled for. + * + * keepalive timers: + * + * The keepalive timer is used as a way to keep a CPU engaged in an + * idle pull operation out of idle while waiting for the source + * CPU to stop and move the task. Ideally this would not be necessary + * and we could impose a temporary zero-latency requirement on the + * current CPU, but in the current QoS framework this will result in + * all CPUs in the system being unable to enter idle states which is + * not desirable. The timer does not perform any work when it expires. + */ +struct hmp_keepalive { + bool init; + ktime_t delay; /* if zero, no need for timer */ + struct hrtimer timer; +}; +DEFINE_PER_CPU(struct hmp_keepalive, hmp_cpu_keepalive); + +/* setup per-cpu keepalive timers */ +static enum hrtimer_restart hmp_cpu_keepalive_notify(struct hrtimer *hrtimer) +{ + return HRTIMER_NORESTART; +} + +/* + * Work out if any of the idle states have an exit latency too high for us. + * ns_delay is passed in containing the max we are willing to tolerate. + * If there are none, set ns_delay to zero. + * If there are any, set ns_delay to + * ('target_residency of state with shortest too-big latency' - 1) * 1000. + */ +static void hmp_keepalive_delay(int cpu, unsigned int *ns_delay) +{ + struct cpuidle_device *dev = per_cpu(cpuidle_devices, cpu); + struct cpuidle_driver *drv; + + drv = cpuidle_get_cpu_driver(dev); + if (drv) { + unsigned int us_delay = UINT_MAX; + unsigned int us_max_delay = *ns_delay / 1000; + int idx; + /* if cpuidle states are guaranteed to be sorted we + * could stop at the first match. + */ + for (idx = 0; idx < drv->state_count; idx++) { + if (drv->states[idx].exit_latency > us_max_delay && + drv->states[idx].target_residency < us_delay) { + us_delay = drv->states[idx].target_residency; + } + } + if (us_delay == UINT_MAX) + *ns_delay = 0; /* no timer required */ + else + *ns_delay = 1000 * (us_delay - 1); + } +} + +static void hmp_cpu_keepalive_trigger(void) +{ + int cpu = smp_processor_id(); + struct hmp_keepalive *keepalive = &per_cpu(hmp_cpu_keepalive, cpu); + if (!keepalive->init) { + unsigned int ns_delay = 100000; /* tolerate 100usec delay */ + + hrtimer_init(&keepalive->timer, + CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED); + keepalive->timer.function = hmp_cpu_keepalive_notify; + + hmp_keepalive_delay(cpu, &ns_delay); + keepalive->delay = ns_to_ktime(ns_delay); + keepalive->init = true; + } + if (ktime_to_ns(keepalive->delay)) + __hrtimer_start_range_ns(&keepalive->timer, + keepalive->delay, 0, HRTIMER_MODE_REL_PINNED, 0); +} + +static void hmp_cpu_keepalive_cancel(int cpu) +{ + struct hmp_keepalive *keepalive = &per_cpu(hmp_cpu_keepalive, cpu); + if (keepalive->init) + hrtimer_cancel(&keepalive->timer); +} +#else /* !CONFIG_CPU_IDLE */ +static void hmp_cpu_keepalive_trigger(void) +{ +} + +static void hmp_cpu_keepalive_cancel(int cpu) +{ +} +#endif + +/* Setup hmp_domains */ +static int __init hmp_cpu_mask_setup(void) +{ + char buf[64]; + struct hmp_domain *domain; + struct list_head *pos; + int dc, cpu; + + pr_debug("Initializing HMP scheduler:\n"); + + /* Initialize hmp_domains using platform code */ + arch_get_hmp_domains(&hmp_domains); + if (list_empty(&hmp_domains)) { + pr_debug("HMP domain list is empty!\n"); + return 0; + } + + /* Print hmp_domains */ + dc = 0; + list_for_each(pos, &hmp_domains) { + domain = list_entry(pos, struct hmp_domain, hmp_domains); + cpulist_scnprintf(buf, 64, &domain->possible_cpus); + pr_debug(" HMP domain %d: %s\n", dc, buf); + + for_each_cpu_mask(cpu, domain->possible_cpus) { + per_cpu(hmp_cpu_domain, cpu) = domain; + } + dc++; + } + + return 1; +} + +static struct hmp_domain *hmp_get_hmp_domain_for_cpu(int cpu) +{ + struct hmp_domain *domain; + struct list_head *pos; + + list_for_each(pos, &hmp_domains) { + domain = list_entry(pos, struct hmp_domain, hmp_domains); + if(cpumask_test_cpu(cpu, &domain->possible_cpus)) + return domain; + } + return NULL; +} + +static void hmp_online_cpu(int cpu) +{ + struct hmp_domain *domain = hmp_get_hmp_domain_for_cpu(cpu); + + if(domain) + cpumask_set_cpu(cpu, &domain->cpus); +} + +static void hmp_offline_cpu(int cpu) +{ + struct hmp_domain *domain = hmp_get_hmp_domain_for_cpu(cpu); + + if(domain) + cpumask_clear_cpu(cpu, &domain->cpus); + + hmp_cpu_keepalive_cancel(cpu); +} + +/* + * Needed to determine heaviest tasks etc. + */ +static inline unsigned int hmp_cpu_is_fastest(int cpu); +static inline unsigned int hmp_cpu_is_slowest(int cpu); +static inline struct hmp_domain *hmp_slower_domain(int cpu); +static inline struct hmp_domain *hmp_faster_domain(int cpu); + +/* must hold runqueue lock for queue se is currently on */ +static struct sched_entity *hmp_get_heaviest_task( + struct sched_entity *se, int target_cpu) +{ + int num_tasks = hmp_max_tasks; + struct sched_entity *max_se = se; + unsigned long int max_ratio = se->avg.load_avg_ratio; + const struct cpumask *hmp_target_mask = NULL; + struct hmp_domain *hmp; + + if (hmp_cpu_is_fastest(cpu_of(se->cfs_rq->rq))) + return max_se; + + hmp = hmp_faster_domain(cpu_of(se->cfs_rq->rq)); + hmp_target_mask = &hmp->cpus; + if (target_cpu >= 0) { + /* idle_balance gets run on a CPU while + * it is in the middle of being hotplugged + * out. Bail early in that case. + */ + if(!cpumask_test_cpu(target_cpu, hmp_target_mask)) + return NULL; + hmp_target_mask = cpumask_of(target_cpu); + } + /* The currently running task is not on the runqueue */ + se = __pick_first_entity(cfs_rq_of(se)); + + while (num_tasks && se) { + if (entity_is_task(se) && + se->avg.load_avg_ratio > max_ratio && + cpumask_intersects(hmp_target_mask, + tsk_cpus_allowed(task_of(se)))) { + max_se = se; + max_ratio = se->avg.load_avg_ratio; + } + se = __pick_next_entity(se); + num_tasks--; + } + return max_se; +} + +static struct sched_entity *hmp_get_lightest_task( + struct sched_entity *se, int migrate_down) +{ + int num_tasks = hmp_max_tasks; + struct sched_entity *min_se = se; + unsigned long int min_ratio = se->avg.load_avg_ratio; + const struct cpumask *hmp_target_mask = NULL; + + if (migrate_down) { + struct hmp_domain *hmp; + if (hmp_cpu_is_slowest(cpu_of(se->cfs_rq->rq))) + return min_se; + hmp = hmp_slower_domain(cpu_of(se->cfs_rq->rq)); + hmp_target_mask = &hmp->cpus; + } + /* The currently running task is not on the runqueue */ + se = __pick_first_entity(cfs_rq_of(se)); + + while (num_tasks && se) { + if (entity_is_task(se) && + (se->avg.load_avg_ratio < min_ratio && + hmp_target_mask && + cpumask_intersects(hmp_target_mask, + tsk_cpus_allowed(task_of(se))))) { + min_se = se; + min_ratio = se->avg.load_avg_ratio; + } + se = __pick_next_entity(se); + num_tasks--; + } + return min_se; +} + +/* + * Migration thresholds should be in the range [0..1023] + * hmp_up_threshold: min. load required for migrating tasks to a faster cpu + * hmp_down_threshold: max. load allowed for tasks migrating to a slower cpu + * + * hmp_up_prio: Only up migrate task with high priority (<hmp_up_prio) + * hmp_next_up_threshold: Delay before next up migration (1024 ~= 1 ms) + * hmp_next_down_threshold: Delay before next down migration (1024 ~= 1 ms) + * + * Small Task Packing: + * We can choose to fill the littlest CPUs in an HMP system rather than + * the typical spreading mechanic. This behavior is controllable using + * two variables. + * hmp_packing_enabled: runtime control over pack/spread + * hmp_full_threshold: Consider a CPU with this much unweighted load full + */ +unsigned int hmp_up_threshold = 700; +unsigned int hmp_down_threshold = 512; +#ifdef CONFIG_SCHED_HMP_PRIO_FILTER +unsigned int hmp_up_prio = NICE_TO_PRIO(CONFIG_SCHED_HMP_PRIO_FILTER_VAL); +#endif +unsigned int hmp_next_up_threshold = 4096; +unsigned int hmp_next_down_threshold = 4096; + +#ifdef CONFIG_SCHED_HMP_LITTLE_PACKING +/* + * Set the default packing threshold to try to keep little + * CPUs at no more than 80% of their maximum frequency if only + * packing a small number of small tasks. Bigger tasks will + * raise frequency as normal. + * In order to pack a task onto a CPU, the sum of the + * unweighted runnable_avg load of existing tasks plus the + * load of the new task must be less than hmp_full_threshold. + * + * This works in conjunction with frequency-invariant load + * and DVFS governors. Since most DVFS governors aim for 80% + * utilisation, we arrive at (0.8*0.8*(max_load=1024))=655 + * and use a value slightly lower to give a little headroom + * in the decision. + * Note that the most efficient frequency is different for + * each system so /sys/kernel/hmp/packing_limit should be + * configured at runtime for any given platform to achieve + * optimal energy usage. Some systems may not benefit from + * packing, so this feature can also be disabled at runtime + * with /sys/kernel/hmp/packing_enable + */ +unsigned int hmp_packing_enabled = 1; +unsigned int hmp_full_threshold = 650; +#endif + +static unsigned int hmp_up_migration(int cpu, int *target_cpu, struct sched_entity *se); +static unsigned int hmp_down_migration(int cpu, struct sched_entity *se); +static inline unsigned int hmp_domain_min_load(struct hmp_domain *hmpd, + int *min_cpu, struct cpumask *affinity); + +static inline struct hmp_domain *hmp_smallest_domain(void) +{ + return list_entry(hmp_domains.prev, struct hmp_domain, hmp_domains); +} + +/* Check if cpu is in fastest hmp_domain */ +static inline unsigned int hmp_cpu_is_fastest(int cpu) +{ + struct list_head *pos; + + pos = &hmp_cpu_domain(cpu)->hmp_domains; + return pos == hmp_domains.next; +} + +/* Check if cpu is in slowest hmp_domain */ +static inline unsigned int hmp_cpu_is_slowest(int cpu) +{ + struct list_head *pos; + + pos = &hmp_cpu_domain(cpu)->hmp_domains; + return list_is_last(pos, &hmp_domains); +} + +/* Next (slower) hmp_domain relative to cpu */ +static inline struct hmp_domain *hmp_slower_domain(int cpu) +{ + struct list_head *pos; + + pos = &hmp_cpu_domain(cpu)->hmp_domains; + return list_entry(pos->next, struct hmp_domain, hmp_domains); +} + +/* Previous (faster) hmp_domain relative to cpu */ +static inline struct hmp_domain *hmp_faster_domain(int cpu) +{ + struct list_head *pos; + + pos = &hmp_cpu_domain(cpu)->hmp_domains; + return list_entry(pos->prev, struct hmp_domain, hmp_domains); +} + +/* + * Selects a cpu in previous (faster) hmp_domain + */ +static inline unsigned int hmp_select_faster_cpu(struct task_struct *tsk, + int cpu) +{ + int lowest_cpu=NR_CPUS; + __always_unused int lowest_ratio; + struct hmp_domain *hmp; + + if (hmp_cpu_is_fastest(cpu)) + hmp = hmp_cpu_domain(cpu); + else + hmp = hmp_faster_domain(cpu); + + lowest_ratio = hmp_domain_min_load(hmp, &lowest_cpu, + tsk_cpus_allowed(tsk)); + + return lowest_cpu; +} + +/* + * Selects a cpu in next (slower) hmp_domain + * Note that cpumask_any_and() returns the first cpu in the cpumask + */ +static inline unsigned int hmp_select_slower_cpu(struct task_struct *tsk, + int cpu) +{ + int lowest_cpu=NR_CPUS; + struct hmp_domain *hmp; + __always_unused int lowest_ratio; + + if (hmp_cpu_is_slowest(cpu)) + hmp = hmp_cpu_domain(cpu); + else + hmp = hmp_slower_domain(cpu); + + lowest_ratio = hmp_domain_min_load(hmp, &lowest_cpu, + tsk_cpus_allowed(tsk)); + + return lowest_cpu; +} + +#ifdef CONFIG_SCHED_HMP_LITTLE_PACKING +/* + * Select the 'best' candidate little CPU to wake up on. + * Implements a packing strategy which examines CPU in + * logical CPU order, and selects the first which will + * be loaded less than hmp_full_threshold according to + * the sum of the tracked load of the runqueue and the task. + */ +static inline unsigned int hmp_best_little_cpu(struct task_struct *tsk, + int cpu) { + int tmp_cpu; + unsigned long estimated_load; + struct hmp_domain *hmp; + struct sched_avg *avg; + struct cpumask allowed_hmp_cpus; + + if(!hmp_packing_enabled || + tsk->se.avg.load_avg_ratio > ((NICE_0_LOAD * 90)/100)) + return hmp_select_slower_cpu(tsk, cpu); + + if (hmp_cpu_is_slowest(cpu)) + hmp = hmp_cpu_domain(cpu); + else + hmp = hmp_slower_domain(cpu); + + /* respect affinity */ + cpumask_and(&allowed_hmp_cpus, &hmp->cpus, + tsk_cpus_allowed(tsk)); + + for_each_cpu_mask(tmp_cpu, allowed_hmp_cpus) { + avg = &cpu_rq(tmp_cpu)->avg; + /* estimate new rq load if we add this task */ + estimated_load = avg->load_avg_ratio + + tsk->se.avg.load_avg_ratio; + if (estimated_load <= hmp_full_threshold) { + cpu = tmp_cpu; + break; + } + } + /* if no match was found, the task uses the initial value */ + return cpu; +} +#endif + +static inline void hmp_next_up_delay(struct sched_entity *se, int cpu) +{ + /* hack - always use clock from first online CPU */ + u64 now = cpu_rq(cpumask_first(cpu_online_mask))->clock_task; + se->avg.hmp_last_up_migration = now; + se->avg.hmp_last_down_migration = 0; + cpu_rq(cpu)->avg.hmp_last_up_migration = now; + cpu_rq(cpu)->avg.hmp_last_down_migration = 0; +} + +static inline void hmp_next_down_delay(struct sched_entity *se, int cpu) +{ + /* hack - always use clock from first online CPU */ + u64 now = cpu_rq(cpumask_first(cpu_online_mask))->clock_task; + se->avg.hmp_last_down_migration = now; + se->avg.hmp_last_up_migration = 0; + cpu_rq(cpu)->avg.hmp_last_down_migration = now; + cpu_rq(cpu)->avg.hmp_last_up_migration = 0; +} + +/* + * Heterogenous multiprocessor (HMP) optimizations + * + * These functions allow to change the growing speed of the load_avg_ratio + * by default it goes from 0 to 0.5 in LOAD_AVG_PERIOD = 32ms + * This can now be changed with /sys/kernel/hmp/load_avg_period_ms. + * + * These functions also allow to change the up and down threshold of HMP + * using /sys/kernel/hmp/{up,down}_threshold. + * Both must be between 0 and 1023. The threshold that is compared + * to the load_avg_ratio is up_threshold/1024 and down_threshold/1024. + * + * For instance, if load_avg_period = 64 and up_threshold = 512, an idle + * task with a load of 0 will reach the threshold after 64ms of busy loop. + * + * Changing load_avg_periods_ms has the same effect than changing the + * default scaling factor Y=1002/1024 in the load_avg_ratio computation to + * (1002/1024.0)^(LOAD_AVG_PERIOD/load_avg_period_ms), but the last one + * could trigger overflows. + * For instance, with Y = 1023/1024 in __update_task_entity_contrib() + * "contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);" + * could be overflowed for a weight > 2^12 even is the load_avg_contrib + * should still be a 32bits result. This would not happen by multiplicating + * delta time by 1/22 and setting load_avg_period_ms = 706. + */ + +/* + * By scaling the delta time it end-up increasing or decrease the + * growing speed of the per entity load_avg_ratio + * The scale factor hmp_data.multiplier is a fixed point + * number: (32-HMP_VARIABLE_SCALE_SHIFT).HMP_VARIABLE_SCALE_SHIFT + */ +static inline u64 hmp_variable_scale_convert(u64 delta) +{ +#ifdef CONFIG_HMP_VARIABLE_SCALE + u64 high = delta >> 32ULL; + u64 low = delta & 0xffffffffULL; + low *= hmp_data.multiplier; + high *= hmp_data.multiplier; + return (low >> HMP_VARIABLE_SCALE_SHIFT) + + (high << (32ULL - HMP_VARIABLE_SCALE_SHIFT)); +#else + return delta; +#endif +} + +static ssize_t hmp_show(struct kobject *kobj, + struct attribute *attr, char *buf) +{ + struct hmp_global_attr *hmp_attr = + container_of(attr, struct hmp_global_attr, attr); + int temp; + + if (hmp_attr->to_sysfs_text != NULL) + return hmp_attr->to_sysfs_text(buf, PAGE_SIZE); + + temp = *(hmp_attr->value); + if (hmp_attr->to_sysfs != NULL) + temp = hmp_attr->to_sysfs(temp); + + return (ssize_t)sprintf(buf, "%d\n", temp); +} + +static ssize_t hmp_store(struct kobject *a, struct attribute *attr, + const char *buf, size_t count) +{ + int temp; + ssize_t ret = count; + struct hmp_global_attr *hmp_attr = + container_of(attr, struct hmp_global_attr, attr); + char *str = vmalloc(count + 1); + if (str == NULL) + return -ENOMEM; + memcpy(str, buf, count); + str[count] = 0; + if (sscanf(str, "%d", &temp) < 1) + ret = -EINVAL; + else { + if (hmp_attr->from_sysfs != NULL) + temp = hmp_attr->from_sysfs(temp); + if (temp < 0) + ret = -EINVAL; + else + *(hmp_attr->value) = temp; + } + vfree(str); + return ret; +} + +static ssize_t hmp_print_domains(char *outbuf, int outbufsize) +{ + char buf[64]; + const char nospace[] = "%s", space[] = " %s"; + const char *fmt = nospace; + struct hmp_domain *domain; + struct list_head *pos; + int outpos = 0; + list_for_each(pos, &hmp_domains) { + domain = list_entry(pos, struct hmp_domain, hmp_domains); + if (cpumask_scnprintf(buf, 64, &domain->possible_cpus)) { + outpos += sprintf(outbuf+outpos, fmt, buf); + fmt = space; + } + } + strcat(outbuf, "\n"); + return outpos+1; +} + +#ifdef CONFIG_HMP_VARIABLE_SCALE +static int hmp_period_tofrom_sysfs(int value) +{ + return (LOAD_AVG_PERIOD << HMP_VARIABLE_SCALE_SHIFT) / value; +} +#endif + +/* max value for threshold is 1024 */ +static int hmp_theshold_from_sysfs(int value) +{ + if (value > 1024) + return -1; + return value; +} + +#if defined(CONFIG_SCHED_HMP_LITTLE_PACKING) || \ + defined(CONFIG_HMP_FREQUENCY_INVARIANT_SCALE) +/* toggle control is only 0,1 off/on */ +static int hmp_toggle_from_sysfs(int value) +{ + if (value < 0 || value > 1) + return -1; + return value; +} +#endif + +#ifdef CONFIG_SCHED_HMP_LITTLE_PACKING +/* packing value must be non-negative */ +static int hmp_packing_from_sysfs(int value) +{ + if (value < 0) + return -1; + return value; +} +#endif + +static void hmp_attr_add( + const char *name, + int *value, + int (*to_sysfs)(int), + int (*from_sysfs)(int), + ssize_t (*to_sysfs_text)(char *, int), + umode_t mode) +{ + int i = 0; + while (hmp_data.attributes[i] != NULL) { + i++; + if (i >= HMP_DATA_SYSFS_MAX) + return; + } + if (mode) + hmp_data.attr[i].attr.mode = mode; + else + hmp_data.attr[i].attr.mode = 0644; + hmp_data.attr[i].show = hmp_show; + hmp_data.attr[i].store = hmp_store; + hmp_data.attr[i].attr.name = name; + hmp_data.attr[i].value = value; + hmp_data.attr[i].to_sysfs = to_sysfs; + hmp_data.attr[i].from_sysfs = from_sysfs; + hmp_data.attr[i].to_sysfs_text = to_sysfs_text; + hmp_data.attributes[i] = &hmp_data.attr[i].attr; + hmp_data.attributes[i + 1] = NULL; +} + +static int hmp_attr_init(void) +{ + int ret; + memset(&hmp_data, 0, sizeof(hmp_data)); + hmp_attr_add("hmp_domains", + NULL, + NULL, + NULL, + hmp_print_domains, + 0444); + hmp_attr_add("up_threshold", + &hmp_up_threshold, + NULL, + hmp_theshold_from_sysfs, + NULL, + 0); + hmp_attr_add("down_threshold", + &hmp_down_threshold, + NULL, + hmp_theshold_from_sysfs, + NULL, + 0); +#ifdef CONFIG_HMP_VARIABLE_SCALE + /* by default load_avg_period_ms == LOAD_AVG_PERIOD + * meaning no change + */ + hmp_data.multiplier = hmp_period_tofrom_sysfs(LOAD_AVG_PERIOD); + hmp_attr_add("load_avg_period_ms", + &hmp_data.multiplier, + hmp_period_tofrom_sysfs, + hmp_period_tofrom_sysfs, + NULL, + 0); +#endif +#ifdef CONFIG_HMP_FREQUENCY_INVARIANT_SCALE + /* default frequency-invariant scaling ON */ + hmp_data.freqinvar_load_scale_enabled = 1; + hmp_attr_add("frequency_invariant_load_scale", + &hmp_data.freqinvar_load_scale_enabled, + NULL, + hmp_toggle_from_sysfs, + NULL, + 0); +#endif +#ifdef CONFIG_SCHED_HMP_LITTLE_PACKING + hmp_attr_add("packing_enable", + &hmp_packing_enabled, + NULL, + hmp_toggle_from_sysfs, + NULL, + 0); + hmp_attr_add("packing_limit", + &hmp_full_threshold, + NULL, + hmp_packing_from_sysfs, + NULL, + 0); +#endif + hmp_data.attr_group.name = "hmp"; + hmp_data.attr_group.attrs = hmp_data.attributes; + ret = sysfs_create_group(kernel_kobj, + &hmp_data.attr_group); + return 0; +} +late_initcall(hmp_attr_init); + +/* + * return the load of the lowest-loaded CPU in a given HMP domain + * min_cpu optionally points to an int to receive the CPU. + * affinity optionally points to a cpumask containing the + * CPUs to be considered. note: + * + min_cpu = NR_CPUS only if no CPUs are in the set of + * affinity && hmp_domain cpus + * + min_cpu will always otherwise equal one of the CPUs in + * the hmp domain + * + when more than one CPU has the same load, the one which + * is least-recently-disturbed by an HMP migration will be + * selected + * + if all CPUs are equally loaded or idle and the times are + * all the same, the first in the set will be used + * + if affinity is not set, cpu_online_mask is used + */ +static inline unsigned int hmp_domain_min_load(struct hmp_domain *hmpd, + int *min_cpu, struct cpumask *affinity) +{ + int cpu; + int min_cpu_runnable_temp = NR_CPUS; + u64 min_target_last_migration = ULLONG_MAX; + u64 curr_last_migration; + unsigned long min_runnable_load = INT_MAX; + unsigned long contrib; + struct sched_avg *avg; + struct cpumask temp_cpumask; + /* + * only look at CPUs allowed if specified, + * otherwise look at all online CPUs in the + * right HMP domain + */ + cpumask_and(&temp_cpumask, &hmpd->cpus, affinity ? affinity : cpu_online_mask); + + for_each_cpu_mask(cpu, temp_cpumask) { + avg = &cpu_rq(cpu)->avg; + /* used for both up and down migration */ + curr_last_migration = avg->hmp_last_up_migration ? + avg->hmp_last_up_migration : avg->hmp_last_down_migration; + + contrib = avg->load_avg_ratio; + /* + * Consider a runqueue completely busy if there is any load + * on it. Definitely not the best for overall fairness, but + * does well in typical Android use cases. + */ + if (contrib) + contrib = 1023; + + if ((contrib < min_runnable_load) || + (contrib == min_runnable_load && + curr_last_migration < min_target_last_migration)) { + /* + * if the load is the same target the CPU with + * the longest time since a migration. + * This is to spread migration load between + * members of a domain more evenly when the + * domain is fully loaded + */ + min_runnable_load = contrib; + min_cpu_runnable_temp = cpu; + min_target_last_migration = curr_last_migration; + } + } + + if (min_cpu) + *min_cpu = min_cpu_runnable_temp; + + return min_runnable_load; +} + +/* + * Calculate the task starvation + * This is the ratio of actually running time vs. runnable time. + * If the two are equal the task is getting the cpu time it needs or + * it is alone on the cpu and the cpu is fully utilized. + */ +static inline unsigned int hmp_task_starvation(struct sched_entity *se) +{ + u32 starvation; + + starvation = se->avg.usage_avg_sum * scale_load_down(NICE_0_LOAD); + starvation /= (se->avg.runnable_avg_sum + 1); + + return scale_load(starvation); +} + +static inline unsigned int hmp_offload_down(int cpu, struct sched_entity *se) +{ + int min_usage; + int dest_cpu = NR_CPUS; + + if (hmp_cpu_is_slowest(cpu)) + return NR_CPUS; + + /* Is there an idle CPU in the current domain */ + min_usage = hmp_domain_min_load(hmp_cpu_domain(cpu), NULL, NULL); + if (min_usage == 0) { + trace_sched_hmp_offload_abort(cpu, min_usage, "load"); + return NR_CPUS; + } + + /* Is the task alone on the cpu? */ + if (cpu_rq(cpu)->cfs.h_nr_running < 2) { + trace_sched_hmp_offload_abort(cpu, + cpu_rq(cpu)->cfs.h_nr_running, "nr_running"); + return NR_CPUS; + } + + /* Is the task actually starving? */ + /* >=25% ratio running/runnable = starving */ + if (hmp_task_starvation(se) > 768) { + trace_sched_hmp_offload_abort(cpu, hmp_task_starvation(se), + "starvation"); + return NR_CPUS; + } + + /* Does the slower domain have any idle CPUs? */ + min_usage = hmp_domain_min_load(hmp_slower_domain(cpu), &dest_cpu, + tsk_cpus_allowed(task_of(se))); + + if (min_usage == 0) { + trace_sched_hmp_offload_succeed(cpu, dest_cpu); + return dest_cpu; + } else + trace_sched_hmp_offload_abort(cpu,min_usage,"slowdomain"); + + return NR_CPUS; +} +#endif /* CONFIG_SCHED_HMP */ + /* * select_task_rq_fair: Select target runqueue for the waking task in domains * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, @@ -4553,6 +5592,19 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f if (p->nr_cpus_allowed == 1) return prev_cpu; +#ifdef CONFIG_SCHED_HMP + /* always put non-kernel forking tasks on a big domain */ + if (unlikely(sd_flag & SD_BALANCE_FORK) && hmp_task_should_forkboost(p)) { + new_cpu = hmp_select_faster_cpu(p, prev_cpu); + if (new_cpu != NR_CPUS) { + hmp_next_up_delay(&p->se, new_cpu); + return new_cpu; + } + /* failed to perform HMP fork balance, use normal balance */ + new_cpu = cpu; + } +#endif + if (sd_flag & SD_BALANCE_WAKE) want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); @@ -4620,9 +5672,49 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f unlock: rcu_read_unlock(); +#ifdef CONFIG_SCHED_HMP + prev_cpu = task_cpu(p); + + if (hmp_up_migration(prev_cpu, &new_cpu, &p->se)) { + hmp_next_up_delay(&p->se, new_cpu); + trace_sched_hmp_migrate(p, new_cpu, HMP_MIGRATE_WAKEUP); + return new_cpu; + } + if (hmp_down_migration(prev_cpu, &p->se)) { +#ifdef CONFIG_SCHED_HMP_LITTLE_PACKING + new_cpu = hmp_best_little_cpu(p, prev_cpu); +#else + new_cpu = hmp_select_slower_cpu(p, prev_cpu); +#endif + /* + * we might have no suitable CPU + * in which case new_cpu == NR_CPUS + */ + if (new_cpu < NR_CPUS && new_cpu != prev_cpu) { + hmp_next_down_delay(&p->se, new_cpu); + trace_sched_hmp_migrate(p, new_cpu, HMP_MIGRATE_WAKEUP); + return new_cpu; + } + } + /* Make sure that the task stays in its previous hmp domain */ + if (!cpumask_test_cpu(new_cpu, &hmp_cpu_domain(prev_cpu)->cpus)) + return prev_cpu; +#endif + return new_cpu; } +#ifdef CONFIG_SCHED_HMP +#ifdef CONFIG_NO_HZ_COMMON +static int nohz_test_cpu(int cpu); +#else +static inline int nohz_test_cpu(int cpu) +{ + return 0; +} +#endif +#endif + /* * Called immediately before a task is migrated to a new cpu; task_cpu(p) and * cfs_rq_of(p) references at time of call are still valid and identify the @@ -4642,6 +5734,27 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu) * be negative here since on-rq tasks have decay-count == 0. */ if (se->avg.decay_count) { +#ifdef CONFIG_SCHED_HMP + /* + * If we migrate a sleeping task away from a CPU + * which has the tick stopped, then both the clock_task + * and decay_counter will be out of date for that CPU + * and we will not decay load correctly. + */ + if (!se->on_rq && nohz_test_cpu(task_cpu(p))) { + struct rq *rq = cpu_rq(task_cpu(p)); + unsigned long flags; + /* + * Current CPU cannot be holding rq->lock in this + * circumstance, but another might be. We must hold + * rq->lock before we go poking around in its clocks + */ + raw_spin_lock_irqsave(&rq->lock, flags); + update_rq_clock(rq); + update_cfs_rq_blocked_load(cfs_rq, 0); + raw_spin_unlock_irqrestore(&rq->lock, flags); + } +#endif /* CONFIG_SCHED_HMP */ se->avg.decay_count = -__synchronize_entity_decay(se); atomic_long_add(se->avg.load_avg_contrib, &cfs_rq->removed_load); @@ -6924,6 +8037,10 @@ update_next_balance(struct sched_domain *sd, int cpu_busy, unsigned long *next_b *next_balance = next; } +#ifdef CONFIG_SCHED_HMP +static unsigned int hmp_idle_pull(int this_cpu); +#endif + /* * idle_balance is called by schedule() if this_cpu is about to become * idle. Attempts to pull tasks from other CPUs. @@ -6998,7 +8115,10 @@ static int idle_balance(struct rq *this_rq) break; } rcu_read_unlock(); - +#ifdef CONFIG_SCHED_HMP + if (!pulled_task) + pulled_task = hmp_idle_pull(this_cpu); +#endif raw_spin_lock(&this_rq->lock); if (curr_cost > this_rq->max_idle_balance_cost) @@ -7029,24 +8149,128 @@ out: return pulled_task; } -/* - * active_load_balance_cpu_stop is run by cpu stopper. It pushes - * running tasks off the busiest CPU onto idle CPUs. It requires at - * least 1 task to be running on each physical CPU where possible, and - * avoids physical / logical imbalances. - */ -static int active_load_balance_cpu_stop(void *data) +static int active_load_balance_fair(void *data) +{ + struct rq *busiest_rq = data; + int busiest_cpu = cpu_of(busiest_rq); + int target_cpu = busiest_rq->push_cpu; + struct rq *target_rq = cpu_rq(target_cpu); + struct task_struct *tsk = NULL; + struct lb_env env = { + .dst_cpu = target_cpu, + .dst_rq = target_rq, + .src_cpu = busiest_rq->cpu, + .src_rq = busiest_rq, + .idle = CPU_IDLE, + }; + + /* Search for an sd spanning us and the target CPU. */ + rcu_read_lock(); + for_each_domain(target_cpu, env.sd) + if ((env.sd->flags & SD_LOAD_BALANCE) && + cpumask_test_cpu(busiest_cpu, sched_domain_span(env.sd))) + break; + + /* Task could not be migrated between these two CPUs */ + if (unlikely(env.sd == NULL)) + goto out_done; + + /* FAIR active balance */ + schedstat_inc(env.sd, alb_count); + + tsk = detach_one_task(&env); + if (tsk) + schedstat_inc(env.sd, alb_pushed); + else + schedstat_inc(env.sd, alb_failed); + +out_done: + + rcu_read_unlock(); + + busiest_rq->active_balance = 0; + raw_spin_unlock(&busiest_rq->lock); + + if (tsk) + attach_one_task(target_rq, tsk); + + local_irq_enable(); + + return 0; +} + +#ifdef CONFIG_SCHED_HMP +static int move_specific_task(struct lb_env *env, struct task_struct *pm); +static int active_load_balance_hmp(void *data) +{ + struct rq *busiest_rq = data; + int busiest_cpu = cpu_of(busiest_rq); + int target_cpu = busiest_rq->push_cpu; + struct rq *target_rq = cpu_rq(target_cpu); + struct task_struct *tsk = busiest_rq->migrate_task; + struct lb_env env = { + .dst_cpu = target_cpu, + .dst_rq = target_rq, + .src_cpu = busiest_rq->cpu, + .src_rq = busiest_rq, + .idle = CPU_IDLE, + }; + + /* move a task from busiest_rq to target_rq */ + double_lock_balance(busiest_rq, target_rq); + + /* Search for an sd spanning us and the target CPU. */ + rcu_read_lock(); + for_each_domain(target_cpu, env.sd) + if (cpumask_test_cpu(busiest_cpu, sched_domain_span(env.sd))) + break; + + /* Task could not be migrated between these two CPUs */ + if (unlikely(env.sd == NULL)) + goto out_done; + + schedstat_inc(env.sd, alb_count); + + if (move_specific_task(&env, tsk)) + schedstat_inc(env.sd, alb_pushed); + else + schedstat_inc(env.sd, alb_failed); + +out_done: + + rcu_read_unlock(); + double_unlock_balance(busiest_rq, target_rq); + + busiest_rq->active_balance = 0; + raw_spin_unlock(&busiest_rq->lock); + + /* + * Reset active balance request and release the task + * + * NOTE: The put_task_struct() is done while the runqueue spinlock is + * held, but put_task_struct() can also cause a reschedule causing the + * runqueue lock to be acquired recursively. + * To avoid this race condition, we keep the put_task_struct() outside + * the runqueue spinlock. + */ + put_task_struct(tsk); + + local_irq_enable(); + + return 0; +} +#endif + +static int __do_active_load_balance_cpu_stop(void *data, bool hmp_active_balance) { struct rq *busiest_rq = data; int busiest_cpu = cpu_of(busiest_rq); int target_cpu = busiest_rq->push_cpu; struct rq *target_rq = cpu_rq(target_cpu); - struct sched_domain *sd; - struct task_struct *p = NULL; raw_spin_lock_irq(&busiest_rq->lock); - /* make sure the requested cpu hasn't gone down in the meantime */ + /* Make sure the requested cpu hasn't gone down in the meantime */ if (unlikely(busiest_cpu != smp_processor_id() || !busiest_rq->active_balance)) goto out_unlock; @@ -7062,43 +8286,41 @@ static int active_load_balance_cpu_stop(void *data) */ BUG_ON(busiest_rq == target_rq); - /* Search for an sd spanning us and the target CPU. */ - rcu_read_lock(); - for_each_domain(target_cpu, sd) { - if ((sd->flags & SD_LOAD_BALANCE) && - cpumask_test_cpu(busiest_cpu, sched_domain_span(sd))) - break; - } - - if (likely(sd)) { - struct lb_env env = { - .sd = sd, - .dst_cpu = target_cpu, - .dst_rq = target_rq, - .src_cpu = busiest_rq->cpu, - .src_rq = busiest_rq, - .idle = CPU_IDLE, - }; +#ifdef CONFIG_SCHED_HMP + /* HMP active balance (using double locking) */ + if (hmp_active_balance) + return active_load_balance_hmp(data); +#endif - schedstat_inc(sd, alb_count); + /* FAIR active balance (using attach/detach) */ + return active_load_balance_fair(data); - p = detach_one_task(&env); - if (p) - schedstat_inc(sd, alb_pushed); - else - schedstat_inc(sd, alb_failed); - } - rcu_read_unlock(); out_unlock: + busiest_rq->active_balance = 0; + raw_spin_unlock(&busiest_rq->lock); - if (p) - attach_one_task(target_rq, p); +#ifdef CONFIG_SCHED_HMP + if (hmp_active_balance) + put_task_struct(busiest_rq->migrate_task); +#endif local_irq_enable(); return 0; + +} + +/* + * active_load_balance_cpu_stop is run by cpu stopper. It pushes + * running tasks off the busiest CPU onto idle CPUs. It requires at + * least 1 task to be running on each physical CPU where possible, and + * avoids physical / logical imbalances. + */ +static int active_load_balance_cpu_stop(void *data) +{ + return __do_active_load_balance_cpu_stop(data, false); } static inline int on_null_domain(struct rq *rq) @@ -7119,12 +8341,79 @@ static struct { unsigned long next_balance; /* in jiffy units */ } nohz ____cacheline_aligned; -static inline int find_new_ilb(void) +#ifdef CONFIG_SCHED_HMP +static int nohz_test_cpu(int cpu) +{ + return cpumask_test_cpu(cpu, nohz.idle_cpus_mask); +} +#endif + +#ifdef CONFIG_SCHED_HMP_LITTLE_PACKING +/* + * Decide if the tasks on the busy CPUs in the littlest domain would benefit + * from an idle balance + * + * When packing is enabled, only enforce this behaviour when we are not in the + * smallest domain - there we idle balance whenever a CPU is over the + * up_threshold regardless of tasks in case one needs to be moved. + */ +static int hmp_packing_ilb_needed(int cpu, int ilb_needed) +{ + struct hmp_domain *hmp; + /* allow previous decision on non-slowest domain */ + if (!hmp_cpu_is_slowest(cpu)) + return ilb_needed; + + /* if disabled, use normal ILB behaviour */ + if (!hmp_packing_enabled) + return ilb_needed; + + hmp = hmp_cpu_domain(cpu); + for_each_cpu_and(cpu, &hmp->cpus, nohz.idle_cpus_mask) { + /* only idle balance if a CPU is loaded over threshold */ + if (cpu_rq(cpu)->avg.load_avg_ratio > hmp_full_threshold) + return 1; + } + return 0; +} +#endif + +DEFINE_PER_CPU(cpumask_var_t, ilb_tmpmask); + +static inline int find_new_ilb(int call_cpu) { int ilb = cpumask_first(nohz.idle_cpus_mask); +#ifdef CONFIG_SCHED_HMP + int ilb_needed = 0; + int cpu; + struct cpumask* tmp = per_cpu(ilb_tmpmask, smp_processor_id()); + + /* restrict nohz balancing to occur in the same hmp domain */ + ilb = cpumask_first_and(nohz.idle_cpus_mask, + &((struct hmp_domain *)hmp_cpu_domain(call_cpu))->cpus); + + /* check to see if it's necessary within this domain */ + cpumask_andnot(tmp, + &((struct hmp_domain *)hmp_cpu_domain(call_cpu))->cpus, + nohz.idle_cpus_mask); + for_each_cpu(cpu, tmp) { + if (cpu_rq(cpu)->nr_running > 1) { + ilb_needed = 1; + break; + } + } +#ifdef CONFIG_SCHED_HMP_LITTLE_PACKING + if (ilb < nr_cpu_ids) + ilb_needed = hmp_packing_ilb_needed(ilb, ilb_needed); +#endif + + if (ilb_needed && ilb < nr_cpu_ids && idle_cpu(ilb)) + return ilb; +#else if (ilb < nr_cpu_ids && idle_cpu(ilb)) return ilb; +#endif return nr_cpu_ids; } @@ -7134,13 +8423,13 @@ static inline int find_new_ilb(void) * nohz_load_balancer CPU (if there is one) otherwise fallback to any idle * CPU (if there is one). */ -static void nohz_balancer_kick(void) +static void nohz_balancer_kick(int cpu) { int ilb_cpu; nohz.next_balance++; - ilb_cpu = find_new_ilb(); + ilb_cpu = find_new_ilb(cpu); if (ilb_cpu >= nr_cpu_ids) return; @@ -7435,6 +8724,18 @@ static inline int nohz_kick_needed(struct rq *rq) if (time_before(now, nohz.next_balance)) return 0; +#ifdef CONFIG_SCHED_HMP + /* + * Bail out if there are no nohz CPUs in our + * HMP domain, since we will move tasks between + * domains through wakeup and force balancing + * as necessary based upon task load. + */ + if (cpumask_first_and(nohz.idle_cpus_mask, + &((struct hmp_domain *)hmp_cpu_domain(cpu))->cpus) >= nr_cpu_ids) + return 0; +#endif + if (rq->nr_running >= 2) goto need_kick; @@ -7467,6 +8768,456 @@ need_kick: static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { } #endif +#ifdef CONFIG_SCHED_HMP +static unsigned int hmp_task_eligible_for_up_migration(struct sched_entity *se) +{ + /* below hmp_up_threshold, never eligible */ + if (se->avg.load_avg_ratio < hmp_up_threshold) + return 0; + return 1; +} + +/* Check if task should migrate to a faster cpu */ +static unsigned int hmp_up_migration(int cpu, int *target_cpu, struct sched_entity *se) +{ + struct task_struct *p = task_of(se); + int temp_target_cpu; + u64 now; + + if (hmp_cpu_is_fastest(cpu)) + return 0; + +#ifdef CONFIG_SCHED_HMP_PRIO_FILTER + /* Filter by task priority */ + if (p->prio >= hmp_up_prio) + return 0; +#endif + if (!hmp_task_eligible_for_up_migration(se)) + return 0; + + /* Let the task load settle before doing another up migration */ + /* hack - always use clock from first online CPU */ + now = cpu_rq(cpumask_first(cpu_online_mask))->clock_task; + if (((now - se->avg.hmp_last_up_migration) >> 10) + < hmp_next_up_threshold) + return 0; + + /* hmp_domain_min_load only returns 0 for an + * idle CPU or 1023 for any partly-busy one. + * Be explicit about requirement for an idle CPU. + */ + if (hmp_domain_min_load(hmp_faster_domain(cpu), &temp_target_cpu, + tsk_cpus_allowed(p)) == 0 && temp_target_cpu != NR_CPUS) { + if(target_cpu) + *target_cpu = temp_target_cpu; + return 1; + } + return 0; +} + +/* Check if task should migrate to a slower cpu */ +static unsigned int hmp_down_migration(int cpu, struct sched_entity *se) +{ + struct task_struct *p = task_of(se); + u64 now; + + if (hmp_cpu_is_slowest(cpu)) { +#ifdef CONFIG_SCHED_HMP_LITTLE_PACKING + if(hmp_packing_enabled) + return 1; + else +#endif + return 0; + } + +#ifdef CONFIG_SCHED_HMP_PRIO_FILTER + /* Filter by task priority */ + if ((p->prio >= hmp_up_prio) && + cpumask_intersects(&hmp_slower_domain(cpu)->cpus, + tsk_cpus_allowed(p))) { + return 1; + } +#endif + + /* Let the task load settle before doing another down migration */ + /* hack - always use clock from first online CPU */ + now = cpu_rq(cpumask_first(cpu_online_mask))->clock_task; + if (((now - se->avg.hmp_last_down_migration) >> 10) + < hmp_next_down_threshold) + return 0; + + if (cpumask_intersects(&hmp_slower_domain(cpu)->cpus, + tsk_cpus_allowed(p)) + && se->avg.load_avg_ratio < hmp_down_threshold) { + return 1; + } + return 0; +} + +/* + * hmp_can_migrate_task - may task p from runqueue rq be migrated to this_cpu? + * Ideally this function should be merged with can_migrate_task() to avoid + * redundant code. + */ +static int hmp_can_migrate_task(struct task_struct *p, struct lb_env *env) +{ + int tsk_cache_hot = 0; + + /* + * We do not migrate tasks that are: + * 1) running (obviously), or + * 2) cannot be migrated to this CPU due to cpus_allowed + */ + if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) { + schedstat_inc(p, se.statistics.nr_failed_migrations_affine); + return 0; + } + env->flags &= ~LBF_ALL_PINNED; + + if (task_running(env->src_rq, p)) { + schedstat_inc(p, se.statistics.nr_failed_migrations_running); + return 0; + } + + /* + * Aggressive migration if: + * 1) task is cache cold, or + * 2) too many balance attempts have failed. + */ + + tsk_cache_hot = task_hot(p, env); + if (!tsk_cache_hot || + env->sd->nr_balance_failed > env->sd->cache_nice_tries) { +#ifdef CONFIG_SCHEDSTATS + if (tsk_cache_hot) { + schedstat_inc(env->sd, lb_hot_gained[env->idle]); + schedstat_inc(p, se.statistics.nr_forced_migrations); + } +#endif + return 1; + } + + return 1; +} + +/* + * move_task - move a task from one runqueue to another runqueue. + * Both runqueues must be locked. + */ +static void move_task(struct task_struct *p, struct lb_env *env) +{ + deactivate_task(env->src_rq, p, 0); + set_task_cpu(p, env->dst_cpu); + activate_task(env->dst_rq, p, 0); + check_preempt_curr(env->dst_rq, p, 0); +} + +/* + * move_specific_task tries to move a specific task. + * Returns 1 if successful and 0 otherwise. + * Called with both runqueues locked. + */ +static int move_specific_task(struct lb_env *env, struct task_struct *pm) +{ + struct task_struct *p, *n; + + list_for_each_entry_safe(p, n, &env->src_rq->cfs_tasks, se.group_node) { + if (throttled_lb_pair(task_group(p), + env->src_rq->cpu, env->dst_cpu)) + continue; + + if (!hmp_can_migrate_task(p, env)) + continue; + /* Check if we found the right task */ + if (p != pm) + continue; + + move_task(p, env); + /* + * Right now, this is only the third place move_task() + * is called, so we can safely collect move_task() + * stats here rather than inside move_task(). + */ + schedstat_inc(env->sd, lb_gained[env->idle]); + return 1; + } + return 0; +} + +/* + * hmp_active_task_migration_cpu_stop is run by cpu stopper and used to + * migrate a specific task from one runqueue to another. + * hmp_force_up_migration uses this to push a currently running task + * off a runqueue. hmp_idle_pull uses this to pull a currently + * running task to an idle runqueue. + * Reuses __do_active_load_balance_cpu_stop to actually do the work. + */ +static int hmp_active_task_migration_cpu_stop(void *data) +{ + return __do_active_load_balance_cpu_stop(data, true); +} + +/* + * Move task in a runnable state to another CPU. + * + * Tailored on 'active_load_balance_cpu_stop' with slight + * modification to locking and pre-transfer checks. Note + * rq->lock must be held before calling. + */ +static void hmp_migrate_runnable_task(struct rq *rq) +{ + struct sched_domain *sd; + int src_cpu = cpu_of(rq); + struct rq *src_rq = rq; + int dst_cpu = rq->push_cpu; + struct rq *dst_rq = cpu_rq(dst_cpu); + struct task_struct *p = rq->migrate_task; + /* + * One last check to make sure nobody else is playing + * with the source rq. + */ + if (src_rq->active_balance) + goto out; + + if (src_rq->nr_running <= 1) + goto out; + + if (task_rq(p) != src_rq) + goto out; + /* + * Not sure if this applies here but one can never + * be too cautious + */ + BUG_ON(src_rq == dst_rq); + + double_lock_balance(src_rq, dst_rq); + + rcu_read_lock(); + for_each_domain(dst_cpu, sd) { + if (cpumask_test_cpu(src_cpu, sched_domain_span(sd))) + break; + } + + if (likely(sd)) { + struct lb_env env = { + .sd = sd, + .dst_cpu = dst_cpu, + .dst_rq = dst_rq, + .src_cpu = src_cpu, + .src_rq = src_rq, + .idle = CPU_IDLE, + }; + + schedstat_inc(sd, alb_count); + + if (move_specific_task(&env, p)) + schedstat_inc(sd, alb_pushed); + else + schedstat_inc(sd, alb_failed); + } + + rcu_read_unlock(); + double_unlock_balance(src_rq, dst_rq); +out: + put_task_struct(p); +} + +static DEFINE_SPINLOCK(hmp_force_migration); + +/* + * hmp_force_up_migration checks runqueues for tasks that need to + * be actively migrated to a faster cpu. + */ +static void hmp_force_up_migration(int this_cpu) +{ + int cpu, target_cpu; + struct sched_entity *curr, *orig; + struct rq *target; + unsigned long flags; + unsigned int force, got_target; + struct task_struct *p; + + if (!spin_trylock(&hmp_force_migration)) + return; + for_each_online_cpu(cpu) { + force = 0; + got_target = 0; + target = cpu_rq(cpu); + raw_spin_lock_irqsave(&target->lock, flags); + curr = target->cfs.curr; + if (!curr || target->active_balance) { + raw_spin_unlock_irqrestore(&target->lock, flags); + continue; + } + if (!entity_is_task(curr)) { + struct cfs_rq *cfs_rq; + + cfs_rq = group_cfs_rq(curr); + while (cfs_rq) { + curr = cfs_rq->curr; + cfs_rq = group_cfs_rq(curr); + } + } + orig = curr; + curr = hmp_get_heaviest_task(curr, -1); + if (!curr) { + raw_spin_unlock_irqrestore(&target->lock, flags); + continue; + } + p = task_of(curr); + if (hmp_up_migration(cpu, &target_cpu, curr)) { + cpu_rq(target_cpu)->wake_for_idle_pull = 1; + raw_spin_unlock_irqrestore(&target->lock, flags); + spin_unlock(&hmp_force_migration); + smp_send_reschedule(target_cpu); + return; + } + if (!got_target) { + /* + * For now we just check the currently running task. + * Selecting the lightest task for offloading will + * require extensive book keeping. + */ + curr = hmp_get_lightest_task(orig, 1); + p = task_of(curr); + target->push_cpu = hmp_offload_down(cpu, curr); + if (target->push_cpu < NR_CPUS) { + get_task_struct(p); + target->migrate_task = p; + got_target = 1; + trace_sched_hmp_migrate(p, target->push_cpu, HMP_MIGRATE_OFFLOAD); + hmp_next_down_delay(&p->se, target->push_cpu); + } + } + /* + * We have a target with no active_balance. If the task + * is not currently running move it, otherwise let the + * CPU stopper take care of it. + */ + if (got_target) { + if (!task_running(target, p)) { + trace_sched_hmp_migrate_force_running(p, 0); + hmp_migrate_runnable_task(target); + } else { + target->active_balance = 1; + force = 1; + } + } + + raw_spin_unlock_irqrestore(&target->lock, flags); + + if (force) + stop_one_cpu_nowait(cpu_of(target), + hmp_active_task_migration_cpu_stop, + target, &target->active_balance_work); + } + spin_unlock(&hmp_force_migration); +} + +/* + * hmp_idle_pull looks at little domain runqueues to see + * if a task should be pulled. + * + * Reuses hmp_force_migration spinlock. + * + */ +static unsigned int hmp_idle_pull(int this_cpu) +{ + int cpu; + struct sched_entity *curr, *orig; + struct hmp_domain *hmp_domain = NULL; + struct rq *target = NULL, *rq; + unsigned long flags, ratio = 0; + unsigned int force = 0; + struct task_struct *p = NULL; + + if (!hmp_cpu_is_slowest(this_cpu)) + hmp_domain = hmp_slower_domain(this_cpu); + if (!hmp_domain) + return 0; + + if (!spin_trylock(&hmp_force_migration)) + return 0; + + /* first select a task */ + for_each_cpu(cpu, &hmp_domain->cpus) { + rq = cpu_rq(cpu); + raw_spin_lock_irqsave(&rq->lock, flags); + curr = rq->cfs.curr; + if (!curr) { + raw_spin_unlock_irqrestore(&rq->lock, flags); + continue; + } + if (!entity_is_task(curr)) { + struct cfs_rq *cfs_rq; + + cfs_rq = group_cfs_rq(curr); + while (cfs_rq) { + curr = cfs_rq->curr; + if (!entity_is_task(curr)) + cfs_rq = group_cfs_rq(curr); + else + cfs_rq = NULL; + } + } + orig = curr; + curr = hmp_get_heaviest_task(curr, this_cpu); + /* check if heaviest eligible task on this + * CPU is heavier than previous task + */ + if (curr && hmp_task_eligible_for_up_migration(curr) && + curr->avg.load_avg_ratio > ratio && + cpumask_test_cpu(this_cpu, + tsk_cpus_allowed(task_of(curr)))) { + p = task_of(curr); + target = rq; + ratio = curr->avg.load_avg_ratio; + } + raw_spin_unlock_irqrestore(&rq->lock, flags); + } + + if (!p) + goto done; + + /* now we have a candidate */ + raw_spin_lock_irqsave(&target->lock, flags); + if (!target->active_balance && task_rq(p) == target) { + get_task_struct(p); + target->push_cpu = this_cpu; + target->migrate_task = p; + trace_sched_hmp_migrate(p, target->push_cpu, HMP_MIGRATE_IDLE_PULL); + hmp_next_up_delay(&p->se, target->push_cpu); + /* + * if the task isn't running move it right away. + * Otherwise setup the active_balance mechanic and let + * the CPU stopper do its job. + */ + if (!task_running(target, p)) { + trace_sched_hmp_migrate_idle_running(p, 0); + hmp_migrate_runnable_task(target); + } else { + target->active_balance = 1; + force = 1; + } + } + raw_spin_unlock_irqrestore(&target->lock, flags); + + if (force) { + /* start timer to keep us awake */ + hmp_cpu_keepalive_trigger(); + stop_one_cpu_nowait(cpu_of(target), + hmp_active_task_migration_cpu_stop, + target, &target->active_balance_work); + } +done: + spin_unlock(&hmp_force_migration); + return force; +} + +#else +static void hmp_force_up_migration(int this_cpu) { } +#endif /* CONFIG_SCHED_HMP */ + /* * run_rebalance_domains is triggered when needed from the scheduler tick. * Also triggered for nohz idle balancing (with nohz_balancing_kick set). @@ -7477,6 +9228,20 @@ static void run_rebalance_domains(struct softirq_action *h) enum cpu_idle_type idle = this_rq->idle_balance ? CPU_IDLE : CPU_NOT_IDLE; +#ifdef CONFIG_SCHED_HMP + /* shortcut for hmp idle pull wakeups */ + if (unlikely(this_rq->wake_for_idle_pull)) { + this_rq->wake_for_idle_pull = 0; + if (hmp_idle_pull(cpu_of(this_rq))) { + /* break out unless running nohz idle as well */ + if (idle != CPU_IDLE) + return; + } + } +#endif + + hmp_force_up_migration(cpu_of(this_rq)); + rebalance_domains(this_rq, idle); /* @@ -7500,12 +9265,15 @@ void trigger_load_balance(struct rq *rq) raise_softirq(SCHED_SOFTIRQ); #ifdef CONFIG_NO_HZ_COMMON if (nohz_kick_needed(rq)) - nohz_balancer_kick(); + nohz_balancer_kick(cpu_of(rq)); #endif } static void rq_online_fair(struct rq *rq) { +#ifdef CONFIG_SCHED_HMP + hmp_online_cpu(rq->cpu); +#endif update_sysctl(); update_runtime_enabled(rq); @@ -7513,6 +9281,9 @@ static void rq_online_fair(struct rq *rq) static void rq_offline_fair(struct rq *rq) { +#ifdef CONFIG_SCHED_HMP + hmp_offline_cpu(rq->cpu); +#endif update_sysctl(); /* Ensure any throttled groups are reachable by pick_next_task */ @@ -7996,6 +9767,139 @@ __init void init_sched_fair_class(void) zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT); cpu_notifier(sched_ilb_notifier, 0); #endif + +#ifdef CONFIG_SCHED_HMP + hmp_cpu_mask_setup(); +#endif #endif /* SMP */ } + +#ifdef CONFIG_HMP_FREQUENCY_INVARIANT_SCALE +static u32 cpufreq_calc_scale(u32 min, u32 max, u32 curr) +{ + u32 result = curr / max; + return result; +} + +/* Called when the CPU Frequency is changed. + * Once for each CPU. + */ +static int cpufreq_callback(struct notifier_block *nb, + unsigned long val, void *data) +{ + struct cpufreq_freqs *freq = data; + int cpu = freq->cpu; + struct cpufreq_extents *extents; + + if (freq->flags & CPUFREQ_CONST_LOOPS) + return NOTIFY_OK; + + if (val != CPUFREQ_POSTCHANGE) + return NOTIFY_OK; + + /* if dynamic load scale is disabled, set the load scale to 1.0 */ + if (!hmp_data.freqinvar_load_scale_enabled) { + freq_scale[cpu].curr_scale = 1024; + return NOTIFY_OK; + } + + extents = &freq_scale[cpu]; + if (extents->flags & SCHED_LOAD_FREQINVAR_SINGLEFREQ) { + /* If our governor was recognised as a single-freq governor, + * use 1.0 + */ + extents->curr_scale = 1024; + } else { + extents->curr_scale = cpufreq_calc_scale(extents->min, + extents->max, freq->new); + } + + return NOTIFY_OK; +} + +/* Called when the CPUFreq governor is changed. + * Only called for the CPUs which are actually changed by the + * userspace. + */ +static int cpufreq_policy_callback(struct notifier_block *nb, + unsigned long event, void *data) +{ + struct cpufreq_policy *policy = data; + struct cpufreq_extents *extents; + int cpu, singleFreq = 0; + static const char performance_governor[] = "performance"; + static const char powersave_governor[] = "powersave"; + + if (event == CPUFREQ_START) + return 0; + + if (event != CPUFREQ_INCOMPATIBLE) + return 0; + + /* CPUFreq governors do not accurately report the range of + * CPU Frequencies they will choose from. + * We recognise performance and powersave governors as + * single-frequency only. + */ + if (!strncmp(policy->governor->name, performance_governor, + strlen(performance_governor)) || + !strncmp(policy->governor->name, powersave_governor, + strlen(powersave_governor))) + singleFreq = 1; + + /* Make sure that all CPUs impacted by this policy are + * updated since we will only get a notification when the + * user explicitly changes the policy on a CPU. + */ + for_each_cpu(cpu, policy->cpus) { + extents = &freq_scale[cpu]; + extents->max = policy->max >> SCHED_FREQSCALE_SHIFT; + extents->min = policy->min >> SCHED_FREQSCALE_SHIFT; + if (!hmp_data.freqinvar_load_scale_enabled) { + extents->curr_scale = 1024; + } else if (singleFreq) { + extents->flags |= SCHED_LOAD_FREQINVAR_SINGLEFREQ; + extents->curr_scale = 1024; + } else { + extents->flags &= ~SCHED_LOAD_FREQINVAR_SINGLEFREQ; + extents->curr_scale = cpufreq_calc_scale(extents->min, + extents->max, policy->cur); + } + } + + return 0; +} + +static struct notifier_block cpufreq_notifier = { + .notifier_call = cpufreq_callback, +}; +static struct notifier_block cpufreq_policy_notifier = { + .notifier_call = cpufreq_policy_callback, +}; + +static int __init register_sched_cpufreq_notifier(void) +{ + int ret = 0; + + /* init safe defaults since there are no policies at registration */ + for (ret = 0; ret < CONFIG_NR_CPUS; ret++) { + /* safe defaults */ + freq_scale[ret].max = 1024; + freq_scale[ret].min = 1024; + freq_scale[ret].curr_scale = 1024; + } + + pr_info("sched: registering cpufreq notifiers for scale-invariant loads\n"); + ret = cpufreq_register_notifier(&cpufreq_policy_notifier, + CPUFREQ_POLICY_NOTIFIER); + + if (ret != -EINVAL) + ret = cpufreq_register_notifier(&cpufreq_notifier, + CPUFREQ_TRANSITION_NOTIFIER); + + return ret; +} + +core_initcall(register_sched_cpufreq_notifier); +#endif /* CONFIG_HMP_FREQUENCY_INVARIANT_SCALE */ diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index f698089e10ca..9d09f277d915 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -219,6 +219,7 @@ struct task_group { #ifdef CONFIG_SMP atomic_long_t load_avg; atomic_t runnable_avg; + atomic_t usage_avg; #endif #endif @@ -352,6 +353,7 @@ struct cfs_rq { #ifdef CONFIG_FAIR_GROUP_SCHED /* Required to track per-cpu representation of a task_group */ u32 tg_runnable_contrib; + u32 tg_usage_contrib; unsigned long tg_load_contrib; /* @@ -586,6 +588,10 @@ struct rq { int active_balance; int push_cpu; struct cpu_stop_work active_balance_work; +#ifdef CONFIG_SCHED_HMP + struct task_struct *migrate_task; + int wake_for_idle_pull; +#endif /* cpu of this runqueue: */ int cpu; int online; @@ -805,6 +811,12 @@ static inline unsigned int group_first_cpu(struct sched_group *group) extern int group_balance_cpu(struct sched_group *sg); +#ifdef CONFIG_SCHED_HMP +static LIST_HEAD(hmp_domains); +DECLARE_PER_CPU(struct hmp_domain *, hmp_cpu_domain); +#define hmp_cpu_domain(cpu) (per_cpu(hmp_cpu_domain, (cpu))) +#endif /* CONFIG_SCHED_HMP */ + #else static inline void sched_ttwu_pending(void) { } diff --git a/kernel/smp.c b/kernel/smp.c index f38a1e692259..ca0789f1755a 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -14,6 +14,8 @@ #include <linux/smp.h> #include <linux/cpu.h> #include <linux/sched.h> +#define CREATE_TRACE_POINTS +#include <trace/events/smp.h> #include "smpboot.h" @@ -152,7 +154,9 @@ static int generic_exec_single(int cpu, struct call_single_data *csd, if (cpu == smp_processor_id()) { local_irq_save(flags); + trace_smp_call_func_entry(func); func(info); + trace_smp_call_func_exit(func); local_irq_restore(flags); return 0; } @@ -187,8 +191,10 @@ static int generic_exec_single(int cpu, struct call_single_data *csd, * locking and barrier primitives. Generic code isn't really * equipped to do the right thing... */ - if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu))) + if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu))) { + trace_smp_call_func_send(csd->func, cpu); arch_send_call_function_single_ipi(cpu); + } if (wait) csd_lock_wait(csd); @@ -250,7 +256,9 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline) } llist_for_each_entry_safe(csd, csd_next, entry, llist) { + trace_smp_call_func_entry(csd->func); csd->func(csd->info); + trace_smp_call_func_exit(csd->func); csd_unlock(csd); } @@ -277,6 +285,7 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info, int this_cpu; int err; + trace_smp_call_func_send(func, cpu); /* * prevent preemption and reschedule on another processor, * as well as CPU removal diff --git a/linaro/configs/big-LITTLE-MP.conf b/linaro/configs/big-LITTLE-MP.conf new file mode 100644 index 000000000000..fc1b64f8b090 --- /dev/null +++ b/linaro/configs/big-LITTLE-MP.conf @@ -0,0 +1,11 @@ +CONFIG_CGROUPS=y +CONFIG_CGROUP_SCHED=y +CONFIG_FAIR_GROUP_SCHED=y +CONFIG_NO_HZ=y +CONFIG_SCHED_MC=y +CONFIG_SCHED_HMP=y +CONFIG_HMP_FAST_CPU_MASK="" +CONFIG_HMP_SLOW_CPU_MASK="" +CONFIG_HMP_VARIABLE_SCALE=y +CONFIG_HMP_FREQUENCY_INVARIANT_SCALE=y +CONFIG_SCHED_HMP_LITTLE_PACKING=y |