diff options
author | Ryan Harkin <ryan.harkin@linaro.org> | 2015-11-11 14:38:58 +0000 |
---|---|---|
committer | Ryan Harkin <ryan.harkin@linaro.org> | 2015-11-11 14:39:15 +0000 |
commit | f4ea0b713a154a4a4af7ea28eecc8ce82ef727db (patch) | |
tree | 2387633545e7f1bf2b117eddfc3f718b0aba6e52 | |
parent | 57a4270035bc749057dcdac83a9d1b3307bed622 (diff) | |
parent | 45637436056fd3a554596f04a653434ba11728b7 (diff) |
Merge branch 'linux-linaro-lsk-v3.18-eas-test' into juno-easlsk-3.18-armlt-20151102-eas-test
Merged from repo:
git.linaro.org/kernel/linux-linaro-stable.git
Branch:
linux-linaro-lsk-v3.18-eas-test
Commit ID:
45637436056fd3a554596f04a653434ba11728b7
2015-10-08 Merge branch 'linaro/3.18/eas_debug' into linux-linaro-lsk-v3.18 [Alex Shi]
Signed-off-by: Ryan Harkin <ryan.harkin@linaro.org>
Conflicts:
drivers/cpufreq/Kconfig
include/linux/cpufreq.h
32 files changed, 5090 insertions, 295 deletions
diff --git a/Documentation/devicetree/bindings/scheduler/sched-energy-costs.txt b/Documentation/devicetree/bindings/scheduler/sched-energy-costs.txt new file mode 100644 index 000000000000..11216f09e596 --- /dev/null +++ b/Documentation/devicetree/bindings/scheduler/sched-energy-costs.txt @@ -0,0 +1,360 @@ +=========================================================== +Energy cost bindings for Energy Aware Scheduling +=========================================================== + +=========================================================== +1 - Introduction +=========================================================== + +This note specifies bindings required for energy-aware scheduling +(EAS)[1]. Historically, the scheduler's primary objective has been +performance. EAS aims to provide an alternative objective - energy +efficiency. EAS relies on a simple platform energy cost model to +guide scheduling decisions. The model only considers the CPU +subsystem. + +This note is aligned with the definition of the layout of physical +CPUs in the system as described in the ARM topology binding +description [2]. The concept is applicable to any system so long as +the cost model data is provided for those processing elements in +that system's topology that EAS is required to service. + +Processing elements refer to hardware threads, CPUs and clusters of +related CPUs in increasing order of hierarchy. + +EAS requires two key cost metrics - busy costs and idle costs. Busy +costs comprise of a list of compute capacities for the processing +element in question and the corresponding power consumption at that +capacity. Idle costs comprise of a list of power consumption values +for each idle state [C-state] that the processing element supports. +For a detailed description of these metrics, their derivation and +their use see [3]. + +These cost metrics are required for processing elements in all +scheduling domain levels that EAS is required to service. + +=========================================================== +2 - energy-costs node +=========================================================== + +Energy costs for the processing elements in scheduling domains that +EAS is required to service are defined in the energy-costs node +which acts as a container for the actual per processing element cost +nodes. A single energy-costs node is required for a given system. + +- energy-costs node + + Usage: Required + + Description: The energy-costs node is a container node and + it's sub-nodes describe costs for each processing element at + all scheduling domain levels that EAS is required to + service. + + Node name must be "energy-costs". + + The energy-costs node's parent node must be the cpus node. + + The energy-costs node's child nodes can be: + + - one or more cost nodes. + + Any other configuration is considered invalid. + +The energy-costs node can only contain a single type of child node +whose bindings are described in paragraph 4. + +=========================================================== +3 - energy-costs node child nodes naming convention +=========================================================== + +energy-costs child nodes must follow a naming convention where the +node name must be "thread-costN", "core-costN", "cluster-costN" +depending on whether the costs in the node are for a thread, core or +cluster. N (where N = {0, 1, ...}) is the node number and has no +bearing to the OS' logical thread, core or cluster index. + +=========================================================== +4 - cost node bindings +=========================================================== + +Bindings for cost nodes are defined as follows: + +- cluster-cost node + + Description: must be declared within an energy-costs node. A + system can contain multiple clusters and each cluster + serviced by EAS must have a corresponding cluster-costs + node. + + The cluster-cost node name must be "cluster-costN" as + described in 3 above. + + A cluster-cost node must be a leaf node with no children. + + Properties for cluster-cost nodes are described in paragraph + 5 below. + + Any other configuration is considered invalid. + +- core-cost node + + Description: must be declared within an energy-costs node. A + system can contain multiple cores and each core serviced by + EAS must have a corresponding core-cost node. + + The core-cost node name must be "core-costN" as described in + 3 above. + + A core-cost node must be a leaf node with no children. + + Properties for core-cost nodes are described in paragraph + 5 below. + + Any other configuration is considered invalid. + +- thread-cost node + + Description: must be declared within an energy-costs node. A + system can contain cores with multiple hardware threads and + each thread serviced by EAS must have a corresponding + thread-cost node. + + The core-cost node name must be "core-costN" as described in + 3 above. + + A core-cost node must be a leaf node with no children. + + Properties for thread-cost nodes are described in paragraph + 5 below. + + Any other configuration is considered invalid. + +=========================================================== +5 - Cost node properties +========================================================== + +All cost node types must have only the following properties: + +- busy-cost-data + + Usage: required + Value type: An array of 2-item tuples. Each item is of type + u32. + Definition: The first item in the tuple is the capacity + value as described in [3]. The second item in the tuple is + the energy cost value as described in [3]. + +- idle-cost-data + + Usage: required + Value type: An array of 1-item tuples. The item is of type + u32. + Definition: The item in the tuple is the energy cost value + as described in [3]. + +=========================================================== +4 - Extensions to the cpu node +=========================================================== + +The cpu node is extended with a property that establishes the +connection between the processing element represented by the cpu +node and the cost-nodes associated with this processing element. + +The connection is expressed in line with the topological hierarchy +that this processing element belongs to starting with the level in +the hierarchy that this processing element itself belongs to through +to the highest level that EAS is required to service. The +connection cannot be sparse and must be contiguous from the +processing element's level through to the highest desired level. The +highest desired level must be the same for all processing elements. + +Example: Given that a cpu node may represent a thread that is a part +of a core, this property may contain multiple elements which +associate the thread with cost nodes describing the costs for the +thread itself, the core the thread belongs to, the cluster the core +belongs to and so on. The elements must be ordered from the lowest +level nodes to the highest desired level that EAS must service. The +highest desired level must be the same for all cpu nodes. The +elements must not be sparse: there must be elements for the current +thread, the next level of hierarchy (core) and so on without any +'holes'. + +Example: Given that a cpu node may represent a core that is a part +of a cluster of related cpus this property may contain multiple +elements which associate the core with cost nodes describing the +costs for the core itself, the cluster the core belongs to and so +on. The elements must be ordered from the lowest level nodes to the +highest desired level that EAS must service. The highest desired +level must be the same for all cpu nodes. The elements must not be +sparse: there must be elements for the current thread, the next +level of hierarchy (core) and so on without any 'holes'. + +If the system comprises of hierarchical clusters of clusters, this +property will contain multiple associations with the relevant number +of cluster elements in hierarchical order. + +Property added to the cpu node: + +- sched-energy-costs + + Usage: required + Value type: List of phandles + Definition: a list of phandles to specific cost nodes in the + energy-costs parent node that correspond to the processing + element represented by this cpu node in hierarchical order + of topology. + + The order of phandles in the list is significant. The first + phandle is to the current processing element's own cost + node. Subsequent phandles are to higher hierarchical level + cost nodes up until the maximum level that EAS is to + service. + + All cpu nodes must have the same highest level cost node. + + The phandle list must not be sparsely populated with handles + to non-contiguous hierarchical levels. See commentary above + for clarity. + + Any other configuration is invalid. + +=========================================================== +5 - Example dts +=========================================================== + +Example 1 (ARM 64-bit, 6-cpu system, two clusters of cpus, one +cluster of 2 Cortex-A57 cpus, one cluster of 4 Cortex-A53 cpus): + +cpus { + #address-cells = <2>; + #size-cells = <0>; + . + . + . + A57_0: cpu@0 { + compatible = "arm,cortex-a57","arm,armv8"; + reg = <0x0 0x0>; + device_type = "cpu"; + enable-method = "psci"; + next-level-cache = <&A57_L2>; + clocks = <&scpi_dvfs 0>; + cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + sched-energy-costs = <&CPU_COST_0 &CLUSTER_COST_0>; + }; + + A57_1: cpu@1 { + compatible = "arm,cortex-a57","arm,armv8"; + reg = <0x0 0x1>; + device_type = "cpu"; + enable-method = "psci"; + next-level-cache = <&A57_L2>; + clocks = <&scpi_dvfs 0>; + cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + sched-energy-costs = <&CPU_COST_0 &CLUSTER_COST_0>; + }; + + A53_0: cpu@100 { + compatible = "arm,cortex-a53","arm,armv8"; + reg = <0x0 0x100>; + device_type = "cpu"; + enable-method = "psci"; + next-level-cache = <&A53_L2>; + clocks = <&scpi_dvfs 1>; + cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + sched-energy-costs = <&CPU_COST_1 &CLUSTER_COST_1>; + }; + + A53_1: cpu@101 { + compatible = "arm,cortex-a53","arm,armv8"; + reg = <0x0 0x101>; + device_type = "cpu"; + enable-method = "psci"; + next-level-cache = <&A53_L2>; + clocks = <&scpi_dvfs 1>; + cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + sched-energy-costs = <&CPU_COST_1 &CLUSTER_COST_1>; + }; + + A53_2: cpu@102 { + compatible = "arm,cortex-a53","arm,armv8"; + reg = <0x0 0x102>; + device_type = "cpu"; + enable-method = "psci"; + next-level-cache = <&A53_L2>; + clocks = <&scpi_dvfs 1>; + cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + sched-energy-costs = <&CPU_COST_1 &CLUSTER_COST_1>; + }; + + A53_3: cpu@103 { + compatible = "arm,cortex-a53","arm,armv8"; + reg = <0x0 0x103>; + device_type = "cpu"; + enable-method = "psci"; + next-level-cache = <&A53_L2>; + clocks = <&scpi_dvfs 1>; + cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + sched-energy-costs = <&CPU_COST_1 &CLUSTER_COST_1>; + }; + + energy-costs { + CPU_COST_0: core-cost0 { + busy-cost-data = < + 417 168 + 579 251 + 744 359 + 883 479 + 1024 616 + >; + idle-cost-data = < + 15 + 0 + >; + }; + CPU_COST_1: core-cost1 { + busy-cost-data = < + 235 33 + 302 46 + 368 61 + 406 76 + 447 93 + >; + idle-cost-data = < + 6 + 0 + >; + }; + CLUSTER_COST_0: cluster-cost0 { + busy-cost-data = < + 417 24 + 579 32 + 744 43 + 883 49 + 1024 64 + >; + idle-cost-data = < + 65 + 24 + >; + }; + CLUSTER_COST_1: cluster-cost1 { + busy-cost-data = < + 235 26 + 303 30 + 368 39 + 406 47 + 447 57 + >; + idle-cost-data = < + 56 + 17 + >; + }; + }; +}; + +=============================================================================== +[1] https://lkml.org/lkml/2015/5/12/728 +[2] Documentation/devicetree/bindings/topology.txt +[3] Documentation/scheduler/sched-energy.txt diff --git a/Documentation/scheduler/sched-energy.txt b/Documentation/scheduler/sched-energy.txt new file mode 100644 index 000000000000..37be110c2bad --- /dev/null +++ b/Documentation/scheduler/sched-energy.txt @@ -0,0 +1,363 @@ +Energy cost model for energy-aware scheduling (EXPERIMENTAL) + +Introduction +============= + +The basic energy model uses platform energy data stored in sched_group_energy +data structures attached to the sched_groups in the sched_domain hierarchy. The +energy cost model offers two functions that can be used to guide scheduling +decisions: + +1. static unsigned int sched_group_energy(struct energy_env *eenv) +2. static int energy_diff(struct energy_env *eenv) + +sched_group_energy() estimates the energy consumed by all cpus in a specific +sched_group including any shared resources owned exclusively by this group of +cpus. Resources shared with other cpus are excluded (e.g. later level caches). + +energy_diff() estimates the total energy impact of a utilization change. That +is, adding, removing, or migrating utilization (tasks). + +Both functions use a struct energy_env to specify the scenario to be evaluated: + + struct energy_env { + struct sched_group *sg_top; + struct sched_group *sg_cap; + int cap_idx; + int usage_delta; + int src_cpu; + int dst_cpu; + int energy; + }; + +sg_top: sched_group to be evaluated. Not used by energy_diff(). + +sg_cap: sched_group covering the cpus in the same frequency domain. Set by +sched_group_energy(). + +cap_idx: Capacity state to be used for energy calculations. Set by +find_new_capacity(). + +usage_delta: Amount of utilization to be added, removed, or migrated. + +src_cpu: Source cpu from where 'usage_delta' utilization is removed. Should be +-1 if no source (e.g. task wake-up). + +dst_cpu: Destination cpu where 'usage_delta' utilization is added. Should be -1 +if utilization is removed (e.g. terminating tasks). + +energy: Result of sched_group_energy(). + +The metric used to represent utilization is the actual per-entity running time +averaged over time using a geometric series. Very similar to the existing +per-entity load-tracking, but _not_ scaled by task priority and capped by the +capacity of the cpu. The latter property does mean that utilization may +underestimate the compute requirements for task on fully/over utilized cpus. +The greatest potential for energy savings without affecting performance too much +is scenarios where the system isn't fully utilized. If the system is deemed +fully utilized load-balancing should be done with task load (includes task +priority) instead in the interest of fairness and performance. + + +Background and Terminology +=========================== + +To make it clear from the start: + +energy = [joule] (resource like a battery on powered devices) +power = energy/time = [joule/second] = [watt] + +The goal of energy-aware scheduling is to minimize energy, while still getting +the job done. That is, we want to maximize: + + performance [inst/s] + -------------------- + power [W] + +which is equivalent to minimizing: + + energy [J] + ----------- + instruction + +while still getting 'good' performance. It is essentially an alternative +optimization objective to the current performance-only objective for the +scheduler. This alternative considers two objectives: energy-efficiency and +performance. Hence, there needs to be a user controllable knob to switch the +objective. Since it is early days, this is currently a sched_feature +(ENERGY_AWARE). + +The idea behind introducing an energy cost model is to allow the scheduler to +evaluate the implications of its decisions rather than applying energy-saving +techniques blindly that may only have positive effects on some platforms. At +the same time, the energy cost model must be as simple as possible to minimize +the scheduler latency impact. + +Platform topology +------------------ + +The system topology (cpus, caches, and NUMA information, not peripherals) is +represented in the scheduler by the sched_domain hierarchy which has +sched_groups attached at each level that covers one or more cpus (see +sched-domains.txt for more details). To add energy awareness to the scheduler +we need to consider power and frequency domains. + +Power domain: + +A power domain is a part of the system that can be powered on/off +independently. Power domains are typically organized in a hierarchy where you +may be able to power down just a cpu or a group of cpus along with any +associated resources (e.g. shared caches). Powering up a cpu means that all +power domains it is a part of in the hierarchy must be powered up. Hence, it is +more expensive to power up the first cpu that belongs to a higher level power +domain than powering up additional cpus in the same high level domain. Two +level power domain hierarchy example: + + Power source + +-------------------------------+----... +per group PD G G + | +----------+ | + +--------+-------| Shared | (other groups) +per-cpu PD G G | resource | + | | +----------+ + +-------+ +-------+ + | CPU 0 | | CPU 1 | + +-------+ +-------+ + +Frequency domain: + +Frequency domains (P-states) typically cover the same group of cpus as one of +the power domain levels. That is, there might be several smaller power domains +sharing the same frequency (P-state) or there might be a power domain spanning +multiple frequency domains. + +From a scheduling point of view there is no need to know the actual frequencies +[Hz]. All the scheduler cares about is the compute capacity available at the +current state (P-state) the cpu is in and any other available states. For that +reason, and to also factor in any cpu micro-architecture differences, compute +capacity scaling states are called 'capacity states' in this document. For SMP +systems this is equivalent to P-states. For mixed micro-architecture systems +(like ARM big.LITTLE) it is P-states scaled according to the micro-architecture +performance relative to the other cpus in the system. + +Energy modelling: +------------------ + +Due to the hierarchical nature of the power domains, the most obvious way to +model energy costs is therefore to associate power and energy costs with +domains (groups of cpus). Energy costs of shared resources are associated with +the group of cpus that share the resources, only the cost of powering the +cpu itself and any private resources (e.g. private L1 caches) is associated +with the per-cpu groups (lowest level). + +For example, for an SMP system with per-cpu power domains and a cluster level +(group of cpus) power domain we get the overall energy costs to be: + + energy = energy_cluster + n * energy_cpu + +where 'n' is the number of cpus powered up and energy_cluster is the cost paid +as soon as any cpu in the cluster is powered up. + +The power and frequency domains can naturally be mapped onto the existing +sched_domain hierarchy and sched_groups by adding the necessary data to the +existing data structures. + +The energy model considers energy consumption from two contributors (shown in +the illustration below): + +1. Busy energy: Energy consumed while a cpu and the higher level groups that it +belongs to are busy running tasks. Busy energy is associated with the state of +the cpu, not an event. The time the cpu spends in this state varies. Thus, the +most obvious platform parameter for this contribution is busy power +(energy/time). + +2. Idle energy: Energy consumed while a cpu and higher level groups that it +belongs to are idle (in a C-state). Like busy energy, idle energy is associated +with the state of the cpu. Thus, the platform parameter for this contribution +is idle power (energy/time). + +Energy consumed during transitions from an idle-state (C-state) to a busy state +(P-state) or going the other way is ignored by the model to simplify the energy +model calculations. + + + Power + ^ + | busy->idle idle->busy + | transition transition + | + | _ __ + | / \ / \__________________ + |______________/ \ / + | \ / + | Busy \ Idle / Busy + | low P-state \____________/ high P-state + | + +------------------------------------------------------------> time + +Busy |--------------| |-----------------| + +Wakeup |------| |------| + +Idle |------------| + + +The basic algorithm +==================== + +The basic idea is to determine the total energy impact when utilization is +added or removed by estimating the impact at each level in the sched_domain +hierarchy starting from the bottom (sched_group contains just a single cpu). +The energy cost comes from busy time (sched_group is awake because one or more +cpus are busy) and idle time (in an idle-state). Energy model numbers account +for energy costs associated with all cpus in the sched_group as a group. + + for_each_domain(cpu, sd) { + sg = sched_group_of(cpu) + energy_before = curr_util(sg) * busy_power(sg) + + (1-curr_util(sg)) * idle_power(sg) + energy_after = new_util(sg) * busy_power(sg) + + (1-new_util(sg)) * idle_power(sg) + energy_diff += energy_before - energy_after + + } + + return energy_diff + +{curr, new}_util: The cpu utilization at the lowest level and the overall +non-idle time for the entire group for higher levels. Utilization is in the +range 0.0 to 1.0 in the pseudo-code. + +busy_power: The power consumption of the sched_group. + +idle_power: The power consumption of the sched_group when idle. + +Note: It is a fundamental assumption that the utilization is (roughly) scale +invariant. Task utilization tracking factors in any frequency scaling and +performance scaling differences due to difference cpu microarchitectures such +that task utilization can be used across the entire system. + + +Platform energy data +===================== + +struct sched_group_energy can be attached to sched_groups in the sched_domain +hierarchy and has the following members: + +cap_states: + List of struct capacity_state representing the supported capacity states + (P-states). struct capacity_state has two members: cap and power, which + represents the compute capacity and the busy_power of the state. The + list must be ordered by capacity low->high. + +nr_cap_states: + Number of capacity states in cap_states list. + +idle_states: + List of struct idle_state containing idle_state power cost for each + idle-state support by the sched_group. Note that the energy model + calculations will use this table to determine idle power even if no idle + state is actually entered by cpuidle. That is, if latency constraints + prevents that the group enters a coupled state or no idle-states are + supported. Hence, the first entry of the list must be the idle power + when idle, but no idle state was actually entered ('active idle'). This + state may be left out groups with one cpu if the cpu is guaranteed to + enter the state when idle. + +nr_idle_states: + Number of idle states in idle_states list. + +nr_idle_states_below: + Number of idle-states below current level. Filled by generic code, not + to be provided by the platform. + +There are no unit requirements for the energy cost data. Data can be normalized +with any reference, however, the normalization must be consistent across all +energy cost data. That is, one bogo-joule/watt must be the same quantity for +data, but we don't care what it is. + +A recipe for platform characterization +======================================= + +Obtaining the actual model data for a particular platform requires some way of +measuring power/energy. There isn't a tool to help with this (yet). This +section provides a recipe for use as reference. It covers the steps used to +characterize the ARM TC2 development platform. This sort of measurements is +expected to be done anyway when tuning cpuidle and cpufreq for a given +platform. + +The energy model needs two types of data (struct sched_group_energy holds +these) for each sched_group where energy costs should be taken into account: + +1. Capacity state information + +A list containing the compute capacity and power consumption when fully +utilized attributed to the group as a whole for each available capacity state. +At the lowest level (group contains just a single cpu) this is the power of the +cpu alone without including power consumed by resources shared with other cpus. +It basically needs to fit the basic modelling approach described in "Background +and Terminology" section: + + energy_system = energy_shared + n * energy_cpu + +for a system containing 'n' busy cpus. Only 'energy_cpu' should be included at +the lowest level. 'energy_shared' is included at the next level which +represents the group of cpus among which the resources are shared. + +This model is, of course, a simplification of reality. Thus, power/energy +attributions might not always exactly represent how the hardware is designed. +Also, busy power is likely to depend on the workload. It is therefore +recommended to use a representative mix of workloads when characterizing the +capacity states. + +If the group has no capacity scaling support, the list will contain a single +state where power is the busy power attributed to the group. The capacity +should be set to a default value (1024). + +When frequency domains include multiple power domains, the group representing +the frequency domain and all child groups share capacity states. This must be +indicated by setting the SD_SHARE_CAP_STATES sched_domain flag. All groups at +all levels that share the capacity state must have the list of capacity states +with the power set to the contribution of the individual group. + +2. Idle power information + +Stored in the idle_states list. The power number is the group idle power +consumption in each idle state as well when the group is idle but has not +entered an idle-state ('active idle' as mentioned earlier). Due to the way the +energy model is defined, the idle power of the deepest group idle state can +alternatively be accounted for in the parent group busy power. In that case the +group idle state power values are offset such that the idle power of the +deepest state is zero. It is less intuitive, but it is easier to measure as +idle power consumed by the group and the busy/idle power of the parent group +cannot be distinguished without per group measurement points. + +Measuring capacity states and idle power: + +The capacity states' capacity and power can be estimated by running a benchmark +workload at each available capacity state. By restricting the benchmark to run +on subsets of cpus it is possible to extrapolate the power consumption of +shared resources. + +ARM TC2 has two clusters of two and three cpus respectively. Each cluster has a +shared L2 cache. TC2 has on-chip energy counters per cluster. Running a +benchmark workload on just one cpu in a cluster means that power is consumed in +the cluster (higher level group) and a single cpu (lowest level group). Adding +another benchmark task to another cpu increases the power consumption by the +amount consumed by the additional cpu. Hence, it is possible to extrapolate the +cluster busy power. + +For platforms that don't have energy counters or equivalent instrumentation +built-in, it may be possible to use an external DAQ to acquire similar data. + +If the benchmark includes some performance score (for example sysbench cpu +benchmark), this can be used to record the compute capacity. + +Measuring idle power requires insight into the idle state implementation on the +particular platform. Specifically, if the platform has coupled idle-states (or +package states). To measure non-coupled per-cpu idle-states it is necessary to +keep one cpu busy to keep any shared resources alive to isolate the idle power +of the cpu from idle/busy power of the shared resources. The cpu can be tricked +into different per-cpu idle states by disabling the other states. Based on +various combinations of measurements with specific cpus busy and disabling +idle-states it is possible to extrapolate the idle-state power. diff --git a/Documentation/scheduler/sched-tune.txt b/Documentation/scheduler/sched-tune.txt new file mode 100644 index 000000000000..4c960ac98a86 --- /dev/null +++ b/Documentation/scheduler/sched-tune.txt @@ -0,0 +1,619 @@ + Central, scheduler-driven, power-performance control + (EXPERIMENTAL) + +Abstract +======== + +The topic of a single simple power-performance tunable, that is wholly +scheduler centric, and has well defined and predictable properties has come up +on several occasions in the past [4,5]. +With techniques such as energy cost model driven task placement and scheduler +driven DVFS, we now have a good framework for implementing such a tunable. +This document describes the overall ideas behind its design and implementation. + + +Table of Contents +================= + +1. Motivations + +2. Introduction + - Signals Boosting Strategy + - Energy-Performance Space + +3. Design details + - CPU selection using boosted task utilization + - Energy payoff evaluation + - OPP selection using boosted CPU usage + +4. Per task group boosting + - Setup and usage + +5. Question and Answers + - What about "auto" mode? + - What about boosting on a congested system? + - How CPUs are boosted when we have tasks with multiple boost values? + +6. References + + +1. Motivations +============== + +Energy aware scheduling (EAS) [1,2] adds a new objective - energy efficiency - +to the scheduler current performance oriented objectives. +As a foundation component, EAS uses a simple energy cost model (EM) to drive +task placement decisions. Another component is sched-DVFS [3], a new +event-driven cpufreq governor, that allows the scheduler to select the optimal +DVFS operating point (OPP) for running a task allocated to a CPU. + +The combination of EAS and sched-DVFS enable running workloads using a +combination of the most energy efficient OPPs and CPUs. This actually minimizes +the energy consumption. +However, sometimes it may be desired to intentionally boost the performance of +a workload even if that could imply a reasonable increase in energy +consumption. For example, in order to reduce the response time of a task, we +may want to run the task at a higher OPP than the one that is actually required +by it's CPU bandwidth demand. + +This last requirement is especially important if we consider that one of the +main goals of the sched-DVFS component is to replace all currently available +CPUFreq policies. Since sched-DVFS is event based, as opposed to the sampling +driven governors we currently have, it is already more responsive at selecting +the optimal OPP to run tasks allocated to a CPU. However, just tracking the +actual task load demand may not be enough from a performance standpoint. +For example, it is not possible to get behaviors similar to those provided by +the "performance" and "interactive" CPUFreq governors. + +This document describes an implementation of a tunable, stacked on top of the +EAS EM and sched-DVFS which extends their functionality to support task +performance boosting. +By "performance boosting" we mean the reduction of the time required to +complete a task activation, i.e. the time elapsed from a task wakeup to its +next deactivation (e.g. because it goes back to sleep or it terminates). +For example, if we consider a simple periodic task which executes the same +workload for 5[s] every 20[s] while running at a certain OPP, a boosted +execution of that task must complete each of its activations in less than 5[s]. + +A previous attempt [5] to introduce such a boosting feature has not been +successful mainly because of the complexity of the proposed solution. +The approach described in this document exposes a single simple interface to +user-space. This single tunable knob allows the tuning of system wide +scheduler behaviours ranging from energy efficiency at one end through to +incremental performance boosting at the other end. +The tunable affects all tasks. A more advanced extension of the concept is also +provided which uses CGroups to boost the performance of only selected tasks +while using the energy efficient default for all others. + +The rest of this document introduces in more details the proposed solution +which has been named SchedTune. + + +2. Introduction +=============== + +SchedTune exposes a simple user-space interface with a single power-performance +tunable: + + /proc/sys/kernel/sched_cfs_boost + +This permits expressing a boost value as an integer in the range [0..100]. + +A value of 0 (default) configures the Energy-Aware Scheduler (EAS) for maximum +energy efficiency. This means that EM will try always to do its best to schedule +tasks on the most energy-efficient CPU while sched-DVFS runs them at the minimum +OPP required to satisfy the workload demand. +A value of 100 configures EAS for maximum performance with the scheduler doing +it's best to put tasks on CPUs with the maximum capacity. This translates to +the maximum OPP on that CPU and, for heterogeneous systems like ARM big.LITTLE, +the CPU type with the highest capacity. + +The range between 0 and 100 can be set to satisfy other scenarios suitably. For +example to satisfy interactive response considering the energy expense +trade-off or depending on other system events (battery level etc). + +A CGroup based extension is also provided, which permits further user-space +defined task classification to tune the scheduler for different goals depending +on the specific nature of the task, e.g. background vs interactive vs +low-priority. + +The overall design of the SchedTune module is built on top of the EAS +by introducing two main bias: + +1. bias the Scheduling Group (SG) and CPU selection + Each time a task wakes up, EAS has the opportunity to allocate the task in + the most appropriate SG/CPU. This decision is influenced by the global boost + value, or the boost value for the task CGroup when in use. + +2. bias the Operating Performance Point (OPP) selection + Each time a task is allocated on a CPU, sched-DVFS has the opportunity to + tune the operating frequency of that CPU to better match the workload + demand. The selection of the actual OPP being activated is influenced by the + global boost value, or the boost value for the task CGroup when in use. + +This simple biasing approach leverages existing frameworks, which means minimal +modifications to the scheduler, and yet it allows to achieve a range of +different behaviours all from a single simple tunable knob. +The only new concepts introduced are those of signal boosting +and the energy-performance space which are detailed in the following sections. + + +2.1. Signals Boosting Strategy +============================== + +The whole EAS machinery works based on the value of a few load tracking signals +which basically track the CPU bandwidth requirements for tasks and the capacity +of CPUs. The basic idea behind the SchedTune knob is to artificially inflate +some of these load tracking signals to make a task or RQ appears more demanding +that it actually is. + +Which signal have to be inflated depends on the specific "consumer". However, +independently from the specific (signal, consumer) pair, it is important to +define a simple and possibly consistent strategy for the concept of boosting a +signal. + +A boosting strategy defines how the "abstract" user-space defined +sched_cfs_boost value is translated into an internal "margin" value to be added +to a signal to get its inflated value: + + margin := boosting_strategy(sched_cfs_boost, signal) + boosted_signal := signal + margin + +Different boosting strategies have been identified and analyzed before choosing +the one found to be the most effective. + +Signal Proportional Compensation (SPC) +-------------------------------------- + +In this boosting strategy the sched_cfs_boost value is used to compute a +margin which is proportional to the complement of the original signal. +When a signal has a maximum possible value, its complement is defined as +the delta from the actual value and its possible maximum. +Since the tunable implementation uses signals which have SCHED_LOAD_SCALE as +the maximum possible value, the margin becomes: + + margin := sched_cfs_boost * (SCHED_LOAD_SCALE - signal) + +Using this boosting strategy: +- a 100% sched_cfs_boost means that the signal is scaled to the maximum value +- each value in the range of sched_cfs_boost effectively inflates the signal in + question by a quantity which is proportional to the maximum value. + +For example, by applying the SPC boosting strategy to the selection of the OPP +to run a task it is possible to achieve these behaviors: + +- 0% boosting: run the task at the minimum OPP required by its workload +- 100% boosting: run the task at the maximum OPP available for the CPU +- 50% boosting: run at the half-way OPP between minimum and maximum + +Which means that at 50% boosting a task will be scheduled to run at half of the +maximum theoretically achievable performance on the specific target platform. + +A graphical representation of an SPC boosted signal is represented in the +following figure where: + a) "-" represents the original signal + b) "b" represents a 50% boosted signal + c) "p" represents a 100% boosted signal + + + ^ + | SCHED_LOAD_SCALE + +-----------------------------------------------------------------+ + |pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp + | + | boosted_signal + | bbbbbbbbbbbbbbbbbbbbbbbb + | + | signal + | bbbbbbbbbbbbbbbbbbbbbbbb+----------------------+ + | | + |bbbbbbbbbbbbbbbbbb | + | | + | | + | | + | +-----------------------+ + | | + | | + | | + |------------------+ + | + | + +-----------------------------------------------------------------------> + +This plot represent a "ramp" signal. For each step of the original signal the +boosted signal corresponding to a 50% boost is midway from the original signal +and the upper bound. An 100% boost generates instead a boosted signal which is +always saturated to the upper bound. + + +2.2 Energy-Performance Space +============================ + +Boosting a task to get EAS to schedule it on a more capable CPU and/or running +it at a higher OPP implies a higher energy expense to do a certain amount of +work. Conversely, by scheduling a task in an energy efficient way we could +affect its performance such that it takes longer to complete each of its +activations. + +Thus, using the sched_cfs_boost knob requires identifying an effective strategy +to evaluate two different conditions: + a) how much more energy is worth spending + for a certain performance increase + b) how much performance reduction is possible + to save a certain amount of energy + + +To support this kind of evaluation at run-time, the implementation of SchedTune +uses a representation of a scheduling candidate as a point in the +Performance-Energy Space (P-E space). + +A Scheduling Candidate (SC) is a possible scheduling decision to switch a +task from the current CPU and OPP to another CPU and/or OPP. +Such switching involves a certain variation both on the expected +energy consumption and task performance. + +Thus, each scheduling candidate can be represented in the P-E space where: + a) the energy variation (dE) is represented on the X axis + b) the performance variation is represented on the Y axis + +A graphical representation of the P-E space is depicted in the following figure. + + dP ^ + | + | Performance Boost + | Region (B) + Optimal Region (O) | bbb + | bbb + | +sd1 bbb + | bbb + | bbb + | bbb + | bbb + | bbb dE + --------------------------------------------------------------> + cccc| + cccc | + cccc | + cccc | + cccc +sd2 | Suboptimal Region (S) + cccc | + cccc | + | + Performance Constraint | + Region (C) | + | + | + + +Four main regions can be identified in this space: + + 1) Optimal region (O) + The space of scheduling decisions which correspond to a decreased energy + consumption with better performance, all these decisions must always be + selected. + + 2) Suboptimal region (S) + The space of scheduling decisions which correspond to an increased energy + consumption for worse performance, all these decisions must always be + discarded. + + 3) Performance Boost region (B) + The space of scheduling decisions which corresponds to an increased energy + consumption for better performance. + These decisions could be selected only if the increase in energy + consumption is "reasonable" with respect to the performance gain. + + 4) Performance Constraint region (C) + The space of scheduling decisions which corresponds to a decreased energy + consumption for worst performance. + These decisions could be selected only if the decrease in energy + consumption is reasonable with respect to the performance loss. + + +The acceptability criteria defined for the B and C regions are based on the +evaluation of how reasonable is the energy variation compared to the +performance variation. + +From a mathematical/geometrical standpoint, the degree of "reasonableness" of a +scheduling candidate is defined by its location on the P-E space with respect +In the previous figure, two different thresholds are represented by the two +line in the P-E space: + + a) the "boosting" threshold, represented by the "b" line + b) the "constraining" threshold, represented by the "c" line + +Boosting threshold +------------------ +The boosting threshold is the acceptability criterion for a scheduling +candidate belonging to the B region. A scheduling candidate which increases the +energy consumption can only be accepted if it provides a corresponding minimum +performance increment. +The boosting threshold defines this minimum required increment of performance +for each possible energy increase. Thus, the slope of the line representing the +boosting threshold indicates the minimum expected performance boost that can +amortize the corresponding energy increase. + +For example, the point named sd1 in the figure represents a scheduling +candidate which could be accepted given the specific configuration of the +boosting threshold. + +Constraining threshold +---------------------- +The constraining threshold is the acceptability criterion for a scheduling +candidate belonging to the C region. A scheduling candidate which decreases the +energy consumption can only be accepted if it does not involve an excessive +decrement in the expected performance. +The constraining threshold defines the maximum acceptable degradation of +performance for each possible decrease in energy expense. Thus, the slope of +the line representing the constraining threshold indicates the minimum energy +saving expected for the corresponding decrease in performance. + +For example, the point named sd2 in the figure represents a scheduling +candidate which could not be accepted given the specific configuration of the +constraining threshold. + + +3. Design details +================= + +Based on the concepts of signal boosting and the P-E space described +previously, the implementation of the SchedTune tunable extends EAS with three +simple modifications. + +It is worth calling out that the implementation does not introduce any new load +signals. Instead, it provides an API to tune existing signals. This tuning is +done on demand and only in scheduler code paths where it is sensible to do so. +The new API calls are defined to return either the default signal or a boosted +one, depending on the value of sched_cfs_boost. This is a clean an non invasive +modification of the existing existing EAS code paths, specifically the EM and +sched-DVFS. + +The following diagram depicts the integration of the SchedTune with EAS: + + + sched_cfs_boost + +----------------+ + | + +-------------------v-----------------+ + | SchedTune | + +---------------+-------+-------------+ + | | + (SG/CPU selection biasing) | | (OPP selection biasing) + | | + boosted_task_utilization() | | get_boosted_cpu_usage() + | | + +---------------v-+ +-v-------------+ + | EnergyModel | | sched-DVFS | + +-----------------+ +---------------+ + + + +1) CPU selection using boosted task utilization +----------------------------------------------- + +The signal representing a task's utilization is boosted according to the +previously described SPC boosting strategy. This allows representing a task to +the scheduler as being more CPU demanding than it actually is. + +Thus, with the SchedTune tunable enabled we have two main functions to get the +utilization of a task: + + task_utilization() + boosted_task_utilization() + +The new boosted_task_utilization() is similar to the first but returns a +boosted utilization signal which is a function of the sched_cfs_boost value. + +This function is used in the EAS code paths where it is required to decide in +which CPU a task could be allocated. +For example, this allows the selection of the most capable CPU on the system +when a task is boosted 100%. + +Thus, the new boosted_task_utilization() function is used to bias the selection +of a possible scheduling candidate. + + +2) Energy payoff evaluation +--------------------------- + +As previously described, by considering a boosted task utilization we could end +up with a scheduling candidate which increases the energy consumption to +hopefully get more performance for a task. + +A new function: + + schedtune_accept_deltas(energy_delta, performance_delta) + +has been added by the SchedTune implementation which allows to evaluate the +scheduling candidate in the P-E Space. + +The P-E space requires the definition of boosting and constraining thresholds. +In order to keep the user-space interface simple, the SchedTune implementation +binds the single sched_cfs_boost value to the definition of these thresholds. + +Specifically: + a) the two thresholds have the same slope + b) a 0% sched_cfs_boost value corresponds to vertical line in the P-E space, + centered at the origin, and a consequent threshold which accepts only those + scheduling candidate that correspond to a decrease of the expected energy + consumption + c) a 100% sched_cfs_boost value corresponds to an horizontal line in the P-E + space, centered in the origin, and a consequent threshold which accepts all + the scheduling candidate that corresponds to an increase of expected + performance + d) a sched_cfs_boost value in between 0% and 100% translates to a line whose + slope is inversely proportional to the boost value + +This definition of the thresholds in the P-E space has the following +interesting properties: + 1) a 0% boost value provides power saving behaviors + 2) a 100% boost value provides power performance behaviors + 3) support a smooth transition from power saving to performance boosting. + + +3) OPP selection using boosted CPU usage +---------------------------------------- + +The signal representing a CPU's usage is boosted according to the previously +described SPC boosting strategy. This allows to represent a CPU (i.e. CFS RQ) +to sched-DVFS as being more used than it actually is. + +Thus, with the sched_cfs_boost enabled we have the following main functions to +get the current usage of a CPU: + + get_cpu_usage() + get_boosted_cpu_usage() + +The new get_boosted_cpu_usage() is similar to the first but returns a boosted +usage signal which is function of the sched_cfs_boost value. + +This function is used in the EAS code paths where sched-DVFS needs to decide +the OPP to run a CPU at. +For example, this allows selecting the highest OPP for a CPU which has +the boost value set to 100%. + +Thus, the new get_boosted_cpu_usage() function is used to bias the selection of +the CPUs operational frequency. + + +4. Per task group boosting +========================== + +The availability of a single knob which is used to boost all tasks in the +system is certainly a simple solution but it quite likely doesn't fit many +usage scenarios, especially in the mobile device space. + +For example, on battery powered devices there usually are many background +services which are long running and need energy efficient scheduling. On the +other hand, some applications are more performance sensitive and require an +interactive response and/or maximum performance, regardless of the energy cost. +To better service such scenarios, the SchedTune implementation has an extension +that provides a more fine grained boosting interface. + +A new CGroup controller, namely "schedtune", could be enabled which allows to +defined and configure task groups with different boosting values. +Tasks that require special power-performance can be put into separate CGroups. +The value of the boost associated with the tasks in this group can be specified +using a single knob exposed by the CGroup controller: + + schedtune.boost + +This knob allows the definition of a boost value that is to be used for +SPC boosting of all tasks attached to this group. + +The current schedtune controller implementation is really simple and has these +main characteristics: + + 1) it is only possible to create 1 level depth hierarchies + The root control groups define the system-wide boost value to be applied + by default to all tasks. Its direct subgroups are named "boost groups" and + they define the boost value for specific set of tasks. + Further nested subgroups are not allowed since they do not have a sensible + meaning from a user-space standpoint. + + 2) it is possible to define only a limited number of "boost groups" + This number is defined at compile time and by default configured to 16. + This is a design decision motivated by two main reasons: + a) in a real system we do not expect usage scenarios with more then few + boost groups. For example, a reasonable collection of groups could be + just "background", "interactive" and "performance". + b) it simplifies the implementation considerably, especially for the code + which has to compute the per CPU boosting once there are multiple + RUNNABLE tasks with different boost values. + +Such a simple design should allow servicing the main usage scenarios identified +so far. It provides a simple interface which can be used to manage the +power-performance of all tasks or only selected tasks. +Moreover, this interface can be easily integrated by user-space run-times (e.g. +Android, ChromeOS) to implement a QoS solution for task boosting based on tasks +classification, which has been a long standing requirement. + +Setup and usage +--------------- + +0. Use a kernel with CGROUP_SCHEDTUNE support enabled + +1. Check that the "schedtune" CGroup controller is available: + + root@linaro-nano:~# cat /proc/cgroups + #subsys_name hierarchy num_cgroups enabled + cpuset 0 1 1 + cpu 0 1 1 + schedtune 0 1 1 + +2. Mount a tmpfs to create the CGroups mount point (Optional) + + root@linaro-nano:~# sudo mount -t tmpfs cgroups /sys/fs/cgroup + +3. Mount the "schedtune" controller + + root@linaro-nano:~# mkdir /sys/fs/cgroup/stune + root@linaro-nano:~# sudo mount -t cgroup -o schedtune stune /sys/fs/cgroup/stune + +4. Setup the system-wide boost value (Optional) + If not configured the root control group has a 0% boost value, which + basically disable boosting for all tasks in the system thus running in + energy-efficient mode. + + root@linaro-nano:~# echo $SYSBOOST > /sys/fs/cgroup/stune/schedtune.boost + +5. Create task groups and configure their specific boost value (Optional) + For example here we create a "performance" boost group configure to boost + all its tasks to 100% + + root@linaro-nano:~# mkdir /sys/fs/cgroup/stune/performance + root@linaro-nano:~# echo 100 > /sys/fs/cgroup/stune/performance/schedtune.boost + +6. Move tasks into the boost group + For example, the following moves the tasks with PID $TASKPID (and all its + threads) into the "performance" boost group. + + root@linaro-nano:~# echo "TASKPID > /sys/fs/cgroup/stune/performance/cgroup.procs + +This simple configuration allows only the threads of the $TASKPID task to run, +when needed, at the highest OPP in the most capable CPU of the system. + + +5. Question and Answers +======================= + +What about "auto" mode? +----------------------- + +The "auto" mode as described in [5] is still possible to be implemented by +using the SchedTune implementation provided a suitable integration with a +user-space run-time which tune the simple boost knob exposed by either the +system-wide or cgroup based interface. + +What about boosting on a congested system? +------------------------------------------ + +The current implementation of the sched_cfs_boost tunable has the most impact +only while EAS runs under the so called 'tipping point' [5] and the system has +spare capacity. +This seems to make sense since when the tipping point is reached the system is +likely already running at the maximum OPP and CFS allocates tasks to try and +maximize their performance. +Put differently, the kind of power-performance boosting makes sense only when +the system has spare capacity and there are tasks that can be boosted. + +How are multiple groups of tasks with different boost values managed? +--------------------------------------------------------------------- + +The current ScheTune implementation keeps track of the boosted RUNNABLE tasks +on a CPU. Once sched-DVFS selects the OPP to run a CPU at, the CPU usage is +boosted with a value which is the maximum of the boost values of the currently +RUNNABLE tasks in its RQ. +This allows sched-DVFS to boost a CPU only while there are boosted tasks ready +to run and switch back to the energy efficient mode as soon as the last boosted +task is dequeued. + + +6. References +============= +[1] http://lkml.org/lkml/2015/5/12/728 +[2] http://lkml.org/lkml/2015/5/12/757 +[3] http://lkml.org/lkml/2015/6/26/620 +[4] http://lwn.net/Articles/552889 +[5] http://lkml.org/lkml/2012/5/18/91 +[6] http://lkml.org/lkml/2015/5/12/749 diff --git a/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts b/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts index eb43d9263e36..4edf022ff2fb 100644 --- a/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts +++ b/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts @@ -39,6 +39,7 @@ reg = <0>; cci-control-port = <&cci_control1>; cpu-idle-states = <&CLUSTER_SLEEP_BIG>; + clock-frequency = <1000000000>; }; cpu1: cpu@1 { @@ -47,6 +48,7 @@ reg = <1>; cci-control-port = <&cci_control1>; cpu-idle-states = <&CLUSTER_SLEEP_BIG>; + clock-frequency = <1000000000>; }; cpu2: cpu@2 { @@ -55,6 +57,7 @@ reg = <0x100>; cci-control-port = <&cci_control2>; cpu-idle-states = <&CLUSTER_SLEEP_LITTLE>; + clock-frequency = <800000000>; }; cpu3: cpu@3 { @@ -63,6 +66,7 @@ reg = <0x101>; cci-control-port = <&cci_control2>; cpu-idle-states = <&CLUSTER_SLEEP_LITTLE>; + clock-frequency = <800000000>; }; cpu4: cpu@4 { @@ -71,6 +75,7 @@ reg = <0x102>; cci-control-port = <&cci_control2>; cpu-idle-states = <&CLUSTER_SLEEP_LITTLE>; + clock-frequency = <800000000>; }; idle-states { diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h index 2fe85fff5cca..cf66aca3ba25 100644 --- a/arch/arm/include/asm/topology.h +++ b/arch/arm/include/asm/topology.h @@ -24,6 +24,17 @@ void init_cpu_topology(void); void store_cpu_topology(unsigned int cpuid); const struct cpumask *cpu_coregroup_mask(int cpu); +#define arch_scale_freq_capacity arm_arch_scale_freq_capacity +struct sched_domain; +extern +unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu); + +DECLARE_PER_CPU(atomic_long_t, cpu_freq_capacity); + +#define arch_scale_cpu_capacity arm_arch_scale_cpu_capacity +extern +unsigned long arm_arch_scale_cpu_capacity(struct sched_domain *sd, int cpu); + #else static inline void init_cpu_topology(void) { } diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c index 9ce02e6e3963..a27337d46a9f 100644 --- a/arch/arm/kernel/smp.c +++ b/arch/arm/kernel/smp.c @@ -47,6 +47,7 @@ #include <asm/mach/arch.h> #include <asm/mpu.h> +#include <trace/events/power.h> #define CREATE_TRACE_POINTS #include <trace/events/ipi.h> @@ -730,12 +731,34 @@ static DEFINE_PER_CPU(unsigned long, l_p_j_ref); static DEFINE_PER_CPU(unsigned long, l_p_j_ref_freq); static unsigned long global_l_p_j_ref; static unsigned long global_l_p_j_ref_freq; +static DEFINE_PER_CPU(atomic_long_t, cpu_max_freq); +DEFINE_PER_CPU(atomic_long_t, cpu_freq_capacity); + +/* + * Scheduler load-tracking scale-invariance + * + * Provides the scheduler with a scale-invariance correction factor that + * compensates for frequency scaling through arch_scale_freq_capacity() + * (implemented in topology.c). + */ +static inline +void scale_freq_capacity(int cpu, unsigned long curr, unsigned long max) +{ + unsigned long capacity; + + if (!max) + return; + + capacity = (curr << SCHED_CAPACITY_SHIFT) / max; + atomic_long_set(&per_cpu(cpu_freq_capacity, cpu), capacity); +} static int cpufreq_callback(struct notifier_block *nb, unsigned long val, void *data) { struct cpufreq_freqs *freq = data; int cpu = freq->cpu; + unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu)); if (freq->flags & CPUFREQ_CONST_LOOPS) return NOTIFY_OK; @@ -760,6 +783,12 @@ static int cpufreq_callback(struct notifier_block *nb, per_cpu(l_p_j_ref_freq, cpu), freq->new); } + + if (val == CPUFREQ_PRECHANGE) { + scale_freq_capacity(cpu, freq->new, max); + trace_cpu_capacity(capacity_curr_of(cpu), cpu); + } + return NOTIFY_OK; } @@ -767,11 +796,38 @@ static struct notifier_block cpufreq_notifier = { .notifier_call = cpufreq_callback, }; +static int cpufreq_policy_callback(struct notifier_block *nb, + unsigned long val, void *data) +{ + struct cpufreq_policy *policy = data; + int i; + + if (val != CPUFREQ_NOTIFY) + return NOTIFY_OK; + + for_each_cpu(i, policy->cpus) { + scale_freq_capacity(i, policy->cur, policy->max); + atomic_long_set(&per_cpu(cpu_max_freq, i), policy->max); + } + + return NOTIFY_OK; +} + +static struct notifier_block cpufreq_policy_notifier = { + .notifier_call = cpufreq_policy_callback, +}; + static int __init register_cpufreq_notifier(void) { - return cpufreq_register_notifier(&cpufreq_notifier, + int ret; + + ret = cpufreq_register_notifier(&cpufreq_notifier, CPUFREQ_TRANSITION_NOTIFIER); + if (ret) + return ret; + + return cpufreq_register_notifier(&cpufreq_policy_notifier, + CPUFREQ_POLICY_NOTIFIER); } core_initcall(register_cpufreq_notifier); - #endif diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 89cfdd6e50cb..da450709eaa9 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -42,7 +42,7 @@ */ static DEFINE_PER_CPU(unsigned long, cpu_scale); -unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu) +unsigned long arm_arch_scale_cpu_capacity(struct sched_domain *sd, int cpu) { return per_cpu(cpu_scale, cpu); } @@ -62,9 +62,7 @@ struct cpu_efficiency { * Table of relative efficiency of each processors * The efficiency value must fit in 20bit and the final * cpu_scale value must be in the range - * 0 < cpu_scale < 3*SCHED_CAPACITY_SCALE/2 - * in order to return at most 1 when DIV_ROUND_CLOSEST - * is used to compute the capacity of a CPU. + * 0 < cpu_scale < SCHED_CAPACITY_SCALE. * Processors that are not defined in the table, * use the default SCHED_CAPACITY_SCALE value for cpu_scale. */ @@ -77,24 +75,18 @@ static const struct cpu_efficiency table_efficiency[] = { static unsigned long *__cpu_capacity; #define cpu_capacity(cpu) __cpu_capacity[cpu] -static unsigned long middle_capacity = 1; +static unsigned long max_cpu_perf; /* * Iterate all CPUs' descriptor in DT and compute the efficiency - * (as per table_efficiency). Also calculate a middle efficiency - * as close as possible to (max{eff_i} - min{eff_i}) / 2 - * This is later used to scale the cpu_capacity field such that an - * 'average' CPU is of middle capacity. Also see the comments near - * table_efficiency[] and update_cpu_capacity(). + * (as per table_efficiency). Calculate the max cpu performance too. */ + static void __init parse_dt_topology(void) { const struct cpu_efficiency *cpu_eff; struct device_node *cn = NULL; - unsigned long min_capacity = ULONG_MAX; - unsigned long max_capacity = 0; - unsigned long capacity = 0; - int cpu = 0; + int cpu = 0, i = 0; __cpu_capacity = kcalloc(nr_cpu_ids, sizeof(*__cpu_capacity), GFP_NOWAIT); @@ -102,6 +94,7 @@ static void __init parse_dt_topology(void) for_each_possible_cpu(cpu) { const u32 *rate; int len; + unsigned long cpu_perf; /* too early to use cpu->of_node */ cn = of_get_cpu_node(cpu, NULL); @@ -124,51 +117,57 @@ static void __init parse_dt_topology(void) continue; } - capacity = ((be32_to_cpup(rate)) >> 20) * cpu_eff->efficiency; - - /* Save min capacity of the system */ - if (capacity < min_capacity) - min_capacity = capacity; - - /* Save max capacity of the system */ - if (capacity > max_capacity) - max_capacity = capacity; - - cpu_capacity(cpu) = capacity; + cpu_perf = ((be32_to_cpup(rate)) >> 20) * cpu_eff->efficiency; + cpu_capacity(cpu) = cpu_perf; + max_cpu_perf = max(max_cpu_perf, cpu_perf); + i++; } - /* If min and max capacities are equals, we bypass the update of the - * cpu_scale because all CPUs have the same capacity. Otherwise, we - * compute a middle_capacity factor that will ensure that the capacity - * of an 'average' CPU of the system will be as close as possible to - * SCHED_CAPACITY_SCALE, which is the default value, but with the - * constraint explained near table_efficiency[]. - */ - if (4*max_capacity < (3*(max_capacity + min_capacity))) - middle_capacity = (min_capacity + max_capacity) - >> (SCHED_CAPACITY_SHIFT+1); - else - middle_capacity = ((max_capacity / 3) - >> (SCHED_CAPACITY_SHIFT-1)) + 1; - + if (i < num_possible_cpus()) + max_cpu_perf = 0; } /* * Look for a customed capacity of a CPU in the cpu_capacity table during the * boot. The update of all CPUs is in O(n^2) for heteregeneous system but the - * function returns directly for SMP system. + * function returns directly for SMP systems or if there is no complete set + * of cpu efficiency, clock frequency data for each cpu. */ static void update_cpu_capacity(unsigned int cpu) { - if (!cpu_capacity(cpu)) + unsigned long capacity = cpu_capacity(cpu); + + if (!capacity || !max_cpu_perf) { + cpu_capacity(cpu) = 0; return; + } + + capacity *= SCHED_CAPACITY_SCALE; + capacity /= max_cpu_perf; - set_capacity_scale(cpu, cpu_capacity(cpu) / middle_capacity); + set_capacity_scale(cpu, capacity); printk(KERN_INFO "CPU%u: update cpu_capacity %lu\n", cpu, arch_scale_cpu_capacity(NULL, cpu)); } +/* + * Scheduler load-tracking scale-invariance + * + * Provides the scheduler with a scale-invariance correction factor that + * compensates for frequency scaling (arch_scale_freq_capacity()). The scaling + * factor is updated in smp.c + */ +unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu) +{ + unsigned long curr = atomic_long_read(&per_cpu(cpu_freq_capacity, cpu)); + + if (!curr) + return SCHED_CAPACITY_SCALE; + + return curr; +} + #else static inline void parse_dt_topology(void) {} static inline void update_cpu_capacity(unsigned int cpuid) {} @@ -275,17 +274,130 @@ void store_cpu_topology(unsigned int cpuid) cpu_topology[cpuid].socket_id, mpidr); } +/* + * ARM TC2 specific energy cost model data. There are no unit requirements for + * the data. Data can be normalized to any reference point, but the + * normalization must be consistent. That is, one bogo-joule/watt must be the + * same quantity for all data, but we don't care what it is. + */ +static struct idle_state idle_states_cluster_a7[] = { + { .power = 25 }, /* WFI */ + { .power = 10 }, /* cluster-sleep-l */ + }; + +static struct idle_state idle_states_cluster_a15[] = { + { .power = 70 }, /* WFI */ + { .power = 25 }, /* cluster-sleep-b */ + }; + +static struct capacity_state cap_states_cluster_a7[] = { + /* Cluster only power */ + { .cap = 150, .power = 2967, }, /* 350 MHz */ + { .cap = 172, .power = 2792, }, /* 400 MHz */ + { .cap = 215, .power = 2810, }, /* 500 MHz */ + { .cap = 258, .power = 2815, }, /* 600 MHz */ + { .cap = 301, .power = 2919, }, /* 700 MHz */ + { .cap = 344, .power = 2847, }, /* 800 MHz */ + { .cap = 387, .power = 3917, }, /* 900 MHz */ + { .cap = 430, .power = 4905, }, /* 1000 MHz */ + }; + +static struct capacity_state cap_states_cluster_a15[] = { + /* Cluster only power */ + { .cap = 426, .power = 7920, }, /* 500 MHz */ + { .cap = 512, .power = 8165, }, /* 600 MHz */ + { .cap = 597, .power = 8172, }, /* 700 MHz */ + { .cap = 682, .power = 8195, }, /* 800 MHz */ + { .cap = 768, .power = 8265, }, /* 900 MHz */ + { .cap = 853, .power = 8446, }, /* 1000 MHz */ + { .cap = 938, .power = 11426, }, /* 1100 MHz */ + { .cap = 1024, .power = 15200, }, /* 1200 MHz */ + }; + +static struct sched_group_energy energy_cluster_a7 = { + .nr_idle_states = ARRAY_SIZE(idle_states_cluster_a7), + .idle_states = idle_states_cluster_a7, + .nr_cap_states = ARRAY_SIZE(cap_states_cluster_a7), + .cap_states = cap_states_cluster_a7, +}; + +static struct sched_group_energy energy_cluster_a15 = { + .nr_idle_states = ARRAY_SIZE(idle_states_cluster_a15), + .idle_states = idle_states_cluster_a15, + .nr_cap_states = ARRAY_SIZE(cap_states_cluster_a15), + .cap_states = cap_states_cluster_a15, +}; + +static struct idle_state idle_states_core_a7[] = { + { .power = 0 }, /* WFI */ + }; + +static struct idle_state idle_states_core_a15[] = { + { .power = 0 }, /* WFI */ + }; + +static struct capacity_state cap_states_core_a7[] = { + /* Power per cpu */ + { .cap = 150, .power = 187, }, /* 350 MHz */ + { .cap = 172, .power = 275, }, /* 400 MHz */ + { .cap = 215, .power = 334, }, /* 500 MHz */ + { .cap = 258, .power = 407, }, /* 600 MHz */ + { .cap = 301, .power = 447, }, /* 700 MHz */ + { .cap = 344, .power = 549, }, /* 800 MHz */ + { .cap = 387, .power = 761, }, /* 900 MHz */ + { .cap = 430, .power = 1024, }, /* 1000 MHz */ + }; + +static struct capacity_state cap_states_core_a15[] = { + /* Power per cpu */ + { .cap = 426, .power = 2021, }, /* 500 MHz */ + { .cap = 512, .power = 2312, }, /* 600 MHz */ + { .cap = 597, .power = 2756, }, /* 700 MHz */ + { .cap = 682, .power = 3125, }, /* 800 MHz */ + { .cap = 768, .power = 3524, }, /* 900 MHz */ + { .cap = 853, .power = 3846, }, /* 1000 MHz */ + { .cap = 938, .power = 5177, }, /* 1100 MHz */ + { .cap = 1024, .power = 6997, }, /* 1200 MHz */ + }; + +static struct sched_group_energy energy_core_a7 = { + .nr_idle_states = ARRAY_SIZE(idle_states_core_a7), + .idle_states = idle_states_core_a7, + .nr_cap_states = ARRAY_SIZE(cap_states_core_a7), + .cap_states = cap_states_core_a7, +}; + +static struct sched_group_energy energy_core_a15 = { + .nr_idle_states = ARRAY_SIZE(idle_states_core_a15), + .idle_states = idle_states_core_a15, + .nr_cap_states = ARRAY_SIZE(cap_states_core_a15), + .cap_states = cap_states_core_a15, +}; + +/* sd energy functions */ +static inline const struct sched_group_energy *cpu_cluster_energy(int cpu) +{ + return cpu_topology[cpu].socket_id ? &energy_cluster_a7 : + &energy_cluster_a15; +} + +static inline const struct sched_group_energy *cpu_core_energy(int cpu) +{ + return cpu_topology[cpu].socket_id ? &energy_core_a7 : + &energy_core_a15; +} + static inline int cpu_corepower_flags(void) { - return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN; + return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN | \ + SD_SHARE_CAP_STATES; } static struct sched_domain_topology_level arm_topology[] = { #ifdef CONFIG_SCHED_MC - { cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) }, - { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) }, + { cpu_coregroup_mask, cpu_corepower_flags, cpu_core_energy, SD_INIT_NAME(MC) }, #endif - { cpu_cpu_mask, SD_INIT_NAME(DIE) }, + { cpu_cpu_mask, 0, cpu_cluster_energy, SD_INIT_NAME(DIE) }, { NULL, }, }; diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h index 7ebcd31ce51c..b3496efc17ba 100644 --- a/arch/arm64/include/asm/topology.h +++ b/arch/arm64/include/asm/topology.h @@ -24,6 +24,16 @@ void init_cpu_topology(void); void store_cpu_topology(unsigned int cpuid); const struct cpumask *cpu_coregroup_mask(int cpu); +#define arch_scale_freq_capacity arm_arch_scale_freq_capacity +struct sched_domain; +extern +unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu); + +DECLARE_PER_CPU(atomic_long_t, cpu_freq_capacity); + +#define arch_scale_cpu_capacity arm_arch_scale_cpu_capacity +extern unsigned long arm_arch_scale_cpu_capacity(struct sched_domain *sd, int cpu); + #else static inline void init_cpu_topology(void) { } diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index 0ef87896e4ae..bf31185fac2f 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -35,6 +35,7 @@ #include <linux/clockchips.h> #include <linux/completion.h> #include <linux/of.h> +#include <linux/cpufreq.h> #include <linux/irq_work.h> #include <asm/alternative.h> @@ -52,6 +53,7 @@ #include <asm/tlbflush.h> #include <asm/ptrace.h> +#include <trace/events/power.h> #define CREATE_TRACE_POINTS #include <trace/events/ipi.h> @@ -664,3 +666,85 @@ int setup_profiling_timer(unsigned int multiplier) { return -EINVAL; } + +#ifdef CONFIG_CPU_FREQ + +static DEFINE_PER_CPU(atomic_long_t, cpu_max_freq); +DEFINE_PER_CPU(atomic_long_t, cpu_freq_capacity); + +/* + * Scheduler load-tracking scale-invariance + * + * Provides the scheduler with a scale-invariance correction factor that + * compensates for frequency scaling through arch_scale_freq_capacity() + * (implemented in topology.c). + */ +static inline +void scale_freq_capacity(int cpu, unsigned long curr, unsigned long max) +{ + unsigned long capacity; + + if (!max) + return; + + capacity = (curr << SCHED_CAPACITY_SHIFT) / max; + atomic_long_set(&per_cpu(cpu_freq_capacity, cpu), capacity); +} + +static int cpufreq_callback(struct notifier_block *nb, + unsigned long val, void *data) +{ + struct cpufreq_freqs *freq = data; + int cpu = freq->cpu; + unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu)); + + if (freq->flags & CPUFREQ_CONST_LOOPS) + return NOTIFY_OK; + + if (val == CPUFREQ_PRECHANGE) { + scale_freq_capacity(cpu, freq->new, max); + trace_cpu_capacity(capacity_curr_of(cpu), cpu); + } + + return NOTIFY_OK; +} + +static struct notifier_block cpufreq_notifier = { + .notifier_call = cpufreq_callback, +}; + +static int cpufreq_policy_callback(struct notifier_block *nb, + unsigned long val, void *data) +{ + struct cpufreq_policy *policy = data; + int i; + + if (val != CPUFREQ_NOTIFY) + return NOTIFY_OK; + + for_each_cpu(i, policy->cpus) { + scale_freq_capacity(i, policy->cur, policy->max); + atomic_long_set(&per_cpu(cpu_max_freq, i), policy->max); + } + + return NOTIFY_OK; +} + +static struct notifier_block cpufreq_policy_notifier = { + .notifier_call = cpufreq_policy_callback, +}; + +static int __init register_cpufreq_notifier(void) +{ + int ret; + + ret = cpufreq_register_notifier(&cpufreq_notifier, + CPUFREQ_TRANSITION_NOTIFIER); + if (ret) + return ret; + + return cpufreq_register_notifier(&cpufreq_policy_notifier, + CPUFREQ_POLICY_NOTIFIER); +} +core_initcall(register_cpufreq_notifier); +#endif diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c index b6ee26b0939a..ade0419ee97d 100644 --- a/arch/arm64/kernel/topology.c +++ b/arch/arm64/kernel/topology.c @@ -19,10 +19,24 @@ #include <linux/nodemask.h> #include <linux/of.h> #include <linux/sched.h> +#include <linux/sched.h> +#include <linux/sched_energy.h> #include <asm/cputype.h> #include <asm/topology.h> +static DEFINE_PER_CPU(unsigned long, cpu_scale); + +unsigned long arm_arch_scale_cpu_capacity(struct sched_domain *sd, int cpu) +{ + return per_cpu(cpu_scale, cpu); +} + +static void set_capacity_scale(unsigned int cpu, unsigned long capacity) +{ + per_cpu(cpu_scale, cpu) = capacity; +} + static int __init get_cpu_for_node(struct device_node *node) { struct device_node *cpu_node; @@ -201,16 +215,89 @@ out: } /* + * Scheduler load-tracking scale-invariance + * + * Provides the scheduler with a scale-invariance correction factor that + * compensates for frequency scaling (arch_scale_freq_capacity()). The scaling + * factor is updated in smp.c + */ +unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu) +{ + unsigned long curr = atomic_long_read(&per_cpu(cpu_freq_capacity, cpu)); + + if (!curr) + return SCHED_CAPACITY_SCALE; + + return curr; +} + +/* * cpu topology table */ struct cpu_topology cpu_topology[NR_CPUS]; EXPORT_SYMBOL_GPL(cpu_topology); +/* sd energy functions */ +static inline const struct sched_group_energy *cpu_cluster_energy(int cpu) +{ + struct sched_group_energy *sge = sge_array[cpu][SD_LEVEL1]; + + if (!sge) { + pr_warn("Invalid sched_group_energy for Cluster%d\n", cpu); + return NULL; + } + + return sge; +} + +static inline const struct sched_group_energy *cpu_core_energy(int cpu) +{ + struct sched_group_energy *sge = sge_array[cpu][SD_LEVEL0]; + + if (!sge) { + pr_warn("Invalid sched_group_energy for CPU%d\n", cpu); + return NULL; + } + + return sge; +} + const struct cpumask *cpu_coregroup_mask(int cpu) { return &cpu_topology[cpu].core_sibling; } +static inline int cpu_corepower_flags(void) +{ + return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN | \ + SD_SHARE_CAP_STATES; +} + +static struct sched_domain_topology_level arm64_topology[] = { +#ifdef CONFIG_SCHED_MC + { cpu_coregroup_mask, cpu_corepower_flags, cpu_core_energy, SD_INIT_NAME(MC) }, +#endif + { cpu_cpu_mask, 0, cpu_cluster_energy, SD_INIT_NAME(DIE) }, + { NULL, }, +}; + +static void update_cpu_capacity(unsigned int cpu) +{ + unsigned long capacity; + + if (!cpu_core_energy(cpu)) { + capacity = SCHED_CAPACITY_SCALE; + } else { + int max_cap_idx = cpu_core_energy(cpu)->nr_cap_states - 1; + capacity = cpu_core_energy(cpu)->cap_states[max_cap_idx].cap; + } + + set_capacity_scale(cpu, capacity); + + pr_info("CPU%d: update cpu_capacity %lu\n", + cpu, arch_scale_cpu_capacity(NULL, cpu)); +} + static void update_siblings_masks(unsigned int cpuid) { struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid]; @@ -269,6 +356,7 @@ void store_cpu_topology(unsigned int cpuid) topology_populated: update_siblings_masks(cpuid); + update_cpu_capacity(cpuid); } static void __init reset_cpu_topology(void) @@ -289,6 +377,14 @@ static void __init reset_cpu_topology(void) } } +static void __init reset_cpu_capacity(void) +{ + unsigned int cpu; + + for_each_possible_cpu(cpu) + set_capacity_scale(cpu, SCHED_CAPACITY_SCALE); +} + void __init init_cpu_topology(void) { reset_cpu_topology(); @@ -299,4 +395,10 @@ void __init init_cpu_topology(void) */ if (parse_dt_topology()) reset_cpu_topology(); + else + set_sched_topology(arm64_topology); + + reset_cpu_capacity(); + + init_sched_energy_costs(); } diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index 548311fd04c0..85343418a684 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -112,6 +112,14 @@ config CPU_FREQ_DEFAULT_GOV_INTERACTIVE loading your cpufreq low-level hardware driver, using the 'interactive' governor for latency-sensitive workloads. +config CPU_FREQ_DEFAULT_GOV_SCHED + bool "sched" + select CPU_FREQ_GOV_SCHED + select CPU_FREQ_GOV_PERFORMANCE + help + Use the CPUfreq governor 'sched' as default. This scales + cpu frequency from the scheduler as per-entity load tracking + statistics are updated. endchoice config CPU_FREQ_GOV_PERFORMANCE @@ -210,6 +218,23 @@ config CPU_FREQ_GOV_CONSERVATIVE If in doubt, say N. +config CPU_FREQ_GOV_SCHED + tristate "'sched' cpufreq governor" + depends on CPU_FREQ + select CPU_FREQ_GOV_COMMON + help + 'sched' - this governor scales cpu frequency from the + scheduler as a function of cpu capacity utilization. It does + not evaluate utilization on a periodic basis (as ondemand + does) but instead is invoked from the completely fair + scheduler when updating per-entity load tracking statistics. + Latency to respond to changes in load is improved over polling + governors due to its event-driven design. + + If in doubt, say N. + +comment "CPU frequency scaling drivers" + config CPUFREQ_DT tristate "Generic DT based cpufreq driver" depends on HAVE_CLK && OF diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 90e8deb6c15e..297087fb6b1a 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -102,6 +102,12 @@ bool have_governor_per_policy(void) } EXPORT_SYMBOL_GPL(have_governor_per_policy); +bool cpufreq_driver_might_sleep(void) +{ + return !(cpufreq_driver->flags & CPUFREQ_DRIVER_WILL_NOT_SLEEP); +} +EXPORT_SYMBOL_GPL(cpufreq_driver_might_sleep); + struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy) { if (have_governor_per_policy()) diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h index 98c4f9b12b03..0719c98bb682 100644 --- a/include/linux/cgroup_subsys.h +++ b/include/linux/cgroup_subsys.h @@ -15,6 +15,10 @@ SUBSYS(cpu) SUBSYS(cpuacct) #endif +#if IS_ENABLED(CONFIG_CGROUP_SCHEDTUNE) +SUBSYS(schedtune) +#endif + #if IS_ENABLED(CONFIG_MEMCG) SUBSYS(memory) #endif diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 56ea8a336016..0a82db9082a7 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -158,6 +158,7 @@ u64 get_cpu_idle_time(unsigned int cpu, u64 *wall, int io_busy); int cpufreq_get_policy(struct cpufreq_policy *policy, unsigned int cpu); int cpufreq_update_policy(unsigned int cpu); bool have_governor_per_policy(void); +bool cpufreq_driver_might_sleep(void); struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy); #else static inline unsigned int cpufreq_get(unsigned int cpu) @@ -311,6 +312,14 @@ struct cpufreq_driver { */ #define CPUFREQ_NEED_INITIAL_FREQ_CHECK (1 << 5) +/* + * Set by drivers that will never block or sleep during their frequency + * transition. Used to indicate when it is safe to call cpufreq_driver_target + * from non-interruptable context. Drivers must opt-in to this flag, as the + * safe default is that they might sleep. + */ +#define CPUFREQ_DRIVER_WILL_NOT_SLEEP (1 << 6) + int cpufreq_register_driver(struct cpufreq_driver *driver_data); int cpufreq_unregister_driver(struct cpufreq_driver *driver_data); @@ -486,6 +495,9 @@ extern struct cpufreq_governor cpufreq_gov_conservative; #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_INTERACTIVE) extern struct cpufreq_governor cpufreq_gov_interactive; #define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_interactive) +#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED) +extern struct cpufreq_governor cpufreq_gov_sched; +#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_sched) #endif /********************************************************************* diff --git a/include/linux/sched.h b/include/linux/sched.h index 4169de53eae3..94980db2fb66 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -873,6 +873,7 @@ enum cpu_idle_type { #define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */ #define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */ #define SD_NUMA 0x4000 /* cross-node balancing */ +#define SD_SHARE_CAP_STATES 0x8000 /* Domain members share capacity state */ #ifdef CONFIG_SCHED_SMT static inline int cpu_smt_flags(void) @@ -905,6 +906,26 @@ struct sched_domain_attr { extern int sched_domain_level_max; +struct capacity_state { + unsigned long cap; /* compute capacity */ + unsigned long power; /* power consumption at this compute capacity */ +}; + +struct idle_state { + unsigned long power; /* power consumption in this idle state */ +}; + +struct sched_group_energy { + atomic_t ref; + unsigned int nr_idle_states; /* number of idle states */ + struct idle_state *idle_states; /* ptr to idle state array */ + unsigned int nr_idle_states_below; /* number idle states in lower groups */ + unsigned int nr_cap_states; /* number of capacity states */ + struct capacity_state *cap_states; /* ptr to capacity state array */ +}; + +unsigned long capacity_curr_of(int cpu); + struct sched_group; struct sched_domain { @@ -1003,6 +1024,7 @@ bool cpus_share_cache(int this_cpu, int that_cpu); typedef const struct cpumask *(*sched_domain_mask_f)(int cpu); typedef int (*sched_domain_flags_f)(void); +typedef const struct sched_group_energy *(*sched_domain_energy_f)(int cpu); #define SDTL_OVERLAP 0x01 @@ -1010,11 +1032,13 @@ struct sd_data { struct sched_domain **__percpu sd; struct sched_group **__percpu sg; struct sched_group_capacity **__percpu sgc; + struct sched_group_energy **__percpu sge; }; struct sched_domain_topology_level { sched_domain_mask_f mask; sched_domain_flags_f sd_flags; + sched_domain_energy_f energy; int flags; int numa_level; struct sd_data data; @@ -1072,15 +1096,28 @@ struct load_weight { }; struct sched_avg { + u64 last_runnable_update; + s64 decay_count; + /* + * utilization_avg_contrib describes the amount of time that a + * sched_entity is running on a CPU. It is based on running_avg_sum + * and is scaled in the range [0..SCHED_LOAD_SCALE]. + * load_avg_contrib described the amount of time that a sched_entity + * is runnable on a rq. It is based on both runnable_avg_sum and the + * weight of the task. + */ + unsigned long load_avg_contrib, utilization_avg_contrib; /* * These sums represent an infinite geometric series and so are bound * above by 1024/(1-y). Thus we only need a u32 to store them for all * choices of y < 1-2^(-32)*1024. + * running_avg_sum reflects the time that the sched_entity is + * effectively running on the CPU. + * runnable_avg_sum represents the amount of time a sched_entity is on + * a runqueue which includes the running time that is monitored by + * running_avg_sum. */ - u32 runnable_avg_sum, runnable_avg_period; - u64 last_runnable_update; - s64 decay_count; - unsigned long load_avg_contrib; + u32 runnable_avg_sum, avg_period, running_avg_sum; }; #ifdef CONFIG_SCHEDSTATS diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index 596a0e007c62..4ce6a05c6a0c 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -89,6 +89,22 @@ extern int sysctl_sched_rt_runtime; extern unsigned int sysctl_sched_cfs_bandwidth_slice; #endif +#ifdef CONFIG_SCHED_TUNE +extern unsigned int sysctl_sched_cfs_boost; +int sysctl_sched_cfs_boost_handler(struct ctl_table *table, int write, + void __user *buffer, size_t *length, + loff_t *ppos); +static inline unsigned int get_sysctl_sched_cfs_boost(void) +{ + return sysctl_sched_cfs_boost; +} +#else +static inline unsigned int get_sysctl_sched_cfs_boost(void) +{ + return 0; +} +#endif + #ifdef CONFIG_SCHED_AUTOGROUP extern unsigned int sysctl_sched_autogroup_enabled; #endif diff --git a/include/linux/sched_energy.h b/include/linux/sched_energy.h new file mode 100644 index 000000000000..a3f1627ac609 --- /dev/null +++ b/include/linux/sched_energy.h @@ -0,0 +1,36 @@ +#ifndef _LINUX_SCHED_ENERGY_H +#define _LINUX_SCHED_ENERGY_H + +#include <linux/sched.h> +#include <linux/slab.h> + +/* + * There doesn't seem to be an NR_CPUS style max number of sched domain + * levels so here's an arbitrary constant one for the moment. + * + * The levels alluded to here correspond to entries in struct + * sched_domain_topology_level that are meant to be populated by arch + * specific code (topology.c). + */ +#define NR_SD_LEVELS 8 + +#define SD_LEVEL0 0 +#define SD_LEVEL1 1 +#define SD_LEVEL2 2 +#define SD_LEVEL3 3 +#define SD_LEVEL4 4 +#define SD_LEVEL5 5 +#define SD_LEVEL6 6 +#define SD_LEVEL7 7 + +/* + * Convenience macro for iterating through said sd levels. + */ +#define for_each_possible_sd_level(level) \ + for (level = 0; level < NR_SD_LEVELS; level++) + +extern struct sched_group_energy *sge_array[NR_CPUS][NR_SD_LEVELS]; + +void init_sched_energy_costs(void); + +#endif diff --git a/include/trace/events/power.h b/include/trace/events/power.h index 274021214b4e..50d8a9fc7166 100644 --- a/include/trace/events/power.h +++ b/include/trace/events/power.h @@ -111,6 +111,13 @@ DEFINE_EVENT(cpu, cpu_frequency, TP_ARGS(frequency, cpu_id) ); +DEFINE_EVENT(cpu, cpu_capacity, + + TP_PROTO(unsigned int capacity, unsigned int cpu_id), + + TP_ARGS(capacity, cpu_id) +); + TRACE_EVENT(device_pm_callback_start, TP_PROTO(struct device *dev, const char *pm_ops, int event), diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index a7d67bc14906..7308931f88d7 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -550,6 +550,332 @@ TRACE_EVENT(sched_wake_idle_without_ipi, TP_printk("cpu=%d", __entry->cpu) ); + +TRACE_EVENT(sched_contrib_scale_f, + + TP_PROTO(int cpu, unsigned long freq_scale_factor, + unsigned long cpu_scale_factor), + + TP_ARGS(cpu, freq_scale_factor, cpu_scale_factor), + + TP_STRUCT__entry( + __field(int, cpu) + __field(unsigned long, freq_scale_factor) + __field(unsigned long, cpu_scale_factor) + ), + + TP_fast_assign( + __entry->cpu = cpu; + __entry->freq_scale_factor = freq_scale_factor; + __entry->cpu_scale_factor = cpu_scale_factor; + ), + + TP_printk("cpu=%d freq_scale_factor=%lu cpu_scale_factor=%lu", + __entry->cpu, __entry->freq_scale_factor, + __entry->cpu_scale_factor) +); + +/* + * Tracepoint for accounting sched averages for tasks. + */ +TRACE_EVENT(sched_load_avg_task, + + TP_PROTO(struct task_struct *tsk, struct sched_avg *avg), + + TP_ARGS(tsk, avg), + + TP_STRUCT__entry( + __array( char, comm, TASK_COMM_LEN ) + __field( pid_t, pid ) + __field( int, cpu ) + __field( unsigned long, load ) + __field( unsigned long, utilization ) + __field( unsigned int, runnable_avg_sum ) + __field( unsigned int, running_avg_sum ) + __field( unsigned int, avg_period ) + ), + + TP_fast_assign( + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); + __entry->pid = tsk->pid; + __entry->cpu = task_cpu(tsk); + __entry->load = avg->load_avg_contrib; + __entry->utilization = avg->utilization_avg_contrib; + __entry->runnable_avg_sum = avg->runnable_avg_sum; + __entry->running_avg_sum = avg->running_avg_sum; + __entry->avg_period = avg->avg_period; + ), + + TP_printk("comm=%s pid=%d cpu=%d load=%lu utilization=%lu runnable_avg_sum=%u" + " running_avg_sum=%u avg_period=%u", + __entry->comm, __entry->pid, __entry->cpu, + __entry->load, __entry->utilization, + (unsigned int)__entry->runnable_avg_sum, + (unsigned int)__entry->running_avg_sum, + (unsigned int)__entry->avg_period) +); + +/* + * Tracepoint for accounting sched averages for cpus. + */ +TRACE_EVENT(sched_load_avg_cpu, + + TP_PROTO(int cpu, struct cfs_rq *cfs_rq), + + TP_ARGS(cpu, cfs_rq), + + TP_STRUCT__entry( + __field( int, cpu ) + __field( unsigned long, load ) + __field( unsigned long, utilization ) + ), + + TP_fast_assign( + __entry->cpu = cpu; + __entry->load = cfs_rq->runnable_load_avg; + __entry->utilization = cfs_rq->utilization_load_avg; + ), + + TP_printk("cpu=%d load=%lu utilization=%lu", + __entry->cpu, __entry->load, __entry->utilization) +); + +/* + * Tracepoint for sched_tune_config settings + */ +TRACE_EVENT(sched_tune_config, + + TP_PROTO(int boost, int pb_nrg_gain, int pb_cap_gain, int pc_nrg_gain, int pc_cap_gain), + + TP_ARGS(boost, pb_nrg_gain, pb_cap_gain, pc_nrg_gain, pc_cap_gain), + + TP_STRUCT__entry( + __field( int, boost ) + __field( int, pb_nrg_gain ) + __field( int, pb_cap_gain ) + __field( int, pc_nrg_gain ) + __field( int, pc_cap_gain ) + ), + + TP_fast_assign( + __entry->boost = boost; + __entry->pb_nrg_gain = pb_nrg_gain; + __entry->pb_cap_gain = pb_cap_gain; + __entry->pc_nrg_gain = pc_nrg_gain; + __entry->pc_cap_gain = pc_cap_gain; + ), + + TP_printk("boost=%d " + "pb_nrg_gain=%d pb_cap_gain=%d " + "pc_nrg_gain=%d pc_cap_gain=%d", + __entry->boost, + __entry->pb_nrg_gain, __entry->pb_cap_gain, + __entry->pc_nrg_gain, __entry->pc_cap_gain) +); + +/* + * Tracepoint for accounting task boosted utilization + */ +TRACE_EVENT(sched_boost_task, + + TP_PROTO(struct task_struct *tsk, unsigned long utilization, unsigned long margin), + + TP_ARGS(tsk, utilization, margin), + + TP_STRUCT__entry( + __array( char, comm, TASK_COMM_LEN ) + __field( pid_t, pid ) + __field( unsigned long, utilization ) + __field( unsigned long, margin ) + ), + + TP_fast_assign( + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); + __entry->pid = tsk->pid; + __entry->utilization = utilization; + __entry->margin = margin; + ), + + TP_printk("comm=%s pid=%d utilization=%lu margin=%lu", + __entry->comm, __entry->pid, + __entry->utilization, + __entry->margin) +); + +/* + * Tracepoint for accounting CPU boosted utilization + */ +TRACE_EVENT(sched_boost_cpu, + + TP_PROTO(int cpu, unsigned long usage, unsigned long margin), + + TP_ARGS(cpu, usage, margin), + + TP_STRUCT__entry( + __field( int, cpu ) + __field( unsigned long, usage ) + __field( unsigned long, margin ) + ), + + TP_fast_assign( + __entry->cpu = cpu; + __entry->usage = usage; + __entry->margin = margin; + ), + + TP_printk("cpu=%d usage=%lu margin=%lu", + __entry->cpu, + __entry->usage, + __entry->margin) +); + +/* + * Tracepoint for accounting sched group energy + */ +TRACE_EVENT(sched_energy_diff, + + TP_PROTO(struct task_struct *tsk, int scpu, int dcpu, int udelta, + int nrgb, int nrga, int nrgd, int capb, int capa, int capd, + int nrgn, int nrgp), + + TP_ARGS(tsk, scpu, dcpu, udelta, + nrgb, nrga, nrgd, capb, capa, capd, + nrgn, nrgp), + + TP_STRUCT__entry( + __array( char, comm, TASK_COMM_LEN ) + __field( pid_t, pid ) + __field( int, scpu ) + __field( int, dcpu ) + __field( int, udelta ) + __field( int, nrgb ) + __field( int, nrga ) + __field( int, nrgd ) + __field( int, capb ) + __field( int, capa ) + __field( int, capd ) + __field( int, nrgn ) + __field( int, nrgp ) + ), + + TP_fast_assign( + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); + __entry->pid = tsk->pid; + __entry->scpu = scpu; + __entry->dcpu = dcpu; + __entry->udelta = udelta; + __entry->nrgb = nrgb; + __entry->nrga = nrga; + __entry->nrgd = nrgd; + __entry->capb = capb; + __entry->capa = capa; + __entry->capd = capd; + __entry->nrgn = nrgn; + __entry->nrgp = nrgp; + ), + + TP_printk("pid=%d comm=%s " + "src_cpu=%d dst_cpu=%d usage_delta=%d " + "nrg_before=%d nrg_after=%d nrg_diff=%d " + "cap_before=%d cap_after=%d cap_delta=%d " + "nrg_delta=%d nrg_payoff=%d", + __entry->pid, __entry->comm, + __entry->scpu, __entry->dcpu, __entry->udelta, + __entry->nrgb, __entry->nrga, __entry->nrgd, + __entry->capb, __entry->capa, __entry->capd, + __entry->nrgn, __entry->nrgp) +); + +/* + * Tracepoint for schedtune_tasks_update + */ +TRACE_EVENT(sched_tune_tasks_update, + + TP_PROTO(struct task_struct *tsk, int cpu, int tasks, int idx, + unsigned int boost, unsigned int max_boost), + + TP_ARGS(tsk, cpu, tasks, idx, boost, max_boost), + + TP_STRUCT__entry( + __array( char, comm, TASK_COMM_LEN ) + __field( pid_t, pid ) + __field( int, cpu ) + __field( int, tasks ) + __field( int, idx ) + __field( unsigned int, boost ) + __field( unsigned int, max_boost ) + ), + + TP_fast_assign( + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); + __entry->pid = tsk->pid; + __entry->cpu = cpu; + __entry->tasks = tasks; + __entry->idx = idx; + __entry->boost = boost; + __entry->max_boost = max_boost; + ), + + TP_printk("pid=%d comm=%s " + "cpu=%d tasks=%d idx=%d boost=%u max_boost=%u", + __entry->pid, __entry->comm, + __entry->cpu, __entry->tasks, __entry->idx, + __entry->boost, __entry->max_boost) +); + +/* + * Tracepoint for schedtune_tasks_update + */ +TRACE_EVENT(sched_tune_filter, + + TP_PROTO(int nrg_delta, int cap_delta, int nrg_payoff, int region), + + TP_ARGS(nrg_delta, cap_delta, nrg_payoff, region), + + TP_STRUCT__entry( + __field( int, nrg_delta ) + __field( int, cap_delta ) + __field( int, nrg_payoff ) + __field( int, region ) + ), + + TP_fast_assign( + __entry->nrg_delta = nrg_delta; + __entry->cap_delta = cap_delta; + __entry->nrg_payoff = nrg_payoff; + __entry->region = region; + ), + + TP_printk("nrg_delta=%d cap_delta=%d nrg_payoff=%d region=%d", + __entry->nrg_delta, __entry->cap_delta, + __entry->nrg_payoff, __entry->region) +); + +/* + * Tracepoint for schedtune_boostgroup_update + */ +TRACE_EVENT(sched_tune_boostgroup_update, + + TP_PROTO(int cpu, int variation, int max_boost), + + TP_ARGS(cpu, variation, max_boost), + + TP_STRUCT__entry( + __field( int, cpu ) + __field( int, variation ) + __field( int, max_boost ) + ), + + TP_fast_assign( + __entry->cpu = cpu; + __entry->variation = variation; + __entry->max_boost = max_boost; + ), + + TP_printk("cpu=%d variation=%d max_boost=%d", + __entry->cpu, __entry->variation, __entry->max_boost) +); + #endif /* _TRACE_SCHED_H */ /* This part must be outside protection */ diff --git a/init/Kconfig b/init/Kconfig index 2081a4d3d917..ff2c8fa956ba 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -985,6 +985,24 @@ config RESOURCE_COUNTERS This option enables controller independent resource accounting infrastructure that works with cgroups. +config CGROUP_SCHEDTUNE + bool "Task boosting cgroup subsystem for EAS (EXPERIMENTAL)" + depends on SCHED_TUNE + help + This option provides the "schedtune" controller which improve the + flexibility of the task boosting mechanism by introducing the support + to define "per task" boost values. + + This new controller: + 1. allows only a two layers hierarchy, where the root defines the + system-wide boost value and its direct childs define each one a + different "class of tasks" to be boosted with a different value + 2. supports up to 16 different task classes, each one which could be + configured with a different boost value + + Only if you are testing a kernel with energy-aware scheduler + support, you might want to say Y here. + config MEMCG bool "Memory Resource Controller for Control Groups" depends on RESOURCE_COUNTERS @@ -1230,6 +1248,33 @@ config SCHED_AUTOGROUP desktop applications. Task group autogeneration is currently based upon task session. +config SCHED_TUNE + bool "Tasks boosting for energy-aware scheduler (EXPERIMENTAL)" + help + This option enable the system-wide support for task boosting. + When this support is enabled a new sysctl interface is exposed to + userspace via: + /proc/sys/kernel/sched_cfs_boost + which allows to set a system-wide boost value in range [0..100]. + + The currently boosting strategy is implemented in such a way that: + - a 0% boost value requires to operate in "standard" EAS mode by + scheduling all tasks at the minimum capacities required by their + workload demand + - a 100% boost value requires to push at maximum the task + performances, "regardless" of the incurred energy consumption + + A boost value in between these two boundaries is used to bias the + power/performance trade-off, the higher the boost value the more the + EAS scheduler is biased toward performance boosting instead of energy + efficiency. + + Since this support exposes a single system-wide knob, the specified + boost value is applied to all (CFS) tasks in the system. + + Only if you are testing a kernel with energy-aware scheduler support, + you might want to say Y here. + config SYSFS_DEPRECATED bool "Enable deprecated sysfs features to support old userspace tools" depends on SYSFS diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index ab32b7b0db5c..7351b011137c 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -12,10 +12,12 @@ CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer endif obj-y += core.o proc.o clock.o cputime.o -obj-y += idle_task.o fair.o rt.o deadline.o stop_task.o +obj-y += idle_task.o fair.o rt.o deadline.o stop_task.o energy.o obj-y += wait.o completion.o idle.o obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o +obj-$(CONFIG_SCHED_TUNE) += tune.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o +obj-$(CONFIG_CPU_FREQ_GOV_SCHED) += cpufreq_sched.o diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 147e86916b09..ed0674c36812 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2124,6 +2124,8 @@ void wake_up_new_task(struct task_struct *p) struct rq *rq; raw_spin_lock_irqsave(&p->pi_lock, flags); + /* Initialize new task's runnable average */ + init_task_runnable_average(p); #ifdef CONFIG_SMP /* * Fork balancing, do it here and not earlier because: @@ -2133,10 +2135,8 @@ void wake_up_new_task(struct task_struct *p) set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0)); #endif - /* Initialize new task's runnable average */ - init_task_runnable_average(p); rq = __task_rq_lock(p); - activate_task(rq, p, 0); + activate_task(rq, p, ENQUEUE_WAKEUP_NEW); p->on_rq = TASK_ON_RQ_QUEUED; trace_sched_wakeup_new(p, true); check_preempt_curr(rq, p, WF_FORK); @@ -5062,9 +5062,62 @@ set_table_entry(struct ctl_table *entry, } static struct ctl_table * +sd_alloc_ctl_energy_table(struct sched_group_energy *sge) +{ + struct ctl_table *table = sd_alloc_ctl_entry(6); + + if (table == NULL) + return NULL; + + set_table_entry(&table[0], "nr_idle_states", &sge->nr_idle_states, + sizeof(int), 0644, proc_dointvec_minmax, false); + set_table_entry(&table[1], "idle_states", &sge->idle_states[0].power, + sge->nr_idle_states*sizeof(struct idle_state), 0644, + proc_doulongvec_minmax, false); + set_table_entry(&table[2], "nr_idle_states_below", &sge->nr_idle_states_below, + sizeof(int), 0644, proc_dointvec_minmax, false); + set_table_entry(&table[3], "nr_cap_states", &sge->nr_cap_states, + sizeof(int), 0644, proc_dointvec_minmax, false); + set_table_entry(&table[4], "cap_states", &sge->cap_states[0].cap, + sge->nr_cap_states*sizeof(struct capacity_state), 0644, + proc_doulongvec_minmax, false); + + return table; +} + +static struct ctl_table * +sd_alloc_ctl_group_table(struct sched_group *sg) +{ + struct ctl_table *table = sd_alloc_ctl_entry(2); + + if (table == NULL) + return NULL; + + table->procname = kstrdup("energy", GFP_KERNEL); + table->mode = 0555; + table->child = sd_alloc_ctl_energy_table(sg->sge); + + return table; +} + +static struct ctl_table * sd_alloc_ctl_domain_table(struct sched_domain *sd) { - struct ctl_table *table = sd_alloc_ctl_entry(14); + struct ctl_table *table; + unsigned int nr_entries = 14; + + int i = 0; + struct sched_group *sg = sd->groups; + + if (sg->sge) { + int nr_sgs = 0; + + do {} while (nr_sgs++, sg = sg->next, sg != sd->groups); + + nr_entries += nr_sgs; + } + + table = sd_alloc_ctl_entry(nr_entries); if (table == NULL) return NULL; @@ -5097,7 +5150,19 @@ sd_alloc_ctl_domain_table(struct sched_domain *sd) sizeof(long), 0644, proc_doulongvec_minmax, false); set_table_entry(&table[12], "name", sd->name, CORENAME_MAX_SIZE, 0444, proc_dostring, false); - /* &table[13] is terminator */ + sg = sd->groups; + if (sg->sge) { + char buf[32]; + struct ctl_table *entry = &table[13]; + + do { + snprintf(buf, 32, "group%d", i); + entry->procname = kstrdup(buf, GFP_KERNEL); + entry->mode = 0555; + entry->child = sd_alloc_ctl_group_table(sg); + } while (entry++, i++, sg = sg->next, sg != sd->groups); + } + /* &table[nr_entries-1] is terminator */ return table; } @@ -5399,17 +5464,6 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level, break; } - /* - * Even though we initialize ->capacity to something semi-sane, - * we leave capacity_orig unset. This allows us to detect if - * domain iteration is still funny without causing /0 traps. - */ - if (!group->sgc->capacity_orig) { - printk(KERN_CONT "\n"); - printk(KERN_ERR "ERROR: domain->cpu_capacity not set\n"); - break; - } - if (!cpumask_weight(sched_group_cpus(group))) { printk(KERN_CONT "\n"); printk(KERN_ERR "ERROR: empty group\n"); @@ -5429,7 +5483,7 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level, printk(KERN_CONT " %s", str); if (group->sgc->capacity != SCHED_CAPACITY_SCALE) { - printk(KERN_CONT " (cpu_capacity = %d)", + printk(KERN_CONT " (cpu_capacity = %lu)", group->sgc->capacity); } @@ -5490,7 +5544,8 @@ static int sd_degenerate(struct sched_domain *sd) SD_BALANCE_EXEC | SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES | - SD_SHARE_POWERDOMAIN)) { + SD_SHARE_POWERDOMAIN | + SD_SHARE_CAP_STATES)) { if (sd->groups != sd->groups->next) return 0; } @@ -5522,7 +5577,8 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent) SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES | SD_PREFER_SIBLING | - SD_SHARE_POWERDOMAIN); + SD_SHARE_POWERDOMAIN | + SD_SHARE_CAP_STATES); if (nr_node_ids == 1) pflags &= ~SD_SERIALIZE; } @@ -5658,6 +5714,9 @@ static void free_sched_groups(struct sched_group *sg, int free_sgc) if (free_sgc && atomic_dec_and_test(&sg->sgc->ref)) kfree(sg->sgc); + if (free_sgc && atomic_dec_and_test(&sg->sge->ref)) + kfree(sg->sge); + kfree(sg); sg = tmp; } while (sg != first); @@ -5675,6 +5734,7 @@ static void free_sched_domain(struct rcu_head *rcu) free_sched_groups(sd->groups, 1); } else if (atomic_dec_and_test(&sd->groups->ref)) { kfree(sd->groups->sgc); + kfree(sd->groups->sge); kfree(sd->groups); } kfree(sd); @@ -5706,11 +5766,12 @@ DEFINE_PER_CPU(int, sd_llc_id); DEFINE_PER_CPU(struct sched_domain *, sd_numa); DEFINE_PER_CPU(struct sched_domain *, sd_busy); DEFINE_PER_CPU(struct sched_domain *, sd_asym); +DEFINE_PER_CPU(struct sched_domain *, sd_ea); static void update_top_cache_domain(int cpu) { struct sched_domain *sd; - struct sched_domain *busy_sd = NULL; + struct sched_domain *busy_sd = NULL, *ea_sd = NULL; int id = cpu; int size = 1; @@ -5731,6 +5792,14 @@ static void update_top_cache_domain(int cpu) sd = highest_flag_domain(cpu, SD_ASYM_PACKING); rcu_assign_pointer(per_cpu(sd_asym, cpu), sd); + + for_each_domain(cpu, sd) { + if (sd->groups->sge) + ea_sd = sd; + else + break; + } + rcu_assign_pointer(per_cpu(sd_ea, cpu), ea_sd); } /* @@ -5894,7 +5963,9 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) * die on a /0 trap. */ sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span); - sg->sgc->capacity_orig = sg->sgc->capacity; + sg->sgc->max_capacity = SCHED_CAPACITY_SCALE; + + sg->sge = *per_cpu_ptr(sdd->sge, i); /* * Make sure the first group of this domain contains the @@ -5934,6 +6005,7 @@ static int get_group(int cpu, struct sd_data *sdd, struct sched_group **sg) *sg = *per_cpu_ptr(sdd->sg, cpu); (*sg)->sgc = *per_cpu_ptr(sdd->sgc, cpu); atomic_set(&(*sg)->sgc->ref, 1); /* for claim_allocations */ + (*sg)->sge = *per_cpu_ptr(sdd->sge, cpu); } return cpu; @@ -6023,6 +6095,64 @@ static void init_sched_groups_capacity(int cpu, struct sched_domain *sd) atomic_set(&sg->sgc->nr_busy_cpus, sg->group_weight); } +static void init_sched_energy(int cpu, struct sched_domain *sd, + struct sched_domain_topology_level *tl) +{ + struct sched_group *sg = sd->groups; + struct sched_group_energy *sge = sg->sge; + sched_domain_energy_f fn = tl->energy; + struct cpumask *mask = sched_group_cpus(sg); + int nr_idle_states_below = 0; + + if (fn && sd->child && !sd->child->groups->sge) { + pr_err("BUG: EAS setup broken for CPU%d\n", cpu); +#ifdef CONFIG_SCHED_DEBUG + pr_err(" energy data on %s but not on %s domain\n", + sd->name, sd->child->name); +#endif + return; + } + + if (cpu != group_balance_cpu(sg)) + return; + + if (!fn || !fn(cpu)) { + sg->sge = NULL; + return; + } + + atomic_set(&sg->sge->ref, 1); /* for claim_allocations */ + + if (cpumask_weight(mask) > 1) + check_sched_energy_data(cpu, fn, mask); + + /* Figure out the number of true cpuidle states below current group */ + sd = sd->child; + for_each_lower_domain(sd) { + nr_idle_states_below += sd->groups->sge->nr_idle_states; + + /* Disregard non-cpuidle 'active' idle states */ + if (sd->child) + nr_idle_states_below--; + } + + sge->nr_idle_states = fn(cpu)->nr_idle_states; + sge->nr_idle_states_below = nr_idle_states_below; + sge->nr_cap_states = fn(cpu)->nr_cap_states; + sge->idle_states = (struct idle_state *) + ((void *)&sge->cap_states + + sizeof(sge->cap_states)); + sge->cap_states = (struct capacity_state *) + ((void *)&sge->cap_states + + sizeof(sge->cap_states) + + sge->nr_idle_states * + sizeof(struct idle_state)); + memcpy(sge->idle_states, fn(cpu)->idle_states, + sge->nr_idle_states*sizeof(struct idle_state)); + memcpy(sge->cap_states, fn(cpu)->cap_states, + sge->nr_cap_states*sizeof(struct capacity_state)); +} + /* * Initializers for schedule domains * Non-inlined to reduce accumulated stack pressure in build_sched_domains() @@ -6113,6 +6243,9 @@ static void claim_allocations(int cpu, struct sched_domain *sd) if (atomic_read(&(*per_cpu_ptr(sdd->sgc, cpu))->ref)) *per_cpu_ptr(sdd->sgc, cpu) = NULL; + + if (atomic_read(&(*per_cpu_ptr(sdd->sge, cpu))->ref)) + *per_cpu_ptr(sdd->sge, cpu) = NULL; } #ifdef CONFIG_NUMA @@ -6129,6 +6262,7 @@ static int sched_domains_curr_level; * SD_SHARE_PKG_RESOURCES - describes shared caches * SD_NUMA - describes NUMA topologies * SD_SHARE_POWERDOMAIN - describes shared power domain + * SD_SHARE_CAP_STATES - describes shared capacity states * * Odd one out: * SD_ASYM_PACKING - describes SMT quirks @@ -6138,7 +6272,8 @@ static int sched_domains_curr_level; SD_SHARE_PKG_RESOURCES | \ SD_NUMA | \ SD_ASYM_PACKING | \ - SD_SHARE_POWERDOMAIN) + SD_SHARE_POWERDOMAIN | \ + SD_SHARE_CAP_STATES) static struct sched_domain * sd_init(struct sched_domain_topology_level *tl, int cpu) @@ -6178,7 +6313,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu) | 1*SD_BALANCE_NEWIDLE | 1*SD_BALANCE_EXEC | 1*SD_BALANCE_FORK - | 0*SD_BALANCE_WAKE + | 1*SD_BALANCE_WAKE | 1*SD_WAKE_AFFINE | 0*SD_SHARE_CPUCAPACITY | 0*SD_SHARE_PKG_RESOURCES @@ -6203,6 +6338,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu) */ if (sd->flags & SD_SHARE_CPUCAPACITY) { + sd->flags |= SD_PREFER_SIBLING; sd->imbalance_pct = 110; sd->smt_gain = 1178; /* ~15% */ @@ -6522,10 +6658,24 @@ static int __sdt_alloc(const struct cpumask *cpu_map) if (!sdd->sgc) return -ENOMEM; + sdd->sge = alloc_percpu(struct sched_group_energy *); + if (!sdd->sge) + return -ENOMEM; + for_each_cpu(j, cpu_map) { struct sched_domain *sd; struct sched_group *sg; struct sched_group_capacity *sgc; + struct sched_group_energy *sge; + sched_domain_energy_f fn = tl->energy; + unsigned int nr_idle_states = 0; + unsigned int nr_cap_states = 0; + + if (fn && fn(j)) { + nr_idle_states = fn(j)->nr_idle_states; + nr_cap_states = fn(j)->nr_cap_states; + BUG_ON(!nr_idle_states || !nr_cap_states); + } sd = kzalloc_node(sizeof(struct sched_domain) + cpumask_size(), GFP_KERNEL, cpu_to_node(j)); @@ -6549,6 +6699,16 @@ static int __sdt_alloc(const struct cpumask *cpu_map) return -ENOMEM; *per_cpu_ptr(sdd->sgc, j) = sgc; + + sge = kzalloc_node(sizeof(struct sched_group_energy) + + nr_idle_states*sizeof(struct idle_state) + + nr_cap_states*sizeof(struct capacity_state), + GFP_KERNEL, cpu_to_node(j)); + + if (!sge) + return -ENOMEM; + + *per_cpu_ptr(sdd->sge, j) = sge; } } @@ -6577,6 +6737,8 @@ static void __sdt_free(const struct cpumask *cpu_map) kfree(*per_cpu_ptr(sdd->sg, j)); if (sdd->sgc) kfree(*per_cpu_ptr(sdd->sgc, j)); + if (sdd->sge) + kfree(*per_cpu_ptr(sdd->sge, j)); } free_percpu(sdd->sd); sdd->sd = NULL; @@ -6584,6 +6746,8 @@ static void __sdt_free(const struct cpumask *cpu_map) sdd->sg = NULL; free_percpu(sdd->sgc); sdd->sgc = NULL; + free_percpu(sdd->sge); + sdd->sge = NULL; } } @@ -6631,6 +6795,7 @@ static int build_sched_domains(const struct cpumask *cpu_map, enum s_alloc alloc_state; struct sched_domain *sd; struct s_data d; + struct rq *rq; int i, ret = -ENOMEM; alloc_state = __visit_domain_allocation_hell(&d, cpu_map); @@ -6669,10 +6834,13 @@ static int build_sched_domains(const struct cpumask *cpu_map, /* Calculate CPU capacity for physical packages and nodes */ for (i = nr_cpumask_bits-1; i >= 0; i--) { + struct sched_domain_topology_level *tl = sched_domain_topology; + if (!cpumask_test_cpu(i, cpu_map)) continue; - for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent, tl++) { + init_sched_energy(i, sd, tl); claim_allocations(i, sd); init_sched_groups_capacity(i, sd); } @@ -6681,11 +6849,18 @@ static int build_sched_domains(const struct cpumask *cpu_map, /* Attach the domains */ rcu_read_lock(); for_each_cpu(i, cpu_map) { + rq = cpu_rq(i); sd = *per_cpu_ptr(d.sd, i); cpu_attach_domain(sd, d.rd, i); + + if (rq->cpu_capacity_orig > rq->rd->max_cpu_capacity) + rq->rd->max_cpu_capacity = rq->cpu_capacity_orig; } rcu_read_unlock(); + rq = cpu_rq(cpumask_first(cpu_map)); + pr_info("Max cpu capacity: %lu\n", rq->rd->max_cpu_capacity); + ret = 0; error: __free_domain_allocs(&d, alloc_state, cpu_map); @@ -7117,7 +7292,7 @@ void __init sched_init(void) #ifdef CONFIG_SMP rq->sd = NULL; rq->rd = NULL; - rq->cpu_capacity = SCHED_CAPACITY_SCALE; + rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE; rq->post_schedule = 0; rq->active_balance = 0; rq->next_balance = jiffies; diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c new file mode 100644 index 000000000000..8c5ad4805e29 --- /dev/null +++ b/kernel/sched/cpufreq_sched.c @@ -0,0 +1,349 @@ +/* + * Copyright (C) 2015 Michael Turquette <mturquette@linaro.org> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include <linux/cpufreq.h> +#include <linux/module.h> +#include <linux/kthread.h> +#include <linux/percpu.h> +#include <linux/irq_work.h> + +#include "sched.h" + +#define THROTTLE_NSEC 50000000 /* 50ms default */ + +static DEFINE_PER_CPU(unsigned long, pcpu_capacity); +static DEFINE_PER_CPU(struct cpufreq_policy *, pcpu_policy); +static DEFINE_PER_CPU(int, governor_started); + +/** + * gov_data - per-policy data internal to the governor + * @throttle: next throttling period expiry. Derived from throttle_nsec + * @throttle_nsec: throttle period length in nanoseconds + * @task: worker thread for dvfs transition that may block/sleep + * @irq_work: callback used to wake up worker thread + * @freq: new frequency stored in *_sched_update_cpu and used in *_sched_thread + * + * struct gov_data is the per-policy cpufreq_sched-specific data structure. A + * per-policy instance of it is created when the cpufreq_sched governor receives + * the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data + * member of struct cpufreq_policy. + * + * Readers of this data must call down_read(policy->rwsem). Writers must + * call down_write(policy->rwsem). + */ +struct gov_data { + ktime_t throttle; + unsigned int throttle_nsec; + struct task_struct *task; + struct irq_work irq_work; + struct cpufreq_policy *policy; + unsigned int freq; +}; + +static void cpufreq_sched_try_driver_target(struct cpufreq_policy *policy, unsigned int freq) +{ + struct gov_data *gd = policy->governor_data; + + /* avoid race with cpufreq_sched_stop */ + if (!down_write_trylock(&policy->rwsem)) + return; + + __cpufreq_driver_target(policy, freq, CPUFREQ_RELATION_L); + + gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec); + up_write(&policy->rwsem); +} + +/* + * we pass in struct cpufreq_policy. This is safe because changing out the + * policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP), + * which tears down all of the data structures and __cpufreq_governor(policy, + * CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the + * new policy pointer + */ +static int cpufreq_sched_thread(void *data) +{ + struct sched_param param; + struct cpufreq_policy *policy; + struct gov_data *gd; + int ret; + + policy = (struct cpufreq_policy *) data; + if (!policy) { + pr_warn("%s: missing policy\n", __func__); + do_exit(-EINVAL); + } + + gd = policy->governor_data; + if (!gd) { + pr_warn("%s: missing governor data\n", __func__); + do_exit(-EINVAL); + } + + param.sched_priority = 50; + ret = sched_setscheduler_nocheck(gd->task, SCHED_FIFO, ¶m); + if (ret) { + pr_warn("%s: failed to set SCHED_FIFO\n", __func__); + do_exit(-EINVAL); + } else { + pr_debug("%s: kthread (%d) set to SCHED_FIFO\n", + __func__, gd->task->pid); + } + + ret = set_cpus_allowed_ptr(gd->task, policy->related_cpus); + if (ret) { + pr_warn("%s: failed to set allowed ptr\n", __func__); + do_exit(-EINVAL); + } + + /* main loop of the per-policy kthread */ + do { + set_current_state(TASK_INTERRUPTIBLE); + schedule(); + if (kthread_should_stop()) + break; + + cpufreq_sched_try_driver_target(policy, gd->freq); + } while (!kthread_should_stop()); + + do_exit(0); +} + +static void cpufreq_sched_irq_work(struct irq_work *irq_work) +{ + struct gov_data *gd; + + gd = container_of(irq_work, struct gov_data, irq_work); + if (!gd) { + return; + } + + wake_up_process(gd->task); +} + +/** + * cpufreq_sched_set_capacity - interface to scheduler for changing capacity values + * @cpu: cpu whose capacity utilization has recently changed + * @capacity: the new capacity requested by cpu + * + * cpufreq_sched_sched_capacity is an interface exposed to the scheduler so + * that the scheduler may inform the governor of updates to capacity + * utilization and make changes to cpu frequency. Currently this interface is + * designed around PELT values in CFS. It can be expanded to other scheduling + * classes in the future if needed. + * + * cpufreq_sched_set_capacity raises an IPI. The irq_work handler for that IPI + * wakes up the thread that does the actual work, cpufreq_sched_thread. + * + * This functions bails out early if either condition is true: + * 1) this cpu did not the new maximum capacity for its frequency domain + * 2) no change in cpu frequency is necessary to meet the new capacity request + */ +void cpufreq_sched_set_cap(int cpu, unsigned long capacity) +{ + unsigned int freq_new, cpu_tmp; + struct cpufreq_policy *policy; + struct gov_data *gd; + unsigned long capacity_max = 0; + + if (!per_cpu(governor_started, cpu)) + return; + + /* update per-cpu capacity request */ + per_cpu(pcpu_capacity, cpu) = capacity; + + policy = cpufreq_cpu_get(cpu); + if (IS_ERR_OR_NULL(policy)) { + return; + } + + if (!policy->governor_data) + goto out; + + gd = policy->governor_data; + + /* bail early if we are throttled */ + if (ktime_before(ktime_get(), gd->throttle)) + goto out; + + /* find max capacity requested by cpus in this policy */ + for_each_cpu(cpu_tmp, policy->cpus) + capacity_max = max(capacity_max, per_cpu(pcpu_capacity, cpu_tmp)); + + /* + * We only change frequency if this cpu's capacity request represents a + * new max. If another cpu has requested a capacity greater than the + * previous max then we rely on that cpu to hit this code path and make + * the change. IOW, the cpu with the new max capacity is responsible + * for setting the new capacity/frequency. + * + * If this cpu is not the new maximum then bail + */ + if (capacity_max > capacity) + goto out; + + /* Convert the new maximum capacity request into a cpu frequency */ + freq_new = (capacity * policy->max) / capacity_orig_of(cpu); + + /* No change in frequency? Bail and return current capacity. */ + if (freq_new == policy->cur) + goto out; + + /* store the new frequency and perform the transition */ + gd->freq = freq_new; + + if (cpufreq_driver_might_sleep()) + irq_work_queue_on(&gd->irq_work, cpu); + else + cpufreq_sched_try_driver_target(policy, freq_new); + +out: + cpufreq_cpu_put(policy); + return; +} + +/** + * cpufreq_sched_reset_capacity - interface to scheduler for resetting capacity + * requests + * @cpu: cpu whose capacity request has to be reset + * + * This _wont trigger_ any capacity update. + */ +void cpufreq_sched_reset_cap(int cpu) +{ + per_cpu(pcpu_capacity, cpu) = 0; +} + +static inline void set_sched_energy_freq(void) +{ + static_key_slow_inc(&__sched_energy_freq); +} + +static inline void clear_sched_energy_freq(void) +{ + static_key_slow_dec(&__sched_energy_freq); +} + +static int cpufreq_sched_start(struct cpufreq_policy *policy) +{ + struct gov_data *gd; + int cpu; + + /* prepare per-policy private data */ + gd = kzalloc(sizeof(*gd), GFP_KERNEL); + if (!gd) { + pr_debug("%s: failed to allocate private data\n", __func__); + return -ENOMEM; + } + + /* initialize per-cpu data */ + for_each_cpu(cpu, policy->cpus) { + per_cpu(pcpu_capacity, cpu) = 0; + per_cpu(pcpu_policy, cpu) = policy; + } + + /* + * Don't ask for freq changes at an higher rate than what + * the driver advertises as transition latency. + */ + gd->throttle_nsec = policy->cpuinfo.transition_latency ? + policy->cpuinfo.transition_latency : + THROTTLE_NSEC; + pr_debug("%s: throttle threshold = %u [ns]\n", + __func__, gd->throttle_nsec); + + if (cpufreq_driver_might_sleep()) { + /* init per-policy kthread */ + gd->task = kthread_run(cpufreq_sched_thread, policy, "kcpufreq_sched_task"); + if (IS_ERR_OR_NULL(gd->task)) { + pr_err("%s: failed to create kcpufreq_sched_task thread\n", __func__); + goto err; + } + init_irq_work(&gd->irq_work, cpufreq_sched_irq_work); + } + + policy->governor_data = gd; + gd->policy = policy; + set_sched_energy_freq(); + + for_each_cpu(cpu, policy->cpus) + per_cpu(governor_started, cpu) = 1; + + return 0; + +err: + kfree(gd); + return -ENOMEM; +} + +static int cpufreq_sched_stop(struct cpufreq_policy *policy) +{ + struct gov_data *gd = policy->governor_data; + int cpu; + + for_each_cpu(cpu, policy->cpus) + per_cpu(governor_started, cpu) = 0; + + clear_sched_energy_freq(); + if (cpufreq_driver_might_sleep()) { + kthread_stop(gd->task); + } + + policy->governor_data = NULL; + + /* FIXME replace with devm counterparts? */ + kfree(gd); + return 0; +} + +static int cpufreq_sched_setup(struct cpufreq_policy *policy, unsigned int event) +{ + switch (event) { + case CPUFREQ_GOV_START: + /* Start managing the frequency */ + return cpufreq_sched_start(policy); + + case CPUFREQ_GOV_STOP: + return cpufreq_sched_stop(policy); + + case CPUFREQ_GOV_LIMITS: /* unused */ + case CPUFREQ_GOV_POLICY_INIT: /* unused */ + case CPUFREQ_GOV_POLICY_EXIT: /* unused */ + break; + } + return 0; +} + +#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED +static +#endif +struct cpufreq_governor cpufreq_gov_sched = { + .name = "sched", + .governor = cpufreq_sched_setup, + .owner = THIS_MODULE, +}; + +static int __init cpufreq_sched_init(void) +{ + int cpu; + + for_each_cpu(cpu, cpu_possible_mask) { + per_cpu(governor_started, cpu) = 0; + } + return cpufreq_register_governor(&cpufreq_gov_sched); +} + +static void __exit cpufreq_sched_exit(void) +{ + cpufreq_unregister_governor(&cpufreq_gov_sched); +} + +/* Try to make this the default governor */ +fs_initcall(cpufreq_sched_init); + +MODULE_LICENSE("GPL v2"); diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index ce33780d8f20..efb47edc04c1 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -71,7 +71,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group if (!se) { struct sched_avg *avg = &cpu_rq(cpu)->avg; P(avg->runnable_avg_sum); - P(avg->runnable_avg_period); + P(avg->avg_period); return; } @@ -94,8 +94,10 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group P(se->load.weight); #ifdef CONFIG_SMP P(se->avg.runnable_avg_sum); - P(se->avg.runnable_avg_period); + P(se->avg.running_avg_sum); + P(se->avg.avg_period); P(se->avg.load_avg_contrib); + P(se->avg.utilization_avg_contrib); P(se->avg.decay_count); #endif #undef PN @@ -214,6 +216,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq) cfs_rq->runnable_load_avg); SEQ_printf(m, " .%-30s: %ld\n", "blocked_load_avg", cfs_rq->blocked_load_avg); + SEQ_printf(m, " .%-30s: %ld\n", "utilization_load_avg", + cfs_rq->utilization_load_avg); #ifdef CONFIG_FAIR_GROUP_SCHED SEQ_printf(m, " .%-30s: %ld\n", "tg_load_contrib", cfs_rq->tg_load_contrib); @@ -628,8 +632,10 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m) P(se.load.weight); #ifdef CONFIG_SMP P(se.avg.runnable_avg_sum); - P(se.avg.runnable_avg_period); + P(se.avg.running_avg_sum); + P(se.avg.avg_period); P(se.avg.load_avg_contrib); + P(se.avg.utilization_avg_contrib); P(se.avg.decay_count); #endif P(policy); diff --git a/kernel/sched/energy.c b/kernel/sched/energy.c new file mode 100644 index 000000000000..b0656b7a93e3 --- /dev/null +++ b/kernel/sched/energy.c @@ -0,0 +1,124 @@ +/* + * Obtain energy cost data from DT and populate relevant scheduler data + * structures. + * + * Copyright (C) 2015 ARM Ltd. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see <http://www.gnu.org/licenses/>. + */ +#define pr_fmt(fmt) "sched-energy: " fmt + +#define DEBUG + +#include <linux/gfp.h> +#include <linux/of.h> +#include <linux/printk.h> +#include <linux/sched.h> +#include <linux/sched_energy.h> +#include <linux/stddef.h> + +struct sched_group_energy *sge_array[NR_CPUS][NR_SD_LEVELS]; + +static void free_resources(void) +{ + int cpu, sd_level; + struct sched_group_energy *sge; + + for_each_possible_cpu(cpu) { + for_each_possible_sd_level(sd_level) { + sge = sge_array[cpu][sd_level]; + if (sge) { + kfree(sge->cap_states); + kfree(sge->idle_states); + kfree(sge); + } + } + } +} + +void init_sched_energy_costs(void) +{ + struct device_node *cn, *cp; + struct capacity_state *cap_states; + struct idle_state *idle_states; + struct sched_group_energy *sge; + const struct property *prop; + int sd_level, i, nstates, cpu; + const __be32 *val; + + for_each_possible_cpu(cpu) { + cn = of_get_cpu_node(cpu, NULL); + if (!cn) { + pr_warn("CPU device node missing for CPU %d\n", cpu); + return; + } + + if (!of_find_property(cn, "sched-energy-costs", NULL)) { + pr_warn("CPU device node has no sched-energy-costs\n"); + return; + } + + for_each_possible_sd_level(sd_level) { + cp = of_parse_phandle(cn, "sched-energy-costs", sd_level); + if (!cp) + break; + + prop = of_find_property(cp, "busy-cost-data", NULL); + if (!prop || !prop->value) { + pr_warn("No busy-cost data, skipping sched_energy init\n"); + goto out; + } + + sge = kcalloc(1, sizeof(struct sched_group_energy), + GFP_NOWAIT); + + nstates = (prop->length / sizeof(u32)) / 2; + cap_states = kcalloc(nstates, + sizeof(struct capacity_state), + GFP_NOWAIT); + + for (i = 0, val = prop->value; i < nstates; i++) { + cap_states[i].cap = be32_to_cpup(val++); + cap_states[i].power = be32_to_cpup(val++); + } + + sge->nr_cap_states = nstates; + sge->cap_states = cap_states; + + prop = of_find_property(cp, "idle-cost-data", NULL); + if (!prop || !prop->value) { + pr_warn("No idle-cost data, skipping sched_energy init\n"); + goto out; + } + + nstates = (prop->length / sizeof(u32)); + idle_states = kcalloc(nstates, + sizeof(struct idle_state), + GFP_NOWAIT); + + for (i = 0, val = prop->value; i < nstates; i++) + idle_states[i].power = be32_to_cpup(val++); + + sge->nr_idle_states = nstates; + sge->idle_states = idle_states; + + sge_array[cpu][sd_level] = sge; + } + } + + pr_info("Sched-energy-costs installed from DT\n"); + return; + +out: + free_resources(); +} diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2246a36050f9..c5aae8925fe7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -34,6 +34,7 @@ #include <trace/events/sched.h> #include "sched.h" +#include "tune.h" /* * Targeted preemption latency for CPU-bound tasks: @@ -670,17 +671,18 @@ static int select_idle_sibling(struct task_struct *p, int cpu); static unsigned long task_h_load(struct task_struct *p); static inline void __update_task_entity_contrib(struct sched_entity *se); +static inline void __update_task_entity_utilization(struct sched_entity *se); /* Give new task start runnable values to heavy its load in infant time */ void init_task_runnable_average(struct task_struct *p) { - u32 slice; + u32 start_load = sysctl_sched_latency >> 10; p->se.avg.decay_count = 0; - slice = sched_slice(task_cfs_rq(p), &p->se) >> 10; - p->se.avg.runnable_avg_sum = slice; - p->se.avg.runnable_avg_period = slice; + p->se.avg.runnable_avg_sum = p->se.avg.running_avg_sum = start_load; + p->se.avg.avg_period = start_load; __update_task_entity_contrib(&p->se); + __update_task_entity_utilization(&p->se); } #else void init_task_runnable_average(struct task_struct *p) @@ -1571,7 +1573,7 @@ static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period) *period = now - p->last_task_numa_placement; } else { delta = p->se.avg.runnable_avg_sum; - *period = p->se.avg.runnable_avg_period; + *period = p->se.avg.avg_period; } p->last_sum_exec_runtime = runtime; @@ -2317,13 +2319,19 @@ static u32 __compute_runnable_contrib(u64 n) * load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... ) * = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}] */ -static __always_inline int __update_entity_runnable_avg(u64 now, +static __always_inline int __update_entity_runnable_avg(u64 now, int cpu, struct sched_avg *sa, - int runnable) + int runnable, + int running) { - u64 delta, periods; - u32 runnable_contrib; - int delta_w, decayed = 0; + u64 delta, scaled_delta, periods; + u32 runnable_contrib, scaled_runnable_contrib; + int delta_w, scaled_delta_w, decayed = 0; + unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu); + unsigned long scale_cpu = arch_scale_cpu_capacity(NULL, cpu); + + trace_sched_contrib_scale_f(cpu, scale_freq, scale_cpu); + delta = now - sa->last_runnable_update; /* @@ -2345,7 +2353,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now, sa->last_runnable_update = now; /* delta_w is the amount already accumulated against our next period */ - delta_w = sa->runnable_avg_period % 1024; + delta_w = sa->avg_period % 1024; if (delta + delta_w >= 1024) { /* period roll-over */ decayed = 1; @@ -2356,9 +2364,17 @@ static __always_inline int __update_entity_runnable_avg(u64 now, * period and accrue it. */ delta_w = 1024 - delta_w; + scaled_delta_w = (delta_w * scale_freq) >> SCHED_CAPACITY_SHIFT; + if (runnable) - sa->runnable_avg_sum += delta_w; - sa->runnable_avg_period += delta_w; + sa->runnable_avg_sum += scaled_delta_w; + + scaled_delta_w *= scale_cpu; + scaled_delta_w >>= SCHED_CAPACITY_SHIFT; + + if (running) + sa->running_avg_sum += scaled_delta_w; + sa->avg_period += delta_w; delta -= delta_w; @@ -2368,20 +2384,39 @@ static __always_inline int __update_entity_runnable_avg(u64 now, sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum, periods + 1); - sa->runnable_avg_period = decay_load(sa->runnable_avg_period, + sa->running_avg_sum = decay_load(sa->running_avg_sum, + periods + 1); + sa->avg_period = decay_load(sa->avg_period, periods + 1); /* Efficiently calculate \sum (1..n_period) 1024*y^i */ runnable_contrib = __compute_runnable_contrib(periods); + scaled_runnable_contrib = (runnable_contrib * scale_freq) + >> SCHED_CAPACITY_SHIFT; + if (runnable) - sa->runnable_avg_sum += runnable_contrib; - sa->runnable_avg_period += runnable_contrib; + sa->runnable_avg_sum += scaled_runnable_contrib; + + scaled_runnable_contrib *= scale_cpu; + scaled_runnable_contrib >>= SCHED_CAPACITY_SHIFT; + + if (running) + sa->running_avg_sum += scaled_runnable_contrib; + sa->avg_period += runnable_contrib; } /* Remainder of delta accrued against u_0` */ + scaled_delta = (delta * scale_freq) >> SCHED_CAPACITY_SHIFT; + if (runnable) - sa->runnable_avg_sum += delta; - sa->runnable_avg_period += delta; + sa->runnable_avg_sum += scaled_delta; + + scaled_delta *= scale_cpu; + scaled_delta >>= SCHED_CAPACITY_SHIFT; + + if (running) + sa->running_avg_sum += scaled_delta; + sa->avg_period += delta; return decayed; } @@ -2393,11 +2428,13 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se) u64 decays = atomic64_read(&cfs_rq->decay_counter); decays -= se->avg.decay_count; + se->avg.decay_count = 0; if (!decays) return 0; se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays); - se->avg.decay_count = 0; + se->avg.utilization_avg_contrib = + decay_load(se->avg.utilization_avg_contrib, decays); return decays; } @@ -2433,7 +2470,7 @@ static inline void __update_tg_runnable_avg(struct sched_avg *sa, /* The fraction of a cpu used by this cfs_rq */ contrib = div_u64((u64)sa->runnable_avg_sum << NICE_0_SHIFT, - sa->runnable_avg_period + 1); + sa->avg_period + 1); contrib -= cfs_rq->tg_runnable_contrib; if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) { @@ -2486,7 +2523,8 @@ static inline void __update_group_entity_contrib(struct sched_entity *se) static inline void update_rq_runnable_avg(struct rq *rq, int runnable) { - __update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable); + __update_entity_runnable_avg(rq_clock_task(rq), cpu_of(rq), &rq->avg, + runnable, runnable); __update_tg_runnable_avg(&rq->avg, &rq->cfs); } #else /* CONFIG_FAIR_GROUP_SCHED */ @@ -2504,7 +2542,7 @@ static inline void __update_task_entity_contrib(struct sched_entity *se) /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */ contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight); - contrib /= (se->avg.runnable_avg_period + 1); + contrib /= (se->avg.avg_period + 1); se->avg.load_avg_contrib = scale_load(contrib); } @@ -2523,6 +2561,31 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se) return se->avg.load_avg_contrib - old_contrib; } + +static inline void __update_task_entity_utilization(struct sched_entity *se) +{ + u32 contrib; + + /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */ + contrib = se->avg.running_avg_sum * scale_load_down(SCHED_LOAD_SCALE); + contrib /= (se->avg.avg_period + 1); + se->avg.utilization_avg_contrib = scale_load(contrib); +} + +static long __update_entity_utilization_avg_contrib(struct sched_entity *se) +{ + long old_contrib = se->avg.utilization_avg_contrib; + + if (entity_is_task(se)) + __update_task_entity_utilization(se); + else + se->avg.utilization_avg_contrib = + group_cfs_rq(se)->utilization_load_avg + + group_cfs_rq(se)->utilization_blocked_avg; + + return se->avg.utilization_avg_contrib - old_contrib; +} + static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq, long load_contrib) { @@ -2532,6 +2595,15 @@ static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq, cfs_rq->blocked_load_avg = 0; } +static inline void subtract_utilization_blocked_contrib(struct cfs_rq *cfs_rq, + long utilization_contrib) +{ + if (likely(utilization_contrib < cfs_rq->utilization_blocked_avg)) + cfs_rq->utilization_blocked_avg -= utilization_contrib; + else + cfs_rq->utilization_blocked_avg = 0; +} + static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq); /* Update a sched_entity's runnable average */ @@ -2539,7 +2611,8 @@ static inline void update_entity_load_avg(struct sched_entity *se, int update_cfs_rq) { struct cfs_rq *cfs_rq = cfs_rq_of(se); - long contrib_delta; + long contrib_delta, utilization_delta; + int cpu = cpu_of(rq_of(cfs_rq)); u64 now; /* @@ -2551,18 +2624,29 @@ static inline void update_entity_load_avg(struct sched_entity *se, else now = cfs_rq_clock_task(group_cfs_rq(se)); - if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq)) + if (!__update_entity_runnable_avg(now, cpu, &se->avg, se->on_rq, + cfs_rq->curr == se)) return; contrib_delta = __update_entity_load_avg_contrib(se); + utilization_delta = __update_entity_utilization_avg_contrib(se); + + if (entity_is_task(se)) + trace_sched_load_avg_task(task_of(se), &se->avg); if (!update_cfs_rq) return; - if (se->on_rq) + if (se->on_rq) { cfs_rq->runnable_load_avg += contrib_delta; - else + cfs_rq->utilization_load_avg += utilization_delta; + } else { subtract_blocked_load_contrib(cfs_rq, -contrib_delta); + subtract_utilization_blocked_contrib(cfs_rq, + -utilization_delta); + } + + trace_sched_load_avg_cpu(cpu, cfs_rq); } /* @@ -2579,14 +2663,20 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update) return; if (atomic_long_read(&cfs_rq->removed_load)) { - unsigned long removed_load; + unsigned long removed_load, removed_utilization; removed_load = atomic_long_xchg(&cfs_rq->removed_load, 0); + removed_utilization = + atomic_long_xchg(&cfs_rq->removed_utilization, 0); subtract_blocked_load_contrib(cfs_rq, removed_load); + subtract_utilization_blocked_contrib(cfs_rq, + removed_utilization); } if (decays) { cfs_rq->blocked_load_avg = decay_load(cfs_rq->blocked_load_avg, decays); + cfs_rq->utilization_blocked_avg = + decay_load(cfs_rq->utilization_blocked_avg, decays); atomic64_add(decays, &cfs_rq->decay_counter); cfs_rq->last_decay = now; } @@ -2633,10 +2723,13 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq, /* migrated tasks did not contribute to our blocked load */ if (wakeup) { subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib); + subtract_utilization_blocked_contrib(cfs_rq, + se->avg.utilization_avg_contrib); update_entity_load_avg(se, 0); } cfs_rq->runnable_load_avg += se->avg.load_avg_contrib; + cfs_rq->utilization_load_avg += se->avg.utilization_avg_contrib; /* we force update consideration on load-balancer moves */ update_cfs_rq_blocked_load(cfs_rq, !wakeup); } @@ -2655,8 +2748,11 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq, update_cfs_rq_blocked_load(cfs_rq, !sleep); cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib; + cfs_rq->utilization_load_avg -= se->avg.utilization_avg_contrib; if (sleep) { cfs_rq->blocked_load_avg += se->avg.load_avg_contrib; + cfs_rq->utilization_blocked_avg += + se->avg.utilization_avg_contrib; se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter); } /* migrations, e.g. sleep=0 leave decay_count == 0 */ } @@ -2902,6 +2998,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) * Update run-time statistics of the 'current'. */ update_curr(cfs_rq); + if (entity_is_task(se) && task_of(se)->state == TASK_DEAD) + flags &= !DEQUEUE_SLEEP; dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP); update_stats_dequeue(cfs_rq, se); @@ -2992,6 +3090,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) */ update_stats_wait_end(cfs_rq, se); __dequeue_entity(cfs_rq, se); + update_entity_load_avg(se, 1); } update_stats_curr_start(cfs_rq, se); @@ -3960,6 +4059,13 @@ static inline void hrtick_update(struct rq *rq) } #endif +static unsigned int capacity_margin = 1280; /* ~20% margin */ + +static bool cpu_overutilized(int cpu); +static unsigned long get_cpu_usage(int cpu); +static inline unsigned long get_boosted_cpu_usage(int cpu); +struct static_key __sched_energy_freq __read_mostly = STATIC_KEY_INIT_FALSE; + /* * The enqueue_task method is called before nr_running is * increased. Here we update the fair scheduling stats and @@ -3970,6 +4076,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) { struct cfs_rq *cfs_rq; struct sched_entity *se = &p->se; + int task_new = flags & ENQUEUE_WAKEUP_NEW; + int task_wakeup = flags & ENQUEUE_WAKEUP; for_each_sched_entity(se) { if (se->on_rq) @@ -4004,6 +4112,30 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) if (!se) { update_rq_runnable_avg(rq, rq->nr_running); add_nr_running(rq, 1); + if (!task_new && !rq->rd->overutilized && + cpu_overutilized(rq->cpu)) + rq->rd->overutilized = true; + + schedtune_enqueue_task(p, cpu_of(rq)); + + /* + * We want to trigger a freq switch request only for tasks that + * are waking up; this is because we get here also during + * load balancing, but in these cases it seems wise to trigger + * as single request after load balancing is done. + * + * Also, we add a margin (same ~20% used for the tipping point) + * to our request to provide some head room if p's utilization + * further increases. + */ + if (sched_energy_freq() && (task_new || task_wakeup)) { + unsigned long req_cap = + get_boosted_cpu_usage(cpu_of(rq)); + + req_cap = req_cap * capacity_margin + >> SCHED_CAPACITY_SHIFT; + cpufreq_sched_set_cap(cpu_of(rq), req_cap); + } } hrtick_update(rq); } @@ -4065,6 +4197,31 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) if (!se) { sub_nr_running(rq, 1); update_rq_runnable_avg(rq, 1); + + schedtune_dequeue_task(p, cpu_of(rq)); + + /* + * We want to trigger a freq switch request only for tasks that + * are going to sleep; this is because we get here also during + * load balancing, but in these cases it seems wise to trigger + * as single request after load balancing is done. + * + * Also, we add a margin (same ~20% used for the tipping point) + * to our request to provide some head room if p's utilization + * further increases. + */ + if (sched_energy_freq() && task_sleep) { + unsigned long req_cap = + get_boosted_cpu_usage(cpu_of(rq)); + + if (rq->cfs.nr_running) { + req_cap = req_cap * capacity_margin + >> SCHED_CAPACITY_SHIFT; + cpufreq_sched_set_cap(cpu_of(rq), req_cap); + } else { + cpufreq_sched_reset_cap(cpu_of(rq)); + } + } } hrtick_update(rq); } @@ -4114,6 +4271,11 @@ static unsigned long capacity_of(int cpu) return cpu_rq(cpu)->cpu_capacity; } +unsigned long capacity_orig_of(int cpu) +{ + return cpu_rq(cpu)->cpu_capacity_orig; +} + static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu); @@ -4272,6 +4434,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg) return wl; } + #else static long effective_load(struct task_group *tg, int cpu, long wl, long wg) @@ -4281,6 +4444,410 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg) #endif +/* + * Returns the current capacity of cpu after applying both + * cpu and freq scaling. + */ +unsigned long capacity_curr_of(int cpu) +{ + return cpu_rq(cpu)->cpu_capacity_orig * + arch_scale_freq_capacity(NULL, cpu) + >> SCHED_CAPACITY_SHIFT; +} + +/* + * get_cpu_usage returns the amount of capacity of a CPU that is used by CFS + * tasks. The unit of the return value must be the one of capacity so we can + * compare the usage with the capacity of the CPU that is available for CFS + * task (ie cpu_capacity). + * + * cfs.utilization_load_avg is the sum of running time of runnable tasks on a + * CPU. It represents the amount of utilization of a CPU in the range + * [0..capacity_orig] where capacity_orig is the cpu_capacity available at the + * highest frequency (arch_scale_freq_capacity()). The usage of a CPU converges + * towards a sum equal to or less than the current capacity (capacity_curr <= + * capacity_orig) of the CPU because it is the running time on this CPU scaled + * by capacity_curr. Nevertheless, cfs.utilization_load_avg can be higher than + * capacity_curr or even higher than capacity_orig because of unfortunate + * rounding in avg_period and running_load_avg or just after migrating tasks + * (and new task wakeups) until the average stabilizes with the new running + * time. We need to check that the usage stays into the range + * [0..capacity_orig] and cap if necessary. Without capping the usage, a group + * could be seen as overloaded (CPU0 usage at 121% + CPU1 usage at 80%) whereas + * CPU1 has 20% of available capacity. We allow usage to overshoot + * capacity_curr (but not capacity_orig) as it useful for predicting the + * capacity required after task migrations (scheduler-driven DVFS). + */ +static unsigned long __get_cpu_usage(int cpu, int delta) +{ + int sum; + unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg; + unsigned long blocked = cpu_rq(cpu)->cfs.utilization_blocked_avg; + unsigned long capacity_orig = capacity_orig_of(cpu); + + sum = usage + blocked + delta; + + if (sum < 0) + return 0; + + if (sum >= capacity_orig) + return capacity_orig; + + return sum; +} + +static unsigned long get_cpu_usage(int cpu) +{ + return __get_cpu_usage(cpu, 0); +} + +static inline bool energy_aware(void) +{ + return sched_feat(ENERGY_AWARE); +} + +struct energy_env { + struct sched_group *sg_top; + struct sched_group *sg_cap; + int cap_idx; + int usage_delta; + int src_cpu; + int dst_cpu; + int energy; + int energy_payoff; + struct task_struct *task; + struct { + int before; + int after; + int diff; + int delta; + } nrg; + struct { + int before; + int after; + int delta; + } cap; +}; + +/* + * __cpu_norm_usage() returns the cpu usage relative to a specific capacity, + * i.e. it's busy ratio, in the range [0..SCHED_LOAD_SCALE] which is useful for + * energy calculations. Using the scale-invariant usage returned by + * get_cpu_usage() and approximating scale-invariant usage by: + * + * usage ~ (curr_freq/max_freq)*1024 * capacity_orig/1024 * running_time/time + * + * the normalized usage can be found using the specific capacity. + * + * capacity = capacity_orig * curr_freq/max_freq + * + * norm_usage = running_time/time ~ usage/capacity + */ +static unsigned long __cpu_norm_usage(int cpu, unsigned long capacity, int delta) +{ + int usage = __get_cpu_usage(cpu, delta); + + if (usage >= capacity) + return SCHED_CAPACITY_SCALE; + + return (usage << SCHED_CAPACITY_SHIFT)/capacity; +} + +static int calc_usage_delta(struct energy_env *eenv, int cpu) +{ + if (cpu == eenv->src_cpu) + return -eenv->usage_delta; + if (cpu == eenv->dst_cpu) + return eenv->usage_delta; + return 0; +} + +static +unsigned long group_max_usage(struct energy_env *eenv) +{ + int i, delta; + unsigned long max_usage = 0; + + for_each_cpu(i, sched_group_cpus(eenv->sg_cap)) { + delta = calc_usage_delta(eenv, i); + max_usage = max(max_usage, __get_cpu_usage(i, delta)); + } + + return max_usage; +} + +/* + * group_norm_usage() returns the approximated group usage relative to it's + * current capacity (busy ratio) in the range [0..SCHED_LOAD_SCALE] for use in + * energy calculations. Since task executions may or may not overlap in time in + * the group the true normalized usage is between max(cpu_norm_usage(i)) and + * sum(cpu_norm_usage(i)) when iterating over all cpus in the group, i. The + * latter is used as the estimate as it leads to a more pessimistic energy + * estimate (more busy). + */ +static unsigned +long group_norm_usage(struct energy_env *eenv, struct sched_group *sg) +{ + int i, delta; + unsigned long usage_sum = 0; + unsigned long capacity = sg->sge->cap_states[eenv->cap_idx].cap; + + for_each_cpu(i, sched_group_cpus(sg)) { + delta = calc_usage_delta(eenv, i); + usage_sum += __cpu_norm_usage(i, capacity, delta); + } + + if (usage_sum > SCHED_CAPACITY_SCALE) + return SCHED_CAPACITY_SCALE; + return usage_sum; +} + +static int find_new_capacity(struct energy_env *eenv, + struct sched_group_energy *sge) +{ + int idx; + unsigned long util = group_max_usage(eenv); + + for (idx = 0; idx < sge->nr_cap_states; idx++) { + if (sge->cap_states[idx].cap >= util) + return idx; + } + + eenv->cap_idx = idx; + + return idx; +} + +static bool cpu_overutilized(int cpu) +{ + return (capacity_of(cpu) * 1024) < + (get_cpu_usage(cpu) * capacity_margin); +} + +static int group_idle_state(struct sched_group *sg) +{ + int i, state = INT_MAX; + + /* Find the shallowest idle state in the sched group. */ + for_each_cpu(i, sched_group_cpus(sg)) + state = min(state, idle_get_state_idx(cpu_rq(i))); + + /* Transform system into sched domain idle state. */ + if (sg->sge->nr_idle_states_below > 1) + state -= sg->sge->nr_idle_states_below - 1; + + /* Clamp state to the range of sched domain idle states. */ + return clamp_t(int, state, 0, sg->sge->nr_idle_states - 1); +} + +/* + * sched_group_energy(): Returns absolute energy consumption of cpus belonging + * to the sched_group including shared resources shared only by members of the + * group. Iterates over all cpus in the hierarchy below the sched_group starting + * from the bottom working it's way up before going to the next cpu until all + * cpus are covered at all levels. The current implementation is likely to + * gather the same usage statistics multiple times. This can probably be done in + * a faster but more complex way. + */ +static unsigned int sched_group_energy(struct energy_env *eenv) +{ + struct sched_domain *sd; + int cpu, total_energy = 0; + struct cpumask visit_cpus; + struct sched_group *sg; + + WARN_ON(!eenv->sg_top->sge); + + cpumask_copy(&visit_cpus, sched_group_cpus(eenv->sg_top)); + + while (!cpumask_empty(&visit_cpus)) { + struct sched_group *sg_shared_cap = NULL; + + cpu = cpumask_first(&visit_cpus); + + /* + * Is the group utilization affected by cpus outside this + * sched_group? + */ + sd = highest_flag_domain(cpu, SD_SHARE_CAP_STATES); + if (sd && sd->parent) + sg_shared_cap = sd->parent->groups; + + for_each_domain(cpu, sd) { + sg = sd->groups; + + /* Has this sched_domain already been visited? */ + if (sd->child && group_first_cpu(sg) != cpu) + break; + + do { + unsigned long group_util; + int sg_busy_energy, sg_idle_energy; + int cap_idx, idle_idx; + + if (sg_shared_cap && sg_shared_cap->group_weight >= sg->group_weight) + eenv->sg_cap = sg_shared_cap; + else + eenv->sg_cap = sg; + + cap_idx = find_new_capacity(eenv, sg->sge); + + if (sg->group_weight == 1) { + /* Remove capacity of src CPU (before task move) */ + if (eenv->usage_delta == 0 && + cpumask_test_cpu(eenv->src_cpu, sched_group_cpus(sg))) { + eenv->cap.before = sg->sge->cap_states[cap_idx].cap; + eenv->cap.delta -= eenv->cap.before; + } + /* Add capacity of dst CPU (after task move) */ + if (eenv->usage_delta != 0 && + cpumask_test_cpu(eenv->dst_cpu, sched_group_cpus(sg))) { + eenv->cap.after = sg->sge->cap_states[cap_idx].cap; + eenv->cap.delta += eenv->cap.after; + } + } + + idle_idx = group_idle_state(sg); + group_util = group_norm_usage(eenv, sg); + sg_busy_energy = (group_util * sg->sge->cap_states[cap_idx].power) + >> SCHED_CAPACITY_SHIFT; + sg_idle_energy = ((SCHED_LOAD_SCALE-group_util) + * sg->sge->idle_states[idle_idx].power) + >> SCHED_CAPACITY_SHIFT; + + total_energy += sg_busy_energy + sg_idle_energy; + + if (!sd->child) + cpumask_xor(&visit_cpus, &visit_cpus, sched_group_cpus(sg)); + + if (cpumask_equal(sched_group_cpus(sg), sched_group_cpus(eenv->sg_top))) + goto next_cpu; + + } while (sg = sg->next, sg != sd->groups); + } +next_cpu: + continue; + } + + eenv->energy = total_energy; + return total_energy; +} + +#ifdef CONFIG_SCHED_TUNE +static int energy_diff_evaluate(struct energy_env *eenv) +{ + unsigned int boost; + int nrg_delta; + + /* Return energy diff when boost margin is 0 */ +#ifdef CONFIG_CGROUP_SCHEDTUNE + boost = schedtune_taskgroup_boost(eenv->task); +#else + boost = get_sysctl_sched_cfs_boost(); +#endif + if (boost == 0) + return eenv->nrg.diff; + + /* Compute normalized energy diff */ + nrg_delta = schedtune_normalize_energy(eenv->nrg.diff); + eenv->nrg.delta = nrg_delta; + +#ifdef CONFIG_CGROUP_SCHEDTUNE + eenv->energy_payoff = schedtune_accept_deltas( + eenv->nrg.delta, + eenv->cap.delta, + eenv->task); +#else + eenv->energy_payoff = schedtune_accept_deltas( + eenv->nrg.delta, + eenv->cap.delta); +#endif + + /* + * When SchedTune is enabled, the energy_diff() function will return + * the computed energy payoff value. Since the energy_diff() return + * value is expected to be negative by its callers, this evaluation + * function return a negative value each time the evaluation return a + * positive energy payoff, which is the condition for the acceptance of + * a scheduling decision + */ + return -eenv->energy_payoff; +} +#else /* CONFIG_SCHED_TUNE */ +#define energy_diff_evaluate(eenv) eenv->nrg.diff +#endif + +/* + * energy_diff(): Estimate the energy impact of changing the utilization + * distribution. eenv specifies the change: utilisation amount, source, and + * destination cpu. Source or destination cpu may be -1 in which case the + * utilization is removed from or added to the system (e.g. task wake-up). If + * both are specified, the utilization is migrated. + */ +static int energy_diff(struct energy_env *eenv) +{ + struct sched_domain *sd; + struct sched_group *sg; + int sd_cpu = -1, energy_before = 0, energy_after = 0; + int result; + + struct energy_env eenv_before = { + .usage_delta = 0, + .src_cpu = eenv->src_cpu, + .dst_cpu = eenv->dst_cpu, + .nrg = { 0, 0, 0, 0 }, + .cap = { 0, 0, 0 }, + }; + + if (eenv->src_cpu == eenv->dst_cpu) + return 0; + + sd_cpu = (eenv->src_cpu != -1) ? eenv->src_cpu : eenv->dst_cpu; + sd = rcu_dereference(per_cpu(sd_ea, sd_cpu)); + + if (!sd) + return 0; /* Error */ + + sg = sd->groups; + do { + if (eenv->src_cpu != -1 && cpumask_test_cpu(eenv->src_cpu, + sched_group_cpus(sg))) { + eenv_before.sg_top = eenv->sg_top = sg; + energy_before += sched_group_energy(&eenv_before); + + /* Keep track of SRC cpu (before) capacity */ + eenv->cap.before = eenv_before.cap.before; + eenv->cap.delta = eenv_before.cap.delta; + + energy_after += sched_group_energy(eenv); + + /* src_cpu and dst_cpu may belong to the same group */ + continue; + } + + if (eenv->dst_cpu != -1 && cpumask_test_cpu(eenv->dst_cpu, + sched_group_cpus(sg))) { + eenv_before.sg_top = eenv->sg_top = sg; + energy_before += sched_group_energy(&eenv_before); + energy_after += sched_group_energy(eenv); + } + } while (sg = sg->next, sg != sd->groups); + + eenv->nrg.before = energy_before; + eenv->nrg.after = energy_after; + eenv->nrg.diff = eenv->nrg.after - eenv->nrg.before; + eenv->energy_payoff = 0; + + result = energy_diff_evaluate(eenv); + trace_sched_energy_diff(eenv->task, + eenv->src_cpu, eenv->dst_cpu, eenv->usage_delta, + eenv->nrg.before, eenv->nrg.after, eenv->nrg.diff, + eenv->cap.before, eenv->cap.after, eenv->cap.delta, + eenv->nrg.delta, eenv->energy_payoff); + + return result; +} + static int wake_wide(struct task_struct *p) { int factor = this_cpu_read(sd_llc_size); @@ -4376,6 +4943,174 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) return 1; } +static inline unsigned long task_utilization(struct task_struct *p) +{ + return p->se.avg.utilization_avg_contrib; +} + +#ifdef CONFIG_SCHED_TUNE + +static unsigned long +schedtune_margin(unsigned long signal, unsigned long boost) +{ + unsigned long long margin = 0; + + /* + * Signal proportional compensation (SPC) + * + * The Boost (B) value is used to compute a Maring (M) which is + * proportional to the complement of the original Signal (S): + * M = B * (1024-S) + * The obtained M could be used by the caller to "boost" S. + */ + margin = SCHED_LOAD_SCALE - signal; + margin *= boost; + + /* + * Fast integer division by constant: + * Constant : (C) = 100 + * Precision : 0.1% (P) = 0.1 + * Reference : C * 100 / P (R) = 100000 + * + * Thus: + * Shift bifs : ceil(log(R,2)) (S) = 17 + * Mult const : round(2^S/C) (M) = 1311 + * + * + * */ + margin *= 1311; + margin >>= 17; + + return margin; + +} + +static unsigned long +schedtune_task_margin(struct task_struct *task) +{ + unsigned int boost; + unsigned long utilization; + unsigned long margin; + +#ifdef CONFIG_CGROUP_SCHEDTUNE + boost = schedtune_taskgroup_boost(task); +#else + boost = get_sysctl_sched_cfs_boost(); +#endif + if (boost == 0) + return 0; + + utilization = task_utilization(task); + margin = schedtune_margin(utilization, boost); + + return margin; +} + +static unsigned long +boosted_task_utilization(struct task_struct *task) +{ + unsigned long utilization; + unsigned long margin = 0; + + utilization = task_utilization(task); + + /* + * Boosting of task utilization is enabled only when the scheduler is + * working in energy-aware mode. + */ + if (!task_rq(task)->rd->overutilized) + margin = schedtune_task_margin(task); + + trace_sched_boost_task(task, utilization, margin); + + utilization += margin; + + return utilization; +} + +#else /* CONFIG_SCHED_TUNE */ + +static unsigned long +boosted_task_utilization(struct task_struct *task) +{ + return task_utilization(task); +} + +#endif /* CONFIG_SCHED_TUNE */ + +static inline bool __task_fits(struct task_struct *p, int cpu, int usage) +{ + unsigned long capacity = capacity_of(cpu); + + usage += boosted_task_utilization(p); + + return (capacity * 1024) > (usage * capacity_margin); +} + +static inline bool task_fits_capacity(struct task_struct *p, int cpu) +{ + unsigned long capacity = capacity_of(cpu); + unsigned long max_capacity = cpu_rq(cpu)->rd->max_cpu_capacity; + + if (capacity == max_capacity) + return true; + + if (capacity * capacity_margin > max_capacity * 1024) + return true; + + return __task_fits(p, cpu, 0); +} + +static inline bool task_fits_cpu(struct task_struct *p, int cpu) +{ + return __task_fits(p, cpu, get_cpu_usage(cpu)); +} + +#ifdef CONFIG_SCHED_TUNE + +static inline unsigned int +schedtune_cpu_margin(int cpu, unsigned long usage) +{ + unsigned int boost; + unsigned long margin; + +#ifdef CONFIG_CGROUP_SCHEDTUNE + boost = schedtune_cpu_boost(cpu); +#else + boost = get_sysctl_sched_cfs_boost(); +#endif + if (boost == 0) + return 0; + margin = schedtune_margin(usage, boost); + + return margin; +} + +#else /* CONFIG_SCHED_TUNE */ + +static inline unsigned int +schedtune_cpu_margin(int cpu, unsigned long usage) +{ + return 0; +} + +#endif /* CONFIG_SCHED_TUNE */ + +static inline unsigned long +get_boosted_cpu_usage(int cpu) +{ + unsigned long usage; + unsigned long margin; + + usage = get_cpu_usage(cpu); + margin = schedtune_cpu_margin(cpu, usage); + + trace_sched_boost_cpu(cpu, usage, margin); + + usage += margin; + return usage; +} + /* * find_idlest_group finds and returns the least busy CPU group within the * domain. @@ -4385,7 +5120,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu, int sd_flag) { struct sched_group *idlest = NULL, *group = sd->groups; + struct sched_group *fit_group = NULL, *spare_group = NULL; unsigned long min_load = ULONG_MAX, this_load = 0; + unsigned long fit_capacity = ULONG_MAX; + unsigned long max_spare_capacity = capacity_margin - SCHED_LOAD_SCALE; int load_idx = sd->forkexec_idx; int imbalance = 100 + (sd->imbalance_pct-100)/2; @@ -4393,7 +5131,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, load_idx = sd->wake_idx; do { - unsigned long load, avg_load; + unsigned long load, avg_load, spare_capacity; int local_group; int i; @@ -4416,6 +5154,26 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, load = target_load(i, load_idx); avg_load += load; + + /* + * Look for most energy-efficient group that can fit + * that can fit the task. + */ + if (energy_aware() && capacity_of(i) < fit_capacity && + task_fits_cpu(p, i)) { + fit_capacity = capacity_of(i); + fit_group = group; + } + + /* + * Look for group which has most spare capacity on a + * single cpu. + */ + spare_capacity = capacity_of(i) - get_cpu_usage(i); + if (spare_capacity > max_spare_capacity) { + max_spare_capacity = spare_capacity; + spare_group = group; + } } /* Adjust by relative CPU capacity of the group */ @@ -4429,6 +5187,12 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, } } while (group = group->next, group != sd->groups); + if (fit_group) + return fit_group; + + if (spare_group) + return spare_group; + if (!idlest || 100*this_load < imbalance*min_load) return NULL; return idlest; @@ -4449,7 +5213,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) /* Traverse only the allowed CPUs */ for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) { - if (idle_cpu(i)) { + if (task_fits_cpu(p, i)) { struct rq *rq = cpu_rq(i); struct cpuidle_state *idle = idle_get_state(rq); if (idle && idle->exit_latency < min_exit_latency) { @@ -4461,7 +5225,8 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) min_exit_latency = idle->exit_latency; latest_idle_timestamp = rq->idle_stamp; shallowest_idle_cpu = i; - } else if ((!idle || idle->exit_latency == min_exit_latency) && + } else if (idle_cpu(i) && + (!idle || idle->exit_latency == min_exit_latency) && rq->idle_stamp > latest_idle_timestamp) { /* * If equal or no active idle state, then @@ -4470,6 +5235,13 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) */ latest_idle_timestamp = rq->idle_stamp; shallowest_idle_cpu = i; + } else if (shallowest_idle_cpu == -1) { + /* + * If we haven't found an idle CPU yet + * pick a non-idle one that can fit the task as + * fallback. + */ + shallowest_idle_cpu = i; } } else { load = weighted_cpuload(i); @@ -4528,6 +5300,87 @@ done: return target; } +static int energy_aware_wake_cpu(struct task_struct *p, int target) +{ + struct sched_domain *sd; + struct sched_group *sg, *sg_target; + int target_max_cap = INT_MAX; + int target_cpu = task_cpu(p); + int i; + + sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p))); + + if (!sd) + return target; + + sg = sd->groups; + sg_target = sg; + + /* + * Find group with sufficient capacity. We only get here if no cpu is + * overutilized. We may end up overutilizing a cpu by adding the task, + * but that should not be any worse than select_idle_sibling(). + * load_balance() should sort it out later as we get above the tipping + * point. + */ + do { + /* Assuming all cpus are the same in group */ + int max_cap_cpu = group_first_cpu(sg); + + /* + * Assume smaller max capacity means more energy-efficient. + * Ideally we should query the energy model for the right + * answer but it easily ends up in an exhaustive search. + */ + if (capacity_of(max_cap_cpu) < target_max_cap && + task_fits_capacity(p, max_cap_cpu)) { + sg_target = sg; + target_max_cap = capacity_of(max_cap_cpu); + } + } while (sg = sg->next, sg != sd->groups); + + /* Find cpu with sufficient capacity */ + for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) { + /* + * p's blocked utilization is still accounted for on prev_cpu + * so prev_cpu will receive a negative bias due the double + * accouting. However, the blocked utilization may be zero. + */ + int new_usage = get_cpu_usage(i) + boosted_task_utilization(p); + + if (new_usage > capacity_orig_of(i)) + continue; + + if (new_usage < capacity_curr_of(i)) { + target_cpu = i; + if (cpu_rq(i)->nr_running) + break; + } + + /* cpu has capacity at higher OPP, keep it as fallback */ + if (target_cpu == task_cpu(p)) + target_cpu = i; + } + + if (target_cpu != task_cpu(p)) { + struct energy_env eenv = { + .usage_delta = task_utilization(p), + .src_cpu = task_cpu(p), + .dst_cpu = target_cpu, + .task = p, + }; + + /* Not enough spare capacity on previous cpu */ + if (cpu_overutilized(task_cpu(p))) + return target_cpu; + + if (energy_diff(&eenv) >= 0) + return task_cpu(p); + } + + return target_cpu; +} + /* * select_task_rq_fair: Select target runqueue for the waking task in domains * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, @@ -4547,12 +5400,17 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f int cpu = smp_processor_id(); int new_cpu = cpu; int want_affine = 0; + int want_sibling = true; int sync = wake_flags & WF_SYNC; if (p->nr_cpus_allowed == 1) return prev_cpu; - if (sd_flag & SD_BALANCE_WAKE) + /* Check if prev_cpu can fit us ignoring its current usage */ + if (energy_aware() && !task_fits_capacity(p, prev_cpu)) + want_sibling = false; + + if (sd_flag & SD_BALANCE_WAKE && want_sibling) want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); rcu_read_lock(); @@ -4577,8 +5435,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) prev_cpu = cpu; - if (sd_flag & SD_BALANCE_WAKE) { - new_cpu = select_idle_sibling(p, prev_cpu); + if (sd_flag & SD_BALANCE_WAKE && want_sibling) { + if (energy_aware() && !cpu_rq(cpu)->rd->overutilized) + new_cpu = energy_aware_wake_cpu(p, prev_cpu); + else + new_cpu = select_idle_sibling(p, prev_cpu); goto unlock; } @@ -4644,6 +5505,8 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu) se->avg.decay_count = -__synchronize_entity_decay(se); atomic_long_add(se->avg.load_avg_contrib, &cfs_rq->removed_load); + atomic_long_add(se->avg.utilization_avg_contrib, + &cfs_rq->removed_utilization); } /* We have migrated, no longer consider this task hot */ @@ -4892,6 +5755,8 @@ again: if (hrtick_enabled(rq)) hrtick_start_fair(rq, p); + rq->misfit_task = !task_fits_capacity(p, rq->cpu); + return p; simple: cfs_rq = &rq->cfs; @@ -4913,9 +5778,13 @@ simple: if (hrtick_enabled(rq)) hrtick_start_fair(rq, p); + rq->misfit_task = !task_fits_capacity(p, rq->cpu); + return p; idle: + rq->misfit_task = 0; + new_tasks = idle_balance(rq); /* * Because idle_balance() releases (and re-acquires) rq->lock, it is @@ -5120,6 +5989,13 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10; enum fbq_type { regular, remote, all }; +enum group_type { + group_other = 0, + group_misfit_task, + group_imbalanced, + group_overloaded, +}; + #define LBF_ALL_PINNED 0x01 #define LBF_NEED_BREAK 0x02 #define LBF_DST_PINNED 0x04 @@ -5138,6 +6014,7 @@ struct lb_env { int new_dst_cpu; enum cpu_idle_type idle; long imbalance; + unsigned int src_grp_nr_running; /* The set of CPUs under consideration for load-balancing */ struct cpumask *cpus; @@ -5148,6 +6025,7 @@ struct lb_env { unsigned int loop_max; enum fbq_type fbq_type; + enum group_type busiest_group_type; struct list_head tasks; }; @@ -5641,12 +6519,6 @@ static unsigned long task_h_load(struct task_struct *p) /********** Helpers for find_busiest_group ************************/ -enum group_type { - group_other = 0, - group_imbalanced, - group_overloaded, -}; - /* * sg_lb_stats - stats of a sched_group required for load_balancing */ @@ -5656,12 +6528,13 @@ struct sg_lb_stats { unsigned long sum_weighted_load; /* Weighted load of group's tasks */ unsigned long load_per_task; unsigned long group_capacity; + unsigned long group_usage; /* Total usage of the group */ unsigned int sum_nr_running; /* Nr tasks running in the group */ - unsigned int group_capacity_factor; unsigned int idle_cpus; unsigned int group_weight; enum group_type group_type; - int group_has_free_capacity; + int group_no_capacity; + int group_misfit_task; /* A cpu has a task too big for its capacity */ #ifdef CONFIG_NUMA_BALANCING unsigned int nr_numa_running; unsigned int nr_preferred_running; @@ -5732,33 +6605,10 @@ static inline int get_sd_load_idx(struct sched_domain *sd, return load_idx; } -static unsigned long default_scale_capacity(struct sched_domain *sd, int cpu) -{ - return SCHED_CAPACITY_SCALE; -} - -unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu) -{ - return default_scale_capacity(sd, cpu); -} - -static unsigned long default_scale_cpu_capacity(struct sched_domain *sd, int cpu) -{ - if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1)) - return sd->smt_gain / sd->span_weight; - - return SCHED_CAPACITY_SCALE; -} - -unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu) -{ - return default_scale_cpu_capacity(sd, cpu); -} - static unsigned long scale_rt_capacity(int cpu) { struct rq *rq = cpu_rq(cpu); - u64 total, available, age_stamp, avg; + u64 total, used, age_stamp, avg; s64 delta; /* @@ -5774,41 +6624,20 @@ static unsigned long scale_rt_capacity(int cpu) total = sched_avg_period() + delta; - if (unlikely(total < avg)) { - /* Ensures that capacity won't end up being negative */ - available = 0; - } else { - available = total - avg; - } - - if (unlikely((s64)total < SCHED_CAPACITY_SCALE)) - total = SCHED_CAPACITY_SCALE; + used = div_u64(avg, total); - total >>= SCHED_CAPACITY_SHIFT; + if (likely(used < SCHED_CAPACITY_SCALE)) + return SCHED_CAPACITY_SCALE - used; - return div_u64(available, total); + return 1; } static void update_cpu_capacity(struct sched_domain *sd, int cpu) { - unsigned long capacity = SCHED_CAPACITY_SCALE; + unsigned long capacity = arch_scale_cpu_capacity(sd, cpu); struct sched_group *sdg = sd->groups; - if (sched_feat(ARCH_CAPACITY)) - capacity *= arch_scale_cpu_capacity(sd, cpu); - else - capacity *= default_scale_cpu_capacity(sd, cpu); - - capacity >>= SCHED_CAPACITY_SHIFT; - - sdg->sgc->capacity_orig = capacity; - - if (sched_feat(ARCH_CAPACITY)) - capacity *= arch_scale_freq_capacity(sd, cpu); - else - capacity *= default_scale_capacity(sd, cpu); - - capacity >>= SCHED_CAPACITY_SHIFT; + cpu_rq(cpu)->cpu_capacity_orig = capacity; capacity *= scale_rt_capacity(cpu); capacity >>= SCHED_CAPACITY_SHIFT; @@ -5818,13 +6647,14 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu) cpu_rq(cpu)->cpu_capacity = capacity; sdg->sgc->capacity = capacity; + sdg->sgc->max_capacity = capacity; } void update_group_capacity(struct sched_domain *sd, int cpu) { struct sched_domain *child = sd->child; struct sched_group *group, *sdg = sd->groups; - unsigned long capacity, capacity_orig; + unsigned long capacity, max_capacity; unsigned long interval; interval = msecs_to_jiffies(sd->balance_interval); @@ -5836,7 +6666,8 @@ void update_group_capacity(struct sched_domain *sd, int cpu) return; } - capacity_orig = capacity = 0; + capacity = 0; + max_capacity = 0; if (child->flags & SD_OVERLAP) { /* @@ -5856,20 +6687,17 @@ void update_group_capacity(struct sched_domain *sd, int cpu) * Use capacity_of(), which is set irrespective of domains * in update_cpu_capacity(). * - * This avoids capacity/capacity_orig from being 0 and + * This avoids capacity from being 0 and * causing divide-by-zero issues on boot. - * - * Runtime updates will correct capacity_orig. */ if (unlikely(!rq->sd)) { - capacity_orig += capacity_of(cpu); capacity += capacity_of(cpu); - continue; + } else { + sgc = rq->sd->groups->sgc; + capacity += sgc->capacity; } - sgc = rq->sd->groups->sgc; - capacity_orig += sgc->capacity_orig; - capacity += sgc->capacity; + max_capacity = max(capacity, max_capacity); } } else { /* @@ -5879,39 +6707,28 @@ void update_group_capacity(struct sched_domain *sd, int cpu) group = child->groups; do { - capacity_orig += group->sgc->capacity_orig; - capacity += group->sgc->capacity; + struct sched_group_capacity *sgc = group->sgc; + + capacity += sgc->capacity; + max_capacity = max(sgc->max_capacity, max_capacity); group = group->next; } while (group != child->groups); } - sdg->sgc->capacity_orig = capacity_orig; sdg->sgc->capacity = capacity; + sdg->sgc->max_capacity = max_capacity; } /* - * Try and fix up capacity for tiny siblings, this is needed when - * things like SD_ASYM_PACKING need f_b_g to select another sibling - * which on its own isn't powerful enough. - * - * See update_sd_pick_busiest() and check_asym_packing(). + * Check whether the capacity of the rq has been noticeably reduced by side + * activity. The imbalance_pct is used for the threshold. + * Return true is the capacity is reduced */ static inline int -fix_small_capacity(struct sched_domain *sd, struct sched_group *group) +check_cpu_capacity(struct rq *rq, struct sched_domain *sd) { - /* - * Only siblings can have significantly less than SCHED_CAPACITY_SCALE - */ - if (!(sd->flags & SD_SHARE_CPUCAPACITY)) - return 0; - - /* - * If ~90% of the cpu_capacity is still there, we're good. - */ - if (group->sgc->capacity * 32 > group->sgc->capacity_orig * 29) - return 1; - - return 0; + return ((rq->cpu_capacity * sd->imbalance_pct) < + (rq->cpu_capacity_orig * 100)); } /* @@ -5949,42 +6766,76 @@ static inline int sg_imbalanced(struct sched_group *group) } /* - * Compute the group capacity factor. - * - * Avoid the issue where N*frac(smt_capacity) >= 1 creates 'phantom' cores by - * first dividing out the smt factor and computing the actual number of cores - * and limit unit capacity with that. + * group_has_capacity returns true if the group has spare capacity that could + * be used by some tasks. + * We consider that a group has spare capacity if the * number of task is + * smaller than the number of CPUs or if the usage is lower than the available + * capacity for CFS tasks. + * For the latter, we use a threshold to stabilize the state, to take into + * account the variance of the tasks' load and to return true if the available + * capacity in meaningful for the load balancer. + * As an example, an available capacity of 1% can appear but it doesn't make + * any benefit for the load balance. + */ +static inline bool +group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs) +{ + if (sgs->sum_nr_running < sgs->group_weight) + return true; + + if ((sgs->group_capacity * 100) > + (sgs->group_usage * env->sd->imbalance_pct)) + return true; + + return false; +} + +/* + * group_is_overloaded returns true if the group has more tasks than it can + * handle. + * group_is_overloaded is not equals to !group_has_capacity because a group + * with the exact right number of tasks, has no more spare capacity but is not + * overloaded so both group_has_capacity and group_is_overloaded return + * false. */ -static inline int sg_capacity_factor(struct lb_env *env, struct sched_group *group) +static inline bool +group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs) { - unsigned int capacity_factor, smt, cpus; - unsigned int capacity, capacity_orig; + if (sgs->sum_nr_running <= sgs->group_weight) + return false; - capacity = group->sgc->capacity; - capacity_orig = group->sgc->capacity_orig; - cpus = group->group_weight; + if ((sgs->group_capacity * 100) < + (sgs->group_usage * env->sd->imbalance_pct)) + return true; - /* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */ - smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig); - capacity_factor = cpus / smt; /* cores */ + return false; +} - capacity_factor = min_t(unsigned, - capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE)); - if (!capacity_factor) - capacity_factor = fix_small_capacity(env->sd, group); - return capacity_factor; +/* + * group_smaller_cpu_capacity: Returns true if sched_group sg has smaller + * per-cpu capacity than sched_group ref. + */ +static inline bool +group_smaller_cpu_capacity(struct sched_group *sg, struct sched_group *ref) +{ + return sg->sgc->max_capacity + capacity_margin - SCHED_LOAD_SCALE < + ref->sgc->max_capacity; } -static enum group_type -group_classify(struct sched_group *group, struct sg_lb_stats *sgs) +static enum group_type group_classify(struct lb_env *env, + struct sched_group *group, + struct sg_lb_stats *sgs) { - if (sgs->sum_nr_running > sgs->group_capacity_factor) + if (sgs->group_no_capacity) return group_overloaded; if (sg_imbalanced(group)) return group_imbalanced; + if (sgs->group_misfit_task) + return group_misfit_task; + return group_other; } @@ -5996,11 +6847,12 @@ group_classify(struct sched_group *group, struct sg_lb_stats *sgs) * @local_group: Does group contain this_cpu. * @sgs: variable to hold the statistics for this group. * @overload: Indicate more than one runnable task for any CPU. + * @overutilized: Indicate overutilization for any CPU. */ static inline void update_sg_lb_stats(struct lb_env *env, struct sched_group *group, int load_idx, int local_group, struct sg_lb_stats *sgs, - bool *overload) + bool *overload, bool *overutilized) { unsigned long load; int i; @@ -6017,6 +6869,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, load = source_load(i, load_idx); sgs->group_load += load; + sgs->group_usage += get_cpu_usage(i); sgs->sum_nr_running += rq->cfs.h_nr_running; if (rq->nr_running > 1) @@ -6029,6 +6882,12 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->sum_weighted_load += weighted_cpuload(i); if (idle_cpu(i)) sgs->idle_cpus++; + + if (cpu_overutilized(i)) { + *overutilized = true; + if (!sgs->group_misfit_task && rq->misfit_task) + sgs->group_misfit_task = capacity_of(i); + } } /* Adjust by relative CPU capacity of the group */ @@ -6039,11 +6898,9 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running; sgs->group_weight = group->group_weight; - sgs->group_capacity_factor = sg_capacity_factor(env, group); - sgs->group_type = group_classify(group, sgs); - if (sgs->group_capacity_factor > sgs->sum_nr_running) - sgs->group_has_free_capacity = 1; + sgs->group_no_capacity = group_is_overloaded(env, sgs); + sgs->group_type = group_classify(env, group, sgs); } /** @@ -6072,9 +6929,25 @@ static bool update_sd_pick_busiest(struct lb_env *env, if (sgs->group_type < busiest->group_type) return false; + /* + * Candidate sg doesn't face any serious load-balance problems + * so don't pick it if the local sg is already filled up. + */ + if (sgs->group_type == group_other && + !group_has_capacity(env, &sds->local_stat)) + return false; + if (sgs->avg_load <= busiest->avg_load) return false; + /* + * Candiate sg has no more than one task per cpu and has higher + * per-cpu capacity. No reason to pull tasks to less capable cpus. + */ + if (sgs->sum_nr_running <= sgs->group_weight && + group_smaller_cpu_capacity(sds->local, sg)) + return false; + /* This is the busiest node in its class. */ if (!(env->sd->flags & SD_ASYM_PACKING)) return true; @@ -6136,7 +7009,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd struct sched_group *sg = env->sd->groups; struct sg_lb_stats tmp_sgs; int load_idx, prefer_sibling = 0; - bool overload = false; + bool overload = false, overutilized = false; if (child && child->flags & SD_PREFER_SIBLING) prefer_sibling = 1; @@ -6158,24 +7031,36 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd } update_sg_lb_stats(env, sg, load_idx, local_group, sgs, - &overload); + &overload, &overutilized); if (local_group) goto next_group; /* * In case the child domain prefers tasks go to siblings - * first, lower the sg capacity factor to one so that we'll try + * first, lower the sg capacity so that we'll try * and move all the excess tasks away. We lower the capacity * of a group only if the local group has the capacity to fit - * these excess tasks, i.e. nr_running < group_capacity_factor. The - * extra check prevents the case where you always pull from the - * heaviest group when it is already under-utilized (possible - * with a large weight task outweighs the tasks on the system). + * these excess tasks. The extra check prevents the case where + * you always pull from the heaviest group when it is already + * under-utilized (possible with a large weight task outweighs + * the tasks on the system). */ if (prefer_sibling && sds->local && - sds->local_stat.group_has_free_capacity) - sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U); + group_has_capacity(env, &sds->local_stat) && + (sgs->sum_nr_running > 1)) { + sgs->group_no_capacity = 1; + sgs->group_type = group_overloaded; + } + + /* + * Ignore task groups with misfit tasks if local group has no + * capacity or if per-cpu capacity isn't higher. + */ + if (sgs->group_type == group_misfit_task && + (!group_has_capacity(env, &sds->local_stat) || + !group_smaller_cpu_capacity(sg, sds->local))) + sgs->group_type = group_other; if (update_sd_pick_busiest(env, sds, sg, sgs)) { sds->busiest = sg; @@ -6193,12 +7078,20 @@ next_group: if (env->sd->flags & SD_NUMA) env->fbq_type = fbq_classify_group(&sds->busiest_stat); + env->src_grp_nr_running = sds->busiest_stat.sum_nr_running; + if (!env->sd->parent) { /* update overload indicator if we are at root domain */ if (env->dst_rq->rd->overload != overload) env->dst_rq->rd->overload = overload; - } + /* Update over-utilization (tipping point, U >= 0) indicator */ + if (env->dst_rq->rd->overutilized != overutilized) + env->dst_rq->rd->overutilized = overutilized; + } else { + if (!env->dst_rq->rd->overutilized && overutilized) + env->dst_rq->rd->overutilized = true; + } } /** @@ -6345,6 +7238,22 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s */ if (busiest->avg_load <= sds->avg_load || local->avg_load >= sds->avg_load) { + /* Misfitting tasks should be migrated in any case */ + if (busiest->group_type == group_misfit_task) { + env->imbalance = busiest->group_misfit_task; + return; + } + + /* + * Busiest group is overloaded, local is not, use the spare + * cycles to maximize throughput + */ + if (busiest->group_type == group_overloaded && + local->group_type <= group_misfit_task) { + env->imbalance = busiest->load_per_task; + return; + } + env->imbalance = 0; return fix_small_imbalance(env, sds); } @@ -6354,11 +7263,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s */ if (busiest->group_type == group_overloaded && local->group_type == group_overloaded) { - load_above_capacity = - (busiest->sum_nr_running - busiest->group_capacity_factor); - - load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE); - load_above_capacity /= busiest->group_capacity; + load_above_capacity = busiest->sum_nr_running * + SCHED_LOAD_SCALE; + if (load_above_capacity > busiest->group_capacity) + load_above_capacity -= busiest->group_capacity; + else + load_above_capacity = ~0UL; } /* @@ -6377,6 +7287,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s (sds->avg_load - local->avg_load) * local->group_capacity ) / SCHED_CAPACITY_SCALE; + /* Boost imbalance to allow misfit task to be balanced. */ + if (busiest->group_type == group_misfit_task) + env->imbalance = max_t(long, env->imbalance, + busiest->group_misfit_task); + /* * if *imbalance is less than the average load per runnable task * there is no guarantee that any tasks will be moved so we'll have @@ -6418,9 +7333,14 @@ static struct sched_group *find_busiest_group(struct lb_env *env) * this level. */ update_sd_lb_stats(env, &sds); + + if (energy_aware() && !env->dst_rq->rd->overutilized) + goto out_balanced; + local = &sds.local_stat; busiest = &sds.busiest_stat; + /* ASYM feature bypasses nice load balance check */ if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) && check_asym_packing(env, &sds)) return sds.busiest; @@ -6441,10 +7361,15 @@ static struct sched_group *find_busiest_group(struct lb_env *env) goto force_balance; /* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */ - if (env->idle == CPU_NEWLY_IDLE && local->group_has_free_capacity && - !busiest->group_has_free_capacity) + if (env->idle == CPU_NEWLY_IDLE && group_has_capacity(env, local) && + busiest->group_no_capacity) goto force_balance; + /* Misfitting tasks should be dealt with regardless of the avg load */ + if (busiest->group_type == group_misfit_task) { + goto force_balance; + } + /* * If the local group is busier than the selected busiest group * don't try and pull any tasks. @@ -6468,7 +7393,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env) * might end up to just move the imbalance on another group */ if ((busiest->group_type != group_overloaded) && - (local->idle_cpus <= (busiest->idle_cpus + 1))) + (local->idle_cpus <= (busiest->idle_cpus + 1)) && + !group_smaller_cpu_capacity(sds.busiest, sds.local)) goto out_balanced; } else { /* @@ -6481,6 +7407,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env) } force_balance: + env->busiest_group_type = busiest->group_type; /* Looks like there is an imbalance. Compute it */ calculate_imbalance(env, &sds); return sds.busiest; @@ -6501,7 +7428,7 @@ static struct rq *find_busiest_queue(struct lb_env *env, int i; for_each_cpu_and(i, sched_group_cpus(group), env->cpus) { - unsigned long capacity, capacity_factor, wl; + unsigned long capacity, wl; enum fbq_type rt; rq = cpu_rq(i); @@ -6530,9 +7457,6 @@ static struct rq *find_busiest_queue(struct lb_env *env, continue; capacity = capacity_of(i); - capacity_factor = DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE); - if (!capacity_factor) - capacity_factor = fix_small_capacity(env->sd, group); wl = weighted_cpuload(i); @@ -6540,7 +7464,10 @@ static struct rq *find_busiest_queue(struct lb_env *env, * When comparing with imbalance, use weighted_cpuload() * which is not scaled with the cpu capacity. */ - if (capacity_factor && rq->nr_running == 1 && wl > env->imbalance) + + if (rq->nr_running == 1 && wl > env->imbalance && + !check_cpu_capacity(rq, env->sd) && + env->busiest_group_type != group_misfit_task) continue; /* @@ -6588,6 +7515,26 @@ static int need_active_balance(struct lb_env *env) return 1; } + /* + * The dst_cpu is idle and the src_cpu CPU has only 1 CFS task. + * It's worth migrating the task if the src_cpu's capacity is reduced + * because of other sched_class or IRQs if more capacity stays + * available on dst_cpu. + */ + if ((env->idle != CPU_NOT_IDLE) && + (env->src_rq->cfs.h_nr_running == 1)) { + if ((check_cpu_capacity(env->src_rq, sd)) && + (capacity_of(env->src_cpu)*sd->imbalance_pct < capacity_of(env->dst_cpu)*100)) + return 1; + } + + if ((capacity_of(env->src_cpu) < capacity_of(env->dst_cpu)) && + env->src_rq->cfs.h_nr_running == 1 && + cpu_overutilized(env->src_cpu) && + !cpu_overutilized(env->dst_cpu)) { + return 1; + } + return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2); } @@ -6687,6 +7634,9 @@ redo: schedstat_add(sd, lb_imbalance[idle], env.imbalance); + env.src_cpu = busiest->cpu; + env.src_rq = busiest; + ld_moved = 0; if (busiest->nr_running > 1) { /* @@ -6696,8 +7646,6 @@ redo: * correctly treated as an imbalance. */ env.flags |= LBF_ALL_PINNED; - env.src_cpu = busiest->cpu; - env.src_rq = busiest; env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running); more_balance: @@ -6708,6 +7656,21 @@ more_balance: * ld_moved - cumulative load moved across iterations */ cur_ld_moved = detach_tasks(&env); + /* + * We want to potentially update env.src_cpu's OPP. + * + * Add a margin (same ~20% used for the tipping point) + * to our request to provide some head room for the remaining + * tasks. + */ + if (sched_energy_freq() && cur_ld_moved) { + unsigned long req_cap = + get_boosted_cpu_usage(env.src_cpu); + + req_cap = req_cap * capacity_margin + >> SCHED_CAPACITY_SHIFT; + cpufreq_sched_set_cap(env.src_cpu, req_cap); + } /* * We've detached some tasks from busiest_rq. Every @@ -6722,6 +7685,21 @@ more_balance: if (cur_ld_moved) { attach_tasks(&env); ld_moved += cur_ld_moved; + /* + * We want to potentially update env.dst_cpu's OPP. + * + * Add a margin (same ~20% used for the tipping point) + * to our request to provide some head room if p's + * utilization further increases. + */ + if (sched_energy_freq()) { + unsigned long req_cap = + get_boosted_cpu_usage(env.dst_cpu); + + req_cap = req_cap * capacity_margin + >> SCHED_CAPACITY_SHIFT; + cpufreq_sched_set_cap(env.dst_cpu, req_cap); + } } local_irq_restore(flags); @@ -6799,7 +7777,8 @@ more_balance: * excessive cache_hot migrations and active balances. */ if (idle != CPU_NEWLY_IDLE) - sd->nr_balance_failed++; + if (env.src_grp_nr_running > 1) + sd->nr_balance_failed++; if (need_active_balance(&env)) { raw_spin_lock_irqsave(&busiest->lock, flags); @@ -6940,8 +7919,9 @@ static int idle_balance(struct rq *this_rq) */ this_rq->idle_stamp = rq_clock(this_rq); - if (this_rq->avg_idle < sysctl_sched_migration_cost || - !this_rq->rd->overload) { + if ((!energy_aware() && (this_rq->avg_idle < sysctl_sched_migration_cost + || !this_rq->rd->overload)) || + (energy_aware() && !this_rq->rd->overutilized)) { rcu_read_lock(); sd = rcu_dereference_check_sched_domain(this_rq->sd); if (sd) @@ -7079,8 +8059,24 @@ static int active_load_balance_cpu_stop(void *data) schedstat_inc(sd, alb_count); p = detach_one_task(&env); - if (p) + if (p) { schedstat_inc(sd, alb_pushed); + /* + * We want to potentially update env.src_cpu's OPP. + * + * Add a margin (same ~20% used for the tipping point) + * to our request to provide some head room for the + * remaining task. + */ + if (sched_energy_freq()) { + unsigned long req_cap = + get_boosted_cpu_usage(env.src_cpu); + + req_cap = req_cap * capacity_margin + >> SCHED_CAPACITY_SHIFT; + cpufreq_sched_set_cap(env.src_cpu, req_cap); + } + } else schedstat_inc(sd, alb_failed); } @@ -7089,8 +8085,24 @@ out_unlock: busiest_rq->active_balance = 0; raw_spin_unlock(&busiest_rq->lock); - if (p) + if (p) { attach_one_task(target_rq, p); + /* + * We want to potentially update target_cpu's OPP. + * + * Add a margin (same ~20% used for the tipping point) + * to our request to provide some head room if p's utilization + * further increases. + */ + if (sched_energy_freq()) { + unsigned long req_cap = + get_boosted_cpu_usage(target_cpu); + + req_cap = req_cap * capacity_margin + >> SCHED_CAPACITY_SHIFT; + cpufreq_sched_set_cap(target_cpu, req_cap); + } + } local_irq_enable(); @@ -7397,22 +8409,25 @@ end: /* * Current heuristic for kicking the idle load balancer in the presence - * of an idle cpu is the system. + * of an idle cpu in the system. * - This rq has more than one task. - * - At any scheduler domain level, this cpu's scheduler group has multiple - * busy cpu's exceeding the group's capacity. + * - This rq has at least one CFS task and the capacity of the CPU is + * significantly reduced because of RT tasks or IRQs. + * - At parent of LLC scheduler domain level, this cpu's scheduler group has + * multiple busy cpu. * - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler * domain span are idle. */ -static inline int nohz_kick_needed(struct rq *rq) +static inline bool nohz_kick_needed(struct rq *rq) { unsigned long now = jiffies; struct sched_domain *sd; struct sched_group_capacity *sgc; int nr_busy, cpu = rq->cpu; + bool kick = false; if (unlikely(rq->idle_balance)) - return 0; + return false; /* * We may be recently in ticked or tickless idle mode. At the first @@ -7426,38 +8441,47 @@ static inline int nohz_kick_needed(struct rq *rq) * balancing. */ if (likely(!atomic_read(&nohz.nr_cpus))) - return 0; + return false; if (time_before(now, nohz.next_balance)) - return 0; + return false; - if (rq->nr_running >= 2) - goto need_kick; + if (rq->nr_running >= 2 && + (!energy_aware() || cpu_overutilized(cpu))) + return true; rcu_read_lock(); sd = rcu_dereference(per_cpu(sd_busy, cpu)); - - if (sd) { + if (sd && !energy_aware()) { sgc = sd->groups->sgc; nr_busy = atomic_read(&sgc->nr_busy_cpus); - if (nr_busy > 1) - goto need_kick_unlock; + if (nr_busy > 1) { + kick = true; + goto unlock; + } + } - sd = rcu_dereference(per_cpu(sd_asym, cpu)); + sd = rcu_dereference(rq->sd); + if (sd) { + if ((rq->cfs.h_nr_running >= 1) && + check_cpu_capacity(rq, sd)) { + kick = true; + goto unlock; + } + } + sd = rcu_dereference(per_cpu(sd_asym, cpu)); if (sd && (cpumask_first_and(nohz.idle_cpus_mask, - sched_domain_span(sd)) < cpu)) - goto need_kick_unlock; - - rcu_read_unlock(); - return 0; + sched_domain_span(sd)) < cpu)) { + kick = true; + goto unlock; + } -need_kick_unlock: +unlock: rcu_read_unlock(); -need_kick: - return 1; + return kick; } #else static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { } @@ -7473,14 +8497,16 @@ static void run_rebalance_domains(struct softirq_action *h) enum cpu_idle_type idle = this_rq->idle_balance ? CPU_IDLE : CPU_NOT_IDLE; - rebalance_domains(this_rq, idle); - /* * If this cpu has a pending nohz_balance_kick, then do the * balancing on behalf of the other idle cpus whose ticks are - * stopped. + * stopped. Do nohz_idle_balance *before* rebalance_domains to + * give the idle cpus a chance to load balance. Else we may + * load balance only within the local sched_domain hierarchy + * and abort nohz_idle_balance altogether if we pull some load. */ nohz_idle_balance(this_rq, idle); + rebalance_domains(this_rq, idle); } /* @@ -7534,6 +8560,30 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) task_tick_numa(rq, curr); update_rq_runnable_avg(rq, 1); + + if (!rq->rd->overutilized && cpu_overutilized(task_cpu(curr))) + rq->rd->overutilized = true; + + rq->misfit_task = !task_fits_capacity(curr, rq->cpu); + + /* + * To make free room for a task that is building up its "real" + * utilization and to harm its performance the least, request a + * jump to max OPP as soon as get_cpu_usage() crosses the UP + * threshold. The UP threshold is built relative to the current + * capacity (OPP), by using same margin used to tell if a cpu + * is overutilized (capacity_margin). + */ + if (sched_energy_freq()) { + int cpu = cpu_of(rq); + unsigned long capacity_orig = capacity_orig_of(cpu); + unsigned long capacity_curr = capacity_curr_of(cpu); + + if (capacity_curr < capacity_orig && + (capacity_curr * SCHED_LOAD_SCALE) < + (get_cpu_usage(cpu) * capacity_margin)) + cpufreq_sched_set_cap(cpu, capacity_orig); + } } /* @@ -7640,6 +8690,8 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p) if (se->avg.decay_count) { __synchronize_entity_decay(se); subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib); + subtract_utilization_blocked_contrib(cfs_rq, + se->avg.utilization_avg_contrib); } #endif } @@ -7699,6 +8751,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq) #ifdef CONFIG_SMP atomic64_set(&cfs_rq->decay_counter, 1); atomic_long_set(&cfs_rq->removed_load, 0); + atomic_long_set(&cfs_rq->removed_utilization, 0); #endif } @@ -7751,6 +8804,8 @@ static void task_move_group_fair(struct task_struct *p, int queued) */ se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter); cfs_rq->blocked_load_avg += se->avg.load_avg_contrib; + cfs_rq->utilization_blocked_avg += + se->avg.utilization_avg_contrib; #endif } } diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 90284d117fe6..6f6e6c67c546 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -36,11 +36,6 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true) */ SCHED_FEAT(WAKEUP_PREEMPTION, true) -/* - * Use arch dependent cpu capacity functions - */ -SCHED_FEAT(ARCH_CAPACITY, true) - SCHED_FEAT(HRTICK, false) SCHED_FEAT(DOUBLE_TICK, false) SCHED_FEAT(LB_BIAS, true) @@ -83,3 +78,9 @@ SCHED_FEAT(NUMA_FAVOUR_HIGHER, true) */ SCHED_FEAT(NUMA_RESIST_LOWER, false) #endif + +/* + * Energy aware scheduling. Use platform energy model to guide scheduling + * decisions optimizing for energy efficiency. + */ +SCHED_FEAT(ENERGY_AWARE, false) diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index c47fce75e666..e46c85c1df78 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -149,6 +149,7 @@ use_default: /* Take note of the planned idle state. */ idle_set_state(this_rq(), &drv->states[next_state]); + idle_set_state_idx(this_rq(), next_state); /* * Enter the idle state previously returned by the governor decision. @@ -159,6 +160,7 @@ use_default: /* The cpu is no longer idle or about to enter idle. */ idle_set_state(this_rq(), NULL); + idle_set_state_idx(this_rq(), -1); if (broadcast) clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &dev->cpu); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 2df8ef067cc5..0b5d64781e37 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -343,11 +343,21 @@ struct cfs_rq { * Under CFS, load is tracked on a per-entity basis and aggregated up. * This allows for the description of both thread and group usage (in * the FAIR_GROUP_SCHED case). + * runnable_load_avg is the sum of the load_avg_contrib of the + * sched_entities on the rq. + * blocked_load_avg is similar to runnable_load_avg except that its + * the blocked sched_entities on the rq. + * utilization_load_avg is the sum of the average running time of the + * sched_entities on the rq. + * utilization_blocked_avg is the utilization equivalent of + * blocked_load_avg, i.e. the sum of running contributions of blocked + * sched_entities associated with the rq. */ unsigned long runnable_load_avg, blocked_load_avg; + unsigned long utilization_load_avg, utilization_blocked_avg; atomic64_t decay_counter; u64 last_decay; - atomic_long_t removed_load; + atomic_long_t removed_load, removed_utilization; #ifdef CONFIG_FAIR_GROUP_SCHED /* Required to track per-cpu representation of a task_group */ @@ -488,6 +498,9 @@ struct root_domain { /* Indicate more than one runnable task for any CPU */ bool overload; + /* Indicate one or more cpus over-utilized (tipping point) */ + bool overutilized; + /* * The bit corresponding to a CPU gets set here if such CPU has more * than one runnable -deadline task (as it is below for RT tasks). @@ -503,6 +516,9 @@ struct root_domain { */ cpumask_var_t rto_mask; struct cpupri cpupri; + + /* Maximum cpu capacity in the system. */ + unsigned long max_cpu_capacity; }; extern struct root_domain def_root_domain; @@ -532,6 +548,7 @@ struct rq { #define CPU_LOAD_IDX_MAX 5 unsigned long cpu_load[CPU_LOAD_IDX_MAX]; unsigned long last_load_update_tick; + unsigned int misfit_task; #ifdef CONFIG_NO_HZ_COMMON u64 nohz_stamp; unsigned long nohz_flags; @@ -579,6 +596,7 @@ struct rq { struct sched_domain *sd; unsigned long cpu_capacity; + unsigned long cpu_capacity_orig; unsigned char idle_balance; /* For active balancing */ @@ -648,6 +666,7 @@ struct rq { #ifdef CONFIG_CPU_IDLE /* Must be inspected within a rcu lock section */ struct cpuidle_state *idle_state; + int idle_state_idx; #endif }; @@ -745,6 +764,7 @@ DECLARE_PER_CPU(int, sd_llc_id); DECLARE_PER_CPU(struct sched_domain *, sd_numa); DECLARE_PER_CPU(struct sched_domain *, sd_busy); DECLARE_PER_CPU(struct sched_domain *, sd_asym); +DECLARE_PER_CPU(struct sched_domain *, sd_ea); struct sched_group_capacity { atomic_t ref; @@ -752,7 +772,8 @@ struct sched_group_capacity { * CPU capacity of this group, SCHED_LOAD_SCALE being max capacity * for a single CPU. */ - unsigned int capacity, capacity_orig; + unsigned long capacity; + unsigned long max_capacity; /* Max per-cpu capacity in group */ unsigned long next_update; int imbalance; /* XXX unrelated to capacity but shared group state */ /* @@ -769,6 +790,7 @@ struct sched_group { unsigned int group_weight; struct sched_group_capacity *sgc; + struct sched_group_energy *sge; /* * The CPUs this group covers. @@ -805,6 +827,39 @@ static inline unsigned int group_first_cpu(struct sched_group *group) extern int group_balance_cpu(struct sched_group *sg); +/* + * Check that the per-cpu provided sd energy data is consistent for all cpus + * within the mask. + */ +static inline void check_sched_energy_data(int cpu, sched_domain_energy_f fn, + const struct cpumask *cpumask) +{ + struct cpumask mask; + int i; + + cpumask_xor(&mask, cpumask, get_cpu_mask(cpu)); + + for_each_cpu(i, &mask) { + int y; + + BUG_ON(fn(i)->nr_idle_states != fn(cpu)->nr_idle_states); + + for (y = 0; y < (fn(i)->nr_idle_states); y++) { + BUG_ON(fn(i)->idle_states[y].power != + fn(cpu)->idle_states[y].power); + } + + BUG_ON(fn(i)->nr_cap_states != fn(cpu)->nr_cap_states); + + for (y = 0; y < (fn(i)->nr_cap_states); y++) { + BUG_ON(fn(i)->cap_states[y].cap != + fn(cpu)->cap_states[y].cap); + BUG_ON(fn(i)->cap_states[y].power != + fn(cpu)->cap_states[y].power); + } + } +} + #else static inline void sched_ttwu_pending(void) { } @@ -1080,6 +1135,7 @@ static const u32 prio_to_wmult[40] = { #define ENQUEUE_WAKING 0 #endif #define ENQUEUE_REPLENISH 8 +#define ENQUEUE_WAKEUP_NEW 16 #define DEQUEUE_SLEEP 1 @@ -1186,6 +1242,17 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq) WARN_ON(!rcu_read_lock_held()); return rq->idle_state; } + +static inline void idle_set_state_idx(struct rq *rq, int idle_state_idx) +{ + rq->idle_state_idx = idle_state_idx; +} + +static inline int idle_get_state_idx(struct rq *rq) +{ + WARN_ON(!rcu_read_lock_held()); + return rq->idle_state_idx; +} #else static inline void idle_set_state(struct rq *rq, struct cpuidle_state *idle_state) @@ -1196,6 +1263,15 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq) { return NULL; } + +static inline void idle_set_state_idx(struct rq *rq, int idle_state_idx) +{ +} + +static inline int idle_get_state_idx(struct rq *rq) +{ + return -1; +} #endif extern void sysrq_sched_debug_show(void); @@ -1308,14 +1384,53 @@ static inline int hrtick_enabled(struct rq *rq) #ifdef CONFIG_SMP extern void sched_avg_update(struct rq *rq); + +#ifndef arch_scale_freq_capacity +static __always_inline +unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu) +{ + return SCHED_CAPACITY_SCALE; +} +#endif + +#ifndef arch_scale_cpu_capacity +static __always_inline +unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu) +{ + if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1)) + return sd->smt_gain / sd->span_weight; + + return SCHED_CAPACITY_SCALE; +} +#endif + +unsigned long capacity_orig_of(int cpu); + +extern struct static_key __sched_energy_freq; +static inline bool sched_energy_freq(void) +{ + return static_key_false(&__sched_energy_freq); +} + +#ifdef CONFIG_CPU_FREQ_GOV_SCHED +void cpufreq_sched_set_cap(int cpu, unsigned long util); +void cpufreq_sched_reset_cap(int cpu); +#else +static inline void cpufreq_sched_set_cap(int cpu, unsigned long util) +{ } +static inline void cpufreq_sched_reset_cap(int cpu) +{ } +#endif + static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { - rq->rt_avg += rt_delta; + rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq)); sched_avg_update(rq); } #else static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { } static inline void sched_avg_update(struct rq *rq) { } +static inline void gov_cfs_update_cpu(int cpu) {} #endif extern void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period); diff --git a/kernel/sched/tune.c b/kernel/sched/tune.c new file mode 100644 index 000000000000..2ec7e63524a5 --- /dev/null +++ b/kernel/sched/tune.c @@ -0,0 +1,681 @@ +#include <linux/cgroup.h> +#include <linux/err.h> +#include <linux/kernel.h> +#include <linux/percpu.h> +#include <linux/printk.h> +#include <linux/rcupdate.h> +#include <linux/slab.h> + +#include <trace/events/sched.h> + +#include "sched.h" +#include "tune.h" + +unsigned int sysctl_sched_cfs_boost __read_mostly = 0; + +/* Performance Boost region (B) threshold params */ +static int perf_boost_idx; + +/* Performance Constraint region (C) threshold params */ +static int perf_constrain_idx; + +/** + * Performance-Energy (P-E) Space thresholds constants + */ +struct threshold_params { + int nrg_gain; + int cap_gain; +}; + +/* + * System specific P-E space thresholds constants + */ +static struct threshold_params +threshold_gains[] = { + { 0, 4 }, /* >= 0% */ + { 1, 4 }, /* >= 10% */ + { 2, 4 }, /* >= 20% */ + { 3, 4 }, /* >= 30% */ + { 4, 4 }, /* >= 40% */ + { 4, 3 }, /* >= 50% */ + { 4, 2 }, /* >= 60% */ + { 4, 1 }, /* >= 70% */ + { 4, 0 }, /* >= 80% */ + { 4, 0 } /* >= 90% */ +}; + +/* + * System energy normalization constants + */ +struct target_nrg { + unsigned long min_power; + unsigned long max_power; + unsigned long nrg_shift; + unsigned long nrg_mult; +}; + +/* + * Target specific system energy normalization constants + * NOTE: These values are specific for ARM TC2 and they are derived from the + * energy model data defined in: arch/arm/kernel/topology.c + */ +static struct target_nrg +schedtune_target_nrg = { + + /* + * TC2 Min CPUs power: + * all CPUs idle, all clusters in deep idle: + * 0 * 3 + 0 * 2 + 10 + 25 + */ + .min_power = 35, + + /* + * TC2 Max CPUs power: + * all CPUs fully utilized while running at max OPP: + * 1024 * 3 + 6997 * 2 + 4905 + 15200 + */ + + .max_power = 37171, + + /* + * Fast integer division by constant: + * Constant : Max - Min (C) = 37171 - 35 = 37136 + * Precision : 0.1% (P) = 0.1 + * Reference : C * 100 / P (R) = 3713600 + * + * Thus: + * Shift bifs : ceil(log(R,2)) (S) = 26 + * Mult const : round(2^S/C) (M) = 1807 + * + * This allows to compute the normalized energy: + * system_energy / C + * as: + * (system_energy * M) >> S + */ + .nrg_shift = 26, /* S */ + .nrg_mult = 1807, /* M */ +}; + +/* + * System energy normalization + * Returns the normalized value, in the range [0..SCHED_LOAD_SCALE], + * corresponding to the specified energy variation. + */ +int +schedtune_normalize_energy(int energy_diff) +{ + long long normalized_nrg = energy_diff; + int max_delta; + + /* Check for boundaries */ + max_delta = schedtune_target_nrg.max_power; + max_delta -= schedtune_target_nrg.min_power; + WARN_ON(abs(energy_diff) >= max_delta); + + /* Scale by energy magnitude */ + normalized_nrg <<= SCHED_LOAD_SHIFT; + + /* Normalize on max energy for target platform */ + normalized_nrg *= schedtune_target_nrg.nrg_mult; + normalized_nrg >>= schedtune_target_nrg.nrg_shift; + + return normalized_nrg; +} + +static int +__schedtune_accept_deltas(int nrg_delta, int cap_delta, + int perf_boost_idx, int perf_constrain_idx) { + int energy_payoff; + + /* Performance Boost (B) region */ + if (nrg_delta > 0 && cap_delta > 0) { + /* + * energy_payoff criteria: + * cap_delta / nrg_delta > cap_gain / nrg_gain + * which is: + * nrg_delta * cap_gain < cap_delta * nrg_gain + */ + energy_payoff = cap_delta * threshold_gains[perf_boost_idx].nrg_gain; + energy_payoff -= nrg_delta * threshold_gains[perf_boost_idx].cap_gain; + + trace_sched_tune_filter( + threshold_gains[perf_boost_idx].nrg_gain, + threshold_gains[perf_boost_idx].cap_gain, + energy_payoff, 8); + + return energy_payoff; + } + + /* Performance Constraint (C) region */ + if (nrg_delta < 0 && cap_delta < 0) { + /* + * energy_payoff criteria: + * cap_delta / nrg_delta < cap_gain / nrg_gain + * which is: + * nrg_delta * cap_gain > cap_delta * nrg_gain + */ + energy_payoff = nrg_delta * threshold_gains[perf_constrain_idx].cap_gain; + energy_payoff -= cap_delta * threshold_gains[perf_constrain_idx].nrg_gain; + + trace_sched_tune_filter( + threshold_gains[perf_constrain_idx].nrg_gain, + threshold_gains[perf_constrain_idx].cap_gain, + energy_payoff, 6); + + return energy_payoff; + } + + /* Default: reject schedule candidate */ + return -INT_MAX; +} + +#ifdef CONFIG_CGROUP_SCHEDTUNE + +/* + * EAS scheduler tunables for task groups. + */ + +/* SchdTune tunables for a group of tasks */ +struct schedtune { + /* SchedTune CGroup subsystem */ + struct cgroup_subsys_state css; + + /* Boost group allocated ID */ + int idx; + + /* Boost value for tasks on that SchedTune CGroup */ + int boost; + + /* Performance Boost (B) region threshold params */ + int perf_boost_idx; + + /* Performance Constraint (C) region threshold params */ + int perf_constrain_idx; + +}; + +static inline struct schedtune *css_st(struct cgroup_subsys_state *css) +{ + return css ? container_of(css, struct schedtune, css) : NULL; +} + +static inline struct schedtune *task_schedtune(struct task_struct *tsk) +{ + return css_st(task_css(tsk, schedtune_cgrp_id)); +} + +static inline struct schedtune *parent_st(struct schedtune *st) +{ + return css_st(st->css.parent); +} + +/* + * SchedTune root control group + * The root control group is used to defined a system-wide boosting tuning, + * which is applied to all tasks in the system. + * Task specific boosting tuning could be specified by creating and + * configuring a child control group under the root one. + * By default, system-wide boosting is disabled, i.e. no boosting is applied + * to all tasks not into a child control group. + */ +static struct schedtune +root_schedtune = { + .boost = 0, + .perf_boost_idx = 0, + .perf_constrain_idx = 0, +}; + +int +schedtune_accept_deltas(int nrg_delta, int cap_delta, struct task_struct *task) { + struct schedtune *ct; + int perf_boost_idx; + int perf_constrain_idx; + + /* Optimal (O) region */ + if (nrg_delta < 0 && cap_delta > 0) { + trace_sched_tune_filter(0, 0, 1, 0); + return INT_MAX; + } + + /* Suboptimal (S) region */ + if (nrg_delta > 0 && cap_delta < 0) { + trace_sched_tune_filter(0, 0, -1, 5); + return -INT_MAX; + } + + /* Get task specific perf Boost/Constraints indexes */ + rcu_read_lock(); + ct = task_schedtune(task); + perf_boost_idx = ct->perf_boost_idx; + perf_constrain_idx = ct->perf_constrain_idx; + rcu_read_unlock(); + + return __schedtune_accept_deltas(nrg_delta, cap_delta, + perf_boost_idx, perf_constrain_idx); + +} + +/* + * Maximum number of boost groups to support + * When per-task boosting is used we still allows only limited number of + * boost groups for two main reasons: + * 1. on a real system we usually have only few classes of workloads which + * make sense to boost with different values (e.g. backgroud vs foreground + * tasks, interactive vs low-priority tasks) + * 2. a limited number allows for a simpler and more memory/time efficient + * implementation especially for the computation of the per-CPU boost + * value + */ +#define BOOSTGROUPS_COUNT 16 + +/* Array of configured boostgroups */ +static struct schedtune *allocated_group[BOOSTGROUPS_COUNT] = { + &root_schedtune, + NULL, +}; + +/* SchedTune boost groups + * Each CPU in the system could be affected by multiple boost groups, for + * example when a CPU has two RUNNABLE tasks beloging to two different boost + * groups and thus likely with different boost values. + * This data structure keep track of all the boost groups which could impact + * on a CPU. + * Since on each system we expect only a limited number of boost + * groups, here we use a simple array to keep track of the metrics required to + * compute the maximum per-CPU boosting value. + */ +struct boost_groups { + /* Maximum boost value for all RUNNABLE tasks on a CPU */ + unsigned boost_max; + struct { + /* The boost for tasks on that boost group */ + unsigned boost; + /* Count of RUNNABLE tasks on that boost group */ + unsigned tasks; + } group[BOOSTGROUPS_COUNT]; +}; + +/* Boost groups affecting each CPU in the system */ +DEFINE_PER_CPU(struct boost_groups, cpu_boost_groups); + +static void +schedtune_cpu_update(int cpu) +{ + struct boost_groups *bg; + unsigned boost_max; + int idx; + + bg = &per_cpu(cpu_boost_groups, cpu); + + /* The root boost group is always active */ + boost_max = bg->group[0].boost; + for (idx = 1; idx < BOOSTGROUPS_COUNT; ++idx) { + /* + * A boost group affects a CPU only if it has + * RUNNABLE tasks on that CPU + */ + if (bg->group[idx].tasks == 0) + continue; + boost_max = max(boost_max, bg->group[idx].boost); + } + + bg->boost_max = boost_max; +} + +static int +schedtune_boostgroup_update(int idx, int boost) +{ + struct boost_groups *bg; + int cur_boost_max; + int old_boost; + int cpu; + + /* Update per CPU boost groups */ + for_each_possible_cpu(cpu) { + bg = &per_cpu(cpu_boost_groups, cpu); + + /* + * Keep track of current boost values to compute the per CPU + * maximum only when it has been affected by the new value of + * the updated boost group + */ + cur_boost_max = bg->boost_max; + old_boost = bg->group[idx].boost; + + /* Update the boost value of this boost group */ + bg->group[idx].boost = boost; + + /* Check if this update increase current max */ + if (boost > cur_boost_max && bg->group[idx].tasks) { + bg->boost_max = boost; + trace_sched_tune_boostgroup_update(cpu, 1, bg->boost_max); + continue; + } + + /* Check if this update has decreased current max */ + if (cur_boost_max == old_boost && old_boost > boost) { + schedtune_cpu_update(cpu); + trace_sched_tune_boostgroup_update(cpu, -1, bg->boost_max); + continue; + } + + trace_sched_tune_boostgroup_update(cpu, 0, bg->boost_max); + } + + return 0; +} + +static inline void +schedtune_tasks_update(struct task_struct *p, int cpu, int idx, int task_count) +{ + struct boost_groups *bg; + int tasks; + + bg = &per_cpu(cpu_boost_groups, cpu); + + /* Update boosted tasks count while avoiding to make it negative */ + if (task_count < 0 && bg->group[idx].tasks <= -task_count) + bg->group[idx].tasks = 0; + else + bg->group[idx].tasks += task_count; + + /* Boost group activation or deactivation on that RQ */ + tasks = bg->group[idx].tasks; + if (tasks == 1 || tasks == 0) + schedtune_cpu_update(cpu); + + trace_sched_tune_tasks_update(p, cpu, tasks, idx, + bg->group[idx].boost, bg->boost_max); + +} + +/* + * NOTE: This function must be called while holding the lock on the CPU RQ + */ +void schedtune_enqueue_task(struct task_struct *p, int cpu) +{ + struct schedtune *st; + int idx; + + /* + * When a task is marked PF_EXITING by do_exit() it's going to be + * dequeued and enqueued multiple times in the exit path. + * Thus we avoid any further update, since we do not want to change + * CPU boosting while the task is exiting. + */ + if (p->flags & PF_EXITING) + return; + + /* Get task boost group */ + rcu_read_lock(); + st = task_schedtune(p); + idx = st->idx; + rcu_read_unlock(); + + schedtune_tasks_update(p, cpu, idx, 1); +} + +/* + * NOTE: This function must be called while holding the lock on the CPU RQ + */ +void schedtune_dequeue_task(struct task_struct *p, int cpu) +{ + struct schedtune *st; + int idx; + + /* + * When a task is marked PF_EXITING by do_exit() it's going to be + * dequeued and enqueued multiple times in the exit path. + * Thus we avoid any further update, since we do not want to change + * CPU boosting while the task is exiting. + * The last dequeue will be done by cgroup exit() callback. + */ + if (p->flags & PF_EXITING) + return; + + /* Get task boost group */ + rcu_read_lock(); + st = task_schedtune(p); + idx = st->idx; + rcu_read_unlock(); + + schedtune_tasks_update(p, cpu, idx, -1); +} + +int schedtune_taskgroup_boost(struct task_struct *p) +{ + struct schedtune *ct; + int task_boost; + + rcu_read_lock(); + ct = task_schedtune(p); + task_boost = ct->boost; + rcu_read_unlock(); + + return task_boost; +} + +int schedtune_cpu_boost(int cpu) +{ + struct boost_groups *bg; + bg = &per_cpu(cpu_boost_groups, cpu); + return bg->boost_max; +} + +static u64 +boost_read(struct cgroup_subsys_state *css, struct cftype *cft) +{ + struct schedtune *st = css_st(css); + return st->boost; +} + +static int +boost_write(struct cgroup_subsys_state *css, struct cftype *cft, + u64 boost) +{ + struct schedtune *st = css_st(css); + int err = 0; + + if (boost < 0 || boost > 100) { + err = -EINVAL; + goto out; + } + + st->boost = boost; + if (css == &root_schedtune.css) + sysctl_sched_cfs_boost = boost; + + if (boost == 100) + boost = 99; + + /* Performance Boost (B) region threshold params */ + st->perf_boost_idx = boost; + st->perf_boost_idx /= 10; + + /* Performance Constraint (C) region threshold params */ + st->perf_constrain_idx = 100 - boost; + st->perf_constrain_idx /= 10; + + /* Update CPU boost */ + schedtune_boostgroup_update(st->idx, st->boost); + + trace_sched_tune_config(st->boost, + threshold_gains[st->perf_boost_idx].nrg_gain, + threshold_gains[st->perf_boost_idx].cap_gain, + threshold_gains[st->perf_constrain_idx].nrg_gain, + threshold_gains[st->perf_constrain_idx].cap_gain); + +out: + return err; +} + +static struct cftype files[] = { + { + .name = "boost", + .read_u64 = boost_read, + .write_u64 = boost_write, + }, + { } /* terminate */ +}; + +static int +schedtune_boostgroup_init(struct schedtune *st) +{ + struct boost_groups *bg; + int cpu; + + /* Keep track of allocated boost group */ + allocated_group[st->idx] = st; + + /* Initialize the per CPU boost groups */ + for_each_possible_cpu(cpu) { + bg = &per_cpu(cpu_boost_groups, cpu); + bg->group[st->idx].boost = 0; + bg->group[st->idx].tasks = 0; + } + + return 0; +} + +static int +schedtune_init(void) +{ + struct boost_groups *bg; + int cpu; + + /* Initialize the per CPU boost groups */ + for_each_possible_cpu(cpu) { + bg = &per_cpu(cpu_boost_groups, cpu); + memset(bg, 0 , sizeof(struct boost_groups)); + } + + pr_info(" schedtune configured to support %d boost groups\n", + BOOSTGROUPS_COUNT); + return 0; +} + +static struct cgroup_subsys_state * +schedtune_css_alloc(struct cgroup_subsys_state *parent_css) +{ + struct schedtune *st; + int idx; + + if (!parent_css) { + schedtune_init(); + return &root_schedtune.css; + } + + /* Allows only single level hierachies */ + if (parent_css != &root_schedtune.css) { + pr_err("Nested SchedTune boosting groups not allowed\n"); + return ERR_PTR(-ENOMEM); + } + + /* Allows only a limited number of boosting groups */ + for (idx = 1; idx < BOOSTGROUPS_COUNT; ++idx) + if (allocated_group[idx] == NULL) + break; + if (idx == BOOSTGROUPS_COUNT) { + pr_err("Trying to create more than %d SchedTune boosting groups\n", + BOOSTGROUPS_COUNT); + return ERR_PTR(-ENOSPC); + } + + st = kzalloc(sizeof(*st), GFP_KERNEL); + if (!st) + goto out; + + /* Initialize per CPUs boost group support */ + st->idx = idx; + if (schedtune_boostgroup_init(st)) + goto release; + + return &st->css; + +release: + kfree(st); +out: + return ERR_PTR(-ENOMEM); +} + +static void +schedtune_boostgroup_release(struct schedtune *st) +{ + /* Reset this group boost */ + schedtune_boostgroup_update(st->idx, 0); + + /* Keep track of allocated boost group */ + allocated_group[st->idx] = NULL; +} + +static void +schedtune_css_free(struct cgroup_subsys_state *css) +{ + struct schedtune *st = css_st(css); + schedtune_boostgroup_release(st); + kfree(st); +} + +static void +schedtune_exit(struct cgroup_subsys_state *css, + struct cgroup_subsys_state *old_css, + struct task_struct *tsk) +{ + struct schedtune *old_st = css_st(old_css); + int cpu = task_cpu(tsk); + + schedtune_tasks_update(tsk, cpu, old_st->idx, -1); +} + +struct cgroup_subsys schedtune_cgrp_subsys = { + .css_alloc = schedtune_css_alloc, + .css_free = schedtune_css_free, + .exit = schedtune_exit, + .legacy_cftypes = files, + .early_init = 1, +}; + +#else /* CONFIG_CGROUP_SCHEDTUNE */ + +int +schedtune_accept_deltas(int nrg_delta, int cap_delta) { + + /* Optimal (O) region */ + if (nrg_delta < 0 && cap_delta > 0) { + trace_printk("schedtune_filter: region=O ngain=0 pgain=0 nrg_payoff=-1"); + return INT_MAX; + } + + /* Suboptimal (S) region */ + if (nrg_delta > 0 && cap_delta < 0) { + trace_printk("schedtune_filter: region=S ngain=0 pgain=0 nrg_payoff=1"); + return -INT_MAX; + } + + return __schedtune_accept_deltas(nrg_delta, cap_delta, + perf_boost_idx, perf_constrain_idx); + +} + +#endif /* CONFIG_CGROUP_SCHEDTUNE */ + +int +sysctl_sched_cfs_boost_handler(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, + loff_t *ppos) +{ + int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos); + if (ret || !write) + return ret; + + /* Performance Boost (B) region threshold params */ + perf_boost_idx = sysctl_sched_cfs_boost; + perf_boost_idx /= 10; + + /* Performance Constraint (C) region threshold params */ + perf_constrain_idx = 100 - sysctl_sched_cfs_boost; + perf_constrain_idx /= 10; + + return 0; +} + diff --git a/kernel/sched/tune.h b/kernel/sched/tune.h new file mode 100644 index 000000000000..6e878eac3b1d --- /dev/null +++ b/kernel/sched/tune.h @@ -0,0 +1,34 @@ + +#ifdef CONFIG_SCHED_TUNE + +extern int schedtune_normalize_energy(int energy); + +#ifdef CONFIG_CGROUP_SCHEDTUNE + +extern int schedtune_taskgroup_boost(struct task_struct *tsk); +extern int schedtune_cpu_boost(int cpu); + +extern void schedtune_enqueue_task(struct task_struct *p, int cpu); +extern void schedtune_dequeue_task(struct task_struct *p, int cpu); + +extern int schedtune_accept_deltas(int nrg_delta, int cap_delta, + struct task_struct *task); + +#else /* CONFIG_CGROUP_SCHEDTUNE */ + +extern int schedtune_accept_deltas(int nrg_delta, int cap_delta); + +#define schedtune_enqueue_task(task, cpu) while(0){} +#define schedtune_dequeue_task(task, cpu) while(0){} + +#endif /* CONFIG_CGROUP_SCHEDTUNE */ + +#else /* CONFIG_SCHED_TUNE */ + +#define schedtune_normalize_energy(energy) energy +#define schedtune_accept_deltas(nrg_delta, cap_delta) nrg_delta + +#define schedtune_enqueue_task(task, cpu) while(0){} +#define schedtune_dequeue_task(task, cpu) while(0){} + +#endif /* CONFIG_SCHED_TUNE */ diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 8479c4567add..29ac0a9485f3 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -445,6 +445,21 @@ static struct ctl_table kern_table[] = { .extra1 = &one, }, #endif +#ifdef CONFIG_SCHED_TUNE + { + .procname = "sched_cfs_boost", + .data = &sysctl_sched_cfs_boost, + .maxlen = sizeof(sysctl_sched_cfs_boost), +#ifdef CONFIG_CGROUP_SCHEDTUNE + .mode = 0444, +#else + .mode = 0644, +#endif + .proc_handler = &sysctl_sched_cfs_boost_handler, + .extra1 = &zero, + .extra2 = &one_hundred, + }, +#endif #ifdef CONFIG_PROVE_LOCKING { .procname = "prove_locking", |