diff options
author | Viresh Kumar <viresh.kumar@linaro.org> | 2019-11-25 13:00:59 +0530 |
---|---|---|
committer | Viresh Kumar <viresh.kumar@linaro.org> | 2019-11-25 13:00:59 +0530 |
commit | eead365e2da5a984c1b477098c97b8fcfcffd4c3 (patch) | |
tree | ab1561807a9a8cd6ff18efb87bfba3a227f98ea5 | |
parent | 3723bc8fb261248cd035a480c971142f3d836c41 (diff) |
updates
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
-rw-r--r-- | scheduler/lwn_sched_idle.txt | 260 |
1 files changed, 260 insertions, 0 deletions
diff --git a/scheduler/lwn_sched_idle.txt b/scheduler/lwn_sched_idle.txt new file mode 100644 index 0000000..923a95c --- /dev/null +++ b/scheduler/lwn_sched_idle.txt @@ -0,0 +1,260 @@ +Fixing SCHED_IDLE + +The Linux kernel scheduler (or the task scheduler) is a complicated beast and a +lot of effort goes into improving it during every kernel release cycle. The 5.4 +kernel release includes few improvements to the existing SCHED_IDLE +scheduling policy that can help users to improve the scheduling latency of +their high-priority (interactive) tasks if they use the SCHED_IDLE policy for +the lowest-priority (background) tasks. + +Scheduling classes and policies +The Linux kernel scheduler implements many "scheduling classes", an extensible +hierarchy of modules, and each class may further encapsulate +"scheduling policies" that are handled by the scheduler core in a +policy-independent way. + + + + + + + + + + + + + + + + + + + +The scheduling classes are described below in descending priority order; the +Stop class has the highest priority, and Idle class has the lowest. + +The Stop scheduling class is a special class that is used internally by the +kernel. It doesn't implement any scheduling policy and no user task ever gets +scheduled with it. The Stop class is, instead, a mechanism to force a CPU to +stop running everything else and perform a specific task. As this is the +highest-priority class, it can preempt everything else and nothing ever +preempts it. It is used by one CPU to stop another in order to run a specific +function, so it is only available on SMP systems. The Stop class creates a +single, per-CPU kernel thread (or kthread) named migration/N, where N is the +CPU number. This class is used by the kernel for task migration, CPU hotplug, +RCU, ftrace, clock events, and more. + +The Deadline scheduling class implements a single scheduling policy, +SCHED_DEADLINE, and it serves the highest-priority user tasks in the system. It +is used for tasks with hard deadlines, like video encoding and decoding. The +task with the earliest deadline is served first under this policy. The policy +of a task can be set to SCHED_DEADLINE using the sched_setattr() system call by +passing three parameters: the run time, deadline, and period. + +To ensure deadline-scheduling guarantees, the kernel must prevent situations +where the current set of SCHED_DEADLINE threads is not schedulable within the +given constraints. The kernel thus performs an admittance test when setting or +changing SCHED_DEADLINE policy and attributes. This admission test calculates +whether the change can be successfully scheduled; if not, sched_setattr() fails +with the error EBUSY. + +The POSIX realtime (or RT) scheduling class comes after the deadline class and +is used for short, latency-sensitive tasks, like IRQ threads. This is a +fixed-priority class that schedules higher-priority tasks before lower-priority +tasks. It implements two scheduling policies: SCHED_FIFO and SCHED_RR. In +SCHED_FIFO, a task runs until it relinquishes the CPU, either because it blocks +for a resource or it has completed its execution. In SCHED_RR (round-robin), a +task will run for the maximum time slice; if the task doesn't block before the +end of its time slice, the scheduler will put it at the end of the round-robin +queue of tasks with the same priority and select the next task to run. The +priority of the tasks under the realtime policies range from 1 (low) to 99 +(high). + +The CFS (completely fair scheduling) class hosts most of the user tasks; it +implements three scheduling policies: SCHED_NORMAL, SCHED_BATCH, and +SCHED_IDLE. A task under any of these policies gets a chance to run only if no +other tasks are enqueued in the deadline or realtime classes (though by default +the scheduler reserves 5% of the CPU for CFS tasks regardless). The scheduler +tracks the virtual runtime (vruntime) for all tasks, runnable and blocked. The +lower a task's vruntime, the more deserving the task is for time on the +processor. CFS accordingly moves low-vruntime tasks toward the front of the +scheduling queue. + +The priority of a task is calculated by adding 120 to its nice value, which +ranges from -20 to +19. The priority of the task is used to set the weight of +the task, which in turn affects the vruntime of the task; the lower the nice +value, the higher the priority. The task's weight will thus be higher, and its +vruntime will increase more slowly as the task runs. + +The SCHED_NORMAL policy (called SCHED_OTHER in user space) is used for most of +the tasks that run in a Linux environment, like the shell. The SCHED_BATCH +policy is used for batch processing by non-interactive tasks — tasks that +should run uninterrupted for a period of time and hence are normally scheduled +only after finishing all the SCHED_NORMAL activity. The SCHED_IDLE policy is +designed for the lowest-priority tasks in the system; these tasks get a chance +to run only if there is nothing else to run. Though, in practice, even in the +presence of other SCHED_NORMAL tasks a SCHED_IDLE task will get some time to +run (around 1.4% for a task with a nice value of zero). This policy isn't +widely used currently and efforts are being made to improve how it works. + +Last is the Idle scheduling class (which should not be confused with the +SCHED_IDLE scheduling policy). This is the lowest-priority scheduling class; +like the Stop class, it doesn't manage any user tasks and so doesn't implement +a policy. It only keeps a single per-CPU kthread which is named swapper/N, +where N is the CPU number. These kthreads are also called the "idle threads" +and aren't visible to user space. These threads are responsible for saving +system power by putting the CPUs into deep idle states when there is no work to +do. + +Scheduling classes in the kernel +The scheduling classes are represented by struct sched_class in the kernel +source code: + + struct sched_class { + const struct sched_class *next; + void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags); + void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags); + struct task_struct *(*pick_next_task) (struct rq *rq, struct task_struct *prev, + struct rq_flags *rf); + /* many fields omitted */ + }; + +This structure mostly consists of function pointers (callbacks) to +class-specific implementations that are called by the scheduler core in a +class-independent manner. The classes are kept in a singly linked list in +descending order of their priorities; the head node points to the Stop +scheduling class (highest priority) and the last node in the list points to the +Idle class (lowest priority). + +The Linux kernel calls schedule() when it needs to pick a new task to run on +the local CPU, which further calls pick_next_task() to find the next task. +pick_next_task() traverses the list of scheduling classes, with the help of the +for_each_class() macro, to find the highest-priority scheduling class that has +a task available to run. Once a task is found, it is returned to the caller, +which then runs it on the local CPU. There should always be a task available to +run in the Idle class, which will run only if there is nothing else to run. + +SCHED_IDLE improvements +The CFS scheduler tries to be fair to all tasks by giving more CPU time to the +higher-priority tasks as compared to the lower-priority tasks. It normally +doesn't provide special treatment to tasks based on their scheduling policy, +for example tasks running under the SCHED_NORMAL and SCHED_IDLE policies are +managed in the same way. They are all kept in the same CFS run queues, the load +and utilization of the CPUs change in the same way for all the tasks, and the +PELT signal and CPU-frequency changes are impacted similarly by all tasks. The +only differentiating factor is the priority (derived from the nice value) of +the tasks, which affects the weight of the tasks. + +The weight of a task defines how the load and utilization of the CPU will +change because of that task. For this reason, we don't see a lot of SCHED_IDLE +policy-related code in the CFS scheduler. As the SCHED_IDLE policy tasks have +the lowest priority, they automatically get scheduled for the least amount of +time. Also, since there aren't many known users of the SCHED_IDLE policy in the +Linux community, no one attempted to improve it since it was first introduced +in Linux 2.6.23. + +When a newly woken-up task is available to run, the scheduler core finds the +target run queue (i.e. a CPU to run it on) by calling the select_task_rq() +callback of the respective scheduling class. This callback returns the CPU +where the task should be enqueued. Once the task is enqueued there, the +scheduler checks if that task should preempt the currently running task on that +CPU by calling the check_preempt_curr() callback of the respective scheduling +class. + +Until now, the SCHED_IDLE policy was getting special treatment only in the +check_preempt_curr() callback, where a currently running SCHED_IDLE task will +be immediately preempted by a newly woken-up SCHED_NORMAL task. But this +preemption will only happen if the newly woken-up task is enqueued on a CPU +that is running a SCHED_IDLE task currently. As there was no special handling +of the SCHED_IDLE policy in the select_task_rq() callback, there was no +specific effort made to enqueue the newly woken-up SCHED_NORMAL task on a CPU +running a SCHED_IDLE task. + +Normally, the scheduler tries to spread tasks across the available CPUs by +searching for an idle CPU for newly woken-up tasks. The 5.4 kernel contains a +patch set that makes the necessary changes to the CFS scheduler's +select_task_rq() callback to queue tasks more aggressively on CPUs that are +running only SCHED_IDLE tasks, even if a few CPUs are currently idle. There are +two separate code paths in the CFS select_task_rq() callback: the slow path and +the fast path. The slow path is mostly executed for newly forked tasks, where +it tries to find the optimal CPU to run the task on. The fast path, instead, is +taken for existing tasks that have become runnable again; it tries to find a +target CPU (an idle CPU if possible) as soon as possible even if it is not the +optimal one. + +Both these code paths were updated by the new patch set to consider a CPU that +is running only SCHED_IDLE tasks as equivalent to an idle CPU. The scheduler +now prefers to queue the newly woken-up tasks on CPUs with only SCHED_IDLE +activity; the newly queued task will immediately preempt the currently running +SCHED_IDLE task when check_preempt_curr() is called. This reduces the +scheduling latency for the newly queued task as compared to selecting a fully +idle CPU, as we don't need to bring an idle CPU out of its deep idle state, +which normally takes a few milliseconds to complete. + +The results of this change + +This patch set was initially tested with rt-app on an arm64 octa-core HiKey +platform, where all the CPUs change frequency together. Rt-app is a test +application that starts multiple periodic threads in order to simulate a +real-time periodic load. For this test, eight SCHED_OTHER tasks and five +SCHED_IDLE tasks were created. The tasks weren't bound to any particular CPU +and could be queued anywhere by the scheduler. The SCHED_NORMAL tasks executed +(busy loops) for 5333µs out of a period of 7777µs, while the SCHED_IDLE tasks +kept on running forever. The idea was to check whether the SCHED_NORMAL tasks +were being scheduled together (thus preempting each other) or if they were able +to preempt SCHED_IDLE tasks instead. The result showed that the average +scheduling latency (wu_lat field in rt-app results) for the SCHED_NORMAL tasks +reduced to 102µs after the patch set was applied, down from 1116µs without the +patch set; that was a reduction of 90% in scheduling latency for the +SCHED_NORMAL tasks, which looks quite encouraging. + +Further testing showed that the average scheduling latency of a SCHED_NORMAL +task, on the above-mentioned arm64 platform is 64µs when it preempts a +SCHED_IDLE task, 177µs when it runs on a shallow-idle (no cluster idle) CPU, +and 307µs when it runs on a deep-idle (cluster idle) CPU. The same behavior can +be observed with the kernel function tracer; the traces are shown below with +help of the KernelShark tool. First, the output from the 5.3 kernel: + +If you look closely at the above figure, you can see that occasionally, for +long periods of time, a few CPUs were running a single task (solid single-color +lines) without being preempted by another task. The long-running tasks are the +SCHED_IDLE tasks which should ideally be preempted by the SCHED_NORMAL tasks, +but that wasn't happening then. + +The results from the 5.4 kernel are different: + +If you look closely at the above figure, you can see that the pattern is quite +consistent now. The SCHED_IDLE tasks are preempted by the SCHED_NORMAL tasks as +soon as one is available to run, which then runs for 5333µs and then gives the +CPU back to a SCHED_IDLE task. This is exactly the behavior this patch set was +meant to create. + +Other applications +Recently, Song Liu was trying to solve a problem seen on servers at Facebook. +The servers running latency-sensitive workloads usually weren't fully loaded +for various reasons, including disaster readiness. The machines running +Facebook's interactive workloads (referred as the main workload) have a lot of +spare CPU cycles that they would like to use for opportunistic side jobs like +video encoding. However, Song's experiments showed that the side workload has a +strong impact on the latency of main workload. Song was asked to try the +SCHED_IDLE patch set and he found that it solved the problems he was facing to +a great extent; though he tested an earlier version of the patch set where only +the fast path was updated. + +Another potential user of this work is the Android operating system, which has +knowledge about the importance of a task for the current user's experience +ranging from "background" (not important) to "top-app" (most important). The +SCHED_IDLE policy can potentially be used for all the background tasks as that +would increase the probability of finding an idle CPU for top-app tasks by +preempting the background tasks. + +Clearly this work has a lot of potential, and more mainstream products should +be using the SCHED_IDLE policy. Though there may be a need for more SCHED_IDLE +policy-specific optimizations in the CFS scheduler for that. One such +optimization is under discussion right now on the kernel mailing list, where I +am trying to be more aggressive in selecting a SCHED_IDLE CPU in both the slow +and fast paths of the CFS scheduler. Also, improvements can be made to the CFS +load balancer, which doesn't give any special treatment to the SCHED_IDLE CPUs +currently and rather attempts to spread the tasks to all the CPUs; that is +future work, though. |