summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorViresh Kumar <viresh.kumar@linaro.org>2019-11-25 13:00:59 +0530
committerViresh Kumar <viresh.kumar@linaro.org>2019-11-25 13:00:59 +0530
commiteead365e2da5a984c1b477098c97b8fcfcffd4c3 (patch)
treeab1561807a9a8cd6ff18efb87bfba3a227f98ea5
parent3723bc8fb261248cd035a480c971142f3d836c41 (diff)
updates
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
-rw-r--r--scheduler/lwn_sched_idle.txt260
1 files changed, 260 insertions, 0 deletions
diff --git a/scheduler/lwn_sched_idle.txt b/scheduler/lwn_sched_idle.txt
new file mode 100644
index 0000000..923a95c
--- /dev/null
+++ b/scheduler/lwn_sched_idle.txt
@@ -0,0 +1,260 @@
+Fixing SCHED_IDLE
+
+The Linux kernel scheduler (or the task scheduler) is a complicated beast and a
+lot of effort goes into improving it during every kernel release cycle. The 5.4
+kernel release includes few improvements to the existing SCHED_IDLE
+scheduling policy that can help users to improve the scheduling latency of
+their high-priority (interactive) tasks if they use the SCHED_IDLE policy for
+the lowest-priority (background) tasks.
+
+Scheduling classes and policies
+The Linux kernel scheduler implements many "scheduling classes", an extensible
+hierarchy of modules, and each class may further encapsulate
+"scheduling policies" that are handled by the scheduler core in a
+policy-independent way.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+The scheduling classes are described below in descending priority order; the
+Stop class has the highest priority, and Idle class has the lowest.
+
+The Stop scheduling class is a special class that is used internally by the
+kernel. It doesn't implement any scheduling policy and no user task ever gets
+scheduled with it. The Stop class is, instead, a mechanism to force a CPU to
+stop running everything else and perform a specific task. As this is the
+highest-priority class, it can preempt everything else and nothing ever
+preempts it. It is used by one CPU to stop another in order to run a specific
+function, so it is only available on SMP systems. The Stop class creates a
+single, per-CPU kernel thread (or kthread) named migration/N, where N is the
+CPU number. This class is used by the kernel for task migration, CPU hotplug,
+RCU, ftrace, clock events, and more.
+
+The Deadline scheduling class implements a single scheduling policy,
+SCHED_DEADLINE, and it serves the highest-priority user tasks in the system. It
+is used for tasks with hard deadlines, like video encoding and decoding. The
+task with the earliest deadline is served first under this policy. The policy
+of a task can be set to SCHED_DEADLINE using the sched_setattr() system call by
+passing three parameters: the run time, deadline, and period.
+
+To ensure deadline-scheduling guarantees, the kernel must prevent situations
+where the current set of SCHED_DEADLINE threads is not schedulable within the
+given constraints. The kernel thus performs an admittance test when setting or
+changing SCHED_DEADLINE policy and attributes. This admission test calculates
+whether the change can be successfully scheduled; if not, sched_setattr() fails
+with the error EBUSY.
+
+The POSIX realtime (or RT) scheduling class comes after the deadline class and
+is used for short, latency-sensitive tasks, like IRQ threads. This is a
+fixed-priority class that schedules higher-priority tasks before lower-priority
+tasks. It implements two scheduling policies: SCHED_FIFO and SCHED_RR. In
+SCHED_FIFO, a task runs until it relinquishes the CPU, either because it blocks
+for a resource or it has completed its execution. In SCHED_RR (round-robin), a
+task will run for the maximum time slice; if the task doesn't block before the
+end of its time slice, the scheduler will put it at the end of the round-robin
+queue of tasks with the same priority and select the next task to run. The
+priority of the tasks under the realtime policies range from 1 (low) to 99
+(high).
+
+The CFS (completely fair scheduling) class hosts most of the user tasks; it
+implements three scheduling policies: SCHED_NORMAL, SCHED_BATCH, and
+SCHED_IDLE. A task under any of these policies gets a chance to run only if no
+other tasks are enqueued in the deadline or realtime classes (though by default
+the scheduler reserves 5% of the CPU for CFS tasks regardless). The scheduler
+tracks the virtual runtime (vruntime) for all tasks, runnable and blocked. The
+lower a task's vruntime, the more deserving the task is for time on the
+processor. CFS accordingly moves low-vruntime tasks toward the front of the
+scheduling queue.
+
+The priority of a task is calculated by adding 120 to its nice value, which
+ranges from -20 to +19. The priority of the task is used to set the weight of
+the task, which in turn affects the vruntime of the task; the lower the nice
+value, the higher the priority. The task's weight will thus be higher, and its
+vruntime will increase more slowly as the task runs.
+
+The SCHED_NORMAL policy (called SCHED_OTHER in user space) is used for most of
+the tasks that run in a Linux environment, like the shell. The SCHED_BATCH
+policy is used for batch processing by non-interactive tasks — tasks that
+should run uninterrupted for a period of time and hence are normally scheduled
+only after finishing all the SCHED_NORMAL activity. The SCHED_IDLE policy is
+designed for the lowest-priority tasks in the system; these tasks get a chance
+to run only if there is nothing else to run. Though, in practice, even in the
+presence of other SCHED_NORMAL tasks a SCHED_IDLE task will get some time to
+run (around 1.4% for a task with a nice value of zero). This policy isn't
+widely used currently and efforts are being made to improve how it works.
+
+Last is the Idle scheduling class (which should not be confused with the
+SCHED_IDLE scheduling policy). This is the lowest-priority scheduling class;
+like the Stop class, it doesn't manage any user tasks and so doesn't implement
+a policy. It only keeps a single per-CPU kthread which is named swapper/N,
+where N is the CPU number. These kthreads are also called the "idle threads"
+and aren't visible to user space. These threads are responsible for saving
+system power by putting the CPUs into deep idle states when there is no work to
+do.
+
+Scheduling classes in the kernel
+The scheduling classes are represented by struct sched_class in the kernel
+source code:
+
+ struct sched_class {
+ const struct sched_class *next;
+ void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
+ void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
+ struct task_struct *(*pick_next_task) (struct rq *rq, struct task_struct *prev,
+ struct rq_flags *rf);
+ /* many fields omitted */
+ };
+
+This structure mostly consists of function pointers (callbacks) to
+class-specific implementations that are called by the scheduler core in a
+class-independent manner. The classes are kept in a singly linked list in
+descending order of their priorities; the head node points to the Stop
+scheduling class (highest priority) and the last node in the list points to the
+Idle class (lowest priority).
+
+The Linux kernel calls schedule() when it needs to pick a new task to run on
+the local CPU, which further calls pick_next_task() to find the next task.
+pick_next_task() traverses the list of scheduling classes, with the help of the
+for_each_class() macro, to find the highest-priority scheduling class that has
+a task available to run. Once a task is found, it is returned to the caller,
+which then runs it on the local CPU. There should always be a task available to
+run in the Idle class, which will run only if there is nothing else to run.
+
+SCHED_IDLE improvements
+The CFS scheduler tries to be fair to all tasks by giving more CPU time to the
+higher-priority tasks as compared to the lower-priority tasks. It normally
+doesn't provide special treatment to tasks based on their scheduling policy,
+for example tasks running under the SCHED_NORMAL and SCHED_IDLE policies are
+managed in the same way. They are all kept in the same CFS run queues, the load
+and utilization of the CPUs change in the same way for all the tasks, and the
+PELT signal and CPU-frequency changes are impacted similarly by all tasks. The
+only differentiating factor is the priority (derived from the nice value) of
+the tasks, which affects the weight of the tasks.
+
+The weight of a task defines how the load and utilization of the CPU will
+change because of that task. For this reason, we don't see a lot of SCHED_IDLE
+policy-related code in the CFS scheduler. As the SCHED_IDLE policy tasks have
+the lowest priority, they automatically get scheduled for the least amount of
+time. Also, since there aren't many known users of the SCHED_IDLE policy in the
+Linux community, no one attempted to improve it since it was first introduced
+in Linux 2.6.23.
+
+When a newly woken-up task is available to run, the scheduler core finds the
+target run queue (i.e. a CPU to run it on) by calling the select_task_rq()
+callback of the respective scheduling class. This callback returns the CPU
+where the task should be enqueued. Once the task is enqueued there, the
+scheduler checks if that task should preempt the currently running task on that
+CPU by calling the check_preempt_curr() callback of the respective scheduling
+class.
+
+Until now, the SCHED_IDLE policy was getting special treatment only in the
+check_preempt_curr() callback, where a currently running SCHED_IDLE task will
+be immediately preempted by a newly woken-up SCHED_NORMAL task. But this
+preemption will only happen if the newly woken-up task is enqueued on a CPU
+that is running a SCHED_IDLE task currently. As there was no special handling
+of the SCHED_IDLE policy in the select_task_rq() callback, there was no
+specific effort made to enqueue the newly woken-up SCHED_NORMAL task on a CPU
+running a SCHED_IDLE task.
+
+Normally, the scheduler tries to spread tasks across the available CPUs by
+searching for an idle CPU for newly woken-up tasks. The 5.4 kernel contains a
+patch set that makes the necessary changes to the CFS scheduler's
+select_task_rq() callback to queue tasks more aggressively on CPUs that are
+running only SCHED_IDLE tasks, even if a few CPUs are currently idle. There are
+two separate code paths in the CFS select_task_rq() callback: the slow path and
+the fast path. The slow path is mostly executed for newly forked tasks, where
+it tries to find the optimal CPU to run the task on. The fast path, instead, is
+taken for existing tasks that have become runnable again; it tries to find a
+target CPU (an idle CPU if possible) as soon as possible even if it is not the
+optimal one.
+
+Both these code paths were updated by the new patch set to consider a CPU that
+is running only SCHED_IDLE tasks as equivalent to an idle CPU. The scheduler
+now prefers to queue the newly woken-up tasks on CPUs with only SCHED_IDLE
+activity; the newly queued task will immediately preempt the currently running
+SCHED_IDLE task when check_preempt_curr() is called. This reduces the
+scheduling latency for the newly queued task as compared to selecting a fully
+idle CPU, as we don't need to bring an idle CPU out of its deep idle state,
+which normally takes a few milliseconds to complete.
+
+The results of this change
+
+This patch set was initially tested with rt-app on an arm64 octa-core HiKey
+platform, where all the CPUs change frequency together. Rt-app is a test
+application that starts multiple periodic threads in order to simulate a
+real-time periodic load. For this test, eight SCHED_OTHER tasks and five
+SCHED_IDLE tasks were created. The tasks weren't bound to any particular CPU
+and could be queued anywhere by the scheduler. The SCHED_NORMAL tasks executed
+(busy loops) for 5333µs out of a period of 7777µs, while the SCHED_IDLE tasks
+kept on running forever. The idea was to check whether the SCHED_NORMAL tasks
+were being scheduled together (thus preempting each other) or if they were able
+to preempt SCHED_IDLE tasks instead. The result showed that the average
+scheduling latency (wu_lat field in rt-app results) for the SCHED_NORMAL tasks
+reduced to 102µs after the patch set was applied, down from 1116µs without the
+patch set; that was a reduction of 90% in scheduling latency for the
+SCHED_NORMAL tasks, which looks quite encouraging.
+
+Further testing showed that the average scheduling latency of a SCHED_NORMAL
+task, on the above-mentioned arm64 platform is 64µs when it preempts a
+SCHED_IDLE task, 177µs when it runs on a shallow-idle (no cluster idle) CPU,
+and 307µs when it runs on a deep-idle (cluster idle) CPU. The same behavior can
+be observed with the kernel function tracer; the traces are shown below with
+help of the KernelShark tool. First, the output from the 5.3 kernel:
+
+If you look closely at the above figure, you can see that occasionally, for
+long periods of time, a few CPUs were running a single task (solid single-color
+lines) without being preempted by another task. The long-running tasks are the
+SCHED_IDLE tasks which should ideally be preempted by the SCHED_NORMAL tasks,
+but that wasn't happening then.
+
+The results from the 5.4 kernel are different:
+
+If you look closely at the above figure, you can see that the pattern is quite
+consistent now. The SCHED_IDLE tasks are preempted by the SCHED_NORMAL tasks as
+soon as one is available to run, which then runs for 5333µs and then gives the
+CPU back to a SCHED_IDLE task. This is exactly the behavior this patch set was
+meant to create.
+
+Other applications
+Recently, Song Liu was trying to solve a problem seen on servers at Facebook.
+The servers running latency-sensitive workloads usually weren't fully loaded
+for various reasons, including disaster readiness. The machines running
+Facebook's interactive workloads (referred as the main workload) have a lot of
+spare CPU cycles that they would like to use for opportunistic side jobs like
+video encoding. However, Song's experiments showed that the side workload has a
+strong impact on the latency of main workload. Song was asked to try the
+SCHED_IDLE patch set and he found that it solved the problems he was facing to
+a great extent; though he tested an earlier version of the patch set where only
+the fast path was updated.
+
+Another potential user of this work is the Android operating system, which has
+knowledge about the importance of a task for the current user's experience
+ranging from "background" (not important) to "top-app" (most important). The
+SCHED_IDLE policy can potentially be used for all the background tasks as that
+would increase the probability of finding an idle CPU for top-app tasks by
+preempting the background tasks.
+
+Clearly this work has a lot of potential, and more mainstream products should
+be using the SCHED_IDLE policy. Though there may be a need for more SCHED_IDLE
+policy-specific optimizations in the CFS scheduler for that. One such
+optimization is under discussion right now on the kernel mailing list, where I
+am trying to be more aggressive in selecting a SCHED_IDLE CPU in both the slow
+and fast paths of the CFS scheduler. Also, improvements can be made to the CFS
+load balancer, which doesn't give any special treatment to the SCHED_IDLE CPUs
+currently and rather attempts to spread the tasks to all the CPUs; that is
+future work, though.