1 files changed, 736 insertions, 215 deletions
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index ff49cf901148..cb9ea281ab72 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1,7 +1,9 @@
-
+================
 Control Group v2
+================
 
-October, 2015		Tejun Heo <tj@kernel.org>
+:Date: October, 2015
+:Author: Tejun Heo <tj@kernel.org>
 
 This is the authoritative documentation on the design, interface and
 conventions of cgroup v2.  It describes all userland-visible aspects
@@ -9,59 +11,74 @@ of cgroup including core and specific controller behaviors.  All
 future changes must be reflected in this document.  Documentation for
 v1 is available under Documentation/cgroup-v1/.
 
-CONTENTS
-
-1. Introduction
-  1-1. Terminology
-  1-2. What is cgroup?
-2. Basic Operations
-  2-1. Mounting
-  2-2. Organizing Processes
-  2-3. [Un]populated Notification
-  2-4. Controlling Controllers
-    2-4-1. Enabling and Disabling
-    2-4-2. Top-down Constraint
-    2-4-3. No Internal Process Constraint
-  2-5. Delegation
-    2-5-1. Model of Delegation
-    2-5-2. Delegation Containment
-  2-6. Guidelines
-    2-6-1. Organize Once and Control
-    2-6-2. Avoid Name Collisions
-3. Resource Distribution Models
-  3-1. Weights
-  3-2. Limits
-  3-3. Protections
-  3-4. Allocations
-4. Interface Files
-  4-1. Format
-  4-2. Conventions
-  4-3. Core Interface Files
-5. Controllers
-  5-1. CPU
-    5-1-1. CPU Interface Files
-  5-2. Memory
-    5-2-1. Memory Interface Files
-    5-2-2. Usage Guidelines
-    5-2-3. Memory Ownership
-  5-3. IO
-    5-3-1. IO Interface Files
-    5-3-2. Writeback
-P. Information on Kernel Programming
-  P-1. Filesystem Support for Writeback
-D. Deprecated v1 Core Features
-R. Issues with v1 and Rationales for v2
-  R-1. Multiple Hierarchies
-  R-2. Thread Granularity
-  R-3. Competition Between Inner Nodes and Threads
-  R-4. Other Interface Issues
-  R-5. Controller Issues and Remedies
-    R-5-1. Memory
-
-
-1. Introduction
-
-1-1. Terminology
+.. CONTENTS
+
+   1. Introduction
+     1-1. Terminology
+     1-2. What is cgroup?
+   2. Basic Operations
+     2-1. Mounting
+     2-2. Organizing Processes and Threads
+       2-2-1. Processes
+       2-2-2. Threads
+     2-3. [Un]populated Notification
+     2-4. Controlling Controllers
+       2-4-1. Enabling and Disabling
+       2-4-2. Top-down Constraint
+       2-4-3. No Internal Process Constraint
+     2-5. Delegation
+       2-5-1. Model of Delegation
+       2-5-2. Delegation Containment
+     2-6. Guidelines
+       2-6-1. Organize Once and Control
+       2-6-2. Avoid Name Collisions
+   3. Resource Distribution Models
+     3-1. Weights
+     3-2. Limits
+     3-3. Protections
+     3-4. Allocations
+   4. Interface Files
+     4-1. Format
+     4-2. Conventions
+     4-3. Core Interface Files
+   5. Controllers
+     5-1. CPU
+       5-1-1. CPU Interface Files
+     5-2. Memory
+       5-2-1. Memory Interface Files
+       5-2-2. Usage Guidelines
+       5-2-3. Memory Ownership
+     5-3. IO
+       5-3-1. IO Interface Files
+       5-3-2. Writeback
+     5-4. PID
+       5-4-1. PID Interface Files
+     5-5. RDMA
+       5-5-1. RDMA Interface Files
+     5-6. Misc
+       5-6-1. perf_event
+   6. Namespace
+     6-1. Basics
+     6-2. The Root and Views
+     6-3. Migration and setns(2)
+     6-4. Interaction with Other Namespaces
+   P. Information on Kernel Programming
+     P-1. Filesystem Support for Writeback
+   D. Deprecated v1 Core Features
+   R. Issues with v1 and Rationales for v2
+     R-1. Multiple Hierarchies
+     R-2. Thread Granularity
+     R-3. Competition Between Inner Nodes and Threads
+     R-4. Other Interface Issues
+     R-5. Controller Issues and Remedies
+       R-5-1. Memory
+
+
+Introduction
+============
+
+Terminology
+-----------
 
 "cgroup" stands for "control group" and is never capitalized.  The
 singular form is used to designate the whole feature and also as a
@@ -69,7 +86,8 @@ qualifier as in "cgroup controllers".  When explicitly referring to
 multiple individual control groups, the plural form "cgroups" is used.
 
 
-1-2. What is cgroup?
+What is cgroup?
+---------------
 
 cgroup is a mechanism to organize processes hierarchically and
 distribute system resources along the hierarchy in a controlled and
@@ -99,12 +117,14 @@ restrictions set closer to the root in the hierarchy can not be
 overridden from further away.
 
 
-2. Basic Operations
+Basic Operations
+================
 
-2-1. Mounting
+Mounting
+--------
 
 Unlike v1, cgroup v2 has only single hierarchy.  The cgroup v2
-hierarchy can be mounted with the following mount command.
+hierarchy can be mounted with the following mount command::
 
   # mount -t cgroup2 none $MOUNT_POINT
 
@@ -132,11 +152,31 @@ strongly discouraged for production use.  It is recommended to decide
 the hierarchies and controller associations before starting using the
 controllers after system boot.
 
+During transition to v2, system management software might still
+automount the v1 cgroup filesystem and so hijack all controllers
+during boot, before manual intervention is possible. To make testing
+and experimenting easier, the kernel parameter cgroup_no_v1= allows
+disabling controllers in v1 and make them always available in v2.
+
+cgroup v2 currently supports the following mount options.
+
+  nsdelegate
+
+	Consider cgroup namespaces as delegation boundaries.  This
+	option is system wide and can only be set on mount or modified
+	through remount from the init namespace.  The mount option is
+	ignored on non-init namespace mounts.  Please refer to the
+	Delegation section for details.
+
+
+Organizing Processes and Threads
+--------------------------------
 
-2-2. Organizing Processes
+Processes
+~~~~~~~~~
 
 Initially, only the root cgroup exists to which all processes belong.
-A child cgroup can be created by creating a sub-directory.
+A child cgroup can be created by creating a sub-directory::
 
   # mkdir $CGROUP_NAME
 
@@ -163,28 +203,127 @@ moved to another cgroup.
 A cgroup which doesn't have any children or live processes can be
 destroyed by removing the directory.  Note that a cgroup which doesn't
 have any children and is associated only with zombie processes is
-considered empty and can be removed.
+considered empty and can be removed::
 
   # rmdir $CGROUP_NAME
 
 "/proc/$PID/cgroup" lists a process's cgroup membership.  If legacy
 cgroup is in use in the system, this file may contain multiple lines,
 one for each hierarchy.  The entry for cgroup v2 is always in the
-format "0::$PATH".
+format "0::$PATH"::
 
   # cat /proc/842/cgroup
   ...
   0::/test-cgroup/test-cgroup-nested
 
 If the process becomes a zombie and the cgroup it was associated with
-is removed subsequently, " (deleted)" is appended to the path.
+is removed subsequently, " (deleted)" is appended to the path::
 
   # cat /proc/842/cgroup
   ...
   0::/test-cgroup/test-cgroup-nested (deleted)
 
 
-2-3. [Un]populated Notification
+Threads
+~~~~~~~
+
+cgroup v2 supports thread granularity for a subset of controllers to
+support use cases requiring hierarchical resource distribution across
+the threads of a group of processes.  By default, all threads of a
+process belong to the same cgroup, which also serves as the resource
+domain to host resource consumptions which are not specific to a
+process or thread.  The thread mode allows threads to be spread across
+a subtree while still maintaining the common resource domain for them.
+
+Controllers which support thread mode are called threaded controllers.
+The ones which don't are called domain controllers.
+
+Marking a cgroup threaded makes it join the resource domain of its
+parent as a threaded cgroup.  The parent may be another threaded
+cgroup whose resource domain is further up in the hierarchy.  The root
+of a threaded subtree, that is, the nearest ancestor which is not
+threaded, is called threaded domain or thread root interchangeably and
+serves as the resource domain for the entire subtree.
+
+Inside a threaded subtree, threads of a process can be put in
+different cgroups and are not subject to the no internal process
+constraint - threaded controllers can be enabled on non-leaf cgroups
+whether they have threads in them or not.
+
+As the threaded domain cgroup hosts all the domain resource
+consumptions of the subtree, it is considered to have internal
+resource consumptions whether there are processes in it or not and
+can't have populated child cgroups which aren't threaded.  Because the
+root cgroup is not subject to no internal process constraint, it can
+serve both as a threaded domain and a parent to domain cgroups.
+
+The current operation mode or type of the cgroup is shown in the
+"cgroup.type" file which indicates whether the cgroup is a normal
+domain, a domain which is serving as the domain of a threaded subtree,
+or a threaded cgroup.
+
+On creation, a cgroup is always a domain cgroup and can be made
+threaded by writing "threaded" to the "cgroup.type" file.  The
+operation is single direction::
+
+  # echo threaded > cgroup.type
+
+Once threaded, the cgroup can't be made a domain again.  To enable the
+thread mode, the following conditions must be met.
+
+- As the cgroup will join the parent's resource domain.  The parent
+  must either be a valid (threaded) domain or a threaded cgroup.
+
+- The cgroup must be empty.  No enabled controllers, child cgroups or
+  processes.
+
+Topology-wise, a cgroup can be in an invalid state.  Please consider
+the following toplogy::
+
+  A (threaded domain) - B (threaded) - C (domain, just created)
+
+C is created as a domain but isn't connected to a parent which can
+host child domains.  C can't be used until it is turned into a
+threaded cgroup.  "cgroup.type" file will report "domain (invalid)" in
+these cases.  Operations which fail due to invalid topology use
+EOPNOTSUPP as the errno.
+
+A domain cgroup is turned into a threaded domain when one of its child
+cgroup becomes threaded or threaded controllers are enabled in the
+"cgroup.subtree_control" file while there are processes in the cgroup.
+A threaded domain reverts to a normal domain when the conditions
+clear.
+
+When read, "cgroup.threads" contains the list of the thread IDs of all
+threads in the cgroup.  Except that the operations are per-thread
+instead of per-process, "cgroup.threads" has the same format and
+behaves the same way as "cgroup.procs".  While "cgroup.threads" can be
+written to in any cgroup, as it can only move threads inside the same
+threaded domain, its operations are confined inside each threaded
+subtree.
+
+The threaded domain cgroup serves as the resource domain for the whole
+subtree, and, while the threads can be scattered across the subtree,
+all the processes are considered to be in the threaded domain cgroup.
+"cgroup.procs" in a threaded domain cgroup contains the PIDs of all
+processes in the subtree and is not readable in the subtree proper.
+However, "cgroup.procs" can be written to from anywhere in the subtree
+to migrate all threads of the matching process to the cgroup.
+
+Only threaded controllers can be enabled in a threaded subtree.  When
+a threaded controller is enabled inside a threaded subtree, it only
+accounts for and controls resource consumptions associated with the
+threads in the cgroup and its descendants.  All consumptions which
+aren't tied to a specific thread belong to the threaded domain cgroup.
+
+Because a threaded subtree is exempt from no internal process
+constraint, a threaded controller must be able to handle competition
+between threads in a non-leaf cgroup and its child cgroups.  Each
+threaded controller defines how such competitions are handled.
+
+
+[Un]populated Notification
+--------------------------
 
 Each non-root cgroup has a "cgroup.events" file which contains
 "populated" field indicating whether the cgroup's sub-hierarchy has
@@ -195,7 +334,7 @@ example, to start a clean-up operation after all processes of a given
 sub-hierarchy have exited.  The populated state updates and
 notifications are recursive.  Consider the following sub-hierarchy
 where the numbers in the parentheses represent the numbers of processes
-in each cgroup.
+in each cgroup::
 
   A(4) - B(0) - C(1)
               \ D(0)
@@ -206,18 +345,20 @@ file modified events will be generated on the "cgroup.events" files of
 both cgroups.
 
 
-2-4. Controlling Controllers
+Controlling Controllers
+-----------------------
 
-2-4-1. Enabling and Disabling
+Enabling and Disabling
+~~~~~~~~~~~~~~~~~~~~~~
 
 Each cgroup has a "cgroup.controllers" file which lists all
-controllers available for the cgroup to enable.
+controllers available for the cgroup to enable::
 
   # cat cgroup.controllers
   cpu io memory
 
 No controller is enabled by default.  Controllers can be enabled and
-disabled by writing to the "cgroup.subtree_control" file.
+disabled by writing to the "cgroup.subtree_control" file::
 
   # echo "+cpu +memory -io" > cgroup.subtree_control
 
@@ -229,7 +370,7 @@ are specified, the last one is effective.
 Enabling a controller in a cgroup indicates that the distribution of
 the target resource across its immediate children will be controlled.
 Consider the following sub-hierarchy.  The enabled controllers are
-listed in parentheses.
+listed in parentheses::
 
   A(cpu,memory) - B(memory) - C()
                             \ D()
@@ -249,7 +390,8 @@ controller interface files - anything which doesn't start with
 "cgroup." are owned by the parent rather than the cgroup itself.
 
 
-2-4-2. Top-down Constraint
+Top-down Constraint
+~~~~~~~~~~~~~~~~~~~
 
 Resources are distributed top-down and a cgroup can further distribute
 a resource only if the resource has been distributed to it from the
@@ -260,17 +402,18 @@ the parent has the controller enabled and a controller can't be
 disabled if one or more children have it enabled.
 
 
-2-4-3. No Internal Process Constraint
+No Internal Process Constraint
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Non-root cgroups can only distribute resources to their children when
-they don't have any processes of their own.  In other words, only
-cgroups which don't contain any processes can have controllers enabled
-in their "cgroup.subtree_control" files.
+Non-root cgroups can distribute domain resources to their children
+only when they don't have any processes of their own.  In other words,
+only domain cgroups which don't contain any processes can have domain
+controllers enabled in their "cgroup.subtree_control" files.
 
-This guarantees that, when a controller is looking at the part of the
-hierarchy which has it enabled, processes are always only on the
-leaves.  This rules out situations where child cgroups compete against
-internal processes of the parent.
+This guarantees that, when a domain controller is looking at the part
+of the hierarchy which has it enabled, processes are always only on
+the leaves.  This rules out situations where child cgroups compete
+against internal processes of the parent.
 
 The root cgroup is exempt from this restriction.  Root contains
 processes and anonymous resource consumption which can't be associated
@@ -287,50 +430,62 @@ children before enabling controllers in its "cgroup.subtree_control"
 file.
 
 
-2-5. Delegation
+Delegation
+----------
 
-2-5-1. Model of Delegation
+Model of Delegation
+~~~~~~~~~~~~~~~~~~~
 
-A cgroup can be delegated to a less privileged user by granting write
-access of the directory and its "cgroup.procs" file to the user.  Note
-that resource control interface files in a given directory control the
-distribution of the parent's resources and thus must not be delegated
-along with the directory.
+A cgroup can be delegated in two ways.  First, to a less privileged
+user by granting write access of the directory and its "cgroup.procs",
+"cgroup.threads" and "cgroup.subtree_control" files to the user.
+Second, if the "nsdelegate" mount option is set, automatically to a
+cgroup namespace on namespace creation.
 
-Once delegated, the user can build sub-hierarchy under the directory,
-organize processes as it sees fit and further distribute the resources
-it received from the parent.  The limits and other settings of all
-resource controllers are hierarchical and regardless of what happens
-in the delegated sub-hierarchy, nothing can escape the resource
-restrictions imposed by the parent.
+Because the resource control interface files in a given directory
+control the distribution of the parent's resources, the delegatee
+shouldn't be allowed to write to them.  For the first method, this is
+achieved by not granting access to these files.  For the second, the
+kernel rejects writes to all files other than "cgroup.procs" and
+"cgroup.subtree_control" on a namespace root from inside the
+namespace.
+
+The end results are equivalent for both delegation types.  Once
+delegated, the user can build sub-hierarchy under the directory,
+organize processes inside it as it sees fit and further distribute the
+resources it received from the parent.  The limits and other settings
+of all resource controllers are hierarchical and regardless of what
+happens in the delegated sub-hierarchy, nothing can escape the
+resource restrictions imposed by the parent.
 
 Currently, cgroup doesn't impose any restrictions on the number of
 cgroups in or nesting depth of a delegated sub-hierarchy; however,
 this may be limited explicitly in the future.
 
 
-2-5-2. Delegation Containment
+Delegation Containment
+~~~~~~~~~~~~~~~~~~~~~~
 
 A delegated sub-hierarchy is contained in the sense that processes
-can't be moved into or out of the sub-hierarchy by the delegatee.  For
-a process with a non-root euid to migrate a target process into a
-cgroup by writing its PID to the "cgroup.procs" file, the following
-conditions must be met.
+can't be moved into or out of the sub-hierarchy by the delegatee.
 
-- The writer's euid must match either uid or suid of the target process.
+For delegations to a less privileged user, this is achieved by
+requiring the following conditions for a process with a non-root euid
+to migrate a target process into a cgroup by writing its PID to the
+"cgroup.procs" file.
 
 - The writer must have write access to the "cgroup.procs" file.
 
 - The writer must have write access to the "cgroup.procs" file of the
   common ancestor of the source and destination cgroups.
 
-The above three constraints ensure that while a delegatee may migrate
+The above two constraints ensure that while a delegatee may migrate
 processes around freely in the delegated sub-hierarchy it can't pull
 in from or push out to outside the sub-hierarchy.
 
 For an example, let's assume cgroups C0 and C1 have been delegated to
 user U0 who created C00, C01 under C0 and C10 under C1 as follows and
-all processes under C0 and C1 belong to U0.
+all processes under C0 and C1 belong to U0::
 
   ~~~~~~~~~~~~~ - C0 - C00
   ~ cgroup    ~      \ C01
@@ -339,15 +494,22 @@ all processes under C0 and C1 belong to U0.
 
 Let's also say U0 wants to write the PID of a process which is
 currently in C10 into "C00/cgroup.procs".  U0 has write access to the
-file and uid match on the process; however, the common ancestor of the
-source cgroup C10 and the destination cgroup C00 is above the points
-of delegation and U0 would not have write access to its "cgroup.procs"
-files and thus the write will be denied with -EACCES.
+file; however, the common ancestor of the source cgroup C10 and the
+destination cgroup C00 is above the points of delegation and U0 would
+not have write access to its "cgroup.procs" files and thus the write
+will be denied with -EACCES.
+
+For delegations to namespaces, containment is achieved by requiring
+that both the source and destination cgroups are reachable from the
+namespace of the process which is attempting the migration.  If either
+is not reachable, the migration is rejected with -ENOENT.
 
 
-2-6. Guidelines
+Guidelines
+----------
 
-2-6-1. Organize Once and Control
+Organize Once and Control
+~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Migrating a process across cgroups is a relatively expensive operation
 and stateful resources such as memory are not moved together with the
@@ -363,7 +525,8 @@ distribution can be made by changing controller configuration through
 the interface files.
 
 
-2-6-2. Avoid Name Collisions
+Avoid Name Collisions
+~~~~~~~~~~~~~~~~~~~~~
 
 Interface files for a cgroup and its children cgroups occupy the same
 directory and it is possible to create children cgroups which collide
@@ -381,14 +544,16 @@ cgroup doesn't do anything to prevent name collisions and it's the
 user's responsibility to avoid them.
 
 
-3. Resource Distribution Models
+Resource Distribution Models
+============================
 
 cgroup controllers implement several resource distribution schemes
 depending on the resource type and expected use cases.  This section
 describes major schemes in use along with their expected behaviors.
 
 
-3-1. Weights
+Weights
+-------
 
 A parent's resource is distributed by adding up the weights of all
 active children and giving each the fraction matching the ratio of its
@@ -409,7 +574,8 @@ process migrations.
 and is an example of this type.
 
 
-3-2. Limits
+Limits
+------
 
 A child can only consume upto the configured amount of the resource.
 Limits can be over-committed - the sum of the limits of children can
@@ -425,7 +591,8 @@ process migrations.
 on an IO device and is an example of this type.
 
 
-3-3. Protections
+Protections
+-----------
 
 A cgroup is protected to be allocated upto the configured amount of
 the resource if the usages of all its ancestors are under their
@@ -445,7 +612,8 @@ process migrations.
 example of this type.
 
 
-3-4. Allocations
+Allocations
+-----------
 
 A cgroup is exclusively allocated a certain amount of a finite
 resource.  Allocations can't be over-committed - the sum of the
@@ -464,12 +632,14 @@ may be rejected.
 type.
 
 
-4. Interface Files
+Interface Files
+===============
 
-4-1. Format
+Format
+------
 
 All interface files should be in one of the following formats whenever
-possible.
+possible::
 
   New-line separated values
   (when only one value can be written at once)
@@ -504,7 +674,8 @@ can be written at a time.  For nested keyed files, the sub key pairs
 may be specified in any order and not all pairs have to be specified.
 
 
-4-2. Conventions
+Conventions
+-----------
 
 - Settings for a single feature should be contained in a single file.
 
@@ -540,25 +711,25 @@ may be specified in any order and not all pairs have to be specified.
   with "default" as the value must not appear when read.
 
   For example, a setting which is keyed by major:minor device numbers
-  with integer values may look like the following.
+  with integer values may look like the following::
 
     # cat cgroup-example-interface-file
     default 150
     8:0 300
 
-  The default value can be updated by
+  The default value can be updated by::
 
     # echo 125 > cgroup-example-interface-file
 
-  or
+  or::
 
     # echo "default 125" > cgroup-example-interface-file
 
-  An override can be set by
+  An override can be set by::
 
     # echo "8:16 170" > cgroup-example-interface-file
 
-  and cleared by
+  and cleared by::
 
     # echo "8:0 default" > cgroup-example-interface-file
     # cat cgroup-example-interface-file
@@ -571,12 +742,35 @@ may be specified in any order and not all pairs have to be specified.
   generated on the file.
 
 
-4-3. Core Interface Files
+Core Interface Files
+--------------------
 
 All cgroup core files are prefixed with "cgroup."
 
-  cgroup.procs
+  cgroup.type
+
+	A read-write single value file which exists on non-root
+	cgroups.
+
+	When read, it indicates the current type of the cgroup, which
+	can be one of the following values.
+
+	- "domain" : A normal valid domain cgroup.
+
+	- "domain threaded" : A threaded domain cgroup which is
+          serving as the root of a threaded subtree.
+
+	- "domain invalid" : A cgroup which is in an invalid state.
+	  It can't be populated or have controllers enabled.  It may
+	  be allowed to become a threaded cgroup.
+
+	- "threaded" : A threaded cgroup which is a member of a
+          threaded subtree.
 
+	A cgroup can be turned into a threaded cgroup by writing
+	"threaded" to this file.
+
+  cgroup.procs
 	A read-write new-line separated values file which exists on
 	all cgroups.
 
@@ -590,9 +784,6 @@ All cgroup core files are prefixed with "cgroup."
 	the PID to the cgroup.  The writer should match all of the
 	following conditions.
 
-	- Its euid is either root or must match either uid or suid of
-          the target process.
-
 	- It must have write access to the "cgroup.procs" file.
 
 	- It must have write access to the "cgroup.procs" file of the
@@ -601,8 +792,36 @@ All cgroup core files are prefixed with "cgroup."
 	When delegating a sub-hierarchy, write access to this file
 	should be granted along with the containing directory.
 
-  cgroup.controllers
+	In a threaded cgroup, reading this file fails with EOPNOTSUPP
+	as all the processes belong to the thread root.  Writing is
+	supported and moves every thread of the process to the cgroup.
 
+  cgroup.threads
+	A read-write new-line separated values file which exists on
+	all cgroups.
+
+	When read, it lists the TIDs of all threads which belong to
+	the cgroup one-per-line.  The TIDs are not ordered and the
+	same TID may show up more than once if the thread got moved to
+	another cgroup and then back or the TID got recycled while
+	reading.
+
+	A TID can be written to migrate the thread associated with the
+	TID to the cgroup.  The writer should match all of the
+	following conditions.
+
+	- It must have write access to the "cgroup.threads" file.
+
+	- The cgroup that the thread is currently in must be in the
+          same resource domain as the destination cgroup.
+
+	- It must have write access to the "cgroup.procs" file of the
+	  common ancestor of the source and destination cgroups.
+
+	When delegating a sub-hierarchy, write access to this file
+	should be granted along with the containing directory.
+
+  cgroup.controllers
 	A read-only space separated values file which exists on all
 	cgroups.
 
@@ -610,7 +829,6 @@ All cgroup core files are prefixed with "cgroup."
 	the cgroup.  The controllers are not ordered.
 
   cgroup.subtree_control
-
 	A read-write space separated values file which exists on all
 	cgroups.  Starts out empty.
 
@@ -626,23 +844,25 @@ All cgroup core files are prefixed with "cgroup."
 	operations are specified, either all succeed or all fail.
 
   cgroup.events
-
 	A read-only flat-keyed file which exists on non-root cgroups.
 	The following entries are defined.  Unless specified
 	otherwise, a value change in this file generates a file
 	modified event.
 
 	  populated
-
 		1 if the cgroup or its descendants contains any live
 		processes; otherwise, 0.
 
 
-5. Controllers
+Controllers
+===========
+
+CPU
+---
 
-5-1. CPU
+.. note::
 
-[NOTE: The interface for the cpu controller hasn't been merged yet]
+	The interface for the cpu controller hasn't been merged yet
 
 The "cpu" controllers regulates distribution of CPU cycles.  This
 controller implements weight and absolute bandwidth limit models for
@@ -650,36 +870,34 @@ normal scheduling policy and absolute bandwidth allocation model for
 realtime scheduling policy.
 
 
-5-1-1. CPU Interface Files
+CPU Interface Files
+~~~~~~~~~~~~~~~~~~~
 
 All time durations are in microseconds.
 
   cpu.stat
-
 	A read-only flat-keyed file which exists on non-root cgroups.
 
-	It reports the following six stats.
+	It reports the following six stats:
 
-	  usage_usec
-	  user_usec
-	  system_usec
-	  nr_periods
-	  nr_throttled
-	  throttled_usec
+	- usage_usec
+	- user_usec
+	- system_usec
+	- nr_periods
+	- nr_throttled
+	- throttled_usec
 
   cpu.weight
-
 	A read-write single value file which exists on non-root
 	cgroups.  The default is "100".
 
 	The weight in the range [1, 10000].
 
   cpu.max
-
 	A read-write two value file which exists on non-root cgroups.
 	The default is "max 100000".
 
-	The maximum bandwidth limit.  It's in the following format.
+	The maximum bandwidth limit.  It's in the following format::
 
 	  $MAX $PERIOD
 
@@ -688,9 +906,10 @@ All time durations are in microseconds.
 	one number is written, $MAX is updated.
 
   cpu.rt.max
+	.. note::
 
-  [NOTE: The semantics of this file is still under discussion and the
-   interface hasn't been merged yet]
+	   The semantics of this file is still under discussion and the
+	   interface hasn't been merged yet
 
 	A read-write two value file which exists on all cgroups.
 	The default is "0 100000".
@@ -698,7 +917,7 @@ All time durations are in microseconds.
 	The maximum realtime runtime allocation.  Over-committing
 	configurations are disallowed and process migrations are
 	rejected if not enough bandwidth is available.  It's in the
-	following format.
+	following format::
 
 	  $MAX $PERIOD
 
@@ -707,7 +926,8 @@ All time durations are in microseconds.
 	updated.
 
 
-5-2. Memory
+Memory
+------
 
 The "memory" controller regulates distribution of memory.  Memory is
 stateful and implements both limit and protection models.  Due to the
@@ -729,14 +949,14 @@ following types of memory usages are tracked.
 The above list may expand in the future for better coverage.
 
 
-5-2-1. Memory Interface Files
+Memory Interface Files
+~~~~~~~~~~~~~~~~~~~~~~
 
 All memory amounts are in bytes.  If a value which is not aligned to
 PAGE_SIZE is written, the value may be rounded up to the closest
 PAGE_SIZE multiple when read back.
 
   memory.current
-
 	A read-only single value file which exists on non-root
 	cgroups.
 
@@ -744,7 +964,6 @@ PAGE_SIZE multiple when read back.
 	and its descendants.
 
   memory.low
-
 	A read-write single value file which exists on non-root
 	cgroups.  The default is "0".
 
@@ -757,7 +976,6 @@ PAGE_SIZE multiple when read back.
 	protection is discouraged.
 
   memory.high
-
 	A read-write single value file which exists on non-root
 	cgroups.  The default is "max".
 
@@ -770,7 +988,6 @@ PAGE_SIZE multiple when read back.
 	under extreme conditions the limit may be breached.
 
   memory.max
-
 	A read-write single value file which exists on non-root
 	cgroups.  The default is "max".
 
@@ -785,21 +1002,18 @@ PAGE_SIZE multiple when read back.
 	utility is limited to providing the final safety net.
 
   memory.events
-
 	A read-only flat-keyed file which exists on non-root cgroups.
 	The following entries are defined.  Unless specified
 	otherwise, a value change in this file generates a file
 	modified event.
 
 	  low
-
 		The number of times the cgroup is reclaimed due to
 		high memory pressure even though its usage is under
 		the low boundary.  This usually indicates that the low
 		boundary is over-committed.
 
 	  high
-
 		The number of times processes of the cgroup are
 		throttled and routed to perform direct memory reclaim
 		because the high memory boundary was exceeded.  For a
@@ -808,19 +1022,27 @@ PAGE_SIZE multiple when read back.
 		occurrences are expected.
 
 	  max
-
 		The number of times the cgroup's memory usage was
 		about to go over the max boundary.  If direct reclaim
-		fails to bring it down, the OOM killer is invoked.
+		fails to bring it down, the cgroup goes to OOM state.
 
 	  oom
+		The number of time the cgroup's memory usage was
+		reached the limit and allocation was about to fail.
 
-		The number of times the OOM killer has been invoked in
-		the cgroup.  This may not exactly match the number of
-		processes killed but should generally be close.
+		Depending on context result could be invocation of OOM
+		killer and retrying allocation or failing alloction.
 
-  memory.stat
+		Failed allocation in its turn could be returned into
+		userspace as -ENOMEM or siletly ignored in cases like
+		disk readahead.  For now OOM in memory cgroup kills
+		tasks iff shortage has happened inside page fault.
+
+	  oom_kill
+		The number of processes belonging to this cgroup
+		killed by any kind of OOM killer.
 
+  memory.stat
 	A read-only flat-keyed file which exists on non-root cgroups.
 
 	This breaks down the cgroup's memory footprint into different
@@ -834,53 +1056,98 @@ PAGE_SIZE multiple when read back.
 	fixed position; use the keys to look up specific values!
 
 	  anon
-
 		Amount of memory used in anonymous mappings such as
 		brk(), sbrk(), and mmap(MAP_ANONYMOUS)
 
 	  file
-
 		Amount of memory used to cache filesystem data,
 		including tmpfs and shared memory.
 
-	  sock
+	  kernel_stack
+		Amount of memory allocated to kernel stacks.
+
+	  slab
+		Amount of memory used for storing in-kernel data
+		structures.
 
+	  sock
 		Amount of memory used in network transmission buffers
 
-	  file_mapped
+	  shmem
+		Amount of cached filesystem data that is swap-backed,
+		such as tmpfs, shm segments, shared anonymous mmap()s
 
+	  file_mapped
 		Amount of cached filesystem data mapped with mmap()
 
 	  file_dirty
-
 		Amount of cached filesystem data that was modified but
 		not yet written back to disk
 
 	  file_writeback
-
 		Amount of cached filesystem data that was modified and
 		is currently being written back to disk
 
-	  inactive_anon
-	  active_anon
-	  inactive_file
-	  active_file
-	  unevictable
-
+	  inactive_anon, active_anon, inactive_file, active_file, unevictable
 		Amount of memory, swap-backed and filesystem-backed,
 		on the internal memory management lists used by the
 		page reclaim algorithm
 
-	  pgfault
+	  slab_reclaimable
+		Part of "slab" that might be reclaimed, such as
+		dentries and inodes.
+
+	  slab_unreclaimable
+		Part of "slab" that cannot be reclaimed on memory
+		pressure.
 
+	  pgfault
 		Total number of page faults incurred
 
 	  pgmajfault
-
 		Number of major page faults incurred
 
-  memory.swap.current
+	  workingset_refault
+
+		Number of refaults of previously evicted pages
+
+	  workingset_activate
+
+		Number of refaulted pages that were immediately activated
+
+	  workingset_nodereclaim
+
+		Number of times a shadow node has been reclaimed
+
+	  pgrefill
+
+		Amount of scanned pages (in an active LRU list)
+
+	  pgscan
+
+		Amount of scanned pages (in an inactive LRU list)
+
+	  pgsteal
+
+		Amount of reclaimed pages
+
+	  pgactivate
+
+		Amount of pages moved to the active LRU list
+
+	  pgdeactivate
+
+		Amount of pages moved to the inactive LRU lis
+
+	  pglazyfree
 
+		Amount of pages postponed to be freed under memory pressure
+
+	  pglazyfreed
+
+		Amount of reclaimed lazyfree pages
+
+  memory.swap.current
 	A read-only single value file which exists on non-root
 	cgroups.
 
@@ -888,7 +1155,6 @@ PAGE_SIZE multiple when read back.
 	and its descendants.
 
   memory.swap.max
-
 	A read-write single value file which exists on non-root
 	cgroups.  The default is "max".
 
@@ -896,7 +1162,8 @@ PAGE_SIZE multiple when read back.
 	limit, anonymous meomry of the cgroup will not be swapped out.
 
 
-5-2-2. General Usage
+Usage Guidelines
+~~~~~~~~~~~~~~~~
 
 "memory.high" is the main mechanism to control memory usage.
 Over-committing on high limit (sum of high limits > available memory)
@@ -919,7 +1186,8 @@ memory; unfortunately, memory pressure monitoring mechanism isn't
 implemented yet.
 
 
-5-2-3. Memory Ownership
+Memory Ownership
+~~~~~~~~~~~~~~~~
 
 A memory area is charged to the cgroup which instantiated it and stays
 charged to the cgroup until the area is released.  Migrating a process
@@ -937,7 +1205,8 @@ POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
 belonging to the affected files to ensure correct memory ownership.
 
 
-5-3. IO
+IO
+--
 
 The "io" controller regulates the distribution of IO resources.  This
 controller implements both weight based and absolute bandwidth or IOPS
@@ -946,28 +1215,29 @@ only if cfq-iosched is in use and neither scheme is available for
 blk-mq devices.
 
 
-5-3-1. IO Interface Files
+IO Interface Files
+~~~~~~~~~~~~~~~~~~
 
   io.stat
-
 	A read-only nested-keyed file which exists on non-root
 	cgroups.
 
 	Lines are keyed by $MAJ:$MIN device numbers and not ordered.
 	The following nested keys are defined.
 
+	  ======	===================
 	  rbytes	Bytes read
 	  wbytes	Bytes written
 	  rios		Number of read IOs
 	  wios		Number of write IOs
+	  ======	===================
 
-	An example read output follows.
+	An example read output follows:
 
 	  8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353
 	  8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252
 
   io.weight
-
 	A read-write flat-keyed file which exists on non-root cgroups.
 	The default is "default 100".
 
@@ -981,14 +1251,13 @@ blk-mq devices.
 	$WEIGHT" or simply "$WEIGHT".  Overrides can be set by writing
 	"$MAJ:$MIN $WEIGHT" and unset by writing "$MAJ:$MIN default".
 
-	An example read output follows.
+	An example read output follows::
 
 	  default 100
 	  8:16 200
 	  8:0 50
 
   io.max
-
 	A read-write nested-keyed file which exists on non-root
 	cgroups.
 
@@ -996,10 +1265,12 @@ blk-mq devices.
 	device numbers and not ordered.  The following nested keys are
 	defined.
 
+	  =====		==================================
 	  rbps		Max read bytes per second
 	  wbps		Max write bytes per second
 	  riops		Max read IO operations per second
 	  wiops		Max write IO operations per second
+	  =====		==================================
 
 	When writing, any number of nested key-value pairs can be
 	specified in any order.  "max" can be specified as the value
@@ -1009,24 +1280,25 @@ blk-mq devices.
 	BPS and IOPS are measured in each IO direction and IOs are
 	delayed if limit is reached.  Temporary bursts are allowed.
 
-	Setting read limit at 2M BPS and write at 120 IOPS for 8:16.
+	Setting read limit at 2M BPS and write at 120 IOPS for 8:16::
 
 	  echo "8:16 rbps=2097152 wiops=120" > io.max
 
-	Reading returns the following.
+	Reading returns the following::
 
 	  8:16 rbps=2097152 wbps=max riops=max wiops=120
 
-	Write IOPS limit can be removed by writing the following.
+	Write IOPS limit can be removed by writing the following::
 
 	  echo "8:16 wiops=max" > io.max
 
-	Reading now returns the following.
+	Reading now returns the following::
 
 	  8:16 rbps=2097152 wbps=max riops=max wiops=max
 
 
-5-3-2. Writeback
+Writeback
+~~~~~~~~~
 
 Page cache is dirtied through buffered writes and shared mmaps and
 written asynchronously to the backing filesystem by the writeback
@@ -1074,42 +1346,277 @@ patterns.
 The sysctl knobs which affect writeback behavior are applied to cgroup
 writeback as follows.
 
-  vm.dirty_background_ratio
-  vm.dirty_ratio
-
+  vm.dirty_background_ratio, vm.dirty_ratio
 	These ratios apply the same to cgroup writeback with the
 	amount of available memory capped by limits imposed by the
 	memory controller and system-wide clean memory.
 
-  vm.dirty_background_bytes
-  vm.dirty_bytes
-
+  vm.dirty_background_bytes, vm.dirty_bytes
 	For cgroup writeback, this is calculated into ratio against
 	total available memory and applied the same way as
 	vm.dirty[_background]_ratio.
 
 
-P. Information on Kernel Programming
+PID
+---
+
+The process number controller is used to allow a cgroup to stop any
+new tasks from being fork()'d or clone()'d after a specified limit is
+reached.
+
+The number of tasks in a cgroup can be exhausted in ways which other
+controllers cannot prevent, thus warranting its own controller.  For
+example, a fork bomb is likely to exhaust the number of tasks before
+hitting memory restrictions.
+
+Note that PIDs used in this controller refer to TIDs, process IDs as
+used by the kernel.
+
+
+PID Interface Files
+~~~~~~~~~~~~~~~~~~~
+
+  pids.max
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "max".
+
+	Hard limit of number of processes.
+
+  pids.current
+	A read-only single value file which exists on all cgroups.
+
+	The number of processes currently in the cgroup and its
+	descendants.
+
+Organisational operations are not blocked by cgroup policies, so it is
+possible to have pids.current > pids.max.  This can be done by either
+setting the limit to be smaller than pids.current, or attaching enough
+processes to the cgroup such that pids.current is larger than
+pids.max.  However, it is not possible to violate a cgroup PID policy
+through fork() or clone(). These will return -EAGAIN if the creation
+of a new process would cause a cgroup policy to be violated.
+
+
+RDMA
+----
+
+The "rdma" controller regulates the distribution and accounting of
+of RDMA resources.
+
+RDMA Interface Files
+~~~~~~~~~~~~~~~~~~~~
+
+  rdma.max
+	A readwrite nested-keyed file that exists for all the cgroups
+	except root that describes current configured resource limit
+	for a RDMA/IB device.
+
+	Lines are keyed by device name and are not ordered.
+	Each line contains space separated resource name and its configured
+	limit that can be distributed.
+
+	The following nested keys are defined.
+
+	  ==========	=============================
+	  hca_handle	Maximum number of HCA Handles
+	  hca_object 	Maximum number of HCA Objects
+	  ==========	=============================
+
+	An example for mlx4 and ocrdma device follows::
+
+	  mlx4_0 hca_handle=2 hca_object=2000
+	  ocrdma1 hca_handle=3 hca_object=max
+
+  rdma.current
+	A read-only file that describes current resource usage.
+	It exists for all the cgroup except root.
+
+	An example for mlx4 and ocrdma device follows::
+
+	  mlx4_0 hca_handle=1 hca_object=20
+	  ocrdma1 hca_handle=1 hca_object=23
+
+
+Misc
+----
+
+perf_event
+~~~~~~~~~~
+
+perf_event controller, if not mounted on a legacy hierarchy, is
+automatically enabled on the v2 hierarchy so that perf events can
+always be filtered by cgroup v2 path.  The controller can still be
+moved to a legacy hierarchy after v2 hierarchy is populated.
+
+
+Namespace
+=========
+
+Basics
+------
+
+cgroup namespace provides a mechanism to virtualize the view of the
+"/proc/$PID/cgroup" file and cgroup mounts.  The CLONE_NEWCGROUP clone
+flag can be used with clone(2) and unshare(2) to create a new cgroup
+namespace.  The process running inside the cgroup namespace will have
+its "/proc/$PID/cgroup" output restricted to cgroupns root.  The
+cgroupns root is the cgroup of the process at the time of creation of
+the cgroup namespace.
+
+Without cgroup namespace, the "/proc/$PID/cgroup" file shows the
+complete path of the cgroup of a process.  In a container setup where
+a set of cgroups and namespaces are intended to isolate processes the
+"/proc/$PID/cgroup" file may leak potential system level information
+to the isolated processes.  For Example::
+
+  # cat /proc/self/cgroup
+  0::/batchjobs/container_id1
+
+The path '/batchjobs/container_id1' can be considered as system-data
+and undesirable to expose to the isolated processes.  cgroup namespace
+can be used to restrict visibility of this path.  For example, before
+creating a cgroup namespace, one would see::
+
+  # ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
+  # cat /proc/self/cgroup
+  0::/batchjobs/container_id1
+
+After unsharing a new namespace, the view changes::
+
+  # ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
+  # cat /proc/self/cgroup
+  0::/
+
+When some thread from a multi-threaded process unshares its cgroup
+namespace, the new cgroupns gets applied to the entire process (all
+the threads).  This is natural for the v2 hierarchy; however, for the
+legacy hierarchies, this may be unexpected.
+
+A cgroup namespace is alive as long as there are processes inside or
+mounts pinning it.  When the last usage goes away, the cgroup
+namespace is destroyed.  The cgroupns root and the actual cgroups
+remain.
+
+
+The Root and Views
+------------------
+
+The 'cgroupns root' for a cgroup namespace is the cgroup in which the
+process calling unshare(2) is running.  For example, if a process in
+/batchjobs/container_id1 cgroup calls unshare, cgroup
+/batchjobs/container_id1 becomes the cgroupns root.  For the
+init_cgroup_ns, this is the real root ('/') cgroup.
+
+The cgroupns root cgroup does not change even if the namespace creator
+process later moves to a different cgroup::
+
+  # ~/unshare -c # unshare cgroupns in some cgroup
+  # cat /proc/self/cgroup
+  0::/
+  # mkdir sub_cgrp_1
+  # echo 0 > sub_cgrp_1/cgroup.procs
+  # cat /proc/self/cgroup
+  0::/sub_cgrp_1
+
+Each process gets its namespace-specific view of "/proc/$PID/cgroup"
+
+Processes running inside the cgroup namespace will be able to see
+cgroup paths (in /proc/self/cgroup) only inside their root cgroup.
+From within an unshared cgroupns::
+
+  # sleep 100000 &
+  [1] 7353
+  # echo 7353 > sub_cgrp_1/cgroup.procs
+  # cat /proc/7353/cgroup
+  0::/sub_cgrp_1
+
+From the initial cgroup namespace, the real cgroup path will be
+visible::
+
+  $ cat /proc/7353/cgroup
+  0::/batchjobs/container_id1/sub_cgrp_1
+
+From a sibling cgroup namespace (that is, a namespace rooted at a
+different cgroup), the cgroup path relative to its own cgroup
+namespace root will be shown.  For instance, if PID 7353's cgroup
+namespace root is at '/batchjobs/container_id2', then it will see::
+
+  # cat /proc/7353/cgroup
+  0::/../container_id2/sub_cgrp_1
+
+Note that the relative path always starts with '/' to indicate that
+its relative to the cgroup namespace root of the caller.
+
+
+Migration and setns(2)
+----------------------
+
+Processes inside a cgroup namespace can move into and out of the
+namespace root if they have proper access to external cgroups.  For
+example, from inside a namespace with cgroupns root at
+/batchjobs/container_id1, and assuming that the global hierarchy is
+still accessible inside cgroupns::
+
+  # cat /proc/7353/cgroup
+  0::/sub_cgrp_1
+  # echo 7353 > batchjobs/container_id2/cgroup.procs
+  # cat /proc/7353/cgroup
+  0::/../container_id2
+
+Note that this kind of setup is not encouraged.  A task inside cgroup
+namespace should only be exposed to its own cgroupns hierarchy.
+
+setns(2) to another cgroup namespace is allowed when:
+
+(a) the process has CAP_SYS_ADMIN against its current user namespace
+(b) the process has CAP_SYS_ADMIN against the target cgroup
+    namespace's userns
+
+No implicit cgroup changes happen with attaching to another cgroup
+namespace.  It is expected that the someone moves the attaching
+process under the target cgroup namespace root.
+
+
+Interaction with Other Namespaces
+---------------------------------
+
+Namespace specific cgroup hierarchy can be mounted by a process
+running inside a non-init cgroup namespace::
+
+  # mount -t cgroup2 none $MOUNT_POINT
+
+This will mount the unified cgroup hierarchy with cgroupns root as the
+filesystem root.  The process needs CAP_SYS_ADMIN against its user and
+mount namespaces.
+
+The virtualization of /proc/self/cgroup file combined with restricting
+the view of cgroup hierarchy by namespace-private cgroupfs mount
+provides a properly isolated cgroup view inside the container.
+
+
+Information on Kernel Programming
+=================================
 
 This section contains kernel programming information in the areas
 where interacting with cgroup is necessary.  cgroup core and
 controllers are not covered.
 
 
-P-1. Filesystem Support for Writeback
+Filesystem Support for Writeback
+--------------------------------
 
 A filesystem can support cgroup writeback by updating
 address_space_operations->writepage[s]() to annotate bio's using the
 following two functions.
 
   wbc_init_bio(@wbc, @bio)
-
 	Should be called for each bio carrying writeback data and
 	associates the bio with the inode's owner cgroup.  Can be
 	called anytime between bio allocation and submission.
 
   wbc_account_io(@wbc, @page, @bytes)
-
 	Should be called for each data segment being written out.
 	While this function doesn't care exactly when it's called
 	during the writeback session, it's the easiest and most
@@ -1130,11 +1637,12 @@ cases by skipping wbc_init_bio() or using bio_associate_blkcg()
 directly.
 
 
-D. Deprecated v1 Core Features
+Deprecated v1 Core Features
+===========================
 
 - Multiple hierarchies including named ones are not supported.
 
-- All mount options and remounting are not supported.
+- All v1 mount options are not supported.
 
 - The "tasks" file is removed and "cgroup.procs" is not sorted.
 
@@ -1144,9 +1652,11 @@ D. Deprecated v1 Core Features
   at the root instead.
 
 
-R. Issues with v1 and Rationales for v2
+Issues with v1 and Rationales for v2
+====================================
 
-R-1. Multiple Hierarchies
+Multiple Hierarchies
+--------------------
 
 cgroup v1 allowed an arbitrary number of hierarchies and each
 hierarchy could host any number of controllers.  While this seemed to
@@ -1198,7 +1708,8 @@ how memory is distributed beyond a certain level while still wanting
 to control how CPU cycles are distributed.
 
 
-R-2. Thread Granularity
+Thread Granularity
+------------------
 
 cgroup v1 allowed threads of a process to belong to different cgroups.
 This didn't make sense for some controllers and those controllers
@@ -1241,7 +1752,8 @@ misbehaving and poorly abstracted interfaces and kernel exposing and
 locked into constructs inadvertently.
 
 
-R-3. Competition Between Inner Nodes and Threads
+Competition Between Inner Nodes and Threads
+-------------------------------------------
 
 cgroup v1 allowed threads to be in any cgroups which created an
 interesting problem where threads belonging to a parent cgroup and its
@@ -1260,7 +1772,7 @@ simply weren't available for threads.
 
 The io controller implicitly created a hidden leaf node for each
 cgroup to host the threads.  The hidden leaf had its own copies of all
-the knobs with "leaf_" prefixed.  While this allowed equivalent
+the knobs with ``leaf_`` prefixed.  While this allowed equivalent
 control over internal threads, it was with serious drawbacks.  It
 always added an extra layer of nesting which wouldn't be necessary
 otherwise, made the interface messy and significantly complicated the
@@ -1281,7 +1793,8 @@ This clearly is a problem which needs to be addressed from cgroup core
 in a uniform way.
 
 
-R-4. Other Interface Issues
+Other Interface Issues
+----------------------
 
 cgroup v1 grew without oversight and developed a large number of
 idiosyncrasies and inconsistencies.  One issue on the cgroup core side
@@ -1309,9 +1822,11 @@ cgroup v2 establishes common conventions where appropriate and updates
 controllers so that they expose minimal and consistent interfaces.
 
 
-R-5. Controller Issues and Remedies
+Controller Issues and Remedies
+------------------------------
 
-R-5-1. Memory
+Memory
+~~~~~~
 
 The original lower boundary, the soft limit, is defined as a limit
 that is per default unset.  As a result, the set of cgroups that
@@ -1368,6 +1883,12 @@ system than killing the group.  Otherwise, memory.max is there to
 limit this type of spillover and ultimately contain buggy or even
 malicious applications.
 
+Setting the original memory.limit_in_bytes below the current usage was
+subject to a race condition, where concurrent charges could cause the
+limit setting to fail. memory.max on the other hand will first set the
+limit to prevent new charges, and then reclaim and OOM kill until the
+new limit is met - or the task writing to memory.max is killed.
+
 The combined memory+swap accounting and limiting is replaced by real
 control over swap space.