mirror of
				https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux
				synced 2025-11-04 08:34:47 +10:00 
			
		
		
		
	Those files belong to the admin guide, so add them. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
		
			
				
	
	
		
			186 lines
		
	
	
		
			6.8 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			186 lines
		
	
	
		
			6.8 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
==========================
 | 
						|
Real-Time group scheduling
 | 
						|
==========================
 | 
						|
 | 
						|
.. CONTENTS
 | 
						|
 | 
						|
   0. WARNING
 | 
						|
   1. Overview
 | 
						|
     1.1 The problem
 | 
						|
     1.2 The solution
 | 
						|
   2. The interface
 | 
						|
     2.1 System-wide settings
 | 
						|
     2.2 Default behaviour
 | 
						|
     2.3 Basis for grouping tasks
 | 
						|
   3. Future plans
 | 
						|
 | 
						|
 | 
						|
0. WARNING
 | 
						|
==========
 | 
						|
 | 
						|
 Fiddling with these settings can result in an unstable system, the knobs are
 | 
						|
 root only and assumes root knows what he is doing.
 | 
						|
 | 
						|
Most notable:
 | 
						|
 | 
						|
 * very small values in sched_rt_period_us can result in an unstable
 | 
						|
   system when the period is smaller than either the available hrtimer
 | 
						|
   resolution, or the time it takes to handle the budget refresh itself.
 | 
						|
 | 
						|
 * very small values in sched_rt_runtime_us can result in an unstable
 | 
						|
   system when the runtime is so small the system has difficulty making
 | 
						|
   forward progress (NOTE: the migration thread and kstopmachine both
 | 
						|
   are real-time processes).
 | 
						|
 | 
						|
1. Overview
 | 
						|
===========
 | 
						|
 | 
						|
 | 
						|
1.1 The problem
 | 
						|
---------------
 | 
						|
 | 
						|
Realtime scheduling is all about determinism, a group has to be able to rely on
 | 
						|
the amount of bandwidth (eg. CPU time) being constant. In order to schedule
 | 
						|
multiple groups of realtime tasks, each group must be assigned a fixed portion
 | 
						|
of the CPU time available.  Without a minimum guarantee a realtime group can
 | 
						|
obviously fall short. A fuzzy upper limit is of no use since it cannot be
 | 
						|
relied upon. Which leaves us with just the single fixed portion.
 | 
						|
 | 
						|
1.2 The solution
 | 
						|
----------------
 | 
						|
 | 
						|
CPU time is divided by means of specifying how much time can be spent running
 | 
						|
in a given period. We allocate this "run time" for each realtime group which
 | 
						|
the other realtime groups will not be permitted to use.
 | 
						|
 | 
						|
Any time not allocated to a realtime group will be used to run normal priority
 | 
						|
tasks (SCHED_OTHER). Any allocated run time not used will also be picked up by
 | 
						|
SCHED_OTHER.
 | 
						|
 | 
						|
Let's consider an example: a frame fixed realtime renderer must deliver 25
 | 
						|
frames a second, which yields a period of 0.04s per frame. Now say it will also
 | 
						|
have to play some music and respond to input, leaving it with around 80% CPU
 | 
						|
time dedicated for the graphics. We can then give this group a run time of 0.8
 | 
						|
* 0.04s = 0.032s.
 | 
						|
 | 
						|
This way the graphics group will have a 0.04s period with a 0.032s run time
 | 
						|
limit. Now if the audio thread needs to refill the DMA buffer every 0.005s, but
 | 
						|
needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s =
 | 
						|
0.00015s. So this group can be scheduled with a period of 0.005s and a run time
 | 
						|
of 0.00015s.
 | 
						|
 | 
						|
The remaining CPU time will be used for user input and other tasks. Because
 | 
						|
realtime tasks have explicitly allocated the CPU time they need to perform
 | 
						|
their tasks, buffer underruns in the graphics or audio can be eliminated.
 | 
						|
 | 
						|
NOTE: the above example is not fully implemented yet. We still
 | 
						|
lack an EDF scheduler to make non-uniform periods usable.
 | 
						|
 | 
						|
 | 
						|
2. The Interface
 | 
						|
================
 | 
						|
 | 
						|
 | 
						|
2.1 System wide settings
 | 
						|
------------------------
 | 
						|
 | 
						|
The system wide settings are configured under the /proc virtual file system:
 | 
						|
 | 
						|
/proc/sys/kernel/sched_rt_period_us:
 | 
						|
  The scheduling period that is equivalent to 100% CPU bandwidth
 | 
						|
 | 
						|
/proc/sys/kernel/sched_rt_runtime_us:
 | 
						|
  A global limit on how much time realtime scheduling may use.  Even without
 | 
						|
  CONFIG_RT_GROUP_SCHED enabled, this will limit time reserved to realtime
 | 
						|
  processes. With CONFIG_RT_GROUP_SCHED it signifies the total bandwidth
 | 
						|
  available to all realtime groups.
 | 
						|
 | 
						|
  * Time is specified in us because the interface is s32. This gives an
 | 
						|
    operating range from 1us to about 35 minutes.
 | 
						|
  * sched_rt_period_us takes values from 1 to INT_MAX.
 | 
						|
  * sched_rt_runtime_us takes values from -1 to (INT_MAX - 1).
 | 
						|
  * A run time of -1 specifies runtime == period, ie. no limit.
 | 
						|
 | 
						|
 | 
						|
2.2 Default behaviour
 | 
						|
---------------------
 | 
						|
 | 
						|
The default values for sched_rt_period_us (1000000 or 1s) and
 | 
						|
sched_rt_runtime_us (950000 or 0.95s).  This gives 0.05s to be used by
 | 
						|
SCHED_OTHER (non-RT tasks). These defaults were chosen so that a run-away
 | 
						|
realtime tasks will not lock up the machine but leave a little time to recover
 | 
						|
it.  By setting runtime to -1 you'd get the old behaviour back.
 | 
						|
 | 
						|
By default all bandwidth is assigned to the root group and new groups get the
 | 
						|
period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you
 | 
						|
want to assign bandwidth to another group, reduce the root group's bandwidth
 | 
						|
and assign some or all of the difference to another group.
 | 
						|
 | 
						|
Realtime group scheduling means you have to assign a portion of total CPU
 | 
						|
bandwidth to the group before it will accept realtime tasks. Therefore you will
 | 
						|
not be able to run realtime tasks as any user other than root until you have
 | 
						|
done that, even if the user has the rights to run processes with realtime
 | 
						|
priority!
 | 
						|
 | 
						|
 | 
						|
2.3 Basis for grouping tasks
 | 
						|
----------------------------
 | 
						|
 | 
						|
Enabling CONFIG_RT_GROUP_SCHED lets you explicitly allocate real
 | 
						|
CPU bandwidth to task groups.
 | 
						|
 | 
						|
This uses the cgroup virtual file system and "<cgroup>/cpu.rt_runtime_us"
 | 
						|
to control the CPU time reserved for each control group.
 | 
						|
 | 
						|
For more information on working with control groups, you should read
 | 
						|
Documentation/admin-guide/cgroup-v1/cgroups.rst as well.
 | 
						|
 | 
						|
Group settings are checked against the following limits in order to keep the
 | 
						|
configuration schedulable:
 | 
						|
 | 
						|
   \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period
 | 
						|
 | 
						|
For now, this can be simplified to just the following (but see Future plans):
 | 
						|
 | 
						|
   \Sum_{i} runtime_{i} <= global_runtime
 | 
						|
 | 
						|
 | 
						|
3. Future plans
 | 
						|
===============
 | 
						|
 | 
						|
There is work in progress to make the scheduling period for each group
 | 
						|
("<cgroup>/cpu.rt_period_us") configurable as well.
 | 
						|
 | 
						|
The constraint on the period is that a subgroup must have a smaller or
 | 
						|
equal period to its parent. But realistically its not very useful _yet_
 | 
						|
as its prone to starvation without deadline scheduling.
 | 
						|
 | 
						|
Consider two sibling groups A and B; both have 50% bandwidth, but A's
 | 
						|
period is twice the length of B's.
 | 
						|
 | 
						|
* group A: period=100000us, runtime=50000us
 | 
						|
 | 
						|
	- this runs for 0.05s once every 0.1s
 | 
						|
 | 
						|
* group B: period= 50000us, runtime=25000us
 | 
						|
 | 
						|
	- this runs for 0.025s twice every 0.1s (or once every 0.05 sec).
 | 
						|
 | 
						|
This means that currently a while (1) loop in A will run for the full period of
 | 
						|
B and can starve B's tasks (assuming they are of lower priority) for a whole
 | 
						|
period.
 | 
						|
 | 
						|
The next project will be SCHED_EDF (Earliest Deadline First scheduling) to bring
 | 
						|
full deadline scheduling to the linux kernel. Deadline scheduling the above
 | 
						|
groups and treating end of the period as a deadline will ensure that they both
 | 
						|
get their allocated time.
 | 
						|
 | 
						|
Implementing SCHED_EDF might take a while to complete. Priority Inheritance is
 | 
						|
the biggest challenge as the current linux PI infrastructure is geared towards
 | 
						|
the limited static priority levels 0-99. With deadline scheduling you need to
 | 
						|
do deadline inheritance (since priority is inversely proportional to the
 | 
						|
deadline delta (deadline - now)).
 | 
						|
 | 
						|
This means the whole PI machinery will have to be reworked - and that is one of
 | 
						|
the most complex pieces of code we have.
 |