Documentation/prctl/seccomp_filter.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214

		Seccomp filtering
		=================

Introduction
------------

A large number of system calls are exposed to every userland process
with many of them going unused for the entire lifetime of the process.
As system calls change and mature, bugs are found and eradicated.  A
certain subset of userland applications benefit by having a reduced set
of available system calls.  The resulting set reduces the total kernel
surface exposed to the application.  System call filtering is meant for
use with those applications.

The implementation currently leverages both the existing seccomp
infrastructure and the kernel tracing infrastructure.  By centralizing
hooks for attack surface reduction in seccomp, it is possible to assure
attention to security that is less relevant in normal ftrace scenarios,
such as time-of-check, time-of-use attacks.  However, ftrace provides a
rich, human-friendly environment for interfacing with system call
specific arguments.  (As such, this requires FTRACE_SYSCALLS for any
introspective filtering support.)


What it isn't
-------------

System call filtering isn't a sandbox.  It provides a clearly defined
mechanism for minimizing the exposed kernel surface.  Beyond that,
policy for logical behavior and information flow should be managed with
a combinations of other system hardening techniques and, potentially, a
LSM of your choosing.  Expressive, dynamic filters based on the ftrace
filter engine provide further options down this path (avoiding
pathological sizes or selecting which of the multiplexed system calls in
socketcall() is allowed, for instance) which could be construed,
incorrectly, as a more complete sandboxing solution.


Usage
-----

An additional seccomp mode is exposed through mode '13'.
This mode depends on CONFIG_SECCOMP_FILTER.  By default, it provides
only the most trivial of filter support "1" or cleared.  However, if
CONFIG_FTRACE_SYSCALLS is enabled, the ftrace filter engine may be used
for more expressive filters.

A collection of filters may be supplied via prctl, and the current set
of filters is exposed in /proc/<pid>/seccomp_filter.

Interacting with seccomp filters can be done through three new prctl calls
and one existing one.

PR_SET_SECCOMP:
	A pre-existing option for enabling strict seccomp mode (1) or
	filtering seccomp (13).  Performing this call (13) will enable
	the current set of defined filters.

	Usage:
		prctl(PR_SET_SECCOMP, 1);  /* strict */
		prctl(PR_SET_SECCOMP, 13);  /* filters */

PR_SET_SECCOMP_FILTER:
	Allows the specification of a new filter for a given system
	call, by number, and filter string.  By default, the filter
	string may only be "1".  However, if CONFIG_FTRACE_SYSCALLS is
	supported, the filter string may make use of the ftrace
	filtering language's awareness of system call arguments.

	In addition, the event id for the system call entry may be
	specified in lieu of the system call number itself, as
	determined by the 'type' argument.  This allows for the future
	addition of seccomp-based filtering on other registered,
	relevant ftrace events.

	All calls to PR_SET_SECCOMP_FILTER for a given system
	call will append the supplied string to any existing filters.
	Filter construction looks as follows:
		(Nothing) + "fd == 1 || fd == 2" => fd == 1 || fd == 2
		... + "fd != 2" => (fd == 1 || fd == 2) && (fd != 2)
		... + "size < 100" =>
		    ((fd == 1 || fd == 2) && (fd != 2)) && (size < 100)
	If there is no filter and the seccomp mode has already
	transitioned to filtering, additions cannot be made.  Filters
	may only be added that reduce the available kernel surface.

	Usage (per the construction example above):
		unsigned long type = PR_SECCOMP_FILTER_SYSCALL;
		prctl(PR_SET_SECCOMP_FILTER, type, __NR_write,
			"fd == 1 || fd == 2");
		prctl(PR_SET_SECCOMP_FILTER, type, __NR_write,
			"fd != 2");
		prctl(PR_SET_SECCOMP_FILTER, type, __NR_write,
			"size < 100");

	The 'type' argument may be one of PR_SECCOMP_FILTER_SYSCALL or
	PR_SECCOMP_FILTER_EVENT.

PR_CLEAR_SECCOMP_FILTER:
	Removes all filter entries for a given system call number or
	event id.  When called prior to entering seccomp filtering mode,
	it allows for new filters to be applied to the same system call.
	After transition, however, it completely drops access to the
	call.

	Usage:
		prctl(PR_CLEAR_SECCOMP_FILTER,
			PR_SECCOMP_FILTER_SYSCALL, __NR_open);

PR_GET_SECCOMP_FILTER:
	Returns the aggregated filter string for a system call into a
	user-supplied buffer of a given length.

	Usage:
		prctl(PR_GET_SECCOMP_FILTER,
			PR_SECCOMP_FILTER_SYSCALL, __NR_write, buf,
			sizeof(buf));

All of the above calls return 0 on success and non-zero on error.  If
CONFIG_FTRACE_SYSCALLS is not supported and a rich-filter was specified,
the caller may check the errno for -ENOSYS.  The same is true if
specifying an filter by the event id fails to discover any relevant
event entries.


Example
-------

Assume a process would like to cleanly read and write to stdin/out/err
as well as access its filters after seccomp enforcement begins.  This
may be done as follows:

  int filter_syscall(int nr, char *buf) {
    return prctl(PR_SET_SECCOMP_FILTER, PR_SECCOMP_FILTER_SYSCALL,
                 nr, buf);
  }

  filter_syscall(__NR_read, "fd == 0");
  filter_syscall(_NR_write, "fd == 1 || fd == 2");
  filter_syscall(__NR_exit, "1");
  filter_syscall(__NR_prctl, "1");
  prctl(PR_SET_SECCOMP, 13);

  /* Do stuff with fdset . . .*/

  /* Drop read access and keep only write access to fd 1. */
  prctl(PR_CLEAR_SECCOMP_FILTER, PR_SECCOMP_FILTER_SYSCALL, __NR_read);
  filter_syscall(__NR_write, "fd != 2");

  /* Perform any final processing . . . */
  syscall(__NR_exit, 0);


Inheritance
-----------

Changing the availability of the kernel ABI at runtime runs the risk of
providing access to unreachable code paths in normal applications.  To
avoid the pitfalls that accompany this risk, seccomp filter inheritance
is restricted.

Filters may be inherited across a fork/clone if they have been activated
by a call to prctl(PR_SET_SECCOMP, 13).  If the process had the
CAP_SYS_ADMIN capability when configuring the filters, they may also be
inherited across an execve call.

Inherited filters may not be modified by the child process.  If child
would like to further restrict the available system calls, it may
perform the same calls as discussed earlier: set, clear, get, and
finally prctl(PR_SET_SECCOMP, 13).  Until the child calls
PR_SET_SECCOMP, their filters will be ignored and only the inherited
filter will be evaluated.  After a successful PR_SET_SECCOMP call, all
system calls performed by the child will be checked against the filters
that were specified by itself and then against the filters supplied by
any ancestors.  Any system call used must be allowed by all of the
filters tested.  This composition of the ancestral seccomp filters and
process-local filters guarantees that only the minimal set of system calls
will be permitted at any point and without child processes needing to be
aware of any prior system call filtering.

(If, for instance, the child process merely copied the parent filters
and then extended them, the child would be required to enumerate all
existing filters to determine which needed to be dropped.)


Caveats
-------

- Avoid using a filter of "0" to disable a filter.  Always favor calling
  prctl(PR_CLEAR_SECCOMP_FILTER, ...).  Otherwise the behavior may vary
  depending on if CONFIG_FTRACE_SYSCALLS support exists -- though an
  error will be returned if the support is missing.

- Some platforms support a 32-bit userspace with 64-bit kernels.  In
  these cases (CONFIG_COMPAT), system call numbers may not match across
  64-bit and 32-bit system calls. When the first PRCTL_SET_SECCOMP_FILTER
  is called, the in-memory filters state is annotated with whether the
  call has been made via the compat interface.  All subsequent calls will
  be checked for compat call mismatch.  In the long run, it may make sense
  to store compat and non-compat filters separately, but that is not
  supported at present. Once one type of system call interface has been
  used, it must be continued to be used.


Adding architecture support
-----------------------

Any platform with seccomp support should be able to support the bare
minimum of seccomp filter features.  However, since seccomp_filter
requires that execve be blocked, it expects the architecture to expose a
__NR_seccomp_execve define that maps to the execve system call number.
On platforms where CONFIG_COMPAT applies, __NR_seccomp_execve_32 must
also be provided.  Once those macros exist, "select HAVE_SECCOMP_FILTER"
support may be added to the architectures Kconfig.