diff --git a/isolate/isolate.1.txt b/isolate/isolate.1.txt new file mode 100644 --- /dev/null +++ b/isolate/isolate.1.txt @@ -0,0 +1,348 @@ +ISOLATE(1) +========== + +NAME +---- +isolate - Isolate a process using Linux Containers + +SYNOPSIS +-------- +*isolate* 'options' *--init* + +*isolate* 'options' *--run* +--+ 'program' 'arguments' + +*isolate* 'options' *--cleanup* + +DESCRIPTION +----------- +Run 'program' within a sandbox, so that it cannot communicate with the +outside world and its resource consumption is limited. This can be used +for example in a programming contest to run untrusted programs submitted +by contestants in a controlled environment. + +The sandbox is used in the following way: + +* Run *isolate --init*, which initializes the sandbox, creates its working directory and +prints its name to the standard output. Fails if the sandbox already existed. + +* Populate the directory with the executable file of the program and its +input files. + +* Call *isolate --run* to run the program. A single line describing the +status of the program is written to the standard error stream. + +* Fetch the output of the program from the directory. + +* Run *isolate --cleanup* to remove temporary files. Does nothing if the sandbox +was already cleaned up. + +Please note that by default, the program is not allowed to start multiple +processes of threads. If you need that, turn on the control group mode +(see below). + +OPTIONS +------- +*-M, --meta=*'file':: + Output meta-data on the execution of the program to a given file. + See below for syntax of the meta-files. + +*-m, --mem=*'size':: + Limit address space of the program to 'size' kilobytes. If more processes + are allowed, this applies to each of them separately. + +*-t, --time=*'time':: + Limit run time of the program to 'time' seconds. Fractional numbers are allowed. + Time in which the OS assigns the processor to different tasks is not counted. + +*-w, --wall-time=*'time':: + Limit wall-clock time to 'time' seconds. Fractional values are allowed. + This clock measures the time from the start of the program to its exit, + so it does not stop when the program has lost the CPU or when it is waiting + for an external event. We recommend to use *--time* as the main limit, + but set *--wall-time* to a much higher value as a precaution against + sleeping programs. + +*-x, --extra-time=*'time':: + When a time limit is exceeded, wait for extra 'time' seconds before + killing the program. This has the advantage that the real execution time + is reported, even though it slightly exceeds the limit. Fractional + numbers are again allowed. + +*-b, --box-id=*'id':: + When you run multiple sandboxes in parallel, you have to assign each unique + IDs to them by this option. See the discussion on UIDs in the INSTALLATION + section. The ID defaults to 0. + +*-k, --stack=*'size':: + Limit process stack to 'size' kilobytes. By default, the whole address + space is available for the stack, but it is subject to the *--mem* limit. + +*-f, --fsize=*'size':: + Limit size of files created (or modified) by the program to 'size' kilobytes. + In most cases, it is better to restrict overall disk usage by a disk quota + (see below). This option can help in cases when quotas are not enabled + on the underlying filesystem. + +*-q, --quota=*'blocks'*,*'inodes':: + Set disk quota to a given number of blocks and inodes. This requires the + filesystem to be mounted with support for quotas. Please note that this + currently works only on the ext family of filesystems (other filesystems + use other interfaces for setting quotas). + +*-i, --stdin=*'file':: + Redirect standard input from 'file'. The 'file' has to be accessible + inside the sandbox. Otherwise, standard input is inherited from the + parent process. + +*-o, --stdout=*'file':: + Redirect standard output to 'file'. The 'file' has to be accessible + inside the sandbox. Otherwise, standard output is inherited from the + parent process and the sandbox manager does not write anything to it. + +*-r, --stderr=*'file':: + Redirect standard error output to 'file'. The 'file' has to be accessible + inside the sandbox. Otherwise, standard error output is inherited from the + parent process. See also *--stderr-to-stdout*. + +*--stderr-to-stdout*:: + Redirect standard error output to standard output. This is performed after + the standard output is redirected by *--stdout*. Mutually exclusive with *--stderr*. + +*-c, --chdir=*'dir':: + Change directory to 'dir' before executing the program. This path must be + relative to the root of the sandbox. + +*-p, --processes*[*=*'max']:: + Permit the program to create up to 'max' processes and/or threads. Please + keep in mind that time and memory limit do not work with multiple processes + unless you enable the control group mode. If 'max' is not given, an arbitrary + number of processes can be run. By default, only one process is permitted. + +*--share-net*:: + By default, isolate creates a new network namespace for its child process. + This namespace contains no network devices except for a per-namespace loopback. + This prevents the program from communicating with the outside world. If you want + to permit communication, you can use this switch to keep the child process + in parent's network namespace. + +*--inherit-fds*:: + By default, isolate closes all file descriptors passed from its parent + except for descriptors 0, 1, and 2. + This prevents unintentional descriptor leaks. In some cases, passing extra + descriptors to the sandbox can be desirable, so you can use this switch + to make them survive. + +*-v, --verbose*:: + Tell the sandbox manager to be verbose and report on what is going on. + Using *-v* multiple times produces even more jabber. + +*-s, --silent*:: + Tell the sandbox manager to keep silence. No status messages are printed + to stderr except for fatal errors of the sandbox itself. The combination of + *--verbose* and *--silent* has an undefined effect. + +ENVIRONMENT RULES +----------------- +UNIX processes normally inherit all environment variables from their parent. The +sandbox however passes only those variables which are explicitly requested by +environment rules: + +*-E, --env=*'var':: + Inherit the variable 'var' from the parent. + +*-E, --env=*'var'*=*'value':: + Set the variable 'var' to 'value'. When the 'value' is empty, the + variable is removed from the environment. + +*-e, --full-env*:: + Inherit all variables from the parent. + +The rules are applied in the order in which they were given, except for +*--full-env*, which is applied first. + +The list of rules is automatically initialized with *-ELIBC_FATAL_STDERR_=1*. + +DIRECTORY RULES +--------------- +The sandboxed process gets its own filesystem namespace, which contains only subtrees +requested by directory rules: + +*-d, --dir=*'in'*=*'out'[*:*'options']:: + Bind the directory 'out' as seen by the caller to the path 'in' inside the sandbox. + If there already was a directory rule for 'in', it is replaced. + +*-d, --dir=*'dir'[*:*'options']:: + Bind the directory +/+'dir' to 'dir' inside the sandbox. + If there already was a directory rule for 'in', it is replaced. + +*-d, --dir=*'in'*=*:: + Remove a directory rule for the path 'in' inside the sandbox. + +By default, all directories are bound read-only and restricted (no devices, +no setuid binaries). This behavior can be modified using the 'options': + +*rw*:: + Allow read-write access. + +*dev*:: + Allow access to character and block devices. + +*noexec*:: + Disallow execution of binaries. + +*maybe*:: + Silently ignore the rule if the directory to be bound does not exist. + +*fs*:: + Instead of binding a directory, mount a device-less filesystem called 'in'. + For example, this can be 'proc' or 'sysfs'. + +Unless *--no-default-dirs* is specified, the default set of directory rules binds +/bin+, ++/dev+ (with devices allowed), +/lib+, +/lib64+ (if it exists), and +/usr+. It also binds +the working directory to +/box+ (read-write) and mounts the proc filesystem at +/proc+. + +*-D, --no-default-dirs*:: + Do not bind the default set of directories. Care has to be taken to specify + the correct set of rules (using *--dir*) for the executed program to run + correctly. In particular, +/box+ has to be bound. + +CONTROL GROUPS +-------------- +Isolate can make use of system control groups provided by the kernel +to constrain programs consisting of multiple processes. Please note +that this feature needs special system setup described in the INSTALLATION +section. + +*--cg*:: + Enable use of control groups. This should be specified with *--init*, + *--run* and *--cleanup*. + +*--cg-mem=*'size':: + Limit total memory usage by the whole control group to 'size' kilobytes. + This should be specified with *--run*. + +*--cg-timing*:: + Use control groups for timing, so that the *--time* switch affects the + total run time of all processes and threads in the control group. + This should be specified with *--run*. + This option is turned on by default, use *--no-cg-timing* to turn off. + +META-FILES +---------- +The meta-file contains miscellaneous meta-information on execution of the +program within the sandbox. It is a textual file consisting of lines +of format 'key'*:*'value'. The following keys are defined: + +*cg-mem*:: + When control groups are enabled, this is the total memory use + by the whole control group (in kilobytes). +*cg-oom-killed*:: + Present when the program was killed by the out-of-memory killer + (e.g., because it has exceeded the memory limit of its control group). + This is reported only on Linux 4.13 and later. +*csw-forced*:: + Number of context switches forced by the kernel. +*csw-voluntary*:: + Number of context switches caused by the process giving up the CPU + voluntarily. +*exitcode*:: + The program has exited normally with this exit code. +*exitsig*:: + The program has exited after receiving this fatal signal. +*killed*:: + Present when the program was terminated by the sandbox + (e.g., because it has exceeded the time limit). +*max-rss*:: + Maximum resident set size of the process (in kilobytes). +*message*:: + Status message, not intended for machine processing. + E.g., "Time limit exceeded." +*status*:: + Two-letter status code: + * *RE* -- run-time error, i.e., exited with a non-zero exit code + * *SG* -- program died on a signal + * *TO* -- timed out + * *XX* -- internal error of the sandbox +*time*:: + Run time of the program in fractional seconds. +*time-wall*:: + Wall clock time of the program in fractional seconds. + +Please note that not all keys have to be present. +For example, no *status* nor *message* is reported upon normal termination. + +RETURN VALUE +------------ +When the program inside the sandbox finishes correctly, the sandbox returns 0. +If it finishes incorrectly, it returns 1. +All other return codes signal an internal error. + +INSTALLATION +------------ +Isolate depends on several advanced features of the Linux kernel. Please +make sure that your kernel supports +PID namespaces (+CONFIG_PID_NS+), +IPC namespaces (+CONFIG_IPC_NS+), and +network namespaces (+CONFIG_NET_NS+). +If you want to use control groups, you need +the cpusets (+CONFIG_CPUSETS+), +CPU accounting controller (+CONFIG_CGROUP_CPUACCT+), and +memory resource controller (+CONFIG_MEMCG+). If your machine has swap enabled, +you should also enable the swap controller (+CONFIG_MEMCG_SWAP+). + +Debian 7.x and newer require enabling the memory and swap cgroup controllers by +adding the parameters "cgroup_enable=memory swapaccount=1" to the kernel +command-line, which can be set using +GRUB_CMDLINE_LINUX_DEFAULT+ in +/etc/default/grub. + +Isolate is designed to run setuid to root. The sub-process inside the sandbox +then switches to a non-privileged user ID (different for each *--box-id*). +The range of UIDs available and several filesystem paths are set in a configuration +file, by default located in /usr/local/etc/isolate. + +Before you run isolate with control groups, you need to ensure that the cgroup +filesystem is enabled and mounted. Most modern Linux distributions already +provide cgroup support through a tmpfs mounted at /sys/fs/cgroup, with +individual controllers mounted within subdirectories. + +REPRODUCIBILITY +--------------- + +The reproducibility of results can be improved by tuning some kernel +parameters, listed below. Some of these parameters can be checked using the +program isolate-check-environment. + +* Disable address space randomization: +sysctl kernel.randomize_va_space=0+. +Address space randomization can affect timing, memory usage, and program +behavior. This setting can be made persistent through /etc/sysctl.d/. + +* Disable dynamic CPU frequency scaling. This requires setting the cpufreq +scaling governor to +performance+. The process for doing this varies between +distributions. + +* Consider disabling Turboboost on CPUs that might support it (most i3/i5/i7 +Intel CPUs). Approach this one with caution. Disabling a CPU that Turboboosts +from 2.3 GHz to 2.6 GHz would have minimal impact on run-times in exchange +for determinism, but the same on a CPU that Turboboosts from 1.6 GHz to 2.8 +GHz will incur a much more dramatic slowdown. Perhaps if the ambient +temperature is controlled and only one single-threaded task is keeping the +CPU busy at 100%, then TB's behaviour may be reasonably deterministic; +requires further experimentation to confirm. + +* Run evaluations on a single CPU (core). The Linux scheduler has a tendency to randomly +migrate tasks between CPUs, incurring cache migration costs. You can use isolate's +configuration file to pin the process to a specified CPU. + +* Disable automatic kernel support for transparent huge pages. Both /sys/kernel/mm/transparent_hugepage/enabled +and /sys/kernel/mm/transparent_hugepage/defrag should be set to "madvise" or "never", and +/sys/kernel/mm/transparent_hugepage/khugepaged/defrag to 0. + +* Disable swapping. If you really need swap space and you are using cgroups, +make sure that you have the memsw controller enabled, so that swap space is +properly accounted for. + +LICENSE +------- +Isolate was written by Martin Mares and Bernard Blackham. +It can be distributed and used under the terms of the GNU +General Public License version 2 or any later version.