Explorar o código

add chapter about cgroups

Alexander Kuleshov %!s(int64=8) %!d(string=hai) anos
pai
achega
6992e97a21
Modificáronse 3 ficheiros con 456 adicións e 0 borrados
  1. 5 0
      Cgroups/README.md
  2. 449 0
      Cgroups/cgroups1.md
  3. 2 0
      SUMMARY.md

+ 5 - 0
Cgroups/README.md

@@ -0,0 +1,5 @@
+# Cgroups
+
+This chapter describes `control groups` mechanism in the Linux kernel.
+
+* [Introduction](http://0xax.gitbooks.io/linux-insides/content/Cgroups/cgroups1.html)

+ 449 - 0
Cgroups/cgroups1.md

@@ -0,0 +1,449 @@
+Control Groups
+================================================================================
+
+Introduction
+--------------------------------------------------------------------------------
+
+This is the first part of the new chapter of the [linux insides](http://0xax.gitbooks.io/linux-insides/content/) book and as you may guess by part's name - this part will cover [control groups](https://en.wikipedia.org/wiki/Cgroups) or `cgroups` mechanism in the Linux kernel.
+
+`Cgroups` are special mechanism provided by the Linux kernel which allows us to allocate kind of `resources` like processor time, number of processes per group, amount of memory per control group or combination of such resources for a process or set of processes. `Cgroups` are organized hierarchically and here this mechanism is similar to usual processes as they are hierarchical too and child `cgroups` inherit set of certain parameters from their parents. But actually they are not the same. The main differences between `cgroups` and normal processes that many different hierarchies of control groups may exist simultaneously in one time while normal process three is always single. This was not a casual step because each control group hierarchy is attached to set of control group `subsystems`.
+
+One `control group subsystem` represents one kind of resources like a processor time or number of [pids](https://en.wikipedia.org/wiki/Process_identifier) or in other words number of processes for a `control group`. Linux kernel provides support for following twelve `control group subsystems`:
+
+* `cpuset` - assigns individual processor(s) and memory nodes to task(s) in a group;
+* `cpu` - uses the scheduler to provide cgroup tasks access to the processor resources;
+* `cpuacct` - generates reports about processor usage by a group;
+* `io` - sets limit to read/write from/to [block devices](https://en.wikipedia.org/wiki/Device_file);
+* `memory` - sets limit on memory usage by a task(s) from a group;
+* `devices` - allows access to devices by a task(s) from a group;
+* `freezer` - allows to suspend/resume for a task(s) from a group;
+* `net_cls` - allows to mark network packets from task(s) from a group;
+* `net_prio` - provides a way to dynamically set the priority of network traffic per network interface for a group;
+* `perf_event` - provides access to [perf events](https://en.wikipedia.org/wiki/Perf_(Linux)) to a group;
+* `hugetlb` - activates support for [huge pages](https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt) for a group;
+* `pid` - sets limit to number of processes in a group.
+
+Each of these control group subsystems depends on related configuration option. For example the `cpuset` subsystem should be enabled via `CONFIG_CPUSETS` kernel configuration option, the `io` subsystem via `CONFIG_BLK_CGROUP` kernel configuration option and etc. All of these kernel configuration options may be found in the `General setup → Control Group support` menu:
+
+![menuconfig](http://oi66.tinypic.com/2rc2a9e.jpg)
+
+You may see enabled control groups on your computer via [proc](https://en.wikipedia.org/wiki/Procfs) filesystem:
+
+```
+$ cat /proc/cgroups 
+#subsys_name	hierarchy	num_cgroups	enabled
+cpuset	8	1	1
+cpu	7	66	1
+cpuacct	7	66	1
+blkio	11	66	1
+memory	9	94	1
+devices	6	66	1
+freezer	2	1	1
+net_cls	4	1	1
+perf_event	3	1	1
+net_prio	4	1	1
+hugetlb	10	1	1
+pids	5	69	1
+```
+
+or via [sysfs](https://en.wikipedia.org/wiki/Sysfs):
+
+```
+$ ls -l /sys/fs/cgroup/
+total 0
+dr-xr-xr-x 5 root root  0 Dec  2 22:37 blkio
+lrwxrwxrwx 1 root root 11 Dec  2 22:37 cpu -> cpu,cpuacct
+lrwxrwxrwx 1 root root 11 Dec  2 22:37 cpuacct -> cpu,cpuacct
+dr-xr-xr-x 5 root root  0 Dec  2 22:37 cpu,cpuacct
+dr-xr-xr-x 2 root root  0 Dec  2 22:37 cpuset
+dr-xr-xr-x 5 root root  0 Dec  2 22:37 devices
+dr-xr-xr-x 2 root root  0 Dec  2 22:37 freezer
+dr-xr-xr-x 2 root root  0 Dec  2 22:37 hugetlb
+dr-xr-xr-x 5 root root  0 Dec  2 22:37 memory
+lrwxrwxrwx 1 root root 16 Dec  2 22:37 net_cls -> net_cls,net_prio
+dr-xr-xr-x 2 root root  0 Dec  2 22:37 net_cls,net_prio
+lrwxrwxrwx 1 root root 16 Dec  2 22:37 net_prio -> net_cls,net_prio
+dr-xr-xr-x 2 root root  0 Dec  2 22:37 perf_event
+dr-xr-xr-x 5 root root  0 Dec  2 22:37 pids
+dr-xr-xr-x 5 root root  0 Dec  2 22:37 systemd
+```
+
+As you already may guess that `control groups` mechanism is not such mechanism which was invented only directly to the needs of the Linux kernel, but mostly for userspace needs. To use a `control group`, we should create it at first. We may create a `cgroup` via two ways.
+
+The first way is to create subdirectory in any subsystem from `sys/fs/cgroup` and add a pid of a task to a `tasks` file which will be created automatically right after we will create the subdirectory.
+
+The second way is to create/destroy/manage `cgroups` with utils from `libcgroup` library (`libcgroup-tools` in Fedora).
+
+Let's consider simple example. Following [bash](https://www.gnu.org/software/bash/) script will print a line to `/dev/tty` device which represents control terminal for the current process:
+
+```shell
+#!/bin/bash
+
+while :
+do
+    echo "print line" > /dev/tty
+    sleep 5
+done
+```
+
+So, if we will run this script we will see following result:
+
+```
+$ sudo chmod +x cgroup_test_script.sh
+~$ ./cgroup_test_script.sh 
+print line
+print line
+print line
+...
+...
+...
+```
+
+Now let's go to the place where `cgroupfs` is mounted on our computer. As we just saw, this is `/sys/fs/cgroup` directory, but you may mount it everywhere you want.
+
+```
+$ cd /sys/fs/cgroup
+```
+
+And now let's go to the `devices` subdirectory which represents kind of resouces that allows or denies access to devices by tasks in a `cgroup`:
+
+```
+# cd /devices
+```
+
+and create `cgroup_test_group` directory there:
+
+```
+# mkdir cgroup_test_group
+```
+
+After creation of the `cgroup_test_group` directory, following files will be generated there:
+
+```
+/sys/fs/cgroup/devices/cgroup_test_group$ ls -l
+total 0
+-rw-r--r-- 1 root root 0 Dec  3 22:55 cgroup.clone_children
+-rw-r--r-- 1 root root 0 Dec  3 22:55 cgroup.procs
+--w------- 1 root root 0 Dec  3 22:55 devices.allow
+--w------- 1 root root 0 Dec  3 22:55 devices.deny
+-r--r--r-- 1 root root 0 Dec  3 22:55 devices.list
+-rw-r--r-- 1 root root 0 Dec  3 22:55 notify_on_release
+-rw-r--r-- 1 root root 0 Dec  3 22:55 tasks
+```
+
+For this moment we are interested in `tasks` and `devices.deny` files. The first `tasks` files should contain pid(s) of processes which will be attached to the `cgroup_test_group`. The second `devices.deny` file contain list of denied devices. By default a newly created group has no any limits for devices access. To forbid a device (in our case it is `/dev/tty`) we should write to the `devices.deny` following line:
+
+```
+# echo "c 5:0 w" > devices.deny
+```
+
+Let's go step by step throug this line. The first `c` letter represents type of a device. In our case the `/dev/tty` is `char device`. We can verify this from output of `ls` command:
+
+```
+~$ ls -l /dev/tty
+crw-rw-rw- 1 root tty 5, 0 Dec  3 22:48 /dev/tty
+```
+
+see the first `c` letter in a permissions list. The second part is `5:0` is minor and major numbers of the device. You can see these numbers in the output of `ls` too. And the last `w` letter forbids tasks to write to the specified device. So let's start the `cgroup_test_script.sh` script:
+
+```
+~$ ./cgroup_test_script.sh 
+print line
+print line
+print line
+...
+...
+```
+
+and add pid of this process to the `devices/tasks` file of our group:
+
+```
+# echo $(pidof -x cgroup_test_script.sh) > /sys/fs/cgroup/devices/cgroup_test_group/tasks
+```
+
+The result of this action will be as expected:
+
+```
+~$ ./cgroup_test_script.sh 
+print line
+print line
+print line
+print line
+print line
+print line
+./cgroup_test_script.sh: line 5: /dev/tty: Operation not permitted
+```
+
+Similar situation will be when you will run you [docker](https://en.wikipedia.org/wiki/Docker_(software)) containers for example:
+
+```
+~$ docker ps
+CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                    NAMES
+fa2d2085cd1c        mariadb:10          "docker-entrypoint..."   12 days ago         Up 4 minutes        0.0.0.0:3306->3306/tcp   mysql-work
+
+~$ cat /sys/fs/cgroup/devices/docker/fa2d2085cd1c8d797002c77387d2061f56fefb470892f140d0dc511bd4d9bb61/tasks | head -3
+5501
+5584
+5585
+...
+...
+...
+```
+
+So, during startup of a `docker` container, `docker` will create a `cgroup` for processes in this container:
+
+```
+$ docker exec -it mysql-work /bin/bash
+$ top
+ PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                   1 mysql     20   0  963996 101268  15744 S   0.0  0.6   0:00.46 mysqld                                                                                  71 root      20   0   20248   3028   2732 S   0.0  0.0   0:00.01 bash                                                                                    77 root      20   0   21948   2424   2056 R   0.0  0.0   0:00.00 top                                                                                  
+```
+
+And we may see this `cgroup` on host machine:
+
+```C
+$ systemd-cgls
+
+Control group /:
+-.slice
+├─docker
+│ └─fa2d2085cd1c8d797002c77387d2061f56fefb470892f140d0dc511bd4d9bb61
+│   ├─5501 mysqld
+│   └─6404 /bin/bash
+```
+
+Now we know a little about `control groups` mechanism, how to use it manually and what's purpose of this mechanism. Time to look inside of the Linux kernel source code and start to dive into implementation of this mechanism.
+
+Early initialization of control groups
+--------------------------------------------------------------------------------
+
+Now after we just saw little theory about `control groups` Linux kernel mechanism, we may start to dive into the source code of Linux kernel to acquainted with this mechanism closer. As always we will start from the initialization of `control groups`. Initialization of `cgroups` divided into two parts in the Linux kernel: early and late. In this part we will consider only `early` part and `late` part will be considered in next parts.
+
+Early initialization of `cgroups` starts from the call of the:
+
+```C
+cgroup_init_early();
+```
+
+function in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) during early initialization of the Linux kernel. This function is defined in the [kernel/cgroup.c](https://github.com/torvalds/linux/blob/master/kernel/cgroup.c) source code file and starts from the definition of two following local variables:
+
+```C
+int __init cgroup_init_early(void)
+{
+	static struct cgroup_sb_opts __initdata opts;
+	struct cgroup_subsys *ss;
+    ...
+    ...
+    ...
+}
+```
+
+The `cgroup_sb_opts` structure defined in the same source code file and looks:
+
+```C
+struct cgroup_sb_opts {
+	u16 subsys_mask;
+	unsigned int flags;
+	char *release_agent;
+	bool cpuset_clone_children;
+	char *name;
+	bool none;
+};
+```
+
+which represents mount options of `cgroupfs`. For example we may create named cgroup hierarchy (with name `my_cgrp`) with the `name=` option and without any subsystems:
+
+```
+$ mount -t cgroup -oname=my_cgrp,none /mnt/cgroups
+```
+
+The second variable - `ss` has type - `cgroup_subsys` structure which is defined in the [include/linux/cgroup-defs.h](https://github.com/torvalds/linux/blob/master/include/linux/cgroup-defs.h) header file and as you may guess from the name of the type, it represents a `cgroup` subsystem. This structure contains various fields and callback functions like:
+
+```C
+struct cgroup_subsys {
+    int (*css_online)(struct cgroup_subsys_state *css);
+    void (*css_offline)(struct cgroup_subsys_state *css);
+    ...
+    ...
+    ...
+    bool early_init:1;
+    int id;
+    const char *name;
+    struct cgroup_root *root;
+    ...
+    ...
+    ...
+}
+```
+
+Where for example `ccs_online` and `ccs_offline` callbacks are called after a cgroup successfully will complet all allocations and a cgroup will be before releasing respectively. The `early_init` flags marks subsystems which may/should be initialized early. The `id` and `name` fields represents unique identifier in the array of registered subsystems for a cgroup and `name` of a subsystem respectively. The last - `root` fields represents pointer to the root of of a cgroup hierarchy.
+
+Of course the `cgroup_subsys` structure bigger and has other fields, but it is enough for now. Now as we got to know important structures related to `cgroups` mechanism, let's return to the `cgroup_init_early` function. Main purpose of this function is to do early initialization of some subsystems. As you already may guess, these `early` subsystems should have `cgroup_subsys->early_init = 1`. Let's look what subsystems may be initialized early.
+
+After the definition of the two local variables we may see following lines of code:
+
+```C
+init_cgroup_root(&cgrp_dfl_root, &opts);
+cgrp_dfl_root.cgrp.self.flags |= CSS_NO_REF;
+```
+
+Here we may see call of the `init_cgroup_root` function which will execute initialization of the default unified hierarchy and after this we set `CSS_NO_REF` flag in state of this default `cgroup` to disable reference counting for this css. The `cgrp_dfl_root` is defined in the same source code file:
+
+```C
+struct cgroup_root cgrp_dfl_root;
+```
+
+Its `cgrp` field represented by the `cgroup` structure which represents a `cgroup` as you already may guess and defined in the [include/linux/cgroup-defs.h](https://github.com/torvalds/linux/blob/master/include/linux/cgroup-defs.h) header file. We already know that a process which is represented by the `task_struct` in the Linux kernel. The `task_struct` does not contain direct link to a `cgroup` where this task is attached. But it may be reached via `ccs_set` field of the `task_struct`. This `ccs_set` structure holds pointer to the array of subsystem states:
+
+```C
+struct css_set {
+    ...
+    ...
+    ....
+    struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];
+    ...
+    ...
+    ...
+}
+```
+
+And via the `cgroup_subsys_state`, a process may get a `cgroup` that this process is attached to:
+
+```C
+struct cgroup_subsys_state {
+    ...
+    ...
+    ...
+    struct cgroup *cgroup;
+    ...
+    ...
+    ...
+}
+```
+
+So, the overall picture of `cgroups` related data structure is following:
+
+```                                                 
++-------------+         +---------------------+    +------------->+---------------------+          +----------------+
+| task_struct |         |       css_set       |    |              | cgroup_subsys_state |          |     cgroup     |
++-------------+         |                     |    |              +---------------------+          +----------------+
+|             |         |                     |    |              |                     |          |     flags      |
+|             |         |                     |    |              +---------------------+          |  cgroup.procs  |
+|             |         |                     |    |              |        cgroup       |--------->|       id       |
+|             |         |                     |    |              +---------------------+          |      ....      | 
+|-------------+         |---------------------+----+                                               +----------------+
+|   cgroups   | ------> | cgroup_subsys_state | array of cgroup_subsys_state
+|-------------+         +---------------------+------------------>+---------------------+          +----------------+
+|             |         |                     |                   | cgroup_subsys_state |          |      cgroup    |
++-------------+         +---------------------+                   +---------------------+          +----------------+
+                                                                  |                     |          |      flags     |
+                                                                  +---------------------+          |   cgroup.procs |
+                                                                  |        cgroup       |--------->|        id      |
+                                                                  +---------------------+          |       ....     |
+                                                                  |    cgroup_subsys    |          +----------------+
+                                                                  +---------------------+
+                                                                             |
+                                                                             |
+                                                                             ↓
+                                                                  +---------------------+
+                                                                  |    cgroup_subsys    |
+                                                                  +---------------------+
+                                                                  |         id          |
+                                                                  |        name         |
+                                                                  |      css_online     |
+                                                                  |      css_ofline     |
+                                                                  |        attach       |
+                                                                  |         ....        |
+                                                                  +---------------------+
+```
+
+
+
+So, the `init_cgroup_root` fills the `cgrp_dfl_root` with the default values. The next thing is assigning initial `ccs_set` to the `init_task` which represents first process in the system:
+
+```C
+RCU_INIT_POINTER(init_task.cgroups, &init_css_set);
+```
+
+And the last big thing in the `cgroup_init_early` function is initialization of `early cgroups`. Here we go over all registered subsystems and assign unique identity number, name of a subsystem and call the `cgroup_init_subsys` function for subsystems which are marked as early:
+
+```C
+for_each_subsys(ss, i) {
+		ss->id = i;
+		ss->name = cgroup_subsys_name[i];
+
+        if (ss->early_init)
+			cgroup_init_subsys(ss, true);
+}
+```
+
+The `for_each_subsys` here is a macro which is defined in the [kernel/cgroup.c](https://github.com/torvalds/linux/blob/master/kernel/cgroup.c) source code file and just expands to the `for` loop over `cgroup_subsys` array. Definition of this array may be found in the same source code file and it looks in a little unusual way:
+
+```C
+#define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys,
+    static struct cgroup_subsys *cgroup_subsys[] = {
+        #include <linux/cgroup_subsys.h>
+};
+#undef SUBSYS
+```
+
+It is defined as `SUBSYS` macro which takes one argument (name of a subsystem) and defines `cgroup_subsys` array of cgroup subsystems. Additionally we may see that the array is initialized with content of the [linux/cgroup_subsys.h](https://github.com/torvalds/linux/blob/master/include/linux/cgroup_subsys.h) header file. If we will look inside of this header file we will see again set of the `SUBSYS` macros with the given subsystems names:
+
+```C
+#if IS_ENABLED(CONFIG_CPUSETS)
+SUBSYS(cpuset)
+#endif
+
+#if IS_ENABLED(CONFIG_CGROUP_SCHED)
+SUBSYS(cpu)
+#endif
+...
+...
+...
+```
+
+This works because of `#undef` statement after first definition of the `SUBSYS` macro. Look at the `&_x ## _cgrp_subsys` expression. The `##` operator concatenates right and left expression in a `C` macro. So as we passed `cpuset`, `cpu` and etc., to the `SUBSYS` macro, somewhere `cpuset_cgrp_subsys`, `cp_cgrp_subsys` should be defined. And that's true. If you will look in the [kernel/cpuset.c](https://github.com/torvalds/linux/blob/master/kernel/cpuset.c) source code file, you will see this definition:
+
+```C
+struct cgroup_subsys cpuset_cgrp_subsys = {
+    ...
+    ...
+    ...
+	.early_init	= true,
+};
+```
+
+So the last step in the `cgroup_init_early` function is initialization of early subsystems with the call of the `cgroup_init_subsys` function. Following early subsystems will be initialized:
+
+* `cpuset`;
+* `cpu`;
+* `cpuacct`.
+
+The `cgroup_init_subsys` function does initialization of the given subsystem with the default values. For example sets root of hierarchy, allocates space for the given subsystem with the call of the `css_alloc` callback function, link a subsystem with a parent if it exists, add allocated subsystem to the initial process and etc.
+
+That's all. From this moment early subsystems are initialized.
+
+Conclusion
+--------------------------------------------------------------------------------
+
+It is the end of the first part which describes introduction into `Control groups` mechanism in the Linux kernel. We covered some theory and the first steps of initialization of stuffs related to `control groups` mechanism. In the next part we will continue to dive into the more practical aspects of `control groups`.
+
+If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
+
+**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me a PR to [linux-insides](https://github.com/0xAX/linux-insides).**
+
+Links
+--------------------------------------------------------------------------------
+
+* [control groups](https://en.wikipedia.org/wiki/Cgroups)
+* [PID](https://en.wikipedia.org/wiki/Process_identifier)
+* [cpuset](http://man7.org/linux/man-pages/man7/cpuset.7.html)
+* [block devices](https://en.wikipedia.org/wiki/Device_file)
+* [huge pages](https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt)
+* [sysfs](https://en.wikipedia.org/wiki/Sysfs)
+* [proc](https://en.wikipedia.org/wiki/Procfs)
+* [cgroups kernel documentation](https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt)
+* [cgroups v2](https://www.kernel.org/doc/Documentation/cgroup-v2.txt)
+* [bash](https://www.gnu.org/software/bash/)
+* [docker](https://en.wikipedia.org/wiki/Docker_(software))
+* [perf events](https://en.wikipedia.org/wiki/Perf_(Linux))
+* [Previous chapter](https://0xax.gitbooks.io/linux-insides/content/MM/linux-mm-1.html)

+ 2 - 0
SUMMARY.md

@@ -54,6 +54,8 @@
     * [Memblock](mm/linux-mm-1.md)
     * [Fixmaps and ioremap](mm/linux-mm-2.md)
     * [kmemcheck](mm/linux-mm-3.md)
+* [Cgroups](Cgroups/README.md)
+    * [Introduction to Control Groups][Cgroups/cgroups1.md]
 * [SMP]()
 * [Concepts](Concepts/README.md)
     * [Per-CPU variables](Concepts/per-cpu.md)