فهرست منبع

Create syscall-6.md

0xAX 7 سال پیش
والد
کامیت
8b4faeac4b
1فایلهای تغییر یافته به همراه221 افزوده شده و 0 حذف شده
  1. 221 0
      SysCall/syscall-6.md

+ 221 - 0
SysCall/syscall-6.md

@@ -0,0 +1,221 @@
+Limits on resources in Linux
+================================================================================
+
+Each process in the system uses certain amount of different resources like files, CPU time, memory and so on.
+
+Such resources are not infinite and each process and we should have an instrument to manage it. Sometimes it is useful to know current limits for a certain resource or to change it's value. In this post we will consider such instruments that allow us to get information about limits for a process and increase or decrease such limits.
+
+We will start from userspace view and then we will look how it is implemented in the Linux kernel.
+
+There are three main fundamental [system calls](https://en.wikipedia.org/wiki/System_call) to manage resource limit for a process:
+
+  * `getrlimit`
+  * `setrlimit`
+  * `prlimit`
+
+The first two allows a process to read and set limits on a system resource. The last one is extension for previous functions. The `prlimit` allows to set and read the resource limits of a process specified by [PID](https://en.wikipedia.org/wiki/Process_identifier). Definitions of these functions looks:
+
+The `getrlimit` is:
+
+```C
+int getrlimit(int resource, struct rlimit *rlim);
+```
+
+The `setrlimit` is:
+
+```C
+int setrlimit(int resource, const struct rlimit *rlim);
+```
+
+And the definition of the `prlimit` is:
+
+```C
+int prlimit(pid_t pid, int resource, const struct rlimit *new_limit,
+            struct rlimit *old_limit);
+```
+
+In the first two cases, functions takes two parameters:
+
+  * `resource` - represents resource type (we will see available types later);
+  * `rlim` - combination of `soft` and `hard` limits.
+
+There are two types of limits:
+
+  * `soft`
+  * `hard`
+
+The first provides actual limit for a resource of a process. The second is a ceiling value of a `soft` limit and can be set only by superuser. So, `soft` limit can't exceed related `hard` limit never.
+
+Both these values are combined in the `rlimit` structure:
+
+```C
+struct rlimit {
+    rlim_t rlim_cur;
+    rlim_t rlim_max;
+};
+```
+
+The last one function looks a little bit complex and takes `4` arguments. Besides `resource` argument, it takes:
+
+  * `pid` - specifies an ID of a process on which the `prlimit` should be executed;
+  * `new_limit` - provides new limits values if it is not `NULL`;
+  * `old_limit` - current `soft` and `hard` limits will be placed here if it is not `NULL`.
+
+Exactly `prlimit` function is used by [ulimit](https://www.gnu.org/software/bash/manual/html_node/Bash-Builtins.html#index-ulimit) util. We can verify this with the help of [strace](https://linux.die.net/man/1/strace) util.
+
+For example:
+
+```
+~$ strace ulimit -s 2>&1 | grep rl
+
+prlimit64(0, RLIMIT_NPROC, NULL, {rlim_cur=63727, rlim_max=63727}) = 0
+prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=4*1024}) = 0
+prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
+```
+
+Here we can see `prlimit64`, but not the `prlimit`. The fact is that we see underlying system call here instead of library call.
+
+Now let's look at list of available resources:
+
+| Resouce           | Description
+|-------------------|------------------------------------------------------------------------------------------|
+| RLIMIT_CPU        | CPU time limit given in seconds                                                          |
+| RLIMIT_FSIZE      | the maximum size of files that a process may create                                      |
+| RLIMIT_DATA       | the maximum  size  of  the process's data segment                                        |
+| RLIMIT_STACK      | the maximum size of the process stack in bytes                                           |
+| RLIMIT_CORE       | the maximum size of a [core](http://man7.org/linux/man-pages/man5/core.5.html) file.     |
+| RLIMIT_RSS        | the number of bytes that can be allocated for a process in RAM                           |
+| RLIMIT_NPROC      | the maximum number of processes that can be created by a user                            |
+| RLIMIT_NOFILE     | the maximum number of a file descriptor that can be opened by by a process               |
+| RLIMIT_MEMLOCK    | the maximum number of bytes of memory that may be locked into RAM by [mlock](http://man7.org/linux/man-pages/man2/mlock.2.html).|
+| RLIMIT_AS         | the maximum size of virtual memory in bytes.                                             |
+| RLIMIT_LOCKS      | the maximum number [flock](https://linux.die.net/man/1/flock) and locking related [fcntl](http://man7.org/linux/man-pages/man2/fcntl.2.html) calls|
+| RLIMIT_SIGPENDING | maximum number of [signals](http://man7.org/linux/man-pages/man7/signal.7.html) that may be queued for a user of the calling process|
+| RLIMIT_MSGQUEUE   | the number of bytes that can be allocated for [POSIX message queues](http://man7.org/linux/man-pages/man7/mq_overview.7.html) |
+| RLIMIT_NICE       | the maximum [nice](https://linux.die.net/man/1/nice) value that can be set by a process  |
+| RLIMIT_RTPRIO     | maximum real-time priority value                                                         |
+| RLIMIT_RTTIME     | maximum number of microseconds that a process may be scheduled under real-time scheduling policy without making blocking system call|
+
+If you're looking into source code of an open source projects, you will note that reading or updating of a resource limit is quite widely used operation and.
+
+For example: [systemd](https://github.com/systemd/systemd/blob/master/src/core/main.c)
+
+```C
+/* Don't limit the coredump size */
+(void) setrlimit(RLIMIT_CORE, &RLIMIT_MAKE_CONST(RLIM_INFINITY));
+```
+
+Or [haproxy](https://github.com/haproxy/haproxy/blob/master/src/haproxy.c):
+
+```C
+getrlimit(RLIMIT_NOFILE, &limit);
+if (limit.rlim_cur < global.maxsock) {
+	Warning("[%s.main()] FD limit (%d) too low for maxconn=%d/maxsock=%d. Please raise 'ulimit-n' to %d or more to avoid any trouble.\n",
+		argv[0], (int)limit.rlim_cur, global.maxconn, global.maxsock, global.maxsock);
+}
+```
+
+We've just saw a little bit about resources limits related stuff in the userspace, now let's look at the same system calls in the Linux kernel.
+
+Limits on resource in the Linux kernel
+--------------------------------------------------------------------------------
+
+Both implementation of `getrlimit` system call and `setrlimit` looks similar. Both they execute `do_prlimit` function that is core implementation of the `prlimit` system call and copy from/to given `rlimit` from/to userspace:
+
+The `getrlimit`:
+
+```C
+SYSCALL_DEFINE2(getrlimit, unsigned int, resource, struct rlimit __user *, rlim)
+{
+	struct rlimit value;
+	int ret;
+
+	ret = do_prlimit(current, resource, NULL, &value);
+	if (!ret)
+		ret = copy_to_user(rlim, &value, sizeof(*rlim)) ? -EFAULT : 0;
+
+	return ret;
+}
+```
+
+and `setrlimit`:
+
+```C
+SYSCALL_DEFINE2(setrlimit, unsigned int, resource, struct rlimit __user *, rlim)
+{
+	struct rlimit new_rlim;
+
+	if (copy_from_user(&new_rlim, rlim, sizeof(*rlim)))
+		return -EFAULT;
+	return do_prlimit(current, resource, &new_rlim, NULL);
+}
+```
+
+Implementations of these system calls are defined in the [kernel/sys.c](https://github.com/torvalds/linux/blob/master/kernel/sys.c) kernel source code file.
+
+First of all the `do_prlimit` function executes a check that the given resource is valid:
+
+```C
+if (resource >= RLIM_NLIMITS)
+	return -EINVAL;
+```
+
+and in a failure case returns `-EINVAL` error. After this check will pass successfully and new limits was passed as non `NULL` value, two following checks:
+
+```C
+if (new_rlim) {
+	if (new_rlim->rlim_cur > new_rlim->rlim_max)
+		return -EINVAL;
+	if (resource == RLIMIT_NOFILE &&
+			new_rlim->rlim_max > sysctl_nr_open)
+		return -EPERM;
+}
+```
+
+check that the given `soft` limit does not exceeds `hard` limit and in a case when the given resource is the maximum number of a file descriptors that hard limit is not greater than `sysctl_nr_open` value. The value of the `sysctl_nr_open` can be found via [procfs](https://en.wikipedia.org/wiki/Procfs):
+
+```
+~$ cat /proc/sys/fs/nr_open 
+1048576
+```
+
+After all of these checks we lock `tasklist` to be sure that [signal]() handlers related things will not be destroyed while we updating limits for a given resource:
+
+```C
+read_lock(&tasklist_lock);
+...
+...
+...
+read_unlock(&tasklist_lock);
+```
+
+We need to do this because `prlimit` system call allows us to update limits of another task by the given pid. As task list is locked, we take the `rlimit` instance that is responsible for the given resource limit of the given process:
+
+```C
+rlim = tsk->signal->rlim + resource;
+```
+
+where the `tsk->signal->rlim` is just array of `struct rlimit` that represents certain resources. And if the `new_rlim` is not `NULL` we just update its value. If `old_rlim` is not `NULL` we fill it:
+
+```C
+if (old_rlim)
+    *old_rlim = *rlim;
+```
+
+That's all.
+
+Conclusion
+--------------------------------------------------------------------------------
+
+This is the end of the second part that describes implementation of the system calls in the Linux kernel. If you have questions or suggestions, ping me on Twitter [0xAX](https://twitter.com/0xAX), drop me an [email](anotherworldofworld@gmail.com), or just create an [issue](https://github.com/0xAX/linux-internals/issues/new).
+
+**Please note that English is not my first language and I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-internals).**
+
+Links
+--------------------------------------------------------------------------------
+
+* [system calls](https://en.wikipedia.org/wiki/System_call)
+* [PID](https://en.wikipedia.org/wiki/Process_identifier)
+* [ulimit](https://www.gnu.org/software/bash/manual/html_node/Bash-Builtins.html#index-ulimit)
+* [strace](https://linux.die.net/man/1/strace)
+* [POSIX message queues](http://man7.org/linux/man-pages/man7/mq_overview.7.html)