8 年前 · b10d0d80c1
--- a/Booting/README.md
+++ b/Booting/README.md
@@ -1,10 +1,10 @@
 
				-# Kernel boot process
			
 
				+# Kernel Boot Process
			
 
				 
			
 
				-This chapter describes the linux kernel boot process. You will see here a
			
 
				-couple of posts which describe the full cycle of the kernel loading process:
			
 
				+This chapter describes the linux kernel boot process. Here you will see a
			
 
				+couple of posts which describes the full cycle of the kernel loading process:
			
 
				 
			
 
				-* [From the bootloader to kernel](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-1.html) - describes all stages from turning on the computer to before the first instruction of the kernel;
			
 
				-* [First steps in the kernel setup code](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html) - describes first steps in the kernel setup code. You will see heap initialization, querying of different parameters like EDD, IST and etc...
			
 
				+* [From the bootloader to kernel](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-1.html) - describes all stages from turning on the computer to running the first instruction of the kernel.
			
 
				+* [First steps in the kernel setup code](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html) - describes first steps in the kernel setup code. You will see heap initialization, query of different parameters like EDD, IST and etc...
			
 
				 * [Video mode initialization and transition to protected mode](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html) - describes video mode initialization in the kernel setup code and transition to protected mode.
			
 
				-* [Transition to 64-bit mode](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-4.html) - describes preparation for transition into 64-bit mode and transition into it.
			
 
				-* [Kernel Decompression](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) - describes preparation before kernel decompression and directly decompression.
			
 
				+* [Transition to 64-bit mode](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-4.html) - describes preparation for transition into 64-bit mode and details of transition.
			
 
				+* [Kernel Decompression](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) - describes preparation before kernel decompression and details of direct decompression.
			
--- a/Booting/linux-bootstrap-1.md
+++ b/Booting/linux-bootstrap-1.md
@@ -1,29 +1,29 @@
 
				 Kernel booting process. Part 1.
			
 
				 ================================================================================
			
 
				 
			
 
				-From the bootloader to kernel
			
 
				+From the bootloader to the kernel
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-If you have read my previous [blog posts](http://0xax.blogspot.com/search/label/asm), you can see that sometime ago I started to get involved with low-level programming. I wrote some posts about x86_64 assembly programming for Linux. At the same time, I started to dive into the Linux source code. I have a great interest in understanding how low-level things work, how programs run on my computer, how they are located in memory, how the kernel manages processes and memory, how the network stack works on low-level and many many other things. So, I decided to write yet another series of posts about the Linux kernel for **x86_64**.
			
 
				+If you have been reading my previous [blog posts](http://0xax.blogspot.com/search/label/asm), then you can see that, for some time, I have been starting to get involved in low-level programming. I have written some posts about x86_64 assembly programming for Linux and, at the same time, I have also started to dive into the Linux source code. I have a great interest in understanding how low-level things work, how programs run on my computer, how are they located in memory, how the kernel manages processes & memory, how the network stack works at a low level, and many many other things. So, I have decided to write yet another series of posts about the Linux kernel for **x86_64**.
			
 
				 
			
 
				-Note that I'm not a professional kernel hacker and I don't write code for the kernel at work. It's just a hobby. I just like low-level stuff, and it is interesting for me to see how these things work. So if you notice anything confusing, or if you have any questions/remarks, ping me on twitter [0xAX](https://twitter.com/0xAX), drop me an [email](anotherworldofworld@gmail.com) or just create an [issue](https://github.com/0xAX/linux-insides/issues/new). I appreciate it. All posts will also be accessible at [linux-insides](https://github.com/0xAX/linux-insides) and if you find something wrong with my English or the post content, feel free to send a pull request.
			
 
				+Note that I'm not a professional kernel hacker and I don't write code for the kernel at work. It's just a hobby. I just like low-level stuff, and it is interesting for me to see how these things work. So if you notice anything confusing, or if you have any questions/remarks, ping me on twitter [0xAX](https://twitter.com/0xAX), drop me an [email](anotherworldofworld@gmail.com) or just create an [issue](https://github.com/0xAX/linux-insides/issues/new). I appreciate it. All posts will also be accessible at [linux-insides](https://github.com/0xAX/linux-insides) and, if you find something wrong with my English or the post content, feel free to send a pull request.
			
 
				 
			
 
				 
			
 
				-*Note that this isn't the official documentation, just learning and sharing knowledge.*
			
 
				+*Note that this isn't official documentation, just learning and sharing knowledge.*
			
 
				 
			
 
				 **Required knowledge**
			
 
				 
			
 
				 * Understanding C code
			
 
				 * Understanding assembly code (AT&T syntax)
			
 
				 
			
 
				-Anyway, if you just started to learn some tools, I will try to explain some parts during this and the following posts. Ok, little introduction finished and now we can start to dive into the kernel and low-level stuff.
			
 
				+Anyway, if you just start to learn some tools, I will try to explain some parts during this and the following posts. Alright, this is the end of the simple introduction, and now we can start to dive into the kernel and low-level stuff.
			
 
				 
			
 
				-All code is actually for kernel - 3.18. If there are changes, I will update the posts accordingly.
			
 
				+All code is actually for the 3.18 kernel. If there are changes, I will update the posts accordingly.
			
 
				 
			
 
				-The Magic Power Button, What happens next?
			
 
				+The Magical Power Button, What happens next?
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-Despite that this is a series of posts about Linux kernel, we will not start from kernel code (at least in this paragraph). Ok, you pressed the magic power button on your laptop or desktop computer and it started to work. After the motherboard sends a signal to the [power supply](https://en.wikipedia.org/wiki/Power_supply), the power supply provides the computer with the proper amount of electricity. Once motherboard receives the [power good signal](https://en.wikipedia.org/wiki/Power_good_signal), it tries to run the CPU. The CPU resets all leftover data in its registers and sets up predefined values for every register.
			
 
				+Although this is a series of posts about the Linux kernel, we will not be starting from the kernel code - at least not, in this paragraph. As soon as you press the magical power button on your laptop or desktop computer, it starts working. The motherboard sends a signal to the [power supply](https://en.wikipedia.org/wiki/Power_supply). After receiving the signal, the power supply provides the proper amount of electricity to the computer. Once the motherboard receives the [power good signal](https://en.wikipedia.org/wiki/Power_good_signal), it tries to start the CPU. The CPU resets all leftover data in its registers and sets up predefined values for each of them.
			
 
				 
			
 
				 
			
 
				 [80386](https://en.wikipedia.org/wiki/Intel_80386) and later CPUs define the following predefined data in CPU registers after the computer resets:
			
@@ -34,70 +34,66 @@ CS selector 0xf000
 
				 CS base     0xffff0000
			
 
				 ```
			
 
				 
			
 
				-The processor starts working in [real mode](https://en.wikipedia.org/wiki/Real_mode) and we need to back up a little to understand memory segmentation in this mode. Real mode is supported in all x86-compatible processors, from [8086](https://en.wikipedia.org/wiki/Intel_8086) to modern Intel 64-bit CPUs. The 8086 processor had a 20-bit address bus, which means that it could work with 0-2^20 bytes address space (1 megabyte). But it only has 16-bit registers, and with 16-bit registers the maximum address is 2^16 or 0xffff (64 kilobytes). [Memory segmentation](http://en.wikipedia.org/wiki/Memory_segmentation) is used to make use of all of the address space available. All memory is divided into small, fixed-size segments of 65535 bytes, or 64 KB. Since we cannot address memory below 64 KB with 16 bit registers, an alternate method to do it was devised. An address consists of two parts: the beginning address of the segment and the offset from the beginning of this segment. To get a physical address in memory, we need to multiply the segment part by 16 and add the offset part:
			
 
				+The processor starts working in [real mode](https://en.wikipedia.org/wiki/Real_mode). Let's back up a little and try to understand memory segmentation in this mode. Real mode is supported on all x86-compatible processors, from the [8086](https://en.wikipedia.org/wiki/Intel_8086) all the way to the modern Intel 64-bit CPUs. The 8086 processor has a 20-bit address bus, which means that it could work with a 0-0x100000 address space (1 megabyte). But it only has 16-bit registers, which hace a maximum address of 2^16 - 1 or 0xffff (64 kilobytes). [Memory segmentation](http://en.wikipedia.org/wiki/Memory_segmentation) is used to make use of all the address space available. All memory is divided into small, fixed-size segments of 65536 bytes (64 KB). Since we cannot address memory above 64 KB with 16 bit registers, an alternate method is devised. An address consists of two parts: a segment selector, which has a base address, and an offset from this base address. In real mode, the associated base address of a segment selector is `Segment Selector * 16`. Thus, to get a physical address in memory, we need to multiply the segment selector part by 16 and add the offset:
			
 
				 
			
 
				 ```
			
 
				-PhysicalAddress = Segment * 16 + Offset
			
 
				+PhysicalAddress = Segment Selector * 16 + Offset
			
 
				 ```
			
 
				 
			
 
				-For example if `CS:IP` is `0x2000:0x0010`, the corresponding physical address will be:
			
 
				+For example, if `CS:IP` is `0x2000:0x0010`, then the corresponding physical address will be:
			
 
				 
			
 
				 ```python
			
 
				 >>> hex((0x2000 << 4) + 0x0010)
			
 
				 '0x20010'
			
 
				 ```
			
 
				 
			
 
				-But if we take the biggest segment part and offset: `0xffff:0xffff`, it will be:
			
 
				+But, if we take the largest segment selector and offset, `0xffff:0xffff`, then the resulting address will be:
			
 
				 
			
 
				 ```python
			
 
				 >>> hex((0xffff << 4) + 0xffff)
			
 
				 '0x10ffef'
			
 
				 ```
			
 
				 
			
 
				-which is 65519 bytes over first megabyte. Since only one megabyte is accessible in real mode, `0x10ffef` becomes `0x00ffef` with disabled [A20](https://en.wikipedia.org/wiki/A20_line).
			
 
				+which is 65520 bytes past the first megabyte. Since only one megabyte is accessible in real mode, `0x10ffef` becomes `0x00ffef` with disabled [A20](https://en.wikipedia.org/wiki/A20_line).
			
 
				 
			
 
				-Ok, now we know about real mode and memory addressing. Let's get back to register values after reset.
			
 
				+Ok, now we know about real mode and memory addressing. Let's get back to discussing register values after reset:
			
 
				 
			
 
				-`CS` register consists of two parts: the visible segment selector and hidden base address. We know predefined `CS` base and `IP` value, logical address will be:
			
 
				+The `CS` register consists of two parts: the visible segment selector, and the hidden base address. While the base address is normally formed by multiplying the segment selector value by 16, during a hardware reset the segment selector in the CS register is loaded with 0xf000 and the base address is loaded with 0xffff0000; the processor uses this special base address until `CS` is changed.
			
 
				 
			
 
				-```
			
 
				-0xffff0000:0xfff0
			
 
				-```
			
 
				-
			
 
				-In this way starting address formed by adding the base address to the value in the EIP register:
			
 
				+The starting address is formed by adding the base address to the value in the EIP register:
			
 
				 
			
 
				 ```python
			
 
				 >>> 0xffff0000 + 0xfff0
			
 
				 '0xfffffff0'
			
 
				 ```
			
 
				 
			
 
				-We get `0xfffffff0` which is 4GB - 16 bytes. This point is the [Reset vector](http://en.wikipedia.org/wiki/Reset_vector). This is the memory location at which CPU expects to find the first instruction to execute after reset. It contains a [jump](http://en.wikipedia.org/wiki/JMP_%28x86_instruction%29) instruction which usually points to the BIOS entry point. For example, if we look in [coreboot](http://www.coreboot.org/) source code, we will see it:
			
 
				+We get `0xfffffff0`, which is 4GB (16 bytes). This point is called the [Reset vector](http://en.wikipedia.org/wiki/Reset_vector). This is the memory location at which the CPU expects to find the first instruction to execute after reset. It contains a [jump](http://en.wikipedia.org/wiki/JMP_%28x86_instruction%29) (`jmp`) instruction that usually points to the BIOS entry point. For example, if we look in the [coreboot](http://www.coreboot.org/) source code, we see:
			
 
				 
			
 
				 ```assembly
			
 
				-	.section ".reset"
			
 
				-	.code16
			
 
				-.globl	reset_vector
			
 
				+    .section ".reset"
			
 
				+    .code16
			
 
				+.globl  reset_vector
			
 
				 reset_vector:
			
 
				-	.byte  0xe9
			
 
				-	.int   _start - ( . + 2 )
			
 
				-	...
			
 
				+    .byte  0xe9
			
 
				+    .int   _start - ( . + 2 )
			
 
				+    ...
			
 
				 ```
			
 
				 
			
 
				-We can see here the jump instruction [opcode](http://ref.x86asm.net/coder32.html#xE9) - 0xe9 to the address `_start - ( . + 2)`. And we can see that `reset` section is 16 bytes and starts at `0xfffffff0`:
			
 
				+Here we can see the `jmp` instruction [opcode](http://ref.x86asm.net/coder32.html#xE9), which is 0xe9, and its destination address at `_start - ( . + 2)`. We can also see that the `reset` section is 16 bytes, and that it starts at `0xfffffff0`:
			
 
				 
			
 
				 ```
			
 
				 SECTIONS {
			
 
				-	_ROMTOP = 0xfffffff0;
			
 
				-	. = _ROMTOP;
			
 
				-	.reset . : {
			
 
				-		*(.reset)
			
 
				-		. = 15 ;
			
 
				-		BYTE(0x00);
			
 
				-	}
			
 
				+    _ROMTOP = 0xfffffff0;
			
 
				+    . = _ROMTOP;
			
 
				+    .reset . : {
			
 
				+        *(.reset)
			
 
				+        . = 15 ;
			
 
				+        BYTE(0x00);
			
 
				+    }
			
 
				 }
			
 
				 ```
			
 
				 
			
 
				-Now the BIOS has started to work. After initializing and checking the hardware, it needs to find a bootable device. A boot order is stored in the BIOS configuration. The function of boot order is to control which devices the kernel attempts to boot. In the case of attempting to boot a hard drive, the BIOS tries to find a boot sector. On hard drives partitioned with an MBR partition layout, the boot sector is stored in the first 446 bytes of the first sector (512 bytes). The final two bytes of the first sector are `0x55` and `0xaa` which signals the BIOS that the device is bootable. For example:
			
 
				+Now the BIOS starts; after initializing and checking the hardware, the BIOS needs to find a bootable device. A boot order is stored in the BIOS configuration, controlling which devices the BIOS attempts to boot from. When attempting to boot from a hard drive, the BIOS tries to find a boot sector. On hard drives partitioned with an MBR partition layout, the boot sector is stored in the first 446 bytes of the first sector, where each sectoris 512 bytes. The final two bytes of the first sector are `0x55` and `0xaa`, which designates to the BIOS that this device is bootable. For example:
			
 
				 
			
 
				 ```assembly
			
 
				 ;
			
@@ -121,45 +117,45 @@ db 0x55
 
				 db 0xaa
			
 
				 ```
			
 
				 
			
 
				-Build and run it with:
			
 
				+Build and run this with:
			
 
				 
			
 
				 ```
			
 
				 nasm -f bin boot.nasm && qemu-system-x86_64 boot
			
 
				 ```
			
 
				 
			
 
				-This will instruct [QEMU](http://qemu.org) to use the `boot` binary we just built as a disk image. Since the binary generated by the assembly code above fulfills the requirements of the boot sector (the origin is set to `0x7c00`, and we end with the magic sequence). QEMU will treat the binary as the master boot record(MBR) of a disk image.
			
 
				+This will instruct [QEMU](http://qemu.org) to use the `boot` binary that we just built as a disk image. Since the binary generated by the assembly code above fulfills the requirements of the boot sector (the origin is set to `0x7c00` and we end with the magic sequence), QEMU will treat the binary as the master boot record (MBR) of a disk image.
			
 
				 
			
 
				-We will see:
			
 
				+You will see:
			
 
				 
			
 
				 ![Simple bootloader which prints only `!`](http://oi60.tinypic.com/2qbwup0.jpg)
			
 
				 
			
 
				-In this example we can see that this code will be executed in 16 bit real mode and will start at 0x7c00 in memory. After the start it calls the [0x10](http://www.ctyme.com/intr/rb-0106.htm) interrupt which just prints `!` symbol. It fills rest of 510 bytes with zeros and finish with two magic bytes `0xaa` and `0x55`.
			
 
				+In this example we can see that the code will be executed in 16 bit real mode and will start at `0x7c00` in memory. After starting, it calls the [0x10](http://www.ctyme.com/intr/rb-0106.htm) interrupt, which just prints the `!` symbol; it fills the remaining 510 bytes with zeros and finishes with the two magic bytes `0xaa` and `0x55`.
			
 
				 
			
 
				-You can see binary dump of it with `objdump` util:
			
 
				+You can see a binary dump of this using the `objdump` utility:
			
 
				 
			
 
				 ```
			
 
				 nasm -f bin boot.nasm
			
 
				 objdump -D -b binary -mi386 -Maddr16,data16,intel boot
			
 
				 ```
			
 
				 
			
 
				-A real-world boot sector has code for continuing the boot process and the partition table instead of a bunch of 0's and an exclamation point :) Ok so, from this point onwards BIOS hands over the control to the bootloader and we can go ahead.
			
 
				+A real-world boot sector has code for continuing the boot process and a partition table instead of a bunch of 0's and an exclamation mark :) From this point onwards, the BIOS hands over control to the bootloader.
			
 
				 
			
 
				-**NOTE**: As you can read above the CPU is in real mode. In real mode, calculating the physical address in memory is done as following:
			
 
				+**NOTE**: As explained above, the CPU is in real mode; in real mode, calculating the physical address in memory is done as follows:
			
 
				 
			
 
				 ```
			
 
				-PhysicalAddress = Segment * 16 + Offset
			
 
				+PhysicalAddress = Segment Selector * 16 + Offset
			
 
				 ```
			
 
				 
			
 
				-Same as I mentioned before. But we have only 16 bit general purpose registers. The maximum value of 16 bit register is: `0xffff`; So if we take the biggest values the result will be:
			
 
				+just as explained before. We have only 16 bit general purpose registers; the maximum value of a 16 bit register is `0xffff`, so if we take the largest values, the result will be:
			
 
				 
			
 
				 ```python
			
 
				 >>> hex((0xffff * 16) + 0xffff)
			
 
				 '0x10ffef'
			
 
				 ```
			
 
				 
			
 
				-Where `0x10ffef` is equal to `1MB + 64KB - 16b`. But a [8086](https://en.wikipedia.org/wiki/Intel_8086) processor, which was the first processor with real mode. It had 20 bit address line and `2^20 = 1048576.0` is 1MB. So, it means that the actual  memory available is 1MB.
			
 
				+where `0x10ffef` is equal to `1MB + 64KB - 16b`. A [8086](https://en.wikipedia.org/wiki/Intel_8086) processor (which was the first processor with real mode), in contrast, has a 20 bit address line. Since `2^20 = 1048576` is 1MB, this means that the actual available memory is 1MB.
			
 
				 
			
 
				-General real mode's memory map is:
			
 
				+General real mode's memory map is as follows:
			
 
				 
			
 
				 ```
			
 
				 0x00000000 - 0x000003FF - Real Mode Interrupt Vector Table
			
@@ -175,40 +171,40 @@ General real mode's memory map is:
 
				 0x000F0000 - 0x000FFFFF - System BIOS
			
 
				 ```
			
 
				 
			
 
				-But stop, at the beginning of post I wrote that first instruction executed by the CPU is located at the address `0xFFFFFFF0`, which is much bigger than `0xFFFFF` (1MB). How can CPU access it in real mode? As I write about it and you can read in [coreboot](http://www.coreboot.org/Developer_Manual/Memory_map) documentation:
			
 
				+In the beginning of this post, I wrote that the first instruction executed by the CPU is located at address `0xFFFFFFF0`, which is much larger than `0xFFFFF` (1MB). How can the CPU access this address in real mode? The answer is in the [coreboot](http://www.coreboot.org/Developer_Manual/Memory_map) documentation:
			
 
				 
			
 
				 ```
			
 
				 0xFFFE_0000 - 0xFFFF_FFFF: 128 kilobyte ROM mapped into address space
			
 
				 ```
			
 
				 
			
 
				-At the start of execution BIOS is not in RAM, it is located in the ROM.
			
 
				+At the start of execution, the BIOS is not in RAM, but in ROM.
			
 
				 
			
 
				 Bootloader
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-There are a number of bootloaders which can boot Linux, such as [GRUB 2](https://www.gnu.org/software/grub/) and [syslinux](http://www.syslinux.org/wiki/index.php/The_Syslinux_Project). The Linux kernel has a [Boot protocol](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt) which specifies the requirements for bootloaders to implement Linux support. This example will describe GRUB 2.
			
 
				+There are a number of bootloaders that can boot Linux, such as [GRUB 2](https://www.gnu.org/software/grub/) and [syslinux](http://www.syslinux.org/wiki/index.php/The_Syslinux_Project). The Linux kernel has a [Boot protocol](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt) which specifies the requirements for a bootloader to implement Linux support. This example will describe GRUB 2.
			
 
				 
			
 
				-Now that the BIOS has chosen a boot device and transferred control to the boot sector code, execution starts from [boot.img](http://git.savannah.gnu.org/gitweb/?p=grub.git;a=blob;f=grub-core/boot/i386/pc/boot.S;hb=HEAD). This code is very simple due to the limited amount of space available, and contains a pointer that it uses to jump to the location of GRUB 2's core image. The core image begins with [diskboot.img](http://git.savannah.gnu.org/gitweb/?p=grub.git;a=blob;f=grub-core/boot/i386/pc/diskboot.S;hb=HEAD), which is usually stored immediately after the first sector in the unused space before the first partition. The above code loads the rest of the core image into memory, which contains GRUB 2's kernel and drivers for handling filesystems. After loading the rest of the core image, it executes [grub_main](http://git.savannah.gnu.org/gitweb/?p=grub.git;a=blob;f=grub-core/kern/main.c).
			
 
				+Continuing from before, now that the BIOS has chosen a boot device and transferred control to the boot sector code, execution starts from [boot.img](http://git.savannah.gnu.org/gitweb/?p=grub.git;a=blob;f=grub-core/boot/i386/pc/boot.S;hb=HEAD). This code is very simple, due to the limited amount of space available, and contains a pointer which is used to jump to the location of GRUB 2's core image. The core image begins with [diskboot.img](http://git.savannah.gnu.org/gitweb/?p=grub.git;a=blob;f=grub-core/boot/i386/pc/diskboot.S;hb=HEAD), which is usually stored immediately after the first sector in the unused space before the first partition. The above code loads the rest of the core image, which contains GRUB 2's kernel and drivers for handling filesystems, into memory. After loading the rest of the core image, it executes [grub_main](http://git.savannah.gnu.org/gitweb/?p=grub.git;a=blob;f=grub-core/kern/main.c).
			
 
				 
			
 
				-`grub_main` initializes console, gets base address for modules, sets root device, loads/parses grub configuration file, loads modules etc. At the end of execution, `grub_main` moves grub to normal mode. `grub_normal_execute` (from `grub-core/normal/main.c`) completes last preparation and shows a menu for selecting an operating system. When we select one of grub menu entries, `grub_menu_execute_entry` begins to be executed, which executes grub `boot` command. It starts to boot the selected operating system.
			
 
				+`grub_main` initializes the console, gets the base address for modules, sets the root device, loads/parses the grub configuration file, loads modules, etc. At the end of execution, `grub_main` moves grub to normal mode. `grub_normal_execute` (from `grub-core/normal/main.c`) completes the final preparations and shows a menu to select an operating system. When we select one of the grub menu entries, `grub_menu_execute_entry` runs, executing the grub `boot` command and booting the selected operating system.
			
 
				 
			
 
				-As we can read in the kernel boot protocol, the bootloader must read and fill some fields of kernel setup header which starts at `0x01f1` offset from the kernel setup code. Kernel header [arch/x86/boot/header.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S) starts from:
			
 
				+As we can read in the kernel boot protocol, the bootloader must read and fill some fields of the kernel setup header, which starts at the `0x01f1` offset from the kernel setup code. The kernel header [arch/x86/boot/header.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S) starts from:
			
 
				 
			
 
				 ```assembly
			
 
				-	.globl hdr
			
 
				+    .globl hdr
			
 
				 hdr:
			
 
				-	setup_sects: .byte 0
			
 
				-	root_flags:  .word ROOT_RDONLY
			
 
				-	syssize:     .long 0
			
 
				-	ram_size:    .word 0
			
 
				-	vid_mode:    .word SVGA_MODE
			
 
				-	root_dev:    .word 0
			
 
				-	boot_flag:   .word 0xAA55
			
 
				+    setup_sects: .byte 0
			
 
				+    root_flags:  .word ROOT_RDONLY
			
 
				+    syssize:     .long 0
			
 
				+    ram_size:    .word 0
			
 
				+    vid_mode:    .word SVGA_MODE
			
 
				+    root_dev:    .word 0
			
 
				+    boot_flag:   .word 0xAA55
			
 
				 ```
			
 
				 
			
 
				-The bootloader must fill this and the rest of the headers (only marked as `write` in the Linux boot protocol, for example [this](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt#L354)) with values which it either got from command line or calculated. We will not see description and explanation of all fields of kernel setup header, we will get back to it when kernel uses it. Anyway, you can find description of any field in the [boot protocol](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt#L156).
			
 
				+The bootloader must fill this and the rest of the headers (which are only marked as being type `write` in the Linux boot protocol, such as in [this example](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt#L354)) with values which it has either received from the  command line or calculated. (We will not go over full descriptions and explanations for all fields of the kernel setup header now but instead when the discuss how kernel uses them; you can find a description of all fields in the [boot protocol](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt#L156).)
			
 
				 
			
 
				-As we can see in kernel boot protocol, the memory map will be the following after kernel loading:
			
 
				+As we can see in the kernel boot protocol, the memory map will be the following after loading the kernel:
			
 
				 
			
 
				 ```shell
			
 
				          | Protected-mode kernel  |
			
@@ -235,34 +231,34 @@ X+08000  +------------------------+
 
				 
			
 
				 ```
			
 
				 
			
 
				-So after the bootloader transferred control to the kernel, it starts somewhere at:
			
 
				+So, when the bootloader transfers control to the kernel, it starts at:
			
 
				 
			
 
				 ```
			
 
				 0x1000 + X + sizeof(KernelBootSector) + 1
			
 
				 ```
			
 
				 
			
 
				-where `X` is the address of kernel bootsector loaded. In my case `X` is `0x10000`, we can see it in memory dump:
			
 
				+where `X` is the address of the kernel boot sector being loaded. In my case, `X` is `0x10000`, as we can see in a memory dump:
			
 
				 
			
 
				 ![kernel first address](http://oi57.tinypic.com/16bkco2.jpg)
			
 
				 
			
 
				-Ok, now the bootloader has loaded Linux kernel into the memory, filled header fields and jumped to it. Now we can move directly to the kernel setup code.
			
 
				+The bootloader has now loaded the Linux kernel into memory, filled the header fields, and then jumped to the corresponding memory address. We can now move directly to the kernel setup code.
			
 
				 
			
 
				 Start of Kernel Setup
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-Finally we are in the kernel. Technically kernel didn't run yet, first of all we need to setup kernel, memory manager, process manager etc. Kernel setup execution starts from [arch/x86/boot/header.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S) at the [_start](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L293). It is a little strange at the first look, there are many instructions before it.
			
 
				+Finally, we are in the kernel! Technically, the kernel hasn't run yet; first, we need to set up the kernel, memory manager, process manager, etc. Kernel setup execution starts from [arch/x86/boot/header.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S) at [_start](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L293). It is a little strange at first sight, as there are several instructions before it.
			
 
				 
			
 
				-Actually Long time ago Linux kernel had its own bootloader, but now if you run for example:
			
 
				+A long time ago, the Linux kernel used to have its own bootloader. Now, however, if you run, for example,
			
 
				 
			
 
				 ```
			
 
				 qemu-system-x86_64 vmlinuz-3.18-generic
			
 
				 ```
			
 
				 
			
 
				-You will see:
			
 
				+then you will see:
			
 
				 
			
 
				 ![Try vmlinuz in qemu](http://oi60.tinypic.com/r02xkz.jpg)
			
 
				 
			
 
				-Actually `header.S` starts from [MZ](https://en.wikipedia.org/wiki/DOS_MZ_executable) (see image above), error message printing and following [PE](https://en.wikipedia.org/wiki/Portable_Executable) header:
			
 
				+Actually, `header.S` starts from [MZ](https://en.wikipedia.org/wiki/DOS_MZ_executable) (see image above), the error message printing and following the [PE](https://en.wikipedia.org/wiki/Portable_Executable) header:
			
 
				 
			
 
				 ```assembly
			
 
				 #ifdef CONFIG_EFI_STUB
			
@@ -274,21 +270,21 @@ Actually `header.S` starts from [MZ](https://en.wikipedia.org/wiki/DOS_MZ_execut
 
				 ...
			
 
				 ...
			
 
				 pe_header:
			
 
				-	.ascii "PE"
			
 
				-	.word 0
			
 
				+    .ascii "PE"
			
 
				+    .word 0
			
 
				 ```
			
 
				 
			
 
				-It needs this for loading the operating system with [UEFI](https://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface). Here we will not see how it works (we will these later in the next parts).
			
 
				+It needs this to load an operating system with [UEFI](https://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface). We won't be looking into its inner workings right now and will cover it in upcoming chapters.
			
 
				 
			
 
				-So the actual kernel setup entry point is:
			
 
				+The actual kernel setup entry point is:
			
 
				 
			
 
				-```
			
 
				+```assembly
			
 
				 // header.S line 292
			
 
				 .globl _start
			
 
				 _start:
			
 
				 ```
			
 
				 
			
 
				-Bootloader (grub2 and others) knows about this point (`0x200` offset from `MZ`) and makes a jump directly to this point, despite the fact that `header.S` starts from `.bstext` section which prints error message:
			
 
				+The bootloader (grub2 and others) knows about this point (`0x200` offset from `MZ`) and makes a jump directly to it, despite the fact that `header.S` starts from the `.bstext` section, which prints an error message:
			
 
				 
			
 
				 ```
			
 
				 //
			
@@ -299,186 +295,187 @@ Bootloader (grub2 and others) knows about this point (`0x200` offset from `MZ`)
 
				 .bsdata : { *(.bsdata) }
			
 
				 ```
			
 
				 
			
 
				-So kernel setup entry point is:
			
 
				+The kernel setup entry point is:
			
 
				 
			
 
				 ```assembly
			
 
				-	.globl _start
			
 
				+    .globl _start
			
 
				 _start:
			
 
				-	.byte 0xeb
			
 
				-	.byte start_of_setup-1f
			
 
				+    .byte  0xeb
			
 
				+    .byte  start_of_setup-1f
			
 
				 1:
			
 
				-	//
			
 
				-	// rest of the header
			
 
				-	//
			
 
				+    //
			
 
				+    // rest of the header
			
 
				+    //
			
 
				 ```
			
 
				 
			
 
				-Here we can see `jmp` instruction opcode - `0xeb` to the `start_of_setup-1f` point. `Nf` notation means following: `2f` refers to the next local `2:` label. In our case it is label `1` which goes right after jump. It contains rest of setup [header](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt#L156) and right after setup header we can see `.entrytext` section which starts at `start_of_setup` label.
			
 
				+Here we can see a `jmp` instruction opcode (`0xeb`) that jumps to the `start_of_setup-1f` point. In `Nf` notation, `2f` refers to the following local `2:` label; in our case, it is label `1` that is present right after jump, and it contains the rest of the setup [header](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt#L156). Right after the setup header, we see the `.entrytext` section, which starts at the `start_of_setup` label.
			
 
				 
			
 
				-Actually it's the first code which starts to execute besides previous jump instruction. After kernel setup got the control from bootloader, first `jmp` instruction is located at `0x200` (first 512 bytes) offset from the start of kernel real mode. This we can read in Linux kernel boot protocol and also see in grub2 source code:
			
 
				+This is the first code that actually runs (aside from the previous jump instructions, of course). After the kernel setup received control from the bootloader, the first `jmp` instruction is located at the `0x200` offset from the start of the kernel real mode, i.e., after the first 512 bytes. This we can both read in the Linux kernel boot protocol and see in the grub2 source code:
			
 
				 
			
 
				 ```C
			
 
				-  state.gs = state.fs = state.es = state.ds = state.ss = segment;
			
 
				-  state.cs = segment + 0x20;
			
 
				+segment = grub_linux_real_target >> 4;
			
 
				+state.gs = state.fs = state.es = state.ds = state.ss = segment;
			
 
				+state.cs = segment + 0x20;
			
 
				 ```
			
 
				 
			
 
				-It means that segment registers will have following values after kernel setup starts to work:
			
 
				+This means that segment registers will have the following values after kernel setup starts:
			
 
				 
			
 
				 ```
			
 
				-fs = es = ds = ss = 0x1000
			
 
				+gs = fs = es = ds = ss = 0x1000
			
 
				 cs = 0x1020
			
 
				 ```
			
 
				 
			
 
				-for my case when kernel loaded at `0x10000`.
			
 
				+In my case, the kernel is loaded at `0x10000`.
			
 
				 
			
 
				-After jump to `start_of_setup`, it needs to do the following things:
			
 
				+After the jump to `start_of_setup`, the kernel needs to do the following:
			
 
				 
			
 
				-* Be sure that all values of all segment registers are equal
			
 
				-* Setup correct stack if needed
			
 
				-* Setup [bss](https://en.wikipedia.org/wiki/.bss)
			
 
				-* Jump to C code at [main.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c)
			
 
				+* Make sure that all segment register values are equal
			
 
				+* Set up a correct stack, if needed
			
 
				+* Set up [bss](https://en.wikipedia.org/wiki/.bss)
			
 
				+* Jump to the C code in [main.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c)
			
 
				 
			
 
				-Let's look at implementation.
			
 
				+Let's look at the implementation.
			
 
				 
			
 
				 Segment registers align
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-First of all it ensures that `ds` and `es` segment registers point to the same address and disable interrupts with `cli` instruction:
			
 
				+First of all, the kernel ensures that `ds` and `es` segment registers point to the same address. Next, it clears the direction flag using the `cld` instruction:
			
 
				 
			
 
				 ```assembly
			
 
				-	movw	%ds, %ax
			
 
				-	movw	%ax, %es
			
 
				-	cli	
			
 
				+    movw    %ds, %ax
			
 
				+    movw    %ax, %es
			
 
				+    cld
			
 
				 ```
			
 
				 
			
 
				-As I wrote above, grub2 loads kernel setup code at `0x10000` address and `cs` at `0x1020` because execution doesn't start from the start of file, but from:
			
 
				+As I wrote earlier, grub2 loads kernel setup code at address `0x10000` and `cs` at `0x1020` because execution doesn't start from the start of file, but from
			
 
				 
			
 
				-```
			
 
				+```assembly
			
 
				 _start:
			
 
				-	.byte 0xeb
			
 
				-	.byte start_of_setup-1f
			
 
				+    .byte 0xeb
			
 
				+    .byte start_of_setup-1f
			
 
				 ```
			
 
				 
			
 
				-`jump`, which is 512 bytes offset from the [4d 5a](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L47). Also need to align `cs` from `0x10200` to `0x10000` as all other segment registers. After that we setup the stack:
			
 
				+`jump`, which is at a 512 byte offset from [4d 5a](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L47). It also needs to align `cs` from `0x10200` to `0x10000`, as well as all other segment registers. After that, we set up the stack:
			
 
				 
			
 
				 ```assembly
			
 
				-	pushw	%ds
			
 
				-	pushw	$6f
			
 
				-	lretw
			
 
				+    pushw   %ds
			
 
				+    pushw   $6f
			
 
				+    lretw
			
 
				 ```
			
 
				 
			
 
				-push `ds` value to stack, and address of [6](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L494) label and execute `lretw` instruction. When we call `lretw`, it loads address of  label `6` to [instruction pointer](https://en.wikipedia.org/wiki/Program_counter) register and `cs` with value of `ds`. After it we will have `ds` and `cs` with the same values.
			
 
				+which pushes the value of `ds` to the stack with the address of the [6](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L494) label and executes the `lretw` instruction. When the `lretw` instruction is called, it loads the address of label `6` into the [instruction pointer](https://en.wikipedia.org/wiki/Program_counter) register and loads `cs` with the value of `ds`. Afterwards, `ds` and `cs` will have the same values.
			
 
				 
			
 
				 Stack Setup
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-Actually, almost all of the setup code is preparation for C language environment in the real mode. The next [step](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L467) is checking of `ss` register value and making of correct stack if `ss` is wrong:
			
 
				+Almost all of the setup code is in preparation for the C language environment in real mode. The next [step](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L467) is checking the `ss` register value and making a correct stack if `ss` is wrong:
			
 
				 
			
 
				 ```assembly
			
 
				-	movw	%ss, %dx
			
 
				-	cmpw	%ax, %dx
			
 
				-	movw	%sp, %dx
			
 
				-	je	2f
			
 
				+    movw    %ss, %dx
			
 
				+    cmpw    %ax, %dx
			
 
				+    movw    %sp, %dx
			
 
				+    je      2f
			
 
				 ```
			
 
				 
			
 
				-Generally, it can be 3 different cases:
			
 
				+This can lead to 3 different scenarios:
			
 
				 
			
 
				-* `ss` has valid value 0x10000 (as all other segment registers beside `cs`)
			
 
				+* `ss` has valid value 0x10000 (as do all other segment registers beside `cs`)
			
 
				 * `ss` is invalid and `CAN_USE_HEAP` flag is set     (see below)
			
 
				 * `ss` is invalid and `CAN_USE_HEAP` flag is not set (see below)
			
 
				 
			
 
				-Let's look at all of these cases:
			
 
				+Let's look at all three of these scenarios in turn:
			
 
				 
			
 
				-1. `ss` has a correct address (0x10000). In this case we go to label [2](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L481):
			
 
				+* `ss` has a correct address (0x10000). In this case, we go to label [2](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L481):
			
 
				 
			
 
				-```
			
 
				-2: 	andw	$~3, %dx
			
 
				-	jnz	3f
			
 
				-	movw	$0xfffc, %dx
			
 
				-3:  movw	%ax, %ss
			
 
				-	movzwl %dx, %esp
			
 
				-	sti
			
 
				+```assembly
			
 
				+2:  andw    $~3, %dx
			
 
				+    jnz     3f
			
 
				+    movw    $0xfffc, %dx
			
 
				+3:  movw    %ax, %ss
			
 
				+    movzwl  %dx, %esp
			
 
				+    sti
			
 
				 ```
			
 
				 
			
 
				-Here we can see aligning of `dx` (contains `sp` given by bootloader) to 4 bytes and checking that it is not zero. If it is zero we put `0xfffc` (4 byte aligned address before maximum segment size - 64 KB) to `dx`. If it is not zero we continue to use `sp` given by bootloader (0xf7f4 in my case). After this we put `ax` value to `ss` which stores correct segment address `0x10000` and set up correct `sp`. After it we have correct stack:
			
 
				+Here we can see the alignment of `dx` (contains `sp` given by bootloader) to 4 bytes and a check for whether or not it is zero. If it is zero, we put `0xfffc` (4 byte aligned address before the maximum segment size of 64 KB) in `dx`. If it is not zero, we continue to use `sp`, given by the bootloader (0xf7f4 in my case). After this, we put the `ax` value into `ss`, which stores the correct segment address of `0x10000` and sets up a correct `sp`. We now have a correct stack:
			
 
				 
			
 
				 ![stack](http://oi58.tinypic.com/16iwcis.jpg)
			
 
				 
			
 
				-2. In the second case (`ss` != `ds`), first of all put [_end](https://github.com/torvalds/linux/blob/master/arch/x86/boot/setup.ld#L52) (address of end of setup code) value in `dx`. And check `loadflags` header field with `testb` instruction too see if we can use heap or not. [loadflags](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L321) is a bitmask header which is defined as:
			
 
				+* In the second scenario, (`ss` != `ds`). First, we put the value of [_end](https://github.com/torvalds/linux/blob/master/arch/x86/boot/setup.ld#L52) (the address of the end of the setup code) into `dx` and check the `loadflags` header field using the `testb` instruction to see whether we can use the heap. [loadflags](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L321) is a bitmask header which is defined as:
			
 
				 
			
 
				 ```C
			
 
				-#define LOADED_HIGH	    (1<<0)
			
 
				-#define QUIET_FLAG	    (1<<5)
			
 
				-#define KEEP_SEGMENTS	(1<<6)
			
 
				-#define CAN_USE_HEAP	(1<<7)
			
 
				+#define LOADED_HIGH     (1<<0)
			
 
				+#define QUIET_FLAG      (1<<5)
			
 
				+#define KEEP_SEGMENTS   (1<<6)
			
 
				+#define CAN_USE_HEAP    (1<<7)
			
 
				 ```
			
 
				 
			
 
				-And as we can read in the boot protocol:
			
 
				+and, as we can read in the boot protocol,
			
 
				 
			
 
				 ```
			
 
				-Field name:	loadflags
			
 
				+Field name: loadflags
			
 
				 
			
 
				   This field is a bitmask.
			
 
				 
			
 
				   Bit 7 (write): CAN_USE_HEAP
			
 
				-	Set this bit to 1 to indicate that the value entered in the
			
 
				-	heap_end_ptr is valid.  If this field is clear, some setup code
			
 
				-	functionality will be disabled.
			
 
				+    Set this bit to 1 to indicate that the value entered in the
			
 
				+    heap_end_ptr is valid.  If this field is clear, some setup code
			
 
				+    functionality will be disabled.
			
 
				 ```
			
 
				 
			
 
				-If `CAN_USE_HEAP` bit is set, put `heap_end_ptr` to `dx` which points to `_end` and add `STACK_SIZE` (minimal stack size - 512 bytes) to it. After this if `dx` is not carry, jump to `2` (it will not be carry, dx = _end + 512) label as in previous case and make correct stack.
			
 
				+If the `CAN_USE_HEAP` bit is set, we put `heap_end_ptr` into `dx` (which points to `_end`) and add `STACK_SIZE` (minimum stack size, 512 bytes) to it. After this, if `dx` is not carried (it will not be carried, dx = _end + 512), jump to label `2` (as in the previous case) and make a correct stack.
			
 
				 
			
 
				 ![stack](http://oi62.tinypic.com/dr7b5w.jpg)
			
 
				 
			
 
				-3. The last case when `CAN_USE_HEAP` is not set, we just use minimal stack from `_end` to `_end + STACK_SIZE`:
			
 
				+* When `CAN_USE_HEAP` is not set, we just use a minimal stack from `_end` to `_end + STACK_SIZE`:
			
 
				 
			
 
				 ![minimal stack](http://oi60.tinypic.com/28w051y.jpg)
			
 
				 
			
 
				 BSS Setup
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-The last two steps that need to happen before we can jump to the main C code, are that we need to set up the [BSS](https://en.wikipedia.org/wiki/.bss) area, and check the "magic" signature. Firstly, signature checking:
			
 
				+The last two steps that need to happen before we can jump to the main C code are setting up the [BSS](https://en.wikipedia.org/wiki/.bss) area and checking the "magic" signature. First, signature checking:
			
 
				 
			
 
				 ```assembly
			
 
				-cmpl	$0x5a5aaa55, setup_sig
			
 
				-jne	setup_bad
			
 
				+    cmpl    $0x5a5aaa55, setup_sig
			
 
				+    jne     setup_bad
			
 
				 ```
			
 
				 
			
 
				-This simply consists of comparing the [setup_sig](https://github.com/torvalds/linux/blob/master/arch/x86/boot/setup.ld#L39) against the magic number `0x5a5aaa55`. If they are not equal, a fatal error is reported.
			
 
				+This simply compares the [setup_sig](https://github.com/torvalds/linux/blob/master/arch/x86/boot/setup.ld#L39) with the magic number `0x5a5aaa55`. If they are not equal, a fatal error is reported.
			
 
				 
			
 
				-But if the magic number matches, knowing we have a set of correct segment registers, and a stack, we need only setup the BSS section before jumping into the C code.
			
 
				+If the magic number matches, knowing we have a set of correct segment registers and a stack, we only need to set up the BSS section before jumping into the C code.
			
 
				 
			
 
				-The BSS section is used for storing statically allocated, uninitialized, data. Linux carefully ensures this area of memory is first blanked, using the following code:
			
 
				+The BSS section is used to store statically allocated, uninitialized data. Linux carefully ensures this area of memory is first zeroed using the following code:
			
 
				 
			
 
				 ```assembly
			
 
				-	movw	$__bss_start, %di
			
 
				-	movw	$_end+3, %cx
			
 
				-	xorl	%eax, %eax
			
 
				-	subw	%di, %cx
			
 
				-	shrw	$2, %cx
			
 
				-	rep; stosl
			
 
				+    movw    $__bss_start, %di
			
 
				+    movw    $_end+3, %cx
			
 
				+    xorl    %eax, %eax
			
 
				+    subw    %di, %cx
			
 
				+    shrw    $2, %cx
			
 
				+    rep; stosl
			
 
				 ```
			
 
				 
			
 
				-First of all the [__bss_start](https://github.com/torvalds/linux/blob/master/arch/x86/boot/setup.ld#L47) address is moved into `di`, and the `_end + 3` address (+3 - aligns to 4 bytes) is moved into `cx`. The `eax` register is cleared (using an `xor` instruction), and the bss section size (`cx`-`di`) is calculated and put into `cx`. Then, `cx` is divided by four (the size of a 'word'), and the `stosl` instruction is repeatedly used, storing the value of `eax` (zero) into the address pointed to by `di`, and automatically increasing `di` by four (this occurs until `cx` reaches zero). The net effect of this code, is that zeros are written through all words in memory from `__bss_start` to `_end`:
			
 
				+First, the [__bss_start](https://github.com/torvalds/linux/blob/master/arch/x86/boot/setup.ld#L47) address is moved into `di`. Next, the `_end + 3` address (+3 - aligns to 4 bytes) is moved into `cx`. The `eax` register is cleared (using a `xor` instruction), and the bss section size (`cx`-`di`) is calculated and put into `cx`. Then, `cx` is divided by four (the size of a 'word'), and the `stosl` instruction is used repeatedly, storing the value of `eax` (zero) into the address pointed to by `di`, automatically increasing `di` by four, repeating until `cx` reaches zero). The net effect of this code is that zeros are written through all words in memory from `__bss_start` to `_end`:
			
 
				 
			
 
				 ![bss](http://oi59.tinypic.com/29m2eyr.jpg)
			
 
				 
			
 
				 Jump to main
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-That's all, we have the stack, BSS and now we can jump to the `main()` C function:
			
 
				+That's all - we have the stack and BSS, so we can jump to the `main()` C function:
			
 
				 
			
 
				 ```assembly
			
 
				-	calll main
			
 
				+    calll main
			
 
				 ```
			
 
				 
			
 
				-The `main()` function is located in [arch/x86/boot/main.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c). What will be there? We will see it in the next part.
			
 
				+The `main()` function is located in [arch/x86/boot/main.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c). You can read about what this does in the next part.
			
 
				 
			
 
				 Conclusion
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-This is the end of the first part about Linux kernel internals. If you have questions or suggestions, ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-internals/issues/new). In the next part we will see first C code which executes in Linux kernel setup, implementation of memory routines as `memset`, `memcpy`, `earlyprintk` implementation and early console initialization and many more.
			
 
				+This is the end of the first part about Linux kernel insides. If you have questions or suggestions, ping me on twitter [0xAX](https://twitter.com/0xAX), drop me an [email](anotherworldofworld@gmail.com), or just create an [issue](https://github.com/0xAX/linux-internals/issues/new). In the next part, we will see the first C code that executes in the Linux kernel setup, the implementation of memory routines such as `memset`, `memcpy`, `earlyprintk`, early console implementation and initialization, and much more.
			
 
				 
			
 
				-**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
			
 
				+**Please note that English is not my first language and I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-internals).**
			
 
				 
			
 
				 Links
			
 
				 --------------------------------------------------------------------------------
			
--- a/Booting/linux-bootstrap-2.md
+++ b/Booting/linux-bootstrap-2.md
@@ -4,7 +4,7 @@ Kernel booting process. Part 2.
 
				 First steps in the kernel setup
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-We started to dive into linux kernel internals in the previous [part](linux-bootstrap-1.md) and saw the initial part of the kernel setup code. We stopped at the first call to the `main` function (which is the first function written in C) from [arch/x86/boot/main.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c). 
			
 
				+We started to dive into linux kernel insides in the previous [part](linux-bootstrap-1.md) and saw the initial part of the kernel setup code. We stopped at the first call to the `main` function (which is the first function written in C) from [arch/x86/boot/main.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c). 
			
 
				 
			
 
				 In this part we will continue to research the kernel setup code and 
			
 
				 * see what `protected mode` is,
			
@@ -41,7 +41,7 @@ As you can read in the previous part, addresses consist of two parts in real mod
 
				 And we can get the physical address if we know these two parts by:
			
 
				 
			
 
				 ```
			
 
				-PhysicalAddress = Segment * 16 + Offset
			
 
				+PhysicalAddress = Segment Selector * 16 + Offset
			
 
				 ```
			
 
				 
			
 
				 Memory segmentation was completely redone in protected mode. There are no 64 Kilobyte fixed-size segments. Instead, the size and location of each segment is described by an associated data structure called _Segment Descriptor_. The segment descriptors are stored in a data structure called `Global Descriptor Table` (GDT).
			
@@ -72,7 +72,7 @@ As mentioned above the GDT contains `segment descriptors` which describe memory
 
				 ------------------------------------------------------------
			
 
				 ```
			
 
				 
			
 
				-Don't worry, I know it looks a little scary after real mode, but it's easy. For example LIMIT 15:0 means that bit 0-15 of the Descriptor contain the value for the limit. The rest of it is in LIMIT 16:19. So, the size of Limit is 0-19 i.e 20-bits. Let's take a closer look at it:
			
 
				+Don't worry, I know it looks a little scary after real mode, but it's easy. For example LIMIT 15:0 means that bit 0-15 of the Descriptor contain the value for the limit. The rest of it is in LIMIT 19:16. So, the size of Limit is 0-19 i.e 20-bits. Let's take a closer look at it:
			
 
				 
			
 
				 1. Limit[20-bits] is at 0-15,16-19 bits. It defines `length_of_segment - 1`. It depends on `G`(Granularity) bit.
			
 
				 
			
@@ -122,7 +122,7 @@ As we can see the first bit(bit 43) is `0` for a _data_ segment and `1` for a _c
 
				   * if E(bit 42) is 0, expand up other wise expand down. Read more [here](http://www.sudleyplace.com/dpmione/expanddown.html).
			
 
				   * if W(bit 41)(for Data Segments) is 1, write access is allowed otherwise not. Note that read access is always allowed on data segments.
			
 
				   * A(bit 40) - Whether the segment is accessed by processor or not.
			
 
				-  * C(bit 43) is conforming bit(for code selectors). If C is 1, the segment code can be executed from a lower level privilege for e.g user level. If C is 0, it can only be executed from the same privilege level.
			
 
				+  * C(bit 43) is conforming bit(for code selectors). If C is 1, the segment code can be executed from a lower level privilege e.g. user level. If C is 0, it can only be executed from the same privilege level.
			
 
				   * R(bit 41)(for code segments). If 1 read access to segment is allowed otherwise not. Write access is never allowed to code segments.
			
 
				 
			
 
				 4. DPL[2-bits] (Descriptor Privilege Level) is at bits 45-46. It defines the privilege level of the segment. It can be 0-3 where 0 is the most privileged.
			
@@ -135,11 +135,12 @@ As we can see the first bit(bit 43) is `0` for a _data_ segment and `1` for a _c
 
				 
			
 
				 8. D/B flag(bit 54) - Default/Big flag represents the operand size i.e 16/32 bits. If it is set then 32 bit otherwise 16.
			
 
				 
			
 
				-Segment registers don't contain the base address of the segment as in real mode. Instead they contain a special structure - `Segment Selector`. Each Segment Descriptor has an associated Segment Selector. `Segment Selector` is a 16-bit structure:
			
 
				+Segment registers contain segment selectors as in real mode. However, in protected mode, a segment selector is handled differently. Each Segment Descriptor has an associated Segment Selector which is a 16-bit structure:
			
 
				 
			
 
				 ```
			
 
				+15              3  2   1  0
			
 
				 -----------------------------
			
 
				-|       Index    | TI | RPL |
			
 
				+|      Index     | TI | RPL |
			
 
				 -----------------------------
			
 
				 ```
			
 
				 
			
@@ -188,53 +189,53 @@ Note that it copies `hdr` with `memcpy` function which is defined in the [copy.S
 
				 
			
 
				 ```assembly
			
 
				 GLOBAL(memcpy)
			
 
				-	pushw	%si
			
 
				-	pushw	%di
			
 
				-	movw	%ax, %di
			
 
				-	movw	%dx, %si
			
 
				-	pushw	%cx
			
 
				-	shrw	$2, %cx
			
 
				-	rep; movsl
			
 
				-	popw	%cx
			
 
				-	andw	$3, %cx
			
 
				-	rep; movsb
			
 
				-	popw	%di
			
 
				-	popw	%si
			
 
				-	retl
			
 
				+    pushw   %si
			
 
				+    pushw   %di
			
 
				+    movw    %ax, %di
			
 
				+    movw    %dx, %si
			
 
				+    pushw   %cx
			
 
				+    shrw    $2, %cx
			
 
				+    rep; movsl
			
 
				+    popw    %cx
			
 
				+    andw    $3, %cx
			
 
				+    rep; movsb
			
 
				+    popw    %di
			
 
				+    popw    %si
			
 
				+    retl
			
 
				 ENDPROC(memcpy)
			
 
				 ```
			
 
				 
			
 
				-Yeah, we just moved to C code and now assembly again :) First of all we can see that `memcpy` and other routines which are defined here, start and end with the two macros: `GLOBAL` and `ENDPROC`. `GLOBAL` is described in [arch/x86/include/asm/linkage.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/linkage.h) which defines `globl` directive and the label for it. `ENDPROC` is described in [include/linux/linkage.h](https://github.com/torvalds/linux/blob/master/include/linux/linkage.h) which marks `name` symbol as function name and ends with the size of the `name` symbol.
			
 
				+Yeah, we just moved to C code and now assembly again :) First of all we can see that `memcpy` and other routines which are defined here, start and end with the two macros: `GLOBAL` and `ENDPROC`. `GLOBAL` is described in [arch/x86/include/asm/linkage.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/linkage.h) which defines `globl` directive and the label for it. `ENDPROC` is described in [include/linux/linkage.h](https://github.com/torvalds/linux/blob/master/include/linux/linkage.h) which marks the `name` symbol as a function name and ends with the size of the `name` symbol.
			
 
				 
			
 
				-Implementation of `memcpy` is easy. At first, it pushes values from `si` and `di` registers to the stack because their values will change during the `memcpy`, so it pushes them on the stack to preserve their values. `memcpy` (and other functions in copy.S) use `fastcall` calling conventions. So it gets its incoming parameters from the `ax`, `dx` and `cx` registers.  Calling `memcpy` looks like this:
			
 
				+Implementation of `memcpy` is easy. At first, it pushes values from the `si` and `di` registers to the stack to preserve their values because they will change during the `memcpy`. `memcpy` (and other functions in copy.S) use `fastcall` calling conventions. So it gets its incoming parameters from the `ax`, `dx` and `cx` registers.  Calling `memcpy` looks like this:
			
 
				 
			
 
				 ```c
			
 
				 memcpy(&boot_params.hdr, &hdr, sizeof hdr);
			
 
				 ```
			
 
				 
			
 
				 So,
			
 
				-* `ax` will contain the address of the `boot_params.hdr` in bytes
			
 
				-* `dx` will contain the address of `hdr` in bytes
			
 
				+* `ax` will contain the address of the `boot_params.hdr`
			
 
				+* `dx` will contain the address of `hdr`
			
 
				 * `cx` will contain the size of `hdr` in bytes.
			
 
				 
			
 
				-`memcpy` puts the address of `boot_params.hdr` into `si` and saves the size on the stack. After this it shifts to the right on 2 size (or divide on 4) and copies from `si` to `di` by 4 bytes. After this we restore the size of `hdr` again, align it by 4 bytes and copy the rest of the bytes from `si` to `di` byte by byte (if there is more). Restore `si` and `di` values from the stack in the end and after this copying is finished.
			
 
				+`memcpy` puts the address of `boot_params.hdr` into `di` and saves the size on the stack. After this it shifts to the right on 2 size (or divide on 4) and copies from `si` to `di` by 4 bytes. After this we restore the size of `hdr` again, align it by 4 bytes and copy the rest of the bytes from `si` to `di` byte by byte (if there is more). Restore `si` and `di` values from the stack in the end and after this copying is finished.
			
 
				 
			
 
				 Console initialization
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-After the `hdr` is copied into `boot_params.hdr`, the next step is console initialization by calling the `console_init` function which is defined in [arch/x86/boot/early_serial_console.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/early_serial_console.c).
			
 
				+After `hdr` is copied into `boot_params.hdr`, the next step is console initialization by calling the `console_init` function which is defined in [arch/x86/boot/early_serial_console.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/early_serial_console.c).
			
 
				 
			
 
				-It tries to find the `earlyprintk` option in the command line and if the search was successful, it parses the port address and baud rate of the serial port and initializes the serial port. Value of `earlyprintk` command line option can be one of the:
			
 
				+It tries to find the `earlyprintk` option in the command line and if the search was successful, it parses the port address and baud rate of the serial port and initializes the serial port. Value of `earlyprintk` command line option can be one of these:
			
 
				 
			
 
				-	* serial,0x3f8,115200
			
 
				-	* serial,ttyS0,115200
			
 
				-	* ttyS0,115200
			
 
				+* serial,0x3f8,115200
			
 
				+* serial,ttyS0,115200
			
 
				+* ttyS0,115200
			
 
				 
			
 
				 After serial port initialization we can see the first output:
			
 
				 
			
 
				 ```C
			
 
				 if (cmdline_find_option_bool("debug"))
			
 
				-		puts("early console in setup code\n");
			
 
				+    puts("early console in setup code\n");
			
 
				 ```
			
 
				 
			
 
				 The definition of `puts` is in [tty.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/tty.c). As we can see it prints character by character in a loop by calling the `putchar` function. Let's look into the `putchar` implementation:
			
@@ -242,61 +243,61 @@ The definition of `puts` is in [tty.c](https://github.com/torvalds/linux/blob/ma
 
				 ```C
			
 
				 void __attribute__((section(".inittext"))) putchar(int ch)
			
 
				 {
			
 
				-	if (ch == '\n')
			
 
				-		putchar('\r');
			
 
				+    if (ch == '\n')
			
 
				+        putchar('\r');
			
 
				 
			
 
				-	bios_putchar(ch);
			
 
				+    bios_putchar(ch);
			
 
				 
			
 
				-	if (early_serial_base != 0)
			
 
				-		serial_putchar(ch);
			
 
				+    if (early_serial_base != 0)
			
 
				+        serial_putchar(ch);
			
 
				 }
			
 
				 ```
			
 
				 
			
 
				 `__attribute__((section(".inittext")))` means that this code will be in the `.inittext` section. We can find it in the linker file [setup.ld](https://github.com/torvalds/linux/blob/master/arch/x86/boot/setup.ld#L19).
			
 
				 
			
 
				-First of all, `put_char` checks for the `\n` symbol and if it is found, prints `\r` before. After that it outputs the character on the VGA screen by calling the BIOS with the `0x10` interrupt call:
			
 
				+First of all, `putchar` checks for the `\n` symbol and if it is found, prints `\r` before. After that it outputs the character on the VGA screen by calling the BIOS with the `0x10` interrupt call:
			
 
				 
			
 
				 ```C
			
 
				 static void __attribute__((section(".inittext"))) bios_putchar(int ch)
			
 
				 {
			
 
				-	struct biosregs ireg;
			
 
				-
			
 
				-	initregs(&ireg);
			
 
				-	ireg.bx = 0x0007;
			
 
				-	ireg.cx = 0x0001;
			
 
				-	ireg.ah = 0x0e;
			
 
				-	ireg.al = ch;
			
 
				-	intcall(0x10, &ireg, NULL);
			
 
				+    struct biosregs ireg;
			
 
				+
			
 
				+    initregs(&ireg);
			
 
				+    ireg.bx = 0x0007;
			
 
				+    ireg.cx = 0x0001;
			
 
				+    ireg.ah = 0x0e;
			
 
				+    ireg.al = ch;
			
 
				+    intcall(0x10, &ireg, NULL);
			
 
				 }
			
 
				 ```
			
 
				 
			
 
				 Here `initregs` takes the `biosregs` structure and first fills `biosregs` with zeros using the `memset` function and then fills it with register values.
			
 
				 
			
 
				 ```C
			
 
				-	memset(reg, 0, sizeof *reg);
			
 
				-	reg->eflags |= X86_EFLAGS_CF;
			
 
				-	reg->ds = ds();
			
 
				-	reg->es = ds();
			
 
				-	reg->fs = fs();
			
 
				-	reg->gs = gs();
			
 
				+    memset(reg, 0, sizeof *reg);
			
 
				+    reg->eflags |= X86_EFLAGS_CF;
			
 
				+    reg->ds = ds();
			
 
				+    reg->es = ds();
			
 
				+    reg->fs = fs();
			
 
				+    reg->gs = gs();
			
 
				 ```
			
 
				 
			
 
				 Let's look at the [memset](https://github.com/torvalds/linux/blob/master/arch/x86/boot/copy.S#L36) implementation:
			
 
				 
			
 
				 ```assembly
			
 
				 GLOBAL(memset)
			
 
				-	pushw	%di
			
 
				-	movw	%ax, %di
			
 
				-	movzbl	%dl, %eax
			
 
				-	imull	$0x01010101,%eax
			
 
				-	pushw	%cx
			
 
				-	shrw	$2, %cx
			
 
				-	rep; stosl
			
 
				-	popw	%cx
			
 
				-	andw	$3, %cx
			
 
				-	rep; stosb
			
 
				-	popw	%di
			
 
				-	retl
			
 
				+    pushw   %di
			
 
				+    movw    %ax, %di
			
 
				+    movzbl  %dl, %eax
			
 
				+    imull   $0x01010101,%eax
			
 
				+    pushw   %cx
			
 
				+    shrw    $2, %cx
			
 
				+    rep; stosl
			
 
				+    popw    %cx
			
 
				+    andw    $3, %cx
			
 
				+    rep; stosb
			
 
				+    popw    %di
			
 
				+    retl
			
 
				 ENDPROC(memset)
			
 
				 ```
			
 
				 
			
@@ -308,7 +309,7 @@ The next instruction multiplies `eax` with `0x01010101`. It needs to because `me
 
				 
			
 
				 The rest of the `memset` function does almost the same as `memcpy`.
			
 
				 
			
 
				-After that `biosregs` structure is filled with `memset`, `bios_putchar` calls the [0x10](http://www.ctyme.com/intr/rb-0106.htm) interrupt which prints a character. Afterwards it checks if the serial port was initialized or not and writes a character there with [serial_putchar](https://github.com/torvalds/linux/blob/master/arch/x86/boot/tty.c#L30) and `inb/outb` instructions if it was set.
			
 
				+After the `biosregs` structure is filled with `memset`, `bios_putchar` calls the [0x10](http://www.ctyme.com/intr/rb-0106.htm) interrupt which prints a character. Afterwards it checks if the serial port was initialized or not and writes a character there with [serial_putchar](https://github.com/torvalds/linux/blob/master/arch/x86/boot/tty.c#L30) and `inb/outb` instructions if it was set.
			
 
				 
			
 
				 Heap initialization
			
 
				 --------------------------------------------------------------------------------
			
@@ -318,20 +319,20 @@ After the stack and bss section were prepared in [header.S](https://github.com/t
 
				 First of all `init_heap` checks the [`CAN_USE_HEAP`](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/bootparam.h#L21) flag from the [`loadflags`](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L321) in the kernel setup header and calculates the end of the stack if this flag was set:
			
 
				 
			
 
				 ```C
			
 
				-	char *stack_end;
			
 
				+    char *stack_end;
			
 
				 
			
 
				-	if (boot_params.hdr.loadflags & CAN_USE_HEAP) {
			
 
				-		asm("leal %P1(%%esp),%0"
			
 
				-		    : "=r" (stack_end) : "i" (-STACK_SIZE));
			
 
				+    if (boot_params.hdr.loadflags & CAN_USE_HEAP) {
			
 
				+        asm("leal %P1(%%esp),%0"
			
 
				+            : "=r" (stack_end) : "i" (-STACK_SIZE));
			
 
				 ```
			
 
				 
			
 
				 or in other words `stack_end = esp - STACK_SIZE`.
			
 
				 
			
 
				 Then there is the `heap_end` calculation:
			
 
				 ```c
			
 
				-	heap_end = (char *)((size_t)boot_params.hdr.heap_end_ptr + 0x200);
			
 
				+    heap_end = (char *)((size_t)boot_params.hdr.heap_end_ptr + 0x200);
			
 
				 ```
			
 
				-which means `heap_end_ptr` or `_end` + `512`(`0x200h`). And at the last is checked that whether `heap_end` is greater than `stack_end`. If it is then `stack_end` is assigned to `heap_end` to make them equal.
			
 
				+which means `heap_end_ptr` or `_end` + `512`(`0x200h`). The last check is whether `heap_end` is greater than `stack_end`. If it is then `stack_end` is assigned to `heap_end` to make them equal.
			
 
				 
			
 
				 Now the heap is initialized and we can use it using the `GET_HEAP` method. We will see how it is used, how to use it and how the it is implemented in the next posts.
			
 
				 
			
@@ -343,10 +344,10 @@ The next step as we can see is cpu validation by `validate_cpu` from [arch/x86/b
 
				 It calls the [`check_cpu`](https://github.com/torvalds/linux/blob/master/arch/x86/boot/cpucheck.c#L102) function and passes cpu level and required cpu level to it and checks that the kernel launches on the right cpu level.
			
 
				 ```c
			
 
				 check_cpu(&cpu_level, &req_level, &err_flags);
			
 
				-	if (cpu_level < req_level) {
			
 
				+if (cpu_level < req_level) {
			
 
				     ...
			
 
				-	return -1;
			
 
				-	}
			
 
				+    return -1;
			
 
				+}
			
 
				 ```
			
 
				 `check_cpu` checks the cpu's flags, presence of [long mode](http://en.wikipedia.org/wiki/Long_mode) in case of x86_64(64-bit) CPU, checks the processor's vendor and makes preparation for certain vendors like turning off SSE+SSE2 for AMD if they are missing, etc.
			
 
				 
			
@@ -358,11 +359,11 @@ The next step is memory detection by the `detect_memory` function. `detect_memor
 
				 Let's look into the `detect_memory_e820` implementation from the [arch/x86/boot/memory.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/memory.c) source file. First of all, the `detect_memory_e820` function initializes the `biosregs` structure as we saw above and fills registers with special values for the `0xe820` call:
			
 
				 
			
 
				 ```assembly
			
 
				-	initregs(&ireg);
			
 
				-	ireg.ax  = 0xe820;
			
 
				-	ireg.cx  = sizeof buf;
			
 
				-	ireg.edx = SMAP;
			
 
				-	ireg.di  = (size_t)&buf;
			
 
				+    initregs(&ireg);
			
 
				+    ireg.ax  = 0xe820;
			
 
				+    ireg.cx  = sizeof buf;
			
 
				+    ireg.edx = SMAP;
			
 
				+    ireg.di  = (size_t)&buf;
			
 
				 ```
			
 
				 
			
 
				 * `ax` contains the number of the function (0xe820 in our case)
			
@@ -374,8 +375,8 @@ Let's look into the `detect_memory_e820` implementation from the [arch/x86/boot/
 
				 Next is a loop where data about the memory will be collected. It starts from the call of the `0x15` BIOS interrupt, which writes one line from the address allocation table. For getting the next line we need to call this interrupt again (which we do in the loop). Before the next call `ebx` must contain the value returned previously:
			
 
				 
			
 
				 ```C
			
 
				-	intcall(0x15, &ireg, &oreg);
			
 
				-	ireg.ebx = oreg.ebx;
			
 
				+    intcall(0x15, &ireg, &oreg);
			
 
				+    ireg.ebx = oreg.ebx;
			
 
				 ```
			
 
				 
			
 
				 Ultimately, it does iterations in the loop to collect data from the address allocation table and writes this data into the `e820entry` array:
			
@@ -401,15 +402,15 @@ Keyboard initialization
 
				 
			
 
				 The next step is the initialization of the keyboard with the call of the [`keyboard_init()`](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c#L65) function. At first `keyboard_init` initializes registers using the `initregs` function and calling the [0x16](http://www.ctyme.com/intr/rb-1756.htm) interrupt for getting the keyboard status.
			
 
				 ```c
			
 
				-	initregs(&ireg);
			
 
				-	ireg.ah = 0x02;		/* Get keyboard status */
			
 
				-	intcall(0x16, &ireg, &oreg);
			
 
				-	boot_params.kbd_status = oreg.al;
			
 
				+    initregs(&ireg);
			
 
				+    ireg.ah = 0x02;     /* Get keyboard status */
			
 
				+    intcall(0x16, &ireg, &oreg);
			
 
				+    boot_params.kbd_status = oreg.al;
			
 
				 ```
			
 
				 After this it calls [0x16](http://www.ctyme.com/intr/rb-1757.htm) again to set repeat rate and delay.
			
 
				 ```c
			
 
				-	ireg.ax = 0x0305;	/* Set keyboard repeat rate */
			
 
				-	intcall(0x16, &ireg, NULL);
			
 
				+    ireg.ax = 0x0305;   /* Set keyboard repeat rate */
			
 
				+    intcall(0x16, &ireg, NULL);
			
 
				 ```
			
 
				 
			
 
				 Querying
			
@@ -422,99 +423,99 @@ The [query_mca](https://github.com/torvalds/linux/blob/master/arch/x86/boot/mca.
 
				 ```c
			
 
				 int query_mca(void)
			
 
				 {
			
 
				-	struct biosregs ireg, oreg;
			
 
				-	u16 len;
			
 
				+    struct biosregs ireg, oreg;
			
 
				+    u16 len;
			
 
				 
			
 
				-	initregs(&ireg);
			
 
				-	ireg.ah = 0xc0;
			
 
				-	intcall(0x15, &ireg, &oreg);
			
 
				+    initregs(&ireg);
			
 
				+    ireg.ah = 0xc0;
			
 
				+    intcall(0x15, &ireg, &oreg);
			
 
				 
			
 
				-	if (oreg.eflags & X86_EFLAGS_CF)
			
 
				-		return -1;	/* No MCA present */
			
 
				+    if (oreg.eflags & X86_EFLAGS_CF)
			
 
				+        return -1;  /* No MCA present */
			
 
				 
			
 
				-	set_fs(oreg.es);
			
 
				-	len = rdfs16(oreg.bx);
			
 
				+    set_fs(oreg.es);
			
 
				+    len = rdfs16(oreg.bx);
			
 
				 
			
 
				-	if (len > sizeof(boot_params.sys_desc_table))
			
 
				-		len = sizeof(boot_params.sys_desc_table);
			
 
				+    if (len > sizeof(boot_params.sys_desc_table))
			
 
				+        len = sizeof(boot_params.sys_desc_table);
			
 
				 
			
 
				-	copy_from_fs(&boot_params.sys_desc_table, oreg.bx, len);
			
 
				-	return 0;
			
 
				+    copy_from_fs(&boot_params.sys_desc_table, oreg.bx, len);
			
 
				+    return 0;
			
 
				 }
			
 
				 ```
			
 
				 
			
 
				-It fills  the `ah` register with `0xc0` and calls the `0x15` BIOS interruption. After the interrupt execution it checks  the [carry flag](http://en.wikipedia.org/wiki/Carry_flag) and if it is set to 1, the BIOS doesn't support (**MCA**)[https://en.wikipedia.org/wiki/Micro_Channel_architecture]. If carry flag is set to 0, `ES:BX` will contain a pointer to the system information table, which looks like this:
			
 
				+It fills  the `ah` register with `0xc0` and calls the `0x15` BIOS interruption. After the interrupt execution it checks  the [carry flag](http://en.wikipedia.org/wiki/Carry_flag) and if it is set to 1, the BIOS doesn't support [**MCA**](https://en.wikipedia.org/wiki/Micro_Channel_architecture). If carry flag is set to 0, `ES:BX` will contain a pointer to the system information table, which looks like this:
			
 
				 
			
 
				 ```
			
 
				-Offset	Size	Description	)
			
 
				- 00h	WORD	number of bytes following
			
 
				- 02h	BYTE	model (see #00515)
			
 
				- 03h	BYTE	submodel (see #00515)
			
 
				- 04h	BYTE	BIOS revision: 0 for first release, 1 for 2nd, etc.
			
 
				- 05h	BYTE	feature byte 1 (see #00510)
			
 
				- 06h	BYTE	feature byte 2 (see #00511)
			
 
				- 07h	BYTE	feature byte 3 (see #00512)
			
 
				- 08h	BYTE	feature byte 4 (see #00513)
			
 
				- 09h	BYTE	feature byte 5 (see #00514)
			
 
				+Offset  Size    Description
			
 
				+ 00h    WORD    number of bytes following
			
 
				+ 02h    BYTE    model (see #00515)
			
 
				+ 03h    BYTE    submodel (see #00515)
			
 
				+ 04h    BYTE    BIOS revision: 0 for first release, 1 for 2nd, etc.
			
 
				+ 05h    BYTE    feature byte 1 (see #00510)
			
 
				+ 06h    BYTE    feature byte 2 (see #00511)
			
 
				+ 07h    BYTE    feature byte 3 (see #00512)
			
 
				+ 08h    BYTE    feature byte 4 (see #00513)
			
 
				+ 09h    BYTE    feature byte 5 (see #00514)
			
 
				 ---AWARD BIOS---
			
 
				- 0Ah  N BYTEs	AWARD copyright notice
			
 
				+ 0Ah  N BYTEs   AWARD copyright notice
			
 
				 ---Phoenix BIOS---
			
 
				- 0Ah	BYTE	??? (00h)
			
 
				- 0Bh	BYTE	major version
			
 
				- 0Ch	BYTE	minor version (BCD)
			
 
				- 0Dh  4 BYTEs	ASCIZ string "PTL" (Phoenix Technologies Ltd)
			
 
				+ 0Ah    BYTE    ??? (00h)
			
 
				+ 0Bh    BYTE    major version
			
 
				+ 0Ch    BYTE    minor version (BCD)
			
 
				+ 0Dh  4 BYTEs   ASCIZ string "PTL" (Phoenix Technologies Ltd)
			
 
				 ---Quadram Quad386---
			
 
				- 0Ah 17 BYTEs	ASCII signature string "Quadram Quad386XT"
			
 
				+ 0Ah 17 BYTEs   ASCII signature string "Quadram Quad386XT"
			
 
				 ---Toshiba (Satellite Pro 435CDS at least)---
			
 
				- 0Ah  7 BYTEs	signature "TOSHIBA"
			
 
				- 11h	BYTE	??? (8h)
			
 
				- 12h	BYTE	??? (E7h) product ID??? (guess)
			
 
				- 13h  3 BYTEs	"JPN"
			
 
				+ 0Ah  7 BYTEs   signature "TOSHIBA"
			
 
				+ 11h    BYTE    ??? (8h)
			
 
				+ 12h    BYTE    ??? (E7h) product ID??? (guess)
			
 
				+ 13h  3 BYTEs   "JPN"
			
 
				  ```
			
 
				 
			
 
				-Next we call the `set_fs` routine and pass the value of the `es` register to it. Implementation of `set_fs` is pretty simple:
			
 
				+Next we call the `set_fs` routine and pass the value of the `es` register to it. The implementation of `set_fs` is pretty simple:
			
 
				 
			
 
				 ```c
			
 
				 static inline void set_fs(u16 seg)
			
 
				 {
			
 
				-	asm volatile("movw %0,%%fs" : : "rm" (seg));
			
 
				+    asm volatile("movw %0,%%fs" : : "rm" (seg));
			
 
				 }
			
 
				 ```
			
 
				 
			
 
				 This function contains inline assembly which gets the value of the `seg` parameter and puts it into the `fs` register. There are many functions in [boot.h](https://github.com/torvalds/linux/blob/master/arch/x86/boot/boot.h) like `set_fs`, for example `set_gs`, `fs`, `gs` for reading a value in it etc...
			
 
				 
			
 
				-At the end of `query_mca` it just copies the table which pointed to by `es:bx` to the `boot_params.sys_desc_table`.
			
 
				+At the end of `query_mca` it just copies the table pointed to by `es:bx` to the `boot_params.sys_desc_table`.
			
 
				 
			
 
				 The next step is getting [Intel SpeedStep](http://en.wikipedia.org/wiki/SpeedStep) information by calling the `query_ist` function. First of all it checks the CPU level and if it is correct, calls `0x15` for getting info and saves the result to `boot_params`.
			
 
				 
			
 
				-The following [query_apm_bios](https://github.com/torvalds/linux/blob/master/arch/x86/boot/apm.c#L21) function gets [Advanced Power Management](http://en.wikipedia.org/wiki/Advanced_Power_Management) information from the BIOS. `query_apm_bios` calls the `0x15` BIOS interruption too, but with `ah` = `0x53` to check `APM` installation. After the `0x15` execution, `query_apm_bios` functions checks `PM` signature (it must be `0x504d`), carry flag (it must be 0 if `APM` supported) and value of the `cx` register (if it's 0x02, protected mode interface is supported).
			
 
				+The following [query_apm_bios](https://github.com/torvalds/linux/blob/master/arch/x86/boot/apm.c#L21) function gets [Advanced Power Management](http://en.wikipedia.org/wiki/Advanced_Power_Management) information from the BIOS. `query_apm_bios` calls the `0x15` BIOS interruption too, but with `ah` = `0x53` to check `APM` installation. After the `0x15` execution, `query_apm_bios` functions check the `PM` signature (it must be `0x504d`), carry flag (it must be 0 if `APM` supported) and value of the `cx` register (if it's 0x02, protected mode interface is supported).
			
 
				 
			
 
				-Next it calls the `0x15` again, but with `ax = 0x5304` for disconnecting the `APM` interface and connecting the 32-bit protected mode interface. In the end it fills `boot_params.apm_bios_info` with values obtained from the BIOS.
			
 
				+Next it calls `0x15` again, but with `ax = 0x5304` for disconnecting the `APM` interface and connecting the 32-bit protected mode interface. In the end it fills `boot_params.apm_bios_info` with values obtained from the BIOS.
			
 
				 
			
 
				-Note that `query_apm_bios` will be executed only if `CONFIG_APM` or `CONFIG_APM_MODULE` was set in configuration file:
			
 
				+Note that `query_apm_bios` will be executed only if `CONFIG_APM` or `CONFIG_APM_MODULE` was set in the configuration file:
			
 
				 
			
 
				 ```C
			
 
				 #if defined(CONFIG_APM) || defined(CONFIG_APM_MODULE)
			
 
				-	query_apm_bios();
			
 
				+    query_apm_bios();
			
 
				 #endif
			
 
				 ```
			
 
				 
			
 
				 The last is the [`query_edd`](https://github.com/torvalds/linux/blob/master/arch/x86/boot/edd.c#L122) function, which queries `Enhanced Disk Drive` information from the BIOS. Let's look into the `query_edd` implementation.
			
 
				 
			
 
				-First of all it reads the [edd](https://github.com/torvalds/linux/blob/master/Documentation/kernel-parameters.txt#L1023) option from kernel's command line and if it was set to `off` then `query_edd` just returns.
			
 
				+First of all it reads the [edd](https://github.com/torvalds/linux/blob/master/Documentation/kernel-parameters.txt#L1023) option from the kernel's command line and if it was set to `off` then `query_edd` just returns.
			
 
				 
			
 
				 If EDD is enabled, `query_edd` goes over BIOS-supported hard disks and queries EDD information in the following loop:
			
 
				 
			
 
				 ```C
			
 
				-	for (devno = 0x80; devno < 0x80+EDD_MBR_SIG_MAX; devno++) {
			
 
				-		if (!get_edd_info(devno, &ei) && boot_params.eddbuf_entries < EDDMAXNR) {
			
 
				-			memcpy(edp, &ei, sizeof ei);
			
 
				-			edp++;
			
 
				-			boot_params.eddbuf_entries++;
			
 
				-		}
			
 
				-		...
			
 
				-		...
			
 
				-		...
			
 
				+for (devno = 0x80; devno < 0x80+EDD_MBR_SIG_MAX; devno++) {
			
 
				+    if (!get_edd_info(devno, &ei) && boot_params.eddbuf_entries < EDDMAXNR) {
			
 
				+        memcpy(edp, &ei, sizeof ei);
			
 
				+        edp++;
			
 
				+        boot_params.eddbuf_entries++;
			
 
				+    }
			
 
				+    ...
			
 
				+    ...
			
 
				+    ...
			
 
				 ```
			
 
				 
			
 
				 where `0x80` is the first hard drive and the value of `EDD_MBR_SIG_MAX` macro is 16. It collects data into the array of [edd_info](https://github.com/torvalds/linux/blob/master/include/uapi/linux/edd.h#L172) structures. `get_edd_info` checks that EDD is present by invoking the `0x13` interrupt with `ah` as `0x41` and if EDD is present, `get_edd_info` again calls the `0x13` interrupt, but with `ah` as `0x48` and `si` containing the address of the buffer where EDD information will be stored.
			
@@ -522,11 +523,11 @@ where `0x80` is the first hard drive and the value of `EDD_MBR_SIG_MAX` macro is
 
				 Conclusion
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-This is the end of the second part about Linux kernel internals. In the next part we will see video mode setting and the rest of preparations before transition to protected mode and directly transitioning into it.
			
 
				+This is the end of the second part about Linux kernel insides. In the next part we will see video mode setting and the rest of preparations before transition to protected mode and directly transitioning into it.
			
 
				 
			
 
				 If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				 
			
 
				-**Please note that English is not my first language, And I am really sorry for any inconvenience. If you found any mistakes please send me a PR to [linux-internals](https://github.com/0xAX/linux-internals).**
			
 
				+**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me a PR to [linux-insides](https://github.com/0xAX/linux-internals).**
			
 
				 
			
 
				 Links
			
 
				 --------------------------------------------------------------------------------
			
--- a/Booting/linux-bootstrap-3.md
+++ b/Booting/linux-bootstrap-3.md
@@ -4,14 +4,14 @@ Kernel booting process. Part 3.
 
				 Video mode initialization and transition to protected mode
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-This is the third part of the `Kernel booting process` series. In the previous [part](linux-bootstrap-2.md#kernel-booting-process-part-2), we stopped right before the call of the `set_video` routine from the [main.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c#L181). In this part, we will see:
			
 
				+This is the third part of the `Kernel booting process` series. In the previous [part](linux-bootstrap-2.md#kernel-booting-process-part-2), we stopped right before the call of the `set_video` routine from [main.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c#L181). In this part, we will see:
			
 
				 - video mode initialization in the kernel setup code,
			
 
				-- preparation before switching into the protected mode,
			
 
				+- preparation before switching into protected mode,
			
 
				 - transition to protected mode
			
 
				 
			
 
				 **NOTE** If you don't know anything about protected mode, you can find some information about it in the previous [part](linux-bootstrap-2.md#protected-mode). Also there are a couple of [links](linux-bootstrap-2.md#links) which can help you.
			
 
				 
			
 
				-As I wrote above, we will start from the `set_video` function which defined in the [arch/x86/boot/video.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/video.c#L315) source code file. We can see that it starts by first getting the video mode from the `boot_params.hdr` structure:
			
 
				+As I wrote above, we will start from the `set_video` function which is defined in the [arch/x86/boot/video.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/video.c#L315) source code file. We can see that it starts by first getting the video mode from the `boot_params.hdr` structure:
			
 
				 
			
 
				 ```C
			
 
				 u16 mode = boot_params.hdr.vid_mode;
			
@@ -37,28 +37,28 @@ vga=<mode>
 
				 	line is parsed.
			
 
				 ```
			
 
				 
			
 
				-So we can add `vga` option to the grub or another bootloader configuration file and it will pass this option to the kernel command line. This option can have different values as we can mentioned in the description, for example it can be an integer number `0xFFFD` or `ask`. If you pass `ask` t `vga`, you will see a menu like this:
			
 
				+So we can add `vga` option to the grub or another bootloader configuration file and it will pass this option to the kernel command line. This option can have different values as mentioned in the description. For example, it can be an integer number `0xFFFD` or `ask`. If you pass `ask` to `vga`, you will see a menu like this:
			
 
				 
			
 
				 ![video mode setup menu](http://oi59.tinypic.com/ejcz81.jpg)
			
 
				 
			
 
				-which will ask to select a video mode. We will look at it's implementation, but before diving into the implementation we have to look at some other things.
			
 
				+which will ask to select a video mode. We will look at its implementation, but before diving into the implementation we have to look at some other things.
			
 
				 
			
 
				 Kernel data types
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-Earlier we saw definitions of different data types like `u16` etc. in the kernel setup code. Let's look on a couple of data types provided by the kernel:
			
 
				+Earlier we saw definitions of different data types like `u16` etc. in the kernel setup code. Let's look at a couple of data types provided by the kernel:
			
 
				 
			
 
				 
			
 
				 | Type | char | short | int | long | u8 | u16 | u32 | u64 |
			
 
				 |------|------|-------|-----|------|----|-----|-----|-----|
			
 
				 | Size |  1   |   2   |  4  |   8  |  1 |  2  |  4  |  8  |
			
 
				 
			
 
				-If you read source code of the kernel, you'll see these very often and so it will be good to remember them.
			
 
				+If you the read source code of the kernel, you'll see these very often and so it will be good to remember them.
			
 
				 
			
 
				 Heap API
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-After we have `vid_mode` from the `boot_params.hdr` in the `set_video` function we can see call to `RESET_HEAP` function. `RESET_HEAP` is a macro which defined in the [boot.h](https://github.com/torvalds/linux/blob/master/arch/x86/boot/boot.h#L199). It is defined as:
			
 
				+After we get `vid_mode` from `boot_params.hdr` in the `set_video` function, we can see the call to the `RESET_HEAP` function. `RESET_HEAP` is a macro which is defined in [boot.h](https://github.com/torvalds/linux/blob/master/arch/x86/boot/boot.h#L199). It is defined as:
			
 
				 
			
 
				 ```C
			
 
				 #define RESET_HEAP() ((void *)( HEAP = _end ))
			
@@ -70,20 +70,20 @@ If you have read the second part, you will remember that we initialized the heap
 
				 #define RESET_HEAP()
			
 
				 ```
			
 
				 
			
 
				-As we saw just above it resets the heap by setting the `HEAP` variable equal to `_end`, where `_end` is just `extern char _end[];`
			
 
				+As we saw just above, it resets the heap by setting the `HEAP` variable equal to `_end`, where `_end` is just `extern char _end[];`
			
 
				 
			
 
				-Next is `GET_HEAP` macro:
			
 
				+Next is the `GET_HEAP` macro:
			
 
				 
			
 
				 ```C
			
 
				 #define GET_HEAP(type, n) \
			
 
				 	((type *)__get_heap(sizeof(type),__alignof__(type),(n)))
			
 
				 ```
			
 
				 
			
 
				-for heap allocation. It calls internal function `__get_heap` with 3 parameters:
			
 
				+for heap allocation. It calls the internal function `__get_heap` with 3 parameters:
			
 
				 
			
 
				 * size of a type in bytes, which need be allocated
			
 
				-* `__alignof__(type)` shows how type of variable is aligned
			
 
				-* `n` tells how many bytes to allocate
			
 
				+* `__alignof__(type)` shows how variables of this type are aligned
			
 
				+* `n` tells how many items to allocate
			
 
				 
			
 
				 Implementation of `__get_heap` is:
			
 
				 
			
@@ -105,7 +105,7 @@ and further we will see its usage, something like:
 
				 saved.data = GET_HEAP(u16, saved.x * saved.y);
			
 
				 ```
			
 
				 
			
 
				-Let's try to understand how `__get_heap` works. We can see here that `HEAP` (which is equal to `_end` after `RESET_HEAP()`) is the address of aligned memory according to `a` parameter. After it we save memory address from `HEAP` to the `tmp` variable, move `HEAP` to the end of allocated block and return `tmp` which is start address of allocated memory.
			
 
				+Let's try to understand how `__get_heap` works. We can see here that `HEAP` (which is equal to `_end` after `RESET_HEAP()`) is the address of aligned memory according to the `a` parameter. After this we save the memory address from `HEAP` to the `tmp` variable, move `HEAP` to the end of the allocated block and return `tmp` which is the start address of allocated memory.
			
 
				 
			
 
				 And the last function is:
			
 
				 
			
@@ -118,27 +118,27 @@ static inline bool heap_free(size_t n)
 
				 
			
 
				 which subtracts value of the `HEAP` from the `heap_end` (we calculated it in the previous [part](linux-bootstrap-2.md)) and returns 1 if there is enough memory for `n`.
			
 
				 
			
 
				-That's all. Now we have simple API for heap and can setup video mode.
			
 
				+That's all. Now we have a simple API for heap and can setup video mode.
			
 
				 
			
 
				-Setup video mode
			
 
				+Set up video mode
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-Now we can move directly to video mode initialization. We stopped at the `RESET_HEAP()` call in the `set_video` function. Next is the call to  `store_mode_params` which stores video mode parameters in the `boot_params.screen_info` structure which is defined in the [include/uapi/linux/screen_info.h](https://github.com/0xAX/linux/blob/master/include/uapi/linux/screen_info.h).
			
 
				+Now we can move directly to video mode initialization. We stopped at the `RESET_HEAP()` call in the `set_video` function. Next is the call to  `store_mode_params` which stores video mode parameters in the `boot_params.screen_info` structure which is defined in [include/uapi/linux/screen_info.h](https://github.com/0xAX/linux/blob/master/include/uapi/linux/screen_info.h).
			
 
				 
			
 
				-If we will look at `store_mode_params` function, we can see that it starts with the call to `store_cursor_position` function. As you can understand from the function name, it gets information about cursor and stores it.
			
 
				+If we look at the `store_mode_params` function, we can see that it starts with the call to the `store_cursor_position` function. As you can understand from the function name, it gets information about cursor and stores it.
			
 
				 
			
 
				-First of all `store_cursor_position` initializes two variables which has type - `biosregs`, with `AH = 0x3` and calls `0x10` BIOS interruption. After interruption successfully executed, it returns row and column in the `DL` and `DH` registers. Row and column will be stored in the `orig_x` and `orig_y` fields from the the `boot_params.screen_info` structure.
			
 
				+First of all `store_cursor_position` initializes two variables which have type `biosregs` with `AH = 0x3`, and calls `0x10` BIOS interruption. After the interruption is successfully executed, it returns row and column in the `DL` and `DH` registers. Row and column will be stored in the `orig_x` and `orig_y` fields from the `boot_params.screen_info` structure.
			
 
				 
			
 
				-After `store_cursor_position` executed, `store_video_mode` function will be called. It just gets current video mode and stores it in the `boot_params.screen_info.orig_video_mode`. 
			
 
				+After `store_cursor_position` is executed, the `store_video_mode` function will be called. It just gets the current video mode and stores it in `boot_params.screen_info.orig_video_mode`. 
			
 
				 
			
 
				-After this, it checks current video mode and sets the `video_segment`. After the BIOS transfers control to the boot sector, the following addresses are for video memory:
			
 
				+After this, it checks the current video mode and sets the `video_segment`. After the BIOS transfers control to the boot sector, the following addresses are for video memory:
			
 
				 
			
 
				 ```
			
 
				 0xB000:0x0000 	32 Kb 	Monochrome Text Video Memory
			
 
				 0xB800:0x0000 	32 Kb 	Color Text Video Memory
			
 
				 ```
			
 
				 
			
 
				-So we set the `video_segment` variable to `0xB000` if current video mode is MDA, HGC, VGA in monochrome mode or `0xB800` in color mode. After setup of the address of the video segment font size needs to be stored in the `boot_params.screen_info.orig_video_points` with:
			
 
				+So we set the `video_segment` variable to `0xB000` if the current video mode is MDA, HGC, or VGA in monochrome mode and to `0xB800` if the current video mode is in color mode. After setting up the address of the video segment, font size needs to be stored in `boot_params.screen_info.orig_video_points` with:
			
 
				 
			
 
				 ```C
			
 
				 set_fs(0);
			
@@ -146,16 +146,16 @@ font_size = rdfs16(0x485);
 
				 boot_params.screen_info.orig_video_points = font_size;
			
 
				 ```
			
 
				 
			
 
				-First of all we put 0 to the `FS` register with `set_fs` function. We already saw functions like `set_fs` in the previous part. They are all defined in the [boot.h](https://github.com/0xAX/linux/blob/master/arch/x86/boot/boot.h). Next we read value which is located at address `0x485` (this memory location is used to get the font size) and save font size in the `boot_params.screen_info.orig_video_points`.
			
 
				+First of all we put 0 in the `FS` register with the `set_fs` function. We already saw functions like `set_fs` in the previous part. They are all defined in [boot.h](https://github.com/0xAX/linux/blob/master/arch/x86/boot/boot.h). Next we read the value which is located at address `0x485` (this memory location is used to get the font size) and save the font size in `boot_params.screen_info.orig_video_points`.
			
 
				 
			
 
				 ```
			
 
				  x = rdfs16(0x44a);
			
 
				  y = (adapter == ADAPTER_CGA) ? 25 : rdfs8(0x484)+1;
			
 
				 ```
			
 
				 
			
 
				-Next we get amount of columns by `0x44a` and rows by address `0x484` and store them in the `boot_params.screen_info.orig_video_cols` and `boot_params.screen_info.orig_video_lines`. After this, execution of the `store_mode_params` is finished.
			
 
				+Next we get the amount of columns by address `0x44a` and rows by address `0x484` and store them in `boot_params.screen_info.orig_video_cols` and `boot_params.screen_info.orig_video_lines`. After this, execution of `store_mode_params` is finished.
			
 
				 
			
 
				-Next we can see `save_screen` function which just saves screen content to the heap. This function collects all data which we got in the previous functions like rows and columns amount etc. and stores it in the `saved_screen` structure, which is defined as:
			
 
				+Next we can see the `save_screen` function which just saves screen content to the heap. This function collects all data which we got in the previous functions like rows and columns amount etc. and stores it in the `saved_screen` structure, which is defined as:
			
 
				 
			
 
				 ```C
			
 
				 static struct saved_screen {
			
@@ -174,7 +174,7 @@ if (!heap_free(saved.x*saved.y*sizeof(u16)+512))
 
				 
			
 
				 and allocates space in the heap if it is enough and stores `saved_screen` in it.
			
 
				 
			
 
				-The next call is `probe_cards(0)` from the [arch/x86/boot/video-mode.c](https://github.com/0xAX/linux/blob/master/arch/x86/boot/video-mode.c#L33). It goes over all video_cards and collects number of modes provided by the cards. Here is the interesting moment, we can see the loop:
			
 
				+The next call is `probe_cards(0)` from [arch/x86/boot/video-mode.c](https://github.com/0xAX/linux/blob/master/arch/x86/boot/video-mode.c#L33). It goes over all video_cards and collects the number of modes provided by the cards. Here is the interesting moment, we can see the loop:
			
 
				 
			
 
				 ```C
			
 
				 for (card = video_cards; card < video_cards_end; card++) {
			
@@ -182,7 +182,7 @@ for (card = video_cards; card < video_cards_end; card++) {
 
				 }
			
 
				 ```
			
 
				 
			
 
				-but `video_cards` not declared anywhere. Answer is simple: Every video mode presented in the x86 kernel setup code has definition like this:
			
 
				+but `video_cards` is not declared anywhere. Answer is simple: Every video mode presented in the x86 kernel setup code has definition like this:
			
 
				 
			
 
				 ```C
			
 
				 static __videocard video_vga = {
			
@@ -213,7 +213,7 @@ struct card_info {
 
				 };
			
 
				 ```
			
 
				 
			
 
				-is in the `.videocards` segment. Let's look in the [arch/x86/boot/setup.ld](https://github.com/0xAX/linux/blob/master/arch/x86/boot/setup.ld) linker file, we can see there:
			
 
				+is in the `.videocards` segment. Let's look in the [arch/x86/boot/setup.ld](https://github.com/0xAX/linux/blob/master/arch/x86/boot/setup.ld) linker script, where we can find:
			
 
				 
			
 
				 ```
			
 
				 	.videocards	: {
			
@@ -223,13 +223,13 @@ is in the `.videocards` segment. Let's look in the [arch/x86/boot/setup.ld](http
 
				 	}
			
 
				 ```
			
 
				 
			
 
				-It means that `video_cards` is just memory address and all `card_info` structures are placed in this  segment. It means that all `card_info` structures are placed between `video_cards` and `video_cards_end`, so we can use it in a loop to go over all of it.  After `probe_cards` executed we have all structures like `static __videocard video_vga` with filled `nmodes` (number of video modes).
			
 
				+It means that `video_cards` is just a memory address and all `card_info` structures are placed in this  segment. It means that all `card_info` structures are placed between `video_cards` and `video_cards_end`, so we can use it in a loop to go over all of it.  After `probe_cards` executes we have all structures like `static __videocard video_vga` with filled `nmodes` (number of video modes).
			
 
				 
			
 
				-After `probe_cards` execution is finished, we move to the main loop in the `set_video` function. There is infinite loop which tries to setup video mode with the `set_mode` function or prints a menu if we passed `vid_mode=ask` to the kernel command line or video mode is undefined. 
			
 
				+After `probe_cards` execution is finished, we move to the main loop in the `set_video` function. There is an infinite loop which tries to set up video mode with the `set_mode` function or prints a menu if we passed `vid_mode=ask` to the kernel command line or video mode is undefined. 
			
 
				 
			
 
				-The `set_mode` function is defined in the [video-mode.c](https://github.com/0xAX/linux/blob/master/arch/x86/boot/video-mode.c#L147) and gets only one parameter, `mode` which is the number of video mode (we got it or from the menu or in the start of the `setup_video`, from kernel setup header). 
			
 
				+The `set_mode` function is defined in [video-mode.c](https://github.com/0xAX/linux/blob/master/arch/x86/boot/video-mode.c#L147) and gets only one parameter, `mode`, which is the number of video modes (we got it from the menu or in the start of `setup_video`, from the kernel setup header). 
			
 
				 
			
 
				-`set_mode` function checks the `mode` and calls `raw_set_mode` function. The `raw_set_mode` calls `set_mode` function for selected card i.e. `card->set_mode(struct mode_info*)`. We can get access to this function from the `card_info` structure, every video mode defines this structure with values filled depending upon the video mode (for example for `vga` it is `video_vga.set_mode` function, see above example of `card_info` structure for `vga`). `video_vga.set_mode` is `vga_set_mode`, which checks the vga mode and calls the respective function:
			
 
				+The `set_mode` function checks the `mode` and calls the `raw_set_mode` function. The `raw_set_mode` calls the `set_mode` function for the selected card i.e. `card->set_mode(struct mode_info*)`. We can get access to this function from the `card_info` structure. Every video mode defines this structure with values filled depending upon the video mode (for example for `vga` it is the `video_vga.set_mode` function. See above example of `card_info` structure for `vga`). `video_vga.set_mode` is `vga_set_mode`, which checks the vga mode and calls the respective function:
			
 
				 
			
 
				 ```C
			
 
				 static int vga_set_mode(struct mode_info *mode)
			
@@ -265,24 +265,24 @@ static int vga_set_mode(struct mode_info *mode)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-Every function which setups video mode, just calls `0x10` BIOS interrupt with certain value in the `AH` register.
			
 
				+Every function which sets up video mode just calls the `0x10` BIOS interrupt with a certain value in the `AH` register.
			
 
				 
			
 
				-After we have set video mode, we pass it to the `boot_params.hdr.vid_mode`.
			
 
				+After we have set video mode, we pass it to `boot_params.hdr.vid_mode`.
			
 
				 
			
 
				-Next `vesa_store_edid` is called. This function simply stores the [EDID](https://en.wikipedia.org/wiki/Extended_Display_Identification_Data) (**E**xtended **D**isplay **I**dentification **D**ata) information for kernel use. After this `store_mode_params` is called again. Lastly, if `do_restore` is set, screen is restored to an earlier state.
			
 
				+Next `vesa_store_edid` is called. This function simply stores the [EDID](https://en.wikipedia.org/wiki/Extended_Display_Identification_Data) (**E**xtended **D**isplay **I**dentification **D**ata) information for kernel use. After this `store_mode_params` is called again. Lastly, if `do_restore` is set, the screen is restored to an earlier state.
			
 
				 
			
 
				 After this we have set video mode and now we can switch to the protected mode.
			
 
				 
			
 
				 Last preparation before transition into protected mode
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-We can see the last function call - `go_to_protected_mode` in the [main.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c#L184). As the comment says: `Do the last things and invoke protected mode`, so let's see these last things and switch into the protected mode.
			
 
				+We can see the last function call - `go_to_protected_mode` - in [main.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c#L184). As the comment says: `Do the last things and invoke protected mode`, so let's see these last things and switch into protected mode.
			
 
				 
			
 
				-`go_to_protected_mode` defined in the [arch/x86/boot/pm.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pm.c#L104). It contains some functions which make last preparations before we can jump into protected mode, so let's look on it and try to understand what they do and how it works.
			
 
				+`go_to_protected_mode` is defined in [arch/x86/boot/pm.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pm.c#L104). It contains some functions which make the last preparations before we can jump into protected mode, so let's look at it and try to understand what they do and how it works.
			
 
				 
			
 
				-First is the call to `realmode_switch_hook` function in the `go_to_protected_mode`. This function invokes real mode switch hook if it is present and disables [NMI](http://en.wikipedia.org/wiki/Non-maskable_interrupt). Hooks are used if bootloader runs in a hostile environment. You can read more about hooks in the [boot protocol](https://www.kernel.org/doc/Documentation/x86/boot.txt) (see **ADVANCED BOOT LOADER HOOKS**).
			
 
				+First is the call to the `realmode_switch_hook` function in `go_to_protected_mode`. This function invokes the real mode switch hook if it is present and disables [NMI](http://en.wikipedia.org/wiki/Non-maskable_interrupt). Hooks are used if the bootloader runs in a hostile environment. You can read more about hooks in the [boot protocol](https://www.kernel.org/doc/Documentation/x86/boot.txt) (see **ADVANCED BOOT LOADER HOOKS**).
			
 
				 
			
 
				-`readlmode_swtich` hook presents pointer to the 16-bit real mode far subroutine which disables non-maskable interrupts. After `realmode_switch` hook (it isn't present for me) is checked, disabling of Non-Maskable Interrupts(NMI) occurs:
			
 
				+The `realmode_switch` hook presents a pointer to the 16-bit real mode far subroutine which disables non-maskable interrupts. After `realmode_switch` hook (it isn't present for me) is checked, disabling of Non-Maskable Interrupts(NMI) occurs:
			
 
				 
			
 
				 ```assembly
			
 
				 asm volatile("cli");
			
@@ -290,11 +290,11 @@ outb(0x80, 0x70);	/* Disable NMI */
 
				 io_delay();
			
 
				 ```
			
 
				 
			
 
				-At first there is inline assembly instruction with `cli` instruction which clears the interrupt flag (`IF`). After this, external interrupts are disabled. Next line disables NMI (non-maskable interrupt).
			
 
				+At first there is an inline assembly instruction with a `cli` instruction which clears the interrupt flag (`IF`). After this, external interrupts are disabled. The next line disables NMI (non-maskable interrupt).
			
 
				 
			
 
				-Interrupt is a signal to the CPU which is emitted by hardware or software. After getting signal, CPU suspends current instructions sequence, saves its state and transfers control to the interrupt handler. After interrupt handler has finished it's work, it transfers control to the interrupted instruction. Non-maskable interrupts (NMI) are interrupts which are always processed, independently of permission. It cannot be ignored and is typically used to signal for non-recoverable hardware errors. We will not dive into details of interrupts now, but will discuss it in the next posts.
			
 
				+An interrupt is a signal to the CPU which is emitted by hardware or software. After getting the signal, the CPU suspends the current instruction sequence, saves its state and transfers control to the interrupt handler. After the interrupt handler has finished it's work, it transfers control to the interrupted instruction. Non-maskable interrupts (NMI) are interrupts which are always processed, independently of permission. It cannot be ignored and is typically used to signal for non-recoverable hardware errors. We will not dive into details of interrupts now, but will discuss it in the next posts.
			
 
				 
			
 
				-Let's get back to the code. We can see that second line is writing `0x80` (disabled bit) byte to the `0x70` (CMOS Address register). After that call to the `io_delay` function occurs. `io_delay` causes a small delay and looks like:
			
 
				+Let's get back to the code. We can see that second line is writing `0x80` (disabled bit) byte to `0x70` (CMOS Address register). After that, a call to the `io_delay` function occurs. `io_delay` causes a small delay and looks like:
			
 
				 
			
 
				 ```C
			
 
				 static inline void io_delay(void)
			
@@ -304,9 +304,9 @@ static inline void io_delay(void)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-Outputting any byte to the port `0x80` should delay exactly 1 microsecond. So we can write any value (value from `AL` register in our case) to the `0x80` port. After this delay `realmode_switch_hook` function has finished execution and we can move to the next function.
			
 
				+To output any byte to the port `0x80` should delay exactly 1 microsecond. So we can write any value (value from `AL` register in our case) to the `0x80` port. After this delay `realmode_switch_hook` function has finished execution and we can move to the next function.
			
 
				 
			
 
				-The next function is `enable_a20`, which enables [A20 line](http://en.wikipedia.org/wiki/A20_line). This function is defined in the [arch/x86/boot/a20.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/a20.c) and it tries to enable A20 gate with different methods. The first is `a20_test_short` function which checks is A20 already enabled or not with `a20_test` function:
			
 
				+The next function is `enable_a20`, which enables [A20 line](http://en.wikipedia.org/wiki/A20_line). This function is defined in [arch/x86/boot/a20.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/a20.c) and it tries to enable the A20 gate with different methods. The first is the `a20_test_short` function which checks if A20 is already enabled or not with the `a20_test` function:
			
 
				 
			
 
				 ```C
			
 
				 static int a20_test(int loops)
			
@@ -332,11 +332,11 @@ static int a20_test(int loops)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-First of all we put `0x0000` to the `FS` register and `0xffff` to the `GS` register. Next we read value by address `A20_TEST_ADDR` (it is `0x200`) and put this value into `saved` variable and `ctr`.
			
 
				+First of all we put `0x0000` in the `FS` register and `0xffff` in the `GS` register. Next we read the value in address `A20_TEST_ADDR` (it is `0x200`) and put this value into the `saved` variable and `ctr`.
			
 
				 
			
 
				-Next we write updated `ctr` value into `fs:gs` with `wrfs32` function, then delay for 1ms, and then read the value into the `GS` register by address `A20_TEST_ADDR+0x10`, if it's not zero we already have enabled A20 line. If A20 is disabled, we try to enable it with a different method which you can find in the `a20.c`. For example with call of `0x15` BIOS interrupt with `AH=0x2041` etc.
			
 
				+Next we write an updated `ctr` value into `fs:gs` with the `wrfs32` function, then delay for 1ms, and then read the value from the `GS` register by address `A20_TEST_ADDR+0x10`, if it's not zero we already have enabled the A20 line. If A20 is disabled, we try to enable it with a different method which you can find in the `a20.c`. For example with call of `0x15` BIOS interrupt with `AH=0x2041` etc.
			
 
				 
			
 
				-If `enabled_a20` function finished with fail, print an error message and call function `die`. You can remember it from the first source code file where we started - [arch/x86/boot/header.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S):
			
 
				+If the `enabled_a20` function finished with fail, print an error message and call function `die`. You can remember it from the first source code file where we started - [arch/x86/boot/header.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S):
			
 
				 
			
 
				 ```assembly
			
 
				 die:
			
@@ -345,26 +345,26 @@ die:
 
				 	.size	die, .-die
			
 
				 ```
			
 
				 
			
 
				-After the A20 gate is successfully enabled, `reset_coprocessor` function is called:
			
 
				+After the A20 gate is successfully enabled, the `reset_coprocessor` function is called:
			
 
				  ```C
			
 
				 outb(0, 0xf0);
			
 
				 outb(0, 0xf1);
			
 
				 ```
			
 
				 This function clears the Math Coprocessor by writing `0` to `0xf0` and then resets it by writing `0` to `0xf1`.
			
 
				 
			
 
				-After this `mask_all_interrupts` function is called:
			
 
				+After this, the `mask_all_interrupts` function is called:
			
 
				 ```C
			
 
				 outb(0xff, 0xa1);       /* Mask all interrupts on the secondary PIC */
			
 
				 outb(0xfb, 0x21);       /* Mask all but cascade on the primary PIC */
			
 
				 ```
			
 
				 This masks all interrupts on the secondary PIC (Programmable Interrupt Controller) and primary PIC except for IRQ2 on the primary PIC.
			
 
				 
			
 
				-And after all of these preparations, we can see actual transition into protected mode.
			
 
				+And after all of these preparations, we can see the actual transition into protected mode.
			
 
				 
			
 
				-Setup Interrupt Descriptor Table
			
 
				+Set up Interrupt Descriptor Table
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-Now we setup the Interrupt Descriptor table (IDT). `setup_idt`:
			
 
				+Now we set up the Interrupt Descriptor table (IDT). `setup_idt`:
			
 
				 
			
 
				 ```C
			
 
				 static void setup_idt(void)
			
@@ -374,7 +374,7 @@ static void setup_idt(void)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-which setups the Interrupt Descriptor Table (describes interrupt handlers and etc.). For now IDT is not installed (we will see it later), but now we just load IDT with `lidtl` instruction. `null_idt` contains address and size of IDT, but now they are just zero. `null_idt` is a `gdt_ptr` structure, it as defined as:
			
 
				+which sets up the Interrupt Descriptor Table (describes interrupt handlers and etc.). For now the IDT is not installed (we will see it later), but now we just the load IDT with the `lidtl` instruction. `null_idt` contains address and size of IDT, but now they are just zero. `null_idt` is a `gdt_ptr` structure, it as defined as:
			
 
				 ```C
			
 
				 struct gdt_ptr {
			
 
				 	u16 len;
			
@@ -382,12 +382,12 @@ struct gdt_ptr {
 
				 } __attribute__((packed));
			
 
				 ```
			
 
				 
			
 
				-where we can see - 16-bit length(`len`) of IDT and 32-bit pointer to it (More details about IDT and interruptions we will see in the next posts). ` __attribute__((packed))` means here that size of `gdt_ptr` minimum as required. So size of the `gdt_ptr` will be 6 bytes here or 48 bits. (Next we will load pointer to the `gdt_ptr` to the `GDTR` register and you might remember from the previous post that it is 48-bits in size).
			
 
				+where we can see the 16-bit length(`len`) of the IDT and the 32-bit pointer to it (More details about the IDT and interruptions will be seen in the next posts). ` __attribute__((packed))` means that the size of `gdt_ptr` is the minimum required size. So the size of the `gdt_ptr` will be 6 bytes here or 48 bits. (Next we will load the pointer to the `gdt_ptr` to the `GDTR` register and you might remember from the previous post that it is 48-bits in size).
			
 
				 
			
 
				-Setup Global Descriptor Table
			
 
				+Set up Global Descriptor Table
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-Next is the setup of Global Descriptor Table (GDT). We can see `setup_gdt` function which sets up GDT (you can read about it in the [Kernel booting process. Part 2.](linux-bootstrap-2.md#protected-mode)). There is definition of the `boot_gdt` array in this function, which contains definition of the three segments:
			
 
				+Next is the setup of the Global Descriptor Table (GDT). We can see the `setup_gdt` function which sets up GDT (you can read about it in the [Kernel booting process. Part 2.](linux-bootstrap-2.md#protected-mode)). There is a definition of the `boot_gdt` array in this function, which contains the definition of the three segments:
			
 
				 
			
 
				 ```C
			
 
				 	static const u64 boot_gdt[] __attribute__((aligned(16))) = {
			
@@ -397,7 +397,7 @@ Next is the setup of Global Descriptor Table (GDT). We can see `setup_gdt` funct
 
				 	};
			
 
				 ```
			
 
				 
			
 
				-For code, data and TSS (Task State Segment). We will not use task state segment for now, it was added there to make Intel VT happy as we can see in the comment line (if you're interesting you can find commit which describes it - [here](https://github.com/torvalds/linux/commit/88089519f302f1296b4739be45699f06f728ec31)). Let's look on `boot_gdt`. First of all note that it has `__attribute__((aligned(16)))` attribute. It means that this structure will be aligned by 16 bytes. Let's look at a simple example:
			
 
				+For code, data and TSS (Task State Segment). We will not use the task state segment for now, it was added there to make Intel VT happy as we can see in the comment line (if you're interested you can find commit which describes it - [here](https://github.com/torvalds/linux/commit/88089519f302f1296b4739be45699f06f728ec31)). Let's look at `boot_gdt`. First of all note that it has the `__attribute__((aligned(16)))` attribute. It means that this structure will be aligned by 16 bytes. Let's look at a simple example:
			
 
				 ```C
			
 
				 #include <stdio.h>
			
 
				 
			
@@ -421,7 +421,7 @@ int main(void)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-Technically structure which contains one `int` field, must be 4 bytes, but here `aligned` structure will be 16 bytes:
			
 
				+Technically a structure which contains one `int` field must be 4 bytes, but here `aligned` structure will be 16 bytes:
			
 
				 
			
 
				 ```
			
 
				 $ gcc test.c -o test && test
			
@@ -431,13 +431,13 @@ Aligned - 16
 
				 
			
 
				 `GDT_ENTRY_BOOT_CS` has index - 2 here, `GDT_ENTRY_BOOT_DS` is `GDT_ENTRY_BOOT_CS + 1` and etc. It starts from 2, because first is a mandatory null descriptor (index - 0) and the second is not used (index - 1).
			
 
				 
			
 
				-`GDT_ENTRY` is a macro which takes flags, base and limit and builds GDT entry. For example let's look on the code segment entry. `GDT_ENTRY` takes following values:
			
 
				+`GDT_ENTRY` is a macro which takes flags, base and limit and builds GDT entry. For example let's look at the code segment entry. `GDT_ENTRY` takes following values:
			
 
				 
			
 
				 * base  - 0
			
 
				 * limit - 0xfffff
			
 
				 * flags - 0xc09b
			
 
				 
			
 
				-What does it mean? Segment's base address is 0, limit (size of segment) is - `0xffff` (1 MB). Let's look on flags. It is `0xc09b` and it will be:
			
 
				+What does this mean? The segment's base address is 0, and the limit (size of segment) is - `0xffff` (1 MB). Let's look at the flags. It is `0xc09b` and it will be:
			
 
				 
			
 
				 ```
			
 
				 1100 0000 1001 1011
			
@@ -458,23 +458,23 @@ in binary. Let's try to understand what every bit means. We will go through all
 
				 
			
 
				 You can read more about every bit in the previous [post](linux-bootstrap-2.md) or in the [Intel® 64 and IA-32 Architectures Software Developer's Manuals 3A](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html).
			
 
				 
			
 
				-After this we get length of GDT with:
			
 
				+After this we get the length of the GDT with:
			
 
				 
			
 
				 ```C
			
 
				 gdt.len = sizeof(boot_gdt)-1;
			
 
				 ```
			
 
				 
			
 
				-We get size of `boot_gdt` and subtract 1 (the last valid address in the GDT).
			
 
				+We get the size of `boot_gdt` and subtract 1 (the last valid address in the GDT).
			
 
				 
			
 
				-Next we get pointer to the GDT with:
			
 
				+Next we get a pointer to the GDT with:
			
 
				 
			
 
				 ```C
			
 
				 gdt.ptr = (u32)&boot_gdt + (ds() << 4);
			
 
				 ```
			
 
				 
			
 
				-Here we just get address of `boot_gdt` and add it to address of data segment left-shifted by 4 bits (remember we're in the real mode now).
			
 
				+Here we just get the address of `boot_gdt` and add it to the address of the data segment left-shifted by 4 bits (remember we're in the real mode now).
			
 
				 
			
 
				-Lastly we execute `lgdtl` instruction to load GDT into GDTR register:
			
 
				+Lastly we execute the `lgdtl` instruction to load the GDT into the GDTR register:
			
 
				 
			
 
				 ```C
			
 
				 asm volatile("lgdtl %0" : : "m" (gdt));
			
@@ -483,20 +483,20 @@ asm volatile("lgdtl %0" : : "m" (gdt));
 
				 Actual transition into protected mode
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-It is the end of `go_to_protected_mode` function. We loaded IDT, GDT, disable interruptions and now can switch CPU into protected mode. The last step we call `protected_mode_jump` function with two parameters:
			
 
				+This is the end of the `go_to_protected_mode` function. We loaded IDT, GDT, disable interruptions and now can switch the CPU into protected mode. The last step is calling the `protected_mode_jump` function with two parameters:
			
 
				 
			
 
				 ```C
			
 
				 protected_mode_jump(boot_params.hdr.code32_start, (u32)&boot_params + (ds() << 4));
			
 
				 ```
			
 
				 
			
 
				-which is defined in the [arch/x86/boot/pmjump.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pmjump.S#L26). It takes two parameters:
			
 
				+which is defined in [arch/x86/boot/pmjump.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pmjump.S#L26). It takes two parameters:
			
 
				 
			
 
				 * address of protected mode entry point
			
 
				 * address of `boot_params`
			
 
				 
			
 
				-Let's look inside `protected_mode_jump`. As I wrote above, you can find it in the `arch/x86/boot/pmjump.S`. First parameter will be in `eax` register and second is in `edx`.
			
 
				+Let's look inside `protected_mode_jump`. As I wrote above, you can find it in `arch/x86/boot/pmjump.S`. The first parameter will be in the `eax` register and second is in `edx`.
			
 
				 
			
 
				-First of all we put address of `boot_params` in the `esi` register and address of code segment register `cs` (0x1000) in the `bx`. After this we shift `bx` by 4 bits and add address of label `2` to it (we will have physical address of label `2` in the `bx` after it) and jump to label `1`. Next we put data segment and task state segment in the `cs` and `di` registers with:
			
 
				+First of all we put the address of `boot_params` in the `esi` register and the address of code segment register `cs` (0x1000) in `bx`. After this we shift `bx` by 4 bits and add the address of label `2` to it (we will have the physical address of label `2` in the `bx` after this) and jump to label `1`. Next we put data segment and task state segment in the `cs` and `di` registers with:
			
 
				 
			
 
				 ```assembly
			
 
				 movw	$__BOOT_DS, %cx
			
@@ -505,7 +505,7 @@ movw	$__BOOT_TSS, %di
 
				 
			
 
				 As you can read above `GDT_ENTRY_BOOT_CS` has index 2 and every GDT entry is 8 byte, so `CS` will be `2 * 8 = 16`, `__BOOT_DS` is 24 etc.
			
 
				 
			
 
				-Next we set `PE` (Protection Enable) bit in the `CR0` control register:
			
 
				+Next we set the `PE` (Protection Enable) bit in the `CR0` control register:
			
 
				 
			
 
				 ```assembly
			
 
				 movl	%cr0, %edx
			
@@ -513,7 +513,7 @@ orb	$X86_CR0_PE, %dl
 
				 movl	%edx, %cr0
			
 
				 ```
			
 
				 
			
 
				-and make long jump to the protected mode:
			
 
				+and make a long jump to protected mode:
			
 
				 
			
 
				 ```assembly
			
 
				 	.byte	0x66, 0xea
			
@@ -522,7 +522,7 @@ and make long jump to the protected mode:
 
				 ```
			
 
				 
			
 
				 where
			
 
				-* `0x66` is the operand-size prefix which allows to mix 16-bit and 32-bit code,
			
 
				+* `0x66` is the operand-size prefix which allows us to mix 16-bit and 32-bit code,
			
 
				 * `0xea` - is the jump opcode,
			
 
				 * `in_pm32` is the segment offset
			
 
				 * `__BOOT_CS` is the code segment.
			
@@ -534,7 +534,7 @@ After this we are finally in the protected mode:
 
				 .section ".text32","ax"
			
 
				 ```
			
 
				 
			
 
				-Let's look at the first steps in the protected mode. First of all we setup data segment with:
			
 
				+Let's look at the first steps in protected mode. First of all we set up the data segment with:
			
 
				 
			
 
				 ```assembly
			
 
				 movl	%ecx, %ds
			
@@ -544,7 +544,7 @@ movl	%ecx, %gs
 
				 movl	%ecx, %ss
			
 
				 ```
			
 
				 
			
 
				-If you read with attention, you can remember that we saved `$__BOOT_DS` in the `cx` register. Now we fill with it all segment registers besides `cs` (`cs` is already `__BOOT_CS`). Next we zero out all general purpose registers besides `eax` with:
			
 
				+If you paid attention, you can remember that we saved `$__BOOT_DS` in the `cx` register. Now we fill it with all segment registers besides `cs` (`cs` is already `__BOOT_CS`). Next we zero out all general purpose registers besides `eax` with:
			
 
				 
			
 
				 ```assembly
			
 
				 xorl	%ecx, %ecx
			
@@ -560,18 +560,18 @@ And jump to the 32-bit entry point in the end:
 
				 jmpl	*%eax
			
 
				 ```
			
 
				 
			
 
				-Remember that `eax` contains address of the 32-bit entry (we passed it as first parameter into `protected_mode_jump`).
			
 
				+Remember that `eax` contains the address of the 32-bit entry (we passed it as first parameter into `protected_mode_jump`).
			
 
				 
			
 
				-That's all we're in the protected mode and stop at it's entry point. What happens next, we will see in the next part.
			
 
				+That's all. We're in the protected mode and stop at it's entry point. We will see what happens next in the next part.
			
 
				 
			
 
				 Conclusion
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-It is the end of the third part about linux kernel internals. In next part we will see first steps in the protected mode and transition into the [long mode](http://en.wikipedia.org/wiki/Long_mode).
			
 
				+This is the end of the third part about linux kernel insides. In next part, we will see first steps in the protected mode and transition into the [long mode](http://en.wikipedia.org/wiki/Long_mode).
			
 
				 
			
 
				 If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				 
			
 
				-**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes, please send me a PR with corrections at [linux-internals](https://github.com/0xAX/linux-internals).**
			
 
				+**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes, please send me a PR with corrections at [linux-insides](https://github.com/0xAX/linux-internals).**
			
 
				 
			
 
				 Links
			
 
				 --------------------------------------------------------------------------------
			
--- a/Booting/linux-bootstrap-4.md
+++ b/Booting/linux-bootstrap-4.md
@@ -4,23 +4,23 @@ Kernel booting process. Part 4.
 
				 Transition to 64-bit mode
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-It is the fourth part of the `Kernel booting process` and we will see first steps in the [protected mode](http://en.wikipedia.org/wiki/Protected_mode), like checking that cpu supports the [long mode](http://en.wikipedia.org/wiki/Long_mode) and [SSE](http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions), [paging](http://en.wikipedia.org/wiki/Paging) and initialization of the page tables and transition to the long mode in in the end of this part.
			
 
				+This is the fourth part of the `Kernel booting process` where we will see first steps in [protected mode](http://en.wikipedia.org/wiki/Protected_mode), like checking that cpu supports [long mode](http://en.wikipedia.org/wiki/Long_mode) and [SSE](http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions), [paging](http://en.wikipedia.org/wiki/Paging), initializes the page tables and at the end we will discuss the transition to [long mode](https://en.wikipedia.org/wiki/Long_mode).
			
 
				 
			
 
				-**NOTE: will be much assembly code in this part, so if you have poor knowledge, read a book about it**
			
 
				+**NOTE: there will be much assembly code in this part, so if you are not familiar with that, you might want to consult a book about it**
			
 
				 
			
 
				-In the previous [part](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-3.md) we stopped at the jump to the 32-bit entry point in the [arch/x86/boot/pmjump.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pmjump.S):
			
 
				+In the previous [part](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-3.md) we stopped at the jump to the 32-bit entry point in [arch/x86/boot/pmjump.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pmjump.S):
			
 
				 
			
 
				 ```assembly
			
 
				 jmpl	*%eax
			
 
				 ```
			
 
				 
			
 
				-Remind that `eax` register contains the address of the 32-bit entry point. We can read about this point from the linux kernel x86 boot protocol:
			
 
				+You will recall that `eax` register contains the address of the 32-bit entry point. We can read about this in the [linux kernel x86 boot protocol](https://www.kernel.org/doc/Documentation/x86/boot.txt):
			
 
				 
			
 
				 ```
			
 
				 When using bzImage, the protected-mode kernel was relocated to 0x100000
			
 
				 ```
			
 
				 
			
 
				-And now we can make sure that it is true. Let's look on registers value in 32-bit entry point:
			
 
				+Let's make sure that it is true by looking at the register values at the 32-bit entry point:
			
 
				 
			
 
				 ```
			
 
				 eax            0x100000	1048576
			
@@ -41,12 +41,12 @@ fs             0x18	24
 
				 gs             0x18	24
			
 
				 ```
			
 
				 
			
 
				-We can see here that `cs` register contains - `0x10` (as you can remember from the previous part, it is the second index in the Global Descriptor Table), `eip` register is `0x100000` and base address of the all segments include code segment is zero. So we can get physical address, it will be `0:0x100000` or just `0x100000`, as in boot protocol. Now let's start with 32-bit entry point.
			
 
				+We can see here that `cs` register contains - `0x10` (as you will remember from the previous part, this is the second index in the Global Descriptor Table), `eip` register is `0x100000` and base address of all segments including the code segment are zero. So we can get the physical address, it will be `0:0x100000` or just `0x100000`, as specified by the boot protocol. Now let's start with the 32-bit entry point.
			
 
				 
			
 
				 32-bit entry point
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-We can find definition of the 32-bit entry point in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S):
			
 
				+We can find the definition of the 32-bit entry point in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) assembly source code file:
			
 
				 
			
 
				 ```assembly
			
 
				 	__HEAD
			
@@ -58,14 +58,14 @@ ENTRY(startup_32)
 
				 ENDPROC(startup_32)
			
 
				 ```
			
 
				 
			
 
				-First of all why `compressed` directory? Actually `bzimage` is a gzipped `vmlinux + header + kernel setup code`. We saw the kernel setup code in the all of previous parts. So, the main goal of the `head_64.S` is to prepare for entering long mode, enter into it and decompress the kernel. We will see all of these steps besides kernel decompression in this part.
			
 
				+First of all why `compressed` directory? Actually `bzimage` is a gzipped `vmlinux + header + kernel setup code`. We saw the kernel setup code in all of the previous parts. So, the main goal of the `head_64.S` is to prepare for entering long mode, enter into it and then decompress the kernel. We will see all of the steps up to kernel decompression in this part.
			
 
				 
			
 
				-Also you can note that there are two files in the `arch/x86/boot/compressed` directory:
			
 
				+There were two files in the `arch/x86/boot/compressed` directory:
			
 
				 
			
 
				-* head_32.S
			
 
				-* head_64.S
			
 
				+* [head_32.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_32.S)
			
 
				+* [head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S)
			
 
				 
			
 
				-We will see only `head_64.S` because we are learning linux kernel for `x86_64`. `head_32.S` even not compiled in our case. Let's look on the [arch/x86/boot/compressed/Makefile](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/Makefile), we can see there following target:
			
 
				+but we will see only `head_64.S` because, as you may remember, this book is only `x86_64` related; `head_32.S` is not used in our case. Let's look at [arch/x86/boot/compressed/Makefile](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/Makefile). There we can see the following target:
			
 
				 
			
 
				 ```Makefile
			
 
				 vmlinux-objs-y := $(obj)/vmlinux.lds $(obj)/head_$(BITS).o $(obj)/misc.o \
			
@@ -73,26 +73,26 @@ vmlinux-objs-y := $(obj)/vmlinux.lds $(obj)/head_$(BITS).o $(obj)/misc.o \
 
				 	$(obj)/piggy.o $(obj)/cpuflags.o
			
 
				 ```
			
 
				 
			
 
				-Note on `$(obj)/head_$(BITS).o`. It means that compilation of the head_{32,64}.o depends on value of the `$(BITS)`. We can find it in the other Makefile - [arch/x86/kernel/Makefile](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/Makefile):
			
 
				+Note `$(obj)/head_$(BITS).o`. This means that we will select which file to link based on what `$(BITS)` is set to, either head_32.o or head_64.o.   `$(BITS)` is defined elsewhere in [arch/x86/Makefile](https://github.com/torvalds/linux/blob/master/arch/x86/Makefile) based on the .config file:
			
 
				 
			
 
				 ```Makefile
			
 
				 ifeq ($(CONFIG_X86_32),y)
			
 
				-	    BITS := 32
			
 
				+        BITS := 32
			
 
				+        ...
			
 
				         ...
			
 
				-		...
			
 
				 else
			
 
				-		...
			
 
				-		...
			
 
				         BITS := 64
			
 
				+        ...
			
 
				+        ...
			
 
				 endif
			
 
				 ```
			
 
				 
			
 
				 Now we know where to start, so let's do it.
			
 
				 
			
 
				-Reload the segments if need
			
 
				+Reload the segments if needed
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-As i wrote above, we start in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S). First of all we can see before `startup_32` definition:
			
 
				+As indicated above, we start in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) assembly source code file. First we see the definition of the special section attribute before the `startup_32` definition:
			
 
				 
			
 
				 ```assembly
			
 
				     __HEAD
			
@@ -100,13 +100,13 @@ As i wrote above, we start in the [arch/x86/boot/compressed/head_64.S](https://g
 
				 ENTRY(startup_32)
			
 
				 ```
			
 
				 
			
 
				-`__HEAD` defined in the [include/linux/init.h](https://github.com/torvalds/linux/blob/master/include/linux/init.h) and looks as:
			
 
				+The `__HEAD` is macro which is defined in [include/linux/init.h](https://github.com/torvalds/linux/blob/master/include/linux/init.h) header file and expands to the definition of the following section:
			
 
				 
			
 
				 ```C
			
 
				 #define __HEAD		.section	".head.text","ax"
			
 
				 ```
			
 
				 
			
 
				-We can find this section in the [arch/x86/boot/compressed/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/vmlinux.lds.S) linker script:
			
 
				+with `.head.text` name and `ax` flags. In our case, these flags show us that this section is [executable](https://en.wikipedia.org/wiki/Executable) or in other words contains code. We can find definition of this section in the [arch/x86/boot/compressed/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/vmlinux.lds.S) linker script:
			
 
				 
			
 
				 ```
			
 
				 SECTIONS
			
@@ -119,17 +119,17 @@ SECTIONS
 
				 	}
			
 
				 ```
			
 
				 
			
 
				-Note on `. = 0;`. `.` is a special variable of linker - location counter. Assigning a value to it, is an offset relative to the offset of the segment. As we assign zero to it, we can read from comments:
			
 
				+If you are not familiar with syntax of `GNU LD` linker scripting language, you can find more information in the [documentation](https://sourceware.org/binutils/docs/ld/Scripts.html#Scripts). In short, the `.` symbol is a special variable of linker - location counter. The value assigned to it is an offset relative to the offset of the segment. In our case we assign zero to location counter. This means that that our code is linked to run from the `0` offset in memory. Moreover, we can find this information in comments:
			
 
				 
			
 
				 ```
			
 
				 Be careful parts of head_64.S assume startup_32 is at address 0.
			
 
				 ```
			
 
				 
			
 
				-Ok, now we know where we are, and now the best time to look inside the `startup_32` function. 
			
 
				+Ok, now we know where we are, and now is the best time to look inside the `startup_32` function.
			
 
				 
			
 
				-In the start of the `startup_32` we can see the `cld` instruction which clears `DF` flag. After this, string operations like `stosb` and other will increment the index registers `esi` or `edi`.
			
 
				+In the beginning of the `startup_32` function, we can see the `cld` instruction which clears the `DF` bit in the [flags](https://en.wikipedia.org/wiki/FLAGS_register) register. When direction flag is clear, all string operations like [stos](http://x86.renejeschke.de/html/file_module_x86_id_306.html), [scas](http://x86.renejeschke.de/html/file_module_x86_id_287.html) and others will increment the index registers `esi` or `edi`. We need to clear direction flag because later we will use strings operations for clearing space for page tables, etc.
			
 
				 
			
 
				-The Next we can see the check of `KEEP_SEGMENTS` flag from `loadflags`. If you remember we already saw `loadflags` in the `arch/x86/boot/head.S` (there we checked flag `CAN_USE_HEAP`). Now we need to check `KEEP_SEGMENTS` flag. We can find description of this flag in the linux boot protocol:
			
 
				+After we have cleared the `DF` bit, next step is the check of the `KEEP_SEGMENTS` flag from `loadflags` kernel setup header field. If you remember we already saw `loadflags` in the very first [part](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-1.html) of this book. There we checked `CAN_USE_HEAP` flag to get ability to use heap. Now we need to check the `KEEP_SEGMENTS` flag. This flags is described in the linux [boot protocol](https://www.kernel.org/doc/Documentation/x86/boot.txt) documentation:
			
 
				 
			
 
				 ```
			
 
				 Bit 6 (write): KEEP_SEGMENTS
			
@@ -139,8 +139,8 @@ Bit 6 (write): KEEP_SEGMENTS
 
				     Assume that %cs %ds %ss %es are all set to flat segments with
			
 
				 	a base of 0 (or the equivalent for their environment).
			
 
				 ```
			
 
				-	
			
 
				-and if `KEEP_SEGMENTS` is not set, we need to set `ds`, `ss` and `es` registers to flat segment with base 0. That we do:
			
 
				+
			
 
				+So, if the `KEEP_SEGMENTS` bit is not set in the `loadflags`, we need to reset `ds`, `ss` and `es` segment registers to a flat segment with base `0`. That we do:
			
 
				 
			
 
				 ```C
			
 
				 	testb $(1 << 6), BP_loadflags(%esi)
			
@@ -153,11 +153,29 @@ and if `KEEP_SEGMENTS` is not set, we need to set `ds`, `ss` and `es` registers
 
				 	movl	%eax, %ss
			
 
				 ```
			
 
				 
			
 
				-remember that `__BOOT_DS` is `0x18` (index of data segment in the Global Descriptor Table). If `KEEP_SEGMENTS` is not set, we jump to the label `1f` or update segment registers with `__BOOT_DS` if this flag is set. 
			
 
				+Remember that the `__BOOT_DS` is `0x18` (index of data segment in the [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table)). If `KEEP_SEGMENTS` is set, we jump to the nearest `1f` label or update segment registers with `__BOOT_DS` if it is not set. It is pretty easy, but here is one interesting moment. If you've read the previous [part](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-3.md), you may remember that we already updated these segment registers right after we switched to [protected mode](https://en.wikipedia.org/wiki/Protected_mode) in [arch/x86/boot/pmjump.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pmjump.S). So why do we need to care about values of segment registers again? The answer is easy. The Linux kernel also has a 32-bit boot protocol and if a bootloader uses it to load the Linux kernel all code before the `startup_32` will be missed. In this case, the `startup_32` will be first entry point of the Linux kernel right after bootloader and there are no guarantees that segment registers will be in known state.
			
 
				 
			
 
				-If you read previous the [part](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-3.md), you can remember that we already updated segment registers in the [arch/x86/boot/pmjump.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pmjump.S), so why we need to set up it again? Actually linux kernel has also 32-bit boot protocol, so `startup_32` can be first function which will be executed right after a bootloader transfers control to the kernel.
			
 
				+After we have checked the `KEEP_SEGMENTS` flag and put the correct value to the segment registers, the next step is to calculate difference between where we loaded and compiled to run. Remember that `setup.ld.S` contains following definition: `. = 0` at the start of the `.head.text` section. This means that the code in this section is compiled to run from `0` address. We can see this in `objdump` output:
			
 
				 
			
 
				-As we checked `KEEP_SEGMENTS` flag and put the correct value to the segment registers, next step is calculate difference between where we loaded and compiled to run (remember that `setup.ld.S` contains `. = 0` at the start of the section):
			
 
				+```
			
 
				+arch/x86/boot/compressed/vmlinux:     file format elf64-x86-64
			
 
				+
			
 
				+
			
 
				+Disassembly of section .head.text:
			
 
				+
			
 
				+0000000000000000 <startup_32>:
			
 
				+   0:   fc                      cld
			
 
				+   1:   f6 86 11 02 00 00 40    testb  $0x40,0x211(%rsi)
			
 
				+```
			
 
				+
			
 
				+The `objdump` util tells us that the address of the `startup_32` is `0`. But actually it is not so. Our current goal is to know where actually we are. It is pretty simple to do in [long mode](https://en.wikipedia.org/wiki/Long_mode), because it support `rip` relative addressing, but currently we are in [protected mode](https://en.wikipedia.org/wiki/Protected_mode). We will use common pattern to know the address of the `startup_32`. We need to define a label and make a call to this label and pop the top of the stack to a register:
			
 
				+
			
 
				+```assembly
			
 
				+call label
			
 
				+label: pop %reg
			
 
				+```
			
 
				+
			
 
				+After this a register will contain the address of a label. Let's look to the similar code which search address of the `startup_32` in the Linux kernel:
			
 
				 
			
 
				 ```assembly
			
 
				 	leal	(BP_scratch+4)(%esi), %esp
			
@@ -166,16 +184,71 @@ As we checked `KEEP_SEGMENTS` flag and put the correct value to the segment regi
 
				 	subl	$1b, %ebp
			
 
				 ```
			
 
				 
			
 
				-Here `esi` register contains address of the [boot_params](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/bootparam.h#L113) structure. `boot_params` contains special field `scratch` with offset `0x1e4`. We are getting address of the `scratch` field + 4 bytes and put it to the `esp` register (we will use it as stack for these calculations). After this we can see call instruction and `1f` label as operand of it. What does it mean `call`? It means that it pushes `ebp` value in the stack, next `esp` value, next function arguments and return address in the end. After this we pop return address from the stack into `ebp` register (`ebp` will contain return address) and subtract address of the previous label `1`. 
			
 
				+As you remember from the previous part, the `esi` register contains the address of the [boot_params](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/bootparam.h#L113) structure which was filled before we moved to the protected mode. The `boot_params` structure contains a special field `scratch` with offset `0x1e4`. These four bytes field will be temporary stack for `call` instruction. We are getting the address of the `scratch` field + 4 bytes and putting it in the `esp` register. We add `4` bytes to the base of the `BP_scratch` field because, as just described, it will be a temporary stack and the stack grows from top to down in `x86_64` architecture. So our stack pointer will point to the top of the stack. Next we can see the pattern that I've described above. We make a call to the `1f` label and put the address of this label to the `ebp` register, because we have return address on the top of stack after the `call` instruction will be executed. So, for now we have an address of the `1f` label and now it is easy to get address of the `startup_32`. We just need to subtract address of label from the address which we got from the stack:
			
 
				+
			
 
				+```
			
 
				+startup_32 (0x0)     +-----------------------+
			
 
				+                     |                       |
			
 
				+                     |                       |
			
 
				+                     |                       |
			
 
				+                     |                       |
			
 
				+                     |                       |
			
 
				+                     |                       |
			
 
				+                     |                       |
			
 
				+                     |                       |
			
 
				+1f (0x0 + 1f offset) +-----------------------+ %ebp - real physical address
			
 
				+                     |                       |
			
 
				+                     |                       |
			
 
				+                     +-----------------------+
			
 
				+```
			
 
				+
			
 
				+`startup_32` is linked to run at address `0x0` and this means that `1f` has the address `0x0 + offset to 1f`, approximately `0x21` bytes. The `ebp` register contains the real physical address of the `1f` label. So, if we subtract `1f` from the `ebp` we will get the real physical address of the `startup_32`. The Linux kernel [boot protocol](https://www.kernel.org/doc/Documentation/x86/boot.txt) describes that the base of the protected mode kernel is `0x100000`. We can verify this with [gdb](https://en.wikipedia.org/wiki/GNU_Debugger). Let's start the debugger and put breakpoint to the `1f` address, which is `0x100021`. If this is correct we will see `0x100021` in the `ebp` register:
			
 
				 
			
 
				-After this we have address where we loaded in the `ebp` - `0x100000`.
			
 
				+```
			
 
				+$ gdb
			
 
				+(gdb)$ target remote :1234
			
 
				+Remote debugging using :1234
			
 
				+0x0000fff0 in ?? ()
			
 
				+(gdb)$ br *0x100022
			
 
				+Breakpoint 1 at 0x100022
			
 
				+(gdb)$ c
			
 
				+Continuing.
			
 
				 
			
 
				-Now we can setup the stack and verify CPU that it has support of the long mode and [SSE](http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions).
			
 
				+Breakpoint 1, 0x00100022 in ?? ()
			
 
				+(gdb)$ i r
			
 
				+eax            0x18	0x18
			
 
				+ecx            0x0	0x0
			
 
				+edx            0x0	0x0
			
 
				+ebx            0x0	0x0
			
 
				+esp            0x144a8	0x144a8
			
 
				+ebp            0x100021	0x100021
			
 
				+esi            0x142c0	0x142c0
			
 
				+edi            0x0	0x0
			
 
				+eip            0x100022	0x100022
			
 
				+eflags         0x46	[ PF ZF ]
			
 
				+cs             0x10	0x10
			
 
				+ss             0x18	0x18
			
 
				+ds             0x18	0x18
			
 
				+es             0x18	0x18
			
 
				+fs             0x18	0x18
			
 
				+gs             0x18	0x18
			
 
				+```
			
 
				+
			
 
				+If we execute the next instruction, `subl $1b, %ebp`, we will see:
			
 
				+
			
 
				+```
			
 
				+nexti
			
 
				+...
			
 
				+ebp            0x100000	0x100000
			
 
				+...
			
 
				+```
			
 
				+
			
 
				+Ok, that's true. The address of the `startup_32` is `0x100000`. After we know the address of the `startup_32` label, we can prepare for the transition to [long mode](https://en.wikipedia.org/wiki/Long_mode). Our next goal is to setup the stack and verify that the CPU supports long mode and [SSE](http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions).
			
 
				 
			
 
				 Stack setup and CPU verification
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-The next we can see assembly code which setups new stack for kernel decompression:
			
 
				+We could not setup the stack while we did not know the address of the `startup_32` label. We can imagine the stack as an array and the stack pointer register `esp` must point to the end of this array. Of course we can define an array in our code, but we need to know its actual address to configure the stack pointer in a correct way. Let's look at the code:
			
 
				 
			
 
				 ```assembly
			
 
				 	movl	$boot_stack_end, %eax
			
@@ -183,7 +256,7 @@ The next we can see assembly code which setups new stack for kernel decompressio
 
				 	movl	%eax, %esp
			
 
				 ```
			
 
				 
			
 
				-`boots_stack_end` is in the `.bss` section, we can see definition of it in the end of `head_64.S`:
			
 
				+The `boot_stack_end` label, defined in the same [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) assembly source code file and located in the [.bss](https://en.wikipedia.org/wiki/.bss) section:
			
 
				 
			
 
				 ```assembly
			
 
				 	.bss
			
@@ -195,9 +268,9 @@ boot_stack:
 
				 boot_stack_end:
			
 
				 ```
			
 
				 
			
 
				-First of all we put address of the `boot_stack_end` into `eax` register and add to it value of the `ebp` (remember that `ebp` now contains address where we loaded - `0x100000`). In the end we just put `eax` value into `esp` and that's all, we have correct stack pointer.
			
 
				+First of all, we put the address of `boot_stack_end` into the `eax` register, so the `eax` register contains the address of `boot_stack_end` where it was linked, which is `0x0 + boot_stack_end`. To get the real address of `boot_stack_end`, we need to add the real address of the `startup_32`. As you remember, we have found this address above and put it to the `ebp` register. In the end, the register `eax` will contain real address of the `boot_stack_end` and we just need to put to the stack pointer.
			
 
				 
			
 
				-The next step is CPU verification. Need to check that CPU has support of `long mode` and `SSE`:
			
 
				+After we have set up the stack, next step is CPU verification. As we are going to execute transition to the `long mode`, we need to check that the CPU supports `long mode` and `SSE`. We will do it by the call of the `verify_cpu` function:
			
 
				 
			
 
				 ```assembly
			
 
				 	call	verify_cpu
			
@@ -205,9 +278,9 @@ The next step is CPU verification. Need to check that CPU has support of `long m
 
				 	jnz	no_longmode
			
 
				 ```
			
 
				 
			
 
				-It just calls `verify_cpu` function from the [arch/x86/kernel/verify_cpu.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/verify_cpu.S) which contains a couple of calls of the `cpuid` instruction. `cpuid` is instruction which is used for getting information about processor. In our case it checks long mode and SSE support and returns `0` on success or `1` on fail in the `eax` register. 
			
 
				+This function defined in the [arch/x86/kernel/verify_cpu.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/verify_cpu.S) assembly file and just contains a couple of calls to the [cpuid](https://en.wikipedia.org/wiki/CPUID) instruction. This instruction is used for getting information about the processor. In our case it checks `long mode` and `SSE` support and returns `0` on success or `1` on fail in the `eax` register.
			
 
				 
			
 
				-If `eax` is not zero, we jump to the `no_longmode` label which just stops the CPU with `hlt` instruction while any hardware interrupt will not happen.
			
 
				+If the value of the `eax` is not zero, we jump to the `no_longmode` label which just stops the CPU by the call of the `hlt` instruction while no hardware interrupt will not happen:
			
 
				 
			
 
				 ```assembly
			
 
				 no_longmode:
			
@@ -216,12 +289,29 @@ no_longmode:
 
				 	jmp     1b
			
 
				 ```
			
 
				 
			
 
				-We set stack, cheked CPU and now can move on the next step.
			
 
				+If the value of the `eax` register is zero, everything is ok and we are able to continue.
			
 
				 
			
 
				 Calculate relocation address
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-The next step is calculating relocation address for decompression if need. We can see following assembly code:
			
 
				+The next step is calculating relocation address for decompression if needed. First we need to know what it means for a kernel to be `relocatable`. We already know that the base address of the 32-bit entry point of the Linux kernel is `0x100000`, but that is a 32-bit entry point. The default base address of the Linux kernel is determined by the value of the `CONFIG_PHYSICAL_START` kernel configuration option. Its default value is `0x1000000` or `1 MB`. The main problem here is that if the Linux kernel crashes, a kernel developer must have a `rescue kernel` for [kdump](https://www.kernel.org/doc/Documentation/kdump/kdump.txt) which is configured to load from a different address. The Linux kernel provides special configuration option to solve this problem: `CONFIG_RELOCATABLE`. As we can read in the documentation of the Linux kernel:
			
 
				+
			
 
				+```
			
 
				+This builds a kernel image that retains relocation information
			
 
				+so it can be loaded someplace besides the default 1MB.
			
 
				+
			
 
				+Note: If CONFIG_RELOCATABLE=y, then the kernel runs from the address
			
 
				+it has been loaded at and the compile time physical address
			
 
				+(CONFIG_PHYSICAL_START) is used as the minimum location.
			
 
				+```
			
 
				+
			
 
				+In simple terms this means that the Linux kernel with the same configuration can be booted from different addresses. Technically, this is done by compiling the decompressor as [position independent code](https://en.wikipedia.org/wiki/Position-independent_code). If we look at [arch/x86/boot/compressed/Makefile](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/Makefile), we will see that the decompressor is indeed compiled with the `-fPIC` flag:
			
 
				+
			
 
				+```Makefile
			
 
				+KBUILD_CFLAGS += -fno-strict-aliasing -fPIC
			
 
				+```
			
 
				+
			
 
				+When we are using position-independent code an address is obtained by adding the address field of the command and the value of the program counter. We can load code which uses such addressing from any address. That's why we had to get the real physical address of `startup_32`. Now let's get back to the Linux kernel code. Our current goal is to calculate an address where we can relocate the kernel for decompression. Calculation of this address depends on `CONFIG_RELOCATABLE` kernel configuration option. Let's look at the code:
			
 
				 
			
 
				 ```assembly
			
 
				 #ifdef CONFIG_RELOCATABLE
			
@@ -239,22 +329,7 @@ The next step is calculating relocation address for decompression if need. We ca
 
				 	addl	$z_extract_offset, %ebx
			
 
				 ```
			
 
				 
			
 
				-First of all note on `CONFIG_RELOCATABLE` macro. This configuration option defined in the [arch/x86/Kconfig](https://github.com/torvalds/linux/blob/master/arch/x86/Kconfig) and as we can read from it's description:
			
 
				-
			
 
				-```
			
 
				-This builds a kernel image that retains relocation information
			
 
				-so it can be loaded someplace besides the default 1MB.
			
 
				-
			
 
				-Note: If CONFIG_RELOCATABLE=y, then the kernel runs from the address
			
 
				-it has been loaded at and the compile time physical address
			
 
				-(CONFIG_PHYSICAL_START) is used as the minimum location.
			
 
				-```
			
 
				-
			
 
				-In short words, this code calculates address where to move kernel for decompression put it to `ebx` register if the kernel is relocatable or bzimage will decompress itself above `LOAD_PHYSICAL_ADDR`.
			
 
				-
			
 
				-Let's look on the code. If we have `CONFIG_RELOCATABLE=n` in our kernel configuration file, it just puts `LOAD_PHYSICAL_ADDR` to the `ebx` register and adds `z_extract_offset` to `ebx`. As `ebx` is zero for now, it will contain `z_extract_offset`. Now let's try to understand these two values.
			
 
				-
			
 
				-`LOAD_PHYSICAL_ADDR` is the macro which defined in the [arch/x86/include/asm/boot.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/boot.h) and it looks like this:
			
 
				+Remember that the value of the `ebp` register is the physical address of the `startup_32` label. If the `CONFIG_RELOCATABLE` kernel configuration option is enabled during kernel configuration, we put this address in the `ebx` register, align it to a multiple of `2MB` and compare it with the `LOAD_PHYSICAL_ADDR` value. The `LOAD_PHYSICAL_ADDR` macro is defined in the [arch/x86/include/asm/boot.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/boot.h) header file and it looks like this:
			
 
				 
			
 
				 ```C
			
 
				 #define LOAD_PHYSICAL_ADDR ((CONFIG_PHYSICAL_START \
			
@@ -262,27 +337,14 @@ Let's look on the code. If we have `CONFIG_RELOCATABLE=n` in our kernel configur
 
				 				& ~(CONFIG_PHYSICAL_ALIGN - 1))
			
 
				 ```
			
 
				 
			
 
				-Here we calculates aligned address where kernel is loaded (`0x100000` or 1 megabyte in our case). `PHYSICAL_ALIGN` is an alignment value to which kernel should be aligned, it ranges from `0x200000` to `0x1000000` for x86_64. With the default values we will get 2 megabytes in the `LOAD_PHYSICAL_ADDR`:
			
 
				+As we can see it just expands to the aligned `CONFIG_PHYSICAL_ALIGN` value which represents the physical address of where to load the kernel. After comparison of the `LOAD_PHYSICAL_ADDR` and value of the `ebx` register, we add the offset from the `startup_32` where to decompress the compressed kernel image. If the `CONFIG_RELOCATABLE` option is not enabled during kernel configuration, we just put the default address where to load kernel and add `z_extract_offset` to it.
			
 
				 
			
 
				-```python
			
 
				->>> 0x100000 + (0x200000 - 1) & ~(0x200000 - 1)
			
 
				-2097152
			
 
				-```
			
 
				-
			
 
				-After that we got alignment unit, we adds `z_extract_offset` (which is `0xe5c000` in my case) to the 2 megabytes. In the end we will get 17154048 byte offset. You can find `z_extract_offset` in the `arch/x86/boot/compressed/piggy.S`. This file generated in compile time by [mkpiggy](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/mkpiggy.c) program.
			
 
				-
			
 
				-Now let's try to understand the code if `CONFIG_RELOCATABLE` is `y`. 
			
 
				-
			
 
				-First of all we put `ebp` value to the `ebx` (remember that `ebp` contains address where we loaded) and `kernel_alignment` field from kernel setup header to the `eax` register. `kernel_alignment`  is a physical address of alignment required for the kernel. Next we do the same as in the previous case (when kernel is not relocatable), but we just use value of the `kernel_alignment` field as align unit and `ebx` (address where we loaded) as base address instead of `CONFIG_PHYSICAL_ALIGN` and `LOAD_PHYSICAL_ADDR`. 
			
 
				-
			
 
				-After that we calculated address, we compare it with `LOAD_PHYSICAL_ADDR` and add `z_extract_offset` to it again or put `LOAD_PHYSICAL_ADDR` in the `ebx` if calculated address is less than we need.
			
 
				-
			
 
				-After all of this calculation we will have `ebp` which contains address where we loaded and `ebx` with address where to move kernel for decompression.
			
 
				+After all of these calculations we will have `ebp` which contains the address where we loaded it and `ebx` set to the address of where kernel will be moved after decompression.
			
 
				 
			
 
				 Preparation before entering long mode
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-Now we need to do the last preparations before we can see transition to the 64-bit mode. At first we need to update Global Descriptor Table for this:
			
 
				+When we have the base address where we will relocate the compressed kernel image, we need to do one last step before we can transition to 64-bit mode. First we need to update the [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table):
			
 
				 
			
 
				 ```assembly
			
 
				 	leal	gdt(%ebp), %eax
			
@@ -290,9 +352,7 @@ Now we need to do the last preparations before we can see transition to the 64-b
 
				 	lgdt	gdt(%ebp)
			
 
				 ```
			
 
				 
			
 
				-Here we put the address from `ebp` with `gdt` offset to `eax` register, next we put this address into `ebp` with offset `gdt+2` and load Global Descriptor Table with the `lgdt` instruction. 
			
 
				-
			
 
				-Let's look on Global Descriptor Table definition:
			
 
				+Here we put the base address from `ebp` register with `gdt` offset into the `eax` register. Next we put this address into `ebp` register with offset `gdt+2` and load the `Global Descriptor Table` with the `lgdt` instruction. To understand the magic with `gdt` offsets we need to look at the definition of the `Global Descriptor Table`. We can find its definition in the same source code [file](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S):
			
 
				 
			
 
				 ```assembly
			
 
				 	.data
			
@@ -305,11 +365,17 @@ gdt:
 
				 	.quad	0x00cf92000000ffff	/* __KERNEL_DS */
			
 
				 	.quad	0x0080890000000000	/* TS descriptor */
			
 
				 	.quad   0x0000000000000000	/* TS continued */
			
 
				+gdt_end:
			
 
				 ```
			
 
				 
			
 
				-It defined in the same file in the `.data` section. It contains 5 descriptors: null descriptor, for kernel code segment, kernel data segment and two task descriptors. We already loaded GDT in the previous [part](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-3.md), we're doing almost the same here, but descriptors with `CS.L = 1` and `CS.D = 0` for execution in the 64 bit mode.
			
 
				+We can see that it is located in the `.data` section and contains five descriptors: `null` descriptor, kernel code segment, kernel data segment and two task descriptors. We already loaded the `Global Descriptor Table` in the previous [part](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-3.md), and now we're doing almost the same here, but descriptors with `CS.L = 1` and `CS.D = 0` for execution in `64` bit mode. As we can see, the definition of the `gdt` starts from two bytes: `gdt_end - gdt` which represents last byte in the `gdt` table or table limit. The next four bytes contains base address of the `gdt`. Remember that the `Global Descriptor Table` is stored in the `48-bits GDTR` which consists of two parts:
			
 
				 
			
 
				-After we have loaded Global Descriptor Table, we must enable [PAE](http://en.wikipedia.org/wiki/Physical_Address_Extension) mode with putting value of `cr4` register into `eax`, setting 5 bit in it and load it again in the `cr4` :
			
 
				+* size(16-bit) of global descriptor table;
			
 
				+* address(32-bit) of the global descriptor table.
			
 
				+
			
 
				+So, we put address of the `gdt` to the `eax` register and then we put it to the `.long	gdt` or `gdt+2` in our assembly code. From now we have formed structure for the `GDTR` register and can load the `Global Descriptor Table` with the `lgtd` instruction.
			
 
				+
			
 
				+After we have loaded the `Global Descriptor Table`, we must enable [PAE](http://en.wikipedia.org/wiki/Physical_Address_Extension) mode by putting the value of the `cr4` register into `eax`, setting 5 bit in it and loading it again into `cr4`:
			
 
				 
			
 
				 ```assembly
			
 
				 	movl	%cr4, %eax
			
@@ -317,49 +383,49 @@ After we have loaded Global Descriptor Table, we must enable [PAE](http://en.wik
 
				 	movl	%eax, %cr4
			
 
				 ```
			
 
				 
			
 
				-Now we finished almost with all preparations before we can move into 64-bit mode. The last step is to build page tables, but before some information about long mode.
			
 
				+Now we are almost finished with all preparations before we can move into 64-bit mode. The last step is to build page tables, but before that, here is some information about long mode.
			
 
				 
			
 
				 Long mode
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-Long mode is the native mode for x86_64 processors. First of all let's look on some difference between `x86_64` and `x86`.
			
 
				+[Long mode](https://en.wikipedia.org/wiki/Long_mode) is the native mode for [x86_64](https://en.wikipedia.org/wiki/X86-64) processors. First let's look at some differences between `x86_64` and the `x86`.
			
 
				 
			
 
				-It provides some features as:
			
 
				+The `64-bit` mode provides features such as:
			
 
				 
			
 
				-* New 8 general purpose registers from `r8` to `r15` + all general purpose registers are 64-bit now
			
 
				-* 64-bit instruction pointer - `RIP`
			
 
				-* New operating mode - Long mode
			
 
				-* 64-Bit Addresses and Operands
			
 
				-* RIP Relative Addressing (we will see example if it in the next parts)
			
 
				+* New 8 general purpose registers from `r8` to `r15` + all general purpose registers are 64-bit now;
			
 
				+* 64-bit instruction pointer - `RIP`;
			
 
				+* New operating mode - Long mode;
			
 
				+* 64-Bit Addresses and Operands;
			
 
				+* RIP Relative Addressing (we will see an example of it in the next parts).
			
 
				 
			
 
				-Long mode is an extension of legacy protected mode. It consists from two sub-modes:
			
 
				+Long mode is an extension of legacy protected mode. It consists of two sub-modes:
			
 
				 
			
 
				-* 64-bit mode
			
 
				-* compatibility mode
			
 
				+* 64-bit mode;
			
 
				+* compatibility mode.
			
 
				 
			
 
				-To switch into 64-bit mode we need to do following things:
			
 
				+To switch into `64-bit` mode we need to do following things:
			
 
				 
			
 
				-* enable PAE (we already did it, see above)
			
 
				-* build page tables and load the address of top level page table into `cr3` register 
			
 
				-* enable `EFER.LME`
			
 
				-* enable paging
			
 
				+* Enable [PAE](https://en.wikipedia.org/wiki/Physical_Address_Extension);
			
 
				+* Build page tables and load the address of the top level page table into the `cr3` register;
			
 
				+* Enable `EFER.LME`;
			
 
				+* Enable paging.
			
 
				 
			
 
				-We already enabled `PAE` with setting the PAE bit in the `cr4` register. Now let's look on paging.
			
 
				+We already enabled `PAE` by setting the `PAE` bit in the `cr4` control register. Our next goal is to build the structure for [paging](https://en.wikipedia.org/wiki/Paging). We will see this in next paragraph.
			
 
				 
			
 
				-Early page tables initialization
			
 
				+Early page table initialization
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-Before we can move in the 64-bit mode, we need to build page tables, so, let's look on building of early 4G boot page tables. 
			
 
				+So, we already know that before we can move into `64-bit` mode, we need to build page tables, so, let's look at the building of early `4G` boot page tables.
			
 
				 
			
 
				-**NOTE: I will not describe theory of virtual memory here, if you need to know more about it, see links in the end**
			
 
				+**NOTE: I will not describe the theory of virtual memory here. If you need to know more about it, see links at the end of this part.**
			
 
				 
			
 
				-Linux kernel uses 4-level paging, and generally we build 6 page tables:
			
 
				+The Linux kernel uses `4-level` paging, and we generally build 6 page tables:
			
 
				 
			
 
				-* One PML4 table
			
 
				-* One PDP  table
			
 
				-* Four Page Directory tables
			
 
				+* One `PML4` or `Page Map Level 4` table with one entry;
			
 
				+* One `PDP` or `Page Directory Pointer` table with four entries;
			
 
				+* Four Page Directory tables with a total of `2048` entries.
			
 
				 
			
 
				-Let's look on the implementation of it. First of all we clear buffer for the page tables in the memory. Every table is 4096 bytes, so we need 24 kilobytes buffer:
			
 
				+Let's look at the implementation of this. First of all we clear the buffer for the page tables in memory. Every table is `4096` bytes, so we need clear `24` kilobyte buffer:
			
 
				 
			
 
				 ```assembly
			
 
				 	leal	pgtable(%ebx), %edi
			
@@ -368,7 +434,9 @@ Let's look on the implementation of it. First of all we clear buffer for the pag
 
				 	rep	stosl
			
 
				 ```
			
 
				 
			
 
				-We put address which stored in `ebx` (remember that `ebx` contains the address where to relocate kernel for decompression) with `pgtable` offset to the `edi` register. `pgtable` defined in the end of `head_64.S` and looks:
			
 
				+We put the address of `pgtable` plus `ebx` (remember that `ebx` contains the address to relocate the kernel for decompression) in the `edi` register, clear the `eax` register and set the `ecx` register to `6144`. The `rep stosl` instruction will write the value of the `eax` to `edi`, increase value of the `edi` register by `4` and decrease the value of the `ecx` register by `1`. This operation will be repeated while the value of the `ecx` register is greater than zero. That's why we put `6144` in `ecx`.
			
 
				+
			
 
				+`pgtable` is defined at the end of [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) assembly file and is:
			
 
				 
			
 
				 ```assembly
			
 
				 	.section ".pgtable","a",@nobits
			
@@ -377,9 +445,9 @@ pgtable:
 
				 	.fill 6*4096, 1, 0
			
 
				 ```
			
 
				 
			
 
				-It is in the `.pgtable` section and it size is 24 kilobytes. After we put address to the `edi`, we zero out `eax` register and writes zeros to the buffer with `rep stosl` instruction.
			
 
				+As we can see, it is located in the `.pgtable` section and its size is `24` kilobytes.
			
 
				 
			
 
				-Now we can build top level page table - `PML4` with:
			
 
				+After we have got buffer for the `pgtable` structure, we can start to build the top level page table - `PML4` - with:
			
 
				 
			
 
				 ```assembly
			
 
				 	leal	pgtable + 0(%ebx), %edi
			
@@ -387,9 +455,9 @@ Now we can build top level page table - `PML4` with:
 
				 	movl	%eax, 0(%edi)
			
 
				 ```
			
 
				 
			
 
				-Here we get address which stored in the `ebx` with `pgtable` offset and put it to the `edi`. Next we put this address with offset `0x1007` to the `eax` register. `0x1007` is 4096 bytes (size of the PML4) + 7 (PML4 entry flags - `PRESENT+RW+USER`) and puts `eax` to the `edi`. After this manipulations `edi` will contain the address of the first Page Directory Pointer Entry with flags - `PRESENT+RW+USER`.
			
 
				+Here again, we put the address of the `pgtable` relative to `ebx` or in other words relative to address of the `startup_32` to the `edi` register. Next we put this address with offset `0x1007` in the `eax` register. The `0x1007` is `4096` bytes which is the size of the `PML4` plus `7`. The `7` here represents flags of the `PML4` entry. In our case, these flags are `PRESENT+RW+USER`. In the end we just write first the address of the first `PDP` entry to the `PML4`.
			
 
				 
			
 
				-In the next step we build 4 Page Directory entry in the Page Directory Pointer table, where first entry will be with `0x7` flags and other with `0x8`:
			
 
				+In the next step we will build four `Page Directory` entries in the `Page Directory Pointer` table with the same `PRESENT+RW+USE` flags:
			
 
				 
			
 
				 ```assembly
			
 
				 	leal	pgtable + 0x1000(%ebx), %edi
			
@@ -402,11 +470,7 @@ In the next step we build 4 Page Directory entry in the Page Directory Pointer t
 
				 	jnz	1b
			
 
				 ```
			
 
				 
			
 
				-We put base address of the page directory pointer table to the `edi` and address of the first page directory pointer entry to the `eax`. Put `4` to the `ecx` register, it will be counter in the following loop and write the address of the first page directory pointer table entry  to the `edi` register. 
			
 
				-
			
 
				-After this `edi` will contain address of the first page directory pointer entry with flags `0x7`. Next we just calculates address of following page directory pointer entries with flags `0x8` and writes their addresses to the `edi`.
			
 
				-
			
 
				-The next step is building of `2048` page table entries by 2 megabytes:
			
 
				+We put the base address of the page directory pointer which is `4096` or `0x1000` offset from the `pgtable` table in `edi` and the address of the first page directory pointer entry in `eax` register. Put `4` in the `ecx` register, it will be a counter in the following loop and write the address of the first page directory pointer table entry to the `edi` register. After this `edi` will contain the address of the first page directory pointer entry with flags `0x7`. Next we just calculate the address of following page directory pointer entries where each entry is `8` bytes, and write their addresses to `eax`. The last step of building paging structure is the building of the `2048` page table entries with `2-MByte` pages:
			
 
				 
			
 
				 ```assembly
			
 
				 	leal	pgtable + 0x2000(%ebx), %edi
			
@@ -419,21 +483,26 @@ The next step is building of `2048` page table entries by 2 megabytes:
 
				 	jnz	1b
			
 
				 ```
			
 
				 
			
 
				-Here we do almost the same that in the previous example, just first entry will be with flags - `$0x00000183` - `PRESENT + WRITE + MBZ` and all another with `0x8`. In the end we will have 2048 pages by 2 megabytes.
			
 
				+Here we do almost the same as in the previous example, all entries will be with flags - `$0x00000183` - `PRESENT + WRITE + MBZ`. In the end we will have `2048` pages with `2-MByte` page or:
			
 
				+
			
 
				+```python
			
 
				+>>> 2048 * 0x00200000
			
 
				+4294967296
			
 
				+```
			
 
				 
			
 
				-Our early page table structure are done, it maps 4 gigabytes of memory and now we can put address of the high-level page table - `PML4` to the `cr3` control register:
			
 
				+`4G` page table. We just finished to build our early page table structure which maps `4` gigabytes of memory and now we can put the address of the high-level page table - `PML4` - in `cr3` control register:
			
 
				 
			
 
				 ```assembly
			
 
				 	leal	pgtable(%ebx), %eax
			
 
				 	movl	%eax, %cr3
			
 
				 ```
			
 
				 
			
 
				-That's all now we can see transition to the long mode.
			
 
				+That's all. All preparation are finished and now we can see transition to the long mode.
			
 
				 
			
 
				-Transition to the long mode
			
 
				+Transition to the 64-bit mode
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-First of all we need to set `EFER.LME` flag in the [MSR](http://en.wikipedia.org/wiki/Model-specific_register) to `0xC0000080`:
			
 
				+First of all we need to set the `EFER.LME` flag in the [MSR](http://en.wikipedia.org/wiki/Model-specific_register) to `0xC0000080`:
			
 
				 
			
 
				 ```assembly
			
 
				 	movl	$MSR_EFER, %ecx
			
@@ -442,31 +511,31 @@ First of all we need to set `EFER.LME` flag in the [MSR](http://en.wikipedia.org
 
				 	wrmsr
			
 
				 ```
			
 
				 
			
 
				-Here we put `MSR_EFER` flag (which defined in the [arch/x86/include/uapi/asm/msr-index.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/msr-index.h#L7)) to the `ecx` register and call `rdmsr` instruction which reads [MSR](http://en.wikipedia.org/wiki/Model-specific_register) register. After `rdmsr` executed, we will have result data in the `edx:eax` which depends on `ecx` value. We check  `EFER_LME` bit with `btsl` instruction and write data from `eax` to the `MSR` register with `wrmsr` instruction.
			
 
				+Here we put the `MSR_EFER` flag (which is defined in [arch/x86/include/uapi/asm/msr-index.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/msr-index.h#L7)) in the `ecx` register and call `rdmsr` instruction which reads the [MSR](http://en.wikipedia.org/wiki/Model-specific_register) register. After `rdmsr` executes, we will have the resulting data in `edx:eax` which depends on the `ecx` value. We check the `EFER_LME` bit with the `btsl` instruction and write data from `eax` to the `MSR` register with the `wrmsr` instruction.
			
 
				 
			
 
				-In next step we push address of the kernel segment code to the stack (we defined it in the GDT) and put address of the `startup_64` routine to the `eax`.
			
 
				+In the next step we push the address of the kernel segment code to the stack (we defined it in the GDT) and put the address of the `startup_64` routine in `eax`.
			
 
				 
			
 
				 ```assembly
			
 
				 	pushl	$__KERNEL_CS
			
 
				 	leal	startup_64(%ebp), %eax
			
 
				 ```
			
 
				 
			
 
				-After this we push this address to the stack and enable paging with setting `PG` and `PE` bits in the `cr0` register:
			
 
				+After this we push this address to the stack and enable paging by setting `PG` and `PE` bits in the `cr0` register:
			
 
				 
			
 
				 ```assembly
			
 
				 	movl	$(X86_CR0_PG | X86_CR0_PE), %eax
			
 
				 	movl	%eax, %cr0
			
 
				 ```
			
 
				 
			
 
				-and call:
			
 
				+and execute:
			
 
				 
			
 
				 ```assembly
			
 
				 lret
			
 
				 ```
			
 
				 
			
 
				-Remember that we pushed address of the `startup_64` function to the stack in the previous step, and after `lret` instruction, CPU extracts address of it and jumps there. 
			
 
				+instruction. Remember that we pushed the address of the `startup_64` function to the stack in the previous step, and after the `lret` instruction, the CPU extracts the address of it and jumps there.
			
 
				 
			
 
				-After all of these steps we're finally in the 64-bit mode:
			
 
				+After all of these steps we're finally in 64-bit mode:
			
 
				 
			
 
				 ```assembly
			
 
				 	.code64
			
@@ -482,11 +551,11 @@ That's all!
 
				 Conclusion
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-This is the end of the fourth part linux kernel booting process. If you have questions or suggestions, ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create an [issue](https://github.com/0xAX/linux-internals/issues/new).
			
 
				+This is the end of the fourth part linux kernel booting process. If you have questions or suggestions, ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create an [issue](https://github.com/0xAX/linux-insides/issues/new).
			
 
				 
			
 
				 In the next part we will see kernel decompression and many more.
			
 
				 
			
 
				-**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
			
 
				+**Please note that English is not my first language and I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-internals).**
			
 
				 
			
 
				 Links
			
 
				 --------------------------------------------------------------------------------
			
--- a/Booting/linux-bootstrap-5.md
+++ b/Booting/linux-bootstrap-5.md
@@ -9,7 +9,7 @@ This is the fifth part of the `Kernel booting process` series. We saw transition
 
				 Preparation before kernel decompression
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-We stopped right before jump on 64-bit entry point - `startup_64` which located in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) source code file. We already saw the jump to the `startup_64` in the `startup_32`:
			
 
				+We stopped right before the jump on the 64-bit entry point - `startup_64` which is located in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) source code file. We already saw the jump to the `startup_64` in the `startup_32`:
			
 
				 
			
 
				 ```assembly
			
 
				 	pushl	$__KERNEL_CS
			
@@ -24,7 +24,7 @@ We stopped right before jump on 64-bit entry point - `startup_64` which located
 
				 	lret
			
 
				 ```
			
 
				 
			
 
				-in the previous part, `startup_64` starts to work. Since we loaded the new Global Descriptor Table and there was CPU transition in other mode (64-bit mode in our case), we can see setup of the data segments:
			
 
				+in the previous part, `startup_64` starts to work. Since we loaded the new Global Descriptor Table and there was CPU transition in other mode (64-bit mode in our case), we can see the setup of the data segments:
			
 
				 
			
 
				 ```assembly
			
 
				 	.code64
			
@@ -38,9 +38,9 @@ ENTRY(startup_64)
 
				 	movl	%eax, %gs
			
 
				 ```
			
 
				 
			
 
				-in the beginning of the `startup_64`. All segment registers besides `cs` points now to the `ds` which is `0x18` (if you don't understand why it is `0x18`, read the previous part).
			
 
				+in the beginning of the `startup_64`. All segment registers besides `cs` now point to the `ds` which is `0x18` (if you don't understand why it is `0x18`, read the previous part).
			
 
				 
			
 
				-The next step is computation of difference between where kernel was compiled and where it was loaded:
			
 
				+The next step is computation of difference between where the kernel was compiled and where it was loaded:
			
 
				 
			
 
				 ```assembly
			
 
				 #ifdef CONFIG_RELOCATABLE
			
@@ -58,18 +58,18 @@ The next step is computation of difference between where kernel was compiled and
 
				 	leaq	z_extract_offset(%rbp), %rbx
			
 
				 ```
			
 
				 
			
 
				-`rbp` contains decompressed kernel start address and after this code executed `rbx` register will contain address where to relocate the kernel code for decompression. We already saw code like this in the `startup_32` ( you can read about it in the previous part - [Calculate relocation address](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-4.md#calculate-relocation-address)), but we need to do this calculation again because bootloader can use 64-bit boot protocol and `startup_32` just will not be executed in this case.
			
 
				+`rbp` contains the decompressed kernel start address and after this code executes `rbx` register will contain address to relocate the kernel code for decompression. We already saw code like this in the `startup_32` ( you can read about it in the previous part - [Calculate relocation address](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-4.md#calculate-relocation-address)), but we need to do this calculation again because the bootloader can use 64-bit boot protocol and `startup_32` just will not be executed in this case.
			
 
				 
			
 
				-In the next step we can see setup of the stack and reset of flags register:
			
 
				+In the next step we can see setup of the stack pointer and resetting of the flags register:
			
 
				 
			
 
				 ```assembly
			
 
				 	leaq	boot_stack_end(%rbx), %rsp
			
 
				 
			
 
				- 	pushq	$0
			
 
				+	pushq	$0
			
 
				 	popfq
			
 
				 ```
			
 
				 
			
 
				-As you can see above `rbx` register contains the start address of the decompressing kernel code and we just put this address with `boot_stack_end` offset to the `rsp` register. After this stack will be correct. You can find definition of the `boot_stack_end` in the end of `compressed/head_64.S` file:
			
 
				+As you can see above, the `rbx` register contains the start address of the kernel decompressor code and we just put this address with `boot_stack_end` offset to the `rsp` register which represents pointer to the top of the stack. After this step, the stack will be correct. You can find definition of the `boot_stack_end` in the end of [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) assembly source code file:
			
 
				 
			
 
				 ```assembly
			
 
				 	.bss
			
@@ -79,12 +79,11 @@ boot_heap:
 
				 boot_stack:
			
 
				 	.fill BOOT_STACK_SIZE, 1, 0
			
 
				 boot_stack_end:
			
 
				-
			
 
				 ```
			
 
				 
			
 
				-It located in the `.bss` section right before `.pgtable`. You can look at [arch/x86/boot/compressed/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/vmlinux.lds.S) to find it.
			
 
				+It located in the end of the `.bss` section, right before the `.pgtable`. If you will look into [arch/x86/boot/compressed/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/vmlinux.lds.S) linker script, you will find  Definition of the `.bss` and `.pgtable` there.
			
 
				 
			
 
				-As we set the stack, now we can copy the compressed kernel to the address that we got above, when we calculated the relocation address of the decompressed kernel. Let's look on this code:
			
 
				+As we set the stack, now we can copy the compressed kernel to the address that we got above, when we calculated the relocation address of the decompressed kernel. Before details, let's look at this assembly code:
			
 
				 
			
 
				 ```assembly
			
 
				 	pushq	%rsi
			
@@ -98,9 +97,9 @@ As we set the stack, now we can copy the compressed kernel to the address that w
 
				 	popq	%rsi
			
 
				 ```
			
 
				 
			
 
				-First of all we push `rsi` to the stack. We need save value of `rsi`, because this register now stores pointer to the `boot_params` real mode structure (you must remember this structure, we filled it in the start of kernel setup). In the end of this code we'll restore pointer to the `boot_params` into `rsi` again. 
			
 
				+First of all we push `rsi` to the stack. We need preserve the value of `rsi`, because this register now stores a pointer to the `boot_params` which is real mode structure that contains booting related data (you must remember this structure, we filled it in the start of kernel setup). In the end of this code we'll restore the pointer to the `boot_params` into `rsi` again. 
			
 
				 
			
 
				-The next two `leaq` instructions calculates effective address of the `rip` and `rbx` with `_bss - 8` offset and put it to the `rsi` and `rdi`. Why we calculate this addresses? Actually compressed kernel image located between this copying code (from `startup_32` to the current code) and the decompression code. You can verify this by looking on the linker script - [arch/x86/boot/compressed/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/vmlinux.lds.S):
			
 
				+The next two `leaq` instructions calculates effective addresses of the `rip` and `rbx` with `_bss - 8` offset and put it to the `rsi` and `rdi`. Why do we calculate these addresses? Actually the compressed kernel image is located between this copying code (from `startup_32` to the current code) and the decompression code. You can verify this by looking at the linker script - [arch/x86/boot/compressed/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/vmlinux.lds.S):
			
 
				 
			
 
				 ```
			
 
				 	. = 0;
			
@@ -120,7 +119,7 @@ The next two `leaq` instructions calculates effective address of the `rip` and `
 
				 	}
			
 
				 ```
			
 
				 
			
 
				-Note that `.head.text` section contains `startup_32`. You can remember it from the previous part:
			
 
				+Note that `.head.text` section contains `startup_32`. You may remember it from the previous part:
			
 
				 
			
 
				 ```assembly
			
 
				 	__HEAD
			
@@ -131,10 +130,9 @@ ENTRY(startup_32)
 
				 ...
			
 
				 ```
			
 
				 
			
 
				-`.text` section contains decompression code:
			
 
				+The `.text` section contains decompression code:
			
 
				 
			
 
				-assembly
			
 
				-```
			
 
				+```assembly
			
 
				 	.text
			
 
				 relocated:
			
 
				 ...
			
@@ -146,15 +144,11 @@ relocated:
 
				 ...
			
 
				 ```
			
 
				 
			
 
				-And `.rodata..compressed` contains compressed kernel image. 
			
 
				-
			
 
				-So `rsi` will contain `rip` relative address of the `_bss - 8` and `rdi` will contain relocation relative address of the ``_bss - 8`. As we store these addresses in register, we put the address of `_bss` to the `rcx` register. As you can see in the `vmlinux.lds.S`, it located in the end of all sections with the setup/kernel code. Now we can start to copy data from `rsi` to `rdi` by 8 bytes with `movsq` instruction. 
			
 
				-
			
 
				-Note that there is `std` instruction before data copying, it sets `DF` flag and it means that `rsi` and `rdi` will be decremeted or in other words, we will crbxopy bytes in backwards. 
			
 
				+And `.rodata..compressed` contains the compressed kernel image. So `rsi` will contain the absolute address of `_bss - 8`, and `rdi` will contain the relocation relative address of `_bss - 8`. As we store these addresses in registers, we put the address of `_bss` in the `rcx` register. As you can see in the `vmlinux.lds.S` linker script, it's located at the end of all sections with the setup/kernel code. Now we can start to copy data from `rsi` to `rdi`, `8` bytes at the time, with the `movsq` instruction. 
			
 
				 
			
 
				-In the end we clear `DF` flag with `cld` instruction and restore `boot_params` structure to the `rsi`.
			
 
				+Note that there is an `std` instruction before data copying: it sets the `DF` flag, which means that `rsi` and `rdi` will be decremented. In other words, we will copy the bytes backwards. At the end, we clear the `DF` flag with the `cld` instruction, and restore `boot_params` structure to `rsi`.
			
 
				 
			
 
				-After it we get `.text` section address address and jump to it:
			
 
				+Now we have the address of the `.text` section address after relocation, and we can jump to it:
			
 
				 
			
 
				 ```assembly
			
 
				 	leaq	relocated(%rbx), %rax
			
@@ -164,7 +158,7 @@ After it we get `.text` section address address and jump to it:
 
				 Last preparation before kernel decompression
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-`.text` sections starts with the `relocated` label. For the start there is clearing of the `bss` section with:
			
 
				+In the previous paragraph we saw that the `.text` section starts with the `relocated` label. The first thing it does is clearing the `bss` section with:
			
 
				 
			
 
				 ```assembly
			
 
				 	xorl	%eax, %eax
			
@@ -175,9 +169,9 @@ Last preparation before kernel decompression
 
				 	rep	stosq
			
 
				 ```
			
 
				 
			
 
				-Here we just clear `eax`, put RIP relative address of the `_bss` to the `rdi` and `_ebss` to `rcx` and fill it with zeros with `rep stosq` instructions.
			
 
				+We need to initialize the `.bss` section, because we'll soon jump to [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) code. Here we just clear `eax`, put the address of `_bss` in `rdi` and `_ebss` in `rcx`, and fill it with zeros with the `rep stosq` instruction.
			
 
				 
			
 
				-In the end we can see the call of the `decompress_kernel` routine:
			
 
				+At the end, we can see the call to the `decompress_kernel` function:
			
 
				 
			
 
				 ```assembly
			
 
				 	pushq	%rsi
			
@@ -194,36 +188,48 @@ In the end we can see the call of the `decompress_kernel` routine:
 
				 	popq	%rsi
			
 
				 ```
			
 
				 
			
 
				-Again we save `rsi` with pointer to `boot_params` structure and call `decompress_kernel` from the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c) with seven arguments. All arguments will be passed through the registers. We finished all preparation and now can look on the kernel decompression.
			
 
				+Again we set `rdi` to a pointer to the `boot_params` structure and call `decompress_kernel` from [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c) with seven arguments:
			
 
				+
			
 
				+* `rmode` - pointer to the [boot_params](https://github.com/torvalds/linux/blob/master//arch/x86/include/uapi/asm/bootparam.h#L114) structure which is filled by bootloader or during early kernel initialization;
			
 
				+* `heap` - pointer to the `boot_heap` which represents start address of the early boot heap;
			
 
				+* `input_data` - pointer to the start of the compressed kernel or in other words pointer to the `arch/x86/boot/compressed/vmlinux.bin.bz2`;
			
 
				+* `input_len` - size of the compressed kernel;
			
 
				+* `output` - start address of the future decompressed kernel;
			
 
				+* `output_len` - size of decompressed kernel;
			
 
				+* `run_size` - amount of space needed to run the kernel including `.bss` and `.brk` sections.
			
 
				+
			
 
				+All arguments will be passed through the registers according to [System V Application Binary Interface](http://www.x86-64.org/documentation/abi.pdf). We've finished all preparation and can now look at the kernel decompression.
			
 
				 
			
 
				 Kernel decompression
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-As i wrote above, `decompress_kernel` function is in the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c) source code file. This function starts with the video/console initialization that we saw in the previous parts. This calls need if bootloaded used 32 or 64-bit protocols. After this we store pointers to the start of the free memory and to the end of it:
			
 
				+As we saw in previous paragraph, the `decompress_kernel` function is defined in the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c) source code file and takes seven arguments. This function starts with the video/console initialization that we already saw in the previous parts. We need to do this again because we don't know if we started in [real mode](https://en.wikipedia.org/wiki/Real_mode) or a bootloader was used, or whether the bootloader used the 32 or 64-bit boot protocol.
			
 
				+
			
 
				+After the first initialization steps, we store pointers to the start of the free memory and to the end of it:
			
 
				 
			
 
				 ```C
			
 
				-	free_mem_ptr     = heap;
			
 
				-	free_mem_end_ptr = heap + BOOT_HEAP_SIZE;
			
 
				+free_mem_ptr     = heap;
			
 
				+free_mem_end_ptr = heap + BOOT_HEAP_SIZE;
			
 
				 ```
			
 
				 
			
 
				-where `heap` is the second parameter of the `decompress_kernel` function which we got with:
			
 
				+where the `heap` is the second parameter of the `decompress_kernel` function which we got in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S):
			
 
				 
			
 
				 ```assembly
			
 
				 leaq	boot_heap(%rip), %rsi
			
 
				 ```
			
 
				 
			
 
				-As you saw about `boot_heap` defined as:
			
 
				+As you saw above, the `boot_heap` is defined as:
			
 
				 
			
 
				 ```assembly
			
 
				 boot_heap:
			
 
				 	.fill BOOT_HEAP_SIZE, 1, 0
			
 
				 ```
			
 
				 
			
 
				-where `BOOT_HEAP_SIZE` is `0x400000` if the kernel compressed with `bzip2` or `0x8000` if not.
			
 
				+where the `BOOT_HEAP_SIZE` is macro which expands to `0x8000` (`0x400000` in a case of `bzip2` kernel) and represents the size of the heap.
			
 
				 
			
 
				-In the next step we call `choose_kernel_location` function from the [arch/x86/boot/compressed/aslr.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/aslr.c#L298). As we can understand from the function name it chooses memory location where to decompress the kernel image. Let's look on this function.
			
 
				+After heap pointers initialization, the next step is the call of the `choose_random_location` function from [arch/x86/boot/compressed/kaslr.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/kaslr.c#L425) source code file. As we can guess from the function name, it chooses the memory location where the kernel image will be decompressed. It may look weird that we need to find or even `choose` location where to decompress the compressed kernel image, but the Linux kernel supports [kASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization) which allows decompression of the kernel into a random address, for security reasons. Let's open the [arch/x86/boot/compressed/kaslr.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/kaslr.c#L425) source code file and look at `choose_random_location`.
			
 
				 
			
 
				-At the start `choose_kernel_location` tries to find `kaslr` option in the command line if `CONFIG_HIBERNATION` is set and `nokaslr` option if this configuration option `CONFIG_HIBERNATION` is not set:
			
 
				+First, `choose_random_location` tries to find the `kaslr` option in the Linux kernel command line if `CONFIG_HIBERNATION` is set, and `nokaslr` otherwise:
			
 
				 
			
 
				 ```C
			
 
				 #ifdef CONFIG_HIBERNATION
			
@@ -239,14 +245,16 @@ At the start `choose_kernel_location` tries to find `kaslr` option in the comman
 
				 #endif
			
 
				 ```
			
 
				 
			
 
				-If there is no `kaslr` or `nokaslr` in the command line it jumps to `out` label:
			
 
				+If the `CONFIG_HIBERNATION` kernel configuration option is enabled during kernel configuration and there is no `kaslr` option in the Linux kernel command line, it prints `KASLR disabled by default...` and jumps to the `out` label:
			
 
				 
			
 
				 ```C
			
 
				 out:
			
 
				 	return (unsigned char *)choice;
			
 
				 ```
			
 
				 
			
 
				-which just returns the `output` parameter which we passed to the `choose_kernel_location` without any changes. Let's try to understand what is it `kaslr`. We can find information about it in the [documentation](https://github.com/torvalds/linux/blob/master/Documentation/kernel-parameters.txt):
			
 
				+which just returns the `output` parameter which we passed to the `choose_random_location`, unchanged. If the `CONFIG_HIBERNATION` kernel configuration option is disabled and the `nokaslr` option is in the kernel command line, we jump to `out` again.
			
 
				+
			
 
				+For now, let's assume the kernel was configured with randomization enabled and try to understand what `kASLR` is. We can find information about it in the [documentation](https://github.com/torvalds/linux/blob/master/Documentation/kernel-parameters.txt):
			
 
				 
			
 
				 ```
			
 
				 kaslr/nokaslr [X86]
			
@@ -258,22 +266,50 @@ kASLR is disabled by default. When kASLR is enabled,
 
				 hibernation will be disabled.
			
 
				 ```
			
 
				 
			
 
				-It means that we can pass `kaslr` option to the kernel's command line and get random address for the decompressed kernel (more about aslr you can read [here](https://en.wikipedia.org/wiki/Address_space_layout_randomization)). 
			
 
				+It means that we can pass the `kaslr` option to the kernel's command line and get a random address for the decompressed kernel (you can read more about ASLR [here](https://en.wikipedia.org/wiki/Address_space_layout_randomization)). So, our current goal is to find random address where we can `safely` to decompress the Linux kernel. I repeat: `safely`. What does it mean in this context? You may remember that besides the code of decompressor and directly the kernel image, there are some unsafe places in memory. For example, the [initrd](https://en.wikipedia.org/wiki/Initrd) image is in memory too, and we must not overlap it with the decompressed kernel.
			
 
				 
			
 
				-Let's consider the case when kernel's command line contains `kaslr` option.
			
 
				+The next function will help us to find a safe place where we can decompress kernel. This function is `mem_avoid_init`. It defined in the same source code [file](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/kaslr.c), and takes four arguments that we already saw in the `decompress_kernel` function:
			
 
				 
			
 
				-There is the call of the `mem_avoid_init` function from the same `aslr.c` source code file. This function gets the unsafe memory regions (initrd, kernel command line and etc...). We need to know about this memory regions to not overlap them with the kernel after decompression. For example:
			
 
				+* `input_data` - pointer to the start of the compressed kernel, or in other words, the pointer to `arch/x86/boot/compressed/vmlinux.bin.bz2`;
			
 
				+* `input_len` - the size of the compressed kernel;
			
 
				+* `output` - the start address of the future decompressed kernel;
			
 
				+* `output_len` - the size of decompressed kernel.
			
 
				+
			
 
				+The main point of this function is to fill array of the `mem_vector` structures:
			
 
				 
			
 
				 ```C
			
 
				+#define MEM_AVOID_MAX 5
			
 
				+
			
 
				+static struct mem_vector mem_avoid[MEM_AVOID_MAX];
			
 
				+```
			
 
				+
			
 
				+where the `mem_vector` structure contains information about unsafe memory regions:
			
 
				+
			
 
				+```C
			
 
				+struct mem_vector {
			
 
				+	unsigned long start;
			
 
				+	unsigned long size;
			
 
				+};
			
 
				+```
			
 
				+
			
 
				+The implementation of the `mem_avoid_init` is pretty simple. Let's look on the part of this function:
			
 
				+
			
 
				+```C
			
 
				+	...
			
 
				+	...
			
 
				+	...
			
 
				 	initrd_start  = (u64)real_mode->ext_ramdisk_image << 32;
			
 
				 	initrd_start |= real_mode->hdr.ramdisk_image;
			
 
				 	initrd_size  = (u64)real_mode->ext_ramdisk_size << 32;
			
 
				 	initrd_size |= real_mode->hdr.ramdisk_size;
			
 
				 	mem_avoid[1].start = initrd_start;
			
 
				 	mem_avoid[1].size = initrd_size;
			
 
				+	...
			
 
				+	...
			
 
				+	...
			
 
				 ```
			
 
				 
			
 
				-Here we can see calculation of the [initrd](http://en.wikipedia.org/wiki/Initrd) start address and size. `ext_ramdisk_image` is high 32-bits of the `ramdisk_image` field from boot header and `ext_ramdisk_size` is high 32-bits of the `ramdisk_size` field from [boot protocol](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt):
			
 
				+Here we can see calculation of the [initrd](http://en.wikipedia.org/wiki/Initrd) start address and size. The `ext_ramdisk_image` is the high `32-bits` of the `ramdisk_image` field from the setup header, and `ext_ramdisk_size` is the high 32-bits of the `ramdisk_size` field from the [boot protocol](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt):
			
 
				 
			
 
				 ```
			
 
				 Offset	Proto	Name		Meaning
			
@@ -286,7 +322,7 @@ Offset	Proto	Name		Meaning
 
				 ...
			
 
				 ```
			
 
				 
			
 
				-And `ext_ramdisk_image` and `ext_ramdisk_size` you can find in the [Documentation/x86/zero-page.txt](https://github.com/torvalds/linux/blob/master/Documentation/x86/zero-page.txt):
			
 
				+And `ext_ramdisk_image` and `ext_ramdisk_size` can be found in the [Documentation/x86/zero-page.txt](https://github.com/torvalds/linux/blob/master/Documentation/x86/zero-page.txt):
			
 
				 
			
 
				 ```
			
 
				 Offset	Proto	Name		Meaning
			
@@ -299,31 +335,15 @@ Offset	Proto	Name		Meaning
 
				 ...
			
 
				 ```
			
 
				 
			
 
				-So we're taking `ext_ramdisk_image` and `ext_ramdisk_size`, shifting they left on 32 (now they will contain low 32-bits in the high 32-bit bits) and getting start address of the `initrd` and size of it. After this we store these values in the `mem_avoid` array which defined as:
			
 
				+So we're taking `ext_ramdisk_image` and `ext_ramdisk_size`, shifting them left on `32` (now they will contain low 32-bits in the high 32-bit bits) and getting start address of the `initrd` and size of it. After this we store these values in the `mem_avoid` array.
			
 
				 
			
 
				-```C
			
 
				-#define MEM_AVOID_MAX 5
			
 
				-static struct mem_vector mem_avoid[MEM_AVOID_MAX];
			
 
				-```
			
 
				-
			
 
				-where `mem_vector` structure is:
			
 
				-
			
 
				-```C
			
 
				-struct mem_vector {
			
 
				-	unsigned long start;
			
 
				-	unsigned long size;
			
 
				-};
			
 
				-```
			
 
				-
			
 
				-The next step after we collected all unsafe memory regions in the `mem_avoid` array will be search of the random address which does not overlap with the unsafe regions with the `find_random_addr` function.
			
 
				-
			
 
				-First of all we can see align of the output address in the `find_random_addr` function:
			
 
				+The next step after we've collected all unsafe memory regions in the `mem_avoid` array will be searching for a random address that does not overlap with the unsafe regions, using the `find_random_addr` function. First of all we can see the alignment of the output address in the `find_random_addr` function:
			
 
				 
			
 
				 ```C
			
 
				 minimum = ALIGN(minimum, CONFIG_PHYSICAL_ALIGN);
			
 
				 ```
			
 
				 
			
 
				-you can remember `CONFIG_PHYSICAL_ALIGN` configuration option from the previous part. This option provides the value to which kernel should be aligned and it is `0x200000` by default. After that we got aligned output address, we go through the memory and collect regions which are good for decompressed kernel image:
			
 
				+You can remember `CONFIG_PHYSICAL_ALIGN` configuration option from the previous part. This option provides the value to which kernel should be aligned and it is `0x200000` by default. Once we have the aligned output address, we go through the memory regions which we got with the help of the BIOS [e820](https://en.wikipedia.org/wiki/E820) service and collect regions suitable for the decompressed kernel image:
			
 
				 
			
 
				 ```C
			
 
				 for (i = 0; i < real_mode->e820_entries; i++) {
			
@@ -331,9 +351,7 @@ for (i = 0; i < real_mode->e820_entries; i++) {
 
				 }
			
 
				 ```
			
 
				 
			
 
				-You can remember that we collected `e820_entries` in the second part of the [Kernel booting process part 2](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-2.md#memory-detection).
			
 
				-
			
 
				-First of all `process_e820_entry` function does some checks that e820 memory region is not non-RAM, that the start address of the memory region  is not bigger than Maximum allowed `aslr` offset and that memory region is not less than value of kernel alignment:
			
 
				+Recall that we collected `e820_entries` in the second part of the [Kernel booting process part 2](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-2.md#memory-detection). The `process_e820_entry` function does some checks that an `e820` memory region is not `non-RAM`, that the start address of the memory region is not bigger than maximum allowed `aslr` offset, and that the memory region is above the minimum load location:
			
 
				 
			
 
				 ```C
			
 
				 struct mem_vector region, img;
			
@@ -348,14 +366,14 @@ if (entry->addr + entry->size < minimum)
 
				 	return;
			
 
				 ```
			
 
				 
			
 
				-After this, we store e820 memory region start address and the size in the `mem_vector` structure (we saw definition of this structure above):
			
 
				+After this, we store an `e820` memory region start address and the size in the `mem_vector` structure (we saw definition of this structure above):
			
 
				 
			
 
				 ```C
			
 
				 region.start = entry->addr;
			
 
				 region.size = entry->size;
			
 
				 ```
			
 
				 
			
 
				-As we store these values, we align the `region.start` as we did it in the `find_random_addr` function and check that we didn't get address that bigger than original memory region:
			
 
				+As we store these values, we align the `region.start` as we did it in the `find_random_addr` function and check that we didn't get an address that is outside the original memory region:
			
 
				 
			
 
				 ```C
			
 
				 region.start = ALIGN(region.start, CONFIG_PHYSICAL_ALIGN);
			
@@ -364,7 +382,7 @@ if (region.start > entry->addr + entry->size)
 
				 	return;
			
 
				 ```
			
 
				 
			
 
				-Next we get difference between the original address and aligned and check that if the last address in the memory region is bigger than `CONFIG_RANDOMIZE_BASE_MAX_OFFSET`, we reduce the memory region size that end of kernel image will be less than maximum `aslr` offset:
			
 
				+In the next step, we reduce the size of the memory region to not include rejected regions at the start, and ensure that the last address in the memory region is smaller than `CONFIG_RANDOMIZE_BASE_MAX_OFFSET`, so that the end of the kernel image will be less than the maximum `aslr` offset:
			
 
				 
			
 
				 ```C
			
 
				 region.size -= region.start - entry->addr;
			
@@ -373,7 +391,7 @@ if (region.start + region.size > CONFIG_RANDOMIZE_BASE_MAX_OFFSET)
 
				 		region.size = CONFIG_RANDOMIZE_BASE_MAX_OFFSET - region.start;
			
 
				 ```
			
 
				 
			
 
				-In the end we go through the all unsafe memory regions and check that this region does not overlap unsafe ares with kernel command line, initrd and etc...:
			
 
				+Finally, we go through all unsafe memory regions and check that the region does not overlap unsafe areas, such as kernel command line, initrd, etc...:
			
 
				 
			
 
				 ```C
			
 
				 for (img.start = region.start, img.size = image_size ;
			
@@ -385,13 +403,13 @@ for (img.start = region.start, img.size = image_size ;
 
				 	}
			
 
				 ```
			
 
				 
			
 
				-If memory region does not overlap unsafe regions we call `slots_append` function with the start address of the region. `slots_append` function just collects start addresses of memory regions to the `slots` array:
			
 
				+If the memory region does not overlap unsafe regions we call the `slots_append` function with the start address of the region. `slots_append` function just collects start addresses of memory regions to the `slots` array:
			
 
				 
			
 
				 ```C
			
 
				-	slots[slot_max++] = addr;
			
 
				+slots[slot_max++] = addr;
			
 
				 ```
			
 
				 
			
 
				-which defined as:
			
 
				+which is defined as:
			
 
				 
			
 
				 ```C
			
 
				 static unsigned long slots[CONFIG_RANDOMIZE_BASE_MAX_OFFSET /
			
@@ -399,7 +417,7 @@ static unsigned long slots[CONFIG_RANDOMIZE_BASE_MAX_OFFSET /
 
				 static unsigned long slot_max;
			
 
				 ```
			
 
				 
			
 
				-After `process_e820_entry` will be executed, we will have array of the addresses which are safe for the decompressed kernel. Next we call `slots_fetch_random` function for getting random item from this array:
			
 
				+After `process_e820_entry` is done, we will have an array of addresses that are safe for the decompressed kernel. Then we call `slots_fetch_random` function to get a random item from this array:
			
 
				 
			
 
				 ```C
			
 
				 if (slot_max == 0)
			
@@ -408,17 +426,17 @@ if (slot_max == 0)
 
				 return slots[get_random_long() % slot_max];
			
 
				 ```
			
 
				 
			
 
				-where `get_random_long` function checks different CPU flags as `X86_FEATURE_RDRAND` or `X86_FEATURE_TSC` and chooses method for getting random number (it can be obtain with RDRAND instruction, Time stamp counter, programmable interval timer and etc...). After that we got random address execution of the `choose_kernel_location` is finished.
			
 
				+where `get_random_long` function checks different CPU flags as `X86_FEATURE_RDRAND` or `X86_FEATURE_TSC` and chooses a method for getting random number (it can be the RDRAND instruction, the time stamp counter, the programmable interval timer, etc...). After retrieving the random address, execution of the `choose_random_location` is finished.
			
 
				 
			
 
				-Now let's back to the [misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c#L404). After we got address for the kernel image, there need to do some checks to be sure that gotten random address is correctly aligned and address is not wrong. 
			
 
				+Now let's back to [misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c#L404). After getting the address for the kernel image, there need to be some checks to be sure that the retrieved random address is correctly aligned and address is not wrong. 
			
 
				 
			
 
				-After all these checks will see the familiar message:
			
 
				+After all these checks we will see the familiar message:
			
 
				 
			
 
				 ```
			
 
				 Decompressing Linux... 
			
 
				 ```
			
 
				 
			
 
				-and call `decompress` function which will decompress the kernel. `decompress` function depends on what decompression algorithm was chosen during kernel compilartion:
			
 
				+and call the `__decompress` function which will decompress the kernel. The `__decompress` function depends on what decompression algorithm was chosen during kernel compilation:
			
 
				 
			
 
				 ```C
			
 
				 #ifdef CONFIG_KERNEL_GZIP
			
@@ -446,7 +464,71 @@ and call `decompress` function which will decompress the kernel. `decompress` fu
 
				 #endif
			
 
				 ```
			
 
				 
			
 
				-After kernel will be decompressed, the last function `handle_relocations` will relocate the kernel to the address that we got from `choose_kernel_location`. After that kernel relocated we return from the `decompress_kernel` to the `head_64.S`. The address of the kernel will be in the `rax` register and we jump on it:
			
 
				+After kernel is decompressed, the last two functions are `parse_elf` and `handle_relocations`. The main point of these functions is to move the uncompressed kernel image to the correct memory place. The fact is that the decompression will decompress [in-place](https://en.wikipedia.org/wiki/In-place_algorithm), and we still need to move kernel to the correct address. As we already know, the kernel image is an [ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) executable, so the main goal of the `parse_elf` function is to move loadable segments to the correct address. We can see loadable segments in the output of the `readelf` program:
			
 
				+
			
 
				+```
			
 
				+readelf -l vmlinux
			
 
				+
			
 
				+Elf file type is EXEC (Executable file)
			
 
				+Entry point 0x1000000
			
 
				+There are 5 program headers, starting at offset 64
			
 
				+
			
 
				+Program Headers:
			
 
				+  Type           Offset             VirtAddr           PhysAddr
			
 
				+                 FileSiz            MemSiz              Flags  Align
			
 
				+  LOAD           0x0000000000200000 0xffffffff81000000 0x0000000001000000
			
 
				+                 0x0000000000893000 0x0000000000893000  R E    200000
			
 
				+  LOAD           0x0000000000a93000 0xffffffff81893000 0x0000000001893000
			
 
				+                 0x000000000016d000 0x000000000016d000  RW     200000
			
 
				+  LOAD           0x0000000000c00000 0x0000000000000000 0x0000000001a00000
			
 
				+                 0x00000000000152d8 0x00000000000152d8  RW     200000
			
 
				+  LOAD           0x0000000000c16000 0xffffffff81a16000 0x0000000001a16000
			
 
				+                 0x0000000000138000 0x000000000029b000  RWE    200000
			
 
				+```
			
 
				+
			
 
				+The goal of the `parse_elf` function is to load these segments to the `output` address we got from the `choose_random_location` function. This function starts with checking the [ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) signature:
			
 
				+
			
 
				+```C
			
 
				+Elf64_Ehdr ehdr;
			
 
				+Elf64_Phdr *phdrs, *phdr;
			
 
				+
			
 
				+memcpy(&ehdr, output, sizeof(ehdr));
			
 
				+
			
 
				+if (ehdr.e_ident[EI_MAG0] != ELFMAG0 ||
			
 
				+   ehdr.e_ident[EI_MAG1] != ELFMAG1 ||
			
 
				+   ehdr.e_ident[EI_MAG2] != ELFMAG2 ||
			
 
				+   ehdr.e_ident[EI_MAG3] != ELFMAG3) {
			
 
				+   error("Kernel is not a valid ELF file");
			
 
				+   return;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+and if it's not valid, it prints an error message and halts. If we got a valid `ELF` file, we go through all program headers from the given `ELF` file and copy all loadable segments with correct address to the output buffer:
			
 
				+
			
 
				+```C
			
 
				+	for (i = 0; i < ehdr.e_phnum; i++) {
			
 
				+		phdr = &phdrs[i];
			
 
				+
			
 
				+		switch (phdr->p_type) {
			
 
				+		case PT_LOAD:
			
 
				+#ifdef CONFIG_RELOCATABLE
			
 
				+			dest = output;
			
 
				+			dest += (phdr->p_paddr - LOAD_PHYSICAL_ADDR);
			
 
				+#else
			
 
				+			dest = (void *)(phdr->p_paddr);
			
 
				+#endif
			
 
				+			memcpy(dest,
			
 
				+			       output + phdr->p_offset,
			
 
				+			       phdr->p_filesz);
			
 
				+			break;
			
 
				+		default: /* Ignore other PT_* */ break;
			
 
				+		}
			
 
				+	}
			
 
				+```
			
 
				+
			
 
				+That's all. From now on, all loadable segments are in the correct place. The last `handle_relocations` function adjusts addresses in the kernel image, and is called only if the `kASLR` was enabled during kernel configuration.
			
 
				+
			
 
				+After the kernel is relocated, we return back from the `decompress_kernel` to [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S). The address of the kernel will be in the `rax` register and we jump to it:
			
 
				 
			
 
				 ```assembly
			
 
				 jmp	*%rax
			
@@ -457,13 +539,13 @@ That's all. Now we are in the kernel!
 
				 Conclusion
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-This is the end of the fifth and the last part about linux kernel booting process. We will not see posts about kernel booting anymore (maybe only updates in this and previous posts), but there will be many posts about other kernel internals. 
			
 
				+This is the end of the fifth and the last part about linux kernel booting process. We will not see posts about kernel booting anymore (maybe updates to this and previous posts), but there will be many posts about other kernel internals. 
			
 
				 
			
 
				-Next chapter will be about kernel initialization and we will see the first steps in the linux kernel initialization code.
			
 
				+Next chapter will be about kernel initialization and we will see the first steps in the Linux kernel initialization code.
			
 
				 
			
 
				-If you will have any questions or suggestions write me a comment or ping me in [twitter](https://twitter.com/0xAX).
			
 
				+If you have any questions or suggestions write me a comment or ping me in [twitter](https://twitter.com/0xAX).
			
 
				 
			
 
				-**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
			
 
				+**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-internals).**
			
 
				 
			
 
				 Links
			
 
				 --------------------------------------------------------------------------------
			
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -3,28 +3,31 @@ Contributing
 
				 
			
 
				 If you want to contribute to [linux-insides](https://github.com/0xAX/linux-insides), please follow these simple rules:
			
 
				 
			
 
				-1. Press fork button:
			
 
				+1. Press the fork button:
			
 
				 
			
 
				     ![fork](http://oi58.tinypic.com/jj2trm.jpg)
			
 
				 
			
 
				-2. Clone the repo from your account with:
			
 
				+2. Clone the repository from your account with:
			
 
				 
			
 
				     ```
			
 
				     git clone git@github.com:your_github_username/linux-insides.git
			
 
				     ```
			
 
				 
			
 
				-3. Create branch with:
			
 
				+3. Create a new branch with:
			
 
				 
			
 
				     ```
			
 
				     git checkout -b "linux-bootstrap-1-fix"
			
 
				     ```
			
 
				+    You can name it however you want.
			
 
				 
			
 
				 4. Make your changes.
			
 
				 
			
 
				-5. Don't forget to add yourself in `contributors.md`
			
 
				+5. Don't forget to add yourself in `contributors.md`.
			
 
				+
			
 
				+6. Commit and push your changes, then make a pull request from Github.
			
 
				 
			
 
				 **IMPORTANT**
			
 
				 
			
 
				-Please, make the actual changes. While you made your changes, I can merge changes from somebody else and your changes can conflict with `master` branch content. Please rebase on master every time before you're going to push your changes and check that your branch doesn't conflict with `master`.
			
 
				+Please, don't forget to update your fork. While you made your changes, the content of the `master` branch can change because other pull requests were merged and it can create conflicts. This is why you have to rebase on `master` every time before pushing your changes and check that your branch doesn't have any conflicts with `master`.
			
 
				 
			
 
				 Thank you.
			
--- a/Concepts/README.md
+++ b/Concepts/README.md
@@ -4,3 +4,4 @@ This chapter describes various concepts which are used in the Linux kernel.
 
				 
			
 
				 * [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
			
 
				 * [CPU masks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
			
 
				+* [The initcall mechanism](https://0xax.gitbooks.io/linux-insides/content/Concepts/initcall.html)
			
--- a/Concepts/cpumask.md
+++ b/Concepts/cpumask.md
@@ -4,7 +4,7 @@ CPU masks
 
				 Introduction
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-`Cpumasks` is a special way provided by the Linux kernel to store information about CPUs in the system. The relevant source code and header files which are contains API for `Cpumasks` manipulating:
			
 
				+`Cpumasks` is a special way provided by the Linux kernel to store information about CPUs in the system. The relevant source code and header files which contains API for `Cpumasks` manipulation:
			
 
				 
			
 
				 * [include/linux/cpumask.h](https://github.com/torvalds/linux/blob/master/include/linux/cpumask.h)
			
 
				 * [lib/cpumask.c](https://github.com/torvalds/linux/blob/master/lib/cpumask.c)
			
@@ -19,40 +19,50 @@ set_cpu_present(cpu, true);
 
				 set_cpu_possible(cpu, true);
			
 
				 ```
			
 
				 
			
 
				-`set_cpu_possible` is a set of cpu ID's which can be plugged in anytime during the life of that system boot. `cpu_present` represents which CPUs are currently plugged in. `cpu_online` represents subset of the `cpu_present` and indicates CPUs which are available for scheduling. These masks depends on `CONFIG_HOTPLUG_CPU` configuration option and if this option is disabled `possible == present` and `active == online`. Implementation of the all of these functions are very similar. Every function checks the second parameter. If it is `true`, calls `cpumask_set_cpu` or `cpumask_clear_cpu` otherwise.
			
 
				+Before we will consiuder implementation of these functions, let's consider all of these masks.
			
 
				 
			
 
				-There are two ways for a `cpumask` creation. First is to use `cpumask_t`. It defined as:
			
 
				+The `cpu_possible` is a set of cpu ID's which can be plugged in anytime during the life of that system boot or in other words mask of possible CPUs contains maximum number of CPUs which are possible in the system. It will be equal to value of the `NR_CPUS` which is which is set statically via the `CONFIG_NR_CPUS` kernel configuration option.
			
 
				+
			
 
				+The `cpu_present` mask represents which CPUs are currently plugged in.
			
 
				+
			
 
				+The `cpu_online` represents a subset of the `cpu_present` and indicates CPUs which are available for scheduling or in other words a bit from this mask tells to kernel is a processor may be utilized by the Linux kernel.
			
 
				+
			
 
				+The last mask is `cpu_active`. Bits of this mask tells to Linux kernel is a task may be moved to a certain processor.
			
 
				+
			
 
				+All of these masks depend on the `CONFIG_HOTPLUG_CPU` configuration option and if this option is disabled `possible == present` and `active == online`. The implementations of all of these functions are very similar. Every function checks the second parameter. If it is `true`, it calls `cpumask_set_cpu` otherwise it calls `cpumask_clear_cpu` .
			
 
				+
			
 
				+There are two ways for a `cpumask` creation. First is to use `cpumask_t`. It is defined as:
			
 
				 
			
 
				 ```C
			
 
				 typedef struct cpumask { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_t;
			
 
				 ```
			
 
				 
			
 
				-It wraps `cpumask` structure which contains one bitmak `bits` field. `DECLARE_BITMAP` macro gets two parameters:
			
 
				+It wraps the `cpumask` structure which contains one bitmask `bits` field. The `DECLARE_BITMAP` macro gets two parameters:
			
 
				 
			
 
				 * bitmap name;
			
 
				 * number of bits.
			
 
				 
			
 
				-and creates an array of `unsigned long` with the give name. It's implementation is pretty easy:
			
 
				+and creates an array of `unsigned long` with the given name. Its implementation is pretty easy:
			
 
				 
			
 
				 ```C
			
 
				 #define DECLARE_BITMAP(name,bits) \
			
 
				         unsigned long name[BITS_TO_LONGS(bits)]
			
 
				 ```
			
 
				 
			
 
				-where `BITS_TO_LONG`:
			
 
				+where `BITS_TO_LONGS`:
			
 
				 
			
 
				 ```C
			
 
				 #define BITS_TO_LONGS(nr)       DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(long))
			
 
				 #define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
			
 
				 ```
			
 
				 
			
 
				-As we learning `x86_64` architecture, `unsigned long` is 8-bytes size and our array will contain only one element:
			
 
				+As we are focusing on the `x86_64` architecture, `unsigned long` is 8-bytes size and our array will contain only one element:
			
 
				 
			
 
				 ```
			
 
				 (((8) + (8) - 1) / (8)) = 1
			
 
				 ```
			
 
				 
			
 
				-`NR_CPUS` macro presents the number of the CPUs in the system and depends on the `CONFIG_NR_CPUS` macro which defined in the [include/linux/threads.h](https://github.com/torvalds/linux/blob/master/include/linux/threads.h) and looks like this:
			
 
				+`NR_CPUS` macro represents the number of CPUs in the system and depends on the `CONFIG_NR_CPUS` macro which is defined in [include/linux/threads.h](https://github.com/torvalds/linux/blob/master/include/linux/threads.h) and looks like this:
			
 
				 
			
 
				 ```C
			
 
				 #ifndef CONFIG_NR_CPUS
			
@@ -62,7 +72,7 @@ As we learning `x86_64` architecture, `unsigned long` is 8-bytes size and our ar
 
				 #define NR_CPUS         CONFIG_NR_CPUS
			
 
				 ```
			
 
				 
			
 
				-The second way to define cpumask is to use `DECLARE_BITMAP` macro directly and `to_cpumask` macro which convertes given bitmap to the `struct cpumask *`:
			
 
				+The second way to define cpumask is to use the `DECLARE_BITMAP` macro directly and the `to_cpumask` macro which converts the given bitmap to `struct cpumask *`:
			
 
				 
			
 
				 ```C
			
 
				 #define to_cpumask(bitmap)                                              \
			
@@ -70,7 +80,7 @@ The second way to define cpumask is to use `DECLARE_BITMAP` macro directly and `
 
				                             : (void *)sizeof(__check_is_bitmap(bitmap))))
			
 
				 ```
			
 
				 
			
 
				-We can see ternary operator operator here which is `true` every time. `__check_is_bitmap` inline function defined as:
			
 
				+We can see the ternary operator operator here which is `true` every time. `__check_is_bitmap` inline function is defined as:
			
 
				 
			
 
				 ```C
			
 
				 static inline int __check_is_bitmap(const unsigned long *bitmap)
			
@@ -79,7 +89,7 @@ static inline int __check_is_bitmap(const unsigned long *bitmap)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-And returns `1` every time. We need in it here only for one purpose: In compile time it checks that given `bitmap` is a bitmap, or with another words it checks that given `bitmap` has type - `unsigned long *`. So we just pass `cpu_possible_bits` to the `to_cpumask` macro for converting array of `unsigned long` to the `struct cpumask *`.
			
 
				+And returns `1` every time. We need it here for only one purpose: at compile time it checks that a given `bitmap` is a bitmap, or in other words it checks that a given `bitmap` has type - `unsigned long *`. So we just pass `cpu_possible_bits` to the `to_cpumask` macro for converting an array of `unsigned long` to the `struct cpumask *`.
			
 
				 
			
 
				 cpumask API
			
 
				 --------------------------------------------------------------------------------
			
@@ -103,13 +113,13 @@ void set_cpu_online(unsigned int cpu, bool online)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-First of all it checks the second `state` parameter and calls `cpumask_set_cpu` or `cpumask_clear_cpu` depends on it. Here we can see casting to the `struct cpumask *` of the second parameter in the `cpumask_set_cpu`. In our case it is `cpu_online_bits` which is bitmap and defined as:
			
 
				+First of all it checks the second `state` parameter and calls `cpumask_set_cpu` or `cpumask_clear_cpu` depends on it. Here we can see casting to the `struct cpumask *` of the second parameter in the `cpumask_set_cpu`. In our case it is `cpu_online_bits` which is a bitmap and defined as:
			
 
				 
			
 
				 ```C
			
 
				 static DECLARE_BITMAP(cpu_online_bits, CONFIG_NR_CPUS) __read_mostly;
			
 
				 ```
			
 
				 
			
 
				-`cpumask_set_cpu` function makes only one call of the `set_bit` function inside:
			
 
				+The `cpumask_set_cpu` function makes only one call to the `set_bit` function:
			
 
				 
			
 
				 ```C
			
 
				 static inline void cpumask_set_cpu(unsigned int cpu, struct cpumask *dstp)
			
@@ -118,12 +128,12 @@ static inline void cpumask_set_cpu(unsigned int cpu, struct cpumask *dstp)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-`set_bit` function takes two parameter too, and sets a given bit (first parameter) in the memory (second parameter or `cpu_online_bits` bitmap). We can see here that before `set_bit` will be called, its two parameter will be passed to the
			
 
				+The `set_bit` function takes two parameters too, and sets a given bit (first parameter) in the memory (second parameter or `cpu_online_bits` bitmap). We can see here that before `set_bit` will be called, its two parameters will be passed to the
			
 
				 
			
 
				 * cpumask_check;
			
 
				 * cpumask_bits.
			
 
				 
			
 
				-Let's consider these two macro. First if `cpumask_check` does nothing in our case and just returns given parameter. The second `cpumask_bits` just returns `bits` field from the given `struct cpumask *` structure:
			
 
				+Let's consider these two macros. First if `cpumask_check` does nothing in our case and just returns given parameter. The second `cpumask_bits` just returns the `bits` field from the given `struct cpumask *` structure:
			
 
				 
			
 
				 ```C
			
 
				 #define cpumask_bits(maskp) ((maskp)->bits)
			
@@ -147,13 +157,13 @@ Now let's look on the `set_bit` implementation:
 
				  }
			
 
				 ```
			
 
				 
			
 
				-This function looks scarry, but it is not so hard as it seems. First of all it passes `nr` or number of the bit to the `IS_IMMEDIATE` macro which just makes call of the GCC internal `__builtin_constant_p` function:
			
 
				+This function looks scary, but it is not so hard as it seems. First of all it passes `nr` or number of the bit to the `IS_IMMEDIATE` macro which just calls the GCC internal `__builtin_constant_p` function:
			
 
				 
			
 
				 ```C
			
 
				 #define IS_IMMEDIATE(nr)    (__builtin_constant_p(nr))
			
 
				 ```
			
 
				 
			
 
				-`__builtin_constant_p` checks that given parameter is known constant at compile-time. As our `cpu` is not compile-time constant, `else` clause will be executed:
			
 
				+`__builtin_constant_p` checks that given parameter is known constant at compile-time. As our `cpu` is not compile-time constant, the `else` clause will be executed:
			
 
				 
			
 
				 ```C
			
 
				 asm volatile(LOCK_PREFIX "bts %1,%0" : BITOP_ADDR(addr) : "Ir" (nr) : "memory");
			
@@ -161,9 +171,9 @@ asm volatile(LOCK_PREFIX "bts %1,%0" : BITOP_ADDR(addr) : "Ir" (nr) : "memory");
 
				 
			
 
				 Let's try to understand how it works step by step:
			
 
				 
			
 
				-`LOCK_PREFIX` is a x86 `lock` instruction. This instruction tells to the cpu to occupy the system bus while instruction will be executed. This allows to synchronize memory access, preventing simultaneous access of multiple processors (or devices - DMA controller for example) to one memory cell.
			
 
				+`LOCK_PREFIX` is a x86 `lock` instruction. This instruction tells the cpu to occupy the system bus while the instruction(s) will be executed. This allows the CPU to synchronize memory access, preventing simultaneous access of multiple processors (or devices - the DMA controller for example) to one memory cell.
			
 
				 
			
 
				-`BITOP_ADDR` casts given parameter to the `(*(volatile long *)` and adds `+m` constraints. `+` means that this operand is bot read and written by the instruction. `m` shows that this is memory operand. `BITOP_ADDR` is defined as:
			
 
				+`BITOP_ADDR` casts the given parameter to the `(*(volatile long *)` and adds `+m` constraints. `+` means that this operand is both read and written by the instruction. `m` shows that this is a memory operand. `BITOP_ADDR` is defined as:
			
 
				 
			
 
				 ```C
			
 
				 #define BITOP_ADDR(x) "+m" (*(volatile long *) (x))
			
@@ -171,23 +181,23 @@ Let's try to understand how it works step by step:
 
				 
			
 
				 Next is the `memory` clobber. It tells the compiler that the assembly code performs memory reads or writes to items other than those listed in the input and output operands (for example, accessing the memory pointed to by one of the input parameters).
			
 
				 
			
 
				-`Ir` - immideate register operand. 
			
 
				+`Ir` - immediate register operand.
			
 
				 
			
 
				 
			
 
				-`bts` instruction sets given bit in a bit string and stores the value of a given bit in the `CF` flag. So we passed cpu number which is zero in our case and after `set_bit` will be executed, it sets zero bit in the `cpu_online_bits` cpumask. It would mean that the first cpu is online at this moment.
			
 
				+The `bts` instruction sets a given bit in a bit string and stores the value of a given bit in the `CF` flag. So we passed the cpu number which is zero in our case and after `set_bit` is executed, it sets the zero bit in the `cpu_online_bits` cpumask. It means that the first cpu is online at this moment.
			
 
				 
			
 
				-Besides the `set_cpu_*` API, cpumask ofcourse provides another API for cpumasks manipulation. Let's consider it in shoft.
			
 
				+Besides the `set_cpu_*` API, cpumask of course provides another API for cpumasks manipulation. Let's consider it in short.
			
 
				 
			
 
				 Additional cpumask API
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-cpumask provides the set of macro for getting amount of the CPUs with different state. For example:
			
 
				+cpumask provides a set of macros for getting the numbers of CPUs in various states. For example:
			
 
				 
			
 
				 ```C
			
 
				 #define num_online_cpus()	cpumask_weight(cpu_online_mask)
			
 
				 ```
			
 
				 
			
 
				-This macro returns amount of the `online` CPUs. It calls `cpumask_weight` function with the `cpu_online_mask` bitmap (read about about it). `cpumask_wieght` function makes an one call of the `bitmap_wiegt` function with two parameters:
			
 
				+This macro returns the amount of `online` CPUs. It calls the `cpumask_weight` function with the `cpu_online_mask` bitmap (read about it). The`cpumask_weight` function makes one call of the `bitmap_weight` function with two parameters:
			
 
				 
			
 
				 * cpumask bitmap;
			
 
				 * `nr_cpumask_bits` - which is `NR_CPUS` in our case.
			
@@ -199,7 +209,7 @@ static inline unsigned int cpumask_weight(const struct cpumask *srcp)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-and calculates amount of the bits in the given bitmap. Besides the `num_online_cpus`, cpumask provides macros for the all CPU states:
			
 
				+and calculates the number of bits in the given bitmap. Besides the `num_online_cpus`, cpumask provides macros for the all CPU states:
			
 
				 
			
 
				 * num_possible_cpus;
			
 
				 * num_active_cpus;
			
@@ -208,7 +218,7 @@ and calculates amount of the bits in the given bitmap. Besides the `num_online_c
 
				 
			
 
				 and many more.
			
 
				 
			
 
				-Besides that Linux kernel provides following API for the manipulating of `cpumask`:
			
 
				+Besides that the Linux kernel provides the following API for the manipulation of `cpumask`:
			
 
				 
			
 
				 * `for_each_cpu` - iterates over every cpu in a mask;
			
 
				 * `for_each_cpu_not` - iterates over every cpu in a complemented mask;
			
--- a/Concepts/initcall.md
+++ b/Concepts/initcall.md
@@ -0,0 +1,395 @@
 
				+The initcall mechanism
			
 
				+================================================================================
			
 
				+
			
 
				+Introduction
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+As you may understand from the title, this part will cover an interesting and important concept in the Linux kernel which is called - `initcall`. We already saw definitions like these:
			
 
				+
			
 
				+```C
			
 
				+early_param("debug", debug_kernel);
			
 
				+```
			
 
				+
			
 
				+or
			
 
				+
			
 
				+```C
			
 
				+arch_initcall(init_pit_clocksource);
			
 
				+```
			
 
				+
			
 
				+in some parts of the Linux kernel. Before we see how this mechanism is implemented in the Linux kernel, we must know actually what is it and how the Linux kernel uses it. Definitions like these represent a [callback](https://en.wikipedia.org/wiki/Callback_%28computer_programming%29) function which will be called either during initialization of the Linux kernel or right after it. Actually the main point of the `initcall` mechanism is to determine correct order of the built-in modules and subsystems initialization. For example let's look at the following function:
			
 
				+
			
 
				+```C
			
 
				+static int __init nmi_warning_debugfs(void)
			
 
				+{
			
 
				+    debugfs_create_u64("nmi_longest_ns", 0644,
			
 
				+                       arch_debugfs_dir, &nmi_longest_ns);
			
 
				+    return 0;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+from the [arch/x86/kernel/nmi.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/nmi.c) source code file. As we may see it just creates the `nmi_longest_ns` [debugfs](https://en.wikipedia.org/wiki/Debugfs) file in the `arch_debugfs_dir` directory. Actually, this `debugfs` file may be created only after the `arch_debugfs_dir` will be created. Creation of this directory occurs during the architecture-specific initialization of the Linux kernel. Actually this directory will be created in the `arch_kdebugfs_init` function from the [arch/x86/kernel/kdebugfs.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/kdebugfs.c) source code file. Note that the `arch_kdebugfs_init` function is marked as `initcall` too:
			
 
				+
			
 
				+```C
			
 
				+arch_initcall(arch_kdebugfs_init);
			
 
				+```
			
 
				+
			
 
				+The Linux kernel calls all architecture-specific `initcalls` before the `fs` related `initcalls`. So, our `nmi_longest_ns` file will be created only after the `arch_kdebugfs_dir` directory will be created. Actually, the Linux kernel provides eight levels of main `initcalls`:
			
 
				+
			
 
				+* `early`;
			
 
				+* `core`;
			
 
				+* `postcore`;
			
 
				+* `arch`;
			
 
				+* `susys`;
			
 
				+* `fs`;
			
 
				+* `device`;
			
 
				+* `late`.
			
 
				+
			
 
				+All of their names are represented by the `initcall_level_names` array which is defined in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file:
			
 
				+
			
 
				+```C
			
 
				+static char *initcall_level_names[] __initdata = {
			
 
				+	"early",
			
 
				+	"core",
			
 
				+	"postcore",
			
 
				+	"arch",
			
 
				+	"subsys",
			
 
				+	"fs",
			
 
				+	"device",
			
 
				+	"late",
			
 
				+};
			
 
				+```
			
 
				+
			
 
				+All functions which are marked as `initcall` by these identifiers, will be called in the same order or at first `early initcalls` will be called, at second `core initcalls` and etc. From this moment we know a little about `initcall` mechanism, so we can start to dive into the source code of the Linux kernel to see how this mechanism is implemented.
			
 
				+
			
 
				+Implementation initcall mechanism in the Linux kernel
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+The Linux kernel provides a set of macros from the [include/linux/init.h](https://github.com/torvalds/linux/blob/master/include/linux/init.h) header file to mark a given function as `initcall`. All of these macros are pretty simple:
			
 
				+
			
 
				+```C
			
 
				+#define early_initcall(fn)		__define_initcall(fn, early)
			
 
				+#define core_initcall(fn)		__define_initcall(fn, 1)
			
 
				+#define postcore_initcall(fn)		__define_initcall(fn, 2)
			
 
				+#define arch_initcall(fn)		__define_initcall(fn, 3)
			
 
				+#define subsys_initcall(fn)		__define_initcall(fn, 4)
			
 
				+#define fs_initcall(fn)			__define_initcall(fn, 5)
			
 
				+#define device_initcall(fn)		__define_initcall(fn, 6)
			
 
				+#define late_initcall(fn)		__define_initcall(fn, 7)
			
 
				+```
			
 
				+
			
 
				+and as we may see these macros just expand to the call of the `__define_initcall` macro from the same header file. Moreover, the `__define_initcall` macro takes two arguments:
			
 
				+
			
 
				+* `fn` - callback function which will be called during call of `initcalls` of the certain level;
			
 
				+* `id` - identifier to identify `initcall` to prevent error when two the same `initcalls` point to the same handler.
			
 
				+
			
 
				+The implementation of the `__define_initcall` macro looks like:
			
 
				+
			
 
				+```C
			
 
				+#define __define_initcall(fn, id) \
			
 
				+	static initcall_t __initcall_##fn##id __used \
			
 
				+	__attribute__((__section__(".initcall" #id ".init"))) = fn; \
			
 
				+	LTO_REFERENCE_INITCALL(__initcall_##fn##id)
			
 
				+```
			
 
				+
			
 
				+To understand the `__define_initcall` macro, first of all let's look at the `initcall_t` type. This type is defined in the same [header]() file and it represents pointer to a function which returns pointer to [integer](https://en.wikipedia.org/wiki/Integer) which will be result of the `initcall`:
			
 
				+
			
 
				+```C
			
 
				+typedef int (*initcall_t)(void);
			
 
				+```
			
 
				+
			
 
				+Now let's return to the `_-define_initcall` macro. The [##](https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html) provides ability to concatenate two symbols. In our case, the first line of the `__define_initcall` macro produces definition of the given function which is located in the `.initcall id .init` [ELF section](http://www.skyfree.org/linux/references/ELF_Format.pdf) and marked with the following [gcc](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) attributes: `__initcall_function_name_id` and `__used`. If we will look in the [include/asm-generic/vmlinux.lds.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/vmlinux.lds.h) header file which represents data for the kernel [linker](https://en.wikipedia.org/wiki/Linker_%28computing%29) script, we will see that all of `initcalls` sections will be placed in the `.data` section:
			
 
				+
			
 
				+```C
			
 
				+#define INIT_CALLS					\
			
 
				+		VMLINUX_SYMBOL(__initcall_start) = .;	\
			
 
				+		*(.initcallearly.init)					\
			
 
				+		INIT_CALLS_LEVEL(0)					    \
			
 
				+		INIT_CALLS_LEVEL(1)					    \
			
 
				+		INIT_CALLS_LEVEL(2)					    \
			
 
				+		INIT_CALLS_LEVEL(3)					    \
			
 
				+		INIT_CALLS_LEVEL(4)					    \
			
 
				+		INIT_CALLS_LEVEL(5)					    \
			
 
				+		INIT_CALLS_LEVEL(rootfs)				\
			
 
				+		INIT_CALLS_LEVEL(6)					    \
			
 
				+		INIT_CALLS_LEVEL(7)					    \
			
 
				+		VMLINUX_SYMBOL(__initcall_end) = .;
			
 
				+
			
 
				+#define INIT_DATA_SECTION(initsetup_align)	\
			
 
				+	.init.data : AT(ADDR(.init.data) - LOAD_OFFSET) {	   \
			
 
				+        ...                                                \
			
 
				+        INIT_CALLS						                   \
			
 
				+        ...                                                \
			
 
				+	}
			
 
				+
			
 
				+```
			
 
				+
			
 
				+The second attribute - `__used` is defined in the [include/linux/compiler-gcc.h](https://github.com/torvalds/linux/blob/master/include/linux/compiler-gcc.h) header file and it expands to the definition of the following `gcc` attribute:
			
 
				+
			
 
				+```C
			
 
				+#define __used   __attribute__((__used__))
			
 
				+```
			
 
				+
			
 
				+which prevents `variable defined but not used` warning. The last line of the `__define_initcall` macro is:
			
 
				+
			
 
				+```C
			
 
				+LTO_REFERENCE_INITCALL(__initcall_##fn##id)
			
 
				+```
			
 
				+
			
 
				+depends on the `CONFIG_LTO` kernel configuration option and just provides stub for the compiler [Link time optimization](https://gcc.gnu.org/wiki/LinkTimeOptimization):
			
 
				+
			
 
				+```
			
 
				+#ifdef CONFIG_LTO
			
 
				+#define LTO_REFERENCE_INITCALL(x) \
			
 
				+        static __used __exit void *reference_##x(void)  \
			
 
				+        {                                               \
			
 
				+                return &x;                              \
			
 
				+        }
			
 
				+#else
			
 
				+#define LTO_REFERENCE_INITCALL(x)
			
 
				+#endif
			
 
				+```
			
 
				+
			
 
				+In order to prevent any problem when there is no reference to a variable in a module, it will be moved to the end of the program. That's all about the `__define_initcall` macro. So, all of the `*_initcall` macros will be expanded during compilation of the Linux kernel, and all `initcalls` will be placed in their sections and all of them will be available from the `.data` section and the Linux kernel will know where to find a certain `initcall` to call it during initialization process.
			
 
				+
			
 
				+As `initcalls` can be called by the Linux kernel, let's look how the Linux kernel does this. This process starts in the `do_basic_setup` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file:
			
 
				+
			
 
				+```C
			
 
				+static void __init do_basic_setup(void)
			
 
				+{
			
 
				+    ...
			
 
				+    ...
			
 
				+    ...
			
 
				+   	do_initcalls();
			
 
				+    ...
			
 
				+    ...
			
 
				+    ...
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+which is called during the initialization of the Linux kernel, right after main steps of initialization like memory manager related initialization, `CPU` subsystem and other already finished. The `do_initcalls` function just goes through the array of `initcall` levels and call the `do_initcall_level` function for each level:
			
 
				+
			
 
				+```C
			
 
				+static void __init do_initcalls(void)
			
 
				+{
			
 
				+	int level;
			
 
				+
			
 
				+	for (level = 0; level < ARRAY_SIZE(initcall_levels) - 1; level++)
			
 
				+		do_initcall_level(level);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+The `initcall_levels` array is defined in the same source code [file](https://github.com/torvalds/linux/blob/master/init/main.c) and contains pointers to the sections which were defined in the `__define_initcall` macro:
			
 
				+
			
 
				+```C
			
 
				+static initcall_t *initcall_levels[] __initdata = {
			
 
				+	__initcall0_start,
			
 
				+	__initcall1_start,
			
 
				+	__initcall2_start,
			
 
				+	__initcall3_start,
			
 
				+	__initcall4_start,
			
 
				+	__initcall5_start,
			
 
				+	__initcall6_start,
			
 
				+	__initcall7_start,
			
 
				+	__initcall_end,
			
 
				+};
			
 
				+```
			
 
				+
			
 
				+If you are interested, you can find these sections in the `arch/x86/kernel/vmlinux.lds` linker script which is generated after the Linux kernel compilation:
			
 
				+
			
 
				+```
			
 
				+.init.data : AT(ADDR(.init.data) - 0xffffffff80000000) {
			
 
				+    ...
			
 
				+    ...
			
 
				+    ...
			
 
				+    ...
			
 
				+    __initcall_start = .;
			
 
				+    *(.initcallearly.init)
			
 
				+    __initcall0_start = .;
			
 
				+    *(.initcall0.init)
			
 
				+    *(.initcall0s.init)
			
 
				+    __initcall1_start = .;
			
 
				+    ...
			
 
				+    ...
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+If you are not familiar with this then you can know more about [linkers](https://en.wikipedia.org/wiki/Linker_%28computing%29) in the special [part](https://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html) of this book.
			
 
				+
			
 
				+As we just saw, the `do_initcall_level` function takes one parameter - level of `initcall` and does following two things: First of all this function parses the `initcall_command_line` which is copy of usual kernel [command line](https://www.kernel.org/doc/Documentation/kernel-parameters.txt) which may contain parameters for modules with the `parse_args` function from the [kernel/params.c](https://github.com/torvalds/linux/blob/master/kernel/params.c) source code file and call the `do_on_initcall` function for each level:
			
 
				+
			
 
				+```C
			
 
				+for (fn = initcall_levels[level]; fn < initcall_levels[level+1]; fn++)
			
 
				+		do_one_initcall(*fn);
			
 
				+```
			
 
				+
			
 
				+The `do_on_initcall` does  main job for us. As we may see, this function takes one parameter which represent `initcall` callback function and does the call of the given callback:
			
 
				+
			
 
				+```C
			
 
				+int __init_or_module do_one_initcall(initcall_t fn)
			
 
				+{
			
 
				+	int count = preempt_count();
			
 
				+	int ret;
			
 
				+	char msgbuf[64];
			
 
				+
			
 
				+	if (initcall_blacklisted(fn))
			
 
				+		return -EPERM;
			
 
				+
			
 
				+	if (initcall_debug)
			
 
				+		ret = do_one_initcall_debug(fn);
			
 
				+	else
			
 
				+		ret = fn();
			
 
				+
			
 
				+	msgbuf[0] = 0;
			
 
				+
			
 
				+	if (preempt_count() != count) {
			
 
				+		sprintf(msgbuf, "preemption imbalance ");
			
 
				+		preempt_count_set(count);
			
 
				+	}
			
 
				+	if (irqs_disabled()) {
			
 
				+		strlcat(msgbuf, "disabled interrupts ", sizeof(msgbuf));
			
 
				+		local_irq_enable();
			
 
				+	}
			
 
				+	WARN(msgbuf[0], "initcall %pF returned with %s\n", fn, msgbuf);
			
 
				+
			
 
				+	return ret;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Let's try to understand what does the `do_on_initcall` function does. First of all we increase [preemption](https://en.wikipedia.org/wiki/Preemption_%28computing%29) counter so that we can check it later to be sure that it is not imbalanced. After this step we can see the call of the `initcall_backlist` function which
			
 
				+goes over the `blacklisted_initcalls` list which stores blacklisted `initcalls` and releases the given `initcall` if it is located in this list:
			
 
				+
			
 
				+```C
			
 
				+list_for_each_entry(entry, &blacklisted_initcalls, next) {
			
 
				+	if (!strcmp(fn_name, entry->buf)) {
			
 
				+		pr_debug("initcall %s blacklisted\n", fn_name);
			
 
				+		kfree(fn_name);
			
 
				+		return true;
			
 
				+	}
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+The blacklisted `initcalls` stored in the `blacklisted_initcalls` list and this list is filled during early Linux kernel initialization from the Linux kernel command line.
			
 
				+
			
 
				+After the blacklisted `initcalls` will be handled, the next part of code does directly the call of the `initcall`:
			
 
				+
			
 
				+```C
			
 
				+if (initcall_debug)
			
 
				+	ret = do_one_initcall_debug(fn);
			
 
				+else
			
 
				+	ret = fn();
			
 
				+```
			
 
				+
			
 
				+Depends on the value of the `initcall_debug` variable, the `do_one_initcall_debug` function will call `initcall` or this function will do it directly via `fn()`. The `initcall_debug` variable is defined in the [same](https://github.com/torvalds/linux/blob/master/init/main.c) source code file:
			
 
				+
			
 
				+```C
			
 
				+bool initcall_debug;
			
 
				+```
			
 
				+
			
 
				+and provides ability to print some information to the kernel [log buffer](https://en.wikipedia.org/wiki/Dmesg). The value of the variable can be set from the kernel commands via the `initcall_debug` parameter. As we can read from the [documentation](https://www.kernel.org/doc/Documentation/kernel-parameters.txt) of the Linux kernel command line:
			
 
				+
			
 
				+```
			
 
				+initcall_debug	[KNL] Trace initcalls as they are executed.  Useful
			
 
				+                      for working out where the kernel is dying during
			
 
				+                      startup.
			
 
				+```
			
 
				+
			
 
				+And that's true. If we will look at the implementation of the `do_one_initcall_debug` function, we will see that it does the same as the `do_one_initcall` function or i.e. the `do_one_initcall_debug` function calls the given `initcall` and prints some information (like the [pid](https://en.wikipedia.org/wiki/Process_identifier) of the currently running task, duration of execution of the `initcall` and etc.) related to the execution of the given `initcall`:
			
 
				+
			
 
				+```C
			
 
				+static int __init_or_module do_one_initcall_debug(initcall_t fn)
			
 
				+{
			
 
				+	ktime_t calltime, delta, rettime;
			
 
				+	unsigned long long duration;
			
 
				+	int ret;
			
 
				+
			
 
				+	printk(KERN_DEBUG "calling  %pF @ %i\n", fn, task_pid_nr(current));
			
 
				+	calltime = ktime_get();
			
 
				+	ret = fn();
			
 
				+	rettime = ktime_get();
			
 
				+	delta = ktime_sub(rettime, calltime);
			
 
				+	duration = (unsigned long long) ktime_to_ns(delta) >> 10;
			
 
				+	printk(KERN_DEBUG "initcall %pF returned %d after %lld usecs\n",
			
 
				+		 fn, ret, duration);
			
 
				+
			
 
				+	return ret;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+As an `initcall` was called by the one of the ` do_one_initcall` or `do_one_initcall_debug` functions, we may see two checks in the end of the `do_one_initcall` function. The first one checks the amount of possible `__preempt_count_add` and `__preempt_count_sub` calls inside of the executed initcall, and if this value is not equal to the previous value of the preemptible counter, we add the `preemption imbalance` string to the message buffer and set correct value of the preemptible counter:
			
 
				+
			
 
				+```C
			
 
				+if (preempt_count() != count) {
			
 
				+	sprintf(msgbuf, "preemption imbalance ");
			
 
				+	preempt_count_set(count);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Later this error string will be printed. The last check the state of local [IRQs](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) and if they are disabled, we add the `disabled interrupts` strings to the our message buffer and enable `IRQs` for the current processor to prevent the state when `IRQs` were disabled by an `initcall` and didn't enable again:
			
 
				+
			
 
				+```C
			
 
				+if (irqs_disabled()) {
			
 
				+	strlcat(msgbuf, "disabled interrupts ", sizeof(msgbuf));
			
 
				+	local_irq_enable();
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+That's all. In this way the Linux kernel does initialization of many subsystems in a correct order. From now on, we know what is the `initcall` mechanism in the Linux kernel. In this part, we covered main general portion of the `initcall` mechanism but we left some important concepts. Let's make a short look at these concepts.
			
 
				+
			
 
				+First of all, we have missed one level of `initcalls`, this is `rootfs initcalls`. You can find definition of the `rootfs_initcall` in the [include/linux/init.h](https://github.com/torvalds/linux/blob/master/include/linux/init.h) header file along with all similar macros which we saw in this part:
			
 
				+
			
 
				+```C
			
 
				+#define rootfs_initcall(fn)		__define_initcall(fn, rootfs)
			
 
				+```
			
 
				+
			
 
				+As we may understand from the macro's name, its main purpose is to store callbacks which are related to the [rootfs](https://en.wikipedia.org/wiki/Initramfs). Besides this goal, it may be useful to initialize other stuffs after initialization related to filesystems level only if devices related stuff are not initialized. For example, the decompression of the [initramfs](https://en.wikipedia.org/wiki/Initramfs) which occurred in the `populate_rootfs` function from the [init/initramfs.c](https://github.com/torvalds/linux/blob/master/init/initramfs.c) source code file:
			
 
				+
			
 
				+```C
			
 
				+rootfs_initcall(populate_rootfs);
			
 
				+```
			
 
				+
			
 
				+From this place, we may see familiar output:
			
 
				+
			
 
				+```
			
 
				+[    0.199960] Unpacking initramfs...
			
 
				+```
			
 
				+
			
 
				+Besides the `rootfs_initcall` level, there are additional `console_initcall`, `security_initcall` and other secondary `initcall` levels. The last thing that we have missed is the set of the `*_initcall_sync` levels. Almost each `*_initcall` macro that we have seen in this part, has macro companion with the `_sync` prefix:
			
 
				+
			
 
				+```C
			
 
				+#define core_initcall_sync(fn)		__define_initcall(fn, 1s)
			
 
				+#define postcore_initcall_sync(fn)	__define_initcall(fn, 2s)
			
 
				+#define arch_initcall_sync(fn)		__define_initcall(fn, 3s)
			
 
				+#define subsys_initcall_sync(fn)	__define_initcall(fn, 4s)
			
 
				+#define fs_initcall_sync(fn)		__define_initcall(fn, 5s)
			
 
				+#define device_initcall_sync(fn)	__define_initcall(fn, 6s)
			
 
				+#define late_initcall_sync(fn)		__define_initcall(fn, 7s)
			
 
				+```
			
 
				+
			
 
				+The main goal of these additional levels is to wait for completion of all a module related initialization routines for a certain level.
			
 
				+
			
 
				+That's all.
			
 
				+
			
 
				+Conclusion
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+In this part we saw the important mechanism of the Linux kernel which allows to call a function which depends on the current state of the Linux kernel during its initialization.
			
 
				+
			
 
				+If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
			
 
				+
			
 
				+**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**.
			
 
				+
			
 
				+Links
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+* [callback](https://en.wikipedia.org/wiki/Callback_%28computer_programming%29)
			
 
				+* [debugfs](https://en.wikipedia.org/wiki/Debugfs)
			
 
				+* [integer type](https://en.wikipedia.org/wiki/Integer)
			
 
				+* [symbols concatenation](https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html)
			
 
				+* [GCC](https://en.wikipedia.org/wiki/GNU_Compiler_Collection)
			
 
				+* [Link time optimization](https://gcc.gnu.org/wiki/LinkTimeOptimization)
			
 
				+* [Introduction to linkers](https://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html)
			
 
				+* [Linux kernel command line](https://www.kernel.org/doc/Documentation/kernel-parameters.txt)
			
 
				+* [Process identifier](https://en.wikipedia.org/wiki/Process_identifier)
			
 
				+* [IRQs](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
			
 
				+* [rootfs](https://en.wikipedia.org/wiki/Initramfs)
			
 
				+* [previous part](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
			
--- a/Concepts/per-cpu.md
+++ b/Concepts/per-cpu.md
@@ -1,9 +1,9 @@
 
				 Per-CPU variables
			
 
				 ================================================================================
			
 
				 
			
 
				-Per-CPU variables are one of the kernel features. You can understand what this feature means by reading its name. We can create a variable and each processor core will have its own copy of this variable. We take a closer look on this feature and try to understand how it is implemented and how it works in this part.
			
 
				+Per-CPU variables are one of the kernel features. You can understand the meaning of this feature by reading its name. We can create a variable and each processor core will have its own copy of this variable. In this part, we take a closer look at this feature and try to understand how it is implemented and how it works.
			
 
				 
			
 
				-The kernel provides API for creating per-cpu variables - `DEFINE_PER_CPU` macro:
			
 
				+The kernel provides an API for creating per-cpu variables - the `DEFINE_PER_CPU` macro:
			
 
				 
			
 
				 ```C
			
 
				 #define DEFINE_PER_CPU(type, name) \
			
@@ -12,13 +12,13 @@ The kernel provides API for creating per-cpu variables - `DEFINE_PER_CPU` macro:
 
				 
			
 
				 This macro defined in the [include/linux/percpu-defs.h](https://github.com/torvalds/linux/blob/master/include/linux/percpu-defs.h) as many other macros for work with per-cpu variables. Now we will see how this feature is implemented.
			
 
				 
			
 
				-Take a look at the `DECLARE_PER_CPU` definition. We see that it takes 2 parameters: `type` and `name`, so we can use it to create per-cpu variable, for example like this:
			
 
				+Take a look at the `DECLARE_PER_CPU` definition. We see that it takes 2 parameters: `type` and `name`, so we can use it to create per-cpu variables, for example like this:
			
 
				 
			
 
				 ```C
			
 
				 DEFINE_PER_CPU(int, per_cpu_n)
			
 
				 ```
			
 
				 
			
 
				-We pass the type and the name of our variable. `DEFI_PER_CPU` calls `DEFINE_PER_CPU_SECTION` macro and passes the same two paramaters and empty string to it. Let's look at the definition of the `DEFINE_PER_CPU_SECTION`:
			
 
				+We pass the type and the name of our variable. `DEFINE_PER_CPU` calls the `DEFINE_PER_CPU_SECTION` macro and passes the same two parameters and empty string to it. Let's look at the definition of the `DEFINE_PER_CPU_SECTION`:
			
 
				 
			
 
				 ```C
			
 
				 #define DEFINE_PER_CPU_SECTION(type, name, sec)    \
			
@@ -32,35 +32,35 @@ We pass the type and the name of our variable. `DEFI_PER_CPU` calls `DEFINE_PER_
 
				          PER_CPU_ATTRIBUTES
			
 
				 ```
			
 
				 
			
 
				-where section is:
			
 
				+where `section` is:
			
 
				 
			
 
				 ```C
			
 
				 #define PER_CPU_BASE_SECTION ".data..percpu"
			
 
				 ```
			
 
				 
			
 
				-After all macros are expanded we will get global per-cpu variable:
			
 
				+After all macros are expanded we will get a global per-cpu variable:
			
 
				 
			
 
				 ```C
			
 
				 __attribute__((section(".data..percpu"))) int per_cpu_n
			
 
				 ```
			
 
				 
			
 
				-It means that we will have `per_cpu_n` variable in the `.data..percpu` section. We can find this section in the `vmlinux`:
			
 
				+It means that we will have a `per_cpu_n` variable in the `.data..percpu` section. We can find this section in the `vmlinux`:
			
 
				 
			
 
				 ```
			
 
				 .data..percpu 00013a58  0000000000000000  0000000001a5c000  00e00000  2**12
			
 
				               CONTENTS, ALLOC, LOAD, DATA
			
 
				 ```
			
 
				 
			
 
				-Ok, now we know that when we use `DEFINE_PER_CPU` macro, per-cpu variable in the `.data..percpu` section will be created. When the kernel initilizes it calls `setup_per_cpu_areas` function which loads `.data..percpu` section multiply times, one section per CPU.
			
 
				+Ok, now we know that when we use the `DEFINE_PER_CPU` macro, a per-cpu variable in the `.data..percpu` section will be created. When the kernel initializes it calls the `setup_per_cpu_areas` function which loads the `.data..percpu` section multiple times, one section per CPU.
			
 
				 
			
 
				-Let's look on the per-CPU areas initialization process. It start in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) from the call of the `setup_per_cpu_areas` function which defined in the [arch/x86/kernel/setup_percpu.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup_percpu.c).
			
 
				+Let's look at the per-CPU areas initialization process. It starts in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) from the call of the `setup_per_cpu_areas` function which is defined in the [arch/x86/kernel/setup_percpu.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup_percpu.c).
			
 
				 
			
 
				 ```C
			
 
				 pr_info("NR_CPUS:%d nr_cpumask_bits:%d nr_cpu_ids:%d nr_node_ids:%d\n",
			
 
				         NR_CPUS, nr_cpumask_bits, nr_cpu_ids, nr_node_ids);
			
 
				 ```
			
 
				 
			
 
				-The `setup_per_cpu_areas` starts from the output information about the Maximum number of CPUs set during kernel configuration with `CONFIG_NR_CPUS` configuration option, actual number of CPUs, `nr_cpumask_bits` is the same that `NR_CPUS` bit for the new `cpumask` operators and number of `NUMA` nodes.
			
 
				+The `setup_per_cpu_areas` starts from the output information about the maximum number of CPUs set during kernel configuration with the `CONFIG_NR_CPUS` configuration option, actual number of CPUs, `nr_cpumask_bits` is the same that `NR_CPUS` bit for the new `cpumask` operators and number of `NUMA` nodes.
			
 
				 
			
 
				 We can see this output in the dmesg:
			
 
				 
			
@@ -69,7 +69,7 @@ $ dmesg | grep percpu
 
				 [    0.000000] setup_percpu: NR_CPUS:8 nr_cpumask_bits:8 nr_cpu_ids:8 nr_node_ids:1
			
 
				 ```
			
 
				 
			
 
				-In the next step we check `percpu` first chunk allocator. All percpu areas are allocated in chunks. First chunk is used for the static percpu variables. Linux kernel has `percpu_alloc` command line parameters which provides type of the first chunk allocator. We can read about it in the kernel documentation:
			
 
				+In the next step we check the `percpu` first chunk allocator. All percpu areas are allocated in chunks. The first chunk is used for the static percpu variables. The Linux kernel has `percpu_alloc` command line parameters which provides the type of the first chunk allocator. We can read about it in the kernel documentation:
			
 
				 
			
 
				 ```
			
 
				 percpu_alloc=	Select which percpu first chunk allocator to use.
			
@@ -80,21 +80,21 @@ percpu_alloc=	Select which percpu first chunk allocator to use.
 
				 		and performance comparison.
			
 
				 ```
			
 
				 
			
 
				-The [mm/percpu.c](https://github.com/torvalds/linux/blob/master/mm/percpu.c) contains handler of this command line option:
			
 
				+The [mm/percpu.c](https://github.com/torvalds/linux/blob/master/mm/percpu.c) contains the handler of this command line option:
			
 
				 
			
 
				 ```C
			
 
				 early_param("percpu_alloc", percpu_alloc_setup);
			
 
				 ```
			
 
				 
			
 
				-Where `percpu_alloc_setup` function sets the `pcpu_chosen_fc` variable depends on the `percpu_alloc` parameter value. By default first chunk allocator is `auto`:
			
 
				+Where the `percpu_alloc_setup` function sets the `pcpu_chosen_fc` variable depends on the `percpu_alloc` parameter value. By default the first chunk allocator is `auto`:
			
 
				 
			
 
				 ```C
			
 
				 enum pcpu_fc pcpu_chosen_fc __initdata = PCPU_FC_AUTO;
			
 
				 ```
			
 
				 
			
 
				-If `percpu_alooc` parameter not given to the kernel command line, the `embed` allocator will be used wich as you can understand embed the first percpu chunk into bootmem with the [memblock](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-1.html). The last allocator is first chunk `page` allocator which maps first chunk with `PAGE_SIZE` pages.
			
 
				+If the `percpu_alloc` parameter is not given to the kernel command line, the `embed` allocator will be used which embeds the first percpu chunk into bootmem with the [memblock](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-1.html). The last allocator is the first chunk `page` allocator which maps the first chunk with `PAGE_SIZE` pages.
			
 
				 
			
 
				-As I wrote about first of all we make a check of the first chunk allocator type in the `setup_per_cpu_areas`. First of all we check that first chunk allocator is not page:
			
 
				+As I wrote above, first of all we make a check of the first chunk allocator type in the `setup_per_cpu_areas`. We check that first chunk allocator is not page:
			
 
				 
			
 
				 ```C
			
 
				 if (pcpu_chosen_fc != PCPU_FC_PAGE) {
			
@@ -104,7 +104,7 @@ if (pcpu_chosen_fc != PCPU_FC_PAGE) {
 
				 }
			
 
				 ```
			
 
				 
			
 
				-If it is not `PCPU_FC_PAGE`, we will use `embed` allocator and allocate space for the first chunk with the `pcpu_embed_first_chunk` function:
			
 
				+If it is not `PCPU_FC_PAGE`, we will use the `embed` allocator and allocate space for the first chunk with the `pcpu_embed_first_chunk` function:
			
 
				 
			
 
				 ```C
			
 
				 rc = pcpu_embed_first_chunk(PERCPU_FIRST_CHUNK_RESERVE,
			
@@ -113,16 +113,16 @@ rc = pcpu_embed_first_chunk(PERCPU_FIRST_CHUNK_RESERVE,
 
				 					    pcpu_fc_alloc, pcpu_fc_free);
			
 
				 ```
			
 
				 
			
 
				-As I wrote above, the `pcpu_embed_first_chunk` function embeds the first percpu chunk into bootmem. As you can see we pass a couple of parameters to the `pcup_embed_first_chunk`, they are
			
 
				+As shown above, the `pcpu_embed_first_chunk` function embeds the first percpu chunk into bootmem then we pass a couple of parameters to the `pcup_embed_first_chunk`. They are as follows:
			
 
				 
			
 
				 * `PERCPU_FIRST_CHUNK_RESERVE` - the size of the reserved space for the static `percpu` variables;
			
 
				-* `dyn_size` - minimum free size for dynamic allocation in byte;
			
 
				+* `dyn_size` - minimum free size for dynamic allocation in bytes;
			
 
				 * `atom_size` - all allocations are whole multiples of this and aligned to this parameter;
			
 
				 * `pcpu_cpu_distance` - callback to determine distance between cpus;
			
 
				 * `pcpu_fc_alloc` - function to allocate `percpu` page;
			
 
				 * `pcpu_fc_free` - function to release `percpu` page.
			
 
				 
			
 
				-All of this parameters we calculat before the call of the `pcpu_embed_first_chunk`:
			
 
				+We calculate all of these parameters before the call of the `pcpu_embed_first_chunk`:
			
 
				 
			
 
				 ```C
			
 
				 const size_t dyn_size = PERCPU_MODULE_RESERVE + PERCPU_DYNAMIC_RESERVE - PERCPU_FIRST_CHUNK_RESERVE;
			
@@ -134,15 +134,15 @@ size_t atom_size;
 
				 #endif
			
 
				 ```
			
 
				 
			
 
				-If first chunk allocator is `PCPU_FC_PAGE`, we will use the `pcpu_page_first_chunk` instead of the `pcpu_embed_first_chunk`. After that `percpu` areas up, we setup `percpu` offset and its segment for the every CPU with the `setup_percpu_segment` function (only for `x86` systems) and move some early data from the arrays to the `percpu` variables (`x86_cpu_to_apicid`, `irq_stack_ptr` and etc...). After the kernel finished the initialization process, we have loaded N `.data..percpu` sections, where N is the number of CPU, and section used by bootstrap processor will contain uninitialized variable created with `DEFINE_PER_CPU` macro.
			
 
				+If the first chunk allocator is `PCPU_FC_PAGE`, we will use the `pcpu_page_first_chunk` instead of the `pcpu_embed_first_chunk`. After that `percpu` areas up, we setup `percpu` offset and its segment for every CPU with the `setup_percpu_segment` function (only for `x86` systems) and move some early data from the arrays to the `percpu` variables (`x86_cpu_to_apicid`, `irq_stack_ptr` and etc...). After the kernel finishes the initialization process, we will have loaded N `.data..percpu` sections, where N is the number of CPUs, and the section used by the bootstrap processor will contain an uninitialized variable created with the `DEFINE_PER_CPU` macro.
			
 
				 
			
 
				-The kernel provides API for per-cpu variables manipulating:
			
 
				+The kernel provides an API for per-cpu variables manipulating:
			
 
				 
			
 
				 * get_cpu_var(var)
			
 
				 * put_cpu_var(var)
			
 
				 
			
 
				 
			
 
				-Let's look at `get_cpu_var` implementation:
			
 
				+Let's look at the `get_cpu_var` implementation:
			
 
				 
			
 
				 ```C
			
 
				 #define get_cpu_var(var)     \
			
@@ -152,7 +152,7 @@ Let's look at `get_cpu_var` implementation:
 
				 }))
			
 
				 ```
			
 
				 
			
 
				-Linux kernel is preemptible and accessing a per-cpu variable requires to know which processor kernel running on. So, current code must not be preempted and moved to the another CPU while accessing a per-cpu variable. That's why first of all we can see call of the `preempt_disable` function. After this we can see call of the `this_cpu_ptr` macro, which looks as:
			
 
				+The Linux kernel is preemptible and accessing a per-cpu variable requires us to know which processor the kernel is running on. So, current code must not be preempted and moved to the another CPU while accessing a per-cpu variable. That's why, first of all we can see a call of the `preempt_disable` function then a call of the `this_cpu_ptr` macro, which looks like:
			
 
				 
			
 
				 ```C
			
 
				 #define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
			
@@ -164,7 +164,7 @@ and
 
				 #define raw_cpu_ptr(ptr)        per_cpu_ptr(ptr, 0)
			
 
				 ```
			
 
				 
			
 
				-where `per_cpu_ptr` returns a pointer to the per-cpu variable for the given cpu (second parameter). After that we got per-cpu variables and made any manipulations on it, we must call `put_cpu_var` macro which enables preemption with call of `preempt_enable` function. So the typical usage of a per-cpu variable is following:
			
 
				+where `per_cpu_ptr` returns a pointer to the per-cpu variable for the given cpu (second parameter). After we've created a per-cpu variable and made modifications to it, we must call the `put_cpu_var` macro which enables preemption with a call of `preempt_enable` function. So the typical usage of a per-cpu variable is as follows:
			
 
				 
			
 
				 ```C
			
 
				 get_cpu_var(var);
			
@@ -174,7 +174,7 @@ get_cpu_var(var);
 
				 put_cpu_var(var);
			
 
				 ```
			
 
				 
			
 
				-Let's look at `per_cpu_ptr` macro:
			
 
				+Let's look at the `per_cpu_ptr` macro:
			
 
				 
			
 
				 ```C
			
 
				 #define per_cpu_ptr(ptr, cpu)                             \
			
@@ -184,47 +184,47 @@ Let's look at `per_cpu_ptr` macro:
 
				 })
			
 
				 ```
			
 
				 
			
 
				-As I wrote above, this macro returns per-cpu variable for the given cpu. First of all it calls `__verify_pcpu_ptr`:
			
 
				+As I wrote above, this macro returns a per-cpu variable for the given cpu. First of all it calls `__verify_pcpu_ptr`:
			
 
				 
			
 
				 ```C
			
 
				 #define __verify_pcpu_ptr(ptr)
			
 
				 do {
			
 
				 	const void __percpu *__vpp_verify = (typeof((ptr) + 0))NULL;
			
 
				-	(void)__vpp_verify; 
			
 
				+	(void)__vpp_verify;
			
 
				 } while (0)
			
 
				 ```
			
 
				 
			
 
				-which makes given `ptr` type of `const void __percpu *`, 
			
 
				+which makes the given `ptr` type of `const void __percpu *`,
			
 
				 
			
 
				-After this we can see the call of the `SHIFT_PERCPU_PTR` macro with two parameters. At first parameter we pass our ptr and sencond we pass cpu number to the `per_cpu_offset` macro which:
			
 
				+After this we can see the call of the `SHIFT_PERCPU_PTR` macro with two parameters. As first parameter we pass our ptr and for second parameter we pass the cpu number to the `per_cpu_offset` macro:
			
 
				 
			
 
				 ```C
			
 
				 #define per_cpu_offset(x) (__per_cpu_offset[x])
			
 
				 ```
			
 
				 
			
 
				-expands to getting `x` element from the `__per_cpu_offset` array:
			
 
				+which expands to getting the `x` element from the `__per_cpu_offset` array:
			
 
				 
			
 
				 
			
 
				 ```C
			
 
				 extern unsigned long __per_cpu_offset[NR_CPUS];
			
 
				 ```
			
 
				 
			
 
				-where `NR_CPUS` is the number of CPUs. `__per_cpu_offset` array filled with the distances between cpu-variables copies. For example all per-cpu data is `X` bytes size, so if we access `__per_cpu_offset[Y]`, so `X*Y` will be accessed. Let's look at the `SHIFT_PERCPU_PTR` implementation:
			
 
				+where `NR_CPUS` is the number of CPUs. The `__per_cpu_offset` array is filled with the distances between cpu-variable copies. For example all per-cpu data is `X` bytes in size, so if we access `__per_cpu_offset[Y]`, `X*Y` will be accessed. Let's look at the `SHIFT_PERCPU_PTR` implementation:
			
 
				 
			
 
				 ```C
			
 
				 #define SHIFT_PERCPU_PTR(__p, __offset)                                 \
			
 
				          RELOC_HIDE((typeof(*(__p)) __kernel __force *)(__p), (__offset))
			
 
				 ```
			
 
				 
			
 
				-`RELOC_HIDE` just returns offset `(typeof(ptr)) (__ptr + (off))` and it will be pointer of the variable.
			
 
				+`RELOC_HIDE` just returns offset `(typeof(ptr)) (__ptr + (off))` and it will return a pointer to the variable.
			
 
				 
			
 
				-That's all! Of course it is not the full API, but the general part. It can be hard for the start, but to understand per-cpu variables feature need to understand mainly [include/linux/percpu-defs.h](https://github.com/torvalds/linux/blob/master/include/linux/percpu-defs.h) magic.
			
 
				+That's all! Of course it is not the full API, but a general overview. It can be hard to start with, but to understand per-cpu variables you mainly need to understand the  [include/linux/percpu-defs.h](https://github.com/torvalds/linux/blob/master/include/linux/percpu-defs.h) magic.
			
 
				 
			
 
				-Let's again look at the algorithm of getting pointer on per-cpu variable:
			
 
				+Let's again look at the algorithm of getting a pointer to a per-cpu variable:
			
 
				 
			
 
				-* The kernel creates multiply `.data..percpu` sections (ones perc-pu) during initialization process;
			
 
				-* All variables created with the `DEFINE_PER_CPU` macro will be reloacated to the first section or for CPU0;
			
 
				+* The kernel creates multiple `.data..percpu` sections (one per-cpu) during initialization process;
			
 
				+* All variables created with the `DEFINE_PER_CPU` macro will be relocated to the first section or for CPU0;
			
 
				 * `__per_cpu_offset` array filled with the distance (`BOOT_PERCPU_OFFSET`) between `.data..percpu` sections;
			
 
				-* When `per_cpu_ptr` called for example for getting pointer on the certain per-cpu variable for the third CPU, `__per_cpu_offset` array will be accessed, where every index points to the certain CPU.
			
 
				+* When the `per_cpu_ptr` is called, for example for getting a pointer on a certain per-cpu variable for the third CPU, the `__per_cpu_offset` array will be accessed, where every index points to the required CPU.
			
 
				 
			
 
				 That's all.
			
--- a/DataStructures/README.md
+++ b/DataStructures/README.md
@@ -1,9 +1,10 @@
 
				 Data Structures in the Linux Kernel
			
 
				 ========================================================================
			
 
				 
			
 
				-Linux kernel provides implementations of a different data structures like linked list, B+ tree, prinority heap and many many more.
			
 
				+Linux kernel provides different implementations of data structures like doubly linked list, B+ tree, priority heap and many many more.
			
 
				 
			
 
				-This part considers these data structures and algorithms.
			
 
				+This part considers the following data structures and algorithms:
			
 
				 
			
 
				   * [Doubly linked list](https://github.com/0xAX/linux-insides/blob/master/DataStructures/dlist.md)
			
 
				   * [Radix tree](https://github.com/0xAX/linux-insides/blob/master/DataStructures/radix-tree.md)
			
 
				+  * [Bit arrays](https://github.com/0xAX/linux-insides/blob/master/DataStructures/bitmap.md)
			
--- a/DataStructures/bitmap.md
+++ b/DataStructures/bitmap.md
@@ -0,0 +1,384 @@
 
				+Data Structures in the Linux Kernel
			
 
				+================================================================================
			
 
				+
			
 
				+Bit arrays and bit operations in the Linux kernel
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+Besides different [linked](https://en.wikipedia.org/wiki/Linked_data_structure) and [tree](https://en.wikipedia.org/wiki/Tree_%28data_structure%29) based data structures, the Linux kernel provides [API](https://en.wikipedia.org/wiki/Application_programming_interface) for [bit arrays](https://en.wikipedia.org/wiki/Bit_array) or `bitmap`. Bit arrays are heavily used in the Linux kernel and following source code files contain common `API` for work with such structures:
			
 
				+
			
 
				+* [lib/bitmap.c](https://github.com/torvalds/linux/blob/master/lib/bitmap.c)
			
 
				+* [include/linux/bitmap.h](https://github.com/torvalds/linux/blob/master/include/linux/bitmap.h)
			
 
				+
			
 
				+Besides these two files, there is also architecture-specific header file which provides optimized bit operations for certain architecture. We consider [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture, so in our case it will be: 
			
 
				+
			
 
				+* [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h)
			
 
				+
			
 
				+header file. As I just wrote above, the `bitmap` is heavily used in the Linux kernel. For example a `bit array` is used to store set of online/offline processors for systems which support [hot-plug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt) cpu (more about this you can read in the [cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) part), a `bit array` stores set of allocated [irqs](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) during initialization of the Linux kernel and etc.
			
 
				+
			
 
				+So, the main goal of this part is to see how `bit arrays` are implemented in the Linux kernel. Let's start.
			
 
				+
			
 
				+Declaration of bit array
			
 
				+================================================================================
			
 
				+
			
 
				+Before we will look on `API` for bitmaps manipulation, we must know how to declare it in the Linux kernel. There are two common method to declare own bit array. The first simple way to declare a bit array is to array of `unsigned long`. For example:
			
 
				+
			
 
				+```C
			
 
				+unsigned long my_bitmap[8]
			
 
				+```
			
 
				+
			
 
				+The second way is to use the `DECLARE_BITMAP` macro which is defined in the [include/linux/types.h](https://github.com/torvalds/linux/blob/master/include/linux/types.h) header file:
			
 
				+
			
 
				+```C
			
 
				+#define DECLARE_BITMAP(name,bits) \
			
 
				+    unsigned long name[BITS_TO_LONGS(bits)]
			
 
				+```
			
 
				+
			
 
				+We can see that `DECLARE_BITMAP` macro takes two parameters:
			
 
				+
			
 
				+* `name` - name of bitmap;
			
 
				+* `bits` - amount of bits in bitmap;
			
 
				+
			
 
				+and just expands to the definition of `unsigned long` array with `BITS_TO_LONGS(bits)` elements, where the `BITS_TO_LONGS` macro converts a given number of bits to number of `longs` or in other words it calculates how many `8` byte elements in `bits`:
			
 
				+
			
 
				+```C
			
 
				+#define BITS_PER_BYTE           8
			
 
				+#define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
			
 
				+#define BITS_TO_LONGS(nr)       DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(long))
			
 
				+```
			
 
				+
			
 
				+So, for example `DECLARE_BITMAP(my_bitmap, 64)` will produce:
			
 
				+
			
 
				+```python
			
 
				+>>> (((64) + (64) - 1) / (64))
			
 
				+1
			
 
				+```
			
 
				+
			
 
				+and:
			
 
				+
			
 
				+```C
			
 
				+unsigned long my_bitmap[1];
			
 
				+```
			
 
				+
			
 
				+After we are able to declare a bit array, we can start to use it.
			
 
				+
			
 
				+Architecture-specific bit operations
			
 
				+================================================================================
			
 
				+
			
 
				+We already saw above a couple of source code and header files which provide [API](https://en.wikipedia.org/wiki/Application_programming_interface) for manipulation of bit arrays. The most important and widely used API of bit arrays is architecture-specific and located as we already know in the [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h) header file.
			
 
				+
			
 
				+First of all let's look at the two most important functions:
			
 
				+
			
 
				+* `set_bit`;
			
 
				+* `clear_bit`.
			
 
				+
			
 
				+I think that there is no need to explain what these function do. This is already must be clear from their name. Let's look on their implementation. If you will look into the [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h) header file, you will note that each of these functions represented by two variants: [atomic](https://en.wikipedia.org/wiki/Linearizability) and not. Before we will start to dive into implementations of these functions, first of all we must to know a little about `atomic` operations.
			
 
				+
			
 
				+In simple words atomic operations guarantees that two or more operations will not be performed on the same data concurrently. The `x86` architecture provides a set of atomic instructions, for example [xchg](http://x86.renejeschke.de/html/file_module_x86_id_328.html) instruction, [cmpxchg](http://x86.renejeschke.de/html/file_module_x86_id_41.html) instruction and etc. Besides atomic instructions, some of non-atomic instructions can be made atomic with the help of the [lock](http://x86.renejeschke.de/html/file_module_x86_id_159.html) instruction. It is enough to know about atomic operations for now, so we can begin to consider implementation of `set_bit` and `clear_bit` functions.
			
 
				+
			
 
				+First of all, let's start to consider `non-atomic` variants of this function. Names of non-atomic `set_bit` and `clear_bit` starts from double underscore. As we already know, all of these functions are defined in the [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h) header file and the first function is `__set_bit`:
			
 
				+
			
 
				+```C
			
 
				+static inline void __set_bit(long nr, volatile unsigned long *addr)
			
 
				+{
			
 
				+	asm volatile("bts %1,%0" : ADDR : "Ir" (nr) : "memory");
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+As we can see it takes two arguments:
			
 
				+
			
 
				+* `nr` - number of bit in a bit array.
			
 
				+* `addr` - address of a bit array where we need to set bit.
			
 
				+
			
 
				+Note that the `addr` parameter is defined with `volatile` keyword which tells to compiler that value maybe changed by the given address. The implementation of the `__set_bit` is pretty easy. As we can see, it just contains one line of [inline assembler](https://en.wikipedia.org/wiki/Inline_assembler) code. In our case we are using the [bts](http://x86.renejeschke.de/html/file_module_x86_id_25.html) instruction which selects a bit which is specified with the first operand (`nr` in our case) from the bit array, stores the value of the selected bit in the [CF](https://en.wikipedia.org/wiki/FLAGS_register) flags register and set this bit.
			
 
				+
			
 
				+Note that we can see usage of the `nr`, but there is `addr` here. You already might guess that the secret is in `ADDR`. The `ADDR` is the macro which is defined in the same header code file and expands to the string which contains value of the given address and `+m` constraint:
			
 
				+
			
 
				+```C
			
 
				+#define ADDR				BITOP_ADDR(addr)
			
 
				+#define BITOP_ADDR(x) "+m" (*(volatile long *) (x))
			
 
				+```
			
 
				+
			
 
				+Besides the `+m`, we can see other constraints in the `__set_bit` function. Let's look on they and try to understand what do they mean:
			
 
				+
			
 
				+* `+m` - represents memory operand where `+` tells that the given operand will be input and output operand;
			
 
				+* `I` - represents integer constant;
			
 
				+* `r` - represents register operand
			
 
				+
			
 
				+Besides these constraint, we also can see - the `memory` keyword which tells compiler that this code will change value in memory. That's all. Now let's look at the same function but at `atomic` variant. It looks more complex that its `non-atomic` variant:
			
 
				+
			
 
				+```C
			
 
				+static __always_inline void
			
 
				+set_bit(long nr, volatile unsigned long *addr)
			
 
				+{
			
 
				+	if (IS_IMMEDIATE(nr)) {
			
 
				+		asm volatile(LOCK_PREFIX "orb %1,%0"
			
 
				+			: CONST_MASK_ADDR(nr, addr)
			
 
				+			: "iq" ((u8)CONST_MASK(nr))
			
 
				+			: "memory");
			
 
				+	} else {
			
 
				+		asm volatile(LOCK_PREFIX "bts %1,%0"
			
 
				+			: BITOP_ADDR(addr) : "Ir" (nr) : "memory");
			
 
				+	}
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+First of all note that this function takes the same set of parameters that `__set_bit`, but additionally marked with the `__always_inline` attribute. The `__always_inline` is macro which defined in the [include/linux/compiler-gcc.h](https://github.com/torvalds/linux/blob/master/include/linux/compiler-gcc.h) and just expands to the `always_inline` attribute:
			
 
				+
			
 
				+```C
			
 
				+#define __always_inline inline __attribute__((always_inline))
			
 
				+```
			
 
				+
			
 
				+which means that this function will be always inlined to reduce size of the Linux kernel image. Now let's try to understand implementation of the `set_bit` function. First of all we check a given number of bit at the beginning of the `set_bit` function. The `IS_IMMEDIATE` macro defined in the same [header](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h) file and expands to the call of the builtin [gcc](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) function:
			
 
				+
			
 
				+```C
			
 
				+#define IS_IMMEDIATE(nr)		(__builtin_constant_p(nr))
			
 
				+```
			
 
				+
			
 
				+The `__builtin_constant_p` builtin function returns `1` if the given parameter is known to be constant at compile-time and returns `0` in other case. We no need to use slow `bts` instruction to set bit if the given number of bit is known in compile time constant. We can just apply [bitwise or](https://en.wikipedia.org/wiki/Bitwise_operation#OR) for byte from the give address which contains given bit and masked number of bits where high bit is `1` and other is zero. In other case if the given number of bit is not known constant at compile-time, we do the same as we did in the `__set_bit` function. The `CONST_MASK_ADDR` macro:
			
 
				+
			
 
				+```C
			
 
				+#define CONST_MASK_ADDR(nr, addr)	BITOP_ADDR((void *)(addr) + ((nr)>>3))
			
 
				+```
			
 
				+
			
 
				+expands to the give address with offset to the byte which contains a given bit. For example we have address `0x1000` and the number of bit is `0x9`. So, as `0x9` is `one byte + one bit` our address with be `addr + 1`:
			
 
				+
			
 
				+```python
			
 
				+>>> hex(0x1000 + (0x9 >> 3))
			
 
				+'0x1001'
			
 
				+```
			
 
				+
			
 
				+The `CONST_MASK` macro represents our given number of bit as byte where high bit is `1` and other bits are `0`:
			
 
				+
			
 
				+```C
			
 
				+#define CONST_MASK(nr)			(1 << ((nr) & 7))
			
 
				+```
			
 
				+
			
 
				+```python
			
 
				+>>> bin(1 << (0x9 & 7))
			
 
				+'0b10'
			
 
				+```
			
 
				+
			
 
				+In the end we just apply bitwise `or` for these values. So, for example if our address will be `0x4097` and we need to set `0x9` bit:
			
 
				+
			
 
				+```python
			
 
				+>>> bin(0x4097)
			
 
				+'0b100000010010111'
			
 
				+>>> bin((0x4097 >> 0x9) | (1 << (0x9 & 7)))
			
 
				+'0b100010'
			
 
				+```
			
 
				+
			
 
				+the `ninth` bit will be set.
			
 
				+
			
 
				+Note that all of these operations are marked with `LOCK_PREFIX` which is expands to the [lock](http://x86.renejeschke.de/html/file_module_x86_id_159.html) instruction which guarantees atomicity of this operation.
			
 
				+
			
 
				+As we already know, besides the `set_bit` and `__set_bit` operations, the Linux kernel provides two inverse functions to clear bit in atomic and non-atomic context. They are `clear_bit` and `__clear_bit`. Both of these functions are defined in the same [header file](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h) and takes the same set of arguments. But not only arguments are similar. Generally these functions are very similar on the `set_bit` and `__set_bit`. Let's look on the implementation of the non-atomic `__clear_bit` function:
			
 
				+
			
 
				+```C
			
 
				+static inline void __clear_bit(long nr, volatile unsigned long *addr)
			
 
				+{
			
 
				+	asm volatile("btr %1,%0" : ADDR : "Ir" (nr));
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Yes. As we see, it takes the same set of arguments and contains very similar block of inline assembler. It just uses the [btr](http://x86.renejeschke.de/html/file_module_x86_id_24.html) instruction instead of `bts`. As we can understand form the function's name, it clears a given bit by the given address. The `btr` instruction acts like `bts`. This instruction also selects a given bit which is specified in the first operand, stores its value in the `CF` flag register and clears this bit in the given bit array which is specified with second operand.
			
 
				+
			
 
				+The atomic variant of the `__clear_bit` is `clear_bit`:
			
 
				+
			
 
				+```C
			
 
				+static __always_inline void
			
 
				+clear_bit(long nr, volatile unsigned long *addr)
			
 
				+{
			
 
				+	if (IS_IMMEDIATE(nr)) {
			
 
				+		asm volatile(LOCK_PREFIX "andb %1,%0"
			
 
				+			: CONST_MASK_ADDR(nr, addr)
			
 
				+			: "iq" ((u8)~CONST_MASK(nr)));
			
 
				+	} else {
			
 
				+		asm volatile(LOCK_PREFIX "btr %1,%0"
			
 
				+			: BITOP_ADDR(addr)
			
 
				+			: "Ir" (nr));
			
 
				+	}
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+and as we can see it is very similar on `set_bit` and just contains two differences. The first difference it uses `btr` instruction to clear bit when the `set_bit` uses `bts` instruction to set bit. The second difference it uses negated mask and `and` instruction to clear bit in the given byte when the `set_bit` uses `or` instruction.
			
 
				+
			
 
				+That's all. Now we can set and clear bit in any bit array and and we can go to other operations on bitmasks.
			
 
				+
			
 
				+Most widely used operations on a bit arrays are set and clear bit in a bit array in the Linux kernel. But besides this operations it is useful to do additional operations on a bit array. Yet another widely used operation in the Linux kernel - is to know is a given bit set or not in a bit array. We can achieve this with the help of the `test_bit` macro. This macro is defined in the [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h) header file and expands to the call of the `constant_test_bit` or `variable_test_bit` depends on bit number:
			
 
				+
			
 
				+```C
			
 
				+#define test_bit(nr, addr)			\
			
 
				+	(__builtin_constant_p((nr))                 \
			
 
				+	 ? constant_test_bit((nr), (addr))	        \
			
 
				+	 : variable_test_bit((nr), (addr)))
			
 
				+```
			
 
				+
			
 
				+So, if the `nr` is known in compile time constant, the `test_bit` will be expanded to the call of the `constant_test_bit` function or `variable_test_bit` in other case. Now let's look at implementations of these functions. Let's start from the `variable_test_bit`:
			
 
				+
			
 
				+```C
			
 
				+static inline int variable_test_bit(long nr, volatile const unsigned long *addr)
			
 
				+{
			
 
				+	int oldbit;
			
 
				+
			
 
				+	asm volatile("bt %2,%1\n\t"
			
 
				+		     "sbb %0,%0"
			
 
				+		     : "=r" (oldbit)
			
 
				+		     : "m" (*(unsigned long *)addr), "Ir" (nr));
			
 
				+
			
 
				+	return oldbit;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+The `variable_test_bit` function takes similar set of arguments as `set_bit` and other function take. We also may see inline assembly code here which executes [bt](http://x86.renejeschke.de/html/file_module_x86_id_22.html) and [sbb](http://x86.renejeschke.de/html/file_module_x86_id_286.html) instruction. The `bt` or `bit test` instruction selects a given bit which is specified with first operand from the bit array which is specified with the second operand and stores its value in the [CF](https://en.wikipedia.org/wiki/FLAGS_register) bit of flags register. The second `sbb` instruction subtracts first operand from second and subtracts value of the `CF`. So, here write a value of a given bit number from a given bit array to the `CF` bit of flags register and execute `sbb` instruction which calculates: `00000000 - CF` and writes the result to the `oldbit`.
			
 
				+
			
 
				+The `constant_test_bit` function does the same as we saw in the `set_bit`:
			
 
				+
			
 
				+```C
			
 
				+static __always_inline int constant_test_bit(long nr, const volatile unsigned long *addr)
			
 
				+{
			
 
				+	return ((1UL << (nr & (BITS_PER_LONG-1))) &
			
 
				+		(addr[nr >> _BITOPS_LONG_SHIFT])) != 0;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+It generates a byte where high bit is `1` and other bits are `0` (as we saw in `CONST_MASK`) and applies bitwise [and](https://en.wikipedia.org/wiki/Bitwise_operation#AND) to the byte which contains a given bit number.
			
 
				+
			
 
				+The next widely used bit array related operation is to change bit in a bit array. The Linux kernel provides two helper for this:
			
 
				+
			
 
				+* `__change_bit`;
			
 
				+* `change_bit`.
			
 
				+
			
 
				+As you already can guess, these two variants are atomic and non-atomic as for example `set_bit` and `__set_bit`. For the start, let's look at the implementation of the `__change_bit` function:
			
 
				+
			
 
				+```C
			
 
				+static inline void __change_bit(long nr, volatile unsigned long *addr)
			
 
				+{
			
 
				+    asm volatile("btc %1,%0" : ADDR : "Ir" (nr));
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Pretty easy, is not it? The implementation of the `__change_bit` is the same as `__set_bit`, but instead of `bts` instruction, we are using [btc](http://x86.renejeschke.de/html/file_module_x86_id_23.html). This instruction selects a given bit from a given bit array, stores its value in the `CF` and changes its value by the applying of complement operation. So, a bit with value `1` will be `0` and vice versa:
			
 
				+
			
 
				+```python
			
 
				+>>> int(not 1)
			
 
				+0
			
 
				+>>> int(not 0)
			
 
				+1
			
 
				+```
			
 
				+
			
 
				+The atomic version of the `__change_bit` is the `change_bit` function:
			
 
				+
			
 
				+```C
			
 
				+static inline void change_bit(long nr, volatile unsigned long *addr)
			
 
				+{
			
 
				+	if (IS_IMMEDIATE(nr)) {
			
 
				+		asm volatile(LOCK_PREFIX "xorb %1,%0"
			
 
				+			: CONST_MASK_ADDR(nr, addr)
			
 
				+			: "iq" ((u8)CONST_MASK(nr)));
			
 
				+	} else {
			
 
				+		asm volatile(LOCK_PREFIX "btc %1,%0"
			
 
				+			: BITOP_ADDR(addr)
			
 
				+			: "Ir" (nr));
			
 
				+	}
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+It is similar on `set_bit` function, but also has two differences. The first difference is `xor` operation instead of `or` and the second is `btc` instead of `bts`.
			
 
				+
			
 
				+For this moment we know the most important architecture-specific operations with bit arrays. Time to look at generic bitmap API.
			
 
				+
			
 
				+Common bit operations
			
 
				+================================================================================
			
 
				+
			
 
				+Besides the architecture-specific API from the [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h) header file, the Linux kernel provides common API for manipulation of bit arrays. As we know from the beginning of this part, we can find it in the  [include/linux/bitmap.h](https://github.com/torvalds/linux/blob/master/include/linux/bitmap.h) header file and additionally in the * [lib/bitmap.c](https://github.com/torvalds/linux/blob/master/lib/bitmap.c)  source code file. But before these source code files let's look into the [include/linux/bitops.h](https://github.com/torvalds/linux/blob/master/include/linux/bitops.h) header file which provides a set of useful macro. Let's look on some of they.
			
 
				+
			
 
				+First of all let's look at following four macros:
			
 
				+
			
 
				+* `for_each_set_bit`
			
 
				+* `for_each_set_bit_from`
			
 
				+* `for_each_clear_bit`
			
 
				+* `for_each_clear_bit_from`
			
 
				+
			
 
				+All of these macros provide iterator over certain set of bits in a bit array. The first macro iterates over bits which are set, the second does the same, but starts from a certain bits. The last two macros do the same, but iterates over clear bits. Let's look on implementation of the `for_each_set_bit` macro:
			
 
				+
			
 
				+```C
			
 
				+#define for_each_set_bit(bit, addr, size) \
			
 
				+	for ((bit) = find_first_bit((addr), (size));		\
			
 
				+	     (bit) < (size);					\
			
 
				+	     (bit) = find_next_bit((addr), (size), (bit) + 1))
			
 
				+```
			
 
				+
			
 
				+As we may see it takes three arguments and expands to the loop from first set bit which is returned as result of the `find_first_bit` function and to the last bit number while it is less than given size.
			
 
				+
			
 
				+Besides these four macros, the [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h) provides API for rotation of `64-bit` or `32-bit` values and etc.
			
 
				+
			
 
				+The next [header](https://github.com/torvalds/linux/blob/master/include/linux/bitmap.h) file which provides API for manipulation with a bit arrays. For example it provides two functions:
			
 
				+
			
 
				+* `bitmap_zero`;
			
 
				+* `bitmap_fill`.
			
 
				+
			
 
				+To clear a bit array and fill it with `1`. Let's look on the implementation of the `bitmap_zero` function:
			
 
				+
			
 
				+```C
			
 
				+static inline void bitmap_zero(unsigned long *dst, unsigned int nbits)
			
 
				+{
			
 
				+	if (small_const_nbits(nbits))
			
 
				+		*dst = 0UL;
			
 
				+	else {
			
 
				+		unsigned int len = BITS_TO_LONGS(nbits) * sizeof(unsigned long);
			
 
				+		memset(dst, 0, len);
			
 
				+	}
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+First of all we can see the check for `nbits`. The `small_const_nbits` is macro which defined in the same header [file](https://github.com/torvalds/linux/blob/master/include/linux/bitmap.h) and looks:
			
 
				+
			
 
				+```C
			
 
				+#define small_const_nbits(nbits) \
			
 
				+	(__builtin_constant_p(nbits) && (nbits) <= BITS_PER_LONG)
			
 
				+```
			
 
				+
			
 
				+As we may see it checks that `nbits` is known constant in compile time and `nbits` value does not overflow `BITS_PER_LONG` or `64`. If bits number does not overflow amount of bits in a `long` value we can just set to zero. In other case we need to calculate how many `long` values do we need to fill our bit array and fill it with [memset](http://man7.org/linux/man-pages/man3/memset.3.html).
			
 
				+
			
 
				+The implementation of the `bitmap_fill` function is similar on implementation of the `biramp_zero` function, except we fill a given bit array with `0xff` values or `0b11111111`:
			
 
				+
			
 
				+```C
			
 
				+static inline void bitmap_fill(unsigned long *dst, unsigned int nbits)
			
 
				+{
			
 
				+	unsigned int nlongs = BITS_TO_LONGS(nbits);
			
 
				+	if (!small_const_nbits(nbits)) {
			
 
				+		unsigned int len = (nlongs - 1) * sizeof(unsigned long);
			
 
				+		memset(dst, 0xff,  len);
			
 
				+	}
			
 
				+	dst[nlongs - 1] = BITMAP_LAST_WORD_MASK(nbits);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Besides the `bitmap_fill` and `bitmap_zero` functions, the [include/linux/bitmap.h](https://github.com/torvalds/linux/blob/master/include/linux/bitmap.h) header file provides `bitmap_copy` which is similar on the `bitmap_zero`, but just uses [memcpy](http://man7.org/linux/man-pages/man3/memcpy.3.html) instead of [memset](http://man7.org/linux/man-pages/man3/memset.3.html). Also it provides bitwise operations for bit array like `bitmap_and`, `bitmap_or`, `bitamp_xor` and etc. We will not consider implementation of these functions because it is easy to understand implementations of these functions if you understood all from this part. Anyway if you are interested how did these function implemented, you may open [include/linux/bitmap.h](https://github.com/torvalds/linux/blob/master/include/linux/bitmap.h) header file and start to research.
			
 
				+
			
 
				+That's all.
			
 
				+
			
 
				+Links
			
 
				+================================================================================
			
 
				+
			
 
				+* [bitmap](https://en.wikipedia.org/wiki/Bit_array)
			
 
				+* [linked data structures](https://en.wikipedia.org/wiki/Linked_data_structure)
			
 
				+* [tree data structures](https://en.wikipedia.org/wiki/Tree_%28data_structure%29) 
			
 
				+* [hot-plug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt)
			
 
				+* [cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
			
 
				+* [IRQs](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
			
 
				+* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
			
 
				+* [atomic operations](https://en.wikipedia.org/wiki/Linearizability)
			
 
				+* [xchg instruction](http://x86.renejeschke.de/html/file_module_x86_id_328.html)
			
 
				+* [cmpxchg instruction](http://x86.renejeschke.de/html/file_module_x86_id_41.html)
			
 
				+* [lock instruction](http://x86.renejeschke.de/html/file_module_x86_id_159.html)
			
 
				+* [bts instruction](http://x86.renejeschke.de/html/file_module_x86_id_25.html)
			
 
				+* [btr instruction](http://x86.renejeschke.de/html/file_module_x86_id_24.html)
			
 
				+* [bt instruction](http://x86.renejeschke.de/html/file_module_x86_id_22.html)
			
 
				+* [sbb instruction](http://x86.renejeschke.de/html/file_module_x86_id_286.html)
			
 
				+* [btc instruction](http://x86.renejeschke.de/html/file_module_x86_id_23.html)
			
 
				+* [man memcpy](http://man7.org/linux/man-pages/man3/memcpy.3.html) 
			
 
				+* [man memset](http://man7.org/linux/man-pages/man3/memset.3.html)
			
 
				+* [CF](https://en.wikipedia.org/wiki/FLAGS_register)
			
 
				+* [inline assembler](https://en.wikipedia.org/wiki/Inline_assembler)
			
 
				+* [gcc](https://en.wikipedia.org/wiki/GNU_Compiler_Collection)
			
--- a/DataStructures/dlist.md
+++ b/DataStructures/dlist.md
@@ -4,9 +4,9 @@ Data Structures in the Linux Kernel
 
				 Doubly linked list
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-Linux kernel provides its own doubly linked list implementation which you can find in the [include/linux/list.h](https://github.com/torvalds/linux/blob/master/include/linux/list.h). We will start `Data Structures in the Linux kernel` from the doubly linked list data structure. Why? Because it is very popular in the kernel, just try to [search](http://lxr.free-electrons.com/ident?i=list_head)
			
 
				+Linux kernel provides its own implementation of doubly linked list, which you can find in the [include/linux/list.h](https://github.com/torvalds/linux/blob/master/include/linux/list.h). We will start `Data Structures in the Linux kernel` from the doubly linked list data structure. Why? Because it is very popular in the kernel, just try to [search](http://lxr.free-electrons.com/ident?i=list_head)
			
 
				 
			
 
				-First of all let's look on the main structure:
			
 
				+First of all, let's look on the main structure in the [include/linux/types.h](https://github.com/torvalds/linux/blob/master/include/linux/types.h):
			
 
				 
			
 
				 ```C
			
 
				 struct list_head {
			
@@ -14,7 +14,7 @@ struct list_head {
 
				 };
			
 
				 ```
			
 
				 
			
 
				-You can note that it is different from many lists implementations which you could see. For example this doubly linked list structure from the [glib](http://www.gnu.org/software/libc/):
			
 
				+You can note that it is different from many implementations of doubly linked list which you have seen. For example, this doubly linked list structure from the [glib](http://www.gnu.org/software/libc/) library looks like :
			
 
				 
			
 
				 ```C
			
 
				 struct GList {
			
@@ -24,7 +24,7 @@ struct GList {
 
				 };
			
 
				 ```
			
 
				 
			
 
				-Usually a linked list structure contains a pointer to the item. Linux kernel implementation of the list does not. So the main question is - `where does the list store the data?`. The actual implementation of lists in the kernel is - `Intrusive list`. An intrusive linked list does not contain data in its nodes - A node just contains pointers to the next and previous node and list nodes part of the data that are added to the list. This makes the data structure generic, so it does not care about entry data type anymore.
			
 
				+Usually a linked list structure contains a pointer to the item. The implementation of linked list in Linux kernel does not. So the main question is - `where does the list store the data?`. The actual implementation of linked list in the kernel is - `Intrusive list`. An intrusive linked list does not contain data in its nodes - A node just contains pointers to the next and previous node and list nodes part of the data that are added to the list. This makes the data structure generic, so it does not care about entry data type anymore.
			
 
				 
			
 
				 For example:
			
 
				 
			
@@ -35,13 +35,13 @@ struct nmi_desc {
 
				 };
			
 
				 ```
			
 
				 
			
 
				-Let's look at some examples, how `list_head` is used in the kernel. As I already wrote about, there are many, really many different places where lists are used in the kernel. Let's look for example in miscellaneous character drivers. Misc character drivers API from the [drivers/char/misc.c](https://github.com/torvalds/linux/blob/master/drivers/char/misc.c) for writing small drivers for handling simple hardware or virtual devices. This drivers share major number:
			
 
				+Let's look at some examples to understand how `list_head` is used in the kernel. As I already wrote about, there are many, really many different places where lists are used in the kernel. Let's look for an example in miscellaneous character drivers. Misc character drivers API from the [drivers/char/misc.c](https://github.com/torvalds/linux/blob/master/drivers/char/misc.c) is used for writing small drivers for handling simple hardware or virtual devices. Those drivers share same major number:
			
 
				 
			
 
				 ```C
			
 
				 #define MISC_MAJOR              10
			
 
				 ```
			
 
				 
			
 
				-but has own minor number. For example you can see it with:
			
 
				+but have their own minor number. For example you can see it with:
			
 
				 
			
 
				 ```
			
 
				 ls -l /dev |  grep 10
			
@@ -67,7 +67,7 @@ crw-------   1 root root     10,  63 Mar 21 12:01 vga_arbiter
 
				 crw-------   1 root root     10, 137 Mar 21 12:01 vhci
			
 
				 ```
			
 
				 
			
 
				-Now let's look how lists are used in the misc device drivers. First of all let's look on `miscdevice` structure:
			
 
				+Now let's have a close look at how lists are used in the misc device drivers. First of all, let's look on `miscdevice` structure:
			
 
				 
			
 
				 ```C
			
 
				 struct miscdevice
			
@@ -83,20 +83,20 @@ struct miscdevice
 
				 };
			
 
				 ```
			
 
				 
			
 
				-We can see the fourth field in the `miscdevice` structure - `list` which is list of registered devices. In the beginning of the source code file we can see definition of the:
			
 
				+We can see the fourth field in the `miscdevice` structure - `list` which is a list of registered devices. In the beginning of the source code file we can see the definition of misc_list:
			
 
				 
			
 
				 ```C
			
 
				 static LIST_HEAD(misc_list);
			
 
				 ```
			
 
				 
			
 
				-which expands to definition of the variables with `list_head` type:
			
 
				+which expands to the definition of variables with `list_head` type:
			
 
				 
			
 
				 ```C
			
 
				 #define LIST_HEAD(name) \
			
 
				 	struct list_head name = LIST_HEAD_INIT(name)
			
 
				 ```
			
 
				 
			
 
				-and initializes it with the `LIST_HEAD_INIT` macro which set previous and next entries:
			
 
				+and initializes it with the `LIST_HEAD_INIT` macro, which sets previous and next entries with the address of variable - name:
			
 
				 
			
 
				 ```C
			
 
				 #define LIST_HEAD_INIT(name) { &(name), &(name) }
			
@@ -108,7 +108,7 @@ Now let's look on the `misc_register` function which registers a miscellaneous d
 
				 INIT_LIST_HEAD(&misc->list);
			
 
				 ```
			
 
				 
			
 
				-which does the same that `LIST_HEAD_INIT` macro:
			
 
				+which does the same as the `LIST_HEAD_INIT` macro:
			
 
				 
			
 
				 ```C
			
 
				 static inline void INIT_LIST_HEAD(struct list_head *list)
			
@@ -118,13 +118,13 @@ static inline void INIT_LIST_HEAD(struct list_head *list)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-In the next step after device created with the `device_create` function we add it to the miscellaneous devices list with:
			
 
				+In the next step after a device is created by the `device_create` function, we add it to the miscellaneous devices list with:
			
 
				 
			
 
				 ```
			
 
				 list_add(&misc->list, &misc_list);
			
 
				 ```
			
 
				 
			
 
				-Kernel `list.h` provides this API for the addition of new entry to the list. Let's look on it's implementation:
			
 
				+Kernel `list.h` provides this API for the addition of a new entry to the list. Let's look at its implementation:
			
 
				 
			
 
				 ```C
			
 
				 static inline void list_add(struct list_head *new, struct list_head *head)
			
@@ -135,8 +135,8 @@ static inline void list_add(struct list_head *new, struct list_head *head)
 
				 
			
 
				 It just calls internal function `__list_add` with the 3 given parameters:
			
 
				 
			
 
				-* new  - new entry;
			
 
				-* head - list head after which will be inserted new item;
			
 
				+* new  - new entry.
			
 
				+* head - list head after which the new item will be inserted.
			
 
				 * head->next - next item after list head.
			
 
				 
			
 
				 Implementation of the `__list_add` is pretty simple:
			
@@ -153,9 +153,9 @@ static inline void __list_add(struct list_head *new,
 
				 }
			
 
				 ```
			
 
				 
			
 
				-Here we set new item between `prev` and `next`. So `misc` list which we defined at the start with the `LIST_HEAD_INIT` macro will contain previous and next pointers to the `miscdevice->list`.
			
 
				+Here we add a new item between `prev` and `next`. So `misc` list which we defined at the start with the `LIST_HEAD_INIT` macro will contain previous and next pointers to the `miscdevice->list`.
			
 
				 
			
 
				-There is still only one question how to get list's entry. There is special special macro for this point:
			
 
				+There is still one question: how to get list's entry. There is a special macro:
			
 
				 
			
 
				 ```C
			
 
				 #define list_entry(ptr, type, member) \
			
@@ -166,7 +166,7 @@ which gets three parameters:
 
				 
			
 
				 * ptr - the structure list_head pointer;
			
 
				 * type - structure type;
			
 
				-* member - the name of the list_head within the struct; 
			
 
				+* member - the name of the list_head within the structure;
			
 
				 
			
 
				 For example:
			
 
				 
			
@@ -174,14 +174,14 @@ For example:
 
				 const struct miscdevice *p = list_entry(v, struct miscdevice, list)
			
 
				 ```
			
 
				 
			
 
				-After this we can access to the any `miscdevice` field with `p->minor` or `p->name` and etc... Let's look on the `list_entry` implementation:
			
 
				+After this we can access to any `miscdevice` field with `p->minor` or `p->name` and etc... Let's look on the `list_entry` implementation:
			
 
				 
			
 
				 ```C
			
 
				 #define list_entry(ptr, type, member) \
			
 
				 	container_of(ptr, type, member)
			
 
				 ```
			
 
				 
			
 
				-As we can see it just calls `container_of` macro with the same arguments. For the first look `container_of` looks strange:
			
 
				+As we can see it just calls `container_of` macro with the same arguments. At first sight, the `container_of` looks strange:
			
 
				 
			
 
				 ```C
			
 
				 #define container_of(ptr, type, member) ({                      \
			
@@ -189,7 +189,7 @@ As we can see it just calls `container_of` macro with the same arguments. For th
 
				     (type *)( (char *)__mptr - offsetof(type,member) );})
			
 
				 ```
			
 
				 
			
 
				-First of all you can note that it consists from two expressions in curly brackets. Compiler will evaluate the whole block in the curly braces and use the value of the last expression.
			
 
				+First of all you can note that it consists of two expressions in curly brackets. The compiler will evaluate the whole block in the curly braces and use the value of the last expression.
			
 
				 
			
 
				 For example:
			
 
				 
			
@@ -205,7 +205,7 @@ int main() {
 
				 
			
 
				 will print `2`.
			
 
				 
			
 
				-The next point is `typeof`, it's simple. As you can understand from its name, it just returns the type of the given variable. When I first saw the implementation of the `container_of` macro, the strangest thing for me was the zero in the `((type *)0)` expression. Actually this pointer magic calculates the offset of the given field from the address of the structure, but as we have `0` here, it will be just a zero offset alongwith the field width. Let's look at a simple example:
			
 
				+The next point is `typeof`, it's simple. As you can understand from its name, it just returns the type of the given variable. When I first saw the implementation of the `container_of` macro, the strangest thing I found was the zero in the `((type *)0)` expression. Actually this pointer magic calculates the offset of the given field from the address of the structure, but as we have `0` here, it will be just a zero offset along with the field width. Let's look at a simple example:
			
 
				 
			
 
				 ```C
			
 
				 #include <stdio.h>
			
@@ -224,15 +224,15 @@ int main() {
 
				 
			
 
				 will print `0x5`.
			
 
				 
			
 
				-The next offsetof macro calculates offset from the beginning of the structure to the given structure's field. Its implementation is very similar to the previous code:
			
 
				+The next `offsetof` macro calculates offset from the beginning of the structure to the given structure's field. Its implementation is very similar to the previous code:
			
 
				 
			
 
				 ```C
			
 
				 #define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)
			
 
				 ```
			
 
				 
			
 
				-Let's summarize all about `container_of` macro. `container_of` macro returns address of the structure by the given address of the structure's field with `list_head` type, the name of the structure field with `list_head` type and type of the container structure. At the first line this macro declares the `__mptr` pointer which points to the field of the structure that `ptr` points to and assigns it to the `ptr`. Now `ptr` and `__mptr` point to the same address. Technically we don't need this line but its useful for type checking. First line ensures that that given structure (`type` parameter) has a member called `member`. In the second line it calculates offset of the field from the structure with the `offsetof` macro and subtracts it from the structure address. That's all.
			
 
				+Let's summarize all about `container_of` macro. The `container_of` macro returns the address of the structure by the given address of the structure's field with `list_head` type, the name of the structure field with `list_head` type and type of the container structure. At the first line this macro declares the `__mptr` pointer which points to the field of the structure that `ptr` points to and assigns `ptr` to it. Now `ptr` and `__mptr` point to the same address. Technically we don't need this line but it's useful for type checking. The first line ensures that the given structure (`type` parameter) has a member called `member`. In the second line it calculates offset of the field from the structure with the `offsetof` macro and subtracts it from the structure address. That's all.
			
 
				 
			
 
				-Of course `list_add` and `list_entry` is not only functions which provides `<linux/list.h>`. Implementation of the doubly linked list provides the following API:
			
 
				+Of course `list_add` and `list_entry` is not the only functions which `<linux/list.h>` provides. Implementation of the doubly linked list provides the following API:
			
 
				 
			
 
				 * list_add
			
 
				 * list_add_tail
			
@@ -243,5 +243,7 @@ Of course `list_add` and `list_entry` is not only functions which provides `<lin
 
				 * list_empty
			
 
				 * list_cut_position
			
 
				 * list_splice
			
 
				+* list_for_each
			
 
				+* list_for_each_entry
			
 
				 
			
 
				 and many more.
			
--- a/DataStructures/radix-tree.md
+++ b/DataStructures/radix-tree.md
@@ -9,7 +9,7 @@ As you already know linux kernel provides many different libraries and functions
 
				 * [include/linux/radix-tree.h](https://github.com/torvalds/linux/blob/master/include/linux/radix-tree.h)
			
 
				 * [lib/radix-tree.c](https://github.com/torvalds/linux/blob/master/lib/radix-tree.c)
			
 
				 
			
 
				-Lets talk about what is `radix tree`. Radix tree is a `compressed trie` where [trie](http://en.wikipedia.org/wiki/Trie) is a data structure which implements interface of an associative array and allows to store values as `key-value`. The keys are usually strings, but any other data type can be used as well. Trie is different from any `n-tree` in its nodes. Nodes of a trie do not store keys, instead, a node of a trie stores single character labels. The key which is related to a given node is derived by traversing from the root of the tree to this node. For example:
			
 
				+Lets talk about what a `radix tree` is. Radix tree is a `compressed trie` where a [trie](http://en.wikipedia.org/wiki/Trie) is a data structure which implements an interface of an associative array and allows to store values as `key-value`. The keys are usually strings, but any data type can be used. A trie is different from an `n-tree` because of its nodes. Nodes of a trie do not store keys; instead, a node of a trie stores single character labels. The key which is related to a given node is derived by traversing from the root of the tree to this node. For example:
			
 
				 
			
 
				 
			
 
				 ```
			
@@ -41,9 +41,9 @@ Lets talk about what is `radix tree`. Radix tree is a `compressed trie` where [t
 
				                             +-----------+
			
 
				 ```
			
 
				 
			
 
				-So in this example, we can see the `trie` with keys, `go` and `cat`. The compressed trie or `radix tree` differs from `trie`, such that all intermediates nodes which have only one child are removed.
			
 
				+So in this example, we can see the `trie` with keys, `go` and `cat`. The compressed trie or `radix tree` differs from `trie` in that all intermediates nodes which have only one child are removed.
			
 
				 
			
 
				-Radix tree in linux kernel is the datastructure which maps values to the integer key. It is represented by the following structures from the file [include/linux/radix-tree.h](https://github.com/torvalds/linux/blob/master/include/linux/radix-tree.h):
			
 
				+Radix tree in linux kernel is the data structure which maps values to integer keys. It is represented by the following structures from the file [include/linux/radix-tree.h](https://github.com/torvalds/linux/blob/master/include/linux/radix-tree.h):
			
 
				 
			
 
				 ```C
			
 
				 struct radix_tree_root {
			
@@ -56,14 +56,20 @@ struct radix_tree_root {
 
				 This structure presents the root of a radix tree and contains three fields:
			
 
				 
			
 
				 * `height`   - height of the tree;
			
 
				-* `gfp_mask` - tells how memory allocations are to be performed;
			
 
				+* `gfp_mask` - tells how memory allocations will be performed;
			
 
				 * `rnode`    - pointer to the child node.
			
 
				 
			
 
				-The first structure we will discuss is `gfp_mask`:
			
 
				+The first field we will discuss is `gfp_mask`:
			
 
				 
			
 
				-Low-level kernel memory allocation functions take a set of flags as - `gfp_mask`, which describes how that allocation is to be performed. These `GFP_` flags which control the allocation process can have following values, (`GF_NOIO` flag) be sleep and wait for memory, (`__GFP_HIGHMEM` flag) is high memory can be used, (`GFP_ATOMIC` flag) is allocation process high-priority and can't sleep etc.
			
 
				+Low-level kernel memory allocation functions take a set of flags as - `gfp_mask`, which describes how that allocation is to be performed. These `GFP_` flags which control the allocation process can have following values: (`GF_NOIO` flag) means sleep and wait for memory, (`__GFP_HIGHMEM` flag) means high memory can be used, (`GFP_ATOMIC` flag) means the allocation process has high-priority and can't sleep etc.
			
 
				 
			
 
				-The next structure is `rnode`:
			
 
				+* `GFP_NOIO` - can sleep and wait for memory;
			
 
				+* `__GFP_HIGHMEM` - high memory can be used;
			
 
				+* `GFP_ATOMIC` - allocation process is high-priority and can't sleep;
			
 
				+
			
 
				+etc.
			
 
				+
			
 
				+The next field is `rnode`:
			
 
				 
			
 
				 ```C
			
 
				 struct radix_tree_node {
			
@@ -83,7 +89,7 @@ struct radix_tree_node {
 
				 };
			
 
				 ```
			
 
				 
			
 
				-This structure contains information about the offset in a parent and height from the bottom, count of the child nodes and fields for accessing and freeing a node. The fields are described below:
			
 
				+This structure contains information about the offset in a parent and height from the bottom, count of the child nodes and fields for accessing and freeing a node. This fields are described below:
			
 
				 
			
 
				 * `path` - offset in parent & height from the bottom;
			
 
				 * `count` - count of the child nodes;
			
@@ -92,14 +98,14 @@ This structure contains information about the offset in a parent and height from
 
				 * `rcu_head` - used for freeing a node;
			
 
				 * `private_list` - used by the user of a tree;
			
 
				 
			
 
				-The two last fields of the `radix_tree_node` - `tags` and `slots` are important and interesting. Every node can contains a set of slots which are store pointers to the data. Empty slots in the linux kernel radix tree implementation store `NULL`. Radix tree in the linux kernel also supports tags which are associated with the `tags` fields in the `radix_tree_node` structure. Tags allow to set individual bits on records which are stored in the radix tree.
			
 
				+The two last fields of the `radix_tree_node` - `tags` and `slots` are important and interesting. Every node can contains a set of slots which are store pointers to the data. Empty slots in the linux kernel radix tree implementation store `NULL`. Radix trees in the linux kernel also supports tags which are associated with the `tags` fields in the `radix_tree_node` structure. Tags allow individual bits to be set on records which are stored in the radix tree.
			
 
				 
			
 
				-Now we know about radix tree structure, time to look on its API.
			
 
				+Now that we know about radix tree structure, it is time to look on its API.
			
 
				 
			
 
				 Linux kernel radix tree API
			
 
				 ---------------------------------------------------------------------------------
			
 
				 
			
 
				-We start from the datastructure intialization. There are two ways to initialize new radix tree. The first is to use `RADIX_TREE` macro:
			
 
				+We start from the data structure initialization. There are two ways to initialize a new radix tree. The first is to use `RADIX_TREE` macro:
			
 
				 
			
 
				 ```C
			
 
				 RADIX_TREE(name, gfp_mask);
			
@@ -138,12 +144,12 @@ do {                                 \
 
				 } while (0)
			
 
				 ```
			
 
				 
			
 
				-makes the same initialziation with default values as it does `RADIX_TREE_INIT` macro.
			
 
				+makes the same initialization with default values as it does `RADIX_TREE_INIT` macro.
			
 
				 
			
 
				-The next are two functions for the inserting and deleting records to/from a radix tree:
			
 
				+The next are two functions for inserting and deleting records to/from a radix tree:
			
 
				 
			
 
				 * `radix_tree_insert`;
			
 
				-* `radix_tree_delete`.
			
 
				+* `radix_tree_delete`;
			
 
				 
			
 
				 The first `radix_tree_insert` function takes three parameters:
			
 
				 
			
@@ -164,7 +170,7 @@ The first `radix_tree_lookup` function takes two parameters:
 
				 * root of a radix tree;
			
 
				 * index key;
			
 
				 
			
 
				-This function tries to find the given key in the tree and returns associated record with this key. The second `radix_tree_gang_lookup` function have the following signature
			
 
				+This function tries to find the given key in the tree and return the record associated with this key. The second `radix_tree_gang_lookup` function have the following signature
			
 
				 
			
 
				 ```C
			
 
				 unsigned int radix_tree_gang_lookup(struct radix_tree_root *root,
			
@@ -173,7 +179,7 @@ unsigned int radix_tree_gang_lookup(struct radix_tree_root *root,
 
				                                     unsigned int max_items);
			
 
				 ```
			
 
				 
			
 
				-and returns number of records, sorted by the keys, starting from the first index. Number of the returned records will be not greater than `max_items` value.
			
 
				+and returns number of records, sorted by the keys, starting from the first index. Number of the returned records will not be greater than `max_items` value.
			
 
				 
			
 
				 And the last `radix_tree_lookup_slot` function will return the slot which will contain the data.
			
 
				 
			
--- a/Initialization/README.md
+++ b/Initialization/README.md
@@ -1,8 +1,8 @@
 
				 # Kernel initialization process
			
 
				 
			
 
				-You will find here a couple of posts which describe the full cycle of kernel initialization from its first steps after the kernel has decompressed to the start of the first process run by the kernel itself.
			
 
				+You will find here a couple of posts which describe the full cycle of kernel initialization from its first step after the kernel has been decompressed to the start of the first process run by the kernel itself.
			
 
				 
			
 
				-*Note* That there will not be description of the all kernel initialization steps. Here will be only generic kernel part, without interrupts handling, ACPI, and many other parts. All parts which I'll miss, will be described in other chapters.
			
 
				+*Note* That there will not be a description of the all kernel initialization steps. Here will be only generic kernel part, without interrupts handling, ACPI, and many other parts. All parts which I have missed, will be described in other chapters.
			
 
				 
			
 
				 * [First steps after kernel decompression](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-1.md) - describes first steps in the kernel.
			
 
				 * [Early interrupt and exception handling](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-2.md) - describes early interrupts initialization and early page fault handler.
			
--- a/Initialization/linux-initialization-1.md
+++ b/Initialization/linux-initialization-1.md
@@ -4,20 +4,20 @@ Kernel initialization. Part 1.
 
				 First steps in the kernel code
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-In the previous post (`Kernel booting process. Part 5.`) - [Kernel decompression](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) we stopped at the [jump](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) on the decompressed kernel:
			
 
				+The previous [post](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) was a last part of the Linux kernel [booting process](https://0xax.gitbooks.io/linux-insides/content/Booting/index.html) chapter and now we are starting to dive into initialization process of the Linux kernel. After the image of the Linux kernel is decompressed and placed in a correct place in memory, it starts to work. All previous parts describe the work of the Linux kernel setup code which does preparation before the first bytes of the Linux kernel code will be executed. From now we are in the kernel and all parts of this chapter will be devoted to the initialization process of the kernel before it will launch process with [pid](https://en.wikipedia.org/wiki/Process_identifier) `1`. There are many things to do before the kernel will start first `init` process. Hope we will see all of the preparations before kernel will start in this big chapter. We will start from the kernel entry point, which is located in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) and will move further and further. We will see first preparations like early page tables initialization, switch to a new descriptor in kernel space and many many more, before we will see the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c#L489) will be called.
			
 
				+
			
 
				+In the last [part](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) of the previous [chapter](https://0xax.gitbooks.io/linux-insides/content/Booting/index.html) we stopped at the [jmp](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) instruction from the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) assembly source code file:
			
 
				 
			
 
				 ```assembly
			
 
				 jmp	*%rax
			
 
				 ```
			
 
				 
			
 
				-and now we are in the kernel. There are many things to do before the kernel will start first `init` process. Hope we will see all of the preparations before kernel will start in this big chapter. We will start from the kernel entry point, which is in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S). We will see first preparations like early page tables initialization, switch to a new descriptor in kernel space and many many more, before we will see the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c#L489) will be called.
			
 
				-
			
 
				-So let's start.
			
 
				+At this moment the `rax` register contains address of the Linux kernel entry point which that was obtained as a result of the call of the `decompress_kernel` function from the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c) source code file. So, our last instruction in the kernel setup code is a jump on the kernel entry point. We already know where is defined the entry point of the linux kernel, so we are able to start to learn what does the Linux kernel does after the start.
			
 
				 
			
 
				 First steps in the kernel
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-Okay, we got address of the kernel from the `decompress_kernel` function into `rax` register and just jumped there. Decompressed kernel code starts in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S):
			
 
				+Okay, we got the address of the decompressed kernel image from the `decompress_kernel` function into `rax` register and just jumped there. As we already know the entry point of the decompressed kernel image starts in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) assembly source code file and at the beginning of it, we can see following definitions:
			
 
				 
			
 
				 ```assembly
			
 
				 	__HEAD
			
@@ -29,7 +29,7 @@ startup_64:
 
				 	...
			
 
				 ```
			
 
				 
			
 
				-We can see definition of the `startup_64` routine and it defined in the `__HEAD` section, which is just:
			
 
				+We can see definition of the `startup_64` routine that is defined in the `__HEAD` section, which is just a macro which expands to the definition of executable `.head.text` section:
			
 
				 
			
 
				 ```C
			
 
				 #define __HEAD		.section	".head.text","ax"
			
@@ -46,13 +46,13 @@ We can see definition of this section in the [arch/x86/kernel/vmlinux.lds.S](htt
 
				 } :text = 0x9090
			
 
				 ```
			
 
				 
			
 
				-We can understand default virtual and physical addresses from the linker script. Note that address of the `_text` is location counter which is defined as:
			
 
				+Besides the definition of the `.text` section, we can understand default virtual and physical addresses from the linker script. Note that address of the `_text` is location counter which is defined as:
			
 
				 
			
 
				 ```
			
 
				 . = __START_KERNEL;
			
 
				 ```
			
 
				 
			
 
				-for `x86_64`. We can find definition of the `__START_KERNEL` macro in the [arch/x86/include/asm/page_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/page_types.h):
			
 
				+for the [x86_64](https://en.wikipedia.org/wiki/X86-64). The definition of the `__START_KERNEL` macro is located in the [arch/x86/include/asm/page_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/page_types.h) header file and represented by the sum of the base virtual address of the kernel mapping and physical start:
			
 
				 
			
 
				 ```C
			
 
				 #define __START_KERNEL	(__START_KERNEL_map + __PHYSICAL_START)
			
@@ -60,10 +60,10 @@ for `x86_64`. We can find definition of the `__START_KERNEL` macro in the [arch/
 
				 #define __PHYSICAL_START  ALIGN(CONFIG_PHYSICAL_START, CONFIG_PHYSICAL_ALIGN)
			
 
				 ```
			
 
				 
			
 
				-Here we can see that `__START_KERNEL` is the sum of the `__START_KERNEL_map` (which is `0xffffffff80000000`, see post about [paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html)) and `__PHYSICAL_START`. Where `__PHYSICAL_START` is aligned value of the `CONFIG_PHYSICAL_START`. So if you will not use [kASLR](http://en.wikipedia.org/wiki/Address_space_layout_randomization) and will not change `CONFIG_PHYSICAL_START` in the configuration addresses will be following:
			
 
				+Or in other words:
			
 
				 
			
 
				-* Physical address - `0x1000000`;
			
 
				-* Virtual address  - `0xffffffff81000000`.
			
 
				+* Base physical address of the Linux kernel - `0x1000000`;
			
 
				+* Base virtual address of the Linux kernel - `0xffffffff81000000`.
			
 
				 
			
 
				 Now we know default physical and virtual addresses of the `startup_64` routine, but to know actual addresses we must to calculate it with the following code:
			
 
				 
			
@@ -72,18 +72,22 @@ Now we know default physical and virtual addresses of the `startup_64` routine,
 
				 	subq	$_text - __START_KERNEL_map, %rbp
			
 
				 ```
			
 
				 
			
 
				-Here we just put the `rip-relative` address to the `rbp` register and then subtract `$_text - __START_KERNEL_map` from it. We know that compiled address of the `_text` is `0xffffffff81000000` and `__START_KERNEL_map` contains `0xffffffff81000000`, so `rbp` will contain physical address of the `text` - `0x1000000` after this calculation. We need to calculate it because kernel can't be run on the default address, but now we know the actual physical address.
			
 
				+Yes, it defined as `0x1000000`, but it may be different, for example if [kASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization#Linux) is enabled. So our current goal is to calculate delta between `0x1000000` and where we actually loaded. Here we just put the `rip-relative` address to the `rbp` register and then subtract `$_text - __START_KERNEL_map` from it. We know that compiled virtual address of the `_text` is `0xffffffff81000000` and the physical address of it is `0x1000000`. The `__START_KERNEL_map` macro expands to the `0xffffffff80000000` address, so at the second line of the assembly code, we will get following expression:
			
 
				+
			
 
				+```
			
 
				+rbp = 0x1000000 - (0xffffffff81000000 - 0xffffffff80000000)
			
 
				+```
			
 
				+
			
 
				+So, after the calculation,  the `rbp` will contain `0` which represents difference between addresses where we actually loaded and where the code was compiled. In our case `zero` means that the Linux kernel was loaded by default address and the [kASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization#Linux) was disabled.
			
 
				 
			
 
				-In the next step we checks that this address is aligned with:
			
 
				+After we got the address of the `startup_64`, we need to do a check that this address is correctly aligned. We will do it with the following code:
			
 
				 
			
 
				 ```assembly
			
 
				-	movq	%rbp, %rax
			
 
				-	andl	$~PMD_PAGE_MASK, %eax
			
 
				-	testl	%eax, %eax
			
 
				+	testl	$~PMD_PAGE_MASK, %ebp
			
 
				 	jnz	bad_address
			
 
				 ```
			
 
				 
			
 
				-Here we just put address to the `%rax` and test first bit. `PMD_PAGE_MASK` indicates the mask for `Page middle directory` (read [paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html) about it) and defined as:
			
 
				+Here we just compare low part of the `rbp` register with the complemented value of the `PMD_PAGE_MASK`. The `PMD_PAGE_MASK` indicates the mask for `Page middle directory` (read [paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html) about it) and defined as:
			
 
				 
			
 
				 ```C
			
 
				 #define PMD_PAGE_MASK           (~(PMD_PAGE_SIZE-1))
			
@@ -92,9 +96,9 @@ Here we just put address to the `%rax` and test first bit. `PMD_PAGE_MASK` indic
 
				 #define PMD_SHIFT       21
			
 
				 ```
			
 
				 
			
 
				-As we can easily calculate, `PMD_PAGE_SIZE` is 2 megabytes. Here we use standard formula for checking alignment and if `text` address is not aligned for 2 megabytes, we jump to `bad_address` label.
			
 
				+As we can easily calculate, `PMD_PAGE_SIZE` is `2` megabytes. Here we use standard formula for checking alignment and if `text` address is not aligned for `2` megabytes, we jump to `bad_address` label.
			
 
				 
			
 
				-After this we check address that it is not too large:
			
 
				+After this we check address that it is not too large by the checking of highest `18` bits:
			
 
				 
			
 
				 ```assembly
			
 
				 	leaq	_text(%rip), %rax
			
@@ -102,7 +106,7 @@ After this we check address that it is not too large:
 
				 	jnz	bad_address
			
 
				 ```
			
 
				 
			
 
				-Address most not be greater than 46-bits:
			
 
				+The address must not be greater than `46`-bits:
			
 
				 
			
 
				 ```C
			
 
				 #define MAX_PHYSMEM_BITS       46
			
@@ -113,7 +117,7 @@ Okay, we did some early checks and now we can move on.
 
				 Fix base addresses of page tables
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-The first step before we started to setup identity paging, need to correct following addresses:
			
 
				+The first step before we start to setup identity paging is to fixup following addresses:
			
 
				 
			
 
				 ```assembly
			
 
				 	addq	%rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip)
			
@@ -122,7 +126,7 @@ The first step before we started to setup identity paging, need to correct follo
 
				 	addq	%rbp, level2_fixmap_pgt + (506*8)(%rip)
			
 
				 ```
			
 
				 
			
 
				-Here we need to correct `early_level4_pgt` and other addresses of the page table directories, because as I wrote above, kernel can't be run at the default `0x1000000` address. `rbp` register contains actual address so we add to the `early_level4_pgt`, `level3_kernel_pgt` and  `level2_fixmap_pgt`. Let's try to understand what these labels means. First of all let's look on their definition:
			
 
				+All of `early_level4_pgt`, `level3_kernel_pgt` and other address may be wrong if the `startup_64` is not equal to default `0x1000000` address. The `rbp` register contains the delta address so we add to the certain entries of the `early_level4_pgt`, the `level3_kernel_pgt` and the `level2_fixmap_pgt`. Let's try to understand what these labels mean. First of all let's look at their definition:
			
 
				 
			
 
				 ```assembly
			
 
				 NEXT_PAGE(early_level4_pgt)
			
@@ -147,29 +151,25 @@ NEXT_PAGE(level1_fixmap_pgt)
 
				 	.fill	512,8,0
			
 
				 ```
			
 
				 
			
 
				-Looks hard, but it is not true.
			
 
				-
			
 
				-First of all let's look on the `early_level4_pgt`. It starts with the (4096 - 8) bytes of zeros, it means that we don't use first 511 `early_level4_pgt` entries. And after this we can see `level3_kernel_pgt` entry. Note that we subtract `__START_KERNEL_map + _PAGE_TABLE` from it. As we know `__START_KERNEL_map` is a base virtual address of the kernel text, so if we subtract `__START_KERNEL_map`, we will get physical address of the `level3_kernel_pgt`. Now let's look on `_PAGE_TABLE`, it is just page entry access rights:
			
 
				+Looks hard, but it isn't. First of all let's look at the `early_level4_pgt`. It starts with the (4096 - 8) bytes of zeros, it means that we don't use the first `511` entries. And after this we can see one `level3_kernel_pgt` entry. Note that we subtract `__START_KERNEL_map + _PAGE_TABLE` from it. As we know `__START_KERNEL_map` is a base virtual address of the kernel text, so if we subtract `__START_KERNEL_map`, we will get physical address of the `level3_kernel_pgt`. Now let's look at `_PAGE_TABLE`, it is just page entry access rights:
			
 
				 
			
 
				 ```C
			
 
				 #define _PAGE_TABLE     (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \
			
 
				                          _PAGE_ACCESSED | _PAGE_DIRTY)
			
 
				 ```
			
 
				 
			
 
				-more about it, you can read in the [paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html) post.
			
 
				+You can read more about it in the [paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html) part.
			
 
				 
			
 
				-`level3_kernel_pgt` - stores entries which map kernel space. At the start of it's definition, we can see that it filled with zeros `L3_START_KERNEL` times. Here `L3_START_KERNEL` is the index in the page upper directory which contains `__START_KERNEL_map` address and it equals `510`. After it we can see definition of two `level3_kernel_pgt` entries: `level2_kernel_pgt` and `level2_fixmap_pgt`. First is simple, it is page table entry which contains pointer to the page middle directory which maps kernel space and it has:
			
 
				+The `level3_kernel_pgt` - stores two entries which map kernel space. At the start of it's definition, we can see that it is filled with zeros `L3_START_KERNEL` or `510` times. Here the `L3_START_KERNEL` is the index in the page upper directory which contains `__START_KERNEL_map` address and it equals `510`. After this, we can see the definition of the two `level3_kernel_pgt` entries: `level2_kernel_pgt` and `level2_fixmap_pgt`. First is simple, it is page table entry which contains pointer to the page middle directory which maps kernel space and it has:
			
 
				 
			
 
				 ```C
			
 
				 #define _KERNPG_TABLE   (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \
			
 
				                          _PAGE_DIRTY)
			
 
				 ```
			
 
				 
			
 
				-access rights. The second - `level2_fixmap_pgt` is a virtual addresses which can refer to any physical addresses even under kernel space.
			
 
				-
			
 
				-The next `level2_kernel_pgt` calls `PDMS` macro which creates 512 megabytes from the `__START_KERNEL_map` for kernel text (after these 512 megabytes will be modules memory space).
			
 
				+access rights. The second - `level2_fixmap_pgt` is a virtual addresses which can refer to any physical addresses even under kernel space. They represented by the one `level2_fixmap_pgt` entry and `10` megabytes hole for the [vsyscalls](https://lwn.net/Articles/446528/) mapping. The next `level2_kernel_pgt` calls the `PDMS` macro which creates `512` megabytes from the `__START_KERNEL_map` for kernel `.text` (after these `512` megabytes will be modules memory space).
			
 
				 
			
 
				-Now we know Let's back to our code which is in the beginning of the section. Remember that `rbp` contains actual physical address of the `_text` section. We just add this address to the base address of the page tables, that they'll have correct addresses:
			
 
				+Now, after we saw definitions of these symbols, let's get back to the code which is described at the beginning of the section. Remember that the `rbp` register contains delta between the address of the `startup_64` symbol which was got during kernel [linking](https://en.wikipedia.org/wiki/Linker_%28computing%29) and the actual address. So, for this moment, we just need to add this delta to the base address of some page table entries, that they'll have correct addresses. In our case these entries are:
			
 
				 
			
 
				 ```assembly
			
 
				 	addq	%rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip)
			
@@ -178,7 +178,7 @@ Now we know Let's back to our code which is in the beginning of the section. Rem
 
				 	addq	%rbp, level2_fixmap_pgt + (506*8)(%rip)
			
 
				 ```
			
 
				 
			
 
				-At the first line we add `rbp` to the `early_level4_pgt`, at the second line we add `rbp` to the `level2_kernel_pgt`, at the third line we add `rbp` to the `level2_fixmap_pgt` and add `rbp` to the `level1_fixmap_pgt`.
			
 
				+or the last entry of the `early_level4_pgt` which is the `level3_kernel_pgt`, last two entries of the `level3_kernel_pgt` which are the `level2_kernel_pgt` and the `level2_fixmap_pgt` and five hundreds seventh entry of the `level2_fixmap_pgt` which is `level1_fixmap_pgt` page directory.
			
 
				 
			
 
				 After all of this we will have:
			
 
				 
			
@@ -187,22 +187,22 @@ early_level4_pgt[511] -> level3_kernel_pgt[0]
 
				 level3_kernel_pgt[510] -> level2_kernel_pgt[0]
			
 
				 level3_kernel_pgt[511] -> level2_fixmap_pgt[0]
			
 
				 level2_kernel_pgt[0]   -> 512 MB kernel mapping
			
 
				-level2_fixmap_pgt[506] -> level1_fixmap_pgt 
			
 
				+level2_fixmap_pgt[507] -> level1_fixmap_pgt
			
 
				 ```
			
 
				 
			
 
				-As we corrected base addresses of the page tables, we can start to build it.
			
 
				+Note that we didn't fixup base address of the `early_level4_pgt` and some of other page table directories, because we will see this during of building/filling of structures for these page tables. As we corrected base addresses of the page tables, we can start to build it.
			
 
				 
			
 
				 Identity mapping setup
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-Now we can see set up the identity mapping early page tables. Identity Mapped Paging is a virtual addresses which are mapped to physical addresses that have the same value, `1 : 1`. Let's look on it in details. First of all we get the `rip-relative` address of the `_text` and `_early_level4_pgt` and put they into `rdi` and `rbx` registers:
			
 
				+Now we can see the set up of identity mapping of early page tables. In Identity Mapped Paging, virtual addresses are mapped to physical addresses that have the same value, `1 : 1`. Let's look at it in detail. First of all we get the `rip-relative` address of the `_text` and `_early_level4_pgt` and put they into `rdi` and `rbx` registers:
			
 
				 
			
 
				 ```assembly
			
 
				 	leaq	_text(%rip), %rdi
			
 
				 	leaq	early_level4_pgt(%rip), %rbx
			
 
				 ```
			
 
				 
			
 
				-After this we store physical address of the `_text` in the `rax` and get the index of the page global directory entry which stores `_text` address, by shifting `_text` address on the `PGDIR_SHIFT`: 
			
 
				+After this we store address of the `_text` in the `rax` and get the index of the page global directory entry which stores `_text` address, by shifting `_text` address on the `PGDIR_SHIFT`:
			
 
				 
			
 
				 ```assembly
			
 
				 	movq	%rdi, %rax
			
@@ -221,7 +221,7 @@ where `PGDIR_SHIFT` is `39`. `PGDIR_SHFT` indicates the mask for page global dir
 
				 #define PMD_SHIFT       21
			
 
				 ```
			
 
				 
			
 
				-After this we put the address of the first `level3_kernel_pgt` to the `rdx` with the `_KERNPG_TABLE` access rights (see above) and fill the `early_level4_pgt` with the 2 `level3_kernel_pgt` entries.
			
 
				+After this we put the address of the first `level3_kernel_pgt` in the `rdx` with the `_KERNPG_TABLE` access rights (see above) and fill the `early_level4_pgt` with the 2 `level3_kernel_pgt` entries.
			
 
				 
			
 
				 After this we add `4096` (size of the `early_level4_pgt`) to the `rdx` (it now contains the address of the first entry of the `level3_kernel_pgt`) and put `rdi` (it now contains physical address of the `_text`)  to the `rax`. And after this we write addresses of the two page upper directory entries to the `level3_kernel_pgt`:
			
 
				 
			
@@ -249,7 +249,7 @@ In the next step we write addresses of the page middle directory entries to the
 
				 	jne	1b
			
 
				 ```
			
 
				 
			
 
				-Here we put the address of the `level2_kernel_pgt` to the `rdi` and address of the page table entry to the `r8` register. Next we check the present bit in the `level2_kernel_pgt` and if it is zero we're moving to the next page by adding 8 bytes to `rdi` which contaitns address of the `level2_kernel_pgt`. After this we compare it with `r8` (contains address of the page table entry) and go back to label `1` or move forward.
			
 
				+Here we put the address of the `level2_kernel_pgt` to the `rdi` and address of the page table entry to the `r8` register. Next we check the present bit in the `level2_kernel_pgt` and if it is zero we're moving to the next page by adding 8 bytes to `rdi` which contains address of the `level2_kernel_pgt`. After this we compare it with `r8` (contains address of the page table entry) and go back to label `1` or move forward.
			
 
				 
			
 
				 In the next step we correct `phys_base` physical address with `rbp` (contains physical address of the `_text`), put physical address of the `early_level4_pgt` and jump to label `1`:
			
 
				 
			
@@ -259,12 +259,12 @@ In the next step we correct `phys_base` physical address with `rbp` (contains ph
 
				 	jmp 1f
			
 
				 ```
			
 
				 
			
 
				-where `phys_base` mathes the first entry of the `level2_kernel_pgt` which is 512 MB kernel mapping.
			
 
				+where `phys_base` matches the first entry of the `level2_kernel_pgt` which is `512` MB kernel mapping.
			
 
				 
			
 
				-Last preparations 
			
 
				+Last preparation before jump at the kernel entry point
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-After that we jumped to the label `1` we enable `PAE`, `PGE` (Paging Global Extension) and put the physical address of the `phys_base` (see above) to the `rax` register and fill `cr3` register with it:
			
 
				+After that we jump to the label `1` we enable `PAE`, `PGE` (Paging Global Extension) and put the physical address of the `phys_base` (see above) to the `rax` register and fill `cr3` register with it:
			
 
				 
			
 
				 ```assembly
			
 
				 1:
			
@@ -275,7 +275,7 @@ After that we jumped to the label `1` we enable `PAE`, `PGE` (Paging Global Exte
 
				 	movq	%rax, %cr3
			
 
				 ```
			
 
				 
			
 
				-In the next step we check that CPU support [NX](http://en.wikipedia.org/wiki/NX_bit) bit with:
			
 
				+In the next step we check that CPU supports [NX](http://en.wikipedia.org/wiki/NX_bit) bit with:
			
 
				 
			
 
				 ```assembly
			
 
				 	movl	$0x80000001, %eax
			
@@ -283,7 +283,7 @@ In the next step we check that CPU support [NX](http://en.wikipedia.org/wiki/NX_
 
				 	movl	%edx,%edi
			
 
				 ```
			
 
				 
			
 
				-We put `0x80000001` value to the `eax` and execute `cpuid` instruction for getting extended processor info and feature bits. The result will be in the `edx` register which we put to the `edi`.
			
 
				+We put `0x80000001` value to the `eax` and execute `cpuid` instruction for getting the extended processor info and feature bits. The result will be in the `edx` register which we put to the `edi`.
			
 
				 
			
 
				 Now we put `0xc0000080` or `MSR_EFER` to the `ecx` and call `rdmsr` instruction for the reading model specific register.
			
 
				 
			
@@ -309,7 +309,7 @@ The result will be in the `edx:eax`. General view of the `EFER` is following:
 
				  --------------------------------------------------------------------------------
			
 
				 ```
			
 
				 
			
 
				-We will not see all fields in details here, but we will learn about this and other `MSRs` in the special part about. As we read `EFER` to the `edx:eax`, we checks `_EFER_SCE` or zero bit which is `System Call Extensions` with `btsl` instruction and set it to one. By the setting `SCE` bit we enable `SYSCALL` and `SYSRET` instructions. In the next step we check 20th bit in the `edi`, remember that this register stores result of the `cpuid` (see above). If `20` bit is set (`NX` bit) we just write `EFER_SCE` to the model specific register. 
			
 
				+We will not see all fields in details here, but we will learn about this and other `MSRs` in a special part about it. As we read `EFER` to the `edx:eax`, we check `_EFER_SCE` or zero bit which is `System Call Extensions` with `btsl` instruction and set it to one. By the setting `SCE` bit we enable `SYSCALL` and `SYSRET` instructions. In the next step we check 20th bit in the `edi`, remember that this register stores result of the `cpuid` (see above). If `20` bit is set (`NX` bit) we just write `EFER_SCE` to the model specific register.
			
 
				 
			
 
				 ```assembly
			
 
				 	btsl	$_EFER_SCE, %eax
			
@@ -320,15 +320,113 @@ We will not see all fields in details here, but we will learn about this and oth
 
				 1:	wrmsr
			
 
				 ```
			
 
				 
			
 
				-If `NX` bit is supported we enable `_EFER_NX`  and write it too, with the `wrmsr` instruction.
			
 
				+If the [NX](https://en.wikipedia.org/wiki/NX_bit) bit is supported we enable `_EFER_NX`  and write it too, with the `wrmsr` instruction. After the [NX](https://en.wikipedia.org/wiki/NX_bit) bit is set, we set some bits in the `cr0` [control register](https://en.wikipedia.org/wiki/Control_register), namely:
			
 
				+
			
 
				+* `X86_CR0_PE` - system is in protected mode;
			
 
				+* `X86_CR0_MP` - controls interaction of WAIT/FWAIT instructions with TS flag in CR0;
			
 
				+* `X86_CR0_ET` - on the 386, it allowed to specify whether the external math coprocessor was an 80287 or 80387;
			
 
				+* `X86_CR0_NE` - enable internal x87 floating point error reporting when set, else enables PC style x87 error detection;
			
 
				+* `X86_CR0_WP` - when set, the CPU can't write to read-only pages when privilege level is 0;
			
 
				+* `X86_CR0_AM` - alignment check enabled if AM set, AC flag (in EFLAGS register) set, and privilege level is 3;
			
 
				+* `X86_CR0_PG` - enable paging.
			
 
				+
			
 
				+by the execution following assembly code:
			
 
				+
			
 
				+```assembly
			
 
				+#define CR0_STATE	(X86_CR0_PE | X86_CR0_MP | X86_CR0_ET | \
			
 
				+			 X86_CR0_NE | X86_CR0_WP | X86_CR0_AM | \
			
 
				+			 X86_CR0_PG)
			
 
				+movl	$CR0_STATE, %eax
			
 
				+movq	%rax, %cr0
			
 
				+```
			
 
				+
			
 
				+We already know that to run any code, and even more [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) code from assembly, we need to setup a stack. As always, we are doing it by the setting of [stack pointer](https://en.wikipedia.org/wiki/Stack_register) to a correct place in memory and resetting [flags](https://en.wikipedia.org/wiki/FLAGS_register) register after this:
			
 
				+
			
 
				+```assembly
			
 
				+movq stack_start(%rip), %rsp
			
 
				+pushq $0
			
 
				+popfq
			
 
				+```
			
 
				+
			
 
				+The most interesting thing here is the `stack_start`. It defined in the same [source](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) code file and looks like:
			
 
				+
			
 
				+```assembly
			
 
				+GLOBAL(stack_start)
			
 
				+.quad  init_thread_union+THREAD_SIZE-8
			
 
				+```
			
 
				+
			
 
				+The `GLOBAL` is already familiar to us from. It defined in the [arch/x86/include/asm/linkage.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/linkage.h) header file expands to the `global` symbol definition:
			
 
				+
			
 
				+```C
			
 
				+#define GLOBAL(name)    \
			
 
				+         .globl name;           \
			
 
				+         name:
			
 
				+```
			
 
				+
			
 
				+The `THREAD_SIZE` macro is defined in the [arch/x86/include/asm/page_64_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/page_64_types.h) header file and depends on value of the `KASAN_STACK_ORDER` macro:
			
 
				+
			
 
				+```C
			
 
				+#define THREAD_SIZE_ORDER       (2 + KASAN_STACK_ORDER)
			
 
				+#define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)
			
 
				+```
			
 
				+
			
 
				+We consider when the [kasan](http://lxr.free-electrons.com/source/Documentation/kasan.txt) is disabled and the `PAGE_SIZE` is `4096` bytes. So the `THREAD_SIZE` will expands to `16` kilobytes and represents size of the stack of a thread. Why is `thread`? You may already know that each [process](https://en.wikipedia.org/wiki/Process_%28computing%29) may have parent [processes](https://en.wikipedia.org/wiki/Parent_process) and [child](https://en.wikipedia.org/wiki/Child_process) processes. Actually, a parent process and child process differ in stack. A new kernel stack is allocated for a new process. In the Linux kernel this stack is represented by the [union](https://en.wikipedia.org/wiki/Union_type#C.2FC.2B.2B) with the `thread_info` structure.
			
 
				+
			
 
				+And as we can see the `init_thread_union` is represented by the `thread_union`, which defined as:
			
 
				+
			
 
				+```C
			
 
				+union thread_union {
			
 
				+         struct thread_info thread_info;
			
 
				+         unsigned long stack[THREAD_SIZE/sizeof(long)];
			
 
				+};
			
 
				+```
			
 
				+
			
 
				+and `init_thread_union` looks like:
			
 
				+
			
 
				+```C
			
 
				+union thread_union init_thread_union __init_task_data =
			
 
				+	{ INIT_THREAD_INFO(init_task) };
			
 
				+```
			
 
				+
			
 
				+Where the `INIT_THREAD_INFO` macro takes `task_struct` structure which represents process descriptor in the Linux kernel and does some basic initialization of the given `task_struct` structure:
			
 
				+
			
 
				+```C
			
 
				+#define INIT_THREAD_INFO(tsk)		\
			
 
				+{                                               \
			
 
				+	.task		= &tsk,                         \
			
 
				+	.flags		= 0,                            \
			
 
				+	.cpu		= 0,                            \
			
 
				+	.addr_limit	= KERNEL_DS,                    \
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+So, the `thread_union` contains low-level information about a process and process's stack and placed in the bottom of stack:
			
 
				+
			
 
				+```
			
 
				++-----------------------+
			
 
				+|                       |
			
 
				+|                       |
			
 
				+|                       |
			
 
				+|     Kernel stack      |
			
 
				+|                       |
			
 
				+|                       |
			
 
				+|                       |
			
 
				+|-----------------------|
			
 
				+|                       |
			
 
				+|  struct thread_info   |
			
 
				+|                       |
			
 
				++-----------------------+
			
 
				+```
			
 
				+
			
 
				+Note that we reserve `8` bytes at the to of stack. This is necessary to guarantee illegal access of the next page memory.
			
 
				 
			
 
				-In the next step we need to update Global Descriptor table with `lgdt` instruction:
			
 
				+After the early boot stack is set, to update the [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table) with `lgdt` instruction:
			
 
				 
			
 
				 ```assembly
			
 
				 lgdt	early_gdt_descr(%rip)
			
 
				 ```
			
 
				 
			
 
				-where Global Descriptor table defined as:
			
 
				+where the `early_gdt_descr` is defined as:
			
 
				 
			
 
				 ```assembly
			
 
				 early_gdt_descr:
			
@@ -337,13 +435,13 @@ early_gdt_descr_base:
 
				 	.quad	INIT_PER_CPU_VAR(gdt_page)
			
 
				 ```
			
 
				 
			
 
				-We need to reload Global Descriptor Table because now kernel works in the userspace addresses, but soon kernel will work in it's own space. Now let's look on `early_gdt_descr` definition. Global Descriptor Table contains 32 entries:
			
 
				+We need to reload `Global Descriptor Table` because now kernel works in the low userspace addresses, but soon kernel will work in its own space. Now let's look at the definition of `early_gdt_descr`. Global Descriptor Table contains `32` entries:
			
 
				 
			
 
				 ```C
			
 
				 #define GDT_ENTRIES 32
			
 
				 ```
			
 
				 
			
 
				-for kernel code, data, thread local storage segments and etc... it's simple. Now let's look on the `early_gdt_descr_base`. First of `gdt_page` defined as:
			
 
				+for kernel code, data, thread local storage segments and etc... it's simple. Now let's look at the `early_gdt_descr_base`. First of `gdt_page` defined as:
			
 
				 
			
 
				 ```C
			
 
				 struct gdt_page {
			
@@ -351,7 +449,7 @@ struct gdt_page {
 
				 } __attribute__((aligned(PAGE_SIZE)));
			
 
				 ```
			
 
				 
			
 
				-in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h). It contains one field `gdt` which is array of the `desc_struct` structures which defined as:
			
 
				+in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h). It contains one field `gdt` which is array of the `desc_struct` structure which is defined as:
			
 
				 
			
 
				 ```C
			
 
				 struct desc_struct {
			
@@ -370,13 +468,13 @@ struct desc_struct {
 
				  } __attribute__((packed));
			
 
				 ```
			
 
				 
			
 
				-and presents familiar to us GDT descriptor. Also we can note that `gdt_page` structure aligned to `PAGE_SIZE` which is 4096 bytes. It means that `gdt` will occupy one page. Now let's try to understand what is it `INIT_PER_CPU_VAR`. `INIT_PER_CPU_VAR` is a macro which defined in the [arch/x86/include/asm/percpu.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/percpu.h) and just concats `init_per_cpu__` with the given parameter:
			
 
				+and presents familiar to us `GDT` descriptor. Also we can note that `gdt_page` structure aligned to `PAGE_SIZE` which is `4096` bytes. It means that `gdt` will occupy one page. Now let's try to understand what is `INIT_PER_CPU_VAR`. `INIT_PER_CPU_VAR` is a macro which defined in the [arch/x86/include/asm/percpu.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/percpu.h) and just concats `init_per_cpu__` with the given parameter:
			
 
				 
			
 
				 ```C
			
 
				 #define INIT_PER_CPU_VAR(var) init_per_cpu__##var
			
 
				 ```
			
 
				 
			
 
				-After this we have `init_per_cpu__gdt_page`. We can see in the [linker script](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S):
			
 
				+After the `INIT_PER_CPU_VAR` macro will be expanded, we will have `init_per_cpu__gdt_page`. We can see in the [linker script](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S):
			
 
				 
			
 
				 ```
			
 
				 #define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load
			
@@ -385,7 +483,7 @@ INIT_PER_CPU(gdt_page);
 
				 
			
 
				 As we got `init_per_cpu__gdt_page` in `INIT_PER_CPU_VAR` and `INIT_PER_CPU` macro from linker script will be expanded we will get offset from the `__per_cpu_load`. After this calculations, we will have correct base address of the new GDT.
			
 
				 
			
 
				-Generally per-CPU variables is a 2.6 kernel feature. You can understand what is it from it's name. When we create `per-CPU` variable, each CPU will have will have it's own copy of this variable. Here we creating `gdt_page` per-CPU variable. There are many advantages for variables of this type, like there are no locks, because each CPU works with it's own copy of variable and etc... So every core on multiprocessor will have it's own `GDT` table and every entry in the table will represent a memory segment which can be accessed from the thread which ran on the core. You can read in details about `per-CPU` variables in the [Theory/per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) post.
			
 
				+Generally per-CPU variables is a 2.6 kernel feature. You can understand what it is from its name. When we create `per-CPU` variable, each CPU will have will have its own copy of this variable. Here we creating `gdt_page` per-CPU variable. There are many advantages for variables of this type, like there are no locks, because each CPU works with its own copy of variable and etc... So every core on multiprocessor will have its own `GDT` table and every entry in the table will represent a memory segment which can be accessed from the thread which ran on the core. You can read in details about `per-CPU` variables in the [Theory/per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) post.
			
 
				 
			
 
				 As we loaded new Global Descriptor Table, we reload segments as we did it every time:
			
 
				 
			
@@ -398,7 +496,7 @@ As we loaded new Global Descriptor Table, we reload segments as we did it every
 
				 	movl %eax,%gs
			
 
				 ```
			
 
				 
			
 
				-After all of these steps we set up `gs` register that it post to the `irqstack` (we will see information about it in the next parts):
			
 
				+After all of these steps we set up `gs` register that it post to the `irqstack` which represents special stack where [interrupts](https://en.wikipedia.org/wiki/Interrupt) will be handled on:
			
 
				 
			
 
				 ```assembly
			
 
				 	movl	$MSR_GS_BASE,%ecx
			
@@ -413,7 +511,7 @@ where `MSR_GS_BASE` is:
 
				 #define MSR_GS_BASE             0xc0000101
			
 
				 ```
			
 
				 
			
 
				-We need to put `MSR_GS_BASE` to the `ecx` register and load data from the `eax` and `edx` (which are point to the `initial_gs`) with `wrmsr` instruction. We don't use `cs`, `fs`, `ds` and `ss` segment registers for addressation in the 64-bit mode, but `fs` and `gs` registers can be used. `fs` and `gs` have a hidden part (as we saw it in the real mode for `cs`) and this part contains descriptor which mapped to Model specific registers. So we can see above `0xc0000101` is a `gs.base` MSR address. 
			
 
				+We need to put `MSR_GS_BASE` to the `ecx` register and load data from the `eax` and `edx` (which are point to the `initial_gs`) with `wrmsr` instruction. We don't use `cs`, `fs`, `ds` and `ss` segment registers for addressing in the 64-bit mode, but `fs` and `gs` registers can be used. `fs` and `gs` have a hidden part (as we saw it in the real mode for `cs`) and this part contains descriptor which mapped to [Model Specific Registers](https://en.wikipedia.org/wiki/Model-specific_register). So we can see above `0xc0000101` is a `gs.base` MSR address. When a [system call](https://en.wikipedia.org/wiki/System_call) or [interrupt](https://en.wikipedia.org/wiki/Interrupt) occurred, there is no kernel stack at the entry point, so the value of the `MSR_GS_BASE` will store address of the interrupt stack.
			
 
				 
			
 
				 In the next step we put the address of the real mode bootparam structure to the `rdi` (remember `rsi` holds pointer to this structure from the start) and jump to the C code with:
			
 
				 
			
@@ -425,10 +523,9 @@ In the next step we put the address of the real mode bootparam structure to the
 
				 	lretq
			
 
				 ```
			
 
				 
			
 
				-Here we put the address of the `initial_code` to the `rax` and push fake address, `__KERNEL_CS` and the address of the `initial_code` to the stack. After this we can see `lretq` instruction which means that after it return address will be extracted from stack (now there is address of the `initial_code`) and jump there. `initial_code` defined in the same source code file and looks:
			
 
				+Here we put the address of the `initial_code` to the `rax` and push fake address, `__KERNEL_CS` and the address of the `initial_code` to the stack. After this we can see `lretq` instruction which means that after it return address will be extracted from stack (now there is address of the `initial_code`) and jump there. `initial_code` is defined in the same source code file and looks:
			
 
				 
			
 
				 ```assembly
			
 
				-	__REFDATA
			
 
				 	.balign	8
			
 
				 	GLOBAL(initial_code)
			
 
				 	.quad	x86_64_start_kernel
			
@@ -437,7 +534,7 @@ Here we put the address of the `initial_code` to the `rax` and push fake address
 
				 	...
			
 
				 ```
			
 
				 
			
 
				-As we can see `initial_code` contains address of the `x86_64_start_kernel`, which defined in the [arch/x86/kerne/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c) and looks like this:
			
 
				+As we can see `initial_code` contains address of the `x86_64_start_kernel`, which is defined in the [arch/x86/kerne/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c) and looks like this:
			
 
				 
			
 
				 ```C
			
 
				 asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data) {
			
@@ -475,7 +572,7 @@ There are checks for different things like virtual addresses of modules space is
 
				 #define BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)]))
			
 
				 ```
			
 
				 
			
 
				-Let's try to understand this trick works. Let's take for example first condition: `MODULES_VADDR < __START_KERNEL_map`. `!!conditions` is the same that `condition != 0`. So it means if `MODULES_VADDR < __START_KERNEL_map` is true, we will get `1` in the `!!(condition)` or zero if not. After `2*!!(condition)` we will get or `2` or `0`. In the end of calculations we can get two different behaviors:
			
 
				+Let's try to understand how this trick works. Let's take for example first condition: `MODULES_VADDR < __START_KERNEL_map`. `!!conditions` is the same that `condition != 0`. So it means if `MODULES_VADDR < __START_KERNEL_map` is true, we will get `1` in the `!!(condition)` or zero if not. After `2*!!(condition)` we will get or `2` or `0`. In the end of calculations we can get two different behaviors:
			
 
				 
			
 
				 * We will have compilation error, because try to get size of the char array with negative index (as can be in our case, because `MODULES_VADDR` can't be less than `__START_KERNEL_map` will be in our case);
			
 
				 * No compilation errors.
			
@@ -493,24 +590,24 @@ next_early_pgt = 0;
 
				 write_cr3(__pa_nodebug(early_level4_pgt));
			
 
				 ```
			
 
				 
			
 
				-soon we will build new page tables. Here we can see that we go through all Page Global Directory Entries (`PTRS_PER_PGD` is `512`) in the loop and make it zero. After this we set `next_early_pgt` to zero (we will see details about it in the next post) and write physical address of the `early_level4_pgt` to the `cr3`. `__pa_nodebug` is a macro which will be expanded to:
			
 
				+Soon we will build new page tables. Here we can see that we go through all Page Global Directory Entries (`PTRS_PER_PGD` is `512`) in the loop and make it zero. After this we set `next_early_pgt` to zero (we will see details about it in the next post) and write physical address of the `early_level4_pgt` to the `cr3`. `__pa_nodebug` is a macro which will be expanded to:
			
 
				 
			
 
				 ```C
			
 
				 ((unsigned long)(x) - __START_KERNEL_map + phys_base)
			
 
				 ```
			
 
				 
			
 
				-After this we clear `_bss` from the `__bss_stop` to `__bss_start` and the next step will be setup of the early `IDT` handlers, but it's big theme so we will see it in the next part.
			
 
				+After this we clear `_bss` from the `__bss_stop` to `__bss_start` and the next step will be setup of the early `IDT` handlers, but it's big concept so we will see it in the next part.
			
 
				 
			
 
				 Conclusion
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				 This is the end of the first part about linux kernel initialization.
			
 
				 
			
 
				-If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-internals/issues/new).
			
 
				+If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
			
 
				 
			
 
				-In the next part we will see initialization of the early interruption handlers, kernel space memory mapping and many many more.
			
 
				+In the next part we will see initialization of the early interruption handlers, kernel space memory mapping and a lot more.
			
 
				 
			
 
				-**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-insides).**
			
 
				+**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				 
			
 
				 Links
			
 
				 --------------------------------------------------------------------------------
			
--- a/Initialization/linux-initialization-10.md
+++ b/Initialization/linux-initialization-10.md
@@ -4,7 +4,7 @@ Kernel initialization. Part 10.
 
				 End of the linux kernel initialization process
			
 
				 ================================================================================
			
 
				 
			
 
				-This is tenth part of the chapter about linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) and in the [previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-9.html) we saw the initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update) and stopped on the call of the `acpi_early_init` function. This part will be the last part of the [Kernel initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) chapter, so let's finish with it.
			
 
				+This is tenth part of the chapter about linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) and in the [previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-9.html) we saw the initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update) and stopped on the call of the `acpi_early_init` function. This part will be the last part of the [Kernel initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) chapter, so let's finish it.
			
 
				 
			
 
				 After the call of the `acpi_early_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c), we can see the following code:
			
 
				 
			
@@ -14,7 +14,7 @@ After the call of the `acpi_early_init` function from the [init/main.c](https://
 
				 #endif
			
 
				 ```
			
 
				 
			
 
				-Here we can see the call of the `init_espfix_bsp` function which depends on the `CONFIG_X86_ESPFIX64` kernel configuration option. As we can understand from the function name, it does something with the stack. This function defined in the [arch/x86/kernel/espfix_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/espfix_64.c) and prevents leaking of `31:16` bits of the `esp` register during returning to 16-bit stack. First of all we install `espfix` page upper directory into the kernel page directory in the `init_espfix_bs`:
			
 
				+Here we can see the call of the `init_espfix_bsp` function which depends on the `CONFIG_X86_ESPFIX64` kernel configuration option. As we can understand from the function name, it does something with the stack. This function is defined in the [arch/x86/kernel/espfix_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/espfix_64.c) and prevents leaking of `31:16` bits of the `esp` register during returning to 16-bit stack. First of all we install `espfix` page upper directory into the kernel page directory in the `init_espfix_bs`:
			
 
				 
			
 
				 ```C
			
 
				 pgd_p = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)];
			
@@ -29,7 +29,7 @@ Where `ESPFIX_BASE_ADDR` is:
 
				 #define ESPFIX_BASE_ADDR (ESPFIX_PGD_ENTRY << PGDIR_SHIFT)
			
 
				 ```
			
 
				 
			
 
				-Also we can find it in the [Documentation/arch/x86_64/mm](https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt):
			
 
				+Also we can find it in the [Documentation/x86/x86_64/mm](https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt):
			
 
				 
			
 
				 ```
			
 
				 ... unused hole ...
			
@@ -37,7 +37,7 @@ ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks
 
				 ... unused hole ...
			
 
				 ```
			
 
				 
			
 
				-After we've filled page global directory with the `espfix` pud, the next step is call of the `init_espfix_random` and `init_espfix_ap` functions. The first function returns random locations for the `espfix` page and the second enables the `espfix` the current CPU. After the `init_espfix_bsp` finished to work, we can see the call of the `thread_info_cache_init` function which defined in the [kernel/fork.c](https://github.com/torvalds/linux/blob/master/kernel/fork.c) and allocates cache for the `thread_info` if its size is less than `PAGE_SIZE`:
			
 
				+After we've filled page global directory with the `espfix` pud, the next step is call of the `init_espfix_random` and `init_espfix_ap` functions. The first function returns random locations for the `espfix` page and the second enables the `espfix` for the current CPU. After the `init_espfix_bsp` finished the work, we can see the call of the `thread_info_cache_init` function which defined in the [kernel/fork.c](https://github.com/torvalds/linux/blob/master/kernel/fork.c) and allocates cache for the `thread_info` if `THREAD_SIZE` is less than `PAGE_SIZE`:
			
 
				 
			
 
				 ```C
			
 
				 # if THREAD_SIZE >= PAGE_SIZE
			
@@ -56,7 +56,7 @@ void thread_info_cache_init(void)
 
				 #endif
			
 
				 ```
			
 
				 
			
 
				-As we already know the `PAGE_SIZE` is `(_AC(1,UL) << PAGE_SHIFT)` or `4096` bytes and `THREAD_SIZE` is `(PAGE_SIZE << THREAD_SIZE_ORDER)` or `16384` bytes for the `x86_64`. The next function after the `thread_info_cache_init` is the `cred_init` from the [kernel/cred.c](https://github.com/torvalds/linux/blob/master/kernel/cred.c). This function just allocates space for the credentials (like `uid`, `gid` and etc...):
			
 
				+As we already know the `PAGE_SIZE` is `(_AC(1,UL) << PAGE_SHIFT)` or `4096` bytes and `THREAD_SIZE` is `(PAGE_SIZE << THREAD_SIZE_ORDER)` or `16384` bytes for the `x86_64`. The next function after the `thread_info_cache_init` is the `cred_init` from the [kernel/cred.c](https://github.com/torvalds/linux/blob/master/kernel/cred.c). This function just allocates cache for the credentials (like `uid`, `gid`, etc.):
			
 
				 
			
 
				 ```C
			
 
				 void __init cred_init(void)
			
@@ -66,7 +66,7 @@ void __init cred_init(void)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-more about credentials you can read in the [Documentation/security/credentials.txt](https://github.com/torvalds/linux/blob/master/Documentation/security/credentials.txt). Next step is the `fork_init` function from the [kernel/fork.c](https://github.com/torvalds/linux/blob/master/kernel/fork.c). The `fork_init` function allocates space for the `task_struct`. Let's look on the implementation of the `fork_init`. First of all we can see definitions of the `ARCH_MIN_TASKALIGN` macro and creation of a slab where task_structs will be allocated:
			
 
				+more about credentials you can read in the [Documentation/security/credentials.txt](https://github.com/torvalds/linux/blob/master/Documentation/security/credentials.txt). Next step is the `fork_init` function from the [kernel/fork.c](https://github.com/torvalds/linux/blob/master/kernel/fork.c). The `fork_init` function allocates cache for the `task_struct`. Let's look on the implementation of the `fork_init`. First of all we can see definitions of the `ARCH_MIN_TASKALIGN` macro and creation of a slab where task_structs will be allocated:
			
 
				 
			
 
				 ```C
			
 
				 #ifndef CONFIG_ARCH_TASK_STRUCT_ALLOCATOR
			
@@ -97,7 +97,7 @@ void arch_task_cache_init(void)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-The `arch_task_cache_init` does initialization of the architecture-specific caches. In our case it is `x86_64`, so as we can see, the `arch_task_cache_init` allocates space for the `task_xstate` which represents [FPU](http://en.wikipedia.org/wiki/Floating-point_unit) state and sets up offsets and sizes of all extended states in [xsave](http://www.felixcloutier.com/x86/XSAVES.html) area with the call of the `setup_xstate_comp` function. After the `arch_task_cache_init` we calculate default maximum number of threads with the:
			
 
				+The `arch_task_cache_init` does initialization of the architecture-specific caches. In our case it is `x86_64`, so as we can see, the `arch_task_cache_init` allocates cache for the `task_xstate` which represents [FPU](http://en.wikipedia.org/wiki/Floating-point_unit) state and sets up offsets and sizes of all extended states in [xsave](http://www.felixcloutier.com/x86/XSAVES.html) area with the call of the `setup_xstate_comp` function. After the `arch_task_cache_init` we calculate default maximum number of threads with the:
			
 
				 
			
 
				 ```C
			
 
				 set_max_threads(MAX_THREADS);
			
@@ -110,7 +110,7 @@ where default maximum number of threads is:
 
				 #define MAX_THREADS     FUTEX_TID_MASK
			
 
				 ```
			
 
				 
			
 
				-In the end of the `fork_init` function we initalize [signal](http://www.win.tue.nl/~aeb/linux/lk/lk-5.html) handler:
			
 
				+In the end of the `fork_init` function we initialize [signal](http://www.win.tue.nl/~aeb/linux/lk/lk-5.html) handler:
			
 
				 
			
 
				 ```C
			
 
				 init_task.signal->rlim[RLIMIT_NPROC].rlim_cur = max_threads/2;
			
@@ -128,7 +128,7 @@ struct rlimit {
 
				 };
			
 
				 ```
			
 
				 
			
 
				-structure from the [include/uapi/linux/resource.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/resource.h). In our case the resource is the `RLIMIT_NPROC` which is the maximum number of process that use can own and `RLIMIT_SIGPENDING` - the maximum number of pending signals. We can see it in the:
			
 
				+structure from the [include/uapi/linux/resource.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/resource.h). In our case the resource is the `RLIMIT_NPROC` which is the maximum number of processes that user can own and `RLIMIT_SIGPENDING` - the maximum number of pending signals. We can see it in the:
			
 
				 
			
 
				 ```C
			
 
				 cat /proc/self/limits
			
@@ -168,7 +168,7 @@ After this we allocate `SLAB` cache for the important `vm_area_struct` which use
 
				 vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC);
			
 
				 ```
			
 
				 
			
 
				-Note, that we use `KMEM_CACHE` macro here instead of the `kmem_cache_create`. This macro defined in the [include/linux/slab.h](https://github.com/torvalds/linux/blob/master/include/linux/slab.h) and just expands to the `kmem_cache_create` call:
			
 
				+Note, that we use `KMEM_CACHE` macro here instead of the `kmem_cache_create`. This macro is defined in the [include/linux/slab.h](https://github.com/torvalds/linux/blob/master/include/linux/slab.h) and just expands to the `kmem_cache_create` call:
			
 
				 
			
 
				 ```C
			
 
				 #define KMEM_CACHE(__struct, __flags) kmem_cache_create(#__struct,\
			
@@ -176,16 +176,16 @@ Note, that we use `KMEM_CACHE` macro here instead of the `kmem_cache_create`. Th
 
				                 (__flags), NULL)
			
 
				 ```
			
 
				 
			
 
				-The `KMEM_CACHE` has one difference from `kmem_cache_create`. Take a look on `__alignof__` operator. The `KMEM_CACHE` macro aligns `SLAB` to the size of the given structure, but `kmem_cache_create` uses given value to align space. After this we can see the call of the `mmap_init` and `nsproxy_cache_init` functions. The first function initalizes virtual memory area `SLAB` and the second function initializes `SLAB` for namespaces.
			
 
				+The `KMEM_CACHE` has one difference from `kmem_cache_create`. Take a look on `__alignof__` operator. The `KMEM_CACHE` macro aligns `SLAB` to the size of the given structure, but `kmem_cache_create` uses given value to align space. After this we can see the call of the `mmap_init` and `nsproxy_cache_init` functions. The first function initializes virtual memory area `SLAB` and the second function initializes `SLAB` for namespaces.
			
 
				 
			
 
				-The next function after the `proc_caches_init` is `buffer_init`. This function defined in the [fs/buffer.c](https://github.com/torvalds/linux/blob/master/fs/buffer.c) source code file and allocate cache for the `buffer_head`. The `buffer_head` is a special structure which defined in the [include/linux/buffer_head.h](https://github.com/torvalds/linux/blob/master/include/linux/buffer_head.h) and used for managing buffers. In the start of the `bufer_init` function we allocate cache for the `struct buffer_head` structures with the call of the `kmem_cache_create` function as we did it in the previous functions. And calcuate the maximum size of the buffers in memory with:
			
 
				+The next function after the `proc_caches_init` is `buffer_init`. This function is defined in the [fs/buffer.c](https://github.com/torvalds/linux/blob/master/fs/buffer.c) source code file and allocate cache for the `buffer_head`. The `buffer_head` is a special structure which defined in the [include/linux/buffer_head.h](https://github.com/torvalds/linux/blob/master/include/linux/buffer_head.h) and used for managing buffers. In the start of the `buffer_init` function we allocate cache for the `struct buffer_head` structures with the call of the `kmem_cache_create` function as we did in the previous functions. And calculate the maximum size of the buffers in memory with:
			
 
				 
			
 
				 ```C
			
 
				 nrpages = (nr_free_buffer_pages() * 10) / 100;
			
 
				 max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head));
			
 
				 ```
			
 
				 
			
 
				-which will be equal to the `10%` of the `ZONE_NORMAL` (all RAM from the 4GB on the `x86_64`). The next function after the `buffer_init` is - `vfs_caches_init`. This function allocates `SLAB` caches and hashtable for different [VFS](http://en.wikipedia.org/wiki/Virtual_file_system) caches. We already saw the `vfs_caches_init_early` function in the eighth part of the linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-8.html) which initialized caches for `dcache` (or directory-cache) and [inode](http://en.wikipedia.org/wiki/Inode) cache. The `vfs_caches_init` function makes post-early initialization of the `dcache` and `inode` caches, private data cache, hash tables for the mount points and etc... More details about [VFS](http://en.wikipedia.org/wiki/Virtual_file_system) will be described in the separate part. After this we can see `signals_init` function. This function defined in the [kernel/signal.c](https://github.com/torvalds/linux/blob/master/kernel/signal.c) and allocates a cache for the `sigqueue` structures which represents queue of the real time signals. The next function is `page_writeback_init`. This function initializes the ratio for the dirty pages. Every low-level page entry contains the `dirty` bit which indicates whether a page has been written to when set.
			
 
				+which will be equal to the `10%` of the `ZONE_NORMAL` (all RAM from the 4GB on the `x86_64`). The next function after the `buffer_init` is - `vfs_caches_init`. This function allocates `SLAB` caches and hashtable for different [VFS](http://en.wikipedia.org/wiki/Virtual_file_system) caches. We already saw the `vfs_caches_init_early` function in the eighth part of the linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-8.html) which initialized caches for `dcache` (or directory-cache) and [inode](http://en.wikipedia.org/wiki/Inode) cache. The `vfs_caches_init` function makes post-early initialization of the `dcache` and `inode` caches, private data cache, hash tables for the mount points, etc. More details about [VFS](http://en.wikipedia.org/wiki/Virtual_file_system) will be described in the separate part. After this we can see `signals_init` function. This function is defined in the [kernel/signal.c](https://github.com/torvalds/linux/blob/master/kernel/signal.c) and allocates a cache for the `sigqueue` structures which represents queue of the real time signals. The next function is `page_writeback_init`. This function initializes the ratio for the dirty pages. Every low-level page entry contains the `dirty` bit which indicates whether a page has been written to after been loaded into memory.
			
 
				 
			
 
				 Creation of the root for the procfs
			
 
				 --------------------------------------------------------------------------------
			
@@ -198,7 +198,7 @@ err = register_filesystem(&proc_fs_type);
 
				                 return;
			
 
				 ```
			
 
				 
			
 
				-As I wrote above we will not dive into details about [VFS](http://en.wikipedia.org/wiki/Virtual_file_system) and different filesystems in this chapter, but will see it in the chapter about the `VFS`. After we've registered a new filesystem in the our system, we call the `proc_self_init` function from the TO[fs/proc/self.c](https://github.com/torvalds/linux/blob/master/fs/proc/self.c) and this function allocates `inode` number for the `self` (`/proc/self` directory refers to the process accessing the `/proc` filesystem). The next step after the `proc_self_init` is `proc_setup_thread_self` which setups the `/proc/thread-self` directory which contains information about current thread. After this we create `/proc/self/mounts` symllink which will contains mount points with the call of the
			
 
				+As I wrote above we will not dive into details about [VFS](http://en.wikipedia.org/wiki/Virtual_file_system) and different filesystems in this chapter, but will see it in the chapter about the `VFS`. After we've registered a new filesystem in our system, we call the `proc_self_init` function from the [fs/proc/self.c](https://github.com/torvalds/linux/blob/master/fs/proc/self.c) and this function allocates `inode` number for the `self` (`/proc/self` directory refers to the process accessing the `/proc` filesystem). The next step after the `proc_self_init` is `proc_setup_thread_self` which setups the `/proc/thread-self` directory which contains information about current thread. After this we create `/proc/self/mounts` symlink which will contains mount points with the call of the
			
 
				 
			
 
				 ```C
			
 
				 proc_symlink("mounts", NULL, "self/mounts");
			
@@ -230,14 +230,14 @@ and a couple of directories depends on the different configuration options:
 
				 
			
 
				 In the end of the `proc_root_init` we call the `proc_sys_init` function which creates `/proc/sys` directory and initializes the [Sysctl](http://en.wikipedia.org/wiki/Sysctl).
			
 
				 
			
 
				-It is the end of `start_kernel` function. I did not describe all functions which are called in the `start_kernel`. I missed it, because they are not so important for the generic kernel initialization stuff and depend on only different kernel configurations. They are `taskstats_init_early` which exports per-task statistic to the user-space, `delayacct_init` - initializes per-task delay accounting, `key_init` and `security_init` initialize diferent security stuff, `check_bugs` - makes fix up of the some architecture-dependent bugs, `ftrace_init` function executes initialization of the [ftrace](https://www.kernel.org/doc/Documentation/trace/ftrace.txt), `cgroup_init` makes initialization of the rest of the [cgroup](http://en.wikipedia.org/wiki/Cgroups) subsystem and etc... Many of these parts and subsystems will be described in the other chapters.
			
 
				+It is the end of `start_kernel` function. I did not describe all functions which are called in the `start_kernel`. I skipped them, because they are not important for the generic kernel initialization stuff and depend on only different kernel configurations. They are `taskstats_init_early` which exports per-task statistic to the user-space, `delayacct_init` - initializes per-task delay accounting, `key_init` and `security_init` initialize different security stuff, `check_bugs` - fix some architecture-dependent bugs, `ftrace_init` function executes initialization of the [ftrace](https://www.kernel.org/doc/Documentation/trace/ftrace.txt), `cgroup_init` makes initialization of the rest of the [cgroup](http://en.wikipedia.org/wiki/Cgroups) subsystem,etc. Many of these parts and subsystems will be described in the other chapters.
			
 
				 
			
 
				-That's all. Finally we passed through the long-long `start_kernel` function. But it is not the end of the linux kernel initialization process. We haven't run the first process yet. In the end of the `start_kernel` we can see the last call of the - `rest_init` function. Let's go ahead.
			
 
				+That's all. Finally we have passed through the long-long `start_kernel` function. But it is not the end of the linux kernel initialization process. We haven't run the first process yet. In the end of the `start_kernel` we can see the last call of the - `rest_init` function. Let's go ahead.
			
 
				 
			
 
				 First steps after the start_kernel
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-The `rest_init` function defined in the same source code file as `start_kernel` function, and this file is [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c). In the beginning of the `rest_init` we can see call of the two following functions:
			
 
				+The `rest_init` function is defined in the same source code file as `start_kernel` function, and this file is [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c). In the beginning of the `rest_init` we can see call of the two following functions:
			
 
				 
			
 
				 ```C
			
 
				 	rcu_scheduler_starting();
			
@@ -257,14 +257,14 @@ Here the `kernel_thread` function (defined in the [kernel/fork.c](https://github
 
				 * Parameter for the `kernel_init` function;
			
 
				 * Flags.
			
 
				 
			
 
				-We will not dive into details about `kernel_thread` implementation (we will see it in the chapter which will describe scheduler, just need to say that `kernel_thread` invokes [clone](http://www.tutorialspoint.com/unix_system_calls/clone.htm)). Now we only need to know that we create new kernel thread with `kernel_thread` function, parent and child of the thread will use shared information about a filesystem and it will start to execute `kernel_init` function. A kernel thread differs from an user thread that it runs in a kernel mode. So with these two `kernel_thread` calls we create two new kernel threads with the `PID = 1` for `init` process and `PID = 2` for `kthread`. We already know what is `init` process. Let's look on the `kthread`. It is special kernel thread which allows to `init` and different parts of the kernel to create another kernel threads. We can see it in the output of the `ps` util:
			
 
				+We will not dive into details about `kernel_thread` implementation (we will see it in the chapter which describe scheduler, just need to say that `kernel_thread` invokes [clone](http://www.tutorialspoint.com/unix_system_calls/clone.htm)). Now we only need to know that we create new kernel thread with `kernel_thread` function, parent and child of the thread will use shared information about filesystem and it will start to execute `kernel_init` function. A kernel thread differs from a user thread that it runs in kernel mode. So with these two `kernel_thread` calls we create two new kernel threads with the `PID = 1` for `init` process and `PID = 2` for `kthreadd`. We already know what is `init` process. Let's look on the `kthreadd`. It is a special kernel thread which manages and helps different parts of the kernel to create another kernel thread. We can see it in the output of the `ps` util:
			
 
				 
			
 
				 ```C
			
 
				-$ ps -ef | grep kthradd
			
 
				-alex     12866  4767  0 18:26 pts/0    00:00:00 grep kthradd
			
 
				+$ ps -ef | grep kthread
			
 
				+root         2     0  0 Jan11 ?        00:00:00 [kthreadd]
			
 
				 ```
			
 
				 
			
 
				-Let's postpone `kernel_init` and `kthreadd` for now and will go ahead in the `rest_init`. In the next step after we have created two new kernel threads we can see the following code:
			
 
				+Let's postpone `kernel_init` and `kthreadd` for now and go ahead in the `rest_init`. In the next step after we have created two new kernel threads we can see the following code:
			
 
				 
			
 
				 ```C
			
 
				 	rcu_read_lock();
			
@@ -272,7 +272,7 @@ Let's postpone `kernel_init` and `kthreadd` for now and will go ahead in the `re
 
				 	rcu_read_unlock();
			
 
				 ```
			
 
				     
			
 
				-The first `rcu_read_lock` function marks the beginning of an [RCU](http://en.wikipedia.org/wiki/Read-copy-update) read-side critical section and the `rcu_read_unlock` marks the end of an RCU read-side critical section. We call these functions because we need to protect the `find_task_by_pid_ns`. The `find_task_by_pid_ns` returns pointer to the `task_struct` by the given pid. So, here we are getting the pointer to the `task_struct` for the `PID = 2` (we got it after `kthreadd` creation with the `kernel_thread`). In the next step we call `complete` function
			
 
				+The first `rcu_read_lock` function marks the beginning of an [RCU](http://en.wikipedia.org/wiki/Read-copy-update) read-side critical section and the `rcu_read_unlock` marks the end of an RCU read-side critical section. We call these functions because we need to protect the `find_task_by_pid_ns`. The `find_task_by_pid_ns` returns pointer to the `task_struct` by the given pid. So, here we are getting the pointer to the `task_struct` for `PID = 2` (we got it after `kthreadd` creation with the `kernel_thread`). In the next step we call `complete` function
			
 
				 
			
 
				 ```C
			
 
				 complete(&kthreadd_done);
			
@@ -291,7 +291,7 @@ where `DECLARE_COMPLETION` macro defined as:
 
				          struct completion work = COMPLETION_INITIALIZER(work)
			
 
				 ```
			
 
				 
			
 
				-and expands to the definition of the `completion` structure. This structure defined in the [include/linux/completion.h](https://github.com/torvalds/linux/blob/master/include/linux/completion.h) and presents `completions` concept. Completions are a code synchronization mechanism which is provide race-free solution for the threads that must wait for some process to have reached a point or a specific state. Using completions consists of three parts: The first is definition of the `complete` structure and we did it with the `DECLARE_COMPLETION`. The second is call of the `wait_for_completion`. After the call of this function, a thread which called it will not continue to execute and will wait while other thread did not call `complete` function. Note that we call `wait_for_completion` with the `kthreadd_done` in the beginning of the `kernel_init_freeable`:
			
 
				+and expands to the definition of the `completion` structure. This structure is defined in the [include/linux/completion.h](https://github.com/torvalds/linux/blob/master/include/linux/completion.h) and presents `completions` concept. Completions is a code synchronization mechanism which provides race-free solution for the threads that must wait for some process to have reached a point or a specific state. Using completions consists of three parts: The first is definition of the `complete` structure and we did it with the `DECLARE_COMPLETION`. The second is call of the `wait_for_completion`. After the call of this function, a thread which called it will not continue to execute and will wait while other thread did not call `complete` function. Note that we call `wait_for_completion` with the `kthreadd_done` in the beginning of the `kernel_init_freeable`:
			
 
				 
			
 
				 ```C
			
 
				 wait_for_completion(&kthreadd_done);
			
@@ -314,7 +314,7 @@ void init_idle_bootup_task(struct task_struct *idle)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-where `idle` class is a low priority tasks and tasks can be run only when the processor doesn't have to run anything besides this tasks. The second function `schedule_preempt_disabled` disables preempt in `idle` tasks. And the third function `cpu_startup_entry` defined in the [kernel/sched/idle.c](https://github.com/torvalds/linux/blob/master/sched/idle.c) and calls `cpu_idle_loop` from the [kernel/sched/idle.c](https://github.com/torvalds/linux/blob/master/sched/idle.c). The `cpu_idle_loop` function works as process with `PID = 0` and works in the background. Main purpose of the `cpu_idle_loop` is usage of the idle CPU cycles. When there are no one process to run, this process starts to work. We have one process with `idle` scheduling class (we just set the `current` task to the `idle` with the call of the `init_idle_bootup_task` function), so the `idle` thread does not do useful work and checks that there is not active task to switch: 
			
 
				+where `idle` class is a low task priority and tasks can be run only when the processor doesn't have anything to run besides this tasks. The second function `schedule_preempt_disabled` disables preempt in `idle` tasks. And the third function `cpu_startup_entry` is defined in the [kernel/sched/idle.c](https://github.com/torvalds/linux/blob/master/sched/idle.c) and calls `cpu_idle_loop` from the [kernel/sched/idle.c](https://github.com/torvalds/linux/blob/master/sched/idle.c). The `cpu_idle_loop` function works as process with `PID = 0` and works in the background. Main purpose of the `cpu_idle_loop` is to consume the idle CPU cycles. When there is no process to run, this process starts to work. We have one process with `idle` scheduling class (we just set the `current` task to the `idle` with the call of the `init_idle_bootup_task` function), so the `idle` thread does not do useful work but just checks if there is an active task to switch to: 
			
 
				 
			
 
				 ```C
			
 
				 static void cpu_idle_loop(void)
			
@@ -338,7 +338,7 @@ More about it will be in the chapter about scheduler. So for this moment the `st
 
				 wait_for_completion(&kthreadd_done);
			
 
				 ```
			
 
				 
			
 
				-After this we set `gfp_allowed_mask` to `__GFP_BITS_MASK` which means that already system is running, set allowed [cpus/mems](https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt) to all CPUs and [NUMA](http://en.wikipedia.org/wiki/Non-uniform_memory_access) nodes with the `set_mems_allowed` function, allow `init` process to run on any CPU with the `set_cpus_allowed_ptr`, set pid for the `cad` or `Ctrl-Alt-Delete`, do preparation for booting of the other CPUs with the call of the `smp_prepare_cpus`, call early [initcalls](http://kernelnewbies.org/Documents/InitcallMechanism) with the `do_pre_smp_initcalls`, initialization of the `SMP` with the `smp_init` and initialization of the [lockup_detector](https://www.kernel.org/doc/Documentation/lockup-watchdogs.txt) with the call of the `lockup_detector_init` and initialize scheduler with the `sched_init_smp`.
			
 
				+After this we set `gfp_allowed_mask` to `__GFP_BITS_MASK` which means that system is already running, set allowed [cpus/mems](https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt) to all CPUs and [NUMA](http://en.wikipedia.org/wiki/Non-uniform_memory_access) nodes with the `set_mems_allowed` function, allow `init` process to run on any CPU with the `set_cpus_allowed_ptr`, set pid for the `cad` or `Ctrl-Alt-Delete`, do preparation for booting of the other CPUs with the call of the `smp_prepare_cpus`, call early [initcalls](http://kernelnewbies.org/Documents/InitcallMechanism) with the `do_pre_smp_initcalls`, initialize `SMP` with the `smp_init` and initialize [lockup_detector](https://www.kernel.org/doc/Documentation/lockup-watchdogs.txt) with the call of the `lockup_detector_init` and initialize scheduler with the `sched_init_smp`.
			
 
				 
			
 
				 After this we can see the call of the following functions - `do_basic_setup`. Before we will call the `do_basic_setup` function, our kernel already initialized for this moment. As comment says:
			
 
				 
			
@@ -346,7 +346,7 @@ After this we can see the call of the following functions - `do_basic_setup`. Be
 
				 Now we can finally start doing some real work..
			
 
				 ```
			
 
				 
			
 
				-The `do_basic_setup` will reinitialize [cpuset](https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt) to the active CPUs, initialization of the `khelper` - which is a kernel thread which used for making calls out to userspace from within the kernel, initialize [tmpfs](http://en.wikipedia.org/wiki/Tmpfs), initialize `drivers` subsystem, enable the user-mode helper `workqueue`  and make post-early call of the `initcalls`. We can see openinng of the `dev/console` and dup twice file descriptors from `0` to `2` after the `do_basic_setup`:
			
 
				+The `do_basic_setup` will reinitialize [cpuset](https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt) to the active CPUs, initialize the `khelper` - which is a kernel thread which used for making calls out to userspace from within the kernel, initialize [tmpfs](http://en.wikipedia.org/wiki/Tmpfs), initialize `drivers` subsystem, enable the user-mode helper `workqueue`  and make post-early call of the `initcalls`. We can see opening of the `dev/console` and dup twice file descriptors from `0` to `2` after the `do_basic_setup`:
			
 
				 
			
 
				 
			
 
				 ```C
			
@@ -406,7 +406,7 @@ return do_execve(getname_kernel(init_filename),
 
				 	(const char __user *const __user *)envp_init);
			
 
				 ```
			
 
				 
			
 
				-The `do_execve` function defined in the [include/linux/sched.h](https://github.com/torvalds/linux/blob/master/include/linux/sched.h) and runs program with the given file name and arguments. If we did not pass `rdinit=` option to the kernel command line, kernel starts to check the `execute_command` which is equal to value of the `init=` kernel command line parameter:
			
 
				+The `do_execve` function is defined in the [include/linux/sched.h](https://github.com/torvalds/linux/blob/master/include/linux/sched.h) and runs program with the given file name and arguments. If we did not pass `rdinit=` option to the kernel command line, kernel starts to check the `execute_command` which is equal to value of the `init=` kernel command line parameter:
			
 
				 
			
 
				 ```C
			
 
				 	if (execute_command) {
			
@@ -418,7 +418,7 @@ The `do_execve` function defined in the [include/linux/sched.h](https://github.c
 
				 	}
			
 
				 ```
			
 
				 
			
 
				-If we did not pass `init=` kernel command line parameter too, kernel tries to run one of the following executable files: 
			
 
				+If we did not pass `init=` kernel command line parameter either, kernel tries to run one of the following executable files: 
			
 
				 
			
 
				 ```C
			
 
				 if (!try_to_run_init_process("/sbin/init") ||
			
@@ -428,7 +428,7 @@ if (!try_to_run_init_process("/sbin/init") ||
 
				 	return 0;
			
 
				 ```
			
 
				 
			
 
				-In other way we finish with [panic](http://en.wikipedia.org/wiki/Kernel_panic):
			
 
				+Otherwise we finish with [panic](http://en.wikipedia.org/wiki/Kernel_panic):
			
 
				 
			
 
				 ```C
			
 
				 panic("No working init found.  Try passing init= option to kernel. "
			
@@ -440,11 +440,11 @@ That's all! Linux kernel initialization process is finished!
 
				 Conclusion
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-It is the end of the tenth part about the linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html). And it is not only `tenth` part, but this is the last part which describes initialization of the linux kernel. As I wrote in the first [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) of this chapter, we will go through all steps of the kernel initialization and we did it. We started at the first architecture-independent function - `start_kernel` and finished with the launch of the first `init` process in the our system. I missed details about different subsystem of the kernel, for example I almost did not cover linux kernel scheduler or we did not see almost anything about interrupts and exceptions handling and etc... From the next part we will start to dive to the different kernel subsystems. Hope it will be interesting.
			
 
				+It is the end of the tenth part about the linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html). It is not only the `tenth` part, but also is the last part which describes initialization of the linux kernel. As I wrote in the first [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) of this chapter, we will go through all steps of the kernel initialization and we did it. We started at the first architecture-independent function - `start_kernel` and finished with the launch of the first `init` process in the our system. I skipped details about different subsystem of the kernel, for example I almost did not cover scheduler, interrupts, exception handling, etc. From the next part we will start to dive to the different kernel subsystems. Hope it will be interesting.
			
 
				 
			
 
				-If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				+If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				 
			
 
				-**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
			
 
				+**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				 
			
 
				 Links
			
 
				 --------------------------------------------------------------------------------
			
--- a/Initialization/linux-initialization-2.md
+++ b/Initialization/linux-initialization-2.md
@@ -4,13 +4,15 @@ Kernel initialization. Part 2.
 
				 Early interrupt and exception handling
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-In the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) we stopped before setting of early interrupt handlers. We continue in this part and will know more about interrupt and exception handling.
			
 
				+In the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) we stopped before setting of early interrupt handlers. At this moment we are in the decompressed Linux kernel, we have basic [paging](https://en.wikipedia.org/wiki/Page_table) structure for early boot and our current goal is to finish early preparation before the main kernel code will start to work.
			
 
				+
			
 
				+We already started to do this preparation in the previous [first](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) part of this [chapter](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html). We continue in this part and will know more about interrupt and exception handling.
			
 
				 
			
 
				 Remember that we stopped before following loop:
			
 
				 
			
 
				 ```C
			
 
				- 	for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
			
 
				-		set_intr_gate(i, early_idt_handlers[i]);
			
 
				+for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
			
 
				+	set_intr_gate(i, early_idt_handler_array[i]);
			
 
				 ```
			
 
				 
			
 
				 from the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c) source code file. But before we started to sort out this code, we need to know about interrupts and handlers.
			
@@ -18,7 +20,7 @@ from the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/maste
 
				 Some theory
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-Interrupt is an event caused by software or hardware to the CPU. On interrupt, CPU stops the current task and transfer control to the interrupt handler, which handles interruption and transfer control back to the previously stopped task. We can split interrupts on three types:
			
 
				+An interrupt is an event caused by software or hardware to the CPU. For example a user have pressed a key on keyboard. On interrupt, CPU stops the current task and transfer control to the special routine which is called - [interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler). An interrupt handler handles and interrupt and transfer control back to the previously stopped task. We can split interrupts on three types:
			
 
				 
			
 
				 * Software interrupts - when a software signals CPU that it needs kernel attention. These interrupts are generally used for system calls;
			
 
				 * Hardware interrupts - when a hardware event happens, for example button is pressed on a keyboard;
			
@@ -34,7 +36,7 @@ CPU uses vector number as an index in the `Interrupt Descriptor Table` (we will
 
				 
			
 
				 ```
			
 
				 ----------------------------------------------------------------------------------------------
			
 
				-|Vector|Mnemonic|Description         |Type |Error Code|Source                                |
			
 
				+|Vector|Mnemonic|Description         |Type |Error Code|Source                   |
			
 
				 ----------------------------------------------------------------------------------------------
			
 
				 |0     | #DE    |Divide Error        |Fault|NO        |DIV and IDIV                          |
			
 
				 |---------------------------------------------------------------------------------------------
			
@@ -52,7 +54,7 @@ CPU uses vector number as an index in the `Interrupt Descriptor Table` (we will
 
				 |---------------------------------------------------------------------------------------------
			
 
				 |7     | #NM    |Device Not Available|Fault|NO        |Floating point or [F]WAIT             |
			
 
				 |---------------------------------------------------------------------------------------------
			
 
				-|8     | #DF    |Double Fault        |Abort|YES       |Ant instrctions which can generate NMI|
			
 
				+|8     | #DF    |Double Fault        |Abort|YES       |An instruction which can generate NMI |
			
 
				 |---------------------------------------------------------------------------------------------
			
 
				 |9     | ---    |Reserved            |Fault|NO        |                                      |
			
 
				 |---------------------------------------------------------------------------------------------
			
@@ -82,7 +84,7 @@ CPU uses vector number as an index in the `Interrupt Descriptor Table` (we will
 
				 ----------------------------------------------------------------------------------------------
			
 
				 ```
			
 
				 
			
 
				-To react on interrupt CPU uses special structure - Interrupt Descriptor Table or IDT. IDT is an array of 8-byte descriptors like Global Descriptor Table, but IDT entries are called `gates`. CPU multiplies vector number on 8 to find index of the IDT entry. But in 64-bit mode IDT is an array of 16-byte descriptors and CPU multiplies vector number on 16 to find index of the entry in the IDT. We remember from the previous part that CPU uses special `GDTR` register to locate Global Descriptor Table, so CPU uses special register `IDTR` for Interrupt Descriptor Table and `lidt` instruuction for loading base address of the table into this register.
			
 
				+To react on interrupt CPU uses special structure - Interrupt Descriptor Table or IDT. IDT is an array of 8-byte descriptors like Global Descriptor Table, but IDT entries are called `gates`. CPU multiplies vector number on 8 to find index of the IDT entry. But in 64-bit mode IDT is an array of 16-byte descriptors and CPU multiplies vector number on 16 to find index of the entry in the IDT. We remember from the previous part that CPU uses special `GDTR` register to locate Global Descriptor Table, so CPU uses special register `IDTR` for Interrupt Descriptor Table and `lidt` instruction for loading base address of the table into this register.
			
 
				 
			
 
				 64-bit mode IDT entry has following structure:
			
 
				 
			
@@ -115,11 +117,11 @@ To react on interrupt CPU uses special structure - Interrupt Descriptor Table or
 
				 
			
 
				 Where:
			
 
				 
			
 
				-* Offset - is offset to entry point of an interrupt handler;
			
 
				-* DPL -    Descriptor Privilege Level;
			
 
				-* P -      Segment Present flag;
			
 
				-* Segment selector - a code segment selector in GDT or LDT
			
 
				-* IST -    provides ability to switch to a new stack for interrupts handling.
			
 
				+* `Offset` - is offset to entry point of an interrupt handler;
			
 
				+* `DPL` -    Descriptor Privilege Level;
			
 
				+* `P` -      Segment Present flag;
			
 
				+* `Segment selector` - a code segment selector in GDT or LDT
			
 
				+* `IST` -    provides ability to switch to a new stack for interrupts handling.
			
 
				 
			
 
				 And the last `Type` field describes type of the `IDT` entry. There are three different kinds of handlers for interrupts:
			
 
				 
			
@@ -129,9 +131,7 @@ And the last `Type` field describes type of the `IDT` entry. There are three dif
 
				 
			
 
				 Interrupt and trap descriptors contain a far pointer to the entry point of the interrupt handler. Only one difference between these types is how CPU handles `IF` flag. If interrupt handler was accessed through interrupt gate, CPU clear the `IF` flag to prevent other interrupts while current interrupt handler executes. After that current interrupt handler executes, CPU sets the `IF` flag again with `iret` instruction. 
			
 
				 
			
 
				-Other bits reserved and must be 0.
			
 
				-
			
 
				-Now let's look how CPU handles interrupts:
			
 
				+Other bits in the interrupt gate reserved and must be 0. Now let's look how CPU handles interrupts:
			
 
				 
			
 
				 * CPU save flags register, `CS`, and instruction pointer on the stack.
			
 
				 * If interrupt causes an error code (like `#PF` for example), CPU saves an error on the stack after instruction pointer;
			
@@ -146,26 +146,31 @@ We stopped at the following point:
 
				 
			
 
				 ```C
			
 
				 for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
			
 
				-	set_intr_gate(i, early_idt_handlers[i]);
			
 
				+	set_intr_gate(i, early_idt_handler_array[i]);
			
 
				 ```
			
 
				 
			
 
				 Here we call `set_intr_gate` in the loop, which takes two parameters:
			
 
				 
			
 
				-* Number of an interrupt;
			
 
				+* Number of an interrupt or `vector number`;
			
 
				 * Address of the idt handler.
			
 
				 
			
 
				-and inserts an interrupt gate in the nth `IDT` entry. First of all let's look on the `early_idt_handlers`. It is an array which contains address of the first 32 interrupt handlers:
			
 
				+and inserts an interrupt gate to the `IDT` table which is represented by the `&idt_descr` array. First of all let's look on the `early_idt_handler_array` array. It is an array which is defined in the [arch/x86/include/asm/segment.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/segment.h) header file contains addresses of the first `32` exception handlers:
			
 
				 
			
 
				 ```C
			
 
				-extern const char early_idt_handlers[NUM_EXCEPTION_VECTORS][2+2+5];
			
 
				+#define EARLY_IDT_HANDLER_SIZE   9
			
 
				+#define NUM_EXCEPTION_VECTORS	32
			
 
				+
			
 
				+extern const char early_idt_handler_array[NUM_EXCEPTION_VECTORS][EARLY_IDT_HANDLER_SIZE];
			
 
				 ```
			
 
				 
			
 
				-We're filling only first 32 IDT entries because all of the early setup runs with interrupts disabled, so there is no need to set up early exception handlers for vectors greater than 32. `early_idt_handlers` contains generic idt handlers and we can find it in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S), we will look it soon.
			
 
				+The `early_idt_handler_array` is `288` bytes array which contains address of exception entry points every nine bytes. Every nine bytes of this array consist of two bytes optional instruction for pushing dummy error code if an exception does not provide it, two bytes instruction for pushing vector number to the stack and five bytes of `jump` to the common exception handler code.
			
 
				+
			
 
				+As we can see, We're filling only first 32 `IDT` entries in the loop, because all of the early setup runs with interrupts disabled, so there is no need to set up interrupt handlers for vectors greater than `32`. The `early_idt_handler_array` array contains generic idt handlers and we can find its definition in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) assembly file. For now we will skip it, but will look it soon. Before this we will look on the implementation of the `set_intr_gate` macro.
			
 
				 
			
 
				-Now let's look on `set_intr_gate` implementation:
			
 
				+The `set_intr_gate` macro is defined in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h) header file and looks:
			
 
				 
			
 
				 ```C
			
 
				-#define set_intr_gate(n, addr)                                           \
			
 
				+#define set_intr_gate(n, addr)                         \
			
 
				          do {                                                            \
			
 
				                  BUG_ON((unsigned)n > 0xFF);                             \
			
 
				                  _set_gate(n, GATE_INTERRUPT, (void *)addr, 0, 0,        \
			
@@ -175,8 +180,7 @@ Now let's look on `set_intr_gate` implementation:
 
				          } while (0)
			
 
				 ```
			
 
				 
			
 
				-First of all it checks with that passed interrupt number is not greater than `255` with `BUG_ON` macro. We need to do this check because we can have only 256 interrupts. After this it calls `_set_gate` which writes address of an interrupt gate to the `IDT`:
			
 
				-
			
 
				+First of all it checks with that passed interrupt number is not greater than `255` with `BUG_ON` macro. We need to do this check because we can have only `256` interrupts. After this, it make a call of the `_set_gate` function which writes address of an interrupt gate to the `IDT`:
			
 
				 
			
 
				 ```C
			
 
				 static inline void _set_gate(int gate, unsigned type, void *addr,
			
@@ -208,7 +212,7 @@ static inline void pack_gate(gate_desc *gate, unsigned type, unsigned long func,
 
				 }
			
 
				 ```
			
 
				 
			
 
				-As mentioned above we fill gate descriptor in this function. We fill three parts of the address of the interrupt handler with the address which we got in the main loop (address of the interrupt handler entry point). We are using three following macro to split address on three parts:
			
 
				+As I mentioned above, we fill gate descriptor in this function. We fill three parts of the address of the interrupt handler with the address which we got in the main loop (address of the interrupt handler entry point). We are using three following macros to split address on three parts:
			
 
				 
			
 
				 ```C
			
 
				 #define PTR_LOW(x) ((unsigned long long)(x) & 0xFFFF)
			
@@ -216,7 +220,7 @@ As mentioned above we fill gate descriptor in this function. We fill three parts
 
				 #define PTR_HIGH(x) ((unsigned long long)(x) >> 32)
			
 
				 ```
			
 
				 
			
 
				-With the first `PTR_LOW` macro we get the first 2 bytes of the address, with the second `PTR_MIDDLE` we get the second 2 bytes of the address and with the third `PTR_HIGH` macro we get the last 4 bytes of the address. Next we setup the segment selector for interrupt handler, it will be our kernel code segment - `__KERNEL_CS`. In the next step we fill `Interrupt Stack Table` and `Descriptor Privilege Level` (highest privilege level) with zeros. And we set `GAT_INTERRUPT` type in the end. 
			
 
				+With the first `PTR_LOW` macro we get the first `2` bytes of the address, with the second `PTR_MIDDLE` we get the second `2` bytes of the address and with the third `PTR_HIGH` macro we get the last `4` bytes of the address. Next we setup the segment selector for interrupt handler, it will be our kernel code segment - `__KERNEL_CS`. In the next step we fill `Interrupt Stack Table` and `Descriptor Privilege Level` (highest privilege level) with zeros. And we set `GAT_INTERRUPT` type in the end. 
			
 
				 
			
 
				 Now we have filled IDT entry and we can call `native_write_idt_entry` function which just copies filled `IDT` entry to the `IDT`:
			
 
				 
			
@@ -227,7 +231,7 @@ static inline void native_write_idt_entry(gate_desc *idt, int entry, const gate_
 
				 }
			
 
				 ```
			
 
				 
			
 
				-After that main loop will finished, we will have filled `idt_table` array of `gate_desc` structures and we can load `IDT` with:
			
 
				+After that main loop will finished, we will have filled `idt_table` array of `gate_desc` structures and we can load `Interrupt Descriptor table` with the call of the:
			
 
				 
			
 
				 ```C
			
 
				 load_idt((const struct desc_ptr *)&idt_descr);
			
@@ -245,32 +249,52 @@ and `load_idt` just executes `lidt` instruction:
 
				 asm volatile("lidt %0"::"m" (*dtr));
			
 
				 ```
			
 
				 
			
 
				-You can note that there are calls of the `_trace_*` functions in the `_set_gate` and other functions. These functions fills `IDT` gates in the same manner that `_set_gate` but with one difference. These functions use `trace_idt_table` Interrupt Descriptor Table instead of `idt_table` for tracepoints (we will cover this theme in the another part).
			
 
				+You can note that there are calls of the `_trace_*` functions in the `_set_gate` and other functions. These functions fills `IDT` gates in the same manner that `_set_gate` but with one difference. These functions use `trace_idt_table` the `Interrupt Descriptor Table` instead of `idt_table` for tracepoints (we will cover this theme in the another part).
			
 
				 
			
 
				-Okay, now we have filled and loaded Interrupt Descriptor Table, we know how the CPU acts during interrupt. So now time to deal with interrupts handlers.
			
 
				+Okay, now we have filled and loaded `Interrupt Descriptor Table`, we know how the CPU acts during an interrupt. So now time to deal with interrupts handlers.
			
 
				 
			
 
				 Early interrupts handlers
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-As you can read above, we filled `IDT` with the address of the `early_idt_handlers`. We can find it in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S):
			
 
				+As you can read above, we filled `IDT` with the address of the `early_idt_handler_array`. We can find it in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) assembly file:
			
 
				 
			
 
				 ```assembly
			
 
				-	.globl early_idt_handlers
			
 
				+	.globl early_idt_handler_array
			
 
				 early_idt_handlers:
			
 
				 	i = 0
			
 
				 	.rept NUM_EXCEPTION_VECTORS
			
 
				 	.if (EXCEPTION_ERRCODE_MASK >> i) & 1
			
 
				-	ASM_NOP2
			
 
				-	.else
			
 
				 	pushq $0
			
 
				 	.endif
			
 
				 	pushq $i
			
 
				-	jmp early_idt_handler
			
 
				+	jmp early_idt_handler_common
			
 
				 	i = i + 1
			
 
				+	.fill early_idt_handler_array + i*EARLY_IDT_HANDLER_SIZE - ., 1, 0xcc
			
 
				 	.endr
			
 
				 ```
			
 
				 
			
 
				-We can see here, interrupt handlers generation for the first 32 exceptions. We check here, if exception has error code then we do nothing, if exception does not return error code, we push zero to the stack. We do it for that would stack was uniform. After that we push exception number on the stack and jump on the `early_idt_handler` which is generic interrupt handler for now. As i wrote above, CPU pushes flag register, `CS` and `RIP` on the stack. So before `early_idt_handler` will be executed, stack will contain following data:
			
 
				+We can see here, interrupt handlers generation for the first `32` exceptions. We check here, if exception has an error code then we do nothing, if exception does not return error code, we push zero to the stack. We do it for that would stack was uniform. After that we push exception number on the stack and jump on the `early_idt_handler_array` which is generic interrupt handler for now. As we may see above, every nine bytes of the `early_idt_handler_array` array consists from optional push of an error code, push of `vector number` and jump instruction. We can see it in the output of the `objdump` util:
			
 
				+
			
 
				+```
			
 
				+$ objdump -D vmlinux
			
 
				+...
			
 
				+...
			
 
				+...
			
 
				+ffffffff81fe5000 <early_idt_handler_array>:
			
 
				+ffffffff81fe5000:       6a 00                   pushq  $0x0
			
 
				+ffffffff81fe5002:       6a 00                   pushq  $0x0
			
 
				+ffffffff81fe5004:       e9 17 01 00 00          jmpq   ffffffff81fe5120 <early_idt_handler_common>
			
 
				+ffffffff81fe5009:       6a 00                   pushq  $0x0
			
 
				+ffffffff81fe500b:       6a 01                   pushq  $0x1
			
 
				+ffffffff81fe500d:       e9 0e 01 00 00          jmpq   ffffffff81fe5120 <early_idt_handler_common>
			
 
				+ffffffff81fe5012:       6a 00                   pushq  $0x0
			
 
				+ffffffff81fe5014:       6a 02                   pushq  $0x2
			
 
				+...
			
 
				+...
			
 
				+...
			
 
				+```
			
 
				+
			
 
				+As i wrote above, CPU pushes flag register, `CS` and `RIP` on the stack. So before `early_idt_handler` will be executed, stack will contain following data:
			
 
				 
			
 
				 ```
			
 
				 |--------------------|
			
@@ -281,11 +305,11 @@ We can see here, interrupt handlers generation for the first 32 exceptions. We c
 
				 |--------------------|
			
 
				 ```
			
 
				 
			
 
				-Now let's look on the `early_idt_handler` implementation. It locates in the same [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S#L343). First of all we can see check for [NMI](http://en.wikipedia.org/wiki/Non-maskable_interrupt), we no need to handle it, so just ignore they in the `early_idt_handler`:
			
 
				+Now let's look on the `early_idt_handler_common` implementation. It locates in the same [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S#L343) assembly file and first of all we can see check for [NMI](http://en.wikipedia.org/wiki/Non-maskable_interrupt). We don't need to handle it, so just ignore it in the `early_idt_handler_common`:
			
 
				 
			
 
				 ```assembly
			
 
				 	cmpl $2,(%rsp)
			
 
				-	je is_nmi
			
 
				+	je .Lis_nmi
			
 
				 ```
			
 
				 
			
 
				 where `is_nmi`:
			
@@ -296,7 +320,7 @@ is_nmi:
 
				 	INTERRUPT_RETURN
			
 
				 ```
			
 
				 
			
 
				-we drop error code and vector number from the stack and call `INTERRUPT_RETURN` which is just `iretq`. As we checked the vector number and it is not `NMI`, we check `early_recursion_flag` to prevent recursion in the `early_idt_handler` and if it's correct we save general registers on the stack:
			
 
				+drops an error code and vector number from the stack and call `INTERRUPT_RETURN` which is just expands to the `iretq` instruction. As we checked the vector number and it is not `NMI`, we check `early_recursion_flag` to prevent recursion in the `early_idt_handler_common` and if it's correct we save general registers on the stack:
			
 
				 
			
 
				 ```assembly
			
 
				 	pushq %rax
			
@@ -310,16 +334,16 @@ we drop error code and vector number from the stack and call `INTERRUPT_RETURN`
 
				 	pushq %r11
			
 
				 ```
			
 
				 
			
 
				-we need to do it to prevent wrong values in it when we return from the interrupt handler. After this we check segment selector in the stack:
			
 
				+We need to do it to prevent wrong values of registers when we return from the interrupt handler. After this we check segment selector in the stack:
			
 
				 
			
 
				 ```assembly
			
 
				 	cmpl $__KERNEL_CS,96(%rsp)
			
 
				 	jne 11f
			
 
				 ```
			
 
				 
			
 
				-it must be equal to the kernel code segment and if it is not we jump on label `11` which prints `PANIC` message and makes stack dump.
			
 
				+which must be equal to the kernel code segment and if it is not we jump on label `11` which prints `PANIC` message and makes stack dump.
			
 
				 
			
 
				-After code segment was checked, we check the vector number, and if it is `#PF`, we put value from the `cr2` to the `rdi` register and call `early_make_pgtable` (well see it soon):
			
 
				+After the code segment was checked, we check the vector number, and if it is `#PF` or [Page Fault](https://en.wikipedia.org/wiki/Page_fault), we put value from the `cr2` to the `rdi` register and call `early_make_pgtable` (well see it soon):
			
 
				 
			
 
				 ```assembly
			
 
				 	cmpl $14,72(%rsp)
			
@@ -351,7 +375,7 @@ It is the end of the first interrupt handler. Note that it is very early interru
 
				 Page fault handling
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-In the previous paragraph we saw first early interrupt handler which checks interrupt number for page fault and calls `early_make_pgtable` for building new page tables if it is. We need to have `#PF` handler in this step because there are plans to add ability to load kernel above 4G and make access to `boot_params` structure above the 4G.
			
 
				+In the previous paragraph we saw first early interrupt handler which checks interrupt number for page fault and calls `early_make_pgtable` for building new page tables if it is. We need to have `#PF` handler in this step because there are plans to add ability to load kernel above `4G` and make access to `boot_params` structure above the 4G.
			
 
				 
			
 
				 You can find implementation of the `early_make_pgtable` in the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c) and takes one parameter - address from the `cr2` register, which caused Page Fault. Let's look on it:
			
 
				 
			
@@ -455,9 +479,9 @@ After page fault handler finished it's work and as result our `early_level4_pgt`
 
				 Conclusion
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-This is the end of the second part about linux kernel internals. If you have questions or suggestions, ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-internals/issues/new). In the next part we will see all steps before kernel entry point - `start_kernel` function.
			
 
				+This is the end of the second part about linux kernel insides. If you have questions or suggestions, ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new). In the next part we will see all steps before kernel entry point - `start_kernel` function.
			
 
				 
			
 
				-**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
			
 
				+**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				 
			
 
				 Links
			
 
				 --------------------------------------------------------------------------------
			
@@ -465,4 +489,7 @@ Links
 
				 * [GNU assembly .rept](https://sourceware.org/binutils/docs-2.23/as/Rept.html)
			
 
				 * [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller)
			
 
				 * [NMI](http://en.wikipedia.org/wiki/Non-maskable_interrupt)
			
 
				+* [Page table](https://en.wikipedia.org/wiki/Page_table)
			
 
				+* [Interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler)
			
 
				+* [Page Fault](https://en.wikipedia.org/wiki/Page_fault),
			
 
				 * [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html)
			
--- a/Initialization/linux-initialization-3.md
+++ b/Initialization/linux-initialization-3.md
@@ -4,7 +4,7 @@ Kernel initialization. Part 3.
 
				 Last preparations before the kernel entry point
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-This is the third part of the Linux kernel initialization process series. In the previous [part](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-2.md) we saw early interrupt and exception handling and will continue to dive into the linux kernel initialization process in the current part. Our next point is 'kernel entry point' - `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file. Yes, technically it is not kernel's entry point but the start of the generic kernel code which does not depend on certain architecture. But before we will see call of the `start_kernel` function, we must do some preparations. So let's continue.
			
 
				+This is the third part of the Linux kernel initialization process series. In the previous [part](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-2.md) we saw early interrupt and exception handling and will continue to dive into the linux kernel initialization process in the current part. Our next point is 'kernel entry point' - `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file. Yes, technically it is not kernel's entry point but the start of the generic kernel code which does not depend on certain architecture. But before we call the `start_kernel` function, we must do some preparations. So let's continue.
			
 
				 
			
 
				 boot_params again
			
 
				 --------------------------------------------------------------------------------
			
@@ -63,7 +63,7 @@ cmd_line_ptr |= (u64)boot_params.ext_cmd_line_ptr << 32;
 
				 return cmd_line_ptr;
			
 
				 ```
			
 
				 
			
 
				-which gets the 64-bit address of the command line from the kernel boot header and returns it. In the last step we check that we got `cmd_line_pty`, getting its virtual address and copy it to the `boot_command_line` which is just an array of bytes:
			
 
				+which gets the 64-bit address of the command line from the kernel boot header and returns it. In the last step we check `cmd_line_ptr`, getting its virtual address and copy it to the `boot_command_line` which is just an array of bytes:
			
 
				 
			
 
				 ```C
			
 
				 extern char __initdata boot_command_line[];
			
@@ -71,18 +71,18 @@ extern char __initdata boot_command_line[];
 
				 
			
 
				 After this we will have copied kernel command line and `boot_params` structure. In the next step we can see call of the `load_ucode_bsp` function which loads processor microcode, but we will not see it here.
			
 
				 
			
 
				-After microcode was loaded we can see the check of the `console_loglevel` and the `early_printk` function which prints `Kernel Alive` string. But you'll never see this output because `early_printk` is not initilized yet. It is a minor bug in the kernel and i sent the patch - [commit](http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=91d8f0416f3989e248d3a3d3efb821eda10a85d2) and you will see it in the mainline soon. So you can skip this code.
			
 
				+After microcode was loaded we can see the check of the `console_loglevel` and the `early_printk` function which prints `Kernel Alive` string. But you'll never see this output because `early_printk` is not initialized yet. It is a minor bug in the kernel and i sent the patch - [commit](http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=91d8f0416f3989e248d3a3d3efb821eda10a85d2) and you will see it in the mainline soon. So you can skip this code.
			
 
				 
			
 
				 Move on init pages
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-In the next step as we have copied `boot_params` structure, we need to move from the early page tables to the page tables for initialization process. We already set early page tables for switchover, you can read about it in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) and dropped all it in the `reset_early_page_tables` function (you can read about it in the previous part too) and kept only kernel high mapping. After this we call:
			
 
				+In the next step, as we have copied `boot_params` structure, we need to move from the early page tables to the page tables for initialization process. We already set early page tables for switchover, you can read about it in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) and dropped all it in the `reset_early_page_tables` function (you can read about it in the previous part too) and kept only kernel high mapping. After this we call:
			
 
				 
			
 
				 ```C
			
 
				 	clear_page(init_level4_pgt);
			
 
				 ```
			
 
				 
			
 
				-function and pass `init_level4_pgt` which defined also in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) and looks:
			
 
				+function and pass `init_level4_pgt` which also defined in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) and looks:
			
 
				 
			
 
				 ```assembly
			
 
				 NEXT_PAGE(init_level4_pgt)
			
@@ -93,7 +93,7 @@ NEXT_PAGE(init_level4_pgt)
 
				 	.quad   level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
			
 
				 ```
			
 
				 
			
 
				-which maps first 2 gigabytes and 512 megabytes for the kernel code, data and bss. `clear_page` function defined in the [arch/x86/lib/clear_page_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/lib/clear_page_64.S) let look on this function:
			
 
				+which maps first 2 gigabytes and 512 megabytes for the kernel code, data and bss. `clear_page` function defined in the [arch/x86/lib/clear_page_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/lib/clear_page_64.S) let's look on this function:
			
 
				 
			
 
				 ```assembly
			
 
				 ENTRY(clear_page)
			
@@ -121,14 +121,14 @@ ENTRY(clear_page)
 
				 	ENDPROC(clear_page)
			
 
				 ```
			
 
				 
			
 
				-As you can understart from the function name it clears or fills with zeros page tables. First of all note that this function starts with the `CFI_STARTPROC` and `CFI_ENDPROC` which are expands to GNU assembly directives:
			
 
				+As you can understand from the function name it clears or fills with zeros page tables. First of all note that this function starts with the `CFI_STARTPROC` and `CFI_ENDPROC` which are expands to GNU assembly directives:
			
 
				 
			
 
				 ```C
			
 
				 #define CFI_STARTPROC           .cfi_startproc
			
 
				 #define CFI_ENDPROC             .cfi_endproc
			
 
				 ```
			
 
				 
			
 
				-and used for debugging. After `CFI_STARTPROC` macro we zero out `eax` register and put 64 to the `ecx` (it will be counter). Next we can see loop which starts with the `.Lloop` label and it starts from the `ecx` decrement. After it we put zero from the `rax` register to the `rdi` which contains the base address of the `init_level4_pgt` now and do the same procedure seven times but every time move `rdi` offset on 8. After this we will have first 64 bytes of the `init_level4_pgt` filled with zeros. In the next step we put the address of the `init_level4_pgt` with 64-bytes offset to the `rdi` again and repeat all operations which `ecx` is not zero. In the end we will have `init_level4_pgt` filled with zeros.
			
 
				+and used for debugging. After `CFI_STARTPROC` macro we zero out `eax` register and put 64 to the `ecx` (it will be a counter). Next we can see loop which starts with the `.Lloop` label and it starts from the `ecx` decrement. After it we put zero from the `rax` register to the `rdi` which contains the base address of the `init_level4_pgt` now and do the same procedure seven times but every time move `rdi` offset on 8. After this we will have first 64 bytes of the `init_level4_pgt` filled with zeros. In the next step we put the address of the `init_level4_pgt` with 64-bytes offset to the `rdi` again and repeat all operations until `ecx` reaches zero. In the end we will have `init_level4_pgt` filled with zeros.
			
 
				 
			
 
				 As we have `init_level4_pgt` filled with zeros, we set the last `init_level4_pgt` entry to kernel high mapping with the:
			
 
				 
			
@@ -163,16 +163,16 @@ You can see that it is the last function before we are in the kernel entry point
 
				 Last step before kernel entry point
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-First of all we can see in the `x86_64_start_reservations` function check for `boot_params.hdr.version`:
			
 
				+First of all we can see in the `x86_64_start_reservations` function the check for `boot_params.hdr.version`:
			
 
				 
			
 
				 ```C
			
 
				 if (!boot_params.hdr.version)
			
 
				 	copy_bootdata(__va(real_mode_data));
			
 
				 ```
			
 
				 
			
 
				-and if it is not we call again `copy_bootdata` function with the virtual address of the `real_mode_data` (read about about it's implementation).
			
 
				+and if it is zero we call `copy_bootdata` function again with the virtual address of the `real_mode_data` (read about its implementation).
			
 
				 
			
 
				-In the next step we can see the call of the `reserve_ebda_region` function which defined in the [arch/x86/kernel/head.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head.c). This function reserves memory block for th `EBDA` or Extended BIOS Data Area. The Extended BIOS Data Area located in the top of conventional memory and contains data about ports, disk parameters and etc...
			
 
				+In the next step we can see the call of the `reserve_ebda_region` function which defined in the [arch/x86/kernel/head.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head.c). This function reserves memory block for the `EBDA` or Extended BIOS Data Area. The Extended BIOS Data Area located in the top of conventional memory and contains data about ports, disk parameters and etc...
			
 
				 
			
 
				 Let's look on the `reserve_ebda_region` function. It starts from the checking is paravirtualization enabled or not:
			
 
				 
			
@@ -324,7 +324,7 @@ struct memblock memblock __initdata_memblock = {
 
				 };
			
 
				 ```
			
 
				 
			
 
				-We will not dive into detail of this varaible, but we will see all details about it in the parts about memory manager. Just note that `memblock` variable defined with the `__initdata_memblock` which is:
			
 
				+We will not dive into detail of this variable, but we will see all details about it in the parts about memory manager. Just note that `memblock` variable defined with the `__initdata_memblock` which is:
			
 
				 
			
 
				 ```C
			
 
				 #define __initdata_memblock __meminitdata
			
@@ -344,7 +344,7 @@ After debugging lines were printed next is the call of the following function:
 
				 memblock_add_range(_rgn, base, size, nid, flags);
			
 
				 ```
			
 
				 
			
 
				-which adds new memory block region into the `.meminit.data` section. As we do not initlieze `_rgn` but it just contains `&memblock.reserved`, we just fill passed `_rgn` with the base address of the extended BIOS data area region, size of this region and flags:
			
 
				+which adds new memory block region into the `.meminit.data` section. As we do not initialize `_rgn` but it just contains `&memblock.reserved`, we just fill passed `_rgn` with the base address of the extended BIOS data area region, size of this region and flags:
			
 
				 
			
 
				 ```C
			
 
				 if (type->regions[0].size == 0) {
			
@@ -416,11 +416,11 @@ That's all for this part.
 
				 Conclusion
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-It is the end of the third part about linux kernel internals. In next part we will see the first initialization steps in the kernel entry point - `start_kernel` function. It will be the first step before we will see launch of the first `init` process.
			
 
				+It is the end of the third part about linux kernel insides. In next part we will see the first initialization steps in the kernel entry point - `start_kernel` function. It will be the first step before we will see launch of the first `init` process.
			
 
				 
			
 
				 If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				 
			
 
				-**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
			
 
				+**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				 
			
 
				 Links
			
 
				 --------------------------------------------------------------------------------
			
--- a/Initialization/linux-initialization-4.md
+++ b/Initialization/linux-initialization-4.md
@@ -6,7 +6,7 @@ Kernel entry point
 
				 
			
 
				 If you have read the previous part - [Last preparations before the kernel entry point](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-3.md), you can remember that we finished all pre-initialization stuff and stopped right before the call to the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c). The `start_kernel` is the entry of the generic and architecture independent kernel code, although we will return to the `arch/` folder many times. If you look inside of the `start_kernel` function, you will see that this function is very big. For this moment it contains about `86` calls of functions. Yes, it's very big and of course this part will not cover all the processes that occur in this function. In the current part we will only start to do it. This part and all the next which will be in the [Kernel initialization process](https://github.com/0xAX/linux-insides/blob/master/Initialization/README.md) chapter will cover it.
			
 
				 
			
 
				-The main purpose of the `start_kernel` to finish kernel initialization process and launch the first `init` process. Before the first process will be started, the `start_kernel` must do many things such as: to enable [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt), to initialize processor id, to enable early [cgroups](http://en.wikipedia.org/wiki/Cgroups) subsystem, to setup per-cpu areas, to initialize different caches in [vfs](http://en.wikipedia.org/wiki/Virtual_file_system), to initialize memory manager, rcu, vmalloc, scheduler, IRQs, ACPI and many many more. Only after these steps we will see the launch of the first `init` process in the last part of this chapter. So much kernel code awaits us, let's start.
			
 
				+The main purpose of the `start_kernel` to finish kernel initialization process and launch the first `init` process. Before the first process will be started, the `start_kernel` must do many things such as: to enable [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt), to initialize processor id, to enable early [cgroups](http://en.wikipedia.org/wiki/Cgroups) subsystem, to setup per-cpu areas, to initialize different caches in [vfs](http://en.wikipedia.org/wiki/Virtual_file_system), to initialize memory manager, rcu, vmalloc, scheduler, IRQs, ACPI and many many more. Only after these steps will we see the launch of the first `init` process in the last part of this chapter. So much kernel code awaits us, let's start.
			
 
				 
			
 
				 **NOTE: All parts from this big chapter `Linux Kernel initialization process` will not cover anything about debugging. There will be a separate chapter about kernel debugging tips.**
			
 
				 
			
@@ -19,7 +19,7 @@ As I wrote above, the `start_kernel` function is defined in the [init/main.c](ht
 
				 #define __init      __section(.init.text) __cold notrace
			
 
				 ```
			
 
				 
			
 
				-After the initialization process will be finished, the kernel will release these sections with a call to the `free_initmem` function. Note also that `__init` is defined with two attributes: `__cold` and `notrace`. The purpose of the first `cold` attribute is to mark that the function is rarely used and the compiler must optimize this function for size. The second `notrace` is defined as:
			
 
				+After the initialization process have finished, the kernel will release these sections with a call to the `free_initmem` function. Note also that `__init` is defined with two attributes: `__cold` and `notrace`. The purpose of the first `cold` attribute is to mark that the function is rarely used and the compiler must optimize this function for size. The second `notrace` is defined as:
			
 
				 
			
 
				 ```C
			
 
				 #define notrace __attribute__((no_instrument_function))
			
@@ -76,7 +76,7 @@ union thread_union {
 
				 };
			
 
				 ```
			
 
				 
			
 
				-Every process has its own stack and it is 16 killobytes or 4 page frames. in `x86_64`. We can note that it is defined as array of `unsigned long`. The next field of the `thread_union` is - `thread_info` defined as:
			
 
				+Every process has its own stack and it is 16 kilobytes or 4 page frames. in `x86_64`. We can note that it is defined as array of `unsigned long`. The next field of the `thread_union` is - `thread_info` defined as:
			
 
				 
			
 
				 ```C
			
 
				 struct thread_info {
			
@@ -94,7 +94,7 @@ struct thread_info {
 
				 };
			
 
				 ```
			
 
				 
			
 
				-and occupies 52 bytes. The `thread_info` structure contains architecture-specific information on the thread. We know that on `x86_64` the stack grows down and `thread_union.thread_info` is stored at the bottom of the stack in our case. So the process stack is 16 killobytes and `thread_info` is at the bottom. The remaining thread_size will be `16 killobytes - 62 bytes = 16332 bytes`. Note that `thread_unioun` represented as the [union](http://en.wikipedia.org/wiki/Union_type) and not structure, it means that `thread_info` and stack share the memory space.
			
 
				+and occupies 52 bytes. The `thread_info` structure contains architecture-specific information on the thread. We know that on `x86_64` the stack grows down and `thread_union.thread_info` is stored at the bottom of the stack in our case. So the process stack is 16 kilobytes and `thread_info` is at the bottom. The remaining thread_size will be `16 kilobytes - 62 bytes = 16332 bytes`. Note that `thread_union` represented as the [union](http://en.wikipedia.org/wiki/Union_type) and not structure, it means that `thread_info` and stack share the memory space.
			
 
				 
			
 
				 Schematically it can be represented as follows:
			
 
				 
			
@@ -117,7 +117,7 @@ Schematically it can be represented as follows:
 
				 
			
 
				 http://www.quora.com/In-Linux-kernel-Why-thread_info-structure-and-the-kernel-stack-of-a-process-binds-in-union-construct
			
 
				 
			
 
				-So the `INIT_TASK` macro fills these `task_struct's` fields and many many more. As I already wrote about, I will not describe all the fields and values in the `INIT_TASK` macro but we will see them soon.
			
 
				+So the `INIT_TASK` macro fills these `task_struct's` fields and many many more. As I already wrote above, I will not describe all the fields and values in the `INIT_TASK` macro but we will see them soon.
			
 
				 
			
 
				 Now let's go back to the `set_task_stack_end_magic` function. This function defined in the [kernel/fork.c](https://github.com/torvalds/linux/blob/master/kernel/fork.c#L297) and sets a [canary](http://en.wikipedia.org/wiki/Stack_buffer_overflow) to the `init` process stack to prevent stack overflow.
			
 
				 
			
@@ -162,7 +162,7 @@ void __init __weak smp_setup_processor_id(void)
 
				 
			
 
				 as it not implemented for all architectures, but some such as [s390](http://en.wikipedia.org/wiki/IBM_ESA/390) and [arm64](http://en.wikipedia.org/wiki/ARM_architecture#64.2F32-bit_architecture).
			
 
				 
			
 
				-The next function in `start_kernel` is `debug_objects_early_init`. Implementation of this function is almost the same as `lockdep_init`, but fills hashes for object debugging. As I wrote about, we will not see the explanation of this and other functions which are for debugging purposes in this chapter.
			
 
				+The next function in `start_kernel` is `debug_objects_early_init`. Implementation of this function is almost the same as `lockdep_init`, but fills hashes for object debugging. As I wrote above, we will not see the explanation of this and other functions which are for debugging purposes in this chapter.
			
 
				 
			
 
				 After the `debug_object_early_init` function we can see the call of the `boot_init_stack_canary` function which fills `task_struct->canary` with the canary value for the `-fstack-protector` gcc feature. This function depends on the `CONFIG_CC_STACKPROTECTOR` configuration option and if this option is disabled, `boot_init_stack_canary` does nothing, otherwise it generates random numbers based on random pool and the [TSC](http://en.wikipedia.org/wiki/Time_Stamp_Counter):
			
 
				 
			
@@ -284,7 +284,7 @@ For example let's look at `set_cpu_possible`. As we passed `true` as the second
 
				 cpumask_set_cpu(cpu, to_cpumask(cpu_possible_bits));
			
 
				 ```
			
 
				 
			
 
				-will be called. First of all let's try to understand the `to_cpu_mask` macro. This macro casts a bitmap to a `struct cpumask *`. CPU masks provide a bitmap suitable for representing the set of CPU's in a system, one bit position per CPU number. CPU mask presented by the `cpu_mask` structure:
			
 
				+will be called. First of all let's try to understand the `to_cpumask` macro. This macro casts a bitmap to a `struct cpumask *`. CPU masks provide a bitmap suitable for representing the set of CPU's in a system, one bit position per CPU number. CPU mask presented by the `cpu_mask` structure:
			
 
				 
			
 
				 ```C
			
 
				 typedef struct cpumask { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_t;
			
@@ -344,7 +344,7 @@ Linux version 4.0.0-rc6+ (alex@localhost) (gcc version 4.9.1 (Ubuntu 4.9.1-16ubu
 
				 Architecture-dependent parts of initialization
			
 
				 ---------------------------------------------------------------------------------
			
 
				 
			
 
				-The next step is architecture-specific initializations. The Linux kernel does it with the call of the `setup_arch` function. This is a very big function like `start_kernel` and we do not have time to consider all of its implementation in this part. Here we'll only start to do it and continue in the next part. As it is `architecture-specific`, we need to go again to the `arch/` directory. The `setup_arch` function defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) source code file and takes only one argument - address of the kernel command line.
			
 
				+The next step is architecture-specific initialization. The Linux kernel does it with the call of the `setup_arch` function. This is a very big function like `start_kernel` and we do not have time to consider all of its implementation in this part. Here we'll only start to do it and continue in the next part. As it is `architecture-specific`, we need to go again to the `arch/` directory. The `setup_arch` function defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) source code file and takes only one argument - address of the kernel command line.
			
 
				 
			
 
				 This function starts from the reserving memory block for the kernel `_text` and `_data` which starts from the `_text` symbol (you can remember it from the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S#L46)) and ends before `__bss_stop`. We are using `memblock` for the reserving of memory block:
			
 
				 
			
@@ -432,11 +432,11 @@ memblock_reserve(ramdisk_image, ramdisk_end - ramdisk_image);
 
				 Conclusion
			
 
				 ---------------------------------------------------------------------------------
			
 
				 
			
 
				-It is the end of the fourth part about the Linux kernel initialization process. We started to dive in the kernel generic code from the `start_kernel` function in this part and stopped on the architecture-specific initializations in the `setup_arch`. In the next part we will continue with architecture-dependent initialization steps.
			
 
				+It is the end of the fourth part about the Linux kernel initialization process. We started to dive in the kernel generic code from the `start_kernel` function in this part and stopped on the architecture-specific initialization in the `setup_arch`. In the next part we will continue with architecture-dependent initialization steps.
			
 
				 
			
 
				-If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				+If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				 
			
 
				-**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me a PR to [linux-internals](https://github.com/0xAX/linux-internals).**
			
 
				+**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me a PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				 
			
 
				 Links
			
 
				 --------------------------------------------------------------------------------
			
--- a/Initialization/linux-initialization-5.md
+++ b/Initialization/linux-initialization-5.md
@@ -1,10 +1,10 @@
 
				 Kernel initialization. Part 5.
			
 
				 ================================================================================
			
 
				 
			
 
				-Continue of architecture-specific initializations
			
 
				+Continue of architecture-specific initialization
			
 
				 ================================================================================
			
 
				 
			
 
				-In the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html), we stopped at the initialization of an architecture-specific stuff from the [setup_arch](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c#L856) function and will continue with it. As we reserved memory for the [initrd](http://en.wikipedia.org/wiki/Initrd), next step is the `olpc_ofw_detect` which detects [One Laptop Per Child support](http://wiki.laptop.org/go/OFW_FAQ). We will not consider platform related stuff in this book and will miss functions related with it. So let's go ahead. The next step is the `early_trap_init` function. This function initializes debug (`#DB` - raised when the `TF` flag of rflags is set) and `int3` (`#BP`) interrupts gate. If you don't know anything about interrupts, you can read about it in the [Early interrupt and exception handling](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html). In `x86` architecture `INT`, `INTO` and `INT3` are special instructions which allow a task to explicitly call an interrupt handler. The `INT3` instruction calls the breakpoint (`#BP`) handler. You can remember, we already saw it in the [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html) about interrupts: and exceptions:
			
 
				+In the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html), we stopped at the initialization of an architecture-specific stuff from the [setup_arch](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c#L856) function and now we will continue with it. As we reserved memory for the [initrd](http://en.wikipedia.org/wiki/Initrd), next step is the `olpc_ofw_detect` which detects [One Laptop Per Child support](http://wiki.laptop.org/go/OFW_FAQ). We will not consider platform related stuff in this book and will skip functions related with it. So let's go ahead. The next step is the `early_trap_init` function. This function initializes debug (`#DB` - raised when the `TF` flag of rflags is set) and `int3` (`#BP`) interrupts gate. If you don't know anything about interrupts, you can read about it in the [Early interrupt and exception handling](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html). In `x86` architecture `INT`, `INTO` and `INT3` are special instructions which allow a task to explicitly call an interrupt handler. The `INT3` instruction calls the breakpoint (`#BP`) handler. You may remember, we already saw it in the [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html) about interrupts: and exceptions:
			
 
				 
			
 
				 ```
			
 
				 ----------------------------------------------------------------------------------------------
			
@@ -14,7 +14,7 @@ In the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initializat
 
				 ----------------------------------------------------------------------------------------------
			
 
				 ```
			
 
				 
			
 
				-Debug interrupt `#DB` is the primary means of invoking debuggers. `early_trap_init` defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c). This functions sets `#DB` and `#BP` handlers and reloads [IDT](http://en.wikipedia.org/wiki/Interrupt_descriptor_table):
			
 
				+Debug interrupt `#DB` is the primary method of invoking debuggers. `early_trap_init` defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c). This functions sets `#DB` and `#BP` handlers and reloads [IDT](http://en.wikipedia.org/wiki/Interrupt_descriptor_table):
			
 
				 
			
 
				 ```C
			
 
				 void __init early_trap_init(void)
			
@@ -25,11 +25,11 @@ void __init early_trap_init(void)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-We already saw implementation of the `set_intr_gate` in the previous part about interrupts. Here are two similar functions `set_intr_gate_ist` and `set_system_intr_gate_ist`. Both of these two functions take two parameters:
			
 
				+We already saw implementation of the `set_intr_gate` in the previous part about interrupts. Here are two similar functions `set_intr_gate_ist` and `set_system_intr_gate_ist`. Both of these two functions take three parameters:
			
 
				 
			
 
				 * number of the interrupt;
			
 
				 * base address of the interrupt/exception handler;
			
 
				-* third parameter is - `Interrupt Stack Table`. `IST` is a new mechanism in the `x86_64` and part of the [TSS](http://en.wikipedia.org/wiki/Task_state_segment). Every active thread in kernel mode has own kernel stack which is 16 killobytes. While a thread in user space, kernel stack is empty except `thread_info` (read about it previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html)) at the bottom. In addition to per-thread stacks, there are a couple of specialized stacks associated with each CPU. All about these stack you can read in the linux kernel documentation - [Kernel stacks](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks). `x86_64` provides feature which allows to switch to a new `special` stack for during any events as non-maskable interrupt and etc... And the name of this feature is - `Interrupt Stack Table`. There can be up to 7 `IST` entries per CPU and every entry points to the dedicated stack. In our case this is `DEBUG_STACK`.
			
 
				+* third parameter is - `Interrupt Stack Table`. `IST` is a new mechanism in the `x86_64` and part of the [TSS](http://en.wikipedia.org/wiki/Task_state_segment). Every active thread in kernel mode has own kernel stack which is 16 kilobytes. While a thread in user space, kernel stack is empty except `thread_info` (read about it previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html)) at the bottom. In addition to per-thread stacks, there are a couple of specialized stacks associated with each CPU. All about these stack you can read in the linux kernel documentation - [Kernel stacks](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks). `x86_64` provides feature which allows to switch to a new `special` stack for during any events as non-maskable interrupt and etc... And the name of this feature is - `Interrupt Stack Table`. There can be up to 7 `IST` entries per CPU and every entry points to the dedicated stack. In our case this is `DEBUG_STACK`.
			
 
				 
			
 
				 `set_intr_gate_ist` and `set_system_intr_gate_ist` work by the same principle as `set_intr_gate` with only one difference. Both of these functions checks
			
 
				 interrupt number and call `_set_gate` inside:
			
@@ -39,14 +39,14 @@ BUG_ON((unsigned)n > 0xFF);
 
				 _set_gate(n, GATE_INTERRUPT, addr, 0, ist, __KERNEL_CS);
			
 
				 ```
			
 
				 
			
 
				-as `set_intr_gate` does this. But `set_intr_gate` calls `_set_gate` with [dpl](http://en.wikipedia.org/wiki/Privilege_level) - 0, and ist - 0, but `set_intr_gate_ist` and `set_system_intr_gate_ist` sets `ist` as `DEBUG_STACK` and `set_system_intr_gate_ist` sets `dpl` as `0x3` which is the lowest privilege. When an interrupt occurs and the hardware loads such a descriptor, then hardware automatically sets the new stack pointer based on the IST value, then invokes the interrupt handler. All of the special kernel stacks will be setted in the `cpu_init` function (we will see it later).
			
 
				+as `set_intr_gate` does this. But `set_intr_gate` calls `_set_gate` with [dpl](http://en.wikipedia.org/wiki/Privilege_level) - 0, and ist - 0, but `set_intr_gate_ist` and `set_system_intr_gate_ist` sets `ist` as `DEBUG_STACK` and `set_system_intr_gate_ist` sets `dpl` as `0x3` which is the lowest privilege. When an interrupt occurs and the hardware loads such a descriptor, then hardware automatically sets the new stack pointer based on the IST value, then invokes the interrupt handler. All of the special kernel stacks will be set in the `cpu_init` function (we will see it later).
			
 
				 
			
 
				 As `#DB` and `#BP` gates written to the `idt_descr`, we reload `IDT` table with `load_idt` which just cals `ldtr` instruction. Now let's look on interrupt handlers and will try to understand how they works. Of course, I can't cover all interrupt handlers in this book and I do not see the point in this. It is very interesting to delve in the linux kernel source code, so we will see how `debug` handler implemented in this part, and understand how other interrupt handlers are implemented will be your task.
			
 
				 
			
 
				 #DB handler
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-As you can read above, we passed address of the `#DB` handler as `&debug` in the `set_intr_gate_ist`. [lxr.free-electorns.com](http://lxr.free-electrons.com/ident) is a great resource for searching identificators in the linux kernel source code, but unfortunately you will not find `debug` handler with it. All of you can find, it is `debug` definition in the [arch/x86/include/asm/traps.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/traps.h):
			
 
				+As you can read above, we passed address of the `#DB` handler as `&debug` in the `set_intr_gate_ist`. [lxr.free-electrons.com](http://lxr.free-electrons.com/ident) is a great resource for searching identifiers in the linux kernel source code, but unfortunately you will not find `debug` handler with it. All of you can find, it is `debug` definition in the [arch/x86/include/asm/traps.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/traps.h):
			
 
				 
			
 
				 ```C
			
 
				 asmlinkage void debug(void);
			
@@ -108,9 +108,9 @@ The next two macro from the `idtentry` implementation are:
 
				 	PARAVIRT_ADJUST_EXCEPTION_FRAME
			
 
				 ```
			
 
				 
			
 
				-First `ASM_CLAC` macro depends on `CONFIG_X86_SMAP` configuration option and need for security resason, more about it you can read [here](https://lwn.net/Articles/517475/). The second `PARAVIRT_ADJUST_EXCEPTION_FRAME` macro is for handling handle Xen-type-exceptions (this chapter about kernel initializations and we will not consider virtualization stuff here).
			
 
				+First `ASM_CLAC` macro depends on `CONFIG_X86_SMAP` configuration option and need for security reason, more about it you can read [here](https://lwn.net/Articles/517475/). The second `PARAVIRT_ADJUST_EXCEPTION_FRAME` macro is for handling handle Xen-type-exceptions (this chapter about kernel initialization and we will not consider virtualization stuff here).
			
 
				 
			
 
				-The next piece of code checks is interrupt has error code or not and pushes `$-1` which is `0xffffffffffffffff` on `x86_64` on the stack if not:
			
 
				+The next piece of code checks if interrupt has error code or not and pushes `$-1` which is `0xffffffffffffffff` on `x86_64` on the stack if not:
			
 
				 
			
 
				 ```assembly
			
 
				 	.ifeq \has_error_code
			
@@ -118,7 +118,7 @@ The next piece of code checks is interrupt has error code or not and pushes `$-1
 
				 	.endif
			
 
				 ```
			
 
				 
			
 
				-We need to do it as `dummy` error code for stack consistency for all interrupts. In the next step we subscract from the stack pointer `$ORIG_RAX-R15`:
			
 
				+We need to do it as `dummy` error code for stack consistency for all interrupts. In the next step we subtract from the stack pointer `$ORIG_RAX-R15`:
			
 
				 
			
 
				 ```assembly
			
 
				 	subq $ORIG_RAX-R15, %rsp
			
@@ -144,19 +144,19 @@ Here we checks first and second bits in the `CS`. You can remember that `CS` reg
 
				 1:	ret
			
 
				 ```
			
 
				 
			
 
				-In the next steps we put `pt_regs` pointer to the `rdi`, save error code in the `rsi` if it is and call interrupt handler which is - `do_debug` in our case from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c). `do_debug` like other handlers takes two parameters:
			
 
				+In the next steps we put `pt_regs` pointer to the `rdi`, save error code in the `rsi` if it has and call interrupt handler which is - `do_debug` in our case from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c). `do_debug` like other handlers takes two parameters:
			
 
				 
			
 
				 * pt_regs - is a structure which presents set of CPU registers which are saved in the process' memory region;
			
 
				 * error code - error code of interrupt.
			
 
				 
			
 
				 After interrupt handler finished its work, calls `paranoid_exit` which restores stack, switch on userspace if interrupt came from there and calls `iret`. That's all. Of course it is not all :), but we will see more deeply in the separate chapter about interrupts.
			
 
				 
			
 
				-This is general view of the `idtentry` macro for `#DB` interrupt. All interrupts are similar on this implementation and defined with idtentry too. After `early_trap_init` finished its work, the next function is `early_cpu_init`. This function defined in the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/common.c) and collects information about a CPU and its vendor.
			
 
				+This is general view of the `idtentry` macro for `#DB` interrupt. All interrupts are similar to this implementation and defined with idtentry too. After `early_trap_init` finished its work, the next function is `early_cpu_init`. This function defined in the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/common.c) and collects information about CPU and its vendor.
			
 
				 
			
 
				 Early ioremap initialization
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-The next step is initialization of early `ioremap`. In general there are two ways to comminicate with devices:
			
 
				+The next step is initialization of early `ioremap`. In general there are two ways to communicate with devices:
			
 
				 
			
 
				 * I/O Ports;
			
 
				 * Device memory.
			
@@ -170,14 +170,14 @@ pmd_t *pmd;
 
				 BUILD_BUG_ON((fix_to_virt(0) + PAGE_SIZE) & ((1 << PMD_SHIFT) - 1));
			
 
				 ```
			
 
				 
			
 
				-`fixmap` - is fixed virtual address mappings which extends from `FIXADDR_START` to `FIXADDR_TOP`. Fixed virtual addresses are needed for subsystems that need to know the virtual address at compile time. After the check `early_ioremap_init` makes a call of the `early_ioremap_setup` function from the [mm/early_ioremap.c](https://github.com/torvalds/linux/blob/master/mm/early_ioremap.c). `early_ioremap_setup` fills `slot_virt` arry of the `unsigned long` with virtual addresses with 512 temporary boot-time fix-mappings:
			
 
				+`fixmap` - is fixed virtual address mappings which extends from `FIXADDR_START` to `FIXADDR_TOP`. Fixed virtual addresses are needed for subsystems that need to know the virtual address at compile time. After the check `early_ioremap_init` makes a call of the `early_ioremap_setup` function from the [mm/early_ioremap.c](https://github.com/torvalds/linux/blob/master/mm/early_ioremap.c). `early_ioremap_setup` fills `slot_virt` array of the `unsigned long` with virtual addresses with 512 temporary boot-time fix-mappings:
			
 
				 
			
 
				 ```C
			
 
				 for (i = 0; i < FIX_BTMAPS_SLOTS; i++)
			
 
				     slot_virt[i] = __fix_to_virt(FIX_BTMAP_BEGIN - NR_FIX_BTMAPS*i);
			
 
				 ```
			
 
				 
			
 
				-After this we get page middle directory entry for the `FIX_BTMAP_BEGIN` and put to the `pmd` variable, fills with zeros `bm_pte` which is boot time page tables and call `pmd_populate_kernel` function for setting given page table entry in the given page middle directory:
			
 
				+After this we get page middle directory entry for the `FIX_BTMAP_BEGIN` and put to the `pmd` variable, fills `bm_pte` with zeros which is boot time page tables and call `pmd_populate_kernel` function for setting given page table entry in the given page middle directory:
			
 
				 
			
 
				 ```C
			
 
				 pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));
			
@@ -185,7 +185,7 @@ memset(bm_pte, 0, sizeof(bm_pte));
 
				 pmd_populate_kernel(&init_mm, pmd, bm_pte);
			
 
				 ```
			
 
				 
			
 
				-That's all for this. If you feeling missunderstanding, don't worry. There is special part about `ioremap` and `fixmaps` in the [Linux Kernel Memory Management. Part 2](https://github.com/0xAX/linux-insides/blob/master/mm/linux-mm-2.md) chapter.
			
 
				+That's all for this. If you feeling puzzled, don't worry. There is special part about `ioremap` and `fixmaps` in the [Linux Kernel Memory Management. Part 2](https://github.com/0xAX/linux-insides/blob/master/mm/linux-mm-2.md) chapter.
			
 
				 
			
 
				 Obtaining major and minor numbers for the root device
			
 
				 --------------------------------------------------------------------------------
			
@@ -208,7 +208,7 @@ Protocol:	ALL
 
				   deprecated, use the "root=" option on the command line instead.
			
 
				 ```
			
 
				 
			
 
				-Now let's try understand what is it `old_decode_dev`. Actually it just calls `MKDEV` inside which generates `dev_t` from the give major and minor numbers. It's implementation pretty easy:
			
 
				+Now let's try to understand what `old_decode_dev` does. Actually it just calls `MKDEV` inside which generates `dev_t` from the give major and minor numbers. It's implementation is pretty simple:
			
 
				 
			
 
				 ```C
			
 
				 static inline dev_t old_decode_dev(u16 val)
			
@@ -217,7 +217,7 @@ static inline dev_t old_decode_dev(u16 val)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-where `dev_t` is a kernel data type to present major/minor number pair.  But what's the strange `old_` prefix? For historical reasons, there are two ways of managing the major and minor numbers of a device. In the first way major and minor numbers occupied 2 bytes. You can see it in the previous code: 8 bit for major number and 8 bit for minor number. But there is problem with this way: 256 major numbers and 256 minor numbers are possible. So 16-bit integer was replaced with 32-bit integer where 12 bits reserved for major number and 20 bits for minor. You can see this in the `new_decode_dev` implementation:
			
 
				+where `dev_t` is a kernel data type to present major/minor number pair.  But what's the strange `old_` prefix? For historical reasons, there are two ways of managing the major and minor numbers of a device. In the first way major and minor numbers occupied 2 bytes. You can see it in the previous code: 8 bit for major number and 8 bit for minor number. But there is a problem: only 256 major numbers and 256 minor numbers are possible. So 16-bit integer was replaced by 32-bit integer where 12 bits reserved for major number and 20 bits for minor. You can see this in the `new_decode_dev` implementation:
			
 
				 
			
 
				 ```C
			
 
				 static inline dev_t new_decode_dev(u32 dev)
			
@@ -248,7 +248,7 @@ The next point is the setup of the memory map with the call of the `setup_memory
 
				 	bootloader_version |= boot_params.hdr.ext_loader_ver << 4;
			
 
				 ```
			
 
				 
			
 
				-All of these parameters we got during boot time and stored in the `boot_params` structure. After this we need to setup the end of the I/O memory. As you know the one of the main purposes of the kernel is resource management. And one of the resource is a memory. As we already know there are two ways to communicate with devices are I/O ports and device memory. All information about registered resources available through:
			
 
				+All of these parameters we got during boot time and stored in the `boot_params` structure. After this we need to setup the end of the I/O memory. As you know one of the main purposes of the kernel is resource management. And one of the resource is memory. As we already know there are two ways to communicate with devices are I/O ports and device memory. All information about registered resources are available through:
			
 
				 
			
 
				 * /proc/ioports - provides a list of currently registered port regions used for input or output communication with a device;
			
 
				 * /proc/iomem   - provides current map of the system's memory for each physical device.
			
@@ -315,15 +315,15 @@ EXPORT_SYMBOL(iomem_resource);
 
				 
			
 
				 TODO EXPORT_SYMBOL
			
 
				 
			
 
				-`iomem_resource` defines root addresses range for io memory with `PCI mem` name and `IORESOURCE_MEM` (`0x00000200`) as flags. As i wrote about our current point is setup the end address of the `iomem`. We will do it with:
			
 
				+`iomem_resource` defines root addresses range for io memory with `PCI mem` name and `IORESOURCE_MEM` (`0x00000200`) as flags. As i wrote above our current point is setup the end address of the `iomem`. We will do it with:
			
 
				 
			
 
				 ```C
			
 
				 iomem_resource.end = (1ULL << boot_cpu_data.x86_phys_bits) - 1;
			
 
				 ```
			
 
				 
			
 
				-Here we shift `1` on `boot_cpu_data.x86_phys_bits`. `boot_cpu_data` is `cpuinfo_x86` structure which we filled during execution of the `early_cpu_init`. As you can understand from the name of the `x86_phys_bits` field, it presents maximum bits amount of the maximum physical address in the system. Note also that `iomem_resource` passed to the `EXPORT_SYMBOL` macro. This macro exports the given symbol (`iomem_resource` in our case) for dynamic linking or in another words it makes a symbol accessible to dynamically loaded modules.
			
 
				+Here we shift `1` on `boot_cpu_data.x86_phys_bits`. `boot_cpu_data` is `cpuinfo_x86` structure which we filled during execution of the `early_cpu_init`. As you can understand from the name of the `x86_phys_bits` field, it presents maximum bits amount of the maximum physical address in the system. Note also that `iomem_resource` is passed to the `EXPORT_SYMBOL` macro. This macro exports the given symbol (`iomem_resource` in our case) for dynamic linking or in other words it makes a symbol accessible to dynamically loaded modules.
			
 
				 
			
 
				-As we set the end address of the root `iomem` resource address range, as I wrote about the next step will be setup of the memory map. It will be produced with the call of the `setup_memory_map` function:
			
 
				+After we set the end address of the root `iomem` resource address range, as I wrote above the next step will be setup of the memory map. It will be produced with the call of the `setup_ memory_map` function:
			
 
				 
			
 
				 ```C
			
 
				 void __init setup_memory_map(void)
			
@@ -337,7 +337,7 @@ void __init setup_memory_map(void)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-First of all we call look here the call of the `x86_init.resources.memory_setup`. `x86_init` is a `x86_init_ops` structure which presents platform specific setup functions as resources initializtion, pci initialization and etc... Initiaization of the `x86_init` is in the [arch/x86/kernel/x86_init.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/x86_init.c). I will not give here the full description because it is very long, but only one part which interests us for now:
			
 
				+First of all we call look here the call of the `x86_init.resources.memory_setup`. `x86_init` is a `x86_init_ops` structure which presents platform specific setup functions as resources initialization, pci initialization and etc... initialization of the `x86_init` is in the [arch/x86/kernel/x86_init.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/x86_init.c). I will not give here the full description because it is very long, but only one part which interests us for now:
			
 
				 
			
 
				 ```C
			
 
				 struct x86_init_ops x86_init __initdata = {
			
@@ -352,7 +352,7 @@ struct x86_init_ops x86_init __initdata = {
 
				 }
			
 
				 ```
			
 
				 
			
 
				-As we can see here `memry_setup` field is `default_machine_specific_memory_setup` where we get the number of the [e820](http://en.wikipedia.org/wiki/E820) entries which we collected in the [boot time](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html), sanitize the BIOS e820 map and fill `e820map` structure with the memory regions. As all regions collect, print of all regions with printk. You can find this print if you execute `dmesg` command, you must see something like this:
			
 
				+As we can see here `memry_setup` field is `default_machine_specific_memory_setup` where we get the number of the [e820](http://en.wikipedia.org/wiki/E820) entries which we collected in the [boot time](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html), sanitize the BIOS e820 map and fill `e820map` structure with the memory regions. As all regions are collected, print of all regions with printk. You can find this print if you execute `dmesg` command and you can see something like this:
			
 
				 
			
 
				 ```
			
 
				 [    0.000000] e820: BIOS-provided physical RAM map:
			
@@ -390,7 +390,7 @@ Protocol:	2.09+
 
				   parameters passing mechanism.
			
 
				 ```
			
 
				 
			
 
				-It used for storing setup information for different types as device tree blob, EFI setup data and etc... In the second step we copy BIOS EDD informantion from the `boot_params` structure that we collected in the [arch/x86/boot/edd.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/edd.c) to the `edd` structure:
			
 
				+It used for storing setup information for different types as device tree blob, EFI setup data and etc... In the second step we copy BIOS EDD information from the `boot_params` structure that we collected in the [arch/x86/boot/edd.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/edd.c) to the `edd` structure:
			
 
				 
			
 
				 ```C
			
 
				 static inline void __init copy_edd(void)
			
@@ -406,7 +406,7 @@ static inline void __init copy_edd(void)
 
				 Memory descriptor initialization
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-The next step is initialization of the memory descriptor of the init process. As you already can know every process has own address space. This address space presented with special data structure which called `memory descriptor`. Directly in the linux kernel source code memory descriptor presented with `mm_struct` structure. `mm_struct` contains many different fields related with the process address space as start/end address of the kernel code/data, start/end of the brk, number of memory areas, list of memory areas and etc... This structure defined in the [include/linux/mm_types.h](https://github.com/torvalds/linux/blob/master/include/linux/mm_types.h). As every process has own memory descriptor, `task_struct` structure contains it in the `mm` and `active_mm` field. And our first `init` process has it too. You can remember that we saw the part of initialization of the init `task_struct` with `INIT_TASK` macro in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html):
			
 
				+The next step is initialization of the memory descriptor of the init process. As you already can know every process has its own address space. This address space presented with special data structure which called `memory descriptor`. Directly in the linux kernel source code memory descriptor presented with `mm_struct` structure. `mm_struct` contains many different fields related with the process address space as start/end address of the kernel code/data, start/end of the brk, number of memory areas, list of memory areas and etc... This structure defined in the [include/linux/mm_types.h](https://github.com/torvalds/linux/blob/master/include/linux/mm_types.h). As every process has its own memory descriptor, `task_struct` structure contains it in the `mm` and `active_mm` field. And our first `init` process has it too. You can remember that we saw the part of initialization of the init `task_struct` with `INIT_TASK` macro in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html):
			
 
				 
			
 
				 ```C
			
 
				 #define INIT_TASK(tsk)  \
			
@@ -420,7 +420,7 @@ The next step is initialization of the memory descriptor of the init process. As
 
				 }
			
 
				 ```
			
 
				 
			
 
				-`mm` points to the process address space and `active_mm` points to the active address space if process has no own as kernel threads (more about it you can read in the [documentation](https://www.kernel.org/doc/Documentation/vm/active_mm.txt)). Now we fill memory descriptor of the initial process: 
			
 
				+`mm` points to the process address space and `active_mm` points to the active address space if process has no address space such as kernel threads (more about it you can read in the [documentation](https://www.kernel.org/doc/Documentation/vm/active_mm.txt)). Now we fill memory descriptor of the initial process: 
			
 
				 
			
 
				 ```C
			
 
				 	init_mm.start_code = (unsigned long) _text;
			
@@ -429,7 +429,7 @@ The next step is initialization of the memory descriptor of the init process. As
 
				 	init_mm.brk = _brk_end;
			
 
				 ```
			
 
				 
			
 
				-with the kernel's text, data and brk. `init_mm` is memory descriptor of the initial process and defined as:
			
 
				+with the kernel's text, data and brk. `init_mm` is the memory descriptor of the initial process and defined as:
			
 
				 
			
 
				 ```C
			
 
				 struct mm_struct init_mm = {
			
@@ -444,7 +444,7 @@ struct mm_struct init_mm = {
 
				 };
			
 
				 ```
			
 
				 
			
 
				-where `mm_rb` is a red-black tree of the virtual memory areas, `pgd` is a pointer to the page global directory, `mm_users` is address space users, `mm_count` is primary usage counter and `mmap_sem` is memory area semaphore. After that we setup memory descriptor of the initiali process, next step is initialization of the intel Memory Protection Extensions with `mpx_mm_init`. The next step after it is initialization of the code/data/bss resources with:
			
 
				+where `mm_rb` is a red-black tree of the virtual memory areas, `pgd` is a pointer to the page global directory, `mm_users` is address space users, `mm_count` is primary usage counter and `mmap_sem` is memory area semaphore. After we setup memory descriptor of the initial process, next step is initialization of the Intel Memory Protection Extensions with `mpx_mm_init`. The next step is initialization of the code/data/bss resources with:
			
 
				 
			
 
				 ```C
			
 
				 	code_resource.start = __pa_symbol(_text);
			
@@ -455,7 +455,7 @@ where `mm_rb` is a red-black tree of the virtual memory areas, `pgd` is a pointe
 
				 	bss_resource.end = __pa_symbol(__bss_stop)-1;
			
 
				 ```
			
 
				 
			
 
				-We already know a little about `resource` structure (read above). Here we fills code/data/bss resources with the physical addresses of they. You can see it in the `/proc/iomem` output:
			
 
				+We already know a little about `resource` structure (read above). Here we fills code/data/bss resources with their physical addresses. You can see it in the `/proc/iomem`:
			
 
				 
			
 
				 ```C
			
 
				 00100000-be825fff : System RAM
			
@@ -464,7 +464,7 @@ We already know a little about `resource` structure (read above). Here we fills
 
				   01a11000-01ac3fff : Kernel bss
			
 
				 ```
			
 
				 
			
 
				-All of these structures defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) and look like typical resource initialization:
			
 
				+All of these structures are defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) and look like typical resource initialization:
			
 
				 
			
 
				 ```C
			
 
				 static struct resource code_resource = {
			
@@ -490,11 +490,11 @@ void x86_configure_nx(void)
 
				 Conclusion
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-It is the end of the fifth part about linux kernel initialization process. In this part we continued to dive in the `setup_arch` function which makes initialization of architecutre-specific stuff. It was long part, but we not finished with it. As i already wrote, the `setup_arch` is big function, and I am really not sure that we will cover full of it even in the next part. There were some new interesting concepts in this part like `Fix-mapped` addresses, ioremap and etc... Don't worry if they are unclear for you. There is special part about these concepts - [Linux kernel memory management Part 2.](https://github.com/0xAX/linux-insides/blob/master/mm/linux-mm-2.md). In the next part we will continue with the initialization of the architecture-specific stuff and will see parsing of the early kernel parameteres, early dump of the pci devices, direct Media Interface scanning and many many more.
			
 
				+It is the end of the fifth part about linux kernel initialization process. In this part we continued to dive in the `setup_arch` function which makes initialization of architecture-specific stuff. It was long part, but we have not finished with it. As i already wrote, the `setup_arch` is big function, and I am really not sure that we will cover all of it even in the next part. There were some new interesting concepts in this part like `Fix-mapped` addresses, ioremap and etc... Don't worry if they are unclear for you. There is a special part about these concepts - [Linux kernel memory management Part 2.](https://github.com/0xAX/linux-insides/blob/master/mm/linux-mm-2.md). In the next part we will continue with the initialization of the architecture-specific stuff and will see parsing of the early kernel parameters, early dump of the pci devices, direct Media Interface scanning and many many more.
			
 
				 
			
 
				-If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				+If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				 
			
 
				-**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
			
 
				+**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				 
			
 
				 Links
			
 
				 --------------------------------------------------------------------------------
			
--- a/Initialization/linux-initialization-6.md
+++ b/Initialization/linux-initialization-6.md
@@ -1,10 +1,10 @@
 
				 Kernel initialization. Part 6.
			
 
				 ================================================================================
			
 
				 
			
 
				-Architecture-specific initializations, again...
			
 
				+Architecture-specific initialization, again...
			
 
				 ================================================================================
			
 
				 
			
 
				-In the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) we saw architecture-specific (`x86_64` in our case) initialization stuff from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) and finished on `x86_configure_nx` function which sets the `_PAGE_NX` flag depends on support of [NX bit](http://en.wikipedia.org/wiki/NX_bit). As I wrote before `setup_arch` function and `start_kernel` are very big, so in this and in the next part we will continue to learn about architecture-specific initialization process. The next function after `x86_configure_nx` is `parse_early_param`. This function defined in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) and as you can understand from its name, this function parses kernel command line and setups different some services depends on give parameters (all kernel command line parameters you can find in the [Documentation/kernel-parameters.txt](https://github.com/torvalds/linux/blob/master/Documentation/kernel-parameters.txt)). You can remember how we setup `earlyprintk` in the earliest [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html). On the early stage we looked for kernel parameters and their value with the `cmdline_find_option` function and `__cmdline_find_option`, `__cmdline_find_option_bool` helpers from the [arch/x86/boot/cmdline.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/cmdline.c). There we're in the generic kernel part which does not depend on architecture and here we use another approach. If you are reading linux kernel source code, you already can note calls like this:
			
 
				+In the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) we saw architecture-specific (`x86_64` in our case) initialization stuff from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) and finished on `x86_configure_nx` function which sets the `_PAGE_NX` flag depends on support of [NX bit](http://en.wikipedia.org/wiki/NX_bit). As I wrote before `setup_arch` function and `start_kernel` are very big, so in this and in the next part we will continue to learn about architecture-specific initialization process. The next function after `x86_configure_nx` is `parse_early_param`. This function is defined in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) and as you can understand from its name, this function parses kernel command line and setups different services depends on the given parameters (all kernel command line parameters you can find are in the [Documentation/kernel-parameters.txt](https://github.com/torvalds/linux/blob/master/Documentation/kernel-parameters.txt)). You may remember how we setup `earlyprintk` in the earliest [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html). On the early stage we looked for kernel parameters and their value with the `cmdline_find_option` function and `__cmdline_find_option`, `__cmdline_find_option_bool` helpers from the [arch/x86/boot/cmdline.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/cmdline.c). There we're in the generic kernel part which does not depend on architecture and here we use another approach. If you are reading linux kernel source code, you already note calls like this:
			
 
				 
			
 
				 ```C
			
 
				 early_param("gbpages", parse_direct_gbpages_on);
			
@@ -13,7 +13,7 @@ early_param("gbpages", parse_direct_gbpages_on);
 
				 `early_param` macro takes two parameters:
			
 
				 
			
 
				 * command line parameter name;
			
 
				-* function which will be called if given parameter passed.
			
 
				+* function which will be called if given parameter is passed.
			
 
				 
			
 
				 and defined as:
			
 
				 
			
@@ -48,7 +48,7 @@ and contains three fields:
 
				 
			
 
				 * name of the kernel parameter;
			
 
				 * function which setups something depend on parameter;
			
 
				-* field determinies is parameter early (1) or not (0).
			
 
				+* field determines is parameter early (1) or not (0).
			
 
				 
			
 
				 Note that `__set_param` macro defines with `__section(.init.setup)` attribute. It means that all `__setup_str_*` will be placed in the `.init.setup` section, moreover, as we can see in the [include/asm-generic/vmlinux.lds.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/vmlinux.lds.h), they will be placed between `__setup_start` and `__setup_end`:
			
 
				 
			
@@ -78,7 +78,7 @@ void __init parse_early_param(void)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-The `parse_early_param` function defines two static variables. First `done` check that `parse_early_param` already called and the second is temporary storage for kernel command line. After this we copy `boot_command_line` to the temporary commad line which we just defined and call the `parse_early_options` function from the the same source code `main.c` file. `parse_early_options` calls the `parse_args` function from the [kernel/params.c](https://github.com/torvalds/linux/blob/master/) where `parse_args` parses given command line and calls `do_early_param` function. This [function](https://github.com/torvalds/linux/blob/master/init/main.c#L413) goes from the ` __setup_start` to `__setup_end`, and calls the function from the `obs_kernel_param` if a parameter is early. After this all services which are depend on early command line parameters were setup and the next call after the `parse_early_param` is `x86_report_nx`. As I wrote in the beginning of this part, we already set `NX-bit` with the `x86_configure_nx`. The next `x86_report_nx` function the [arch/x86/mm/setup_nx.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/setup_nx.c) just prints information about the `NX`. Note that we call `x86_report_nx` not right after the `x86_configure_nx`, but after the call of the `parse_early_param`. The answer is simple: we call it after the `parse_early_param` because the kernel support `noexec` parameter:
			
 
				+The `parse_early_param` function defines two static variables. First `done` check that `parse_early_param` already called and the second is temporary storage for kernel command line. After this we copy `boot_command_line` to the temporary command line which we just defined and call the `parse_early_options` function from the same source code `main.c` file. `parse_early_options` calls the `parse_args` function from the [kernel/params.c](https://github.com/torvalds/linux/blob/master/) where `parse_args` parses given command line and calls `do_early_param` function. This [function](https://github.com/torvalds/linux/blob/master/init/main.c#L413) goes from the ` __setup_start` to `__setup_end`, and calls the function from the `obs_kernel_param` if a parameter is early. After this all services which are depend on early command line parameters were setup and the next call after the `parse_early_param` is `x86_report_nx`. As I wrote in the beginning of this part, we already set `NX-bit` with the `x86_configure_nx`. The next `x86_report_nx` function from the [arch/x86/mm/setup_nx.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/setup_nx.c) just prints information about the `NX`. Note that we call `x86_report_nx` not right after the `x86_configure_nx`, but after the call of the `parse_early_param`. The answer is simple: we call it after the `parse_early_param` because the kernel support `noexec` parameter:
			
 
				 
			
 
				 ```
			
 
				 noexec		[X86]
			
@@ -97,7 +97,7 @@ After this we can see call of the:
 
				 	memblock_x86_reserve_range_setup_data();
			
 
				 ```
			
 
				 
			
 
				-function. This function defined in the same [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) source code file and remaps memory for the `setup_data` and reserved memory block for the `setup_data` (more about `setup_data` you can read in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) and about `ioremap` and `memblock` you can read in the [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html)).
			
 
				+function. This function is defined in the same [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) source code file and remaps memory for the `setup_data` and reserved memory block for the `setup_data` (more about `setup_data` you can read in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) and about `ioremap` and `memblock` you can read in the [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html)).
			
 
				 
			
 
				 In the next step we can see following conditional statement:
			
 
				 
			
@@ -110,7 +110,7 @@ In the next step we can see following conditional statement:
 
				 	}
			
 
				 ```
			
 
				 
			
 
				-The first `acpi_mps_check` function from the [arch/x86/kernel/acpi/boot.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/acpi/boot.c) depends on `CONFIG_X86_LOCAL_APIC` and `CNOFIG_x86_MPPARSE` configuration options:
			
 
				+The first `acpi_mps_check` function from the [arch/x86/kernel/acpi/boot.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/acpi/boot.c) depends on `CONFIG_X86_LOCAL_APIC` and `CONFIG_x86_MPPARSE` configuration options:
			
 
				 
			
 
				 ```C
			
 
				 int __init acpi_mps_check(void)
			
@@ -128,9 +128,7 @@ int __init acpi_mps_check(void)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-It checks the built-in `MPS` or [MultiProcessor Specification](http://en.wikipedia.org/wiki/MultiProcessor_Specification) table. If `CONFIG_X86_LOCAL_APIC` is set and `CONFIG_x86_MPPAARSE` is not set, `acpi_mps_check` prints warning message if the one of the command line options: `acpi=off`, `acpi=noirq` or `pci=noacpi` passed to the kernel. If `acpi_mps_check` returns `1` which means that
			
 
				-
			
 
				-we disable local [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) and clears `X86_FEATURE_APIC` bit in the of the current CPU with the `setup_clear_cpu_cap` macro. (more about CPU mask you can read in the [CPU masks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)).
			
 
				+It checks the built-in `MPS` or [MultiProcessor Specification](http://en.wikipedia.org/wiki/MultiProcessor_Specification) table. If `CONFIG_X86_LOCAL_APIC` is set and `CONFIG_x86_MPPAARSE` is not set, `acpi_mps_check` prints warning message if the one of the command line options: `acpi=off`, `acpi=noirq` or `pci=noacpi` passed to the kernel. If `acpi_mps_check` returns `1` it means that we disable local [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) and clear `X86_FEATURE_APIC` bit in the of the current CPU with the `setup_clear_cpu_cap` macro. (more about CPU mask you can read in the [CPU masks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)).
			
 
				 
			
 
				 Early PCI dump
			
 
				 --------------------------------------------------------------------------------
			
@@ -144,13 +142,13 @@ In the next step we make a dump of the [PCI](http://en.wikipedia.org/wiki/Conven
 
				 #endif
			
 
				 ```
			
 
				 
			
 
				-`pci_early_dump_regs` variable defined in the [arch/x86/pci/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/pci/common.c) and its value depends on the kernel command line parameter: `pci=earlydump`. We can find defition of this parameter in the [drivers/pci/pci.c](https://github.com/torvalds/linux/blob/master/arch):
			
 
				+`pci_early_dump_regs` variable defined in the [arch/x86/pci/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/pci/common.c) and its value depends on the kernel command line parameter: `pci=earlydump`. We can find definition of this parameter in the [drivers/pci/pci.c](https://github.com/torvalds/linux/blob/master/arch):
			
 
				 
			
 
				 ```C
			
 
				 early_param("pci", pci_setup);
			
 
				 ```
			
 
				 
			
 
				-`pci_setup` function gets the string after the `pci=` and analyzes it. This function calls `pcibios_setup` which defined as `__weak` in the [drivers/pci/pci.c](https://github.com/torvalds/linux/blob/master/arch) and every architecture defines the same function which overrides `__weak` analog. For example `x86_64` architecture-depened version is in the [arch/x86/pci/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/pci/common.c):
			
 
				+`pci_setup` function gets the string after the `pci=` and analyzes it. This function calls `pcibios_setup` which defined as `__weak` in the [drivers/pci/pci.c](https://github.com/torvalds/linux/blob/master/arch) and every architecture defines the same function which overrides `__weak` analog. For example `x86_64` architecture-dependent version is in the [arch/x86/pci/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/pci/common.c):
			
 
				 
			
 
				 ```C
			
 
				 char *__init pcibios_setup(char *str) {
			
@@ -190,7 +188,7 @@ for (bus = 0; bus < 256; bus++) {
 
				 
			
 
				 and read the `pci` config with the `read_pci_config` function.
			
 
				 
			
 
				-That's all. We will no go deep in the `pci` details, but will see more details in the special `Drivers/PCI` part.
			
 
				+That's all. We will not go deep in the `pci` details, but will see more details in the special `Drivers/PCI` part.
			
 
				 
			
 
				 Finish with memory parsing
			
 
				 --------------------------------------------------------------------------------
			
@@ -210,14 +208,14 @@ After the `early_dump_pci_devices`, there are a couple of function related with
 
				 	early_reserve_e820_mpc_new();
			
 
				 ```
			
 
				 
			
 
				-Let's look on it. As you can see the first function is `e820_reserve_setup_data`. This function does almost the same as `memblock_x86_reserve_range_setup_data` which we saw above, but it also calls `e820_update_range` which adds new regions to the `e820map` with the given type which is `E820_RESERVED_KERN` in our case. The next function is `finish_e820_parsing` which sanitazes `e820map` with the `sanitize_e820_map` function. Besides this two functions we can see a couple of functions related to the [e820](http://en.wikipedia.org/wiki/E820). You can see it in the listing which is above. `e820_add_kernel_range` function takes the physical address of the kernel start and end:
			
 
				+Let's look on it. As you can see the first function is `e820_reserve_setup_data`. This function does almost the same as `memblock_x86_reserve_range_setup_data` which we saw above, but it also calls `e820_update_range` which adds new regions to the `e820map` with the given type which is `E820_RESERVED_KERN` in our case. The next function is `finish_e820_parsing` which sanitizes `e820map` with the `sanitize_e820_map` function. Besides this two functions we can see a couple of functions related to the [e820](http://en.wikipedia.org/wiki/E820). You can see it in the listing above. `e820_add_kernel_range` function takes the physical address of the kernel start and end:
			
 
				 
			
 
				 ```C
			
 
				 u64 start = __pa_symbol(_text);
			
 
				 u64 size = __pa_symbol(_end) - start;
			
 
				 ```
			
 
				 
			
 
				-checks that `.text` `.data` and `.bss` marked as `E820RAM` in the `e820map` and prints the warning message if not. The next function `trm_bios_range` update first 4096 bytes in `e820Map` as `E820_RESERVED` and sanitizes it again with the call of the `sanitize_e820_map`. After this we get the last page frame number with the call of the `e820_end_of_ram_pfn` function. Every memory page has an unique number - `Page frame number`  and `e820_end_of_ram_pfn` function returns the maximum with the call of the `e820_end_pfn`:
			
 
				+checks that `.text` `.data` and `.bss` marked as `E820RAM` in the `e820map` and prints the warning message if not. The next function `trm_bios_range` update first 4096 bytes in `e820Map` as `E820_RESERVED` and sanitizes it again with the call of the `sanitize_e820_map`. After this we get the last page frame number with the call of the `e820_end_of_ram_pfn` function. Every memory page has a unique number - `Page frame number`  and `e820_end_of_ram_pfn` function returns the maximum with the call of the `e820_end_pfn`:
			
 
				 
			
 
				 ```C
			
 
				 unsigned long __init e820_end_of_ram_pfn(void)
			
@@ -226,7 +224,7 @@ unsigned long __init e820_end_of_ram_pfn(void)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-where `e820_end_pfn` takes maximum page frame number on the certain architecture (`MAX_ARCH_PFN` is `0x400000000` for `x86_64`). In the `e820_end_pfn` we go through the all `e820` slots and check that `e820` entry has `E820_RAM` or `E820_PRAM` type because we calcluate page frame numbers only for these types, gets the base address and end address of the page frame number for the current `e820` entry and makes some checks for these addresses:
			
 
				+where `e820_end_pfn` takes maximum page frame number on the certain architecture (`MAX_ARCH_PFN` is `0x400000000` for `x86_64`). In the `e820_end_pfn` we go through the all `e820` slots and check that `e820` entry has `E820_RAM` or `E820_PRAM` type because we calculate page frame numbers only for these types, gets the base address and end address of the page frame number for the current `e820` entry and makes some checks for these addresses:
			
 
				 
			
 
				 ```C
			
 
				 for (i = 0; i < e820.nr_map; i++) {
			
@@ -260,7 +258,7 @@ for (i = 0; i < e820.nr_map; i++) {
 
				 	return last_pfn;
			
 
				 ```
			
 
				 
			
 
				-After this we check that `last_pfn` which we got in the loop is not greater that maximum page frame number for the certain architecture (`x86_64` in our case), print inofmration about last page frame number and return it. We can see the `last_pfn` in the `dmesg` output:
			
 
				+After this we check that `last_pfn` which we got in the loop is not greater that maximum page frame number for the certain architecture (`x86_64` in our case), print information about last page frame number and return it. We can see the `last_pfn` in the `dmesg` output:
			
 
				 
			
 
				 ```
			
 
				 ...
			
@@ -268,7 +266,7 @@ After this we check that `last_pfn` which we got in the loop is not greater that
 
				 ...
			
 
				 ```
			
 
				 
			
 
				-After this, as we have calculated the biggest page frame number, we calculate `max_low_pfn` which is the biggest page frame number in the `low memory` or bellow first `4` gigabytes. If installed more than 4 gigabytes of RAM, `max_low_pfn` will be result of the `e820_end_of_low_ram_pfn` function which does the same `e820_end_of_ram_pfn` but with 4 gigabytes limit, in other way `max_low_pfn` will be the same as `max_pfn`:
			
 
				+After this, as we have calculated the biggest page frame number, we calculate `max_low_pfn` which is the biggest page frame number in the `low memory` or below first `4` gigabytes. If installed more than 4 gigabytes of RAM, `max_low_pfn` will be result of the `e820_end_of_low_ram_pfn` function which does the same `e820_end_of_ram_pfn` but with 4 gigabytes limit, in other way `max_low_pfn` will be the same as `max_pfn`:
			
 
				 
			
 
				 ```C
			
 
				 if (max_pfn > (1UL<<(32 - PAGE_SHIFT)))
			
@@ -279,7 +277,7 @@ else
 
				 high_memory = (void *)__va(max_pfn * PAGE_SIZE - 1) + 1;
			
 
				 ```
			
 
				 
			
 
				-Next we calculate `high_memory` (defines the upper bound on direct map memory) with `__va` macro which returns a virtual address by the given physical.
			
 
				+Next we calculate `high_memory` (defines the upper bound on direct map memory) with `__va` macro which returns a virtual address by the given physical memory.
			
 
				 
			
 
				 DMI scanning 
			
 
				 -------------------------------------------------------------------------------
			
@@ -291,7 +289,7 @@ dmi_scan_machine();
 
				 dmi_memdev_walk();
			
 
				 ```
			
 
				 
			
 
				-First is `dmi_scan_machine` defined in the [drivers/firmware/dmi_scan.c](https://github.com/torvalds/linux/blob/master/drivers/firmware/dmi_scan.c). This function goes through the [System Management BIOS](http://en.wikipedia.org/wiki/System_Management_BIOS) structures and extracts informantion. There are two ways specified to gain access to the `SMBIOS` table: get the pointer to the `SMBIOS` table from the [EFI](http://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface)'s configuration table and scanning the physycal memory between `0xF0000` and `0x10000` addresses. Let's look on the second approach. `dmi_scan_machine` function remaps memory between `0xf0000` and `0x10000` with the `dmi_early_remap` which just expands to the `early_ioremap`:
			
 
				+First is `dmi_scan_machine` defined in the [drivers/firmware/dmi_scan.c](https://github.com/torvalds/linux/blob/master/drivers/firmware/dmi_scan.c). This function goes through the [System Management BIOS](http://en.wikipedia.org/wiki/System_Management_BIOS) structures and extracts information. There are two ways specified to gain access to the `SMBIOS` table: get the pointer to the `SMBIOS` table from the [EFI](http://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface)'s configuration table and scanning the physical memory between `0xF0000` and `0x10000` addresses. Let's look on the second approach. `dmi_scan_machine` function remaps memory between `0xf0000` and `0x10000` with the `dmi_early_remap` which just expands to the `early_ioremap`:
			
 
				 
			
 
				 ```C
			
 
				 void __init dmi_scan_machine(void)
			
@@ -321,14 +319,14 @@ for (q = p; q < p + 0x10000; q += 16) {
 
				 }
			
 
				 ```
			
 
				 
			
 
				-`_SM_` string must be between `000F0000h` and `0x000FFFFF`. Here we copy 16 bytes to the `buf` with `memcpy_fromio` which is the same `memcpy` and execute `dmi_smbios3_present` and `dmi_present` on the buffer. These functions check that first 4 bytes is `_SM_` string, get `SMBIOS` version and gets `_DMI_` attributes as `DMI` structure table length, table address and etc... After one of these function will finish to execute, you will see the result of it in the `dmesg` output:
			
 
				+`_SM_` string must be between `000F0000h` and `0x000FFFFF`. Here we copy 16 bytes to the `buf` with `memcpy_fromio` which is the same `memcpy` and execute `dmi_smbios3_present` and `dmi_present` on the buffer. These functions check that first 4 bytes is `_SM_` string, get `SMBIOS` version and gets `_DMI_` attributes as `DMI` structure table length, table address and etc... After one of these functions finish, you will see the result of it in the `dmesg` output:
			
 
				 
			
 
				 ```
			
 
				 [    0.000000] SMBIOS 2.7 present.
			
 
				 [    0.000000] DMI: Gigabyte Technology Co., Ltd. Z97X-UD5H-BK/Z97X-UD5H-BK, BIOS F6 06/17/2014
			
 
				 ```
			
 
				 
			
 
				-In the end of the `dmi_scan_machine`, we unmap the previously remaped memory:
			
 
				+In the end of the `dmi_scan_machine`, we unmap the previously remapped memory:
			
 
				 
			
 
				 ```C
			
 
				 dmi_early_unmap(p, 0x10000);
			
@@ -381,7 +379,7 @@ static inline void find_smp_config(void)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-inside. `x86_init.mpparse.find_smp_config` is a `default_find_smp_config` function from the [arch/x86/kernel/mpparse.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/mpparse.c). In the `default_find_smp_config` function we are scanning a couple of memory regions for `SMP` config and return if they are not:
			
 
				+inside. `x86_init.mpparse.find_smp_config` is the `default_find_smp_config` function from the [arch/x86/kernel/mpparse.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/mpparse.c). In the `default_find_smp_config` function we are scanning a couple of memory regions for `SMP` config and return if they are found:
			
 
				 
			
 
				 ```C
			
 
				 if (smp_scan_config(0x0, 0x400) ||
			
@@ -414,7 +412,7 @@ struct mpf_intel {
 
				 };
			
 
				 ```
			
 
				 
			
 
				-As we can read in the documentation - one of the main functions of the system BIOS is to construct the MP floating pointer structure and the MP configuration table. And operating system must have access to this information about the multiprocessor configuration and `mpf_intel` stores the physical address (look at second parameter) of the multiprocessor configuration table. So, `smp_scan_config` going in a loop through the given memory range and tries to find `MP floating pointer structure` there. It checks that current byte points to the `SMP` signature, checks checksum, checks that `mpf->specification` is 1 (it must be `1` or `4` by specification) in the loop:
			
 
				+As we can read in the documentation - one of the main functions of the system BIOS is to construct the MP floating pointer structure and the MP configuration table. And operating system must have access to this information about the multiprocessor configuration and `mpf_intel` stores the physical address (look at second parameter) of the multiprocessor configuration table. So, `smp_scan_config` going in a loop through the given memory range and tries to find `MP floating pointer structure` there. It checks that current byte points to the `SMP` signature, checks checksum, checks if `mpf->specification` is 1 or 4(it must be `1` or `4` by specification) in the loop:
			
 
				 
			
 
				 ```C
			
 
				 while (length > 0) {
			
@@ -432,12 +430,12 @@ if ((*bp == SMP_MAGIC_IDENT) &&
 
				 }
			
 
				 ```
			
 
				 
			
 
				-reserves given memory block if search is successful with `memblock_reserve` and reserves physical address of the multiprocessor configuration table. All documentation about this you can find in the - [MultiProcessor Specification](http://www.intel.com/design/pentium/datashts/24201606.pdf). More details you can read in the special part about `SMP`.
			
 
				+reserves given memory block if search is successful with `memblock_reserve` and reserves physical address of the multiprocessor configuration table. You can find documentation about this in the - [MultiProcessor Specification](http://www.intel.com/design/pentium/datashts/24201606.pdf). You can read More details in the special part about `SMP`.
			
 
				 
			
 
				 Additional early memory initialization routines
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-In the next step of the `setup_arch` we can see the call of the `early_alloc_pgt_buf` function which allocates the page table buffer for early stage. The page table buffer will be place in the `brk` area. Let's look on its implementation:
			
 
				+In the next step of the `setup_arch` we can see the call of the `early_alloc_pgt_buf` function which allocates the page table buffer for early stage. The page table buffer will be placed in the `brk` area. Let's look on its implementation:
			
 
				 
			
 
				 ```C
			
 
				 void  __init early_alloc_pgt_buf(void)
			
@@ -453,7 +451,7 @@ void  __init early_alloc_pgt_buf(void)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-First of all it get the size of the page table buffer, it will be `INIT_PGT_BUF_SIZE` which is `(6 * PAGE_SIZE)` in the current linux kernel 4.0. As we got the size of the page table buffer, we call `extend_brk` function with two parameters: size and align. As you can understand from its name, this function extends the `brk` area. As we can see in the linux kernel linker script `brk` in memory right after the [BSS](http://en.wikipedia.org/wiki/.bss):
			
 
				+First of all it get the size of the page table buffer, it will be `INIT_PGT_BUF_SIZE` which is `(6 * PAGE_SIZE)` in the current linux kernel 4.0. As we got the size of the page table buffer, we call `extend_brk` function with two parameters: size and align. As you can understand from its name, this function extends the `brk` area. As we can see in the linux kernel linker script `brk` is in memory right after the [BSS](http://en.wikipedia.org/wiki/.bss):
			
 
				 
			
 
				 ```C
			
 
				 	. = ALIGN(PAGE_SIZE);
			
@@ -469,7 +467,7 @@ Or we can find it with `readelf` util:
 
				 
			
 
				 ![brk area](http://oi61.tinypic.com/71lkeu.jpg)
			
 
				 
			
 
				-After that we got physical address of the new `brk` with the `__pa` macro, we calculate the base address and the end of the page table buffer. In the next step as we got page table buffer, we reserve memory block for the brk are with the `reserve_brk` function:
			
 
				+After that we got physical address of the new `brk` with the `__pa` macro, we calculate the base address and the end of the page table buffer. In the next step as we got page table buffer, we reserve memory block for the brk area with the `reserve_brk` function:
			
 
				 
			
 
				 ```C
			
 
				 static void __init reserve_brk(void)
			
@@ -482,7 +480,7 @@ static void __init reserve_brk(void)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-Note that in the end of the `reserve_brk`, we set `brk_start` to zero, because after this we will not allocate it anymore. The next step after reserving memory block for the `brk`, we need to unmap out-of-range memory areas in the kernel mapping with the `cleanup_highmap` function. Remeber that kernel mapping is `__START_KERNEL_map` and `_end - _text` or `level2_kernel_pgt` maps the kernel `_text`, `data` and `bss`. In the start of the `clean_high_map` we define these parameters:
			
 
				+Note that in the end of the `reserve_brk`, we set `brk_start` to zero, because after this we will not allocate it anymore. The next step after reserving memory block for the `brk`, we need to unmap out-of-range memory areas in the kernel mapping with the `cleanup_highmap` function. Remember that kernel mapping is `__START_KERNEL_map` and `_end - _text` or `level2_kernel_pgt` maps the kernel `_text`, `data` and `bss`. In the start of the `clean_high_map` we define these parameters:
			
 
				 
			
 
				 ```C
			
 
				 unsigned long vaddr = __START_KERNEL_map;
			
@@ -517,18 +515,18 @@ MEMBLOCK configuration:
 
				  reserved[0x2]	[0x0000007ec89000-0x0000007fffffff], 0x1377000 bytes flags: 0x0
			
 
				 ```
			
 
				 
			
 
				-The rest functions after the `memblock_x86_fill` are: `early_reserve_e820_mpc_new` alocates additional slots in the `e820map` for MultiProcessor Specification table, `reserve_real_mode` - reserves low memory from `0x0` to 1 megabyte for the trampoline to the real mode (for rebootin and etc...), `trim_platform_memory_ranges` - trims certain memory regions started from `0x20050000`, `0x20110000` and etc... these regions must be excluded because [Sandy Bridge](http://en.wikipedia.org/wiki/Sandy_Bridge) has problems with these regions, `trim_low_memory_range` reserves the first 4 killobytes page in `memblock`, `init_mem_mapping` function reconstructs direct memory mapping and setups the direct mapping of the physical memory at `PAGE_OFFSET`, `early_trap_pf_init` setups `#PF` handler (we will look on it in the chapter about interrupts) and `setup_real_mode` function setups trampoline to the [real mode](http://en.wikipedia.org/wiki/Real_mode) code.
			
 
				+The rest functions after the `memblock_x86_fill` are: `early_reserve_e820_mpc_new` allocates additional slots in the `e820map` for MultiProcessor Specification table, `reserve_real_mode` - reserves low memory from `0x0` to 1 megabyte for the trampoline to the real mode (for rebooting, etc.), `trim_platform_memory_ranges` - trims certain memory regions started from `0x20050000`, `0x20110000`, etc. these regions must be excluded because [Sandy Bridge](http://en.wikipedia.org/wiki/Sandy_Bridge) has problems with these regions, `trim_low_memory_range` reserves the first 4 kilobyte page in `memblock`, `init_mem_mapping` function reconstructs direct memory mapping and setups the direct mapping of the physical memory at `PAGE_OFFSET`, `early_trap_pf_init` setups `#PF` handler (we will look on it in the chapter about interrupts) and `setup_real_mode` function setups trampoline to the [real mode](http://en.wikipedia.org/wiki/Real_mode) code.
			
 
				 
			
 
				-That's all. You can note that this part will not cover all functions which are in the `setup_arch` (like `early_gart_iommu_check`, [mtrr](http://en.wikipedia.org/wiki/Memory_type_range_register) initalization and etc...). As I already wrote many times, `setup_arch` is big, and linux kernel is big. That's why I can't cover every line in the linux kernel. I don't think that we missed something important,... but you can say something like: each line of code is important. Yes, it's true, but I missed they anyway, because I think that it is not real to cover full linux kernel. Anyway we will often return to the idea that we have already seen, and if something will be unfamiliar, we will cover this theme.
			
 
				+That's all. You can note that this part will not cover all functions which are in the `setup_arch` (like `early_gart_iommu_check`, [mtrr](http://en.wikipedia.org/wiki/Memory_type_range_register) initialization, etc.). As I already wrote many times, `setup_arch` is big, and linux kernel is big. That's why I can't cover every line in the linux kernel. I don't think that we missed something important, but you can say something like: each line of code is important. Yes, it's true, but I missed them anyway, because I think that it is not realistic to cover full linux kernel. Anyway we will often return to the idea that we have already seen, and if something is unfamiliar, we will cover this theme.
			
 
				 
			
 
				 Conclusion
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-It is the end of the sixth part about linux kernel initialization process. In this part we continued to dive in the `setup_arch` function again It was long part, but we not finished with it. Yes, `setup_arch` is big, hope that next part will be last about this function.
			
 
				+It is the end of the sixth part about linux kernel initialization process. In this part we continued to dive in the `setup_arch` function again and it was long part, but we are not finished with it. Yes, `setup_arch` is big, hope that next part will be the last part about this function.
			
 
				 
			
 
				-If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				+If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				 
			
 
				-**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
			
 
				+**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				 
			
 
				 Links
			
 
				 --------------------------------------------------------------------------------
			
--- a/Initialization/linux-initialization-7.md
+++ b/Initialization/linux-initialization-7.md
@@ -1,12 +1,12 @@
 
				 Kernel initialization. Part 7.
			
 
				 ================================================================================
			
 
				 
			
 
				-The End of the architecture-specific initializations, almost...
			
 
				+The End of the architecture-specific initialization, almost...
			
 
				 ================================================================================
			
 
				 
			
 
				-This is the seventh parth of the Linux Kernel initialization process which covers internals of the `setup_arch` function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c#L861). As you can know from the previous [parts](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html), the `setup_arch` function does some architecture-specific (in our case it is [x86_64](http://en.wikipedia.org/wiki/X86-64)) initialization stuff like reserving memory for kernel code/data/bss, early scanning of the [Desktop Management Interface](http://en.wikipedia.org/wiki/Desktop_Management_Interface), early dump of the [PCI](http://en.wikipedia.org/wiki/PCI) device and many many more. If you have read the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/%20linux-initialization-6.html), you can remember that we've finished it at the `setup_real_mode` function. In the next step, as we set limit of the [memblock](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-1.html) to the all mapped pages, we can see the call of the `setup_log_buf` function from the [kernel/printk/printk.c](https://github.com/torvalds/linux/blob/master/kernel/printk/printk.c).
			
 
				+This is the seventh part of the Linux Kernel initialization process which covers insides of the `setup_arch` function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c#L861). As you can know from the previous [parts](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html), the `setup_arch` function does some architecture-specific (in our case it is [x86_64](http://en.wikipedia.org/wiki/X86-64)) initialization stuff like reserving memory for kernel code/data/bss, early scanning of the [Desktop Management Interface](http://en.wikipedia.org/wiki/Desktop_Management_Interface), early dump of the [PCI](http://en.wikipedia.org/wiki/PCI) device and many many more. If you have read the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/%20linux-initialization-6.html), you can remember that we've finished it at the `setup_real_mode` function. In the next step, as we set limit of the [memblock](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-1.html) to the all mapped pages, we can see the call of the `setup_log_buf` function from the [kernel/printk/printk.c](https://github.com/torvalds/linux/blob/master/kernel/printk/printk.c).
			
 
				 
			
 
				-The `setup_log_buf` function setups kernel cyclic buffer which length depends on the `CONFIG_LOG_BUF_SHIFT` configuration option. As we can read from the documentation of the `CONFIG_LOG_BUF_SHIFT` it can be between `12` and `21`. In the internals, buffer defined as array of chars:
			
 
				+The `setup_log_buf` function setups kernel cyclic buffer and its length depends on the `CONFIG_LOG_BUF_SHIFT` configuration option. As we can read from the documentation of the `CONFIG_LOG_BUF_SHIFT` it can be between `12` and `21`. In the insides, buffer defined as array of chars:
			
 
				 
			
 
				 ```C
			
 
				 #define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
			
@@ -14,7 +14,7 @@ static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);
 
				 static char *log_buf = __log_buf;
			
 
				 ```
			
 
				 
			
 
				-Now let's look on the implementation of th `setup_log_buf` function. It starts with check that current buffer is empty (It must be empty, because we just setup it) and another check that it is early setup. If setup of the kernel log buffer is not early, we call the `log_buf_add_cpu` function which increase size of the buffer for every CPU:
			
 
				+Now let's look on the implementation of the `setup_log_buf` function. It starts with check that current buffer is empty (It must be empty, because we just setup it) and another check that it is early setup. If setup of the kernel log buffer is not early, we call the `log_buf_add_cpu` function which increase size of the buffer for every CPU:
			
 
				 
			
 
				 ```C
			
 
				 if (log_buf != __log_buf)
			
@@ -30,9 +30,9 @@ We will not research `log_buf_add_cpu` function, because as you can see in the `
 
				 setup_log_buf(1);
			
 
				 ```
			
 
				 
			
 
				-where `1` means that is is early setup. In the next step we check `new_log_buf_len` variable which is updated length of the kernel log buffer and allocate new space for the buffer with the `memblock_virt_alloc` function for it, or just return.
			
 
				+where `1` means that it is early setup. In the next step we check `new_log_buf_len` variable which is updated length of the kernel log buffer and allocate new space for the buffer with the `memblock_virt_alloc` function for it, or just return.
			
 
				 
			
 
				-As kernel log buffer is ready, the next function is `reserve_initrd`. You can remember that we already called the `early_reserve_initrd` function in the fourth part of the [Kernel initialization](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html). Now, as we reconstructed direct memory mapping in the `init_mem_mapping` function, we need to move [initrd](http://en.wikipedia.org/wiki/Initrd) to the down into directly mapped memory. The `reserve_initrd` function starts from the definition of the base address and end address of the `initrd` and check that `initrd` was provided by a bootloader. All the same as we saw it in the `early_reserve_initrd`. But instead of the reserving place in the `memblock` area with the call of the `memblock_reserve` function, we get the mapped size of the direct memory area and check that the size of the `initrd` is not greater that this area with:
			
 
				+As kernel log buffer is ready, the next function is `reserve_initrd`. You can remember that we already called the `early_reserve_initrd` function in the fourth part of the [Kernel initialization](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html). Now, as we reconstructed direct memory mapping in the `init_mem_mapping` function, we need to move [initrd](http://en.wikipedia.org/wiki/Initrd) into directly mapped memory. The `reserve_initrd` function starts from the definition of the base address and end address of the `initrd` and check that `initrd` is provided by a bootloader. All the same as what we saw in the `early_reserve_initrd`. But instead of the reserving place in the `memblock` area with the call of the `memblock_reserve` function, we get the mapped size of the direct memory area and check that the size of the `initrd` is not greater than this area with:
			
 
				 
			
 
				 ```C
			
 
				 mapped_size = memblock_mem_size(max_pfn_mapped);
			
@@ -42,13 +42,13 @@ if (ramdisk_size >= (mapped_size>>1))
 
				 	      ramdisk_size, mapped_size>>1);
			
 
				 ```
			
 
				           
			
 
				-You can see here that we call `memblock_mem_size` function and pass the `max_pfn_mapped` to it, where `max_pfn_mapped` contains the highest direct mapped page frame number. If you do not remember what is it `page frame number`, explanation is simple: First `12` bits of the virtual address represent offset in the physical page or page frame. If we will shift right virtual address on `12`, we'll discard offset part and will get `Page Frame Number`. In the `memblock_mem_size` we go through the all memblock `mem` (not reserved) regions and calculates size of the mapped pages amount and return it to the `mapped_size` variable (see code above). As we got amount of the direct mapped memory, we check that size of the `initrd` is not greater than mapped pages. If it is greater we just call `panic` which halts the system and prints popular [Kernel panic](http://en.wikipedia.org/wiki/Kernel_panic) message. In the next step we print information about the `initrd` size. We can see the result of this in the `dmesg` output:
			
 
				+You can see here that we call `memblock_mem_size` function and pass the `max_pfn_mapped` to it, where `max_pfn_mapped` contains the highest direct mapped page frame number. If you do not remember what is `page frame number`, explanation is simple: First `12` bits of the virtual address represent offset in the physical page or page frame. If we right-shift out `12` bits of the virtual address, we'll discard offset part and will get `Page Frame Number`. In the `memblock_mem_size` we go through the all memblock `mem` (not reserved) regions and calculates size of the mapped pages and return it to the `mapped_size` variable (see code above). As we got amount of the direct mapped memory, we check that size of the `initrd` is not greater than mapped pages. If it is greater we just call `panic` which halts the system and prints famous [Kernel panic](http://en.wikipedia.org/wiki/Kernel_panic) message. In the next step we print information about the `initrd` size. We can see the result of this in the `dmesg` output:
			
 
				 
			
 
				 ```C
			
 
				 [0.000000] RAMDISK: [mem 0x36d20000-0x37687fff]
			
 
				 ```
			
 
				 
			
 
				-and relocate `initrd` to the direct mapping area with the `relocate_initrd` function. In the start of the `relocate_initrd` function we try to find free area with the `memblock_find_in_range` function:
			
 
				+and relocate `initrd` to the direct mapping area with the `relocate_initrd` function. In the start of the `relocate_initrd` function we try to find a free area with the `memblock_find_in_range` function:
			
 
				 
			
 
				 ```C
			
 
				 relocated_ramdisk = memblock_find_in_range(0, PFN_PHYS(max_pfn_mapped), area_size, PAGE_SIZE);
			
@@ -58,7 +58,7 @@ if (!relocated_ramdisk)
 
				 	       ramdisk_size);
			
 
				 ```
			
 
				 
			
 
				-The `memblock_find_in_range` function tries to find free area in a given range, in our case from `0` to the maximum mapped physical address and size must equal to the aligned size of the `initrd`. If we didn't find area with the given size, we call `panic` again. If all is good, we start to relocated RAM disk to the down of the directly mapped meory in the next step.
			
 
				+The `memblock_find_in_range` function tries to find a free area in a given range, in our case from `0` to the maximum mapped physical address and size must equal to the aligned size of the `initrd`. If we didn't find a area with the given size, we call `panic` again. If all is good, we start to relocated RAM disk to the down of the directly mapped memory in the next step.
			
 
				 
			
 
				 In the end of the `reserve_initrd` function, we free memblock memory which occupied by the ramdisk with the call of the:
			
 
				 
			
@@ -66,9 +66,9 @@ In the end of the `reserve_initrd` function, we free memblock memory which occup
 
				 memblock_free(ramdisk_image, ramdisk_end - ramdisk_image);
			
 
				 ```
			
 
				 
			
 
				-After we relocated `initrd` ramdisk image, the next function is `vsmp_init` from the [arch/x86/kernel/vsmp_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vsmp_64.c). This function initializes support of the `ScaleMP vSMP`. As I already wrote in the previous parts, this chapter will not cover non-related `x86_64` initialization parts (for example as the current or `ACPI` and etc...). So we will miss implementation of this for now and will back to it in the part which will cover techniques of parallel computing.
			
 
				+After we relocated `initrd` ramdisk image, the next function is `vsmp_init` from the [arch/x86/kernel/vsmp_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vsmp_64.c). This function initializes support of the `ScaleMP vSMP`. As I already wrote in the previous parts, this chapter will not cover non-related `x86_64` initialization parts (for example as the current or `ACPI`, etc.). So we will skip implementation of this for now and will back to it in the part which cover techniques of parallel computing.
			
 
				 
			
 
				-The next function is `io_delay_init` from the [arch/x86/kernel/io_delay.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/io_delay.c). This function allows to override default default I/O delay `0x80` port. We already saw I/O delay in the [Last preparation before transition into protected mode](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html), now let's look on the `io_delay_init` implementation:
			
 
				+The next function is `io_delay_init` from the [arch/x86/kernel/io_delay.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/io_delay.c). This function allows to override default I/O delay `0x80` port. We already saw I/O delay in the [Last preparation before transition into protected mode](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html), now let's look on the `io_delay_init` implementation:
			
 
				 
			
 
				 ```C
			
 
				 void __init io_delay_init(void)
			
@@ -127,7 +127,7 @@ The next functions are `acpi_boot_table_init`, `early_acpi_boot_init` and `initm
 
				 Allocate area for DMA
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-In the next step we need to allocate area for the [Direct memory access](http://en.wikipedia.org/wiki/Direct_memory_access) with the `dma_contiguous_reserve` function which defined in the [drivers/base/dma-contiguous.c](https://github.com/torvalds/linux/blob/master/drivers/base/dma-contiguous.c). `DMA` area is a special mode when devices comminicate with memory without CPU. Note that we pass one parameter - `max_pfn_mapped << PAGE_SHIFT`, to the `dma_contiguous_reserve` function and as you can understand from this expression, this is limit of the reserved memory. Let's look on the implementation of this function. It starts from the definition of the following variables:
			
 
				+In the next step we need to allocate area for the [Direct memory access](http://en.wikipedia.org/wiki/Direct_memory_access) with the `dma_contiguous_reserve` function which is defined in the [drivers/base/dma-contiguous.c](https://github.com/torvalds/linux/blob/master/drivers/base/dma-contiguous.c). `DMA` is a special mode when devices communicate with memory without CPU. Note that we pass one parameter - `max_pfn_mapped << PAGE_SHIFT`, to the `dma_contiguous_reserve` function and as you can understand from this expression, this is limit of the reserved memory. Let's look on the implementation of this function. It starts from the definition of the following variables:
			
 
				 
			
 
				 ```C
			
 
				 phys_addr_t selected_size = 0;
			
@@ -178,18 +178,18 @@ As we calculated the size of the reserved area, we reserve area with the call of
 
				 ret = cma_declare_contiguous(base, size, limit, 0, 0, fixed, res_cma);
			
 
				 ```
			
 
				 
			
 
				-function. The `cma_declare_contiguous` reserves contiguous area from the given base address and with given size. After we reserved area for the `DMA`, next function is the `memblock_find_dma_reserve`. As you can understand from its name, this function counts the reserved pages in the `DMA` area. This part will not cover all details of the `CMA` and `DMA`, because they are big. We will see much more details in the special part in the Linux Kernel Memory management which covers contiguous memory allocators and areas.
			
 
				+function. The `cma_declare_contiguous` reserves contiguous area from the given base address with given size. After we reserved area for the `DMA`, next function is the `memblock_find_dma_reserve`. As you can understand from its name, this function counts the reserved pages in the `DMA` area. This part will not cover all details of the `CMA` and `DMA`, because they are big. We will see much more details in the special part in the Linux Kernel Memory management which covers contiguous memory allocators and areas.
			
 
				 
			
 
				 Initialization of the sparse memory
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-The next step is the call of the function - `x86_init.paging.pagetable_init`. If you will try to find this function in the linux kernel source code, in the end of your search, you will see the following macro:
			
 
				+The next step is the call of the function - `x86_init.paging.pagetable_init`. If you try to find this function in the linux kernel source code, in the end of your search, you will see the following macro:
			
 
				 
			
 
				 ```C
			
 
				 #define native_pagetable_init        paging_init
			
 
				 ```
			
 
				 
			
 
				-which expands as you can see to the call of the `paging_init` function from the [arch/x86/mm/init_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/init_64.c). The `paging_init` function initializes sparse memory and zone sizes. First of all what's zones and what is it `Sparsemem`. The `Sparsemem` is a special foundation in the linux kernen memory manager which used to split memory area to the different memory banks in the [NUMA](http://en.wikipedia.org/wiki/Non-uniform_memory_access) systems. Let's look on the implementation of the `paginig_init` function:
			
 
				+which expands as you can see to the call of the `paging_init` function from the [arch/x86/mm/init_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/init_64.c). The `paging_init` function initializes sparse memory and zone sizes. First of all what's zones and what is it `Sparsemem`. The `Sparsemem` is a special foundation in the linux kernel memory manager which used to split memory area into different memory banks in the [NUMA](http://en.wikipedia.org/wiki/Non-uniform_memory_access) systems. Let's look on the implementation of the `paginig_init` function:
			
 
				 
			
 
				 ```C
			
 
				 void __init paging_init(void)
			
@@ -205,7 +205,7 @@ void __init paging_init(void)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-As you can see there is call of the `sparse_memory_present_with_active_regions` function which records a memory area for every `NUMA` node to the array of the `mem_section` structure which contains a pointer to the structure of the array of `struct page`. The next `sparse_init` function allocates non-linear `mem_section` and `mem_map`. In the next step we clear state of the movable memory nodes and initialize sizes of zones. Every `NUMA` node is devided into a number of pieces which are called - `zones`. So, `zone_sizes_init` function from the [arch/x86/mm/init.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/init.c) initializes size of zones.
			
 
				+As you can see there is call of the `sparse_memory_present_with_active_regions` function which records a memory area for every `NUMA` node to the array of the `mem_section` structure which contains a pointer to the structure of the array of `struct page`. The next `sparse_init` function allocates non-linear `mem_section` and `mem_map`. In the next step we clear state of the movable memory nodes and initialize sizes of zones. Every `NUMA` node is divided into a number of pieces which are called - `zones`. So, `zone_sizes_init` function from the [arch/x86/mm/init.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/init.c) initializes size of zones.
			
 
				 
			
 
				 Again, this part and next parts do not cover this theme in full details. There will be special part about `NUMA`.
			
 
				 
			
@@ -222,7 +222,7 @@ if (boot_cpu_data.cpuid_level >= 0) {
 
				 }
			
 
				 ```
			
 
				 
			
 
				-The next function which you can see is `map_vsyscal` from the [arch/x86/kernel/vsyscall_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vsyscall_64.c). This function maps memory space for [vsyscalls](https://lwn.net/Articles/446528/) and depends on `CONFIG_X86_VSYSCALL_EMULATION` kernel configuration option. Actually `vsyscall` is a special segment which provides fast access to the certain system calls like `getcpu` and etc... Let's look on implementation of this function:
			
 
				+The next function which you can see is `map_vsyscal` from the [arch/x86/kernel/vsyscall_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vsyscall_64.c). This function maps memory space for [vsyscalls](https://lwn.net/Articles/446528/) and depends on `CONFIG_X86_VSYSCALL_EMULATION` kernel configuration option. Actually `vsyscall` is a special segment which provides fast access to the certain system calls like `getcpu`, etc. Let's look on implementation of this function:
			
 
				 
			
 
				 ```C
			
 
				 void __init map_vsyscall(void)
			
@@ -241,7 +241,7 @@ void __init map_vsyscall(void)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-In the beginning of the `map_vsyscal` we can see definition of two variables. The first is extern valirable `__vsyscall_page`. As variable extern, it defined somewhere in other source code file. Actually we can see definition of the `__vsyscall_page` in the [arch/x86/kernel/vsyscall_emu_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vsyscall_emu_64.S). The `__vsyscall_page` symbol points to the aligned calls of the `vsyscalls` as `gettimeofday` and etc...:
			
 
				+In the beginning of the `map_vsyscall` we can see definition of two variables. The first is extern variable `__vsyscall_page`. As a extern variable, it defined somewhere in other source code file. Actually we can see definition of the `__vsyscall_page` in the [arch/x86/kernel/vsyscall_emu_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vsyscall_emu_64.S). The `__vsyscall_page` symbol points to the aligned calls of the `vsyscalls` as `gettimeofday`, etc.:
			
 
				 
			
 
				 ```assembly
			
 
				 	.globl __vsyscall_page
			
@@ -262,7 +262,7 @@ __vsyscall_page:
 
				     ...
			
 
				 ```
			
 
				 
			
 
				-The second variable is `physaddr_vsyscall` which just stores physical address of the `__vsyscall_page` symbol. In the next step we check the `vsyscall_mode` variable, and if it is not equal to `NONE` which is `EMULATE` by default:
			
 
				+The second variable is `physaddr_vsyscall` which just stores physical address of the `__vsyscall_page` symbol. In the next step we check the `vsyscall_mode` variable, and if it is not equal to `NONE`, it is `EMULATE` by default:
			
 
				 
			
 
				 ```C
			
 
				 static enum { EMULATE, NATIVE, NONE } vsyscall_mode = EMULATE;
			
@@ -289,26 +289,26 @@ void __native_set_fixmap(enum fixed_addresses idx, pte_t pte)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-Here we can see that `native_set_fixmap` makes value of `Page Table Entry` from the given physical address (physical address of the `__vsyscall_page` symbol in our case) and calls internal function - `__native_set_fixmap`. Internal function gets the virtual address of the given `fixed_addresses` index (`VSYSCALL_PAGE` in our case) and checks that given index is not greated than end of the fix-mapped addresses. After this we set page table entry with the call of the `set_pte_vaddr` function and increase count of the fix-mapped addresses. And in the end of the `map_vsyscall` we check that virtual address of the `VSYSCALL_PAGE` (which is first index in the `fixed_addresses`) is not greater than `VSYSCALL_ADDR` which is `-10UL << 20` or `ffffffffff600000` with the `BUILD_BUG_ON` macro:
			
 
				+Here we can see that `native_set_fixmap` makes value of `Page Table Entry` from the given physical address (physical address of the `__vsyscall_page` symbol in our case) and calls internal function - `__native_set_fixmap`. Internal function gets the virtual address of the given `fixed_addresses` index (`VSYSCALL_PAGE` in our case) and checks that given index is not greater than end of the fix-mapped addresses. After this we set page table entry with the call of the `set_pte_vaddr` function and increase count of the fix-mapped addresses. And in the end of the `map_vsyscall` we check that virtual address of the `VSYSCALL_PAGE` (which is first index in the `fixed_addresses`) is not greater than `VSYSCALL_ADDR` which is `-10UL << 20` or `ffffffffff600000` with the `BUILD_BUG_ON` macro:
			
 
				 
			
 
				 ```C
			
 
				 BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
			
 
				                      (unsigned long)VSYSCALL_ADDR);
			
 
				 ```
			
 
				 
			
 
				-Now `vsyscall` area is in the `fix-mapped` area. That's all about `map_vsyscall`, if you do not know anything about fix-mapped addresses, you can read [Fix-Mapped Addresses and ioremap](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html). More about `vsyscalls` we will see in the `vsyscalls and vdso` part.
			
 
				+Now `vsyscall` area is in the `fix-mapped` area. That's all about `map_vsyscall`, if you do not know anything about fix-mapped addresses, you can read [Fix-Mapped Addresses and ioremap](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html). We will see more about `vsyscalls` in the `vsyscalls and vdso` part.
			
 
				 
			
 
				 Getting the SMP configuration
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-You can remember how we made a search of the [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing) configuration in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/%20linux-initialization-6.html). Now we need to get the `SMP` configurtaion if we found it. For this we check `smp_found_config` variable which we set in the `smp_scan_config` function (read about it the previous part) and call the `get_smp_config` function:
			
 
				+You may remember how we made a search of the [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing) configuration in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/%20linux-initialization-6.html). Now we need to get the `SMP` configuration if we found it. For this we check `smp_found_config` variable which we set in the `smp_scan_config` function (read about it the previous part) and call the `get_smp_config` function:
			
 
				 
			
 
				 ```C
			
 
				 if (smp_found_config)
			
 
				 	get_smp_config();
			
 
				 ```
			
 
				 
			
 
				-The `get_smp_config` expands to the `x86_init.mpparse.default_get_smp_config` function which defined in the [arch/x86/kernel/mpparse.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/mpparse.c). This function defines pointer to the multiprocessor floating pointer structure - `mpf_intel` (you can read about it in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/%20linux-initialization-6.html)) and does some checks:
			
 
				+The `get_smp_config` expands to the `x86_init.mpparse.default_get_smp_config` function which is defined in the [arch/x86/kernel/mpparse.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/mpparse.c). This function defines a pointer to the multiprocessor floating pointer structure - `mpf_intel` (you can read about it in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/%20linux-initialization-6.html)) and does some checks:
			
 
				 
			
 
				 ```C
			
 
				 struct mpf_intel *mpf = mpf_found;
			
@@ -320,12 +320,12 @@ if (acpi_lapic && early)
 
				    return;
			
 
				 ```
			
 
				 
			
 
				-Here we can see that multiprocessor configuration was found in the `smp_scan_config` function or just return from the function if not. The next check check that it is early. And as we did this checks, we start to read the `SMP` configuration. As we finished to read it, the next step is - `prefill_possible_map` function which makes preliminary filling of the possible CPUs `cpumask` (more about it you can read in the [Introduction to the cpumasks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)).
			
 
				+Here we can see that multiprocessor configuration was found in the `smp_scan_config` function or just return from the function if not. The next check is `acpi_lapic` and `early`. And as we did this checks, we start to read the `SMP` configuration. As we finished reading it, the next step is - `prefill_possible_map` function which makes preliminary filling of the possible CPU's `cpumask` (more about it you can read in the [Introduction to the cpumasks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)).
			
 
				 
			
 
				 The rest of the setup_arch
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-Here we are getting to the end of the `setup_arch` function. The rest function of course make important stuff, but details about these stuff will not will not be included in this part. We will just take a short look on these functions, because although they are important as I wrote above, but they cover non-generic kernel features related with the `NUMA`, `SMP`, `ACPI` and `APICs` and etc... First of all, the next call of the `init_apic_mappings` function. As we can understand this function sets the address of the local [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller). The next is `x86_io_apic_ops.init` and this function initializes I/O APIC. Please note that all details related with `APIC`, we will see in the chapter about interrupts and exceptions handling. In the next step we reserve standard I/O resources like `DMA`, `TIMER`, `FPU` and etc..., with the call of the `x86_init.resources.reserve_resources` function. Following is `mcheck_init` function initializes `Machine check Exception` and the last is `register_refined_jiffies` which registers [jiffy](http://en.wikipedia.org/wiki/Jiffy_%28time%29) (There will be separate chapter about timers in the kernel).
			
 
				+Here we are getting to the end of the `setup_arch` function. The rest of function of course is important, but details about these stuff will not will not be included in this part. We will just take a short look on these functions, because although they are important as I wrote above, but they cover non-generic kernel features related with the `NUMA`, `SMP`, `ACPI` and `APICs`, etc. First of all, the next call of the `init_apic_mappings` function. As we can understand this function sets the address of the local [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller). The next is `x86_io_apic_ops.init` and this function initializes I/O APIC. Please note that we will see all details related with `APIC` in the chapter about interrupts and exceptions handling. In the next step we reserve standard I/O resources like `DMA`, `TIMER`, `FPU`, etc., with the call of the `x86_init.resources.reserve_resources` function. Following is `mcheck_init` function initializes `Machine check Exception` and the last is `register_refined_jiffies` which registers [jiffy](http://en.wikipedia.org/wiki/Jiffy_%28time%29) (There will be separate chapter about timers in the kernel).
			
 
				 
			
 
				 So that's all. Finally we have finished with the big `setup_arch` function in this part. Of course as I already wrote many times, we did not see full details about this function, but do not worry about it. We will be back more than once to this function from different chapters for understanding how different platform-dependent parts are initialized.
			
 
				 
			
@@ -334,7 +334,7 @@ That's all, and now we can back to the `start_kernel` from the `setup_arch`.
 
				 Back to the main.c
			
 
				 ================================================================================
			
 
				 
			
 
				-As I wrote above, we have finished with the `setup_arch` function and now we can back to the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c). As you can remember or even you saw yourself, `start_kernel` function is very big too as the `setup_arch`. So the couple of the next part will be dedicated to the learning of this function. So, let's continue with it. After the `setup_arch` we can see the call of the `mm_init_cpumask` function. This function sets the [cpumask]((http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)) pointer to the memory descriptor `cpumask`. We can look on its implementation:
			
 
				+As I wrote above, we have finished with the `setup_arch` function and now we can back to the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c). As you may remember or saw yourself, `start_kernel` function as big as the `setup_arch`. So the couple of the next part will be dedicated to learning of this function. So, let's continue with it. After the `setup_arch` we can see the call of the `mm_init_cpumask` function. This function sets the [cpumask](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) pointer to the memory descriptor `cpumask`. We can look on its implementation:
			
 
				 
			
 
				 ```C
			
 
				 static inline void mm_init_cpumask(struct mm_struct *mm)
			
@@ -346,7 +346,7 @@ static inline void mm_init_cpumask(struct mm_struct *mm)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-As you can see in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c), we passed memory descriptor of the init process to the `mm_init_cpumask` and here depend on `CONFIG_CPUMASK_OFFSTACK` configuration option we set or clear [TLB](http://en.wikipedia.org/wiki/Translation_lookaside_buffer) switch `cpumask`.
			
 
				+As you can see in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c), we pass memory descriptor of the init process to the `mm_init_cpumask` and depends on `CONFIG_CPUMASK_OFFSTACK` configuration option we clear [TLB](http://en.wikipedia.org/wiki/Translation_lookaside_buffer) switch `cpumask`.
			
 
				 
			
 
				 In the next step we can see the call of the following function:
			
 
				 
			
@@ -360,7 +360,7 @@ This function takes pointer to the kernel command line allocates a couple of buf
 
				 * `initcall_command_line` - will contain boot command line. will be used in the `do_initcall_level`;
			
 
				 * `static_command_line` - will contain command line for parameters parsing.
			
 
				 
			
 
				-We will allocate space with the `memblock_virt_alloc` function. This function calls `memblock_virt_alloc_try_nid` which allocates boot memory block with `memblock_reserve` if [slab](http://en.wikipedia.org/wiki/Slab_allocation) is not available or uses `kzalloc_node` (more about it will be in the linux memory management chapter). The `memblock_virt_alloc` uses `BOOTMEM_LOW_LIMIT` (physicall address of the `(PAGE_OFFSET + 0x1000000)` value) and `BOOTMEM_ALLOC_ACCESSIBLE` (equal to the current value of the `memblock.current_limit`) as minimum address of the memory egion and maximum address of the memory region.
			
 
				+We will allocate space with the `memblock_virt_alloc` function. This function calls `memblock_virt_alloc_try_nid` which allocates boot memory block with `memblock_reserve` if [slab](http://en.wikipedia.org/wiki/Slab_allocation) is not available or uses `kzalloc_node` (more about it will be in the linux memory management chapter). The `memblock_virt_alloc` uses `BOOTMEM_LOW_LIMIT` (physical address of the `(PAGE_OFFSET + 0x1000000)` value) and `BOOTMEM_ALLOC_ACCESSIBLE` (equal to the current value of the `memblock.current_limit`) as minimum address of the memory region and maximum address of the memory region.
			
 
				 
			
 
				 Let's look on the implementation of the `setup_command_line`:
			
 
				 
			
@@ -377,7 +377,7 @@ static void __init setup_command_line(char *command_line)
 
				  }
			
 
				  ```
			
 
				 
			
 
				-Here we can see that we allocate space for the three buffers which will contain kernel command line for the different purposes (read above). And as we allocated space, we storing `boot_comand_line` in the `saved_command_line` and `command_line` (kernel command line from the `setup_arch` to the `static_command_line`).
			
 
				+Here we can see that we allocate space for the three buffers which will contain kernel command line for the different purposes (read above). And as we allocated space, we store `boot_command_line` in the `saved_command_line` and `command_line` (kernel command line from the `setup_arch`) to the `static_command_line`.
			
 
				 
			
 
				 The next function after the `setup_command_line` is the `setup_nr_cpu_ids`. This function setting `nr_cpu_ids` (number of CPUs) according to the last bit in the `cpu_possible_mask` (more about it you can read in the chapter describes [cpumasks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) concept). Let's look on its implementation:
			
 
				 
			
@@ -395,7 +395,7 @@ Here `nr_cpu_ids` represents number of CPUs, `NR_CPUS` represents the maximum nu
 
				 Actually we need to call this function, because `NR_CPUS` can be greater than actual amount of the CPUs in the your computer. Here we can see that we call `find_last_bit` function and pass two parameters to it:
			
 
				 
			
 
				 * `cpu_possible_mask` bits;
			
 
				-* maximim number of CPUS.
			
 
				+* maximum number of CPUS.
			
 
				 
			
 
				 In the `setup_arch` we can find the call of the `prefill_possible_map` function which calculates and writes to the `cpu_possible_mask` actual number of the CPUs. We call the `find_last_bit` function which takes the address and maximum size to search and returns bit number of the first set bit. We passed `cpu_possible_mask` bits and maximum number of the CPUs. First of all the `find_last_bit` function splits given `unsigned long` address to the [words](http://en.wikipedia.org/wiki/Word_%28computer_architecture%29):
			
 
				 
			
@@ -451,18 +451,18 @@ Here we put the last word to the `tmp` variable and check that `tmp` contains at
 
				 return size;
			
 
				 ```
			
 
				 
			
 
				-After this `nr_cpu_ids` will contain the correct amount of the avaliable CPUs.
			
 
				+After this `nr_cpu_ids` will contain the correct amount of the available CPUs.
			
 
				 
			
 
				 That's all.
			
 
				 
			
 
				 Conclusion
			
 
				 ================================================================================
			
 
				 
			
 
				-It is the end of the seventh part about the linux kernel initialization process. In this part, finally we have finsihed with the `setup_arch` function and returned to the `start_kernel` function. In the next part we will continue to learn generic kernel code from the `start_kernel` and will continue our way to the first `init` process.
			
 
				+It is the end of the seventh part about the linux kernel initialization process. In this part, finally we have finished with the `setup_arch` function and returned to the `start_kernel` function. In the next part we will continue to learn generic kernel code from the `start_kernel` and will continue our way to the first `init` process.
			
 
				 
			
 
				-If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				+If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				 
			
 
				-**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
			
 
				+**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				 
			
 
				 Links
			
 
				 ================================================================================
			
--- a/Initialization/linux-initialization-8.md
+++ b/Initialization/linux-initialization-8.md
@@ -4,7 +4,7 @@ Kernel initialization. Part 8.
 
				 Scheduler initialization
			
 
				 ================================================================================
			
 
				 
			
 
				-This is the eighth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) of the Linux kernel initialization process and we stopped on the `setup_nr_cpu_ids` function in the [previous](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-7.md) part. The main point of the current part is [scheduler](http://en.wikipedia.org/wiki/Scheduling_%28computing%29) initialization. But before we will start to learn initialization process of the scheduler, we need to do some stuff. The next step in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) is the `setup_per_cpu_areas` function. This function setups areas for the `percpu` variables, more about it you can read in the special part about the [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html). After `percpu` areas up and running, the next step is the `smp_prepare_boot_cpu` function. This function does some preparations for the [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing):
			
 
				+This is the eighth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) of the Linux kernel initialization process and we stopped on the `setup_nr_cpu_ids` function in the [previous](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-7.md) part. The main point of the current part is [scheduler](http://en.wikipedia.org/wiki/Scheduling_%28computing%29) initialization. But before we will start to learn initialization process of the scheduler, we need to do some stuff. The next step in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) is the `setup_per_cpu_areas` function. This function setups areas for the `percpu` variables, more about it you can read in the special part about the [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html). After `percpu` areas is up and running, the next step is the `smp_prepare_boot_cpu` function. This function does some preparations for the [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing):
			
 
				 
			
 
				 ```C
			
 
				 static inline void smp_prepare_boot_cpu(void)
			
@@ -25,7 +25,7 @@ void __init native_smp_prepare_boot_cpu(void)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-The `native_smp_prepare_boot_cpu` function gets the number of the current CPU (which is Bootstrap processor and its `id` is zero) with the `smp_processor_id` function. I will not explain how the `smp_processor_id` works, because we alread saw it in the [Kernel entry point](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) part. As we got processor `id` number we reload [Global Descriptor Table](http://en.wikipedia.org/wiki/Global_Descriptor_Table) for the given CPU with the `switch_to_new_gdt` function:
			
 
				+The `native_smp_prepare_boot_cpu` function gets the id of the current CPU (which is Bootstrap processor and its `id` is zero) with the `smp_processor_id` function. I will not explain how the `smp_processor_id` works, because we already saw it in the [Kernel entry point](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) part. As we got processor `id` number we reload [Global Descriptor Table](http://en.wikipedia.org/wiki/Global_Descriptor_Table) for the given CPU with the `switch_to_new_gdt` function:
			
 
				 
			
 
				 ```C
			
 
				 void switch_to_new_gdt(int cpu)
			
@@ -54,7 +54,7 @@ static inline struct desc_struct *get_cpu_gdt_table(unsigned int cpu)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-The `get_cpu_gdt_table` uses `per_cpu` macro for getting `gdt_page` percpu variable for the given CPU number (bootstrap processor with `id` - 0 in our case). You can ask the following question: so, if we can access `gdt_page` percpu variable, where it was defined? Actually we alread saw it in this book. If you have read the first [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) of this chapter, you can remember that we saw definition of the `gdt_page` in the [arch/x86/kernel/head_64.S](https://github.com/0xAX/linux/blob/master/arch/x86/kernel/head_64.S):
			
 
				+The `get_cpu_gdt_table` uses `per_cpu` macro for getting `gdt_page` percpu variable for the given CPU number (bootstrap processor with `id` - 0 in our case). You may ask the following question: so, if we can access `gdt_page` percpu variable, where it was defined? Actually we already saw it in this book. If you have read the first [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) of this chapter, you can remember that we saw definition of the `gdt_page` in the [arch/x86/kernel/head_64.S](https://github.com/0xAX/linux/blob/master/arch/x86/kernel/head_64.S):
			
 
				 
			
 
				 ```assembly
			
 
				 early_gdt_descr:
			
@@ -86,7 +86,7 @@ DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
 
				     ...
			
 
				 ```
			
 
				 
			
 
				-more about `percpu` variables you can read in the [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) part. As we got address and size of the `GDT` descriptor we case reload `GDT` with the `load_gdt` which just execute `lgdt` instruct and load `percpu_segment` with the following function:
			
 
				+more about `percpu` variables you can read in the [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) part. As we got address and size of the `GDT` descriptor we reload `GDT` with the `load_gdt` which just execute `lgdt` instruct and load `percpu_segment` with the following function:
			
 
				 
			
 
				 ```C
			
 
				 void load_percpu_segment(int cpu) {
			
@@ -103,19 +103,19 @@ cpumask_set_cpu(me, cpu_callout_mask);
 
				 per_cpu(cpu_state, me) = CPU_ONLINE;
			
 
				 ```
			
 
				 
			
 
				-So, what is it `cpu_callout_mask` bitmap... As we initialized bootstrap processor (procesoor which is booted the first on `x86`) the other processors in a multiprocessor system are known as `secondary processors`. Linux kernel uses two following bitmasks:
			
 
				+So, what is `cpu_callout_mask` bitmap... As we initialized bootstrap processor (processor which is booted the first on `x86`) the other processors in a multiprocessor system are known as `secondary processors`. Linux kernel uses following two bitmasks:
			
 
				 
			
 
				 * `cpu_callout_mask`
			
 
				 * `cpu_callin_mask`
			
 
				 
			
 
				-After bootstrap processor initialized, it updates the `cpu_callout_mask` to indicate which secondary processor can be initialized next. All other or secondary processors can do some initialization stuff before and check the `cpu_callout_mask` on the boostrap processor bit. Only after the bootstrap processor filled the `cpu_callout_mask` this secondary processor, it will continue the rest of its initialization. After that the certain processor will finish its initialization process, the processor sets bit in the `cpu_callin_mask`. Once the bootstrap processor finds the bit in the `cpu_callin_mask` for the current secondary processor, this processor repeats the same procedure for initialization of the rest of a secondary processors. In a short words it works as i described, but more details we will see in the chapter about `SMP`.
			
 
				+After bootstrap processor initialized, it updates the `cpu_callout_mask` to indicate which secondary processor can be initialized next. All other or secondary processors can do some initialization stuff before and check the `cpu_callout_mask` on the boostrap processor bit. Only after the bootstrap processor filled the `cpu_callout_mask` with this secondary processor, it will continue the rest of its initialization. After that the certain processor finish its initialization process, the processor sets bit in the `cpu_callin_mask`. Once the bootstrap processor finds the bit in the `cpu_callin_mask` for the current secondary processor, this processor repeats the same procedure for initialization of one of the remaining secondary processors. In a short words it works as i described, but we will see more details in the chapter about `SMP`.
			
 
				         
			
 
				 That's all. We did all `SMP` boot preparation.
			
 
				 
			
 
				 Build zonelists
			
 
				 -----------------------------------------------------------------------
			
 
				 
			
 
				-In the next step we can see the call of the `build_all_zonelists` function. This function sets up the order of zones that allocations are preferred from. What are zones and what's order we will understand now. For the start let's see how linux kernel considers physical memory. Physical memory may be arranged into banks which are called - `nodes`. If you has no hardware with support for `NUMA`, you will see only one node:
			
 
				+In the next step we can see the call of the `build_all_zonelists` function. This function sets up the order of zones that allocations are preferred from. What are zones and what's order we will understand soon. For the start let's see how linux kernel considers physical memory. Physical memory is split into banks which are called - `nodes`. If you has no hardware support for `NUMA`, you will see only one node:
			
 
				 
			
 
				 ```
			
 
				 $ cat /sys/devices/system/node/node0/numastat 
			
@@ -127,7 +127,7 @@ local_node 72452442
 
				 other_node 0
			
 
				 ```
			
 
				 
			
 
				-Every `node` presented by the `struct pglist data` in the linux kernel. Each node devided into a number of special blocks which are called - `zones`. Every zone presented by the `zone struct` in the linux kernel and has one of the type:
			
 
				+Every `node` is presented by the `struct pglist_data` in the linux kernel. Each node is divided into a number of special blocks which are called - `zones`. Every zone is presented by the `zone struct` in the linux kernel and has one of the type:
			
 
				 
			
 
				 * `ZONE_DMA` - 0-16M;
			
 
				 * `ZONE_DMA32` - used for 32 bit devices that can only do DMA areas below 4G;
			
@@ -135,7 +135,7 @@ Every `node` presented by the `struct pglist data` in the linux kernel. Each nod
 
				 * `ZONE_HIGHMEM` - absent on the `x86_64`;
			
 
				 * `ZONE_MOVABLE` - zone which contains movable pages.
			
 
				 
			
 
				-which are presented by the `zone_type` enum. Information about zones we can get with the:
			
 
				+which are presented by the `zone_type` enum. We can get information about zones with the:
			
 
				 
			
 
				 ```
			
 
				 $ cat /proc/zoneinfo
			
@@ -159,12 +159,12 @@ Node 0, zone   Normal
 
				         ...
			
 
				 ```
			
 
				 
			
 
				-As I wrote above all nodes are described with the `pglist_data` or `pg_data_t` structure in memory. This structure defined in the [include/linux/mmzone.h](https://github.com/torvalds/linux/blob/master/include/linux/mmzone.h). The `build_all_zonelists` function from the [mm/page_alloc.c](https://github.com/torvalds/linux/blob/master/mm/page_alloc.c) constructs an ordered `zonelist` (of different zones `DMA`, `DMA32`, `NORMAL`, `HIGH_MEMORY`, `MOVABLE`) which specifies the zones/nodes to visit when a selected `zone` or `node` cannot satisfy the allocation request. That's all. More about `NUMA` and multiprocessor systems will be in the special part.
			
 
				+As I wrote above all nodes are described with the `pglist_data` or `pg_data_t` structure in memory. This structure is defined in the [include/linux/mmzone.h](https://github.com/torvalds/linux/blob/master/include/linux/mmzone.h). The `build_all_zonelists` function from the [mm/page_alloc.c](https://github.com/torvalds/linux/blob/master/mm/page_alloc.c) constructs an ordered `zonelist` (of different zones `DMA`, `DMA32`, `NORMAL`, `HIGH_MEMORY`, `MOVABLE`) which specifies the zones/nodes to visit when a selected `zone` or `node` cannot satisfy the allocation request. That's all. More about `NUMA` and multiprocessor systems will be in the special part.
			
 
				 
			
 
				 The rest of the stuff before scheduler initialization
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-Before we will start to dive into linux kernel scheduler initialization process we must to do a couple of things. The fisrt thing is the `page_alloc_init` function from the [mm/page_alloc.c](https://github.com/torvalds/linux/blob/master/mm/page_alloc.c). This function looks pretty easy:
			
 
				+Before we will start to dive into linux kernel scheduler initialization process we must do a couple of things. The first thing is the `page_alloc_init` function from the [mm/page_alloc.c](https://github.com/torvalds/linux/blob/master/mm/page_alloc.c). This function looks pretty easy:
			
 
				 
			
 
				 ```C
			
 
				 void __init page_alloc_init(void)
			
@@ -180,7 +180,7 @@ After this we can see the kernel command line in the initialization output:
 
				 
			
 
				 ![kernel command line](http://oi58.tinypic.com/2m7vz10.jpg)
			
 
				 
			
 
				-And a couple of functions as `parse_early_param` and `parse_args` which are handles linux kernel command line. You can remember that we already saw the call of the `parse_early_param` function in the sixth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) of the kernel initialization chapter, so why we call it again? Answer is simple: we call this function in the architecture-specific code (`x86_64` in our case), but not all architecture calls this function. And we need in the call of the second function `parse_args` to parse and handle non-early command line arguments.
			
 
				+And a couple of functions such as `parse_early_param` and `parse_args` which handles linux kernel command line. You may remember that we already saw the call of the `parse_early_param` function in the sixth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) of the kernel initialization chapter, so why we call it again? Answer is simple: we call this function in the architecture-specific code (`x86_64` in our case), but not all architecture calls this function. And we need to call the second function `parse_args` to parse and handle non-early command line arguments.
			
 
				 
			
 
				 In the next step we can see the call of the `jump_label_init` from the [kernel/jump_label.c](https://github.com/torvalds/linux/blob/master/kernel/jump_label.c). and initializes [jump label](https://lwn.net/Articles/412072/).
			
 
				 
			
@@ -189,13 +189,13 @@ After this we can see the call of the `setup_log_buf` function which setups the
 
				 PID hash initialization
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-The next is `pidhash_init` function. As you know an each process has assigned unique number which called - `process identification number` or `PID`. Each process generated with fork or clone is automatically assigned a new unique `PID` value by the kernel. The management of `PIDs` centered around the two special data structures: `struct pid` and `struct upid`. First structure represents information about a `PID` in the kernel. The second structure represents the information that is visible in a specific namespace. All `PID` instances stored in the special hash table:
			
 
				+The next is `pidhash_init` function. As you know each process has assigned a unique number which called - `process identification number` or `PID`. Each process generated with fork or clone is automatically assigned a new unique `PID` value by the kernel. The management of `PIDs` centered around the two special data structures: `struct pid` and `struct upid`. First structure represents information about a `PID` in the kernel. The second structure represents the information that is visible in a specific namespace. All `PID` instances stored in the special hash table:
			
 
				 
			
 
				 ```C
			
 
				 static struct hlist_head *pid_hash;
			
 
				 ```
			
 
				 
			
 
				-This hash table is used to find the pid instance that belongs to a numeric `PID` value. So, `pidhash_init` initializes this hash. In the start of the `pidhash_init` function we can see the call of the `alloc_large_system_hash`:
			
 
				+This hash table is used to find the pid instance that belongs to a numeric `PID` value. So, `pidhash_init` initializes this hash table. In the start of the `pidhash_init` function we can see the call of the `alloc_large_system_hash`:
			
 
				 
			
 
				 ```C
			
 
				 pid_hash = alloc_large_system_hash("PID", sizeof(*pid_hash), 0, 18,
			
@@ -217,9 +217,9 @@ $ dmesg | grep hash
 
				 ...
			
 
				 ```
			
 
				 
			
 
				-That's all. The rest of the stuff before scheduler initialization is the following functions: `vfs_caches_init_early` does early initialization of the [virtual file system](http://en.wikipedia.org/wiki/Virtual_file_system) (more about it will be in the chapter which will describe virtual file system), `sort_main_extable` sorts the kernel's built-in exception table entries which are between `__start___ex_table` and `__stop___ex_table,`, and `trap_init` initializies trap handlers (morea about last two function we will know in the separate chapter about interrupts).
			
 
				+That's all. The rest of the stuff before scheduler initialization is the following functions: `vfs_caches_init_early` does early initialization of the [virtual file system](http://en.wikipedia.org/wiki/Virtual_file_system) (more about it will be in the chapter which will describe virtual file system), `sort_main_extable` sorts the kernel's built-in exception table entries which are between `__start___ex_table` and `__stop___ex_table`, and `trap_init` initializes trap handlers (more about last two function we will know in the separate chapter about interrupts).
			
 
				 
			
 
				-The last step before the scheduler initialization is initialization of the memory manager with the `mm_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c). As we can see, the `mm_init` function initializes different part of the linux kernel memory manager:
			
 
				+The last step before the scheduler initialization is initialization of the memory manager with the `mm_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c). As we can see, the `mm_init` function initializes different parts of the linux kernel memory manager:
			
 
				 
			
 
				 ```C
			
 
				 page_ext_init_flatmem();
			
@@ -230,14 +230,14 @@ pgtable_init();
 
				 vmalloc_init();
			
 
				 ```
			
 
				 
			
 
				-The first is `page_ext_init_flatmem` depends on the `CONFIG_SPARSEMEM` kernel configuration option and initializes extended data per page handling. The `mem_init` releases all `bootmem`, the `kmem_cache_init` initializes kernel cache, the `percpu_init_late` - replaces `percpu` chunks with those allocated by [slub](http://en.wikipedia.org/wiki/SLUB_%28software%29), the `pgtable_init` - initilizes the `vmalloc_init` - initializes `vmalloc`. Please, **NOTE** that we will not dive into details about all of these functions and concepts, but we will see all of they it in the [Linux kernem memory manager](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) chapter.
			
 
				+The first is `page_ext_init_flatmem` which depends on the `CONFIG_SPARSEMEM` kernel configuration option and initializes extended data per page handling. The `mem_init` releases all `bootmem`, the `kmem_cache_init` initializes kernel cache, the `percpu_init_late` - replaces `percpu` chunks with those allocated by [slub](http://en.wikipedia.org/wiki/SLUB_%28software%29), the `pgtable_init` - initializes the `page->ptl` kernel cache, the `vmalloc_init` - initializes `vmalloc`. Please, **NOTE** that we will not dive into details about all of these functions and concepts, but we will see all of they it in the [Linux kernel memory manager](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) chapter.
			
 
				 
			
 
				 That's all. Now we can look on the `scheduler`.
			
 
				 
			
 
				 Scheduler initialization
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-And now we came to the main purpose of this part - initialization of the task scheduler. I want to say again as I did it already many times, you will not see the full explanation of the scheduler here, there will be special chapter about this. Ok, next point is the `sched_init` function from the [kernel/sched/core.c](https://github.com/torvalds/linux/blob/master/kernel/sched/core.c) and as we can understand from the function's name, it initializes scheduler. Let's start to dive in this function and try to understand how the scheduler initialized. At the start of the `sched_init` function we can see the following code:
			
 
				+And now we come to the main purpose of this part - initialization of the task scheduler. I want to say again as I already did it many times, you will not see the full explanation of the scheduler here, there will be special chapter about this. Ok, next point is the `sched_init` function from the [kernel/sched/core.c](https://github.com/torvalds/linux/blob/master/kernel/sched/core.c) and as we can understand from the function's name, it initializes scheduler. Let's start to dive into this function and try to understand how the scheduler is initialized. At the start of the `sched_init` function we can see the following code:
			
 
				 
			
 
				 ```C
			
 
				 #ifdef CONFIG_FAIR_GROUP_SCHED
			
@@ -253,7 +253,7 @@ First of all we can see two configuration options here:
 
				 * `CONFIG_FAIR_GROUP_SCHED`
			
 
				 * `CONFIG_RT_GROUP_SCHED`
			
 
				 
			
 
				-Both of this options provide two different planning models. As we can read from the [documentation](https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt), the current scheduler - `CFS` or `Completely Fair Scheduler` used a simple concept. It models process scheduling as if the system had an ideal multitasking processor where each process would receive `1/n` processor time, where `n` is the number of the runnable processes. The scheduler uses the special set of rules used. These rules determine when and how to select a new process to run and they are called `scheduling policy`. The Completely Fair Scheduler supports following `normal` or `non-real-time` scheduling policies: `SCHED_NORMAL`, `SCHED_BATCH` and `SCHED_IDLE`. The `SCHED_NORMAL` is used for the most normal applications, the amount of cpu each process consumes is mostly determined by the [nice](http://en.wikipedia.org/wiki/Nice_%28Unix%29) value, the `SCHED_BATCH` used for the 100% non-interactive tasks and the `SCHED_IDLE` runs tasks only when the processor has not to run anything besides this task. The `real-time` policies are also supported for the time-critial applications: `SCHED_FIFO` and `SCHED_RR`. If you've read something about the Linux kernel scheduler, you can know that it is modular. It means that it supports different algorithms to schedule different types of processes. Usually this modularity is called `scheduler classes`. These modules encapsulate scheduling policy details and are handled by the scheduler core without the core code assuming too much about them. 
			
 
				+Both of this options provide two different planning models. As we can read from the [documentation](https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt), the current scheduler - `CFS` or `Completely Fair Scheduler` use a simple concept. It models process scheduling as if the system has an ideal multitasking processor where each process would receive `1/n` processor time, where `n` is the number of the runnable processes. The scheduler uses the special set of rules. These rules determine when and how to select a new process to run and they are called `scheduling policy`. The Completely Fair Scheduler supports following `normal` or `non-real-time` scheduling policies: `SCHED_NORMAL`, `SCHED_BATCH` and `SCHED_IDLE`. The `SCHED_NORMAL` is used for the most normal applications, the amount of cpu each process consumes is mostly determined by the [nice](http://en.wikipedia.org/wiki/Nice_%28Unix%29) value, the `SCHED_BATCH` used for the 100% non-interactive tasks and the `SCHED_IDLE` runs tasks only when the processor has no task to run besides this task. The `real-time` policies are also supported for the time-critical applications: `SCHED_FIFO` and `SCHED_RR`. If you've read something about the Linux kernel scheduler, you can know that it is modular. It means that it supports different algorithms to schedule different types of processes. Usually this modularity is called `scheduler classes`. These modules encapsulate scheduling policy details and are handled by the scheduler core without knowing too much about them. 
			
 
				 
			
 
				 
			
 
				 Now let's back to the our code and look on the two configuration options `CONFIG_FAIR_GROUP_SCHED` and `CONFIG_RT_GROUP_SCHED`. The scheduler operates on an individual task. These options allows to schedule group tasks (more about it you can read in the [CFS group scheduling](http://lwn.net/Articles/240474/)). We can see that we assign the `alloc_size` variables which represent size based on amount of the processors to allocate for the `sched_entity` and `cfs_rq` to the `2 * nr_cpu_ids * sizeof(void **)` expression with `kzalloc`:
			
@@ -271,7 +271,7 @@ ptr = (unsigned long)kzalloc(alloc_size, GFP_NOWAIT);
 
				         
			
 
				 ```
			
 
				 
			
 
				-The `sched_entity` is struture which defined in the [include/linux/sched.h](https://github.com/torvalds/linux/blob/master/include/linux/sched.h) and used by the scheduler to keep track of process accounting. The `cfs_rq` presents [run queue](http://en.wikipedia.org/wiki/Run_queue). So, you can see that we allocated space with size `alloc_size` for the run queue and scheduler entity of the `root_task_group`. The `root_task_group` is an instance of the `task_group` structure from the [kernel/sched/sched.h](https://github.com/torvalds/linux/blob/master/kernel/sched/sched.h) which contains task group related information:
			
 
				+The `sched_entity` is a structure which is defined in the [include/linux/sched.h](https://github.com/torvalds/linux/blob/master/include/linux/sched.h) and used by the scheduler to keep track of process accounting. The `cfs_rq` presents [run queue](http://en.wikipedia.org/wiki/Run_queue). So, you can see that we allocated space with size `alloc_size` for the run queue and scheduler entity of the `root_task_group`. The `root_task_group` is an instance of the `task_group` structure from the [kernel/sched/sched.h](https://github.com/torvalds/linux/blob/master/kernel/sched/sched.h) which contains task group related information:
			
 
				 
			
 
				 ```C
			
 
				 struct task_group {
			
@@ -284,7 +284,7 @@ struct task_group {
 
				 }
			
 
				 ```
			
 
				 
			
 
				-The root task group is the task group which belongs every task in system. As we allocated space for the root task group scheduler entity and runqueue, we go over all possible CPUs (`cpu_possible_mask` bitmap) and allocate zeroed memory from a particular memory node with the `kzalloc_node` function for the `load_balance_mask` `percpu` variable:
			
 
				+The root task group is the task group which belongs to every task in system. As we allocated space for the root task group scheduler entity and runqueue, we go over all possible CPUs (`cpu_possible_mask` bitmap) and allocate zeroed memory from a particular memory node with the `kzalloc_node` function for the `load_balance_mask` `percpu` variable:
			
 
				 
			
 
				 ```C
			
 
				 DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
			
@@ -310,7 +310,7 @@ init_dl_bandwidth(&def_dl_bandwidth,
 
				                   global_rt_period(), global_rt_runtime());
			
 
				 ```
			
 
				 
			
 
				-we initialize bandwidth management for the `SCHED_DEADLINE` real-time tasks. These functions initializes `rt_bandwidth` and `dl_bandwidth` structures which are store information about maximum `deadline` bandwith of the system. For example, let's look on the implementation of the `init_rt_bandwidth` function:
			
 
				+we initialize bandwidth management for the `SCHED_DEADLINE` real-time tasks. These functions initializes `rt_bandwidth` and `dl_bandwidth` structures which store information about maximum `deadline` bandwidth of the system. For example, let's look on the implementation of the `init_rt_bandwidth` function:
			
 
				 
			
 
				 ```C
			
 
				 void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime)
			
@@ -332,7 +332,7 @@ It takes three parameters:
 
				 * `period` - period over which real-time task bandwidth enforcement is measured in `us`;
			
 
				 * `runtime` - part of the period that we allow tasks to run in `us`.
			
 
				 
			
 
				-As `period` and `runtime` we pass result of the `global_rt_period` and `global_rt_runtime` functions. Which are `1s` second and and `0.95s` by default. The `rt_bandwidth` structure defined in the [kernel/sched/sched.h](https://github.com/torvalds/linux/blob/master/kernel/sched/sched.h) and looks:
			
 
				+As `period` and `runtime` we pass result of the `global_rt_period` and `global_rt_runtime` functions. Which are `1s` second and `0.95s` by default. The `rt_bandwidth` structure is defined in the [kernel/sched/sched.h](https://github.com/torvalds/linux/blob/master/kernel/sched/sched.h) and looks:
			
 
				 
			
 
				 ```C
			
 
				 struct rt_bandwidth {
			
@@ -348,7 +348,7 @@ As you can see, it contains `runtime` and `period` and also two following fields
 
				 * `rt_runtime_lock` - [spinlock](http://en.wikipedia.org/wiki/Spinlock) for the `rt_time` protection;
			
 
				 * `rt_period_timer` - [high-resolution kernel timer](https://www.kernel.org/doc/Documentation/timers/hrtimers.txt) for unthrottled of real-time tasks.
			
 
				 
			
 
				-So, in the `init_rt_bandwidth` we initialize `rt_bandwidth` period and runtime with the given parameters, initialize the spinlock and high-resolution time. In the next step, depends on the enabled [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing), we make initialization of the root domain:
			
 
				+So, in the `init_rt_bandwidth` we initialize `rt_bandwidth` period and runtime with the given parameters, initialize the spinlock and high-resolution time. In the next step, depends on enable of [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing), we make initialization of the root domain:
			
 
				 
			
 
				 ```C
			
 
				 #ifdef CONFIG_SMP
			
@@ -356,7 +356,7 @@ So, in the `init_rt_bandwidth` we initialize `rt_bandwidth` period and runtime w
 
				 #endif
			
 
				 ```
			
 
				 
			
 
				-The real-time scheduler requires global resources to make scheduling decision. But unfortenatelly scalability bottlenecks appear as the number of CPUs increase. The concept of root domains was introduced for improving scalability. The linux kernel provides special mechanism for assigning a set of CPUs and memory nodes to a set of task and it is called - `cpuset`. If a `cpuset` contains non-overlapping with other `cpuset` CPUs, it is `exclusive cpuset`. Each exclusive cpuset defines an isolated domain or `root domain` of CPUs partitioned from other cpusets or CPUs. A `root domain` presented by the `struct root_domain` from the [kernel/sched/sched.h](https://github.com/torvalds/linux/blob/master/kernel/sched/sched.h) in the linux kernel and its main purpose is to narrow the scope of the global variables to per-domain variables and all real-time scheduling decisions are made only within the scope of a root domain. That's all about it, but we will see more details about it in the chapter about scheduling about real-time scheduler.
			
 
				+The real-time scheduler requires global resources to make scheduling decision. But unfortunately scalability bottlenecks appear as the number of CPUs increase. The concept of root domains was introduced for improving scalability. The linux kernel provides a special mechanism for assigning a set of CPUs and memory nodes to a set of tasks and it is called - `cpuset`. If a `cpuset` contains non-overlapping with other `cpuset` CPUs, it is `exclusive cpuset`. Each exclusive cpuset defines an isolated domain or `root domain` of CPUs partitioned from other cpusets or CPUs. A `root domain` is presented by the `struct root_domain` from the [kernel/sched/sched.h](https://github.com/torvalds/linux/blob/master/kernel/sched/sched.h) in the linux kernel and its main purpose is to narrow the scope of the global variables to per-domain variables and all real-time scheduling decisions are made only within the scope of a root domain. That's all about it, but we will see more details about it in the chapter about real-time scheduler.
			
 
				 
			
 
				 After `root domain` initialization, we make initialization of the bandwidth for the real-time tasks of the root task group as we did it above: 
			
 
				 
			
@@ -367,7 +367,7 @@ After `root domain` initialization, we make initialization of the bandwidth for
 
				 #endif
			
 
				 ```
			
 
				 
			
 
				-In the next step, depends on the `CONFIG_CGROUP_SCHED` kernel configuration option we initialze the `siblings` and `children` lists of the root task group. As we can read from the documentation, the `CONFIG_CGROUP_SCHED` is:
			
 
				+In the next step, depends on the `CONFIG_CGROUP_SCHED` kernel configuration option we initialize the `siblings` and `children` lists of the root task group. As we can read from the documentation, the `CONFIG_CGROUP_SCHED` is:
			
 
				 
			
 
				 ```
			
 
				 This option allows you to create arbitrary task groups using the "cgroup" pseudo
			
@@ -387,7 +387,7 @@ As we finished with the lists initialization, we can see the call of the `autogr
 
				 
			
 
				 which initializes automatic process group scheduling.
			
 
				 
			
 
				-After this we are going through the all `possible` cpu (you can remember that `possible` CPUs store in the `cpu_possible_mask` bitmap of possible CPUs that can ever be available in the system) and initialize a `runqueue` for each possible cpu:
			
 
				+After this we are going through the all `possible` cpu (you can remember that `possible` CPUs store in the `cpu_possible_mask` bitmap that can ever be available in the system) and initialize a `runqueue` for each possible cpu:
			
 
				 
			
 
				 ```C
			
 
				 for_each_possible_cpu(i) {
			
@@ -397,7 +397,7 @@ for_each_possible_cpu(i) {
 
				     ...
			
 
				 ```
			
 
				 
			
 
				-Each processor has its own locking and individual runqueue. All runnalble tasks are stored in an active array and indexed according to its priority. When a process consumes its time slice, it is moved to an expired array. All of these arras are stored in the special structure which names is `runqueu`. As there are no global lock and runqueu, we are going through the all possible CPUs and initialize runqueue for the every cpu. The `runque` is presented by the `rq` structure in the linux kernel which defined in the [kernel/sched/sched.h](https://github.com/torvalds/linux/blob/master/kernel/sched/sched.h).
			
 
				+Each processor has its own locking and individual runqueue. All runnable tasks are stored in an active array and indexed according to its priority. When a process consumes its time slice, it is moved to an expired array. All of these arras are stored in the special structure which names is `runqueue`. As there are no global lock and runqueue, we are going through the all possible CPUs and initialize runqueue for the every cpu. The `runqueue` is presented by the `rq` structure in the linux kernel which is defined in the [kernel/sched/sched.h](https://github.com/torvalds/linux/blob/master/kernel/sched/sched.h).
			
 
				 
			
 
				 ```C
			
 
				 rq = cpu_rq(i);
			
@@ -411,7 +411,7 @@ init_dl_rq(&rq->dl);
 
				 rq->rt.rt_runtime = def_rt_bandwidth.rt_runtime;
			
 
				 ```
			
 
				 
			
 
				-Here we get the runque for the every CPU with the `cpu_rq` macto which returns `runqueues` percpu variable and start to initialize it with runqueu lock, number of running tasks, `calc_load` relative fields (`calc_load_active` and `calc_load_update`) which are used in the reckoning of a CPU load and initialization of the completely fair, real-time and deadline related fields in a runqueue. After this we initialize `cpu_load` array with zeros and set the last load update tick to the `jiffies` variable which determines the number of time ticks (cycles), since the system boot:
			
 
				+Here we get the runqueue for the every CPU with the `cpu_rq` macro which returns `runqueues` percpu variable and start to initialize it with runqueue lock, number of running tasks, `calc_load` relative fields (`calc_load_active` and `calc_load_update`) which are used in the reckoning of a CPU load and initialization of the completely fair, real-time and deadline related fields in a runqueue. After this we initialize `cpu_load` array with zeros and set the last load update tick to the `jiffies` variable which determines the number of time ticks (cycles), since the system boot:
			
 
				 
			
 
				 ```C
			
 
				 for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
			
@@ -420,14 +420,14 @@ for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
 
				 rq->last_load_update_tick = jiffies;
			
 
				 ```
			
 
				 
			
 
				-where `cpu_load` keeps history of runqueue loads in the past, for now `CPU_LOAD_IDX_MAX` is 5. In the next step we fill `runqueue` fields which are related to the [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing), but we will not cover they in this part. And in the end of the loop we initialize high-resolution timer for the give `runqueue` and set the `iowait` (more about it in the separate part about scheduler) number:
			
 
				+where `cpu_load` keeps history of runqueue loads in the past, for now `CPU_LOAD_IDX_MAX` is 5. In the next step we fill `runqueue` fields which are related to the [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing), but we will not cover them in this part. And in the end of the loop we initialize high-resolution timer for the give `runqueue` and set the `iowait` (more about it in the separate part about scheduler) number:
			
 
				 
			
 
				 ```C
			
 
				 init_rq_hrtick(rq);
			
 
				 atomic_set(&rq->nr_iowait, 0);
			
 
				 ```
			
 
				 
			
 
				-Now we came out from the `for_each_possible_cpu` loop and the next we need to set load weight for the `init` task with the `set_load_weight` function.  Weight of process is calculated through its dynamic priority which is static priority + scheduling class of the process. After this we increase memory usage counter of the memory descriptor of the `init` process and set scheduler class for the current process:
			
 
				+Now we come out from the `for_each_possible_cpu` loop and the next we need to set load weight for the `init` task with the `set_load_weight` function.  Weight of process is calculated through its dynamic priority which is static priority + scheduling class of the process. After this we increase memory usage counter of the memory descriptor of the `init` process and set scheduler class for the current process:
			
 
				 
			
 
				 ```C
			
 
				 atomic_inc(&init_mm.mm_count);
			
@@ -447,18 +447,16 @@ So, the `init` process will be run, when there will be no other candidates (as i
 
				 scheduler_running = 1;
			
 
				 ```
			
 
				 
			
 
				-That's all. Linux kernel scheduler is initialized. Of course, we missed many different details and explanations here, because we need to know and understand how different concepts (like process and process groups, runqueue, rcu and etc...) works in the linux kernel , but we took a short look on the scheduler initialization process. All other details we will look in the separate part which will be fully dedicated to the scheduler. 
			
 
				+That's all. Linux kernel scheduler is initialized. Of course, we have skipped many different details and explanations here, because we need to know and understand how different concepts (like process and process groups, runqueue, rcu, etc.) works in the linux kernel , but we took a short look on the scheduler initialization process. We will look all other details in the separate part which will be fully dedicated to the scheduler.
			
 
				 
			
 
				 Conclusion
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-It is the end of the eighth part about the linux kernel initialization process. In this part, we looked on the initialization process of the scheduler and we will continue in the next part to dive in the linux kernel initialization process and will see initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update) and many more.
			
 
				+It is the end of the eighth part about the linux kernel initialization process. In this part, we looked on the initialization process of the scheduler and we will continue in the next part to dive in the linux kernel initialization process and will see initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update) and many other initialization stuff in the next part.
			
 
				 
			
 
				-and other initialization stuff in the next part.
			
 
				+If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				 
			
 
				-If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				-
			
 
				-**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
			
 
				+**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				 
			
 
				 Links
			
 
				 --------------------------------------------------------------------------------
			
@@ -467,7 +465,7 @@ Links
 
				 * [high-resolution kernel timer](https://www.kernel.org/doc/Documentation/timers/hrtimers.txt)
			
 
				 * [spinlock](http://en.wikipedia.org/wiki/Spinlock)
			
 
				 * [Run queue](http://en.wikipedia.org/wiki/Run_queue)
			
 
				-* [Linux kernem memory manager](http://0xax.gitbooks.io/linux-insides/content/mm/index.html)
			
 
				+* [Linux kernel memory manager](http://0xax.gitbooks.io/linux-insides/content/mm/index.html)
			
 
				 * [slub](http://en.wikipedia.org/wiki/SLUB_%28software%29)
			
 
				 * [virtual file system](http://en.wikipedia.org/wiki/Virtual_file_system)
			
 
				 * [Linux kernel hotplug documentation](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt)
			
--- a/Initialization/linux-initialization-9.md
+++ b/Initialization/linux-initialization-9.md
@@ -4,12 +4,12 @@ Kernel initialization. Part 9.
 
				 RCU initialization
			
 
				 ================================================================================
			
 
				 
			
 
				-This is ninth part of the [Linux Kernel initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) and in the previous part we stopped at the [scheduler initialization](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-8.html). In this part we will continue to dive to the linux kernel initialization process and the main purpose of this part will be to learn about initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update). We can see that the next step in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) after the `sched_init` is the call of the `preempt_disablepreempt_disable`. There are two macros:
			
 
				+This is ninth part of the [Linux Kernel initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) and in the previous part we stopped at the [scheduler initialization](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-8.html). In this part we will continue to dive to the linux kernel initialization process and the main purpose of this part will be to learn about initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update). We can see that the next step in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) after the `sched_init` is the call of the `preempt_disable`. There are two macros:
			
 
				 
			
 
				 * `preempt_disable`
			
 
				 * `preempt_enable`
			
 
				 
			
 
				-for preemption disabling and enabling. First of all let's try to understand what is it `preempt` in the context of an operating system kernel. In a simple words, preemption is ability of the operating system kernel to preempt current task to run task with higher priority. Here we need to disable preemption because we will have only one `init` process for the early boot time and we no need to stop it before we will call `cpu_idle` function. The `preempt_disable` macro defined in the [include/linux/preempt.h](https://github.com/torvalds/linux/blob/master/include/linux/preempt.h) and depends on the `CONFIG_PREEMPT_COUNT` kernel configuration option. This maco implemeted as:
			
 
				+for preemption disabling and enabling. First of all let's try to understand what is `preempt` in the context of an operating system kernel. In simple words, preemption is ability of the operating system kernel to preempt current task to run task with higher priority. Here we need to disable preemption because we will have only one `init` process for the early boot time and we don't need to stop it before we call `cpu_idle` function. The `preempt_disable` macro is defined in the [include/linux/preempt.h](https://github.com/torvalds/linux/blob/master/include/linux/preempt.h) and depends on the `CONFIG_PREEMPT_COUNT` kernel configuration option. This macro is implemented as:
			
 
				 
			
 
				 ```C
			
 
				 #define preempt_disable() \
			
@@ -25,7 +25,7 @@ and if `CONFIG_PREEMPT_COUNT` is not set just:
 
				 #define preempt_disable()                       barrier()
			
 
				 ```
			
 
				 
			
 
				-Let's look on it. First of all we can see one difference between these macro implementations. The `preempt_disable` with `CONFIG_PREEMPT_COUNT` contains the call of the `preempt_count_inc`. There is special `percpu` variable which stores the number of held locks and `preempt_disable` calls:
			
 
				+Let's look on it. First of all we can see one difference between these macro implementations. The `preempt_disable` with `CONFIG_PREEMPT_COUNT` set contains the call of the `preempt_count_inc`. There is special `percpu` variable which stores the number of held locks and `preempt_disable` calls:
			
 
				 
			
 
				 ```C
			
 
				 DECLARE_PER_CPU(int, __preempt_count);
			
@@ -38,7 +38,7 @@ In the first implementation of the `preempt_disable` we increment this `__preemp
 
				 #define preempt_count_add(val)  __preempt_count_add(val)
			
 
				 ```
			
 
				 
			
 
				-where `preempt_count_add` calls the `raw_cpu_add_4` macro which adds `1` to the given `percpu` variable (`__preempt_count`) in our case (more about `precpu` variables you can read in the part about [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)). Ok, we increased `__preempt_count` and th next step we can see the call of the `barrier` macro in the both macros. The `barrier` macro inserts an optimization barrier. In the processors with `x86_64` architecture independent memory access operations can be performed in any order. That's why we need in the oportunity to point compiler and processor on compliance of order. This mechanism is memory barrier. Let's consider simple example:
			
 
				+where `preempt_count_add` calls the `raw_cpu_add_4` macro which adds `1` to the given `percpu` variable (`__preempt_count`) in our case (more about `precpu` variables you can read in the part about [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)). Ok, we increased `__preempt_count` and the next step we can see the call of the `barrier` macro in the both macros. The `barrier` macro inserts an optimization barrier. In the processors with `x86_64` architecture independent memory access operations can be performed in any order. That's why we need the opportunity to point compiler and processor on compliance of order. This mechanism is memory barrier. Let's consider a simple example:
			
 
				 
			
 
				 ```C
			
 
				 preempt_disable();
			
@@ -71,7 +71,7 @@ That's all. Preemption is disabled and we can go ahead.
 
				 Initialization of the integer ID management
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-In the next step we can see the call of the `idr_init_cache` function which defined in the [lib/idr.c](https://github.com/torvalds/linux/blob/master/lib/idr.c). The `idr` library used in a various [places](http://lxr.free-electrons.com/ident?i=idr_find) in the linux kernel to manage assigning integer `IDs` to objects and looking up objects by id.
			
 
				+In the next step we can see the call of the `idr_init_cache` function which defined in the [lib/idr.c](https://github.com/torvalds/linux/blob/master/lib/idr.c). The `idr` library is used in a various [places](http://lxr.free-electrons.com/ident?i=idr_find) in the linux kernel to manage assigning integer `IDs` to objects and looking up objects by id.
			
 
				 
			
 
				 Let's look on the implementation of the `idr_init_cache` function:
			
 
				 
			
@@ -83,7 +83,7 @@ void __init idr_init_cache(void)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-Here we can see the call of the `kmem_cache_create`. We already called the `kmem_cache_init` in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c#L485). This function create generalized caches again using the `kmem_cache_alloc` (more about caches we will see in the [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) chapter). In our case, as we are using `kmem_cache_t` it will be used the [slab](http://en.wikipedia.org/wiki/Slab_allocation) allocator and `kmem_cache_create` creates it. As you can seee we pass five parameters to the `kmem_cache_create`:
			
 
				+Here we can see the call of the `kmem_cache_create`. We already called the `kmem_cache_init` in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c#L485). This function create generalized caches again using the `kmem_cache_alloc` (more about caches we will see in the [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) chapter). In our case, as we are using `kmem_cache_t` which will be used by the [slab](http://en.wikipedia.org/wiki/Slab_allocation) allocator and `kmem_cache_create` creates it. As you can see we pass five parameters to the `kmem_cache_create`:
			
 
				 
			
 
				 * name of the cache;
			
 
				 * size of the object to store in cache;
			
@@ -91,13 +91,13 @@ Here we can see the call of the `kmem_cache_create`. We already called the `kmem
 
				 * flags;
			
 
				 * constructor for the objects.
			
 
				 
			
 
				-and it will create `kmem_cache` for the integer IDs. Integer `IDs` is commonly used pattern for the to map set of integer IDs to the set of pointers. We can see usage of the integer IDs for example in the [i2c](http://en.wikipedia.org/wiki/I%C2%B2C) drivers subsystem. For example [drivers/i2c/i2c-core.c]((https://github.com/torvalds/linux/blob/master/drivers/i2c/i2c-core) which presentes the core of the `i2c` subsystem defines `ID` for the `i2c` adapter with the `DEFINE_IDR` macro:
			
 
				+and it will create `kmem_cache` for the integer IDs. Integer `IDs` is commonly used pattern to map set of integer IDs to the set of pointers. We can see usage of the integer IDs in the [i2c](http://en.wikipedia.org/wiki/I%C2%B2C) drivers subsystem. For example [drivers/i2c/i2c-core.c](https://github.com/torvalds/linux/blob/master/drivers/i2c/i2c-core.c) which represents the core of the `i2c` subsystem defines `ID` for the `i2c` adapter with the `DEFINE_IDR` macro:
			
 
				 
			
 
				 ```C
			
 
				 static DEFINE_IDR(i2c_adapter_idr);
			
 
				 ```
			
 
				 
			
 
				-and than it uses it for the declaration of the `i2c` adapter:
			
 
				+and then uses it for the declaration of the `i2c` adapter:
			
 
				 
			
 
				 ```C
			
 
				 static int __i2c_add_numbered_adapter(struct i2c_adapter *adap)
			
@@ -127,11 +127,11 @@ The next step is [RCU](http://en.wikipedia.org/wiki/Read-copy-update) initializa
 
				 
			
 
				 In the first case `rcu_init` will be in the [kernel/rcu/tiny.c](https://github.com/torvalds/linux/blob/master/kernel/rcu/tiny.c) and in the second case it will be defined in the [kernel/rcu/tree.c](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.c). We will see the implementation of the `tree rcu`, but first of all about the `RCU` in general.
			
 
				 
			
 
				-`RCU` or read-copy update is a scalable high-performance synchronization mechanism implemented in the Linux kernel. On the early stage the linux kernel provided support and environment for the concurently running applications, but all execution was serialized in the kernel using a single global lock. In our days linux kernel has no single global lock, but provides different mechanisms including [lock-free data structures](http://en.wikipedia.org/wiki/Concurrent_data_structure), [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) data structures and other. One of these mechanisms is - the `read-copy update`. The `RCU` technique designed for rarely-modified data structures. The idea of the `RCU` is simple. For example we have a rarely-modified data structure. If somebody wants to change this data structure, we make a copy of this data structure and make all changes in the copy. In the same time all other users of the data structure use old version of it. Next, we need to choose safe moment when original version of the data structure will have no users and update it with the modified copy.
			
 
				+`RCU` or read-copy update is a scalable high-performance synchronization mechanism implemented in the Linux kernel. On the early stage the linux kernel provided support and environment for the concurrently running applications, but all execution was serialized in the kernel using a single global lock. In our days linux kernel has no single global lock, but provides different mechanisms including [lock-free data structures](http://en.wikipedia.org/wiki/Concurrent_data_structure), [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) data structures and other. One of these mechanisms is - the `read-copy update`. The `RCU` technique is designed for rarely-modified data structures. The idea of the `RCU` is simple. For example we have a rarely-modified data structure. If somebody wants to change this data structure, we make a copy of this data structure and make all changes in the copy. In the same time all other users of the data structure use old version of it. Next, we need to choose safe moment when original version of the data structure will have no users and update it with the modified copy.
			
 
				 
			
 
				-Of course this description of the `RCU` is very simplified. To understand some details about `RCU`, first of all we need to learn some terminology. Data readers in the `RCU` executed in the [critical section](http://en.wikipedia.org/wiki/Critical_section). Everytime when data reader joins to the critical section, it calls the `rcu_read_lock`, and `rcu_read_unlock` on exit from the critical section. If the thread is not in the critical section, it will be in state which called - `quiescent state`. Every moment when every thread was in the `quiescent state` called - `grace period`. If a thread wants to remove element from the data structure, this occurs in two steps. First steps is `removal` - atomically removes element from the data structure, but does not release the physical memory. After this thread-writer announces and waits while it will be finsihed. From this moment, the removed element is available to the thread-readers. After the `grace perioud` will be finished, the second step of the element removal will be started, it just removes element from the physical memory.
			
 
				+Of course this description of the `RCU` is very simplified. To understand some details about `RCU`, first of all we need to learn some terminology. Data readers in the `RCU` executed in the [critical section](http://en.wikipedia.org/wiki/Critical_section). Every time when data reader get to the critical section, it calls the `rcu_read_lock`, and `rcu_read_unlock` on exit from the critical section. If the thread is not in the critical section, it will be in state which called - `quiescent state`. The moment when every thread is in the `quiescent state` called - `grace period`. If a thread wants to remove an element from the data structure, this occurs in two steps. First step is `removal` - atomically removes element from the data structure, but does not release the physical memory. After this thread-writer announces and waits until it is finished. From this moment, the removed element is available to the thread-readers. After the `grace period` finished, the second step of the element removal will be started, it just removes the element from the physical memory.
			
 
				 
			
 
				-There a couple implementations of the `RCU`. Old `RCU` called classic, the new implemetation called `tree` RCU. As you already can undrestand, the `CONFIG_TREE_RCU` kernel configuration option enables tree `RCU`. Another is the `tiny` RCU which depends on `CONFIG_TINY_RCU` and `CONFIG_SMP=n`. We will see more details about the `RCU` in general in the separate chapter about synchronization primitives, but now let's look on the `rcu_init` implementation from the [kernel/rcu/tree.c](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.c):
			
 
				+There a couple of implementations of the `RCU`. Old `RCU` called classic, the new implementation called `tree` RCU. As you may already understand, the `CONFIG_TREE_RCU` kernel configuration option enables tree `RCU`. Another is the `tiny` RCU which depends on `CONFIG_TINY_RCU` and `CONFIG_SMP=n`. We will see more details about the `RCU` in general in the separate chapter about synchronization primitives, but now let's look on the `rcu_init` implementation from the [kernel/rcu/tree.c](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.c):
			
 
				 
			
 
				 ```C
			
 
				 void __init rcu_init(void)
			
@@ -169,13 +169,13 @@ static void __init rcu_bootup_announce(void)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-It just prints information about the `RCU` with the `pr_info` function and `rcu_bootup_announce_oddness` which uses `pr_info` too, for printing different information about the current `RCU` configuration which depends on different kernel configuration options like `CONFIG_RCU_TRACE`, `CONFIG_PROVE_RCU`, `CONFIG_RCU_FANOUT_EXACT` and etc... In the next step, we can see the call of the `rcu_init_geometry` function. This function defined in the same source code file and computes the node tree geometry depends on amount of CPUs. Actually `RCU` provides scalability with extremely low internal to RCU lock contention. What if a data structure will be read from the different CPUs? `RCU` API provides the `rcu_state` structure wihch presents RCU global state including node hierarchy. Hierachy presented by the:
			
 
				+It just prints information about the `RCU` with the `pr_info` function and `rcu_bootup_announce_oddness` which uses `pr_info` too, for printing different information about the current `RCU` configuration which depends on different kernel configuration options like `CONFIG_RCU_TRACE`, `CONFIG_PROVE_RCU`, `CONFIG_RCU_FANOUT_EXACT`, etc. In the next step, we can see the call of the `rcu_init_geometry` function. This function is defined in the same source code file and computes the node tree geometry depends on the amount of CPUs. Actually `RCU` provides scalability with extremely low internal RCU lock contention. What if a data structure will be read from the different CPUs? `RCU` API provides the `rcu_state` structure which presents RCU global state including node hierarchy. Hierarchy is presented by the:
			
 
				 
			
 
				 ```
			
 
				 struct rcu_node node[NUM_RCU_NODES];
			
 
				 ```
			
 
				 
			
 
				-array of structures. As we can read in the comment which is above definition of this structure:
			
 
				+array of structures. As we can read in the comment of above definition:
			
 
				 
			
 
				 ```
			
 
				 The root (first level) of the hierarchy is in ->node[0] (referenced by ->level[0]), the second
			
@@ -186,7 +186,7 @@ determined by the number of CPUs and by CONFIG_RCU_FANOUT.
 
				 Small systems will have a "hierarchy" consisting of a single rcu_node.
			
 
				 ```
			
 
				 
			
 
				-The `rcu_node` structure defined in the [kernel/rcu/tree.h](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.h) and contains information about current grace period, is grace period completed or not, CPUs or groups that need to switch in order for current grace period to proceed and etc... Every `rcu_node` contains a lock for a couple of CPUs. These `rcu_node` structures embedded into a linear array in the `rcu_state` structure and represeted as a tree with the root in the zero element and it covers all CPUs. As you can see the number of the rcu nodes determined by the `NUM_RCU_NODES` which depends on number of available CPUs:
			
 
				+The `rcu_node` structure is defined in the [kernel/rcu/tree.h](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.h) and contains information about current grace period, is grace period completed or not, CPUs or groups that need to switch in order for current grace period to proceed, etc. Every `rcu_node` contains a lock for a couple of CPUs. These `rcu_node` structures are embedded into a linear array in the `rcu_state` structure and represented as a tree with the root as the first element and covers all CPUs. As you can see the number of the rcu nodes determined by the `NUM_RCU_NODES` which depends on number of available CPUs:
			
 
				 
			
 
				 ```C
			
 
				 #define NUM_RCU_NODES (RCU_SUM - NR_CPUS)
			
@@ -254,7 +254,7 @@ static ulong jiffies_till_first_fqs = ULONG_MAX;
 
				 static ulong jiffies_till_next_fqs = ULONG_MAX;
			
 
				 ```
			
 
				 
			
 
				-In the next step of the `rcu_init_geometry`, we check that `rcu_fanout_leaf` didn't chage (it has the same value as `CONFIG_RCU_FANOUT_LEAF` in compile-time) and equal to the value of the `CONFIG_RCU_FANOUT_LEAF` configuration option, we just return:
			
 
				+In the next step of the `rcu_init_geometry`, we check that `rcu_fanout_leaf` didn't change (it has the same value as `CONFIG_RCU_FANOUT_LEAF` in compile-time) and equal to the value of the `CONFIG_RCU_FANOUT_LEAF` configuration option, we just return:
			
 
				 
			
 
				 ```C
			
 
				 if (rcu_fanout_leaf == CONFIG_RCU_FANOUT_LEAF &&
			
@@ -262,7 +262,7 @@ if (rcu_fanout_leaf == CONFIG_RCU_FANOUT_LEAF &&
 
				     return;
			
 
				 ```
			
 
				 
			
 
				-After this we need to compute the number of nodes that can be handled an `rcu_node` tree with the given number of levels:
			
 
				+After this we need to compute the number of nodes that an `rcu_node` tree can handle with the given number of levels:
			
 
				 
			
 
				 ```C
			
 
				 rcu_capacity[0] = 1;
			
@@ -271,9 +271,9 @@ for (i = 2; i <= MAX_RCU_LVLS; i++)
 
				     rcu_capacity[i] = rcu_capacity[i - 1] * CONFIG_RCU_FANOUT;
			
 
				 ```
			
 
				 
			
 
				-And in the last step we calcluate the number of rcu_nodes at each level of the tree in the [loop](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.c#L4094).
			
 
				+And in the last step we calculate the number of rcu_nodes at each level of the tree in the [loop](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.c#L4094).
			
 
				 
			
 
				-As we calculated geometry of the `rcu_node` tree, we need to back to the `rcu_init` function and next step we need to initialize two `rcu_state` structures with the `rcu_init_one` function:
			
 
				+As we calculated geometry of the `rcu_node` tree, we need to go back to the `rcu_init` function and next step we need to initialize two `rcu_state` structures with the `rcu_init_one` function:
			
 
				 
			
 
				 ```C
			
 
				 rcu_init_one(&rcu_bh_state, &rcu_bh_data);
			
@@ -292,13 +292,13 @@ extern struct rcu_state rcu_bh_state;
 
				 DECLARE_PER_CPU(struct rcu_data, rcu_bh_data);
			
 
				 ```
			
 
				 
			
 
				-About this states you can read [here](http://lwn.net/Articles/264090/). As I wrote above we need to initialize `rcu_state` structures and `rcu_init_one` function will help us with it. After the `rcu_state` initialization, we can see the call of the ` __rcu_init_preempt` which depends on the `CONFIG_PREEMPT_RCU` kernel configuration option. It does the same that previous functions - initialization of the `rcu_preempt_state` structure with the `rcu_init_one` function which has `rcu_state` type. After this, in the `rcu_init`, we can see the call of the:
			
 
				+About this states you can read [here](http://lwn.net/Articles/264090/). As I wrote above we need to initialize `rcu_state` structures and `rcu_init_one` function will help us with it. After the `rcu_state` initialization, we can see the call of the ` __rcu_init_preempt` which depends on the `CONFIG_PREEMPT_RCU` kernel configuration option. It does the same as previous functions - initialization of the `rcu_preempt_state` structure with the `rcu_init_one` function which has `rcu_state` type. After this, in the `rcu_init`, we can see the call of the:
			
 
				 
			
 
				 ```C
			
 
				 open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);
			
 
				 ```
			
 
				 
			
 
				-function. This function registers a handler of the `pending interrupt`. Pending interrupt or `softirq` supposes that part of actions cab be delayed for later execution when the system will be less loaded. Pending interrupts represeted by the following structure:
			
 
				+function. This function registers a handler of the `pending interrupt`. Pending interrupt or `softirq` supposes that part of actions can be delayed for later execution when the system is less loaded. Pending interrupts is represented by the following structure:
			
 
				 
			
 
				 ```C
			
 
				 struct softirq_action
			
@@ -307,7 +307,7 @@ struct softirq_action
 
				 };
			
 
				 ```
			
 
				 
			
 
				-which defined in the [include/linux/interrupt.h](https://github.com/torvalds/linux/blob/master/include/linux/interrupt.h) and contains only one field - handler of an interrupt. You can know about `softirqs` in the your system with the:
			
 
				+which is defined in the [include/linux/interrupt.h](https://github.com/torvalds/linux/blob/master/include/linux/interrupt.h) and contains only one field - handler of an interrupt. You can check about `softirqs` in the your system with the:
			
 
				 
			
 
				 ```
			
 
				 $ cat /proc/softirqs
			
@@ -338,7 +338,7 @@ void open_softirq(int nr, void (*action)(struct softirq_action *))
 
				 }
			
 
				 ```
			
 
				 
			
 
				-In our case the interrupt handler is - `rcu_process_callbacks` which defined in the [kernel/rcu/tree.c](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.c) and does the `RCU`  core processing for the current CPU. After we registered `softirq` interrupt for the `RCU`, we can see the following code:
			
 
				+In our case the interrupt handler is - `rcu_process_callbacks` which is defined in the [kernel/rcu/tree.c](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.c) and does the `RCU` core processing for the current CPU. After we registered `softirq` interrupt for the `RCU`, we can see the following code:
			
 
				 
			
 
				 ```C
			
 
				 cpu_notifier(rcu_cpu_notify, 0);
			
@@ -347,7 +347,7 @@ for_each_online_cpu(cpu)
 
				     rcu_cpu_notify(NULL, CPU_UP_PREPARE, (void *)(long)cpu);
			
 
				 ```
			
 
				 
			
 
				-Here we can see registration of the `cpu` notifier which needs in sysmtems which supports [CPU hotplug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt) and we will not dive into details about this theme. The last function in the `rcu_init` is the `rcu_early_boot_tests`:
			
 
				+Here we can see registration of the `cpu` notifier which needs in systems which supports [CPU hotplug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt) and we will not dive into details about this theme. The last function in the `rcu_init` is the `rcu_early_boot_tests`:
			
 
				 
			
 
				 ```C
			
 
				 void rcu_early_boot_tests(void)
			
@@ -370,15 +370,15 @@ That's all. We saw initialization process of the `RCU` subsystem. As I wrote abo
 
				 Rest of the initialization process
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-Ok, we already passed the main theme of this part which is `RCU` initialization, but it is not the end of the linux kernel initialization process. In the last paragraph of this theme we will see a couple of functions which work in the initialization time, but we will not dive into deep details around this function by different reasons. Some reasons not to dive into details are following:
			
 
				+Ok, we already passed the main theme of this part which is `RCU` initialization, but it is not the end of the linux kernel initialization process. In the last paragraph of this theme we will see a couple of functions which work in the initialization time, but we will not dive into deep details around this function for different reasons. Some reasons not to dive into details are following:
			
 
				 
			
 
				-* They are not very important for the generic kernel initialization process and can depend on the different kernel configuration;
			
 
				-* They have the character of debugging and not important too for now;
			
 
				+* They are not very important for the generic kernel initialization process and depend on the different kernel configuration;
			
 
				+* They have the character of debugging and not important for now;
			
 
				 * We will see many of this stuff in the separate parts/chapters.
			
 
				 
			
 
				-After we initilized `RCU`, the next step which you can see in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) is the - `trace_init` function. As you can understand from its name, this function initialize [tracing](http://en.wikipedia.org/wiki/Tracing_%28software%29) subsystem. More about linux kernel trace system you can read - [here](http://elinux.org/Kernel_Trace_Systems).
			
 
				+After we initialized `RCU`, the next step which you can see in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) is the - `trace_init` function. As you can understand from its name, this function initialize [tracing](http://en.wikipedia.org/wiki/Tracing_%28software%29) subsystem. You can read more about linux kernel trace system - [here](http://elinux.org/Kernel_Trace_Systems).
			
 
				 
			
 
				-After the `trace_init`, we can see the call of the `radix_tree_init`. If you are familar with the different data structures, you can understand from the name of this function that it initializes kernel implementation of the [Radix tree](http://en.wikipedia.org/wiki/Radix_tree). This function defined in the [lib/radix-tree.c](https://github.com/torvalds/linux/blob/master/lib/radix-tree.c) and more about it you can read in the part about [Radix tree](http://0xax.gitbooks.io/linux-insides/content/DataStructures/radix-tree.md).
			
 
				+After the `trace_init`, we can see the call of the `radix_tree_init`. If you are familiar with the different data structures, you can understand from the name of this function that it initializes kernel implementation of the [Radix tree](http://en.wikipedia.org/wiki/Radix_tree). This function is defined in the [lib/radix-tree.c](https://github.com/torvalds/linux/blob/master/lib/radix-tree.c) and you can read more about it in the part about [Radix tree](https://0xax.gitbooks.io/linux-insides/content/DataStructures/radix-tree.html).
			
 
				 
			
 
				 In the next step we can see the functions which are related to the `interrupts handling` subsystem, they are:
			
 
				 
			
@@ -386,9 +386,9 @@ In the next step we can see the functions which are related to the `interrupts h
 
				 * `init_IRQ`
			
 
				 * `softirq_init`
			
 
				 
			
 
				-We will see explanation about this functions and their implementation in the special part about interrupts and exceptions handling. After this many different functions (like `init_timers`, `hrtimers_init`, `time_init` and etc...) which are related to different timing and timers stuff. More about these function we will see in the chapter about timers.
			
 
				+We will see explanation about this functions and their implementation in the special part about interrupts and exceptions handling. After this many different functions (like `init_timers`, `hrtimers_init`, `time_init`, etc.) which are related to different timing and timers stuff. We will see more about these function in the chapter about timers.
			
 
				 
			
 
				-The next couple of functions related with the [perf](https://perf.wiki.kernel.org/index.php/Main_Page) events - `perf_event-init` (will be separate chapter about perf), initialization of the `profiling` with the `profile_init`. After this we enable `irq` with the call of the:
			
 
				+The next couple of functions are related with the [perf](https://perf.wiki.kernel.org/index.php/Main_Page) events - `perf_event-init` (there will be separate chapter about perf), initialization of the `profiling` with the `profile_init`. After this we enable `irq` with the call of the:
			
 
				 
			
 
				 ```C
			
 
				 local_irq_enable();
			
@@ -398,18 +398,18 @@ which expands to the `sti` instruction and making post initialization of the [SL
 
				 
			
 
				 After the post initialization of the `SLAB`, next point is initialization of the console with the `console_init` function from the [drivers/tty/tty_io.c](https://github.com/torvalds/linux/blob/master/drivers/tty/tty_io.c).
			
 
				 
			
 
				-After the console initialization, we can see the `lockdep_info` function which prints information about the [Lock dependency validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt). After this, we can see the initialization of the dynamic allocation of the `debug objects` with the `debug_objects_mem_init`, kernel memory leack [detector](https://www.kernel.org/doc/Documentation/kmemleak.txt) initialization with the `kmemleak_init`, `percpu` pageset setup with the `setup_per_cpu_pageset`, setup of the [NUMA](http://en.wikipedia.org/wiki/Non-uniform_memory_access) policy with the `numa_policy_init`, setting time for the scheduler with the `sched_clock_init`, `pidmap` initialization with the call of the `pidmap_init` function for the initial `PID` namespace, cache creation with the `anon_vma_init` for the private virtual memory areas and early initialization of the [ACPI](http://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) with the `acpi_early_init`.
			
 
				+After the console initialization, we can see the `lockdep_info` function which prints information about the [Lock dependency validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt). After this, we can see the initialization of the dynamic allocation of the `debug objects` with the `debug_objects_mem_init`, kernel memory leak [detector](https://www.kernel.org/doc/Documentation/kmemleak.txt) initialization with the `kmemleak_init`, `percpu` pageset setup with the `setup_per_cpu_pageset`, setup of the [NUMA](http://en.wikipedia.org/wiki/Non-uniform_memory_access) policy with the `numa_policy_init`, setting time for the scheduler with the `sched_clock_init`, `pidmap` initialization with the call of the `pidmap_init` function for the initial `PID` namespace, cache creation with the `anon_vma_init` for the private virtual memory areas and early initialization of the [ACPI](http://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) with the `acpi_early_init`.
			
 
				 
			
 
				-This is the end of the ninth part of the [linux kernel initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) and here we saw initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update). In the last paragraph of this part (`Rest of the initialization process`) we went thorugh the many functions but did not dive into details about their implementations. Do not worry if you do not know anything about these stuff or you know and do not understand anything about this. As I wrote already many times, we will see details of implementations, but in the other parts or other chapters.
			
 
				+This is the end of the ninth part of the [linux kernel initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) and here we saw initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update). In the last paragraph of this part (`Rest of the initialization process`) we will go through many functions but did not dive into details about their implementations. Do not worry if you do not know anything about these stuff or you know and do not understand anything about this. As I already wrote many times, we will see details of implementations in other parts or other chapters.
			
 
				 
			
 
				 Conclusion
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-It is the end of the ninth part about the linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html). In this part, we looked on the initialization process of the `RCU` subsystem. In the next part we will continue to dive into linux kernel initialization process and I hope that we will finish with the `start_kernel` function and will go to the `rest_init` function from the same [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file and will see that start of the first process.
			
 
				+It is the end of the ninth part about the linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html). In this part, we looked on the initialization process of the `RCU` subsystem. In the next part we will continue to dive into linux kernel initialization process and I hope that we will finish with the `start_kernel` function and will go to the `rest_init` function from the same [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file and will see the start of the first process.
			
 
				 
			
 
				-If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				+If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				 
			
 
				-**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
			
 
				+**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				 
			
 
				 Links
			
 
				 --------------------------------------------------------------------------------
			
--- a/KernelStructures/.gitkeep
+++ b/KernelStructures/.gitkeep
--- a/KernelStructures/README.md
+++ b/KernelStructures/README.md
@@ -0,0 +1,7 @@
 
				+# Internal `system` structures of the Linux kernel
			
 
				+
			
 
				+This is not usual chapter of `linux-insides`. As you may understand from the title, it mostly describes
			
 
				+internal `system` structures of the Linux kernel. Like `Interrupt Descriptor Table`, `Global Descriptor
			
 
				+Table` and many many more.
			
 
				+
			
 
				+Most of information is taken from official [Intel](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html) and [AMD](http://developer.amd.com/resources/developer-guides-manuals/) manuals.
			
--- a/KernelStructures/idt.md
+++ b/KernelStructures/idt.md
@@ -0,0 +1,190 @@
 
				+interrupt-descriptor table (IDT)
			
 
				+================================================================================
			
 
				+
			
 
				+Three general interrupt & exceptions sources:
			
 
				+
			
 
				+* Exceptions - sync;
			
 
				+* Software interrupts - sync;
			
 
				+* External interrupts - async.
			
 
				+
			
 
				+Types of Exceptions:
			
 
				+
			
 
				+* Faults - are precise exceptions reported on the boundary `before` the instruction causing the exception. The saved `%rip` points to the faulting instruction;
			
 
				+* Traps - are precise exceptions reported on the boundary `following` the instruction causing the exception. The same with `%rip`;
			
 
				+* Aborts - are imprecise exceptions. Because they are imprecise, aborts typically do not allow reliable program restart.
			
 
				+
			
 
				+`Maskable` interrupts trigger the interrupt-handling mechanism only when RFLAGS.IF=1. Otherwise they are held pending for as long as the RFLAGS.IF bit is cleared to 0.
			
 
				+
			
 
				+`Nonmaskable` interrupts (NMI) are unaffected by the value of the rFLAGS.IF bit. However, the occurrence of an NMI masks further NMIs until an IRET instruction is executed.
			
 
				+
			
 
				+Specific exception and interrupt sources are assigned a fixed vector-identification number (also called an “interrupt vector” or simply “vector”). The interrupt vector is used by the interrupt-handling mechanism to locate the system-software service routine assigned to the exception or interrupt. Up to
			
 
				+256 unique interrupt vectors are available. The first 32 vectors are reserved for predefined exception and interrupt conditions. They are defined in the [arch/x86/include/asm/traps.h](http://lxr.free-electrons.com/source/arch/x86/include/asm/traps.h#L121) header file:
			
 
				+
			
 
				+```
			
 
				+/* Interrupts/Exceptions */
			
 
				+enum {
			
 
				+	X86_TRAP_DE = 0,	/*  0, Divide-by-zero */
			
 
				+	X86_TRAP_DB,		/*  1, Debug */
			
 
				+	X86_TRAP_NMI,		/*  2, Non-maskable Interrupt */
			
 
				+	X86_TRAP_BP,		/*  3, Breakpoint */
			
 
				+	X86_TRAP_OF,		/*  4, Overflow */
			
 
				+	X86_TRAP_BR,		/*  5, Bound Range Exceeded */
			
 
				+	X86_TRAP_UD,		/*  6, Invalid Opcode */
			
 
				+	X86_TRAP_NM,		/*  7, Device Not Available */
			
 
				+	X86_TRAP_DF,		/*  8, Double Fault */
			
 
				+	X86_TRAP_OLD_MF,	/*  9, Coprocessor Segment Overrun */
			
 
				+	X86_TRAP_TS,		/* 10, Invalid TSS */
			
 
				+	X86_TRAP_NP,		/* 11, Segment Not Present */
			
 
				+	X86_TRAP_SS,		/* 12, Stack Segment Fault */
			
 
				+	X86_TRAP_GP,		/* 13, General Protection Fault */
			
 
				+	X86_TRAP_PF,		/* 14, Page Fault */
			
 
				+	X86_TRAP_SPURIOUS,	/* 15, Spurious Interrupt */
			
 
				+	X86_TRAP_MF,		/* 16, x87 Floating-Point Exception */
			
 
				+	X86_TRAP_AC,		/* 17, Alignment Check */
			
 
				+	X86_TRAP_MC,		/* 18, Machine Check */
			
 
				+	X86_TRAP_XF,		/* 19, SIMD Floating-Point Exception */
			
 
				+	X86_TRAP_IRET = 32,	/* 32, IRET Exception */
			
 
				+};
			
 
				+```
			
 
				+
			
 
				+Error Codes
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+The processor exception-handling mechanism reports error and status information for some exceptions using an error code. The error code is pushed onto the stack by the exception-mechanism during the control transfer into the exception handler. The error code has two formats:
			
 
				+
			
 
				+* most error-reporting exceptions format;
			
 
				+* page fault format.
			
 
				+
			
 
				+Here is format of selector error code:
			
 
				+
			
 
				+```
			
 
				+31                           16 15                                  3   2   1   0
			
 
				++-------------------------------------------------------------------------------+
			
 
				+|                              |                                    | T | I | E |
			
 
				+|           Reserved           |             Selector Index         | - | D | X |
			
 
				+|                              |                                    | I | T | T |
			
 
				++-------------------------------------------------------------------------------+
			
 
				+```
			
 
				+
			
 
				+Where:
			
 
				+
			
 
				+* `EXT` - If this bit is set to 1, the exception source is external to the processor. If cleared to 0, the exception source is internal to the processor;
			
 
				+* `IDT` - If this bit is set to 1, the error-code selector-index field references a gate descriptor located in the `interrupt-descriptor table`. If cleared to 0, the selector-index field references a descriptor in either the `global-descriptor table` or local-descriptor table `LDT`, as indicated by the `TI` bit;
			
 
				+* `TI` - If this bit is set to 1, the error-code selector-index field references a descriptor in the `LDT`. If cleared to 0, the selector-index field references a descriptor in the `GDT`.
			
 
				+* `Selector Index` - The selector-index field specifies the index into either the `GDT`, `LDT`, or `IDT`, as specified by the `IDT` and `TI` bits.
			
 
				+
			
 
				+Page-Fault Error Code format is:
			
 
				+
			
 
				+```
			
 
				+31                                                              4   3   2   1   0
			
 
				++-------------------------------------------------------------------------------+
			
 
				+|                                                         |     | R | U | R | - |
			
 
				+|                       Reserved                          | I/D | S | - | - | P |
			
 
				+|                                                         |     | V | S | W | - |
			
 
				++-------------------------------------------------------------------------------+
			
 
				+```
			
 
				+
			
 
				+Where:
			
 
				+
			
 
				+* `I/D` - If this bit is set to 1, it indicates that the access that caused the page fault was an instruction fetch;
			
 
				+* `RSV` - If this bit is set to 1, the page fault is a result of the processor reading a 1 from a reserved field within a page-translation-table entry;
			
 
				+* `U/S` - If this bit is cleared to 0, an access in supervisor mode (`CPL=0, 1, or 2`) caused the page fault. If this bit is set to 1, an access in user mode (CPL=3) caused the page fault;
			
 
				+* `R/W` - If this bit is cleared to 0, the access that caused the page fault is a memory read. If this bit is set to 1, the memory access that caused the page fault was a write;
			
 
				+* `P` - If this bit is cleared to 0, the page fault was caused by a not-present page. If this bit is set to 1, the page fault was caused by a page-protection violation.
			
 
				+
			
 
				+Interrupt Control Transfers
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+The IDT may contain any of three kinds of gate descriptors:
			
 
				+
			
 
				+* `Task Gate` - contains the segment selector for a TSS for an exception and/or interrupt handler task;
			
 
				+* `Interrupt Gate` - contains segment selector and offset that the processor uses to transfer program execution to a handler procedure in an interrupt handler code segment;
			
 
				+* `Trap Gate` - contains segment selector and offset that the processor uses to transfer program execution to a handler procedure in an exception handler code segment.
			
 
				+
			
 
				+General format of gates is:
			
 
				+
			
 
				+```
			
 
				+127                                                                             96
			
 
				++-------------------------------------------------------------------------------+
			
 
				+|                                                                               |
			
 
				+|                                Reserved                                       |
			
 
				+|                                                                               |
			
 
				++--------------------------------------------------------------------------------
			
 
				+95                                                                              64
			
 
				++-------------------------------------------------------------------------------+
			
 
				+|                                                                               |
			
 
				+|                               Offset 63..32                                   |
			
 
				+|                                                                               |
			
 
				++-------------------------------------------------------------------------------+
			
 
				+63                               48 47      46  44   42    39             34    32
			
 
				++-------------------------------------------------------------------------------+
			
 
				+|                                  |       |  D  |   |     |      |   |   |     |
			
 
				+|       Offset 31..16              |   P   |  P  | 0 |Type |0 0 0 | 0 | 0 | IST |
			
 
				+|                                  |       |  L  |   |     |      |   |   |     |
			
 
				+ -------------------------------------------------------------------------------+
			
 
				+31                                   16 15                                      0
			
 
				++-------------------------------------------------------------------------------+
			
 
				+|                                      |                                        |
			
 
				+|          Segment Selector            |                 Offset 15..0           |
			
 
				+|                                      |                                        |
			
 
				++-------------------------------------------------------------------------------+
			
 
				+```
			
 
				+
			
 
				+Where
			
 
				+
			
 
				+* `Selector` - Segment Selector for destination code segment;
			
 
				+* `Offset` - Offset to handler procedure entry point;
			
 
				+* `DPL` - Descriptor Privilege Level;
			
 
				+* `P` - Segment Present flag;
			
 
				+* `IST` - Interrupt Stack Table;
			
 
				+* `TYPE` - one of: Local descriptor-table (LDT) segment descriptor, Task-state segment (TSS) descriptor, Call-gate descriptor, Interrupt-gate descriptor, Trap-gate descriptor or Task-gate descriptor.
			
 
				+
			
 
				+An `IDT` descriptor is represented by the following structure in the Linux kernel (only for `x86_64`):
			
 
				+
			
 
				+```C
			
 
				+struct gate_struct64 {
			
 
				+	u16 offset_low;
			
 
				+	u16 segment;
			
 
				+	unsigned ist : 3, zero0 : 5, type : 5, dpl : 2, p : 1;
			
 
				+	u16 offset_middle;
			
 
				+	u32 offset_high;
			
 
				+	u32 zero1;
			
 
				+} __attribute__((packed));
			
 
				+```
			
 
				+
			
 
				+which is defined in the [arch/x86/include/asm/desc_defs.h](http://lxr.free-electrons.com/source/arch/x86/include/asm/desc_defs.h#L51) header file.
			
 
				+
			
 
				+A task gate descriptor does not contain `IST` field and its format differs from interrupt/trap gates:
			
 
				+
			
 
				+```C
			
 
				+struct ldttss_desc64 {
			
 
				+	u16 limit0;
			
 
				+	u16 base0;
			
 
				+	unsigned base1 : 8, type : 5, dpl : 2, p : 1;
			
 
				+	unsigned limit1 : 4, zero0 : 3, g : 1, base2 : 8;
			
 
				+	u32 base3;
			
 
				+	u32 zero1;
			
 
				+} __attribute__((packed));
			
 
				+```
			
 
				+
			
 
				+Exceptions During a Task Switch
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+An exception can occur during a task switch while loading a segment selector. Page faults can also occur when accessing a TSS. In these cases, the hardware task-switch mechanism completes loading the new task state from the TSS, and then triggers the appropriate exception mechanism.
			
 
				+
			
 
				+**In long mode, an exception cannot occur during a task switch, because the hardware task-switch mechanism is disabled.**
			
 
				+
			
 
				+Nonmaskable interrupt
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+**TODO**
			
 
				+
			
 
				+API
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+**TODO**
			
 
				+
			
 
				+Interrupt Stack Table
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+**TODO**
			
--- a/LINKS.md
+++ b/LINKS.md
@@ -12,6 +12,11 @@ Protected mode
 
				 
			
 
				 * [64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html)
			
 
				 
			
 
				+Memory management in the Linux kernel
			
 
				+--------------------------------------
			
 
				+
			
 
				+* [Notes on the linux kernel VM subsystem by @lorenzo-stoakes](https://github.com/lorenzo-stoakes/linux-vm-notes)
			
 
				+
			
 
				 Serial programming
			
 
				 ------------------------
			
 
				 
			
@@ -44,3 +49,9 @@ Other architectures
 
				 ------------------------
			
 
				 
			
 
				 * [PowerPC and Linux Kernel Inside](http://www.systemcomputing.org/ppc/)
			
 
				+
			
 
				+Useful links
			
 
				+------------------------
			
 
				+
			
 
				+* [Linux x86 Program Start Up](http://dbp-consulting.com/tutorials/debugging/linuxProgramStartup.html)
			
 
				+* [Memory Layout in Program Execution (32 bits)](http://fgiasson.com/articles/memorylayout.txt)
			
--- a/Misc/contribute.md
+++ b/Misc/contribute.md
@@ -0,0 +1,489 @@
 
				+Linux kernel development
			
 
				+================================================================================
			
 
				+
			
 
				+Introduction
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+As you already may know, I've started a series of [blog posts](http://0xax.github.io/categories/assembly/) about assembler programming for `x86_64` architecture in the last year. I have never written a line of low-level code before this moment, except for a couple of toy `Hello World` examples in university. It was a long time ago and, as I already said, I didn't write low-level code at all. Some time ago I became interested in such things. I understood that I can write programs, but didn't actually understand how my program is arranged.
			
 
				+
			
 
				+After writing some assembler code I began to understand how my program looks after compilation, **approximately**. But anyway, I didn't understand many other things. For example: what occurs when the `syscall` instruction is executed in my assembler, what occurs when the `printf` function starts to work or how can my program talk with other computers via network. [Assembler](https://en.wikipedia.org/wiki/Assembly_language#Assembler) programming language didn't give me answers to my questions and I decided to go deeper in my research. I started to learn from the source code of the Linux kernel and tried to understand the things that I'm interested in. The source code of the Linux kernel didn't give me the answers to **all** of my questions, but now my knowledge about the Linux kernel and the processes around it is much better.
			
 
				+
			
 
				+I'm writing this part nine and a half months after I've started to learn from the source code of the Linux kernel and published the first [part](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-1.html) of this book. Now it contains forty parts and it is not the end. I decided to write this series about the Linux kernel mostly for myself. As you know the Linux kernel is very huge piece of code and it is easy to forget what does this or that part of the Linux kernel mean and how does it implement something. But soon the [linux-insides](https://github.com/0xAX/linux-insides) repo became popular and after nine months it has `9096` stars:
			
 
				+
			
 
				+![github](http://s2.postimg.org/jjb3s4frt/stars.png)
			
 
				+
			
 
				+It seems that people are interested in the insides of the Linux kernel. Besides this, in all the time that I have been writing `linux-insides`, I have received many questions from different people about how to begin contributing to the Linux kernel. Generally people are interested in contributing to open source projects and the Linux kernel is not an exception:
			
 
				+
			
 
				+![google-linux](http://s4.postimg.org/yg9z5zx0d/google_linux.png)
			
 
				+
			
 
				+So, it seems that people are interested in the Linux kernel development process. I thought it would be strange if a book about the Linux kernel would not contain a part describing how to take a part in the Linux kernel development and that's why I decided to write it. You will not find information about why you should be interested in contributing to the Linux kernel in this part. But if you are interested how to start with Linux kernel development, this part is for you.
			
 
				+
			
 
				+Let's start.
			
 
				+
			
 
				+How to start with Linux kernel
			
 
				+---------------------------------------------------------------------------------
			
 
				+
			
 
				+First of all, let's see how to get, build, and run the Linux kernel. You can run your custom build of the Linux kernel in two ways:
			
 
				+
			
 
				+* Run the Linux kernel on a virtual machine;
			
 
				+* Run the Linux kernel on real hardware.
			
 
				+
			
 
				+I'll provide descriptions for both methods. Before we start doing anything with the Linux kernel, we need to get it. There are a couple of ways to do this depending on your purpose. If you just want to update the current version of the Linux kernel on your computer, you can use the instructions specific to your Linux [distro](https://en.wikipedia.org/wiki/Linux_distribution).
			
 
				+
			
 
				+In the first case you just need to download new version of the Linux kernel with the [package manager](https://en.wikipedia.org/wiki/Package_manager). For example, to upgrade the version of the Linux kernel to `4.1` for [Ubuntu (Vivid Vervet)](http://releases.ubuntu.com/15.04/), you will just need to execute the following commands:
			
 
				+
			
 
				+```
			
 
				+$ sudo add-apt-repository ppa:kernel-ppa/ppa
			
 
				+$ sudo apt-get update
			
 
				+```
			
 
				+
			
 
				+After this execute this command:
			
 
				+
			
 
				+```
			
 
				+$ apt-cache showpkg linux-headers
			
 
				+```
			
 
				+
			
 
				+and choose the version of the Linux kernel in which you are interested. In the end execute the next command and replace `${version}` with the version that you chose in the output of the previous command:
			
 
				+
			
 
				+```
			
 
				+$ sudo apt-get install linux-headers-${version} linux-headers-${version}-generic linux-image-${version}-generic --fix-missing
			
 
				+```
			
 
				+
			
 
				+and reboot your system. After the reboot you will see the new kernel in the [grub](https://en.wikipedia.org/wiki/GNU_GRUB) menu.
			
 
				+
			
 
				+In the other way if you are interested in the Linux kernel development, you will need to get the source code of the Linux kernel. You can find it on the [kernel.org](https://kernel.org/) website and download an archive with the Linux kernel source code. Actually the Linux kernel development process is fully built around `git` [version control system](https://en.wikipedia.org/wiki/Version_control). So you can get it with `git` from the `kernel.org`:
			
 
				+
			
 
				+```
			
 
				+$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
			
 
				+```
			
 
				+
			
 
				+I don't know how about you, but I prefer `github`. There is a [mirror](https://github.com/torvalds/linux) of the Linux kernel mainline repository, so you can clone it with:
			
 
				+
			
 
				+```
			
 
				+$ git clone git@github.com:torvalds/linux.git
			
 
				+```
			
 
				+
			
 
				+I  use my own [fork](https://github.com/0xAX/linux) for development and when I want to pull updates from the main repository I just execute the following command:
			
 
				+
			
 
				+```
			
 
				+$ git checkout master
			
 
				+$ git pull upstream master
			
 
				+```
			
 
				+
			
 
				+Note that the remote name of the main repository is `upstream`. To add a new remote with the main Linux repository you can execute:
			
 
				+
			
 
				+```
			
 
				+git remote add upstream git@github.com:torvalds/linux.git
			
 
				+```
			
 
				+
			
 
				+After this you will have two remotes:
			
 
				+
			
 
				+```
			
 
				+~/dev/linux (master) $ git remote -v
			
 
				+origin	git@github.com:0xAX/linux.git (fetch)
			
 
				+origin	git@github.com:0xAX/linux.git (push)
			
 
				+upstream	https://github.com/torvalds/linux.git (fetch)
			
 
				+upstream	https://github.com/torvalds/linux.git (push)
			
 
				+```
			
 
				+
			
 
				+One is of your fork (`origin`) and the second is for the main repository (`upstream`).
			
 
				+
			
 
				+Now that we have a local copy of the Linux kernel source code, we need to configure and build it. The Linux kernel can be configured in different ways. The simplest way is to just copy the configuration file of the already installed kernel that is located in the `/boot` directory:
			
 
				+
			
 
				+```
			
 
				+$ sudo cp /boot/config-$(uname -r) ~/dev/linux/.config
			
 
				+```
			
 
				+
			
 
				+If your current Linux kernel was built with the support for access to the `/proc/config.gz` file, you can copy your actual kernel configuration file with this command:
			
 
				+
			
 
				+```
			
 
				+$ cat /proc/config.gz | gunzip > ~/dev/linux/.config
			
 
				+```
			
 
				+
			
 
				+If you are not satisfied with the standard kernel configuration that is provided by the maintainers of your distro, you can configure the Linux kernel manually. There are a couple of ways to do it. The Linux kernel root [Makefile](https://github.com/torvalds/linux/blob/master/Makefile) provides a set of targets that allows you to configure it. For example `menuconfig` provides a menu-driven interface for the kernel configuration:
			
 
				+
			
 
				+![menuconfig](http://s21.postimg.org/zcz48p7yf/menucnonfig.png)
			
 
				+
			
 
				+The `defconfig` argument generates the default kernel configuration file for the current architecture, for example [x86_64 defconfig](https://github.com/torvalds/linux/blob/master/arch/x86/configs/x86_64_defconfig). You can pass the `ARCH` command line argument to `make` to build `defconfig` for the given architecture:
			
 
				+
			
 
				+```
			
 
				+$ make ARCH=arm64 defconfig
			
 
				+```
			
 
				+
			
 
				+The `allnoconfig`, `allyesconfig` and `allmodconfig` arguments allow you to generate a new configuration file where all options will be disabled, enabled, and enabled as modules respectively. The `nconfig` command line arguments that provides `ncurses` based program with menu to configure Linux kernel:
			
 
				+
			
 
				+![nconfig](http://s29.postimg.org/hpghikp4n/nconfig.png)
			
 
				+
			
 
				+And even `randconfig` to generate random Linux kernel configuration file. I will not write about how to configure the Linux kernel or which options to enable because it makes no sense to do so for two reasons: First of all I do not know your hardware and second, if you know your hardware, the only remaining task is to find out how to use programs for kernel configuration, and all of them are pretty simple to use.
			
 
				+
			
 
				+OK, we now have the source code of the Linux kernel and configured it. The next step is the compilation of the Linux kernel. The simplest way to compile Linux kernel is to just execute:
			
 
				+
			
 
				+```
			
 
				+$ make
			
 
				+scripts/kconfig/conf  --silentoldconfig Kconfig
			
 
				+#
			
 
				+# configuration written to .config
			
 
				+#
			
 
				+  CHK     include/config/kernel.release
			
 
				+  UPD     include/config/kernel.release
			
 
				+  CHK     include/generated/uapi/linux/version.h
			
 
				+  CHK     include/generated/utsrelease.h
			
 
				+  ...
			
 
				+  ...
			
 
				+  ...
			
 
				+  OBJCOPY arch/x86/boot/vmlinux.bin
			
 
				+  AS      arch/x86/boot/header.o
			
 
				+  LD      arch/x86/boot/setup.elf
			
 
				+  OBJCOPY arch/x86/boot/setup.bin
			
 
				+  BUILD   arch/x86/boot/bzImage
			
 
				+  Setup is 15740 bytes (padded to 15872 bytes).
			
 
				+System is 4342 kB
			
 
				+CRC 82703414
			
 
				+Kernel: arch/x86/boot/bzImage is ready  (#73)
			
 
				+```
			
 
				+
			
 
				+To increase the speed of kernel compilation you can pass `-jN` command line argument to `make`, where `N` specifies the number of commands to run simultaneously:
			
 
				+
			
 
				+```
			
 
				+$ make -j8
			
 
				+```
			
 
				+
			
 
				+If you want to build Linux kernel for an architecture that differs from your current, the simplest way to do it pass two arguments:
			
 
				+
			
 
				+* `ARCH` command line argument and the name of the target architecture;
			
 
				+* `CROSS_COMPILER` command line argument and the cross-compiler tool prefix;
			
 
				+
			
 
				+For example if we want to compile the Linux kernel for the [arm64](https://en.wikipedia.org/wiki/ARM_architecture#AArch64_features) with default kernel configuration file, we need to execute following command:
			
 
				+
			
 
				+```
			
 
				+$ make -j4 ARCH=arm64 CROSS_COMPILER=aarch64-linux-gnu- defconfig
			
 
				+$ make -j4 ARCH=arm64 CROSS_COMPILER=aarch64-linux-gnu-
			
 
				+```
			
 
				+
			
 
				+As result of compilation we can see the compressed kernel - `arch/x86/boot/bzImage`. Now that we have compiled the kernel, we can either install it on our computer or just run it in an emulator.
			
 
				+
			
 
				+Installing Linux kernel
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+As I already wrote we will consider two ways how to launch new kernel: In the first case we can install and run the new version of the Linux kernel on the real hardware and the second is launch the Linux kernel on a virtual machine. In the previous paragraph we saw how to build the Linux kernel from source code and as a result we have got compressed image:
			
 
				+
			
 
				+```
			
 
				+...
			
 
				+...
			
 
				+...
			
 
				+Kernel: arch/x86/boot/bzImage is ready  (#73)
			
 
				+```
			
 
				+
			
 
				+After we have got the [bzImage](https://en.wikipedia.org/wiki/Vmlinux#bzImage) we need to install `headers`, `modules` of the new Linux kernel with the:
			
 
				+
			
 
				+```
			
 
				+$ sudo make headers_install
			
 
				+$ sudo make modules_install
			
 
				+```
			
 
				+
			
 
				+and directly the kernel itself:
			
 
				+
			
 
				+```
			
 
				+$ sudo make install
			
 
				+```
			
 
				+
			
 
				+From this moment we have installed new version of the Linux kernel and now we must tell the `bootloader` about it. Of course we can add it manually by the editing of the `/boot/grub2/grub.cfg` configuration file, but I prefer to use a script for this purpose. I'm using two different Linux distros: Fedora and Ubuntu. There are two different ways to update the [grub](https://en.wikipedia.org/wiki/GNU_GRUB) configuration file. I'm using following script for this purpose:
			
 
				+
			
 
				+```shell
			
 
				+#!/bin/bash
			
 
				+
			
 
				+source "term-colors"
			
 
				+
			
 
				+DISTRIBUTIVE=$(cat /etc/*-release | grep NAME | head -1 | sed -n -e 's/NAME\=//p')
			
 
				+echo -e "Distributive: ${Green}${DISTRIBUTIVE}${Color_Off}"
			
 
				+
			
 
				+if [[ "$DISTRIBUTIVE" == "Fedora" ]] ;
			
 
				+then
			
 
				+    su -c 'grub2-mkconfig -o /boot/grub2/grub.cfg'
			
 
				+else
			
 
				+    sudo update-grub
			
 
				+fi
			
 
				+
			
 
				+echo "${Green}Done.${Color_Off}"
			
 
				+```
			
 
				+
			
 
				+This is the last step of the new Linux kernel installation and after this you can reboot your computer and select new version of the kernel during boot.
			
 
				+
			
 
				+The second case is to launch new Linux kernel in the virtual machine. I prefer [qemu](https://en.wikipedia.org/wiki/QEMU). First of all we need to build initial ramdisk - [initrd](https://en.wikipedia.org/wiki/Initrd) for this. The `initrd` is a temporary root file system that is used by the Linux kernel during initialization process while other filesystems are not mounted. We can build `initrd` with the following commands:
			
 
				+
			
 
				+First of all we need to download [busybox](https://en.wikipedia.org/wiki/BusyBox) and run `menuconfig` for its configuration:
			
 
				+
			
 
				+```shell
			
 
				+$ mkdir initrd
			
 
				+$ cd initrd
			
 
				+$ curl http://busybox.net/downloads/busybox-1.23.2.tar.bz2 | tar xjf -
			
 
				+$ cd busybox-1.23.2/
			
 
				+$ make menuconfig
			
 
				+$ make -j4
			
 
				+```
			
 
				+
			
 
				+`busybox` is an executable file - `/bin/busybox` that contains a set of standard tools like [coreutils](https://en.wikipedia.org/wiki/GNU_Core_Utilities). In the `busysbox` menu we need to enable: `Build BusyBox as a static binary (no shared libs)` option:
			
 
				+
			
 
				+![busysbox menu](http://s18.postimg.org/sj92uoweh/busybox.png)
			
 
				+
			
 
				+We can find this menu in the:
			
 
				+
			
 
				+```
			
 
				+Busybox Settings
			
 
				+--> Build Options
			
 
				+```
			
 
				+
			
 
				+After this we exit from the `busysbox` configuration menu and execute following commands for building and installation of it:
			
 
				+
			
 
				+```
			
 
				+$ make -j4
			
 
				+$ sudo make install
			
 
				+```
			
 
				+
			
 
				+Now that `busybox` is installed, we can begin building our `initrd`. To do this, we go to the previous `initrd` directory and:
			
 
				+
			
 
				+```
			
 
				+$ cd ..
			
 
				+$ mkdir -p initramfs
			
 
				+$ cd initramfs
			
 
				+$ mkdir -pv {bin,sbin,etc,proc,sys,usr/{bin,sbin}}
			
 
				+$ cp -av ../busybox-1.23.2/_install/* .
			
 
				+```
			
 
				+
			
 
				+copy `busybox` fields to the `bin`, `sbin` and other directories. Now we need to create executable `init` file that will be executed as a first process in the system. My `init` file just mounts [procfs](https://en.wikipedia.org/wiki/Procfs) and [sysfs](https://en.wikipedia.org/wiki/Sysfs) filesystems and executed shell:
			
 
				+
			
 
				+```shell
			
 
				+#!/bin/sh
			
 
				+
			
 
				+mount -t proc none /proc
			
 
				+mount -t sysfs none /sys
			
 
				+
			
 
				+exec /bin/sh
			
 
				+```
			
 
				+
			
 
				+Now we can create an archive that will be our `initrd`:
			
 
				+
			
 
				+```
			
 
				+$ find . -print0 | cpio --null -ov --format=newc | gzip -9 > ~/dev/initrd_x86_64.gz
			
 
				+```
			
 
				+
			
 
				+We can now run our kernel in the virtual machine. As I already wrote I prefer [qemu](https://en.wikipedia.org/wiki/QEMU) for this. We can run our kernel with the following command:
			
 
				+
			
 
				+```
			
 
				+$ qemu-system-x86_64 -snapshot -m 8GB -serial stdio -kernel ~/dev/linux/arch/x86_64/boot/bzImage -initrd ~/dev/initrd_x86_64.gz -append "root=/dev/sda1 ignore_loglevel"
			
 
				+```
			
 
				+
			
 
				+![qemu](http://s22.postimg.org/b8ttyigup/qemu.png)
			
 
				+
			
 
				+From now we can run the Linux kernel in the virtual machine and this means that we can begin to change and test the kernel.
			
 
				+
			
 
				+Consider using [ivandaviov/minimal](https://github.com/ivandavidov/minimal) to automate the process of generating initrd.
			
 
				+
			
 
				+Getting started with the Linux Kernel Development
			
 
				+---------------------------------------------------------------------------------
			
 
				+
			
 
				+The main point of this paragraph is to answer two questions: What to do and what not to do before sending your first patch to the Linux kernel. Please, do not confuse this `to do` with `todo`. I have no answer what you can fix in the Linux kernel. I just want to tell you my workflow during experimenting with the Linux kernel source code.
			
 
				+
			
 
				+First of all I pull the latest updates from Linus's repo with the following commands:
			
 
				+
			
 
				+```
			
 
				+$ git checkout master
			
 
				+$ git pull upstream master
			
 
				+```
			
 
				+
			
 
				+After this my local repository with the Linux kernel source code is synced with the [mainline](https://github.com/torvalds/linux) repository. Now we can make some changes in the source code. As I already wrote, I have no advice for you where you can start and what `TODO` in the Linux kernel. But the best place for newbies is `staging` tree. In other words the set of drivers from the [drivers/staging](https://github.com/torvalds/linux/tree/master/drivers/staging). The maintainer of the `staging` tree is [Greg Kroah-Hartman](https://en.wikipedia.org/wiki/Greg_Kroah-Hartman) and the `staging` tree is that place where your trivial patch can be accepted. Let's look on a simple example that describes how to generate patch, check it and send to the [Linux kernel mail listing](https://lkml.org/).
			
 
				+
			
 
				+If we look in the driver for the [Digi International EPCA PCI](https://github.com/torvalds/linux/tree/master/drivers/staging/dgap) based devices, we will see the `dgap_sindex` function on line 295:
			
 
				+
			
 
				+```C
			
 
				+static char *dgap_sindex(char *string, char *group)
			
 
				+{
			
 
				+	char *ptr;
			
 
				+
			
 
				+	if (!string || !group)
			
 
				+		return NULL;
			
 
				+
			
 
				+	for (; *string; string++) {
			
 
				+		for (ptr = group; *ptr; ptr++) {
			
 
				+			if (*ptr == *string)
			
 
				+				return string;
			
 
				+		}
			
 
				+	}
			
 
				+
			
 
				+	return NULL;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+This function looks for a match of any character in the group and returns that position. During research of source code of the Linux kernel, I have noted that the [lib/string.c](https://github.com/torvalds/linux/blob/master/lib/string.c#L473) source code file contains the implementation of the `strpbrk` function that does the same thing as `dgap_sinidex`. It is not a good idea to use a custom implementation of a function that already exists, so we can remove the `dgap_sindex` function from the [drivers/staging/dgap/dgap.c](https://github.com/torvalds/linux/blob/master/drivers/staging/dgap/dgap.c) source code file and use the `strpbrk` instead.
			
 
				+
			
 
				+First of all let's create new `git` branch based on the current master that synced with the Linux kernel mainline repo:
			
 
				+
			
 
				+```
			
 
				+$ git checkout -b "dgap-remove-dgap_sindex"
			
 
				+```
			
 
				+
			
 
				+And now we can replace the `dgap_sindex` with the `strpbrk`. After we did all changes we need to recompile the Linux kernel or just [dgap](https://github.com/torvalds/linux/tree/master/drivers/staging/dgap) directory. Do not forget to enable this driver in the kernel configuration. You can find it in the:
			
 
				+
			
 
				+```
			
 
				+Device Drivers
			
 
				+--> Staging drivers
			
 
				+----> Digi EPCA PCI products
			
 
				+```
			
 
				+
			
 
				+![dgap menu](http://s4.postimg.org/d3pozpge5/digi.png)
			
 
				+
			
 
				+Now is time to make commit. I'm using following combination for this:
			
 
				+
			
 
				+```
			
 
				+$ git add .
			
 
				+$ git commit -s -v
			
 
				+```
			
 
				+
			
 
				+After the last command an editor will be opened that will be chosen from `$GIT_EDITOR` or `$EDITOR` environment variable. The `-s` command line argument will add `Signed-off-by` line by the committer at the end of the commit log message. You can find this line in the end of each commit message, for example - [00cc1633](https://github.com/torvalds/linux/commit/00cc1633816de8c95f337608a1ea64e228faf771). The main point of this line is the tracking of who did a change. The `-v` option show unified diff between the HEAD commit and what would be committed at the bottom of the commit message. It is not necessary, but very useful sometimes. A couple of words about commit message. Actually a commit message consists from two parts:
			
 
				+
			
 
				+The first part is on the first line and contains short description of changes. It starts from the `[PATCH]` prefix followed by a subsystem, driver or architecture name and after `:` symbol short description. In our case it will be something like this:
			
 
				+
			
 
				+```
			
 
				+[PATCH] staging/dgap: Use strpbrk() instead of dgap_sindex()
			
 
				+```
			
 
				+
			
 
				+After short description usually we have an empty line and full description of the commit. In our case it will be:
			
 
				+
			
 
				+```
			
 
				+The <linux/string.h> provides strpbrk() function that does the same that the
			
 
				+dgap_sindex(). Let's use already defined function instead of writing custom.
			
 
				+```
			
 
				+
			
 
				+And the `Sign-off-by` line in the end of the commit message. Note that each line of a commit message must no be longer than `80` symbols and commit message must describe your changes in details. Do not just write a commit message like: `Custom function removed`, you need to describe what you did and why. The patch reviewers must know what they review. Besides this commit messages in this view are very helpful. Each time when we can't understand something, we can use [git blame](http://git-scm.com/docs/git-blame) to read description of changes.
			
 
				+
			
 
				+After we have committed changes time to generate patch. We can do it with the `format-patch` command:
			
 
				+
			
 
				+```
			
 
				+$ git format-patch master
			
 
				+0001-staging-dgap-Use-strpbrk-instead-of-dgap_sindex.patch
			
 
				+```
			
 
				+
			
 
				+We've passed name of the branch (`master` in this case) to the `format-patch` command that will generate a patch with the last changes that are in the `dgap-remove-dgap_sindex` branch and not are in the `master` branch. As you can note, the `format-patch` command generates file that contains last changes and has name that is based on the commit short description. If you want to generate a patch with the custom name, you can use `--stdout` option:
			
 
				+
			
 
				+```
			
 
				+$ git format-patch master --stdout > dgap-patch-1.patch
			
 
				+```
			
 
				+
			
 
				+The last step after we have generated our patch is to send it to the Linux kernel mailing list. Of course, you can use any email client, `git` provides a special command for this: `git send-email`. Before you send your patch, you need to know where to send it. Yes, you can just send it to the Linux kernel mailing list address which is `linux-kernel@vger.kernel.org`, but it is very likely that the patch will be ignored, because of the large flow of messages. The better choice would be to send the patch to the maintainers of the subsystem where you have made changes. To find the names of these maintainers use the `get_maintainer.pl` script. All you need to do is pass the file or directory where you wrote code.
			
 
				+
			
 
				+```
			
 
				+$ ./scripts/get_maintainer.pl -f drivers/staging/dgap/dgap.c
			
 
				+Lidza Louina <lidza.louina@gmail.com> (maintainer:DIGI EPCA PCI PRODUCTS)
			
 
				+Mark Hounschell <markh@compro.net> (maintainer:DIGI EPCA PCI PRODUCTS)
			
 
				+Daeseok Youn <daeseok.youn@gmail.com> (maintainer:DIGI EPCA PCI PRODUCTS)
			
 
				+Greg Kroah-Hartman <gregkh@linuxfoundation.org> (supporter:STAGING SUBSYSTEM)
			
 
				+driverdev-devel@linuxdriverproject.org (open list:DIGI EPCA PCI PRODUCTS)
			
 
				+devel@driverdev.osuosl.org (open list:STAGING SUBSYSTEM)
			
 
				+linux-kernel@vger.kernel.org (open list)
			
 
				+```
			
 
				+
			
 
				+You will see the set of the names and related emails. Now we can send our patch with:
			
 
				+
			
 
				+```
			
 
				+$ git send-email --to "Lidza Louina <lidza.louina@gmail.com>" \
			
 
				+  --cc "Mark Hounschell <markh@compro.net>"                   \
			
 
				+  --cc "Daeseok Youn <daeseok.youn@gmail.com>"                \
			
 
				+  --cc "Greg Kroah-Hartman <gregkh@linuxfoundation.org>"      \
			
 
				+  --cc "driverdev-devel@linuxdriverproject.org"               \
			
 
				+  --cc "devel@driverdev.osuosl.org"                           \
			
 
				+  --cc "linux-kernel@vger.kernel.org"
			
 
				+```
			
 
				+
			
 
				+That's all. The patch is sent and now you only have to wait for feedback from the Linux kernel developers. After you send a patch and a maintainer accepts it, you will find it in the maintainer's repository (for example [patch](https://git.kernel.org/cgit/linux/kernel/git/gregkh/staging.git/commit/?h=staging-testing&id=b9f7f1d0846f15585b8af64435b6b706b25a5c0b) that you saw in this part) and after some time the maintainer will send a pull request to Linus and you will see your patch in the mainline repository.
			
 
				+
			
 
				+That's all.
			
 
				+
			
 
				+Some advice
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+In the end of this part I want to give you some advice that will describe what to do and what not to do during development of the Linux kernel:
			
 
				+
			
 
				+* Think, Think, Think. And think again before you decide to send a patch.
			
 
				+
			
 
				+* Each time when you have changed something in the Linux kernel source code - compile it. After any changes. Again and again. Nobody likes changes that don't even compile.
			
 
				+
			
 
				+* The Linux kernel has a coding style [guide](https://github.com/torvalds/linux/blob/master/Documentation/CodingStyle) and you need to comply with it. There is great script which can help to check your changes. This script is - [scripts/checkpatch.pl](https://github.com/torvalds/linux/blob/master/scripts/checkpatch.pl). Just pass source code file with changes to it and you will see:
			
 
				+
			
 
				+```
			
 
				+$ ./scripts/checkpatch.pl -f drivers/staging/dgap/dgap.c
			
 
				+WARNING: Block comments use * on subsequent lines
			
 
				+#94: FILE: drivers/staging/dgap/dgap.c:94:
			
 
				++/*
			
 
				++     SUPPORTED PRODUCTS
			
 
				+
			
 
				+CHECK: spaces preferred around that '|' (ctx:VxV)
			
 
				+#143: FILE: drivers/staging/dgap/dgap.c:143:
			
 
				++	{ PPCM,        PCI_DEV_XEM_NAME,     64, (T_PCXM|T_PCLITE|T_PCIBUS) },
			
 
				+
			
 
				+```
			
 
				+
			
 
				+Also you can see problematic places with the help of the `git diff`:
			
 
				+
			
 
				+![git diff](http://oi60.tinypic.com/2u91rgn.jpg)
			
 
				+
			
 
				+* [Linus doesn't accept github pull requests](https://github.com/torvalds/linux/pull/17#issuecomment-5654674)
			
 
				+
			
 
				+* If your change consists from some different and unrelated changes, you need to split the changes via separate commits. The `git format-patch` command will generate patches for each commit and the subject of each patch will contain a `vN` prefix where the `N` is the number of the patch. If you are planning to send a series of patches it will be helpful to pass the `--cover-letter` option to the `git format-patch` command. This will generate an additional file that will contain the cover letter that you can use to describe what your patchset changes. It is also a good idea to use the `--in-reply-to` option in the `git send-email` command. This option allows you to send your patch series in reply to your cover message. The structure of the your patch will look like this for a maintainer:
			
 
				+
			
 
				+```
			
 
				+|--> cover letter
			
 
				+  |----> patch_1
			
 
				+  |----> patch_2
			
 
				+```
			
 
				+
			
 
				+You need to pass `message-id` as an argument of the `--in-reply-to` option that you can find in the output of the `git send-email`:
			
 
				+
			
 
				+It's important that your email be in the [plain text](https://en.wikipedia.org/wiki/Plain_text) format. Generally, `send-email` and `format-patch` are very useful during development, so look at the documentation for the commands and you'll find some useful options such as: [git send-email](http://git-scm.com/docs/git-send-email) and [git format-patch](http://git-scm.com/docs/git-format-patch).
			
 
				+
			
 
				+* Do not be surprised if you do not get an immediate answer after you send your patch. Maintainers can be very busy.
			
 
				+
			
 
				+* The [scripts](https://github.com/torvalds/linux/tree/master/scripts) directory contains many different useful scripts that are related to Linux kernel development. We already saw two scripts from this directory: the `checkpatch.pl` and the `get_maintainer.pl` scripts. Outside of those scripts, you can find the [stackusage](https://github.com/torvalds/linux/blob/master/scripts/stackusage) script that will print usage of the stack, [extract-vmlinux](https://github.com/torvalds/linux/blob/master/scripts/extract-vmlinux) for extracting an uncompressed kernel image, and many others. Outside of the `scripts` directory you can find some very useful [scripts](https://github.com/lorenzo-stoakes/kernel-scripts) by [Lorenzo Stoakes](https://twitter.com/ljsloz) for kernel development.
			
 
				+
			
 
				+* Subscribe to the Linux kernel mailing list. There are a large number of letters every day on `lkml`, but it is very useful to read them and understand things such as the current state of the Linux kernel. Other than `lkml` there are [set](http://vger.kernel.org/vger-lists.html) mailing listings which are related to the different Linux kernel subsystems.
			
 
				+
			
 
				+* If your patch is not accepted the first time and you receive feedback from Linux kernel developers, make your changes and resend the patch with the `[PATCH vN]` prefix (where `N` is the number of patch version). For example:
			
 
				+
			
 
				+```
			
 
				+[PATCH v2] staging/dgap: Use strpbrk() instead of dgap_sindex()
			
 
				+```
			
 
				+
			
 
				+Also it must contain a changelog that describes all changes from previous patch versions. Of course, this is not an exhaustive list of requirements for Linux kernel development, but some of the most important items were addressed.
			
 
				+
			
 
				+Happy Hacking!
			
 
				+
			
 
				+Conclusion
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+I hope this will help others join the Linux kernel community!
			
 
				+If you have any questions or suggestions, write me at [email](kuleshovmail@gmail.com) or ping [me](https://twitter.com/0xAX) on twitter.
			
 
				+
			
 
				+Please note that English is not my first language, and I am really sorry for any inconvenience. If you find any mistakes please let me know via email or send a PR.
			
 
				+
			
 
				+Links
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+* [blog posts about assembly programming for x86_64](http://0xax.github.io/categories/assembly/)
			
 
				+* [Assembler](https://en.wikipedia.org/wiki/Assembly_language#Assembler)
			
 
				+* [distro](https://en.wikipedia.org/wiki/Linux_distribution)
			
 
				+* [package manager](https://en.wikipedia.org/wiki/Package_manager)
			
 
				+* [grub](https://en.wikipedia.org/wiki/GNU_GRUB)
			
 
				+* [kernel.org](https://kernel.org/)
			
 
				+* [version control system](https://en.wikipedia.org/wiki/Version_control)
			
 
				+* [arm64](https://en.wikipedia.org/wiki/ARM_architecture#AArch64_features)
			
 
				+* [bzImage](https://en.wikipedia.org/wiki/Vmlinux#bzImage)
			
 
				+* [qemu](https://en.wikipedia.org/wiki/QEMU)
			
 
				+* [initrd](https://en.wikipedia.org/wiki/Initrd)
			
 
				+* [busybox](https://en.wikipedia.org/wiki/BusyBox)
			
 
				+* [coreutils](https://en.wikipedia.org/wiki/GNU_Core_Utilities)
			
 
				+* [procfs](https://en.wikipedia.org/wiki/Procfs)
			
 
				+* [sysfs](https://en.wikipedia.org/wiki/Sysfs)
			
 
				+* [Linux kernel mail listing archive](https://lkml.org/)
			
 
				+* [Linux kernel coding style guide](https://github.com/torvalds/linux/blob/master/Documentation/CodingStyle)
			
 
				+* [How to Get Your Change Into the Linux Kernel](https://github.com/torvalds/linux/blob/master/Documentation/SubmittingPatches)
			
 
				+* [Linux Kernel Newbies](http://kernelnewbies.org/)
			
 
				+* [plain text](https://en.wikipedia.org/wiki/Plain_text)
			
--- a/Misc/how_kernel_compiled.md
+++ b/Misc/how_kernel_compiled.md
@@ -129,7 +129,7 @@ SUBARCH := $(shell uname -m | sed -e s/i.86/x86/ -e s/x86_64/x86/ \
 
				 				  -e s/sh[234].*/sh/ -e s/aarch64.*/arm64/ )
			
 
				 ```
			
 
				 
			
 
				-As you can see, it executes the [uname](https://en.wikipedia.org/wiki/Uname) util that prints information about machine, operating system and architecture. As it gets the output of `uname`, it parses the ouput and assigns the result to the `SUBARCH` variable. Now that we have `SUBARCH`, we set the `SRCARCH` variable that provides the directory of the certain architecture and `hfr-arch` that provides the directory for the header files:
			
 
				+As you can see, it executes the [uname](https://en.wikipedia.org/wiki/Uname) util that prints information about machine, operating system and architecture. As it gets the output of `uname`, it parses the output and assigns the result to the `SUBARCH` variable. Now that we have `SUBARCH`, we set the `SRCARCH` variable that provides the directory of the certain architecture and `hfr-arch` that provides the directory for the header files:
			
 
				 
			
 
				 ```Makefile
			
 
				 ifeq ($(ARCH),i386)
			
@@ -166,7 +166,7 @@ HOSTCFLAGS   = -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 -fomit-frame-p
 
				 HOSTCXXFLAGS = -O2
			
 
				 ```
			
 
				 
			
 
				-Next we get to the `CC` variable that represents compiler too, so why do we need the `HOST*` variables? `CC` is the target compiler that will be used during kernel compilation, but `HOSTCC` will be used during compilation of the set of the `host` programs (we will see it soon). After this we can see the definition of `KBUILD_MODULES` and `KBUILD_BUILTIN` variables that are used to determine what to compile (kernel, modules or both):
			
 
				+Next we get to the `CC` variable that represents compiler too, so why do we need the `HOST*` variables? `CC` is the target compiler that will be used during kernel compilation, but `HOSTCC` will be used during compilation of the set of the `host` programs (we will see it soon). After this we can see the definition of `KBUILD_MODULES` and `KBUILD_BUILTIN` variables that are used to determine what to compile (modules, kernel, or both):
			
 
				 
			
 
				 ```Makefile
			
 
				 KBUILD_MODULES :=
			
@@ -318,7 +318,7 @@ archscripts: scripts_basic
 
				 
			
 
				 We can see that it depends on the `scripts_basic` target from the top [Makefile](https://github.com/torvalds/linux/blob/master/Makefile). At the first we can see the `scripts_basic` target that executes make for the [scripts/basic](https://github.com/torvalds/linux/blob/master/scripts/basic/Makefile) makefile:
			
 
				 
			
 
				-```Maklefile
			
 
				+```Makefile
			
 
				 scripts_basic:
			
 
				 	$(Q)$(MAKE) $(build)=scripts/basic
			
 
				 ```
			
@@ -550,14 +550,14 @@ The first is `voffset.h` generated by the `sed` script that gets two addresses f
 
				 #define VO__text 0xffffffff81000000
			
 
				 ```
			
 
				 
			
 
				-They are start and end of the kernel. The second is `zoffset.h` depens on the `vmlinux` target from the [arch/x86/boot/compressed/Makefile](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/Makefile):
			
 
				+They are the start and the end of the kernel. The second is `zoffset.h` depens on the `vmlinux` target from the [arch/x86/boot/compressed/Makefile](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/Makefile):
			
 
				 
			
 
				 ```Makefile
			
 
				 $(obj)/zoffset.h: $(obj)/compressed/vmlinux FORCE
			
 
				 	$(call if_changed,zoffset)
			
 
				 ```
			
 
				 
			
 
				-The `$(obj)/compressed/vmlinux` target depends on the `vmlinux-objs-y` that compiles source code files from the [arch/x86/boot/compressed](https://github.com/torvalds/linux/tree/master/arch/x86/boot/compressed) directory and generates `vmlinux.bin`, `vmlinux.bin.bz2`, and compiles programm - `mkpiggy`. We can see this in the output:
			
 
				+The `$(obj)/compressed/vmlinux` target depends on the `vmlinux-objs-y` that compiles source code files from the [arch/x86/boot/compressed](https://github.com/torvalds/linux/tree/master/arch/x86/boot/compressed) directory and generates `vmlinux.bin`, `vmlinux.bin.bz2`, and compiles program - `mkpiggy`. We can see this in the output:
			
 
				 
			
 
				 ```Makefile
			
 
				   LDS     arch/x86/boot/compressed/vmlinux.lds
			
@@ -570,7 +570,7 @@ The `$(obj)/compressed/vmlinux` target depends on the `vmlinux-objs-y` that comp
 
				   HOSTCC  arch/x86/boot/compressed/mkpiggy
			
 
				 ```
			
 
				 
			
 
				-Where `vmlinux.bin` is the `vmlinux` file with debuging information and comments stripped and the `vmlinux.bin.bz2` compressed `vmlinux.bin.all` + `u32` size of `vmlinux.bin.all`. The `vmlinux.bin.all` is `vmlinux.bin + vmlinux.relocs`, where `vmlinux.relocs` is the `vmlinux` that was handled by the `relocs` program (see above). As we got these files, the `piggy.S` assembly files will be generated with the `mkpiggy` program and compiled:
			
 
				+Where `vmlinux.bin` is the `vmlinux` file with debugging information and comments stripped and the `vmlinux.bin.bz2` compressed `vmlinux.bin.all` + `u32` size of `vmlinux.bin.all`. The `vmlinux.bin.all` is `vmlinux.bin + vmlinux.relocs`, where `vmlinux.relocs` is the `vmlinux` that was handled by the `relocs` program (see above). As we got these files, the `piggy.S` assembly files will be generated with the `mkpiggy` program and compiled:
			
 
				 
			
 
				 ```Makefile
			
 
				   MKPIGGY arch/x86/boot/compressed/piggy.S
			
--- a/Misc/linkers.md
+++ b/Misc/linkers.md
@@ -3,11 +3,11 @@ Introduction
 
				 
			
 
				 During the writing of the [linux-insides](http://0xax.gitbooks.io/linux-insides/content/) book I have received many emails with questions related to the [linker](https://en.wikipedia.org/wiki/Linker_%28computing%29) script and linker-related subjects. So I've decided to write this to cover some aspects of the linker and the linking of object files.
			
 
				 
			
 
				-If we open the `Linker` page on wikipidia, we will see following definition:
			
 
				+If we open the `Linker` page on Wikipedia, we will see following definition:
			
 
				 
			
 
				 >In computer science, a linker or link editor is a computer program that takes one or more object files generated by a compiler and combines them into a single executable file, library file, or another object file.
			
 
				 
			
 
				-If you've written at least one program on C in your life, you will have seen files with the `*.o` extension. These files are [object files](https://en.wikipedia.org/wiki/Object_file). Object files are blocks of machine code and data with placeholder addresses that reference data and functions in other object files or libraries, as well as a list of its own functions and data. The main purpose of the linker is collect/handle the code and data of each object file, turning it into the the final executable file or library. In this post we will try to go through all aspects of this process. Let's start.
			
 
				+If you've written at least one program on C in your life, you will have seen files with the `*.o` extension. These files are [object files](https://en.wikipedia.org/wiki/Object_file). Object files are blocks of machine code and data with placeholder addresses that reference data and functions in other object files or libraries, as well as a list of its own functions and data. The main purpose of the linker is collect/handle the code and data of each object file, turning it into the final executable file or library. In this post we will try to go through all aspects of this process. Let's start.
			
 
				 
			
 
				 Linking process
			
 
				 ---------------
			
@@ -38,7 +38,7 @@ The `lib.c` file contains:
 
				 
			
 
				 ```C
			
 
				 int factorial(int base) {
			
 
				-	int res = 1, i = 1;
			
 
				+	int res,i = 1;
			
 
				 	
			
 
				 	if (base == 0) {
			
 
				 		return 1;
			
@@ -118,11 +118,11 @@ $ objdump -S -r main.o
 
				 
			
 
				 ...
			
 
				   14:	e8 00 00 00 00       	callq  19 <main+0x19>
			
 
				-			15: R_X86_64_PC32	factorial-0x4
			
 
				+  15: R_X86_64_PC32	               factorial-0x4
			
 
				   19:	89 c6                	mov    %eax,%esi
			
 
				 ...
			
 
				   25:	e8 00 00 00 00       	callq  2a <main+0x2a>
			
 
				-			26: R_X86_64_PC32	printf-0x4
			
 
				+  26:   R_X86_64_PC32	               printf-0x4
			
 
				   2a:	b8 00 00 00 00       	mov    $0x0,%eax
			
 
				 ...
			
 
				 ```
			
@@ -136,7 +136,7 @@ Relocation is the process of connecting symbolic references with symbolic defini
 
				 
			
 
				 ```
			
 
				   14:	e8 00 00 00 00       	callq  19 <main+0x19>
			
 
				-			15: R_X86_64_PC32	factorial-0x4
			
 
				+  15:   R_X86_64_PC32	               factorial-0x4
			
 
				   19:	89 c6                	mov    %eax,%esi
			
 
				 ```
			
 
				 
			
@@ -301,7 +301,7 @@ call __libc_start_main
 
				 Here we pass address of the entry point to the `.init` and `.fini` section that contain code that starts to execute when the program is ran and the code that executes when program terminates. And in the end we see the call of the `main` function from our program. These three symbols are defined in the [csu/elf-init.c](https://sourceware.org/git/?p=glibc.git;a=blob;f=csu/elf-init.c;hb=1d4bbc54bd4f7d85d774871341b49f4357af1fb7) source code file. The following two object files:
			
 
				 
			
 
				 * `crtn.o`;
			
 
				-* `crtn.i`.
			
 
				+* `crti.o`.
			
 
				 
			
 
				 define the function prologs/epilogs for the .init and .fini sections (with the `_init` and `_fini` symbols respectively).
			
 
				 
			
@@ -331,7 +331,7 @@ $ ld \
 
				 -o factorial
			
 
				 ```
			
 
				 
			
 
				-And anyway we will get the same errors. Now we need to pass `-lc` option to the `ld`. This option will search for the standard library in the paths present in the `$LD_LIBRARY_PATH` enviroment variable. Let's try to link again wit the `-lc` option:
			
 
				+And anyway we will get the same errors. Now we need to pass `-lc` option to the `ld`. This option will search for the standard library in the paths present in the `$LD_LIBRARY_PATH` environment variable. Let's try to link again wit the `-lc` option:
			
 
				 
			
 
				 ```
			
 
				 $ ld \
			
@@ -489,28 +489,32 @@ With the linker language we can control:
 
				 * addresses of sections;
			
 
				 * etc...
			
 
				 
			
 
				-Commands written in the linker control language are usually placed in a file called linker script. We can pass it to `ld` with the `-T` command line option. The main command in a linker script is the `SECTIONS` command. Each linker script must contain this command and it determines the `map` of the output file. The special variable `.` contains current position of the output. Let's write simple assembly program andi we will look at how we can use a linker script to control linking of this program. We will take a hello world program for this example:
			
 
				+Commands written in the linker control language are usually placed in a file called linker script. We can pass it to `ld` with the `-T` command line option. The main command in a linker script is the `SECTIONS` command. Each linker script must contain this command and it determines the `map` of the output file. The special variable `.` contains current position of the output. Let's write a simple assembly program and we will look at how we can use a linker script to control linking of this program. We will take a hello world program for this example:
			
 
				 
			
 
				 ```assembly
			
 
				-section .data
			
 
				-	msg	db "hello, world!",`\n`
			
 
				-section .text
			
 
				-	global	_start
			
 
				+.data
			
 
				+        msg:    .ascii  "hello, world!\n"
			
 
				+
			
 
				+.text
			
 
				+
			
 
				+.global _start
			
 
				+
			
 
				 _start:
			
 
				-	mov	rax, 1
			
 
				-	mov	rdi, 1
			
 
				-	mov	rsi, msg
			
 
				-	mov	rdx, 14
			
 
				-	syscall
			
 
				-	mov	rax, 60
			
 
				-	mov	rdi, 0
			
 
				-	syscall
			
 
				+        mov    $1,%rax
			
 
				+        mov    $1,%rdi
			
 
				+        mov    $msg,%rsi
			
 
				+        mov    $14,%rdx
			
 
				+        syscall
			
 
				+
			
 
				+        mov    $60,%rax
			
 
				+        mov    $0,%rdi
			
 
				+        syscall
			
 
				 ```
			
 
				 
			
 
				 We can compile and link it with the following commands:
			
 
				 
			
 
				 ```
			
 
				-$ nasm -f elf64 -o hello.o hello.asm
			
 
				+$ as -o hello.o hello.asm
			
 
				 $ ld -o hello hello.o
			
 
				 ```
			
 
				 
			
@@ -538,16 +542,16 @@ SECTIONS
 
				 }
			
 
				 ```
			
 
				 
			
 
				-On the first three lines you can see a comment written in `C` style. After it the `OUTPUT` and the `OUTPUT_FORMAT` commands specifiy the name of our executable file and its format. The next command, `INPUT`, specfies the input file to the `ld` linker. Then, we can see the main `SECTIONS` command, which, as I already wrote, must be present in every linker script. The `SECTIONS` command represents the set and order of the sections which will be in the output file. At the beginning of the `SECTIONS` command we can see following line `. = 0x200000`. I already wrote above that `.` command points to the current position of the output. This line says that the code should be loaded at address `0x200000` and the line `. = 0x400000` says that data section should be loaded at address `0x400000`. The second line after the `. = 0x200000` defines `.text` as an output section. We can see `*(.text)` expression inside it. The `*` symbol is wildcard that matches any file name. In other words, the `*(.text)` expression says all `.text` input sections in all input files. We can rewrite it as `hello.o(.text)` for our example. After the following location counter `. = 0x400000`, we can see definition of the data section.
			
 
				+On the first three lines you can see a comment written in `C` style. After it the `OUTPUT` and the `OUTPUT_FORMAT` commands specify the name of our executable file and its format. The next command, `INPUT`, specifies the input file to the `ld` linker. Then, we can see the main `SECTIONS` command, which, as I already wrote, must be present in every linker script. The `SECTIONS` command represents the set and order of the sections which will be in the output file. At the beginning of the `SECTIONS` command we can see following line `. = 0x200000`. I already wrote above that `.` command points to the current position of the output. This line says that the code should be loaded at address `0x200000` and the line `. = 0x400000` says that data section should be loaded at address `0x400000`. The second line after the `. = 0x200000` defines `.text` as an output section. We can see `*(.text)` expression inside it. The `*` symbol is wildcard that matches any file name. In other words, the `*(.text)` expression says all `.text` input sections in all input files. We can rewrite it as `hello.o(.text)` for our example. After the following location counter `. = 0x400000`, we can see definition of the data section.
			
 
				 
			
 
				-We can compile and link it with the:
			
 
				+We can compile and link it with the following command:
			
 
				 
			
 
				 ```
			
 
				-$ nasm  -f elf64 -o hello.o hello.S && ld -T linker.script && ./hello
			
 
				+$ as -o hello.o hello.S && ld -T linker.script && ./hello
			
 
				 hello, world!
			
 
				 ```
			
 
				 
			
 
				-If we will look insidei it with the `objdump` util, we can see that `.text` section starts from the address `0x200000` and the `.data` sections starts from the address `0x400000`:
			
 
				+If we look inside it with the `objdump` util, we can see that `.text` section starts from the address `0x200000` and the `.data` sections starts from the address `0x400000`:
			
 
				 
			
 
				 ```
			
 
				 $ objdump -D hello
			
@@ -555,7 +559,7 @@ $ objdump -D hello
 
				 Disassembly of section .text:
			
 
				 
			
 
				 0000000000200000 <_start>:
			
 
				-  200000:	b8 01 00 00 00       	mov    $0x1,%eax
			
 
				+  200000:	48 c7 c0 01 00 00 00 	mov    $0x1,%rax
			
 
				   ...
			
 
				 
			
 
				 Disassembly of section .data:
			
@@ -627,7 +631,7 @@ Please note that English is not my first language, and I am really sorry for any
 
				 Links
			
 
				 -----------------
			
 
				 
			
 
				-* [Book about Linux kernel internals](http://0xax.gitbooks.io/linux-insides/content/)
			
 
				+* [Book about Linux kernel insides](http://0xax.gitbooks.io/linux-insides/content/)
			
 
				 * [linker](https://en.wikipedia.org/wiki/Linker_%28computing%29)
			
 
				 * [object files](https://en.wikipedia.org/wiki/Object_file)
			
 
				 * [glibc](https://en.wikipedia.org/wiki/GNU_C_Library)
			
--- a/Misc/program_startup.md
+++ b/Misc/program_startup.md
@@ -0,0 +1,486 @@
 
				+Program startup process in userspace
			
 
				+================================================================================
			
 
				+
			
 
				+Introduction
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+Despite the [linux-insides](https://www.gitbook.com/book/0xax/linux-insides/details) described mostly Linux kernel related stuff, I have decided to write this one part which mostly related to userspace.
			
 
				+
			
 
				+There is already fourth [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-4.html) of [system calls](https://en.wikipedia.org/wiki/System_call) chapter which describes what does the Linux kernel do when we want to start a program. In this part I want to explore what happens when we run a program on Linux machine from userspace perspective.
			
 
				+
			
 
				+I don't know how about you, but I learned in my university that a `C` program starts to execute from the function which is called `main`. And that's partly true. Everytime, when we are starting to write new program, we start our program from the following lines of code:
			
 
				+
			
 
				+```C
			
 
				+int main(int argc, char *argv[]) {
			
 
				+	// Entry point is here
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+But if you interested in low-level programming, maybe you already know that the `main` function isn't actual entry point of a program. We can make sure in this, if we will look at this simple program:
			
 
				+
			
 
				+```C
			
 
				+int main(int argc, char *argv[]) {
			
 
				+	return 0;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+in debugger. Let's compile this and run in [gdb](https://www.gnu.org/software/gdb/):
			
 
				+
			
 
				+```
			
 
				+$ gcc -ggdb program.c -o program
			
 
				+$ gdb ./program
			
 
				+The target architecture is assumed to be i386:x86-64:intel
			
 
				+Reading symbols from ./program...done.
			
 
				+```
			
 
				+
			
 
				+Let's execute gdb `info` subcommand with `files` argument. The `info files` must print information about debugging targets and memory spaces occupied by different sections.
			
 
				+
			
 
				+```
			
 
				+(gdb) info files
			
 
				+Symbols from "/home/alex/program".
			
 
				+Local exec file:
			
 
				+	`/home/alex/program', file type elf64-x86-64.
			
 
				+	Entry point: 0x400430
			
 
				+	0x0000000000400238 - 0x0000000000400254 is .interp
			
 
				+	0x0000000000400254 - 0x0000000000400274 is .note.ABI-tag
			
 
				+	0x0000000000400274 - 0x0000000000400298 is .note.gnu.build-id
			
 
				+	0x0000000000400298 - 0x00000000004002b4 is .gnu.hash
			
 
				+	0x00000000004002b8 - 0x0000000000400318 is .dynsym
			
 
				+	0x0000000000400318 - 0x0000000000400357 is .dynstr
			
 
				+	0x0000000000400358 - 0x0000000000400360 is .gnu.version
			
 
				+	0x0000000000400360 - 0x0000000000400380 is .gnu.version_r
			
 
				+	0x0000000000400380 - 0x0000000000400398 is .rela.dyn
			
 
				+	0x0000000000400398 - 0x00000000004003c8 is .rela.plt
			
 
				+	0x00000000004003c8 - 0x00000000004003e2 is .init
			
 
				+	0x00000000004003f0 - 0x0000000000400420 is .plt
			
 
				+	0x0000000000400420 - 0x0000000000400428 is .plt.got
			
 
				+	0x0000000000400430 - 0x00000000004005e2 is .text
			
 
				+	0x00000000004005e4 - 0x00000000004005ed is .fini
			
 
				+	0x00000000004005f0 - 0x0000000000400610 is .rodata
			
 
				+	0x0000000000400610 - 0x0000000000400644 is .eh_frame_hdr
			
 
				+	0x0000000000400648 - 0x000000000040073c is .eh_frame
			
 
				+	0x0000000000600e10 - 0x0000000000600e18 is .init_array
			
 
				+	0x0000000000600e18 - 0x0000000000600e20 is .fini_array
			
 
				+	0x0000000000600e20 - 0x0000000000600e28 is .jcr
			
 
				+	0x0000000000600e28 - 0x0000000000600ff8 is .dynamic
			
 
				+	0x0000000000600ff8 - 0x0000000000601000 is .got
			
 
				+	0x0000000000601000 - 0x0000000000601028 is .got.plt
			
 
				+	0x0000000000601028 - 0x0000000000601034 is .data
			
 
				+	0x0000000000601034 - 0x0000000000601038 is .bss
			
 
				+```
			
 
				+
			
 
				+Note on `Entry point: 0x400430` line. Now we know the actual address of entry point of our program. Let's put breakpoint by this address, run our program and will see what happens:
			
 
				+
			
 
				+```
			
 
				+(gdb) break *0x400430
			
 
				+Breakpoint 1 at 0x400430
			
 
				+(gdb) run
			
 
				+Starting program: /home/alex/program 
			
 
				+
			
 
				+Breakpoint 1, 0x0000000000400430 in _start ()
			
 
				+```
			
 
				+
			
 
				+Interesting. We don't see execution of `main` function here, but we may see that another function is called. This function is - `_start` and as debugger showed us, it is actual entry point of our program. Where is this function come from? Who does call `main` and when it will be called. I will try to answer all of these questions in this post.
			
 
				+
			
 
				+How kernel does start new program
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+First of all, let's take following simple `C` program:
			
 
				+
			
 
				+```C
			
 
				+// program.c
			
 
				+
			
 
				+#include <stdlib.h>
			
 
				+#include <stdio.h>
			
 
				+
			
 
				+static int x = 1;
			
 
				+
			
 
				+int y = 2;
			
 
				+
			
 
				+int main(int argc, char *argv[]) {
			
 
				+	int z = 3;
			
 
				+
			
 
				+	printf("x + y + z = %d\n", x + y + z);
			
 
				+
			
 
				+	return EXIT_SUCCESS;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+We can be sure that this program works as we expect. Let's compile it:
			
 
				+
			
 
				+```
			
 
				+$ gcc -Wall program.c -o sum
			
 
				+```
			
 
				+
			
 
				+and run:
			
 
				+
			
 
				+```
			
 
				+./sum
			
 
				+x + y + z = 6
			
 
				+```
			
 
				+
			
 
				+Ok, everything looks pretty good for now. You already may know that there is special family of [system calls](https://en.wikipedia.org/wiki/System_call) - [exec*](http://man7.org/linux/man-pages/man3/execl.3.html) system calls. As we may read in the man page:
			
 
				+
			
 
				+> The exec() family of functions replaces the current process image with a new process image.
			
 
				+
			
 
				+If you have read fourth [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-4.html) of the chapter which describes [system calls](https://en.wikipedia.org/wiki/System_call), you may know that for example [execve](http://linux.die.net/man/2/execve) system call is defined in the [files/exec.c](https://github.com/torvalds/linux/blob/master/fs/exec.c#L1859) source code file and looks like:
			
 
				+
			
 
				+```C
			
 
				+SYSCALL_DEFINE3(execve,
			
 
				+		const char __user *, filename,
			
 
				+		const char __user *const __user *, argv,
			
 
				+		const char __user *const __user *, envp)
			
 
				+{
			
 
				+	return do_execve(getname(filename), argv, envp);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+It takes executable file name, set of command line arguments and set of enviroment variables. As you may guess, everything is done by the `do_execve` function. I will not describe implementation of the `do_execve` function in details because you can read about this in [here](https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-4.html). But in short words, the `do_execve` function does many checks like `filename` is valid, limit of launched processes is not exceed in our system and etc. After all of these checks, this function parses our executable file which is represented in [ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) format, creates memory descriptor for newly executed executable file and fills it with the appropriate values like area for the stack, heap and etc. As the setup of new binary image is done, the `start_thread` function which will execute setup of new process. This function is architecture-specific and for the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture, its definition will be located in the [arch/x86/kernel/process_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/process_64.c#L231) source code file.
			
 
				+
			
 
				+The `start_thread` function sets new value of [segment registers](https://en.wikipedia.org/wiki/X86_memory_segmentation) and program execution address. From this point, new process is ready to start. Once the [context switch](https://en.wikipedia.org/wiki/Context_switch) will be done, control will be returned to the userspace with new values of registers and new executable will be started to execute.
			
 
				+
			
 
				+That's all from the kernel side. The Linux kernel prepares binary image for execution and its execution starts right after context switch will return controll to userspace. But it does not answer on questions like where is from `_start` come and others. Let's try to answer on these questions in the next paragraph.
			
 
				+
			
 
				+How does program start in userspace
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+In the previous paragraph we saw how an executable file is prepared to run by the Linux kernel. Let's look at the same, but from userspace side. We already know that entry point of each program is `_start` function. But where is this function from? It may came from a library. But if you remember correctly we didn't link our program with any libraries during compilation of our program:
			
 
				+
			
 
				+```
			
 
				+$ gcc -Wall program.c -o sum
			
 
				+```
			
 
				+
			
 
				+You may guess that `_start` comes from [stanard libray](https://en.wikipedia.org/wiki/Standard_library) and that's true. If try to compile our program again and pass `-v` option to gcc which will enable `verbose mode`, we will see following long output. Full output is not intereing for us, let's look at the following steps: First of all our program will be compiled with `cc`:
			
 
				+
			
 
				+```
			
 
				+$ gcc -v -ggdb program.c -o sum
			
 
				+...
			
 
				+...
			
 
				+...
			
 
				+/usr/libexec/gcc/x86_64-redhat-linux/6.1.1/cc1 -quiet -v program.c -quiet -dumpbase program.c -mtune=generic -march=x86-64 -auxbase test -ggdb -version -o /tmp/ccvUWZkF.s
			
 
				+...
			
 
				+...
			
 
				+...
			
 
				+```
			
 
				+
			
 
				+The `cc1` compiler will compile our `C` source code and produce assembly `/tmp/ccvUWZkF.s` file. After this we may see that our assembly file will be compiled into object file with `GNU as` assembler:
			
 
				+
			
 
				+```
			
 
				+$ gcc -v -ggdb program.c -o sum
			
 
				+...
			
 
				+...
			
 
				+...
			
 
				+as -v --64 -o /tmp/cc79wZSU.o /tmp/ccvUWZkF.s
			
 
				+...
			
 
				+...
			
 
				+...
			
 
				+```
			
 
				+
			
 
				+And in the end our object file will be linked with `collect2`:
			
 
				+
			
 
				+```
			
 
				+$ gcc -v -ggdb program.c -o sum
			
 
				+...
			
 
				+...
			
 
				+...
			
 
				+/usr/libexec/gcc/x86_64-redhat-linux/6.1.1/collect2 -plugin /usr/libexec/gcc/x86_64-redhat-linux/6.1.1/liblto_plugin.so -plugin-opt=/usr/libexec/gcc/x86_64-redhat-linux/6.1.1/lto-wrapper -plugin-opt=-fresolution=/tmp/ccLEGYra.res -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s --build-id --no-add-needed --eh-frame-hdr --hash-style=gnu -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o test /usr/lib/gcc/x86_64-redhat-linux/6.1.1/../../../../lib64/crt1.o /usr/lib/gcc/x86_64-redhat-linux/6.1.1/../../../../lib64/crti.o /usr/lib/gcc/x86_64-redhat-linux/6.1.1/crtbegin.o -L/usr/lib/gcc/x86_64-redhat-linux/6.1.1 -L/usr/lib/gcc/x86_64-redhat-linux/6.1.1/../../../../lib64 -L/lib/../lib64 -L/usr/lib/../lib64 -L. -L/usr/lib/gcc/x86_64-redhat-linux/6.1.1/../../.. /tmp/cc79wZSU.o -lgcc --as-needed -lgcc_s --no-as-needed -lc -lgcc --as-needed -lgcc_s --no-as-needed /usr/lib/gcc/x86_64-redhat-linux/6.1.1/crtend.o /usr/lib/gcc/x86_64-redhat-linux/6.1.1/../../../../lib64/crtn.o
			
 
				+...
			
 
				+...
			
 
				+...
			
 
				+```
			
 
				+
			
 
				+Yes, we may see long set of command line options which are passed to the linker. Let's go the other way. We know that our program depends on `stdlib`:
			
 
				+
			
 
				+```
			
 
				+~$ ldd program
			
 
				+	linux-vdso.so.1 (0x00007ffc9afd2000)
			
 
				+	libc.so.6 => /lib64/libc.so.6 (0x00007f56b389b000)
			
 
				+	/lib64/ld-linux-x86-64.so.2 (0x0000556198231000)
			
 
				+```
			
 
				+
			
 
				+as we use some stuff from there like `printf` and etc. But not only. That's why we will get error if we will pass `-nostdlib` option to the compiler:
			
 
				+
			
 
				+```
			
 
				+~$ gcc -nostdlib program.c -o program
			
 
				+/usr/bin/ld: warning: cannot find entry symbol _start; defaulting to 000000000040017c
			
 
				+/tmp/cc02msGW.o: In function `main':
			
 
				+/home/alex/program.c:11: undefined reference to `printf'
			
 
				+collect2: error: ld returned 1 exit status
			
 
				+```
			
 
				+
			
 
				+Besides other errors, we also see that `_start` symbol is undefined. So now we are sure that the `_start` function comes from standard library. But even if we will link it with standard library, it will not be compiled successfully anyway:
			
 
				+
			
 
				+```
			
 
				+~$ gcc -nostdlib -lc -ggdb program.c -o program
			
 
				+/usr/bin/ld: warning: cannot find entry symbol _start; defaulting to 0000000000400350
			
 
				+```
			
 
				+
			
 
				+Ok, compiler will not complains about undefined reference of standard library functions as we linked our program with `/usr/lib64/libc.so.6`, but the `_start` symbol isn't resolved yet. Let's return to the verbose output of `gcc` and look at the parameters of `collect2`. First important thing that we may see is our program is linked not only with standard library, but also with some object files. The first object file is: `/lib64/crt1.o`. And if we will look inside this object file with `objdump` util, we will see the `_start` symbol: 
			
 
				+
			
 
				+```
			
 
				+$ objdump -d /lib64/crt1.o 
			
 
				+
			
 
				+/lib64/crt1.o:     file format elf64-x86-64
			
 
				+
			
 
				+
			
 
				+Disassembly of section .text:
			
 
				+
			
 
				+0000000000000000 <_start>:
			
 
				+   0:	31 ed                	xor    %ebp,%ebp
			
 
				+   2:	49 89 d1             	mov    %rdx,%r9
			
 
				+   5:	5e                   	pop    %rsi
			
 
				+   6:	48 89 e2             	mov    %rsp,%rdx
			
 
				+   9:	48 83 e4 f0          	and    $0xfffffffffffffff0,%rsp
			
 
				+   d:	50                   	push   %rax
			
 
				+   e:	54                   	push   %rsp
			
 
				+   f:	49 c7 c0 00 00 00 00 	mov    $0x0,%r8
			
 
				+  16:	48 c7 c1 00 00 00 00 	mov    $0x0,%rcx
			
 
				+  1d:	48 c7 c7 00 00 00 00 	mov    $0x0,%rdi
			
 
				+  24:	e8 00 00 00 00       	callq  29 <_start+0x29>
			
 
				+  29:	f4                   	hlt    
			
 
				+```
			
 
				+
			
 
				+As `crt1.o` is a shared object file, we see only stubs here instead of real calls. Let's look at the source code of the `_start` function. As this function is architecture specific, implementation for `_start` will be located in the [sysdeps/x86_64/start.S](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/start.S;h=f1b961f5ba2d6a1ebffee0005f43123c4352fbf4;hb=HEAD) assembly file.
			
 
				+
			
 
				+The `_start` starts from the clearing of `%ebp` register as [ABI](https://software.intel.com/sites/default/files/article/402129/mpx-linux64-abi.pdf) suggests
			
 
				+
			
 
				+```assembly
			
 
				+xorl %ebp, %ebp
			
 
				+```
			
 
				+
			
 
				+And after this we put the address of termination function to the `%r9` register:
			
 
				+
			
 
				+```assembly
			
 
				+mov %RDX_LP, %R9_LP
			
 
				+```
			
 
				+
			
 
				+As described in the [ELF](http://flint.cs.yale.edu/cs422/doc/ELF_Format.pdf) specification:
			
 
				+
			
 
				+> After the dynamic linker has built the process image and performed the relocations, each shared object
			
 
				+> gets the opportunity to execute some initialization code.
			
 
				+> ...
			
 
				+> Similarly, shared objects may have termination functions, which are executed with the atexit (BA_OS)
			
 
				+> mechanism after the base process begins its termination sequence.
			
 
				+
			
 
				+So we need to put address of termination function to the `%r9` register as it will be passed `__libc_start_main` in future as sixth argument. Note that initially, address of the termination function is located in the `%rdx` register. Other registers besides `%rdx` and `%rsp` contain unspecified values. Actually main point of the `_start` function is to call `__libc_start_main`. So the next actions will be preparations to call this function.
			
 
				+
			
 
				+The signature of the `__libc_start_main` function is located in the [csu/libc-start.c](https://sourceware.org/git/?p=glibc.git;a=blob;f=csu/libc-start.c;h=0fb98f1606bab475ab5ba2d0fe08c64f83cce9df;hb=HEAD) source code file. Let's look on it:
			
 
				+
			
 
				+```C
			
 
				+STATIC int LIBC_START_MAIN (int (*main) (int, char **, char **),
			
 
				+ 			                int argc,
			
 
				+			                char **argv,
			
 
				+ 			                __typeof (main) init,
			
 
				+			                void (*fini) (void),
			
 
				+			                void (*rtld_fini) (void),
			
 
				+			                void *stack_end)
			
 
				+```
			
 
				+
			
 
				+It takes address of the `main` function of a program, `argc` and `argv`. `init` and `fini` functions are constructor and destructor of the program. The `rtld_fini` is termination function which will be called after the program will be exited to terminate and free dynamic section. The last parameter of the `__libc_start_main` is the pointer to the stack of the program. Before we can call the `__libc_start_main` function, all of these parameters must be prepared and passed to it. Let's return to the [sysdeps/x86_64/start.S](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/start.S;h=f1b961f5ba2d6a1ebffee0005f43123c4352fbf4;hb=HEAD) assembly file and continue to see what happens before the `__libc_start_main` function will be called from there.
			
 
				+
			
 
				+All of we need for the `__libc_start_main` function, we can get from the stack. As `_start` is called, our stack is looks like:
			
 
				+
			
 
				+```
			
 
				++-----------------+
			
 
				+|       NULL      |
			
 
				++-----------------+ 
			
 
				+|       envp      |
			
 
				++-----------------+ 
			
 
				+|       NULL      |
			
 
				++------------------
			
 
				+|       argv      | <- %rsp
			
 
				++------------------
			
 
				+|       argc      |
			
 
				++-----------------+ 
			
 
				+```
			
 
				+
			
 
				+At the next step as we cleared `%ebp` register and save address of the termination function in the `%r9` register, we pop element from the stack to the `%rsi` register, so after this `%rsp` will point to the `argv` array and `%rsi` will contain count of command line arguemnts passed to the program:
			
 
				+
			
 
				+```
			
 
				++-----------------+
			
 
				+|       NULL      |
			
 
				++-----------------+ 
			
 
				+|       envp      |
			
 
				++-----------------+ 
			
 
				+|       NULL      |
			
 
				++------------------
			
 
				+|       argv      | <- %rsp
			
 
				++-----------------+
			
 
				+```
			
 
				+
			
 
				+And after this we may move address of the `argv` array to the `%rdx` register
			
 
				+
			
 
				+```assembly
			
 
				+popq %rsi
			
 
				+mov %RSP_LP, %RDX_LP
			
 
				+```
			
 
				+
			
 
				+From this moment we have `argc`, `argv`. Still to put pointers to the construtor, destructor in appropriate registers and pass pointer to the stack. At the first following three lines we align stack to `16` bytes boundary as suggested in [ABI](https://software.intel.com/sites/default/files/article/402129/mpx-linux64-abi.pdf) and push `%rax` which contains garbage:
			
 
				+
			
 
				+```assembly
			
 
				+and  $~15, %RSP_LP
			
 
				+pushq %rax
			
 
				+
			
 
				+pushq %rsp
			
 
				+mov $__libc_csu_fini, %R8_LP
			
 
				+mov $__libc_csu_init, %RCX_LP
			
 
				+mov $main, %RDI_LP
			
 
				+```
			
 
				+
			
 
				+After stack aligning we push address of the stack, addresses of contstructor and destructor to the `%r8` and `%rcx` registers and address of the `main` symbol to the `%rdi`. From this moment we may call the `__libc_start_main` function from the [csu/libc-start.c](https://sourceware.org/git/?p=glibc.git;a=blob;f=csu/libc-start.c;h=0fb98f1606bab475ab5ba2d0fe08c64f83cce9df;hb=HEAD).
			
 
				+
			
 
				+Before we will look at the `__libc_start_main` function let's add the `/lib64/crt1.o` and try to compile our program again:
			
 
				+
			
 
				+```
			
 
				+$ gcc -nostdlib /lib64/crt1.o -lc -ggdb program.c -o program
			
 
				+/lib64/crt1.o: In function `_start':
			
 
				+(.text+0x12): undefined reference to `__libc_csu_fini'
			
 
				+/lib64/crt1.o: In function `_start':
			
 
				+(.text+0x19): undefined reference to `__libc_csu_init'
			
 
				+collect2: error: ld returned 1 exit status
			
 
				+```
			
 
				+
			
 
				+Now we see another error that both `__libc_csu_fini` and `__libc_csu_init` functions are not found. We know that addresses of both of these functions are passed to the `__libc_start_main` as parameters and also these functions are constructor and destructor of our programs. But what are `constructor` and `destructor` in terms of `C` program means? We already saw the quote from the [ELF](http://flint.cs.yale.edu/cs422/doc/ELF_Format.pdf) specification:
			
 
				+
			
 
				+> After the dynamic linker has built the process image and performed the relocations, each shared object
			
 
				+> gets the opportunity to execute some initialization code.
			
 
				+> ...
			
 
				+> Similarly, shared objects may have termination functions, which are executed with the atexit (BA_OS)
			
 
				+> mechanism after the base process begins its termination sequence.
			
 
				+
			
 
				+So the linker besides usual sections like `.text`, `.data` and others, creates two special sections:
			
 
				+
			
 
				+* `.init`
			
 
				+* `.fini`
			
 
				+
			
 
				+We can find it with `readelf` util:
			
 
				+
			
 
				+```
			
 
				+~$ readelf -e test | grep init
			
 
				+  [11] .init             PROGBITS         00000000004003c8  000003c8
			
 
				+
			
 
				+~$ readelf -e test | grep fini
			
 
				+  [15] .fini             PROGBITS         0000000000400504  00000504
			
 
				+```
			
 
				+
			
 
				+Both of these sections will be placed at the start and end of binary image and contain routines which are called constructor and destructor respectively. The main point of these routines is to do some initialization/finalization like initialization of global variables like [errno](http://man7.org/linux/man-pages/man3/errno.3.html), allocation and deallocation of memory for system routines and etc., before actual code of a program will be executed.
			
 
				+
			
 
				+As you may understand from names of these functions, they will be called before `main` and after the `main` function will be finished. Definition of `.init` and `.fini` sections located in the `/lib64/crti.o` and if we will add this object file:
			
 
				+
			
 
				+```
			
 
				+$ gcc -nostdlib /lib64/crt1.o /lib64/crti.o  -lc -ggdb program.c -o program
			
 
				+```
			
 
				+
			
 
				+we will not get any errors. But let's try to run our program and see what happens:
			
 
				+
			
 
				+```
			
 
				+$ ./program
			
 
				+Segmentation fault (core dumped)
			
 
				+```
			
 
				+
			
 
				+Yeah, we got segmentation fault. Let's look inside of the `lib64/crti.o` with `objdump` util:
			
 
				+
			
 
				+```
			
 
				+~$ objdump -D /lib64/crti.o
			
 
				+
			
 
				+/lib64/crti.o:     file format elf64-x86-64
			
 
				+
			
 
				+
			
 
				+Disassembly of section .init:
			
 
				+
			
 
				+0000000000000000 <_init>:
			
 
				+   0:	48 83 ec 08          	sub    $0x8,%rsp
			
 
				+   4:	48 8b 05 00 00 00 00 	mov    0x0(%rip),%rax        # b <_init+0xb>
			
 
				+   b:	48 85 c0             	test   %rax,%rax
			
 
				+   e:	74 05                	je     15 <_init+0x15>
			
 
				+  10:	e8 00 00 00 00       	callq  15 <_init+0x15>
			
 
				+
			
 
				+Disassembly of section .fini:
			
 
				+
			
 
				+0000000000000000 <_fini>:
			
 
				+   0:	48 83 ec 08          	sub    $0x8,%rsp
			
 
				+```
			
 
				+
			
 
				+As I wrote above, the `/lib64/crti.o` object file contains definition of the `.init` and `.fini` section, but also we can see here the stub for function. Let's look at the source code which is placed in the [sysdeps/x86_64/crti.S](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/crti.S;h=e9d86ed08ab134a540e3dae5f97a9afb82cdb993;hb=HEAD) source code file:
			
 
				+
			
 
				+```assembly
			
 
				+	.section .init,"ax",@progbits
			
 
				+	.p2align 2
			
 
				+	.globl _init
			
 
				+	.type _init, @function
			
 
				+_init:
			
 
				+	subq $8, %rsp
			
 
				+	movq PREINIT_FUNCTION@GOTPCREL(%rip), %rax
			
 
				+	testq %rax, %rax
			
 
				+	je .Lno_weak_fn
			
 
				+	call *%rax
			
 
				+.Lno_weak_fn:
			
 
				+	call PREINIT_FUNCTION
			
 
				+```
			
 
				+
			
 
				+It contains definition of the `.init` section and assembly code does 16-byte stack alignment and next we move address of the `PREINIT_FUNCTION` and if it is zero we don't call it:
			
 
				+
			
 
				+```
			
 
				+00000000004003c8 <_init>:
			
 
				+  4003c8:       48 83 ec 08             sub    $0x8,%rsp
			
 
				+  4003cc:       48 8b 05 25 0c 20 00    mov    0x200c25(%rip),%rax        # 600ff8 <_DYNAMIC+0x1d0>
			
 
				+  4003d3:       48 85 c0                test   %rax,%rax
			
 
				+  4003d6:       74 05                   je     4003dd <_init+0x15>
			
 
				+  4003d8:       e8 43 00 00 00          callq  400420 <__libc_start_main@plt+0x10>
			
 
				+  4003dd:       48 83 c4 08             add    $0x8,%rsp
			
 
				+  4003e1:       c3                      retq
			
 
				+```
			
 
				+
			
 
				+where the `PREINIT_FUNCTION` is the `__gmon_start__` which does setup for profiling. You may note that we have no return instruction in the [sysdeps/x86_64/crti.S](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/crti.S;h=e9d86ed08ab134a540e3dae5f97a9afb82cdb993;hb=HEAD). Actually that's why we got segmentation fault. Prolog of `_init` and `_fini` is placed in the [sysdeps/x86_64/crtn.S](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/crtn.S;h=e9d86ed08ab134a540e3dae5f97a9afb82cdb993;hb=HEAD) assembly file:
			
 
				+
			
 
				+```assembly
			
 
				+.section .init,"ax",@progbits
			
 
				+addq $8, %rsp
			
 
				+ret
			
 
				+
			
 
				+.section .fini,"ax",@progbits
			
 
				+addq $8, %rsp
			
 
				+ret
			
 
				+```
			
 
				+
			
 
				+and if we will add it to the compilation, our program will be successfully compiled and runned!
			
 
				+
			
 
				+```
			
 
				+~$ gcc -nostdlib /lib64/crt1.o /lib64/crti.o /lib64/crtn.o  -lc -ggdb program.c -o program
			
 
				+
			
 
				+~$ ./program
			
 
				+x + y + z = 6
			
 
				+```
			
 
				+
			
 
				+Conclusion
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+Now let's return to the `_start` function and try to go through a full chain of calls before the `main` of our program will be called.
			
 
				+
			
 
				+The `_start` is always placed at the beginning of the `.text` section in our programs by the linked which is used default `ld` script:
			
 
				+
			
 
				+```
			
 
				+~$ ld --verbose | grep ENTRY
			
 
				+ENTRY(_start)
			
 
				+```
			
 
				+
			
 
				+The `_start` function defined in the [sysdeps/x86_64/start.S](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/start.S;h=f1b961f5ba2d6a1ebffee0005f43123c4352fbf4;hb=HEAD) assembly file and does preparation like getting `argc/argv` from the stack, stack preparation and etc., before the `__libc_start_main` function will be called. The `__libc_start_main` function from the [csu/libc-start.c](https://sourceware.org/git/?p=glibc.git;a=blob;f=csu/libc-start.c;h=0fb98f1606bab475ab5ba2d0fe08c64f83cce9df;hb=HEAD) source code file does a registration of the constructor and destructor of application which are will be called before `main` and after it, starts up threading, does some security related actions like setting stack canary if need, calls initialization reltad routines and in the end it calls `main` function of our application and exit with its result:
			
 
				+
			
 
				+```C
			
 
				+result = main (argc, argv, __environ MAIN_AUXVEC_PARAM);
			
 
				+exit (result);
			
 
				+```
			
 
				+
			
 
				+That's all.
			
 
				+
			
 
				+Links
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+* [system call](https://en.wikipedia.org/wiki/System_call)
			
 
				+* [gdb](https://www.gnu.org/software/gdb/)
			
 
				+* [execve](http://linux.die.net/man/2/execve)
			
 
				+* [ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format)
			
 
				+* [x86_64](https://en.wikipedia.org/wiki/X86-64)
			
 
				+* [segment registers](https://en.wikipedia.org/wiki/X86_memory_segmentation)
			
 
				+* [context switch](https://en.wikipedia.org/wiki/Context_switch)
			
 
				+* [System V ABI](https://software.intel.com/sites/default/files/article/402129/mpx-linux64-abi.pdf)
			
--- a/README.md
+++ b/README.md
@@ -1,18 +1,24 @@
 
				 linux-insides
			
 
				 ===============
			
 
				 
			
 
				-A series of posts about the linux kernel and its insides.
			
 
				+A book-in-progress about the linux kernel and its insides.
			
 
				 
			
 
				-**The goal is simple** - to share my modest knowledge about the internals of the linux kernel and help people who are interested in linux kernel internals, and other low-level subject matter.
			
 
				+**The goal is simple** - to share my modest knowledge about the insides of the linux kernel and help people who are interested in linux kernel insides, and other low-level subject matter.
			
 
				 
			
 
				-**Questions/Suggestions**: Feel free about any questions or suggestions by pinging me at twitter [@0xAX](https://twitter.com/0xAX), adding an [issue](https://github.com/0xAX/linux-internals/issues/new) or just drop me an [email](mailto:anotherworldofworld@gmail.com).
			
 
				+**Questions/Suggestions**: Feel free about any questions or suggestions by pinging me at twitter [@0xAX](https://twitter.com/0xAX), adding an [issue](https://github.com/0xAX/linux-insides/issues/new) or just drop me an [email](mailto:anotherworldofworld@gmail.com).
			
 
				 
			
 
				 Support
			
 
				 -------
			
 
				 
			
 
				 **Support** If you like `linux-insides` you can support me with: 
			
 
				 
			
 
				-[![Flattr linux-insides](https://img.shields.io/badge/donate-flattr-green.svg)](https://flattr.com/submit/auto?user_id=0xAX&url=https://github.com/0xAX/linux-insides/&title=linux-insed) [![Support at gratipay](http://img.shields.io/gratipay/0xAX.svg)](https://gratipay.com/0xAX/) [![Support with bitcoin](https://img.shields.io/badge/donate-bitcoin-green.svg)](https://www.coinbase.com/checkouts/0bfa452a41cf52c0b3f99500b4f31685) [![Support via gitbook](https://img.shields.io/badge/donate-gitbook-green.svg)](https://gumroad.com/l/gitbook_54c9232c1db1670300055523?wanted=true)
			
 
				+[![Flattr linux-insides](https://img.shields.io/badge/donate-flattr-green.svg)](https://flattr.com/submit/auto?user_id=0xAX&url=https://github.com/0xAX/linux-insides/&title=linux-insed) [![Support at gratipay](https://img.shields.io/gratipay/0xAX.svg)](https://gratipay.com/~0xAX/) [![Support with bitcoin](https://img.shields.io/badge/donate-bitcoin-green.svg)](https://www.coinbase.com/checkouts/0bfa452a41cf52c0b3f99500b4f31685) [![Support via gitbook](https://img.shields.io/badge/donate-gitbook-green.svg)](https://gumroad.com/l/gitbook_54c9232c1db1670300055523?wanted=true) [![Join the chat at https://gitter.im/0xAX/linux-insides](https://badges.gitter.im/0xAX/linux-insides.svg)](https://gitter.im/0xAX/linux-insides?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
			
 
				+
			
 
				+On other languages
			
 
				+-------------------
			
 
				+
			
 
				+  * [Chinese](https://github.com/MintCN/linux-insides-zh)
			
 
				+  * [Spanish](https://github.com/leolas95/linux-insides)
			
 
				 
			
 
				 LICENSE
			
 
				 -------------
			
--- a/SUMMARY.md
+++ b/SUMMARY.md
@@ -28,29 +28,57 @@
 
				     * [Initialization of external hardware interrupts structures](interrupts/interrupts-8.md)
			
 
				     * [Softirq, Tasklets and Workqueues](interrupts/interrupts-9.md)
			
 
				     * [Last part](interrupts/interrupts-10.md)
			
 
				+* [System calls](SysCall/README.md)
			
 
				+    * [Introduction to system calls](SysCall/syscall-1.md)
			
 
				+    * [How the Linux kernel handles a system call](SysCall/syscall-2.md)
			
 
				+    * [vsyscall and vDSO](SysCall/syscall-3.md)
			
 
				+    * [How the Linux kernel runs a program](SysCall/syscall-4.md)
			
 
				+* [Timers and time management](Timers/README.md)
			
 
				+    * [Introduction](Timers/timers-1.md)
			
 
				+    * [Clocksource framework](Timers/timers-2.md)
			
 
				+    * [The tick broadcast framework and dyntick](Timers/timers-3.md)
			
 
				+    * [Introduction to timers](Timers/timers-4.md)
			
 
				+    * [Clockevents framework](Timers/timers-5.md)
			
 
				+    * [x86 related clock sources](Timers/timers-6.md)
			
 
				+    * [Time related system calls](Timers/timers-7.md)
			
 
				+* [Synchronization primitives](SyncPrim/README.md)
			
 
				+    * [Introduction to spinlocks](SyncPrim/sync-1.md)
			
 
				+    * [Queued spinlocks](SyncPrim/sync-2.md)
			
 
				+    * [Semaphores](SyncPrim/sync-3.md)
			
 
				+    * [Mutex](SyncPrim/sync-4.md)
			
 
				+    * [Reader/Writer semaphores](SyncPrim/sync-5.md)
			
 
				+    * [SeqLock](SyncPrim/sync-6.md)
			
 
				+    * [RCU]()
			
 
				+    * [Lockdep]()
			
 
				 * [Memory management](mm/README.md)
			
 
				     * [Memblock](mm/linux-mm-1.md)
			
 
				     * [Fixmaps and ioremap](mm/linux-mm-2.md)
			
 
				-* [System calls](SysCall/README.md)
			
 
				-    * [Introduction to system calls](SysCall/syscall-1.md)
			
 
				+    * [kmemcheck](mm/mm-3.md)
			
 
				 * [SMP]()
			
 
				 * [Concepts](Concepts/README.md)
			
 
				     * [Per-CPU variables](Concepts/per-cpu.md)
			
 
				     * [Cpumasks](Concepts/cpumask.md)
			
 
				+    * [The initcall mechanism](Concepts/initcall.md)
			
 
				 * [Data Structures in the Linux Kernel](DataStructures/README.md)
			
 
				     * [Doubly linked list](DataStructures/dlist.md)
			
 
				     * [Radix tree](DataStructures/radix-tree.md)
			
 
				+    * [Bit arrays](DataStructures/bitmap.md)
			
 
				 * [Theory](Theory/README.md)
			
 
				     * [Paging](Theory/Paging.md)
			
 
				     * [Elf64](Theory/ELF.md)
			
 
				+    * [Inline assembly](Theory/asm.md)
			
 
				     * [CPUID]()
			
 
				     * [MSR]()
			
 
				 * [Initial ram disk]()
			
 
				    * [initrd]()
			
 
				 * [Misc](Misc/README.md)
			
 
				-    * [How kernel compiled](Misc/how_kernel_compiled.md)
			
 
				+    * [How the kernel is compiled](Misc/how_kernel_compiled.md)
			
 
				     * [Linkers](Misc/linkers.md)
			
 
				+    * [Linux kernel development](Misc/contribute.md)
			
 
				+    * [Program startup process in userspace](Misc/program_startup.md)
			
 
				     * [Write and Submit your first Linux kernel Patch]()
			
 
				     * [Data types in the kernel]()
			
 
				+* [KernelStructures](KernelStructures/README.md)
			
 
				+    * [IDT](KernelStructures/idt.md)    
			
 
				 * [Useful links](LINKS.md)
			
 
				 * [Contributors](contributors.md)
			
--- a/SyncPrim/README.md
+++ b/SyncPrim/README.md
@@ -0,0 +1,10 @@
 
				+# Synchronization primitives in the Linux kernel.
			
 
				+
			
 
				+This chapter describes synchronization primitives in the Linux kernel.
			
 
				+
			
 
				+* [Introduction to spinlocks](http://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) - the first part of this chapter describes implementation of spinlock mechanism in the Linux kernel.
			
 
				+* [Queued spinlocks](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-2.html) - the second part describes another type of spinlocks - queued spinlocks.
			
 
				+* [Semaphores](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-3.html) - this part describes implementation of `semaphore` synchronization primitive in the Linux kernel.
			
 
				+* [Mutual exclusion](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-4.html) - this part describes - `mutex` in the Linux kernel.
			
 
				+* [Reader/Writer semaphores](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-5.html) - this part describes special type of semaphores - `reader/writer` semaphores.
			
 
				+* [Sequential locks](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-6.html) - this part describes sequential locks in the Linux kernel.
			
--- a/SyncPrim/sync-1.md
+++ b/SyncPrim/sync-1.md
@@ -0,0 +1,433 @@
 
				+Synchronization primitives in the Linux kernel. Part 1.
			
 
				+================================================================================
			
 
				+
			
 
				+Introduction
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+This part opens new chapter in the [linux-insides](http://0xax.gitbooks.io/linux-insides/content/) book. Timers and time management related stuff was described in the previous [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html). Now time to go next. As you may understand from the part's title, this chapter will describe [synchronization](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) primitives in the Linux kernel.
			
 
				+
			
 
				+As always, before we will consider something synchronization related, we will try to know what is `synchronization primitive` in general. Actually, synchronization primitive is a software mechanism which provides ability to two or more [parallel](https://en.wikipedia.org/wiki/Parallel_computing) processes or threads to not execute simultaneously one the same segment of a code. For example let's look on the following piece of code:
			
 
				+
			
 
				+```C
			
 
				+mutex_lock(&clocksource_mutex);
			
 
				+...
			
 
				+...
			
 
				+...
			
 
				+clocksource_enqueue(cs);
			
 
				+clocksource_enqueue_watchdog(cs);
			
 
				+clocksource_select();
			
 
				+...
			
 
				+...
			
 
				+...
			
 
				+mutex_unlock(&clocksource_mutex);
			
 
				+```
			
 
				+
			
 
				+from the [kernel/time/clocksource.c](https://github.com/torvalds/linux/master/kernel/time/clocksource.c) source code file. This code is from the `__clocksource_register_scale` function which adds the given [clocksource](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-2.html) to the clock sources list. This function produces different operations on a list with registered clock sources. For example the `clocksource_enqueue` function adds the given clock source to the list with registered clocksources - `clocksource_list`. Note that these lines of code wrapped to two functions: `mutex_lock` and `mutex_unlock` which are takes one parameter - the `clocksource_mutex` in our case.
			
 
				+
			
 
				+These functions represents locking and unlocking based on [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion) synchronization primitive. As `mutex_lock` will be executed, it allows us to prevent situation when two or more threads will execute this code while the `mute_unlock` will not be executed by process-owner of the mutex. In other words, we prevent parallel operations on a `clocksource_list`. Why do we need `mutex` here? What if two parallel processes will try to register a clock source. As we already know, the `clocksource_enqueue` function adds the given clock source to the `clocksource_list` list right after a clock source in the list which has the biggest rating (a registered clock source which has the highest frequency in the system):
			
 
				+
			
 
				+```C
			
 
				+static void clocksource_enqueue(struct clocksource *cs)
			
 
				+{
			
 
				+	struct list_head *entry = &clocksource_list;
			
 
				+	struct clocksource *tmp;
			
 
				+
			
 
				+	list_for_each_entry(tmp, &clocksource_list, list)
			
 
				+		if (tmp->rating >= cs->rating)
			
 
				+			entry = &tmp->list;
			
 
				+	list_add(&cs->list, entry);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+If two parallel processes will try to do it simultaneously, both process may found the same `entry` may occur [race condition](https://en.wikipedia.org/wiki/Race_condition) or in other words, the second process which will execute `list_add`, will overwrite a clock source from first thread.
			
 
				+
			
 
				+Besides this simple example, synchronization primitives are ubiquitous in the Linux kernel. If we will go through the previous [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) or other chapters again or if we will look at the Linux kernel source code in general, we will meet many places like this. We will not consider how `mutex` is implemented in the Linux kernel. Actually, the Linux kernel provides a set of different synchronization primitives like:
			
 
				+
			
 
				+* `mutex`;
			
 
				+* `semaphores`; 
			
 
				+* `seqlocks`;
			
 
				+* `atomic operations`;
			
 
				+* etc.
			
 
				+
			
 
				+We will start this chapter from the `spinlock`.
			
 
				+
			
 
				+Spinlocks in the Linux kernel.
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+The `spinlock` is a low-level synchronization mechanism which in simple words, represents a variable which can be in two states:
			
 
				+
			
 
				+* `acquired`;
			
 
				+* `released`.
			
 
				+
			
 
				+Each process which wants to acquire a `spinlock`, must write a value which represents `spinlock acquired` state to this variable and write `spinlock released` state to the variable. If a process tries to execute code which is protected by a `spinlock`, it will be locked while a process which holds this lock will release it. In this case all related operations must be [atomic](https://en.wikipedia.org/wiki/Linearizability) to prevent [race conditions](https://en.wikipedia.org/wiki/Race_condition) state. The `spinlock` is represented by the `spinlock_t` type in the Linux kernel. If we will look at the Linux kernel code, we will see that this type is [widely](http://lxr.free-electrons.com/ident?i=spinlock_t) used. The `spinlock_t` is defined as:
			
 
				+
			
 
				+```C
			
 
				+typedef struct spinlock {
			
 
				+        union {
			
 
				+              struct raw_spinlock rlock;
			
 
				+ 
			
 
				+#ifdef CONFIG_DEBUG_LOCK_ALLOC
			
 
				+# define LOCK_PADSIZE (offsetof(struct raw_spinlock, dep_map))
			
 
				+                struct {
			
 
				+                        u8 __padding[LOCK_PADSIZE];
			
 
				+                        struct lockdep_map dep_map;
			
 
				+                };
			
 
				+#endif
			
 
				+        };
			
 
				+} spinlock_t;
			
 
				+```
			
 
				+
			
 
				+and located in the [include/linux/spinlock_types.h](https://github.com/torvalds/linux/master/include/linux/spinlock_types.h) header file. We may see that its implementation depends on the state of the `CONFIG_DEBUG_LOCK_ALLOC` kernel configuration option. We will skip this now, because all debugging related stuff will be in the end of this part. So, if the `CONFIG_DEBUG_LOCK_ALLOC` kernel configuration option is disabled, the `spinlock_t` contains [union](https://en.wikipedia.org/wiki/Union_type#C.2FC.2B.2B) with one field which is - `raw_spinlock`:
			
 
				+
			
 
				+```C
			
 
				+typedef struct spinlock {
			
 
				+        union {
			
 
				+              struct raw_spinlock rlock;
			
 
				+        };
			
 
				+} spinlock_t;
			
 
				+```
			
 
				+
			
 
				+The `raw_spinlock` structure defined in the [same](https://github.com/torvalds/linux/master/include/linux/spinlock_types.h) header file and represents implementation of `normal` spinlock. Let's look how the `raw_spinlock` structure is defined:
			
 
				+
			
 
				+```C
			
 
				+typedef struct raw_spinlock {
			
 
				+        arch_spinlock_t raw_lock;
			
 
				+#ifdef CONFIG_GENERIC_LOCKBREAK
			
 
				+        unsigned int break_lock;
			
 
				+#endif
			
 
				+} raw_spinlock_t;
			
 
				+```
			
 
				+
			
 
				+where the `arch_spinlock_t` represents architecture-specific `spinlock` implementation and the `break_lock` field which holds value - `1` in a case when  one processor starts to wait while the lock is held on another processor on [SMP](https://en.wikipedia.org/wiki/Symmetric_multiprocessing) systems. This allows prevent long time locking. As consider the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture in this books, so the `arch_spinlock_t` is defined in the [arch/x86/include/asm/spinlock_types.h](https://github.com/torvalds/linux/master/arch/x86/include/asm/spinlock_types.h) header file and looks:
			
 
				+
			
 
				+```C
			
 
				+#ifdef CONFIG_QUEUED_SPINLOCKS
			
 
				+#include <asm-generic/qspinlock_types.h>
			
 
				+#else
			
 
				+typedef struct arch_spinlock {
			
 
				+        union {
			
 
				+                __ticketpair_t head_tail;
			
 
				+                struct __raw_tickets {
			
 
				+                        __ticket_t head, tail;
			
 
				+                } tickets;
			
 
				+        };
			
 
				+} arch_spinlock_t;
			
 
				+```
			
 
				+
			
 
				+As we may see, the definition of the `arch_spinlock` structure depends on the value of the `CONFIG_QUEUED_SPINLOCKS` kernel configuration option. This configuration option the Linux kernel supports `spinlocks` with queue. This special type of `spinlocks` which instead of `acquired` and `released` [atomic](https://en.wikipedia.org/wiki/Linearizability) values used `atomic` operation on a `queue`. If the `CONFIG_QUEUED_SPINLOCKS` kernel configuration option is enabled, the `arch_spinlock_t` will be represented by the following structure:
			
 
				+
			
 
				+```C
			
 
				+typedef struct qspinlock {
			
 
				+	atomic_t	val;
			
 
				+} arch_spinlock_t;
			
 
				+```
			
 
				+
			
 
				+from the [include/asm-generic/qspinlock_types.h](https://github.com/torvalds/linux/master/include/asm-generic/qspinlock_types.h) header file.
			
 
				+
			
 
				+We will not stop on this structures for now and before we will consider both `arch_spinlock` and the `qspinlock`, let's look at the operations on a spinlock. The Linux kernel provides following main operations on a `spinlock`:
			
 
				+
			
 
				+* `spin_lock_init` - produces initialization of the given `spinlock`;
			
 
				+* `spin_lock` - acquires given `spinlock`;
			
 
				+* `spin_lock_bh` - disables software [interrupts](https://en.wikipedia.org/wiki/Interrupt) and acquire given `spinlock`.
			
 
				+* `spin_lock_irqsave` and `spin_lock_irq` - disable interrupts on local processor and preserve/not preserve previous interrupt state in the `flags`;
			
 
				+* `spin_unlock` - releases given `spinlock`;
			
 
				+* `spin_unlock_bh` - releases given `spinlock` and enables software interrupts;
			
 
				+* `spin_is_locked` - returns the state of the given `spinlock`;
			
 
				+* and etc.
			
 
				+
			
 
				+Let's look on the implementation of the `spin_lock_init` macro. As I already wrote, this and other macro are defined in the [include/linux/spinlock.h](https://github.com/torvalds/linux/master/include/linux/spinlock.h) header file and the `spin_lock_init` macro looks:
			
 
				+
			
 
				+```C
			
 
				+#define spin_lock_init(_lock)		\
			
 
				+do {							                \
			
 
				+	spinlock_check(_lock);				        \
			
 
				+	raw_spin_lock_init(&(_lock)->rlock);		\
			
 
				+} while (0)
			
 
				+```
			
 
				+
			
 
				+As we may see, the `spin_lock_init` macro takes a `spinlock` and executes two operations: check the given `spinlock` and execute the `raw_spin_lock_init`. The implementation of the `spinlock_check` is pretty easy, this function just returns the `raw_spinlock_t` of the given `spinlock` to be sure that we got exactly `normal` raw spinlock:
			
 
				+
			
 
				+```C
			
 
				+static __always_inline raw_spinlock_t *spinlock_check(spinlock_t *lock)
			
 
				+{
			
 
				+	return &lock->rlock;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+The `raw_spin_lock_init` macro:
			
 
				+
			
 
				+```C
			
 
				+# define raw_spin_lock_init(lock)		\
			
 
				+do {                                                  \
			
 
				+    *(lock) = __RAW_SPIN_LOCK_UNLOCKED(lock);         \
			
 
				+} while (0)                                           \
			
 
				+```
			
 
				+
			
 
				+assigns the value of the `__RAW_SPIN_LOCK_UNLOCKED` with the given `spinlock` to the given `raw_spinlock_t`. As we may understand from the name of the `__RAW_SPIN_LOCK_UNLOCKED` macro, this macro does initialization of the given `spinlock` and set it to `released` state. This macro defined in the [include/linux/spinlock_types.h](https://github.com/torvalds/linux/master/include/linux/spinlock_types.h) header file and expands to the following macros:
			
 
				+
			
 
				+```C
			
 
				+#define __RAW_SPIN_LOCK_UNLOCKED(lockname)      \
			
 
				+         (raw_spinlock_t) __RAW_SPIN_LOCK_INITIALIZER(lockname)
			
 
				+
			
 
				+#define __RAW_SPIN_LOCK_INITIALIZER(lockname)   \
			
 
				+         {                                                      \
			
 
				+             .raw_lock = __ARCH_SPIN_LOCK_UNLOCKED,             \
			
 
				+             SPIN_DEBUG_INIT(lockname)                          \
			
 
				+             SPIN_DEP_MAP_INIT(lockname)                        \
			
 
				+         }
			
 
				+```
			
 
				+
			
 
				+As I already wrote above, we will not consider stuff which is related to debugging of synchronization primitives. In this case we will not consider the `SPIN_DEBUG_INIT` and the `SPIN_DEP_MAP_INIT` macros. So the `__RAW_SPINLOCK_UNLOCKED` macro will be expanded to the:
			
 
				+
			
 
				+```C
			
 
				+*(&(_lock)->rlock) = __ARCH_SPIN_LOCK_UNLOCKED;
			
 
				+```
			
 
				+
			
 
				+where the `__ARCH_SPIN_LOCK_UNLOCKED` is:
			
 
				+
			
 
				+```C
			
 
				+#define __ARCH_SPIN_LOCK_UNLOCKED       { { 0 } }
			
 
				+```
			
 
				+
			
 
				+and:
			
 
				+
			
 
				+```C
			
 
				+#define __ARCH_SPIN_LOCK_UNLOCKED       { ATOMIC_INIT(0) }
			
 
				+```
			
 
				+
			
 
				+for the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture. if the `CONFIG_QUEUED_SPINLOCKS` kernel configuration option is enabled. So, after the expansion of the `spin_lock_init` macro, a given `spinlock` will be initialized and its state will be - `unlocked`.
			
 
				+
			
 
				+From this moment we know how to initialize a `spinlock`, now let's consider [API](https://en.wikipedia.org/wiki/Application_programming_interface) which Linux kernel provides for manipulations of `spinlocks`. The first is:
			
 
				+
			
 
				+```C
			
 
				+static __always_inline void spin_lock(spinlock_t *lock)
			
 
				+{
			
 
				+	raw_spin_lock(&lock->rlock);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+function which allows us to `acquire` a spinlock. The `raw_spin_lock` macro is defined in the same header file and expands to the call of the `_raw_spin_lock` function:
			
 
				+
			
 
				+```C
			
 
				+#define raw_spin_lock(lock)	_raw_spin_lock(lock)
			
 
				+```
			
 
				+
			
 
				+As we may see in the [include/linux/spinlock.h](https://github.com/torvalds/linux/blob/master/include/linux/spinlock.h) header file, definition of the `_raw_spin_lock` macro depends on the `CONFIG_SMP` kernel configuration parameter:
			
 
				+
			
 
				+```C
			
 
				+#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
			
 
				+# include <linux/spinlock_api_smp.h>
			
 
				+#else
			
 
				+# include <linux/spinlock_api_up.h>
			
 
				+#endif
			
 
				+```
			
 
				+
			
 
				+So, if the [SMP](https://en.wikipedia.org/wiki/Symmetric_multiprocessing) is enabled in the Linux kernel, the `_raw_spin_lock` macro is defined in the [arch/x86/include/asm/spinlock.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/spinlock.h) header file and looks like:
			
 
				+
			
 
				+```C
			
 
				+#define _raw_spin_lock(lock) __raw_spin_lock(lock)
			
 
				+```
			
 
				+
			
 
				+The `__raw_spin_lock` function looks:
			
 
				+
			
 
				+```C
			
 
				+static inline void __raw_spin_lock(raw_spinlock_t *lock)
			
 
				+{
			
 
				+        preempt_disable();
			
 
				+        spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
			
 
				+        LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+As you may see, first of all we disable [preemption](https://en.wikipedia.org/wiki/Preemption_%28computing%29) by the call of the `preempt_disable` macro from the [include/linux/preempt.h](https://github.com/torvalds/linux/blob/master/include/linux/preempt.h) (more about this you may read in the ninth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-9.html) of the Linux kernel initialization process chapter). When we will unlock the given `spinlock`, preemption will be enabled again:
			
 
				+
			
 
				+```C
			
 
				+static inline void __raw_spin_unlock(raw_spinlock_t *lock)
			
 
				+{
			
 
				+        ...
			
 
				+        ...
			
 
				+        ...
			
 
				+        preempt_enable();
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+We need to do this while a process is spinning on a lock, other processes must be prevented to preempt the process which acquired a lock. The `spin_acquire` macro which through a chain of other macros expands to the call of the:
			
 
				+
			
 
				+```C
			
 
				+#define spin_acquire(l, s, t, i)                lock_acquire_exclusive(l, s, t, NULL, i)
			
 
				+#define lock_acquire_exclusive(l, s, t, n, i)           lock_acquire(l, s, t, 0, 1, n, i)
			
 
				+```
			
 
				+
			
 
				+`lock_acquire` function:
			
 
				+
			
 
				+```C
			
 
				+void lock_acquire(struct lockdep_map *lock, unsigned int subclass,
			
 
				+                  int trylock, int read, int check,
			
 
				+                  struct lockdep_map *nest_lock, unsigned long ip)
			
 
				+{
			
 
				+         unsigned long flags;
			
 
				+
			
 
				+         if (unlikely(current->lockdep_recursion))
			
 
				+                return;
			
 
				+ 
			
 
				+         raw_local_irq_save(flags);
			
 
				+         check_flags(flags);
			
 
				+ 
			
 
				+         current->lockdep_recursion = 1;
			
 
				+         trace_lock_acquire(lock, subclass, trylock, read, check, nest_lock, ip);
			
 
				+         __lock_acquire(lock, subclass, trylock, read, check,
			
 
				+                        irqs_disabled_flags(flags), nest_lock, ip, 0, 0);
			
 
				+         current->lockdep_recursion = 0;
			
 
				+         raw_local_irq_restore(flags);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+As I wrote above, we will not consider stuff here which is related to debugging or tracing. The main point of the `lock_acquire` function is to disable hardware interrupts by the call of the `raw_local_irq_save` macro, because the given spinlock might be acquired with enabled hardware interrupts. In this way the process will not be preempted. Note that in the end of the `lock_acquire` function we will enable hardware interrupts again with the help of the `raw_local_irq_restore` macro. As you already may guess, the main work will be in the `__lock_acquire` function which is defined in the [kernel/locking/lockdep.c](https://github.com/torvalds/linux/blob/master/kernel/locking/lockdep.c) source code file.
			
 
				+
			
 
				+The `__lock_acquire` function looks big. We will try to understand what does this function do, but not in this part. Actually this function mostly related to the Linux kernel [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) and it is not topic of this part. If we will return to the definition of the `__raw_spin_lock` function, we will see that it contains the following definition in the end:
			
 
				+
			
 
				+```C
			
 
				+LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
			
 
				+```
			
 
				+
			
 
				+The `LOCK_CONTENDED` macro is defined in the [include/linux/lockdep.h](https://github.com/torvalds/linux/blob/master/include/linux/lockdep.h) header file and just calls the given function with the given `spinlock`:
			
 
				+
			
 
				+```C
			
 
				+#define LOCK_CONTENDED(_lock, try, lock) \
			
 
				+         lock(_lock)
			
 
				+```
			
 
				+
			
 
				+In our case, the `lock` is `do_raw_spin_lock` function from the [include/linux/spinlock.h](https://github.com/torvalds/linux/blob/master/include/linux/spnlock.h) header file and the `_lock` is the given `raw_spinlock_t`:
			
 
				+
			
 
				+```C
			
 
				+static inline void do_raw_spin_lock(raw_spinlock_t *lock) __acquires(lock)
			
 
				+{
			
 
				+        __acquire(lock);
			
 
				+         arch_spin_lock(&lock->raw_lock);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+The `__acquire` here is just [sparse](https://en.wikipedia.org/wiki/Sparse) related macro and we are not interesting in it in this moment. Location of the definition of the `arch_spin_lock` function depends on two things: the first is architecture of system and the second do we use `queued spinlocks` or not. In our case we consider only `x86_64` architecture, so the definition of the `arch_spin_lock` is represented as the macro from the [include/asm-generic/qspinlock.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlocks.h) header file:
			
 
				+
			
 
				+```C
			
 
				+#define arch_spin_lock(l)               queued_spin_lock(l)
			
 
				+```
			
 
				+
			
 
				+if we are using `queued spinlocks`. Or in other case, the `arch_spin_lock` function is defined in the [arch/x86/include/asm/spinlock.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/spinlock.h) header file. Now we will consider only `normal spinlock` and information related to `queued spinlocks` we will see later. Let's look again on the definition of the `arch_spinlock` structure, to understand implementation of the `arch_spin_lock` function:
			
 
				+
			
 
				+```C
			
 
				+typedef struct arch_spinlock {
			
 
				+         union {
			
 
				+                __ticketpair_t head_tail;
			
 
				+                struct __raw_tickets {
			
 
				+                        __ticket_t head, tail;
			
 
				+                } tickets;
			
 
				+        };
			
 
				+} arch_spinlock_t;
			
 
				+```
			
 
				+
			
 
				+This variant of `spinlock` is called - `ticket spinlock`. As we may see, it consists from two parts. When lock is acquired, it increments a `tail` by one every time when a process wants to hold a `spinlock`. If the `tail` is not equal to `head`, the process will be locked, until values of these variables will not be equal. Let's look on the implementation of the `arch_spin_lock` function: 
			
 
				+
			
 
				+```C
			
 
				+static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
			
 
				+{
			
 
				+        register struct __raw_tickets inc = { .tail = TICKET_LOCK_INC };
			
 
				+
			
 
				+        inc = xadd(&lock->tickets, inc);
			
 
				+
			
 
				+        if (likely(inc.head == inc.tail))
			
 
				+                goto out;
			
 
				+
			
 
				+        for (;;) {
			
 
				+                 unsigned count = SPIN_THRESHOLD;
			
 
				+
			
 
				+                 do {
			
 
				+                       inc.head = READ_ONCE(lock->tickets.head);
			
 
				+                       if (__tickets_equal(inc.head, inc.tail))
			
 
				+                                goto clear_slowpath;
			
 
				+                        cpu_relax();
			
 
				+                 } while (--count);
			
 
				+                 __ticket_lock_spinning(lock, inc.tail);
			
 
				+         }
			
 
				+clear_slowpath:
			
 
				+        __ticket_check_and_clear_slowpath(lock, inc.head);
			
 
				+out:
			
 
				+        barrier();
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+At the beginning of the `arch_spin_lock` function we can initialization of the `__raw_tickets` structure with `tail` - `1`:
			
 
				+
			
 
				+```C
			
 
				+#define __TICKET_LOCK_INC       1
			
 
				+```
			
 
				+
			
 
				+In the next line we execute [xadd](http://x86.renejeschke.de/html/file_module_x86_id_327.html) operation on the `inc` and `lock->tickets`. After this operation the `inc` will store value of the `tickets` of the given `lock` and the `tickets.tail` will be increased on `inc` or `1`. The `tail` value was increased on `1` which means that one process started to try to hold a lock. In the next step we do the check that checks that `head` and `tail` have the same value. If these values are equal, this means that nobody holds lock and we go to the `out` label. In the end of the `arch_spin_lock` function we may see the `barrier` macro which represents `barrier instruction` which guarantees that compiler will not change order of operations that access memory (more about memory barriers you can read in the kernel [documentation](https://www.kernel.org/doc/Documentation/memory-barriers.txt)).
			
 
				+
			
 
				+If one process held a lock and a second process started to execute the `arch_spin_lock` function, the `head` will not be `equal` to `tail`, because the `tail` will be greater than `head` on `1`. In this way, process will occur in the loop. There will be comparison between `head` and the `tail` values at each loop iteration. If these values are not equal, the `cpu_relax` will be called which is just [NOP](https://en.wikipedia.org/wiki/NOP) instruction:
			
 
				+
			
 
				+
			
 
				+```C
			
 
				+#define cpu_relax()     asm volatile("rep; nop")
			
 
				+```
			
 
				+
			
 
				+and the next iteration of the loop will be started. If these values will be equal, this means that the process which held this lock, released this lock and the next process may acquire the lock.
			
 
				+
			
 
				+The `spin_unlock` operation goes through the all macros/function as `spin_lock`, of course with `unlock` prefix. In the end the `arch_spin_unlock` function will be called. If we will look at the implementation of the `arch_spin_lock` function, we will see that it increases `head` of the `lock tickets` list:
			
 
				+
			
 
				+```C
			
 
				+__add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
			
 
				+```
			
 
				+
			
 
				+In a combination of the `spin_lock` and `spin_unlock`, we get kind of queue where `head` contains an index number which maps currently executed process which holds a lock and the `tail` which contains an index number which maps last process which tried to hold the lock:
			
 
				+
			
 
				+```
			
 
				+     +-------+       +-------+
			
 
				+     |       |       |       |
			
 
				+head |   7   | - - - |   7   | tail
			
 
				+     |       |       |       |
			
 
				+     +-------+       +-------+
			
 
				+                         |
			
 
				+                     +-------+
			
 
				+                     |       |
			
 
				+                     |   8   |
			
 
				+                     |       |
			
 
				+                     +-------+
			
 
				+                         |
			
 
				+                     +-------+
			
 
				+                     |       |
			
 
				+                     |   9   |
			
 
				+                     |       |
			
 
				+                     +-------+
			
 
				+```
			
 
				+
			
 
				+That's all for now. We didn't cover `spinlock` API in full in this part, but I think that the main idea behind this concept must be clear now.
			
 
				+
			
 
				+Conclusion
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+This concludes the first part covering synchronization primitives in the Linux kernel. In this part, we met first synchronization primitive `spinlock` provided by the Linux kernel. In the next part we will continue to dive into this interesting theme and will see other `synchronization` related stuff.
			
 
				+
			
 
				+If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
			
 
				+
			
 
				+**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				+
			
 
				+Links
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+* [Concurrent computing](https://en.wikipedia.org/wiki/Concurrent_computing)
			
 
				+* [Synchronization](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29)
			
 
				+* [Clocksource framework](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-2.html)
			
 
				+* [Mutex](https://en.wikipedia.org/wiki/Mutual_exclusion)
			
 
				+* [Race condition](https://en.wikipedia.org/wiki/Race_condition)
			
 
				+* [Atomic operations](https://en.wikipedia.org/wiki/Linearizability)
			
 
				+* [SMP](https://en.wikipedia.org/wiki/Symmetric_multiprocessing)
			
 
				+* [x86_64](https://en.wikipedia.org/wiki/X86-64) 
			
 
				+* [Interrupts](https://en.wikipedia.org/wiki/Interrupt)
			
 
				+* [Preemption](https://en.wikipedia.org/wiki/Preemption_%28computing%29) 
			
 
				+* [Linux kernel lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt)
			
 
				+* [Sparse](https://en.wikipedia.org/wiki/Sparse)
			
 
				+* [xadd instruction](http://x86.renejeschke.de/html/file_module_x86_id_327.html)
			
 
				+* [NOP](https://en.wikipedia.org/wiki/NOP)
			
 
				+* [Memory barriers](https://www.kernel.org/doc/Documentation/memory-barriers.txt)
			
 
				+* [Previous chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html)
			
--- a/SyncPrim/sync-2.md
+++ b/SyncPrim/sync-2.md
@@ -0,0 +1,487 @@
 
				+Synchronization primitives in the Linux kernel. Part 2.
			
 
				+================================================================================
			
 
				+
			
 
				+Queued Spinlocks
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+This is the second part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/index.html) which describes synchronization primitives in the Linux kernel and in the first [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) of this chapter we met the first - [spinlock](https://en.wikipedia.org/wiki/Spinlock). We will continue to learn this synchronization primitive in this part. If you have read the previous part, you may remember that besides normal spinlocks, the Linux kernel provides special type of `spinlocks` - `queued spinlocks`. In this part we will try to understand what does this concept represent.
			
 
				+
			
 
				+We saw [API](https://en.wikipedia.org/wiki/Application_programming_interface) of `spinlock` in the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html):
			
 
				+
			
 
				+* `spin_lock_init` - produces initialization of the given `spinlock`;
			
 
				+* `spin_lock` - acquires given `spinlock`;
			
 
				+* `spin_lock_bh` - disables software [interrupts](https://en.wikipedia.org/wiki/Interrupt) and acquire given `spinlock`.
			
 
				+* `spin_lock_irqsave` and `spin_lock_irq` - disable interrupts on local processor and preserve/not preserve previous interrupt state in the `flags`;
			
 
				+* `spin_unlock` - releases given `spinlock`;
			
 
				+* `spin_unlock_bh` - releases given `spinlock` and enables software interrupts;
			
 
				+* `spin_is_locked` - returns the state of the given `spinlock`;
			
 
				+* and etc.
			
 
				+
			
 
				+And we know that all of these macro which are defined in the [include/linux/spinlock.h](https://github.com/torvalds/linux/blob/master/include/linux/spinlock.h) header file will be expanded to the call of the functions with `arch_spin_.*` prefix from the  [arch/x86/include/asm/spinlock.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/spinlock.h) for the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture. If we will look at this header fill with attention, we will that these functions (`arch_spin_is_locked`, `arch_spin_lock`, `arch_spin_unlock` and etc) defined only if the `CONFIG_QUEUED_SPINLOCKS` kernel configuration option is disabled: 
			
 
				+
			
 
				+```C
			
 
				+#ifdef CONFIG_QUEUED_SPINLOCKS
			
 
				+#include <asm/qspinlock.h>
			
 
				+#else
			
 
				+static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
			
 
				+{
			
 
				+    ...
			
 
				+    ...
			
 
				+    ...
			
 
				+}
			
 
				+...
			
 
				+...
			
 
				+...
			
 
				+#endif
			
 
				+```
			
 
				+
			
 
				+This means that the [arch/x86/include/asm/qspinlock.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/qspinlock.h) header file provides own implementation of these functions. Actually they are macros and they are located in other header file. This header file is - [include/asm-generic/qspinlock.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock.h#L126). If we will look into this header file, we will find definition of these macros:
			
 
				+
			
 
				+```C
			
 
				+#define arch_spin_is_locked(l)          queued_spin_is_locked(l)
			
 
				+#define arch_spin_is_contended(l)       queued_spin_is_contended(l)
			
 
				+#define arch_spin_value_unlocked(l)     queued_spin_value_unlocked(l)
			
 
				+#define arch_spin_lock(l)               queued_spin_lock(l)
			
 
				+#define arch_spin_trylock(l)            queued_spin_trylock(l)
			
 
				+#define arch_spin_unlock(l)             queued_spin_unlock(l)
			
 
				+#define arch_spin_lock_flags(l, f)      queued_spin_lock(l)
			
 
				+#define arch_spin_unlock_wait(l)        queued_spin_unlock_wait(l)
			
 
				+```
			
 
				+
			
 
				+Before we will consider how queued spinlocks and their [API](https://en.wikipedia.org/wiki/Application_programming_interface) are implemented, we take a look on theoretical part at first.
			
 
				+
			
 
				+Introduction to queued spinlocks
			
 
				+-------------------------------------------------------------------------------
			
 
				+
			
 
				+Queued spinlocks is a [locking mechanism](https://en.wikipedia.org/wiki/Lock_%28computer_science%29) in the Linux kernel which is replacement for the standard `spinlocks`. At least this is true for the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture. If we will look at the following kernel configuration file - [kernel/Kconfig.locks](https://github.com/torvalds/linux/blob/master/kernel/Kconfig.locks), we will see following configuration entries:
			
 
				+
			
 
				+```
			
 
				+config ARCH_USE_QUEUED_SPINLOCKS
			
 
				+	bool
			
 
				+
			
 
				+config QUEUED_SPINLOCKS
			
 
				+	def_bool y if ARCH_USE_QUEUED_SPINLOCKS
			
 
				+	depends on SMP
			
 
				+```
			
 
				+
			
 
				+This means that the `CONFIG_QUEUED_SPINLOCKS` kernel configuration option will be enabled by default if the `ARCH_USE_QUEUED_SPINLOCKS` is enabled. We may see that the `ARCH_USE_QUEUED_SPINLOCKS` is enabled by default in the `x86_64` specific kernel configuration file - [arch/x86/Kconfig](https://github.com/torvalds/linux/blob/master/arch/x86/Kconfig):
			
 
				+
			
 
				+```
			
 
				+config X86
			
 
				+    ...
			
 
				+    ...
			
 
				+    ...
			
 
				+    select ARCH_USE_QUEUED_SPINLOCKS
			
 
				+    ...
			
 
				+    ...
			
 
				+    ...
			
 
				+```
			
 
				+
			
 
				+Before we will start to consider what is it queued spinlock concept, let's look on other types of `spinlocks`. For the start let's consider how `normal` spinlocks is implemented. Usually, implementation of `normal` spinlock is based on the [test and set](https://en.wikipedia.org/wiki/Test-and-set) instruction. Principle of work of this instruction is pretty simple. This instruction writes a value to the memory location and returns old value from this memory location. Both of these operations are in atomic context i.e. this instruction is non-interruptible. So if the first thread started to execute this instruction, second thread will wait until the first processor will not finish. Basic lock can be built on top of this mechanism. Schematically it may look like this:
			
 
				+
			
 
				+```C
			
 
				+int lock(lock)
			
 
				+{
			
 
				+    while (test_and_set(lock) == 1)
			
 
				+        ;
			
 
				+    return 0;
			
 
				+}
			
 
				+
			
 
				+int unlock(lock)
			
 
				+{
			
 
				+    lock=0;
			
 
				+
			
 
				+    return lock;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+The first thread will execute the `test_and_set` which will set the `lock` to `1`. When the second thread will call the `lock` function, it will spin in the `while` loop, until the first thread will not call the `unlock` function and the `lock` will be equal to `0`. This implementation is not very good for performance, because it has at least two problems. The first problem is that this implementation may be unfair and the thread from one processor may have long waiting time, even if it called the `lock` before other threads which are waiting for free lock too. The second problem is that all threads which want to acquire a lock, must to execute many `atomic` operations like `test_and_set` on a variable which is in shared memory. This leads to the cache invalidation as the cache of the processor will store `lock=1`, but the value of the `lock` in memory may be `1` after a thread will release this lock.
			
 
				+
			
 
				+In the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) we saw the second type of spinlock implementation - `ticket spinlock`. This approach solves the first problem and may guarantee order of threads which want to acquire a lock, but still has a second problem.
			
 
				+
			
 
				+The topic of this part is `queued spinlocks`. This approach may help to solve both of these problems. The `queued spinlocks` allows to each processor to use its own memory location to spin. The basic principle of a queue-based spinlock can best be understood by studying a classic queue-based spinlock implementation called the [MCS](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf) lock. Before we will look at implementation of the `queued spinlocks` in the Linux kernel, we will try to understand what is it `MCS` lock.
			
 
				+
			
 
				+The basic idea of the `MCS` lock is in that as I already wrote in the previous paragraph, a thread spins on a local variable and each processor in the system has its own copy of these variable. In other words this concept is built on top of the [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variables concept in the Linux kernel.
			
 
				+
			
 
				+When the first thread wants to acquire a lock, it registers itself in the `queue` or in other words it will be added to the special `queue` and will acquire lock, because it is free for now. When the second thread will want to acquire the same lock before the first thread will release it, this thread adds its own copy of the lock variable into this `queue`. In this case the first thread will contain a `next` field which will point to the second thread. From this moment, the second thread will wait until the first thread will release its lock and notify `next` thread about this event. The first thread will be deleted from the `queue` and the second thread will be owner of a lock.
			
 
				+
			
 
				+Schematically we can represent it like:
			
 
				+
			
 
				+Empty queue:
			
 
				+
			
 
				+```
			
 
				++---------+
			
 
				+|         |
			
 
				+|  Queue  |
			
 
				+|         |
			
 
				++---------+
			
 
				+```
			
 
				+
			
 
				+First thread tries to acquire a lock:
			
 
				+
			
 
				+```
			
 
				++---------+     +----------------------------+
			
 
				+|         |     |                            |
			
 
				+|  Queue  |---->| First thread acquired lock |
			
 
				+|         |     |                            |
			
 
				++---------+     +----------------------------+
			
 
				+```
			
 
				+
			
 
				+Second thread tries to acquire a lock:
			
 
				+
			
 
				+```
			
 
				++---------+     +----------------------------------------+     +-------------------------+
			
 
				+|         |     |                                        |     |                         |
			
 
				+|  Queue  |---->|  Second thread waits for first thread  |<----| First thread holds lock |
			
 
				+|         |     |                                        |     |                         |
			
 
				++---------+     +----------------------------------------+     +-------------------------+
			
 
				+```
			
 
				+
			
 
				+Or the pseudocode:
			
 
				+
			
 
				+```C
			
 
				+void lock(...)
			
 
				+{
			
 
				+    lock.next = NULL;
			
 
				+    ancestor = put_lock_to_queue_and_return_ancestor(queue, lock);
			
 
				+
			
 
				+    // if we have ancestor, the lock already acquired and we
			
 
				+    // need to wait until it will be released
			
 
				+    if (ancestor)
			
 
				+    {
			
 
				+        lock.locked = 1;
			
 
				+        ancestor.next = lock;
			
 
				+
			
 
				+        while (lock.is_locked == true)
			
 
				+            ;
			
 
				+    }
			
 
				+
			
 
				+    // in other way we are owner of the lock and may exit
			
 
				+}
			
 
				+
			
 
				+void unlock(...)
			
 
				+{
			
 
				+    // do we need to notify somebody or we are alonw in the
			
 
				+    // queue?
			
 
				+    if (lock.next != NULL) {
			
 
				+        // the while loop from the lock() function will be
			
 
				+        // finished
			
 
				+        lock.next.is_locked = false;
			
 
				+        // delete ourself from the queue and exit
			
 
				+        ...
			
 
				+        ...
			
 
				+        ...
			
 
				+        return;
			
 
				+    }
			
 
				+
			
 
				+    // So, we have no next threads in the queue to notify about
			
 
				+    // lock releasing event. Let's just put `0` to the lock, will
			
 
				+    // delete ourself from the queue and exit.
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+The idea is simple, but the implementation of the `queued spinlocks` is must complex than this pseudocode. As I already wrote above, the `queued spinlock` mechanism is planned to be replacement for `ticket spinlocks` in the Linux kernel. But as you may remember, the usual `spinlock` fit into `32-bit` [word](https://en.wikipedia.org/wiki/Word_%28computer_architecture%29). But the `MCS` based lock does not fit to this size. As you may know `spinlock_t` type is [widely](http://lxr.free-electrons.com/ident?i=spinlock_t) used in the Linux kernel. In this case would have to rewrite a significant part of the Linux kernel, but this is unacceptable. Beside this, some kernel structures which contains a spinlock for protection can't grow. But anyway, implementation of the `queued spinlocks` in the Linux kernel based on this concept with some modifications which allows to fit it into `32` bits.
			
 
				+
			
 
				+That's all about theory of the `queued spinlocks`, now let's consider how this mechanism is implemented in the Linux kernel. Implementation of the `queued spinlocks` looks more complex and tangled than implementation of `ticket spinlocks`, but the study with attention will lead to success.
			
 
				+
			
 
				+API of queued spinlocks
			
 
				+-------------------------------------------------------------------------------
			
 
				+
			
 
				+Now we know a little about `queued spinlocks` from the theoretical side, time to see the implementation of this mechanism in the Linux kernel. As we saw above, the [include/asm-generic/qspinlock.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock.h#L126) header files provides a set of macro which are represent API for  a spinlock acquiring, releasing and etc:
			
 
				+
			
 
				+```C
			
 
				+#define arch_spin_is_locked(l)          queued_spin_is_locked(l)
			
 
				+#define arch_spin_is_contended(l)       queued_spin_is_contended(l)
			
 
				+#define arch_spin_value_unlocked(l)     queued_spin_value_unlocked(l)
			
 
				+#define arch_spin_lock(l)               queued_spin_lock(l)
			
 
				+#define arch_spin_trylock(l)            queued_spin_trylock(l)
			
 
				+#define arch_spin_unlock(l)             queued_spin_unlock(l)
			
 
				+#define arch_spin_lock_flags(l, f)      queued_spin_lock(l)
			
 
				+#define arch_spin_unlock_wait(l)        queued_spin_unlock_wait(l)
			
 
				+```
			
 
				+
			
 
				+All of these macros expand to the call of functions from the same header file. Additionally, we saw the `qspinlock` structure from the [include/asm-generic/qspinlock_types.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock_types.h) header file which represents a queued spinlock in the Linux kernel:
			
 
				+
			
 
				+```C
			
 
				+typedef struct qspinlock {
			
 
				+	atomic_t	val;
			
 
				+} arch_spinlock_t;
			
 
				+```
			
 
				+
			
 
				+As we may see, the `qspinlock` structure contains only one field - `val`. This field represents the state of a given `spinlock`. This `4` bytes field consists from following four parts:
			
 
				+
			
 
				+* `0-7` - locked byte;
			
 
				+* `8` - pending bit;
			
 
				+* `16-17` - two bit index which represents entry of the `per-cpu` array of the `MCS` lock (will see it soon);
			
 
				+* `18-31` - contains number of processor which indicates tail of the queue.
			
 
				+
			
 
				+and the `9-15` bytes are not used.
			
 
				+
			
 
				+As we already know, each processor in the system has own copy of the lock. The lock is represented by the following structure:
			
 
				+
			
 
				+```C
			
 
				+struct mcs_spinlock {
			
 
				+       struct mcs_spinlock *next;
			
 
				+       int locked;
			
 
				+       int count;
			
 
				+};
			
 
				+```
			
 
				+
			
 
				+from the [kernel/locking/mcs_spinlock.h](https://github.com/torvalds/linux/blob/master/kernel/locking/mcs_spinlock.h) header file. The first field represents a pointer to the next thread in the `queue`. The second field represents the state of the current thread in the `queue`, where `1` is `lock` already acquired and `0` in other way. And the last field of the `mcs_spinlock` structure represents nested locks. To understand what is it nested lock, imagine situation when a thread acquired lock, but was interrupted by the hardware [interrupt](https://en.wikipedia.org/wiki/Interrupt) and an [interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler) tries to take a lock too. For this case, each processor has not just copy of the `mcs_spinlock` structure but array of these structures:
			
 
				+
			
 
				+```C
			
 
				+static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[4]);
			
 
				+```
			
 
				+
			
 
				+This array allows to make four attempts of a lock acquisition for the four events in following contexts:
			
 
				+
			
 
				+* normal task context;
			
 
				+* hardware interrupt context;
			
 
				+* software interrupt context;
			
 
				+* non-maskable interrupt context.
			
 
				+
			
 
				+Now let's return to the `qspinlock` structure and the `API` of the `queued spinlocks`. Before we will move to consider `API` of `queued spinlocks`, notice the `val` field of the `qspinlock` structure has type - `atomic_t` which represents atomic variable or one operation at a time variable. So, all operations with this field will be [atomic](https://en.wikipedia.org/wiki/Linearizability). For example let's look at the reading value of the `val` API:
			
 
				+
			
 
				+```C
			
 
				+static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
			
 
				+{
			
 
				+	return atomic_read(&lock->val);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Ok, now we know data structures which represents queued spinlock in the Linux kernel and now time is to look at the implementation of the `main` function from the `queued spinlocks` [API](https://en.wikipedia.org/wiki/Application_programming_interface).
			
 
				+
			
 
				+```C
			
 
				+#define arch_spin_lock(l)               queued_spin_lock(l)
			
 
				+```
			
 
				+
			
 
				+Yes, this function is - `queued_spin_lock`. As we may understand from the function's name, it allows to acquire lock by the thread. This function is defined in the [include/asm-generic/qspinlock_types.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock_types.h) header file and its implementation looks:
			
 
				+
			
 
				+```C
			
 
				+static __always_inline void queued_spin_lock(struct qspinlock *lock)
			
 
				+{
			
 
				+        u32 val;
			
 
				+
			
 
				+        val = atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL);
			
 
				+        if (likely(val == 0))
			
 
				+                 return;
			
 
				+        queued_spin_lock_slowpath(lock, val);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Looks pretty easy, except the `queued_spin_lock_slowpath` function. We may see that it takes only one parameter. In our case this parameter will represent `queued spinlock` which will be locked. Let's consider the situation that `queue` with locks is empty for now and the first thread wanted to acquire lock. As we may see the `queued_spin_lock` function starts from the call of the `atomic_cmpxchg_acquire` macro. As you may guess from the name of this macro, it executes atomic [CMPXCHG](http://x86.renejeschke.de/html/file_module_x86_id_41.html) instruction which compares value of the second parameter (zero in our case) with the value of the first parameter (current state of the given spinlock) and if they are identical, it stores value of the `_Q_LOCKED_VAL` in the memory location which is pointed by the `&lock->val` and return the initial value from this memory location.
			
 
				+
			
 
				+The `atomic_cmpxchg_acquire` macro is defined in the [include/linux/atomic.h](https://github.com/torvalds/linux/blob/master/include/linux/atomic.h) header file and expands to the call of the `atomic_cmpxchg` function:
			
 
				+
			
 
				+```C
			
 
				+#define  atomic_cmpxchg_acquire         atomic_cmpxchg
			
 
				+```
			
 
				+
			
 
				+which is architecture specific. We consider [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture, so in our case this header file will be [arch/x86/include/asm/atomic.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/atomic.h) and the implementation of the `atomic_cmpxchg` function is just returns the result of the `cmpxchg` macro:
			
 
				+
			
 
				+```C
			
 
				+static __always_inline int atomic_cmpxchg(atomic_t *v, int old, int new)
			
 
				+{
			
 
				+        return cmpxchg(&v->counter, old, new);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+This macro is defined in the [arch/x86/include/asm/cmpxchg.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/cmpxchg.h) header file and looks:
			
 
				+
			
 
				+```C
			
 
				+#define cmpxchg(ptr, old, new) \
			
 
				+    __cmpxchg(ptr, old, new, sizeof(*(ptr)))
			
 
				+
			
 
				+#define __cmpxchg(ptr, old, new, size) \
			
 
				+    __raw_cmpxchg((ptr), (old), (new), (size), LOCK_PREFIX)
			
 
				+```
			
 
				+
			
 
				+As we may see, the `cmpxchg` macro expands to the `__cpmxchg` macro with the almost the same set of parameters. New additional parameter is the size of the atomic value. The `__cmpxchg` macro adds `LOCK_PREFIX` and expands to the `__raw_cmpxchg` macro where `LOCK_PREFIX` just [LOCK](http://x86.renejeschke.de/html/file_module_x86_id_159.html) instruction. After all, the `__raw_cmpxchg` does all job for us:
			
 
				+
			
 
				+```C
			
 
				+#define __raw_cmpxchg(ptr, old, new, size, lock) \
			
 
				+({
			
 
				+    ...
			
 
				+    ...
			
 
				+    ...
			
 
				+    volatile u32 *__ptr = (volatile u32 *)(ptr);            \
			
 
				+    asm volatile(lock "cmpxchgl %2,%1"                      \
			
 
				+                 : "=a" (__ret), "+m" (*__ptr)              \
			
 
				+                 : "r" (__new), "" (__old)                  \
			
 
				+                 : "memory");                               \
			
 
				+    ...
			
 
				+    ...
			
 
				+    ...
			
 
				+})
			
 
				+```
			
 
				+
			
 
				+After the `atomic_cmpxchg_acquire` macro will be executed, it returns the previous value of the memory location. Now only one thread tried to acquire a lock, so the `val` will be zero and we will return from the `queued_spin_lock` function:
			
 
				+
			
 
				+```C
			
 
				+val = atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL);
			
 
				+if (likely(val == 0))
			
 
				+    return;
			
 
				+```
			
 
				+
			
 
				+From this moment, our first thread will hold a lock. Notice that this behavior differs from the behavior which was described in the `MCS` algorithm. The thread acquired lock, but we didn't add it to the `queue`. As I already wrote the implementation of `queued spinlocks` concept is based on the `MCS` algorithm in the Linux kernel, but in the same time it has some difference like this for optimization purpose.
			
 
				+
			
 
				+So the first thread have acquired lock and now let's consider that the second thread tried to acquire the same lock. The second thread will start from the same `queued_spin_lock` function, but the `lock->val` will contain `1` or `_Q_LOCKED_VAL`, because first thread already holds lock. So, in this case the `queued_spin_lock_slowpath` function will be called. The `queued_spin_lock_slowpath` function is defined in the [kernel/locking/qspinlock.c](https://github.com/torvalds/linux/blob/master/kernel/locking/qspinlock.c) source code file and starts from the following checks:
			
 
				+
			
 
				+```C
			
 
				+void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
			
 
				+{
			
 
				+	if (pv_enabled())
			
 
				+	    goto queue;
			
 
				+
			
 
				+    if (virt_spin_lock(lock))
			
 
				+		return;
			
 
				+
			
 
				+    ...
			
 
				+    ...
			
 
				+    ...
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+which check the state of the `pvqspinlock`. The `pvqspinlock` is `queued spinlock` in [paravirtualized](https://en.wikipedia.org/wiki/Paravirtualization) environment. As this chapter is related only to synchronization primitives in the Linux kernel, we skip these and other parts which are not directly related to the topic of this chapter. After these checks we compare our value which represents lock with the value of the `_Q_PENDING_VAL` macro and do nothing while this is true:
			
 
				+
			
 
				+```C
			
 
				+if (val == _Q_PENDING_VAL) {
			
 
				+	while ((val = atomic_read(&lock->val)) == _Q_PENDING_VAL)
			
 
				+		cpu_relax();
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+where `cpu_relax` is just [NOP](https://en.wikipedia.org/wiki/NOP) instruction. Above, we saw that the lock contains - `pending` bit. This bit represents thread which wanted to acquire lock, but it is already acquired by the other thread and in the same time `queue` is empty. In this case, the `pending` bit will be set and the `queue` will not be touched. This is done for optimization, because there are no need in unnecessary latency which will be caused by the cache invalidation in a touching of own `mcs_spinlock` array.
			
 
				+
			
 
				+At the next step we enter into the following loop:
			
 
				+
			
 
				+```C
			
 
				+for (;;) {
			
 
				+	if (val & ~_Q_LOCKED_MASK)
			
 
				+		goto queue;
			
 
				+
			
 
				+	new = _Q_LOCKED_VAL;
			
 
				+	if (val == new)
			
 
				+		new |= _Q_PENDING_VAL;
			
 
				+
			
 
				+	old = atomic_cmpxchg_acquire(&lock->val, val, new);
			
 
				+	if (old == val)
			
 
				+		break;
			
 
				+
			
 
				+	val = old;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+The first `if` clause here checks that state of the lock (`val`) is in locked or pending state. This means that first thread already acquired lock, second thread tried to acquire lock too, but now it is in pending state. In this case we need to start to build queue. We will consider this situation little later. In our case we are first thread holds lock and the second thread tries to do it too. After this check we create new lock in a locked state and compare it with the state of the previous lock. As you remember, the `val` contains state of the `&lock->val` which after the second thread will call the `atomic_cmpxchg_acquire` macro will be equal to `1`. Both `new` and `val` values are equal so we set pending bit in the lock of the second thread. After this we need to check value of the `&lock->val` again, because the first thread may release lock before this moment. If the first thread did not released lock yet, the value of the `old` will be equal to the value of the `val` (because `atomic_cmpxchg_acquire` will return the value from the memory location which is pointed by the `lock->val` and now it is `1`) and we will exit from the loop. As we exited from this loop, we are waiting for the first thread until it will release lock, clear pending bit, acquire lock and return:
			
 
				+
			
 
				+```C
			
 
				+smp_cond_acquire(!(atomic_read(&lock->val) & _Q_LOCKED_MASK));
			
 
				+clear_pending_set_locked(lock);
			
 
				+return;
			
 
				+```
			
 
				+
			
 
				+Notice that we did not touch `queue` yet. We no need in it, because for two threads it just leads to unnecessary latency for memory access. In other case, the first thread may release it lock before this moment. In this case the `lock->val` will contain `_Q_LOCKED_VAL | _Q_PENDING_VAL` and we will start to build `queue`. We start to build `queue` by the getting the local copy of the `mcs_nodes` array of the processor which executes thread:
			
 
				+
			
 
				+```C
			
 
				+node = this_cpu_ptr(&mcs_nodes[0]);
			
 
				+idx = node->count++;
			
 
				+tail = encode_tail(smp_processor_id(), idx);
			
 
				+```
			
 
				+
			
 
				+Additionally we calculate `tail` which will indicate the tail of the `queue` and `index` which represents an entry of the `mcs_nodes` array. After this we set the `node` to point to the correct of the `mcs_nodes` array, set `locked` to zero because this thread didn't acquire lock yet and `next` to `NULL` because we don't know anything about other `queue` entries:
			
 
				+
			
 
				+```C
			
 
				+node += idx;
			
 
				+node->locked = 0;
			
 
				+node->next = NULL;
			
 
				+```
			
 
				+
			
 
				+We already touch `per-cpu` copy of the queue for the processor which executes current thread which wants to acquire lock, this means that owner of the lock may released it before this moment. So we may try to acquire lock again by the call of the `queued_spin_trylock` function.
			
 
				+
			
 
				+```C
			
 
				+if (queued_spin_trylock(lock))
			
 
				+		goto release;
			
 
				+```
			
 
				+
			
 
				+The `queued_spin_trylock` function is defined in the  [include/asm-generic/qspinlock.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock.h) header file and just does the same `queued_spin_lock` function that does:
			
 
				+
			
 
				+```C
			
 
				+static __always_inline int queued_spin_trylock(struct qspinlock *lock)
			
 
				+{
			
 
				+	if (!atomic_read(&lock->val) &&
			
 
				+	   (atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL) == 0))
			
 
				+		return 1;
			
 
				+	return 0;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+If the lock was successfully acquired we jump to the `release` label to release a node of the `queue`:
			
 
				+
			
 
				+```C
			
 
				+release:
			
 
				+	this_cpu_dec(mcs_nodes[0].count);
			
 
				+```
			
 
				+
			
 
				+because we no need in it anymore as lock is acquired. If the `queued_spin_trylock` was unsuccessful, we update tail of the queue:
			
 
				+
			
 
				+```C
			
 
				+old = xchg_tail(lock, tail);
			
 
				+```
			
 
				+
			
 
				+and retrieve previous tail. The next step is to check that `queue` is not empty. In this case we need to link previous entry with the new:
			
 
				+
			
 
				+```C
			
 
				+if (old & _Q_TAIL_MASK) {
			
 
				+	prev = decode_tail(old);
			
 
				+	WRITE_ONCE(prev->next, node);
			
 
				+
			
 
				+    arch_mcs_spin_lock_contended(&node->locked);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+After queue entries linked, we start to wait until reaching the head of queue. As we As we reached this, we need to do a check for new node which might be added during this wait:
			
 
				+
			
 
				+```C
			
 
				+next = READ_ONCE(node->next);
			
 
				+if (next)
			
 
				+	prefetchw(next);
			
 
				+```
			
 
				+
			
 
				+If the new node was added, we prefetch cache line from memory pointed by the next queue entry with the [PREFETCHW](http://www.felixcloutier.com/x86/PREFETCHW.html) instruction. We preload this pointer now for optimization purpose. We just became a head of queue and this means that there is upcoming `MCS` unlock operation and the next entry will be touched.
			
 
				+
			
 
				+Yes, from this moment we are in the head of the `queue`. But before we are able to acquire a lock, we need to wait at least two events: current owner of a lock will release it and the second thread with `pending` bit will acquire a lock too:
			
 
				+
			
 
				+```C
			
 
				+smp_cond_acquire(!((val = atomic_read(&lock->val)) & _Q_LOCKED_PENDING_MASK));
			
 
				+```
			
 
				+
			
 
				+After both threads will release a lock, the head of the `queue` will hold a lock. In the end we just need to update the tail of the `queue` and remove current head from it. 
			
 
				+
			
 
				+That's all.
			
 
				+
			
 
				+Conclusion
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+This is the end of the second part of the [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) chapter in the Linux kernel. In the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) we already met the first synchronization primitive `spinlock` provided by the Linux kernel which is implemented as `ticket spinlock`. In this part we saw another implementation of the `spinlock` mechanism - `queued spinlock`. In the next part we will continue to dive into synchronization primitives in the Linux kernel. 
			
 
				+
			
 
				+If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
			
 
				+
			
 
				+**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				+
			
 
				+Links
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+* [spinlock](https://en.wikipedia.org/wiki/Spinlock)
			
 
				+* [interrupt](https://en.wikipedia.org/wiki/Interrupt)
			
 
				+* [interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler) 
			
 
				+* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
			
 
				+* [Test and Set](https://en.wikipedia.org/wiki/Test-and-set)
			
 
				+* [MCS](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf)
			
 
				+* [per-cpu variables](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
			
 
				+* [atomic instruction](https://en.wikipedia.org/wiki/Linearizability)
			
 
				+* [CMPXCHG instruction](http://x86.renejeschke.de/html/file_module_x86_id_41.html) 
			
 
				+* [LOCK instruction](http://x86.renejeschke.de/html/file_module_x86_id_159.html)
			
 
				+* [NOP instruction](https://en.wikipedia.org/wiki/NOP)
			
 
				+* [PREFETCHW instruction](http://www.felixcloutier.com/x86/PREFETCHW.html)
			
 
				+* [x86_64](https://en.wikipedia.org/wiki/X86-64)
			
 
				+* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html)
			
--- a/SyncPrim/sync-3.md
+++ b/SyncPrim/sync-3.md
@@ -0,0 +1,354 @@
 
				+Synchronization primitives in the Linux kernel. Part 3.
			
 
				+================================================================================
			
 
				+
			
 
				+Semaphores
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+This is the third part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/index.html) which describes synchronization primitives in the Linux kernel and in the previous part we saw special type of [spinlocks](https://en.wikipedia.org/wiki/Spinlock) - `queued spinlocks`. The previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-2.html) was the last part which describes `spinlocks` related stuff. So we need to go ahead.
			
 
				+
			
 
				+The next [synchronization primitive](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) after `spinlock` which we will see in this part is [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29). We will start from theoretical side and will learn what is it `semaphore` and only after this, we will see how it is implemented in the Linux kernel as we did in the previous part.
			
 
				+
			
 
				+So, let's start.
			
 
				+
			
 
				+Introduction to the semaphores in the Linux kernel
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+So, what is it `semaphore`? As you may guess - `semaphore` is yet another mechanism for support of thread or process synchronization. The Linux kernel already provides implementation of one synchronization mechanism - `spinlocks`, why do we need in yet another one? To answer on this question we need to know details of both of these mechanisms. We already familiar with the `spinlocks`, so let's start from this mechanism.
			
 
				+
			
 
				+The main idea behind `spinlock` concept is a lock which will be acquired for a very short time. We can't sleep when a lock acquired by a process or thread, because other processes wait us. [Context switch](https://en.wikipedia.org/wiki/Context_switch) is not not allowed because [preemption](https://en.wikipedia.org/wiki/Preemption_%28computing%29) is disabled to avoid [deadlocks](https://en.wikipedia.org/wiki/Deadlock).
			
 
				+
			
 
				+In this way, [semaphores](https://en.wikipedia.org/wiki/Semaphore_%28programming%29) is a good solution for locks which may be acquired for a long time. In other way this mechanism is not optimal for locks that acquired for a short time. To understand this, we need to know what is `semaphore`.
			
 
				+
			
 
				+As usual synchronization primitive, a `semaphore` is based on a variable. This variable may be incremented or decremented and it's state will represent ability to acquire lock. Notice that value of the variable is not limited to `0` and `1`. There are two types of `semaphores`:
			
 
				+
			
 
				+* `binary semaphore`;
			
 
				+* `normal semaphore`.
			
 
				+
			
 
				+In the first case, value of `semaphore` may be only `1` or `0`. In the second case value of `semaphore` any non-negative number. If the value of `semaphore` is greater than `1` it is called as `counting semaphore` and it allows to acquire a lock to more than `1` process. This allows us to keep records of available resources, when `spinlock` allows to hold a lock only on one task. Besides all of this, one more important thing that `semaphore` allows to sleep. Moreover when processes waits for a lock which is acquired by other process, the [scheduler](https://en.wikipedia.org/wiki/Scheduling_%28computing%29) may switch on another process.
			
 
				+
			
 
				+Semaphore API
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+So, we know a little about `semaphores` from theoretical side, let's look on its implementation in the Linux kernel. All `semaphore` [API](https://en.wikipedia.org/wiki/Application_programming_interface) is located in the [include/linux/semaphore.h](https://github.com/torvalds/linux/blob/master/include/linux/semaphore.h) header file.
			
 
				+
			
 
				+We may see that the `semaphore` mechanism is represented by the following structure:
			
 
				+
			
 
				+```C
			
 
				+struct semaphore {
			
 
				+	raw_spinlock_t		lock;
			
 
				+	unsigned int		count;
			
 
				+	struct list_head	wait_list;
			
 
				+};
			
 
				+```
			
 
				+
			
 
				+in the Linux kernel. The `semaphore` structure consists of three fields:
			
 
				+
			
 
				+* `lock` - `spinlock` for a `semaphore` data protection;
			
 
				+* `count` - amount available resources;
			
 
				+* `wait_list` - list of processes which are waiting to acquire a lock.
			
 
				+
			
 
				+Before we will consider an [API](https://en.wikipedia.org/wiki/Application_programming_interface) of the `semaphore` mechanism in the Linux kernel, we need to know how to initialize a `semaphore`. Actually the Linux kernel provides two approaches to execute initialization of the given `semaphore` structure. These methods allows to initialize a `semaphore` in a:
			
 
				+
			
 
				+* `statically`;
			
 
				+* `dynamically`.
			
 
				+
			
 
				+ways. Let's look at the first approach. We are able to initialize a `semaphore` statically with the `DEFINE_SEMAPHORE` macro:
			
 
				+
			
 
				+```C
			
 
				+#define DEFINE_SEMAPHORE(name)  \
			
 
				+         struct semaphore name = __SEMAPHORE_INITIALIZER(name, 1)
			
 
				+```
			
 
				+
			
 
				+as we may see, the `DEFINE_SEMAPHORE` macro provides ability to initialize only `binary` semaphore. The `DEFINE_SEMAPHORE` macro expands to the definition of the `semaphore` structure which is initialized with the `__SEMAPHORE_INITIALIZER` macro. Let's look at the implementation of this macro:
			
 
				+
			
 
				+```C
			
 
				+#define __SEMAPHORE_INITIALIZER(name, n)              \
			
 
				+{                                                                       \
			
 
				+        .lock           = __RAW_SPIN_LOCK_UNLOCKED((name).lock),        \
			
 
				+        .count          = n,                                            \
			
 
				+        .wait_list      = LIST_HEAD_INIT((name).wait_list),             \
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+The `__SEMAPHORE_INITIALIZER` macro takes the name of the future `semaphore` structure and does initialization of the fields of this structure. First of all we initialize a `spinlock` of the given `semaphore` with the `__RAW_SPIN_LOCK_UNLOCKED` macro. As you may remember from the [previous](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) parts, the `__RAW_SPIN_LOCK_UNLOCKED` is defined in the [include/linux/spinlock_types.h](https://github.com/torvalds/linux/blob/master/include/linux/spinlock_types.h) header file and expands to the `__ARCH_SPIN_LOCK_UNLOCKED` macro which just expands to zero or unlocked state:
			
 
				+
			
 
				+```C
			
 
				+#define __ARCH_SPIN_LOCK_UNLOCKED       { { 0 } }
			
 
				+```
			
 
				+
			
 
				+The last two fields of the `semaphore` structure `count` and `wait_list` are initialized with the given value which represents count of available resources and empty [list](https://0xax.gitbooks.io/linux-insides/content/DataStructures/dlist.html).
			
 
				+
			
 
				+The second way to initialize a `semaphore` structure is to pass the `semaphore` and number of available resources to the `sema_init` function which is defined in the [include/linux/semaphore.h](https://github.com/torvalds/linux/blob/master/include/linux/semaphore.h) header file:
			
 
				+
			
 
				+```C
			
 
				+static inline void sema_init(struct semaphore *sem, int val)
			
 
				+{
			
 
				+       static struct lock_class_key __key;
			
 
				+       *sem = (struct semaphore) __SEMAPHORE_INITIALIZER(*sem, val);
			
 
				+       lockdep_init_map(&sem->lock.dep_map, "semaphore->lock", &__key, 0);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Let's consider implementation of this function. It looks pretty easy and actually it does almost the same. Thus function executes initialization of the given `semaphore` with the `__SEMAPHORE_INITIALIZER` macro which we just saw. As I already wrote in the previous parts of this [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/index.html), we will skip the stuff which is related to the [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) of the Linux kernel.
			
 
				+
			
 
				+So, from now we are able to initialize a `semaphore` let's look at how to lock and unlock. The Linux kernel provides following [API](https://en.wikipedia.org/wiki/Application_programming_interface) to manipulate `semaphores`:
			
 
				+
			
 
				+```
			
 
				+void down(struct semaphore *sem);
			
 
				+void up(struct semaphore *sem);
			
 
				+int  down_interruptible(struct semaphore *sem);
			
 
				+int  down_killable(struct semaphore *sem);
			
 
				+int  down_trylock(struct semaphore *sem);
			
 
				+int  down_timeout(struct semaphore *sem, long jiffies);
			
 
				+```
			
 
				+
			
 
				+The first two functions: `down` and `up` are for acquiring and releasing of the given `semaphore`. The `down_interruptible` function tries to acquire a `semaphore`. If this try was successful, the `count` field of the given `semaphore` will be decremented and lock will be acquired, in other way the task will be switched to the blocked state or in other words the `TASK_INTERRUPTIBLE` flag will be set. This `TASK_INTERRUPTIBLE` flag means that the process may returned to ruined state by [signal](https://en.wikipedia.org/wiki/Unix_signal).
			
 
				+
			
 
				+The `down_killable` function does the same as the `down_interruptible` function, but set the `TASK_KILLABLE` flag for the current process. This means that the waiting process may be interrupted by the kill signal.
			
 
				+
			
 
				+The `down_trylock` function is similar on the `spin_trylock` function. This function tries to acquire a lock and exit if this operation was unsuccessful. In this case the process which wants to acquire a lock, will not wait. The last `down_timeout` function tries to acquire a lock. It will be interrupted in a waiting state when the given timeout will be expired. Additionally, you may notice that the timeout is in [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html)
			
 
				+
			
 
				+We just saw definitions of the `semaphore` [API](https://en.wikipedia.org/wiki/Application_programming_interface). We will start from the `down` function. This function is defined in the [kernel/locking/semaphore.c](https://github.com/torvalds/linux/blob/master/kernel/locking/semaphore.c) source code file. Let's look on the implementation function:
			
 
				+
			
 
				+```C
			
 
				+void down(struct semaphore *sem)
			
 
				+{
			
 
				+        unsigned long flags;
			
 
				+
			
 
				+        raw_spin_lock_irqsave(&sem->lock, flags);
			
 
				+        if (likely(sem->count > 0))
			
 
				+                sem->count--;
			
 
				+        else
			
 
				+                __down(sem);
			
 
				+        raw_spin_unlock_irqrestore(&sem->lock, flags);
			
 
				+}
			
 
				+EXPORT_SYMBOL(down);
			
 
				+```
			
 
				+
			
 
				+We may see the definition of the `flags` variable at the beginning of the `down` function. This variable will be passed to the `raw_spin_lock_irqsave` and `raw_spin_lock_irqrestore` macros which are defined in the [include/linux/spinlock.h](https://github.com/torvalds/linux/blob/master/include/linux/spinlock.h) header file and protect a counter of the given `semaphore` here. Actually both of these macro do the same that `spin_lock` and `spin_unlock` macros, but additionally they save/restore current value of interrupt flags and disables [interrupts](https://en.wikipedia.org/wiki/Interrupt).
			
 
				+
			
 
				+As you already may guess, the main work is done between the `raw_spin_lock_irqsave` and `raw_spin_unlock_irqrestore` macros in the `down` function. We compare the value of the `semaphore` counter with zero and if it is bigger than zero, we may decrement this counter. This means that we already acquired the lock. In other way counter is zero. This means that all available resources already finished and we need to wait to acquire this lock. As we may see, the `__down` function will be called in this case.
			
 
				+
			
 
				+The `__down` function is defined in the [same](https://github.com/torvalds/linux/blob/master/kernel/locking/semaphore.c) source code file and its implementation looks:
			
 
				+
			
 
				+```C
			
 
				+static noinline void __sched __down(struct semaphore *sem)
			
 
				+{
			
 
				+        __down_common(sem, TASK_UNINTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+The `__down` function just calls the `__down_common` function with three parameters:
			
 
				+
			
 
				+* `semaphore`;
			
 
				+* `flag` - for the task;
			
 
				+* `timeout` - maximum timeout to wait `semaphore`.
			
 
				+
			
 
				+Before we will consider implementation of the `__down_common` function, notice that implementation of the `down_trylock`, `down_timeout` and `down_killable` functions based on the `__down_common` too:
			
 
				+
			
 
				+```C
			
 
				+static noinline int __sched __down_interruptible(struct semaphore *sem)
			
 
				+{
			
 
				+        return __down_common(sem, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+The `__down_killable`:
			
 
				+
			
 
				+```C
			
 
				+static noinline int __sched __down_killable(struct semaphore *sem)
			
 
				+{
			
 
				+        return __down_common(sem, TASK_KILLABLE, MAX_SCHEDULE_TIMEOUT);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+And the `__down_timeout`:
			
 
				+
			
 
				+```C
			
 
				+static noinline int __sched __down_timeout(struct semaphore *sem, long timeout)
			
 
				+{
			
 
				+        return __down_common(sem, TASK_UNINTERRUPTIBLE, timeout);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Now let's look at the implementation of the `__down_common` function. This function is defined in the [kernel/locking/semaphore.c](https://github.com/torvalds/linux/blob/master/kernel/locking/semaphore.c) source code file too and starts from the definition of the two following local variables:
			
 
				+
			
 
				+```C
			
 
				+struct task_struct *task = current;
			
 
				+struct semaphore_waiter waiter;
			
 
				+```
			
 
				+
			
 
				+The first represents current task for the local processor which wants to acquire a lock. The `current` is a macro which is defined in the [arch/x86/include/asm/current.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/current.h) header file:
			
 
				+
			
 
				+```C
			
 
				+#define current get_current()
			
 
				+```
			
 
				+
			
 
				+Where the `get_current` function returns value of the `current_task` [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable:
			
 
				+
			
 
				+```C
			
 
				+DECLARE_PER_CPU(struct task_struct *, current_task);
			
 
				+
			
 
				+static __always_inline struct task_struct *get_current(void)
			
 
				+{
			
 
				+        return this_cpu_read_stable(current_task);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+The second variable is `waiter` represents an entry of a `semaphore.wait_list` list:
			
 
				+
			
 
				+```C
			
 
				+struct semaphore_waiter {
			
 
				+        struct list_head list;
			
 
				+        struct task_struct *task;
			
 
				+        bool up;
			
 
				+};
			
 
				+```
			
 
				+
			
 
				+Next we add current task to the `wait_list` and fill `waiter` fields after definition of these variables:
			
 
				+
			
 
				+```C
			
 
				+list_add_tail(&waiter.list, &sem->wait_list);
			
 
				+waiter.task = task;
			
 
				+waiter.up = false;
			
 
				+```
			
 
				+
			
 
				+In the next step we join into the following infinite loop:
			
 
				+
			
 
				+```C
			
 
				+for (;;) {
			
 
				+        if (signal_pending_state(state, task))
			
 
				+            goto interrupted;
			
 
				+
			
 
				+        if (unlikely(timeout <= 0))
			
 
				+            goto timed_out;
			
 
				+
			
 
				+        __set_task_state(task, state);
			
 
				+
			
 
				+        raw_spin_unlock_irq(&sem->lock);
			
 
				+        timeout = schedule_timeout(timeout);
			
 
				+        raw_spin_lock_irq(&sem->lock);
			
 
				+
			
 
				+        if (waiter.up)
			
 
				+            return 0;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+In the previous piece of code we set `waiter.up` to `false`. So, a task will spin in this loop while `up` will not be set to `true`. This loop starts from the check that the current task is in the `pending` state or in other words flags of this task contains `TASK_INTERRUPTIBLE` or `TASK_WAKEKILL` flag. As I already wrote above a task may be interrupted by [signal](https://en.wikipedia.org/wiki/Unix_signal) during wait of ability to acquire a lock. The `signal_pending_state` function is defined in the [include/linux/sched.h](https://github.com/torvalds/linux/blob/master/include/linux/sched.h) source code file and looks:
			
 
				+
			
 
				+```C
			
 
				+static inline int signal_pending_state(long state, struct task_struct *p)
			
 
				+{
			
 
				+         if (!(state & (TASK_INTERRUPTIBLE | TASK_WAKEKILL)))
			
 
				+                 return 0;
			
 
				+         if (!signal_pending(p))
			
 
				+                 return 0;
			
 
				+ 
			
 
				+         return (state & TASK_INTERRUPTIBLE) || __fatal_signal_pending(p);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+We check that the `state` [bitmask](https://en.wikipedia.org/wiki/Mask_%28computing%29) contains `TASK_INTERRUPTIBLE` or `TASK_WAKEKILL` bits and if the bitmask does not contain this bit we exit. At the next step we check that the given task has a pending signal and exit if there is no. In the end we just check `TASK_INTERRUPTIBLE` bit in the `state` bitmask again or the [SIGKILL](https://en.wikipedia.org/wiki/Unix_signal#SIGKILL) signal. So, if our task has a pending signal, we will jump at the `interrupted` label:
			
 
				+
			
 
				+```C
			
 
				+interrupted:
			
 
				+    list_del(&waiter.list);
			
 
				+    return -EINTR;
			
 
				+```
			
 
				+
			
 
				+where we delete task from the list of lock waiters and return the `-EINTR` [error code](https://en.wikipedia.org/wiki/Errno.h). If a task has no pending signal, we check the given timeout and if it is less or equal zero:
			
 
				+
			
 
				+```C
			
 
				+if (unlikely(timeout <= 0))
			
 
				+    goto timed_out;
			
 
				+```
			
 
				+
			
 
				+we jump at the `timed_out` label:
			
 
				+
			
 
				+```C
			
 
				+timed_out:
			
 
				+    list_del(&waiter.list);
			
 
				+    return -ETIME;
			
 
				+```
			
 
				+
			
 
				+Where we do almost the same that we did in the `interrupted` label. We delete task from the list of lock waiters, but return the `-ETIME` error code. If a task has no pending signal and the given timeout is not expired yet, the given `state` will be set in the given task:
			
 
				+
			
 
				+```C
			
 
				+__set_task_state(task, state);
			
 
				+```
			
 
				+
			
 
				+and call the `schedule_timeout` function:
			
 
				+
			
 
				+```C
			
 
				+raw_spin_unlock_irq(&sem->lock);
			
 
				+timeout = schedule_timeout(timeout);
			
 
				+raw_spin_lock_irq(&sem->lock);
			
 
				+```
			
 
				+
			
 
				+which is defined in the [kernel/time/timer.c](https://github.com/torvalds/linux/blob/master/kernel/time/timer.c) source code file. The `schedule_timeout` function makes the current task sleep until the given timeout.
			
 
				+
			
 
				+That is all about the `__down_common` function. A task which wants to acquire a lock which is already acquired by another task will be spun in the infinite loop while it will not be interrupted by a signal, the given timeout will not be expired or the task which holds a lock will not release it. Now let's look at the implementation of the `up` function.
			
 
				+
			
 
				+The `up` function is defined in the [same](https://github.com/torvalds/linux/blob/master/kernel/locking/semaphore.c) source code file as `down` function. As we already know, the main purpose of this function is to release a lock. This function looks:
			
 
				+
			
 
				+```C
			
 
				+void up(struct semaphore *sem)
			
 
				+{
			
 
				+        unsigned long flags;
			
 
				+
			
 
				+        raw_spin_lock_irqsave(&sem->lock, flags);
			
 
				+        if (likely(list_empty(&sem->wait_list)))
			
 
				+                sem->count++;
			
 
				+        else
			
 
				+                __up(sem);
			
 
				+        raw_spin_unlock_irqrestore(&sem->lock, flags);
			
 
				+}
			
 
				+EXPORT_SYMBOL(up);
			
 
				+```
			
 
				+
			
 
				+It looks almost the same as the `down` function. There are only two differences here. First of all we increment a counter of a `semaphore` if the list of waiters is empty. In other way we call the `__up` function from the same source code file. If the list of waiters is not empty we need to allow the first task from the list to acquire a lock: 
			
 
				+
			
 
				+```C
			
 
				+static noinline void __sched __up(struct semaphore *sem)
			
 
				+{
			
 
				+        struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
			
 
				+                                                struct semaphore_waiter, list);
			
 
				+        list_del(&waiter->list);
			
 
				+        waiter->up = true;
			
 
				+        wake_up_process(waiter->task);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Here we takes the first task from the list of waiters, delete it from the list, set its `waiter-up` to true. From this point the infinite loop from the `__down_common` function will be stopped. The `wake_up_process` function will be called in the end of the `__up` function. As you remember we called the `schedule_timeout` function in the infinite loop from the `__down_common` this function. The `schedule_timeout` function makes the current task sleep until the given timeout will not be expired. So, as our process may sleep right now, we need to wake it up. That's why we call the `wake_up_process` function from the [kernel/sched/core.c](https://github.com/torvalds/linux/blob/master/kernel/sched/core.c) source code file.
			
 
				+
			
 
				+That's all.
			
 
				+
			
 
				+Conclusion
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+This is the end of the third part of the [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) chapter in the Linux kernel. In the two previous parts we already met the first synchronization primitive `spinlock` provided by the Linux kernel which is implemented as `ticket spinlock` and used for a very short time locks. In this part we saw yet another synchronization primitive - [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29) which is used for long time locks as it leads to [context switch](https://en.wikipedia.org/wiki/Context_switch). In the next part we will continue to dive into synchronization primitives in the Linux kernel and will see next synchronization primitive - [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion). 
			
 
				+
			
 
				+If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
			
 
				+
			
 
				+**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				+
			
 
				+Links
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+* [spinlocks](https://en.wikipedia.org/wiki/Spinlock)
			
 
				+* [synchronization primitive](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29)
			
 
				+* [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29)
			
 
				+* [context switch](https://en.wikipedia.org/wiki/Context_switch)
			
 
				+* [preemption](https://en.wikipedia.org/wiki/Preemption_%28computing%29)
			
 
				+* [deadlocks](https://en.wikipedia.org/wiki/Deadlock)
			
 
				+* [scheduler](https://en.wikipedia.org/wiki/Scheduling_%28computing%29)
			
 
				+* [Doubly linked list in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/DataStructures/dlist.html)
			
 
				+* [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html)
			
 
				+* [interrupts](https://en.wikipedia.org/wiki/Interrupt)
			
 
				+* [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
			
 
				+* [bitmask](https://en.wikipedia.org/wiki/Mask_%28computing%29)
			
 
				+* [SIGKILL](https://en.wikipedia.org/wiki/Unix_signal#SIGKILL)
			
 
				+* [errno](https://en.wikipedia.org/wiki/Errno.h)
			
 
				+* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
			
 
				+* [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion)
			
 
				+* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-2.html)
			
--- a/SyncPrim/sync-4.md
+++ b/SyncPrim/sync-4.md
@@ -0,0 +1,440 @@
 
				+Synchronization primitives in the Linux kernel. Part 4.
			
 
				+================================================================================
			
 
				+
			
 
				+Introduction
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+This is the fourth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/index.html) which describes synchronization primitives in the Linux kernel and in the previous parts we finished to consider different types [spinlocks](https://en.wikipedia.org/wiki/Spinlock) and [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29) synchronization primitives. We will continue to learn [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) in this part and consider yet another one which is called - [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion) which is stands for `MUTual EXclusion`.
			
 
				+
			
 
				+As in all previous parts of this [book](https://0xax.gitbooks.io/linux-insides/content), we will try to consider this synchronization primitive from the theoretical side and only than we will consider [API](https://en.wikipedia.org/wiki/Application_programming_interface) provided by the Linux kernel to manipulate with `mutexes`.
			
 
				+
			
 
				+So, let's start.
			
 
				+
			
 
				+Concept of `mutex`
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+We already familiar with the [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29) synchronization primitive from the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-3.html). It represented by the:
			
 
				+
			
 
				+```C
			
 
				+struct semaphore {
			
 
				+	raw_spinlock_t		lock;
			
 
				+	unsigned int		count;
			
 
				+	struct list_head	wait_list;
			
 
				+};
			
 
				+```
			
 
				+
			
 
				+structure which holds information about state of a [lock](https://en.wikipedia.org/wiki/Lock_%28computer_science%29) and list of a lock waiters. Depends on the value of the `count` field, a `semaphore` can provide access to a resource of more than one wishing of this resource. The [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion) concept is very similar to a [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29) concept. But it has some differences. The main difference between `semaphore` and `mutex` synchronization primitive is that `mutex` has more strict semantic. Unlike a `semaphore`, only one [process](https://en.wikipedia.org/wiki/Process_%28computing%29) may hold `mutex` at one time and only the `owner` of a `mutex` may release or unlock it. Additional difference in implementation of `lock` [API](https://en.wikipedia.org/wiki/Application_programming_interface). The `semaphore` synchronization primitive forces rescheduling of processes which are in waiters list. The implementation of `mutex` lock `API` allows to avoid this situation and as a result expensive [context switches](https://en.wikipedia.org/wiki/Context_switch).
			
 
				+
			
 
				+The `mutex` synchronization primitive represented by the following:
			
 
				+
			
 
				+```C
			
 
				+struct mutex {
			
 
				+        atomic_t                count;
			
 
				+        spinlock_t              wait_lock;
			
 
				+        struct list_head        wait_list;
			
 
				+#if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_MUTEX_SPIN_ON_OWNER)
			
 
				+        struct task_struct      *owner;
			
 
				+#endif
			
 
				+#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
			
 
				+        struct optimistic_spin_queue osq;
			
 
				+#endif
			
 
				+#ifdef CONFIG_DEBUG_MUTEXES
			
 
				+        void                    *magic;
			
 
				+#endif
			
 
				+#ifdef CONFIG_DEBUG_LOCK_ALLOC
			
 
				+        struct lockdep_map      dep_map;
			
 
				+#endif
			
 
				+};
			
 
				+```
			
 
				+
			
 
				+structure in the Linux kernel. This structure is defined in the [include/linux/mutex.h](https://github.com/torvalds/linux/blob/master/include/linux/mutex.h) header file and contains similar to the `semaphore` structure set of fields. The first field of the `mutex` structure is - `count`. Value of this field represents state of a `mutex`. In a case when the value of the `count` field is `1`, a `mutex` is in `unlocked` state. When the value of the `count` field is `zero`, a `mutex` is in the `locked` state. Additionally value of the `count` field may be `negative`. In this case a `mutex` is in the `locked` state and has possible waiters.
			
 
				+
			
 
				+The next two fields of the `mutex` structure - `wait_lock` and `wait_list` are [spinlock](https://github.com/torvalds/linux/blob/master/include/linux/mutex.h) for the protection of a `wait queue` and list of waiters which represents this `wait queue` for a certain lock. As you may notice, the similarity of the `mutex` and `semaphore` structures ends. Remaining fields of the `mutex` structure, as we may see depends on different configuration options of the Linux kernel.
			
 
				+
			
 
				+The first field - `owner` represents [process](https://en.wikipedia.org/wiki/Process_%28computing%29) which acquired a lock. As we may see, existence of this field in the `mutex` structure depends on the `CONFIG_DEBUG_MUTEXES` or `CONFIG_MUTEX_SPIN_ON_OWNER` kernel configuration options. Main point of this field and the next `osq` fields is support of `optimistic spinning` which we will see later. The last two fields - `magic` and `dep_map` are used only in [debugging](https://en.wikipedia.org/wiki/Debugging) mode. The `magic` field is to storing a `mutex` related information for debugging and the second field - `lockdep_map` is for [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) of the Linux kernel.
			
 
				+
			
 
				+Now, after we have considered the `mutex` structure, we may consider how this synchronization primitive works in the Linux kernel. As you may guess, a process which wants to acquire a lock, must to decrease value of the `mutex->count` if possible. And if a process wants to release a lock, it must to increase the same value. That's true. But as you may also guess, it is not so simple in the Linux kernel.
			
 
				+
			
 
				+Actually, when a process try to acquire a `mutex`, there three possible paths:
			
 
				+
			
 
				+* `fastpath`;
			
 
				+* `midpath`;
			
 
				+* `slowpath`.
			
 
				+
			
 
				+which may be taken, depending on the current state of the `mutex`. The first path or `fastpath` is the fastest as you may understand from its name. Everything is easy in this case. Nobody acquired a `mutex`, so the value of the `count` field of the `mutex` structure may be directly decremented. In a case of unlocking of a `mutex`, the algorithm is the same. A process just increments the value of the `count` field of the `mutex` structure. Of course, all of these operations must be [atomic](https://en.wikipedia.org/wiki/Linearizability).
			
 
				+
			
 
				+Yes, this looks pretty easy. But what happens if a process wants to acquire a `mutex` which is already acquired by other process? In this case, the control will be transferred to the second path - `midpath`. The `midpath` or `optimistic spinning` tries to [spin](https://en.wikipedia.org/wiki/Spinlock) with already familiar for us [MCS lock](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf) while the lock owner is running. This path will be executed only if there are no other processes ready to run that have higher priority. This path is called `optimistic` because the waiting task will not be sleep and rescheduled. This allows to avoid expensive [context switch](https://en.wikipedia.org/wiki/Context_switch).
			
 
				+
			
 
				+In the last case, when the `fastpath` and `midpath` may not be executed, the last path - `slowpath` will be executed. This path acts like a [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29) lock. If the lock is unable to be acquired by a process, this process will be added to `wait queue` which is represented by the following:
			
 
				+
			
 
				+```C
			
 
				+struct mutex_waiter {
			
 
				+        struct list_head        list;
			
 
				+        struct task_struct      *task;
			
 
				+#ifdef CONFIG_DEBUG_MUTEXES
			
 
				+        void                    *magic;
			
 
				+#endif
			
 
				+};
			
 
				+```
			
 
				+
			
 
				+structure from the [include/linux/mutex.h](https://github.com/torvalds/linux/blob/master/include/linux/mutex.h) header file and will be sleep. Before we will consider [API](https://en.wikipedia.org/wiki/Application_programming_interface) which is provided by the Linux kernel for manipulation with `mutexes`, let's consider the `mutex_waiter` structure. If you have read the [previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-3.html) of this chapter, you may notice that the `mutex_waiter` structure is similar to the `semaphore_waiter` structure from the [kernel/locking/semaphore.c](https://github.com/torvalds/linux/blob/master/kernel/locking/semaphore.c) source code file:
			
 
				+
			
 
				+```C
			
 
				+struct semaphore_waiter {
			
 
				+        struct list_head list;
			
 
				+        struct task_struct *task;
			
 
				+        bool up;
			
 
				+};
			
 
				+```
			
 
				+
			
 
				+It also contains `list` and `task` fields which are represent entry of the mutex wait queue. The one difference here that the `mutex_waiter` does not contains `up` field, but contains the `magic` field which depends on the `CONFIG_DEBUG_MUTEXES` kernel configuration option and used to store a `mutex` related information for debugging purpose.
			
 
				+
			
 
				+Now we know what is it `mutex` and how it is represented the Linux kernel. In this case, we may go ahead and start to look at the [API](https://en.wikipedia.org/wiki/Application_programming_interface) which the Linux kernel provides for manipulation of `mutexes`.
			
 
				+
			
 
				+Mutex API
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+Ok, in the previous paragraph we knew what is it `mutex` synchronization primitive and saw the `mutex` structure which represents `mutex` in the Linux kernel. Now it's time to consider [API](https://en.wikipedia.org/wiki/Application_programming_interface) for manipulation of mutexes. Description of the `mutex` API is located in the [include/linux/mutex.h](https://github.com/torvalds/linux/blob/master/include/linux/mutex.h) header file. As always, before we will consider how to acquire and release a `mutex`, we need to know how to initialize it.
			
 
				+
			
 
				+There are two approaches to initialize a `mutex`. The first is to do it statically. For this purpose the Linux kernel provides following:
			
 
				+
			
 
				+```C
			
 
				+#define DEFINE_MUTEX(mutexname) \
			
 
				+        struct mutex mutexname = __MUTEX_INITIALIZER(mutexname)
			
 
				+```
			
 
				+
			
 
				+macro. Let's consider implementation of this macro. As we may see, the `DEFINE_MUTEX` macro takes name for the `mutex` and expands to the definition of the new `mutex` structure. Additionally new `mutex` structure get initialized with the `__MUTEX_INITIALIZER` macro. Let's look at the implementation of the `__MUTEX_INITIALIZER`:
			
 
				+
			
 
				+```C
			
 
				+#define __MUTEX_INITIALIZER(lockname)         \
			
 
				+{                                                             \
			
 
				+       .count = ATOMIC_INIT(1),                               \
			
 
				+       .wait_lock = __SPIN_LOCK_UNLOCKED(lockname.wait_lock), \
			
 
				+       .wait_list = LIST_HEAD_INIT(lockname.wait_list)        \
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+This macro is defined in the [same](https://github.com/torvalds/linux/blob/master/include/linux/mutex.h) header file and as we may understand it initializes fields of the `mutex` structure the initial values. The `count` field get initialized with the `1` which represents `unlocked` state of a mutex. The `wait_lock` [spinlock](https://en.wikipedia.org/wiki/Spinlock) get initialized to the unlocked state and the last field `wait_list` to empty [doubly linked list](https://0xax.gitbooks.io/linux-insides/content/DataStructures/dlist.html).
			
 
				+
			
 
				+The second approach allows us to initialize a `mutex` dynamically. To do this we need to call the `__mutex_init` function from the [kernel/locking/mutex.c](https://github.com/torvalds/linux/blob/master/kernel/locking/mutex.c) source code file. Actually, the `__mutex_init` function rarely called directly. Instead of the `__mutex_init`, the:
			
 
				+
			
 
				+```C
			
 
				+# define mutex_init(mutex)                \
			
 
				+do {                                                    \
			
 
				+        static struct lock_class_key __key;             \
			
 
				+                                                        \
			
 
				+        __mutex_init((mutex), #mutex, &__key);          \
			
 
				+} while (0)
			
 
				+```
			
 
				+
			
 
				+macro is used. We may see that the `mutex_init` macro just defines the `lock_class_key` and call the `__mutex_init` function. Let's look at the implementation of this function:
			
 
				+
			
 
				+```C
			
 
				+void
			
 
				+__mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
			
 
				+{
			
 
				+        atomic_set(&lock->count, 1);
			
 
				+        spin_lock_init(&lock->wait_lock);
			
 
				+        INIT_LIST_HEAD(&lock->wait_list);
			
 
				+        mutex_clear_owner(lock);
			
 
				+#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
			
 
				+        osq_lock_init(&lock->osq);
			
 
				+#endif
			
 
				+        debug_mutex_init(lock, name, key);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+As we may see the `__mutex_init` function takes three arguments:
			
 
				+
			
 
				+* `lock` - a mutex itself;
			
 
				+* `name` - name of mutex for debugging purpose;
			
 
				+* `key`  - key for [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt).
			
 
				+
			
 
				+At the beginning of the `__mutex_init` function, we may see initialization of the `mutex` state. We set it to `unlocked` state with the `atomic_set` function which atomically set the give variable to the given value. After this we may see initialization of the `spinlock` to the unlocked state which will protect `wait queue` of the `mutex` and initialization of the `wait queue` of the `mutex`. After this we clear owner of the `lock` and initialize optimistic queue by the call of the `osq_lock_init` function from the [include/linux/osq_lock.h](https://github.com/torvalds/linux/blob/master/include/linux/osq_lock.h) header file. This function just sets the tail of the optimistic queue to the unlocked state:
			
 
				+
			
 
				+```C
			
 
				+static inline bool osq_is_locked(struct optimistic_spin_queue *lock)
			
 
				+{
			
 
				+        return atomic_read(&lock->tail) != OSQ_UNLOCKED_VAL;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+In the end of the `__mutex_init` function we may see the call of the `debug_mutex_init` function, but as I already wrote in previous parts of this [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/index.html), we will not consider debugging related stuff in this chapter.
			
 
				+
			
 
				+After the `mutex` structure is initialized, we may go ahead and will look at the `lock` and `unlock` API of `mutex` synchronization primitive. Implementation of `mutex_lock` and `mutex_unlock` functions located in the [kernel/locking/mutex.c](https://github.com/torvalds/linux/blob/master/kernel/locking/mutex.c) source code file. First of all let's start from the implementation of the `mutex_lock`. It looks:
			
 
				+
			
 
				+```C
			
 
				+void __sched mutex_lock(struct mutex *lock)
			
 
				+{
			
 
				+        might_sleep();
			
 
				+        __mutex_fastpath_lock(&lock->count, __mutex_lock_slowpath);
			
 
				+        mutex_set_owner(lock);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+We may see the call of the `might_sleep` macro from the [include/linux/kernel.h](https://github.com/torvalds/linux/blob/master/include/linux/kernel.h) header file at the beginning of the `mutex_lock` function. Implementation of this macro depends on the `CONFIG_DEBUG_ATOMIC_SLEEP` kernel configuration option and if this option is enabled, this macro just prints a stack trace if it was executed in [atomic](https://en.wikipedia.org/wiki/Linearizability) context. This macro is helper for debugging purposes. In other way this macro does nothing.
			
 
				+
			
 
				+After the `might_sleep` macro, we may see the call of the `__mutex_fastpath_lock` function. This function is architecture-specific and as we consider [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture in this book, the implementation of the `__mutex_fastpath_lock` is located in the [arch/x86/include/asm/mutex_64.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/mutex_64.h) header file. As we may understand from the name of the `__mutex_fastpath_lock` function, this function will try to acquire lock in a fast path or in other words this function will try to decrement the value of the `count` of the given mutex.
			
 
				+
			
 
				+Implementation of the `__mutex_fastpath_lock` function consists from two parts. The first part is [inline assembly](https://0xax.gitbooks.io/linux-insides/content/Theory/asm.html) statement. Let's look at it:
			
 
				+
			
 
				+```C
			
 
				+asm_volatile_goto(LOCK_PREFIX "   decl %0\n"
			
 
				+                              "   jns %l[exit]\n"
			
 
				+                              : : "m" (v->counter)
			
 
				+                              : "memory", "cc"
			
 
				+                              : exit);
			
 
				+```
			
 
				+
			
 
				+First of all, let's pay attention to the `asm_volatile_goto`. This macro is defined in the [include/linux/compiler-gcc.h](https://github.com/torvalds/linux/blob/master/include/linux/compiler-gcc.h) header file and just expands to the two inline assembly statements:
			
 
				+
			
 
				+```C
			
 
				+#define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0)
			
 
				+```
			
 
				+
			
 
				+The first assembly statement contains `goto` specificator and the second empty inline assembly statement is [barrier](https://en.wikipedia.org/wiki/Memory_barrier). Now let's return to the our inline assembly statement. As we may see it starts from the definition of the `LOCK_PREFIX` macro which just expands to the [lock](http://x86.renejeschke.de/html/file_module_x86_id_159.html) instruction:
			
 
				+
			
 
				+```C
			
 
				+#define LOCK_PREFIX LOCK_PREFIX_HERE "\n\tlock; "
			
 
				+```
			
 
				+
			
 
				+As we already know from the previous parts, this instruction allows to execute prefixed instruction [atomically](https://en.wikipedia.org/wiki/Linearizability). So, at the first step in the our assembly statement we try decrement value of the given `mutex->counter`. At the next step the [jns](http://unixwiz.net/techtips/x86-jumps.html) instruction will execute jump at the `exit` label if the value of the decremented `mutex->counter` is not negative. The `exit` label is the second part of the `__mutex_fastpath_lock` function and it just points to the exit from this function:
			
 
				+
			
 
				+```C
			
 
				+exit:
			
 
				+        return;
			
 
				+```
			
 
				+
			
 
				+For this moment he implementation of the `__mutex_fastpath_lock` function looks pretty easy. But the value of the `mutex->counter` may be negative after increment. In this case the: 
			
 
				+
			
 
				+```C
			
 
				+fail_fn(v);
			
 
				+```
			
 
				+
			
 
				+will be called after our inline assembly statement. The `fail_fn` is the second parameter of the `__mutex_fastpath_lock` function and represents pointer to function which represents `midpath/slowpath` paths to acquire the given lock. In our case the `fail_fn` is the `__mutex_lock_slowpath` function. Before we will look at the implementation of the `__mutex_lock_slowpath` function, let's finish with the implementation of the `mutex_lock` function. In the simplest way, the lock will be acquired successfully by a process and the `__mutex_fastpath_lock` will be finished. In this case, we just call the
			
 
				+
			
 
				+```C
			
 
				+mutex_set_owner(lock);
			
 
				+```
			
 
				+
			
 
				+in the end of the `mutex_lock`. The `mutex_set_owner` function is defined in the [kernel/locking/mutex.h](https://github.com/torvalds/linux/blob/master/include/linux/mutex.h) header file and just sets owner of a lock to the current process:
			
 
				+
			
 
				+```C
			
 
				+static inline void mutex_set_owner(struct mutex *lock)
			
 
				+{
			
 
				+        lock->owner = current;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+In other way, let's consider situation when a process which wants to acquire a lock is unable to do it, because another process already acquired the same lock. We already know that the `__mutex_lock_slowpath` function will be called in this case. Let's consider implementation of this function. This function is defined in the [kernel/locking/mutex.c](https://github.com/torvalds/linux/blob/master/kernel/locking/mutex.c) source code file and starts from the obtaining of the proper mutex by the mutex state given from the `__mutex_fastpath_lock` with the `container_of` macro:
			
 
				+
			
 
				+```C
			
 
				+__visible void __sched
			
 
				+__mutex_lock_slowpath(atomic_t *lock_count)
			
 
				+{
			
 
				+        struct mutex *lock = container_of(lock_count, struct mutex, count);
			
 
				+
			
 
				+        __mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0,
			
 
				+                            NULL, _RET_IP_, NULL, 0);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+and call the `__mutex_lock_common` function with the obtained `mutex`. The `__mutex_lock_common` function starts from [preemtion](https://en.wikipedia.org/wiki/Preemption_%28computing%29) disabling until rescheduling:
			
 
				+
			
 
				+```C
			
 
				+preempt_disable();
			
 
				+```
			
 
				+
			
 
				+After this comes the stage of optimistic spinning. As we already know this stage depends on the `CONFIG_MUTEX_SPIN_ON_OWNER` kernel configuration option. If this option is disabled, we skip this stage and move at the last path - `slowpath` of a `mutex` acquisition:
			
 
				+
			
 
				+```C
			
 
				+if (mutex_optimistic_spin(lock, ww_ctx, use_ww_ctx)) {
			
 
				+        preempt_enable();
			
 
				+        return 0;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+First of all the `mutex_optimistic_spin` function check that we don't need to reschedule or in other words there are no other tasks ready to run that have higher priority. If this check was successful we need to update `MCS` lock wait queue with the current spin. In this way only one spinner can complete for the mutex at one time:
			
 
				+
			
 
				+```C
			
 
				+osq_lock(&lock->osq)
			
 
				+```
			
 
				+
			
 
				+At the next step we start to spin in the next loop:
			
 
				+
			
 
				+```C
			
 
				+while (true) {
			
 
				+    owner = READ_ONCE(lock->owner);
			
 
				+
			
 
				+    if (owner && !mutex_spin_on_owner(lock, owner))
			
 
				+        break;
			
 
				+
			
 
				+    if (mutex_try_to_acquire(lock)) {
			
 
				+        lock_acquired(&lock->dep_map, ip);
			
 
				+
			
 
				+        mutex_set_owner(lock);
			
 
				+        osq_unlock(&lock->osq);
			
 
				+        return true;
			
 
				+    }
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+and try to acquire a lock. First of all we try to take current owner and if the owner exists (it may not exists in a case when a process already released a mutex) and we wait for it in the `mutex_spin_on_owner` function before the owner will release a lock. If new task with higher priority have appeared during wait of the lock owner, we break the loop and go to sleep. In other case, the process already may release a lock, so we try to acquire a lock with the `mutex_try_to_acquired`. If this operation finished successfully, we set new owner for the given mutex, removes ourself from the `MCS` wait queue and exit from the `mutex_optimistic_spin` function. At this state a lock will be acquired by a process and we enable [preemtion](https://en.wikipedia.org/wiki/Preemption_%28computing%29) and exit from the `__mutex_lock_common` function:
			
 
				+
			
 
				+```C
			
 
				+if (mutex_optimistic_spin(lock, ww_ctx, use_ww_ctx)) {
			
 
				+    preempt_enable();
			
 
				+    return 0;
			
 
				+}
			
 
				+
			
 
				+```
			
 
				+
			
 
				+That's all for this case.
			
 
				+
			
 
				+In other case all may not be so successful. For example new task may occur during we spinning in the loop from the `mutex_optimistic_spin` or even we may not get to this loop from the `mutex_optimistic_spin` in a case when there were task(s) with higher priority before this loop. Or finally the `CONFIG_MUTEX_SPIN_ON_OWNER` kernel configuration option disabled. In this case the `mutex_optimistic_spin` will do nothing:
			
 
				+
			
 
				+```C
			
 
				+#ifndef CONFIG_MUTEX_SPIN_ON_OWNER
			
 
				+static bool mutex_optimistic_spin(struct mutex *lock,
			
 
				+                                  struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx)
			
 
				+{
			
 
				+    return false;
			
 
				+}
			
 
				+#endif
			
 
				+```
			
 
				+
			
 
				+In all of these cases, the `__mutex_lock_common` function will acct like a `semaphore`. We try to acquire a lock again because the owner of a lock might already release a lock before this time:
			
 
				+
			
 
				+```C
			
 
				+if (!mutex_is_locked(lock) &&
			
 
				+   (atomic_xchg_acquire(&lock->count, 0) == 1))
			
 
				+      goto skip_wait;
			
 
				+```
			
 
				+
			
 
				+In a failure case the process which wants to acquire a lock will be added to the waiters list
			
 
				+
			
 
				+```C
			
 
				+list_add_tail(&waiter.list, &lock->wait_list);
			
 
				+waiter.task = task;
			
 
				+```
			
 
				+
			
 
				+In a successful case we update the owner of a lock, enable preemption and exit from the `__mutex_lock_common` function:
			
 
				+
			
 
				+```C
			
 
				+skip_wait:
			
 
				+        mutex_set_owner(lock);
			
 
				+        preempt_enable();
			
 
				+        return 0; 
			
 
				+```
			
 
				+
			
 
				+In this case a lock will be acquired. If can't acquire a lock for now, we enter into the following loop:
			
 
				+
			
 
				+```C
			
 
				+for (;;) {
			
 
				+
			
 
				+    if (atomic_read(&lock->count) >= 0 && (atomic_xchg_acquire(&lock->count, -1) == 1))
			
 
				+        break;
			
 
				+
			
 
				+    if (unlikely(signal_pending_state(state, task))) {
			
 
				+        ret = -EINTR;
			
 
				+        goto err;
			
 
				+    } 
			
 
				+
			
 
				+    __set_task_state(task, state);
			
 
				+
			
 
				+     schedule_preempt_disabled();
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+where try to acquire a lock again and exit if this operation was successful. Yes, we try to acquire a lock again right after unsuccessful try  before the loop. We need to do it to make sure that we get a wakeup once a lock will be unlocked. Besides this, it allows us to acquire a lock after sleep.  In other case we check the current process for pending [signals](https://en.wikipedia.org/wiki/Unix_signal) and exit if the process was interrupted by a `signal` during wait for a lock acquisition. In the end of loop we didn't acquire a lock, so we set the task state for `TASK_UNINTERRUPTIBLE` and go to sleep with call of the `schedule_preempt_disabled` function.
			
 
				+
			
 
				+That's all. We have considered all three possible paths through which a process may pass when it will wan to acquire a lock. Now let's consider how `mutex_unlock` is implemented. When the `mutex_unlock` will be called by a process which wants to release a lock, the `__mutex_fastpath_unlock` will be called from the  [arch/x86/include/asm/mutex_64.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/mutex_64.h)  header file:
			
 
				+
			
 
				+```C
			
 
				+void __sched mutex_unlock(struct mutex *lock)
			
 
				+{
			
 
				+    __mutex_fastpath_unlock(&lock->count, __mutex_unlock_slowpath);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Implementation of the `__mutex_fastpath_unlock` function is very similar to the implementation of the `__mutex_fastpath_lock` function:
			
 
				+
			
 
				+```C
			
 
				+static inline void __mutex_fastpath_unlock(atomic_t *v,
			
 
				+                                           void (*fail_fn)(atomic_t *))
			
 
				+{
			
 
				+       asm_volatile_goto(LOCK_PREFIX "   incl %0\n"
			
 
				+                         "   jg %l[exit]\n"
			
 
				+                         : : "m" (v->counter)
			
 
				+                         : "memory", "cc"
			
 
				+                         : exit);
			
 
				+       fail_fn(v);
			
 
				+exit:
			
 
				+       return;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Actually, there is only one difference. We increment value if the `mutex->count`. So it will represent `unlocked` state after this operation. As `mutex` released, but we have something in the `wait queue` we need to update it. In this case the `fail_fn` function will be called which is `__mutex_unlock_slowpath`. The `__mutex_unlock_slowpath` function just gets the correct `mutex` instance by the given `mutex->count` and calls the `__mutex_unlock_common_slowpath` function:
			
 
				+
			
 
				+```C
			
 
				+__mutex_unlock_slowpath(atomic_t *lock_count)
			
 
				+{
			
 
				+      struct mutex *lock = container_of(lock_count, struct mutex, count);
			
 
				+
			
 
				+      __mutex_unlock_common_slowpath(lock, 1);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+In the `__mutex_unlock_common_slowpath` function we will get the first entry from the wait queue if the wait queue is not empty and wakeup related process:
			
 
				+
			
 
				+```C
			
 
				+if (!list_empty(&lock->wait_list)) {
			
 
				+    struct mutex_waiter *waiter =
			
 
				+           list_entry(lock->wait_list.next, struct mutex_waiter, list); 
			
 
				+                wake_up_process(waiter->task);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+After this, a mutex will be released by previous process and will be acquired by another process from a wait queue.
			
 
				+
			
 
				+That's all. We have considered main `API` for manipulation with `mutexes`: `mutex_lock` and `mutex_unlock`. Besides this the Linux kernel provides following API:
			
 
				+
			
 
				+* `mutex_lock_interruptible`;
			
 
				+* `mutex_lock_killable`;
			
 
				+* `mutex_trylock`.
			
 
				+
			
 
				+and corresponding versions of `unlock` prefixed functions. This part will not describe this `API`, because it is similar to corresponding `API` of `semaphores`. More about it you may read in the [previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-3.html).
			
 
				+
			
 
				+That's all.
			
 
				+
			
 
				+Conclusion
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+This is the end of the fourth part of the [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) chapter in the Linux kernel. In this part we met with new synchronization primitive which is called - `mutex`. From the theoretical side, this synchronization primitive very similar on a [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29). Actually, `mutex` represents binary semaphore. But its implementation differs from the implementation of `semaphore` in the Linux kernel. In the next part we will continue to dive into synchronization primitives in the Linux kernel.
			
 
				+
			
 
				+If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
			
 
				+
			
 
				+**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				+
			
 
				+Links
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+* [Mutex](https://en.wikipedia.org/wiki/Mutual_exclusion)
			
 
				+* [Spinlock](https://en.wikipedia.org/wiki/Spinlock)
			
 
				+* [Semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29)
			
 
				+* [Synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29)
			
 
				+* [API](https://en.wikipedia.org/wiki/Application_programming_interface) 
			
 
				+* [Locking mechanism](https://en.wikipedia.org/wiki/Lock_%28computer_science%29)
			
 
				+* [Context switches](https://en.wikipedia.org/wiki/Context_switch)
			
 
				+* [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt)
			
 
				+* [Atomic](https://en.wikipedia.org/wiki/Linearizability)
			
 
				+* [MCS lock](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf)
			
 
				+* [Doubly linked list](https://0xax.gitbooks.io/linux-insides/content/DataStructures/dlist.html)
			
 
				+* [x86_64](https://en.wikipedia.org/wiki/X86-64)
			
 
				+* [Inline assembly](https://0xax.gitbooks.io/linux-insides/content/Theory/asm.html)
			
 
				+* [Memory barrier](https://en.wikipedia.org/wiki/Memory_barrier)
			
 
				+* [Lock instruction](http://x86.renejeschke.de/html/file_module_x86_id_159.html)
			
 
				+* [JNS instruction](http://unixwiz.net/techtips/x86-jumps.html)
			
 
				+* [preemtion](https://en.wikipedia.org/wiki/Preemption_%28computing%29)
			
 
				+* [Unix signals](https://en.wikipedia.org/wiki/Unix_signal)
			
 
				+* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-3.html)
			
--- a/SyncPrim/sync-5.md
+++ b/SyncPrim/sync-5.md
@@ -0,0 +1,433 @@
 
				+Synchronization primitives in the Linux kernel. Part 5.
			
 
				+================================================================================
			
 
				+
			
 
				+Introduction
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+This is the fifth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/index.html) which describes synchronization primitives in the Linux kernel and in the previous parts we finished to consider different types [spinlocks](https://en.wikipedia.org/wiki/Spinlock), [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29) and [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion) synchronization primitives. We will continue to learn [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) in this part and start to consider special type of synchronization primitives - [readers–writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock).
			
 
				+
			
 
				+The first synchronization primitive of this type will be already familiar for us - [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29). As in all previous parts of this [book](https://0xax.gitbooks.io/linux-insides/content), before we will consider implementation of the `reader/writer semaphores` in the Linux kernel, we will start from the theoretical side and will try to understand what is the difference between `reader/writer semaphores` and `normal semaphores`.
			
 
				+
			
 
				+So, let's start.
			
 
				+
			
 
				+Reader/Writer semaphore
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+Actually there are two types of operations may be performed on the data. We may read data and make changes in data. Two fundamental operations - `read` and `write`. Usually (but not always), `read` operation is performed more often than `write` operation. In this case, it would be logical to we may lock data in such way, that some processes may read locked data in one time, on condition that no one will not change the data. The [readers/writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock) allows us to get this lock.
			
 
				+
			
 
				+When a process which wants to write something into data, all other `writer` and `reader` processes will be blocked until the process which acquired a lock, will not release it. When a process reads data, other processes which want to read the same data too, will not be locked and will be able to do this. As you may guess, implementation of the `reader/writer semaphore` is based on the implementation of the `normal semaphore`. We already familiar with the [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29) synchronization primitive from the third [part]((https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-4.html) of this chapter. From the theoretical side everything looks pretty simple. Let's look how `reader/writer semaphore` is represented in the Linux kernel.
			
 
				+
			
 
				+The `semaphore` is represented by the:
			
 
				+
			
 
				+```C
			
 
				+struct semaphore {
			
 
				+	raw_spinlock_t		lock;
			
 
				+	unsigned int		count;
			
 
				+	struct list_head	wait_list;
			
 
				+};
			
 
				+```
			
 
				+
			
 
				+structure. If you will look in the [include/linux/rwsem.h](https://github.com/torvalds/linux/blob/master/include/linux/rwsem.h) header file, you will find definition of the `rw_semaphore` structure which represents `reader/writer semaphore` in the Linux kernel. Let's look at the definition of this structure:
			
 
				+
			
 
				+```C
			
 
				+#ifdef CONFIG_RWSEM_GENERIC_SPINLOCK
			
 
				+#include <linux/rwsem-spinlock.h>
			
 
				+#else
			
 
				+struct rw_semaphore {
			
 
				+        long count;
			
 
				+        struct list_head wait_list;
			
 
				+        raw_spinlock_t wait_lock;
			
 
				+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
			
 
				+        struct optimistic_spin_queue osq;
			
 
				+        struct task_struct *owner;
			
 
				+#endif
			
 
				+#ifdef CONFIG_DEBUG_LOCK_ALLOC
			
 
				+        struct lockdep_map      dep_map;
			
 
				+#endif
			
 
				+};
			
 
				+```
			
 
				+
			
 
				+Before we will consider fields of the `rw_semaphore` structure, we may notice, that declaration of the `rw_semaphore` structure depends on the `CONFIG_RWSEM_GENERIC_SPINLOCK` kernel configuration option. This option is disabled for the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture by default. We can be sure in this by looking at the corresponding kernel configuration file. In our case, this configuration file is - [arch/x86/um/Kconfig](https://github.com/torvalds/linux/blob/master/arch/x86/um/Kconfig):
			
 
				+
			
 
				+```
			
 
				+config RWSEM_XCHGADD_ALGORITHM
			
 
				+	def_bool 64BIT
			
 
				+
			
 
				+config RWSEM_GENERIC_SPINLOCK
			
 
				+	def_bool !RWSEM_XCHGADD_ALGORITHM
			
 
				+```
			
 
				+
			
 
				+So, as this [book](https://0xax.gitbooks.io/linux-insides/content) describes only [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture related stuff, we will skip the case when the `CONFIG_RWSEM_GENERIC_SPINLOCK` kernel configuration is enabled and consider definition of the `rw_semaphore` structure only from the [include/linux/rwsem.h](https://github.com/torvalds/linux/blob/master/include/linux/rwsem.h) header file.
			
 
				+
			
 
				+If we will take a look at the definition of the `rw_semaphore` structure, we will notice that first three fields are the same that in the `semaphore` structure. It contains `count` field which represents amount of available resources, the `wait_list` field which represents [doubly linked list](https://0xax.gitbooks.io/linux-insides/content/DataStructures/dlist.html) of processes which are waiting to acquire a lock and `wait_lock` [spinlock](https://en.wikipedia.org/wiki/Spinlock) for protection of this list. Notice that `rw_semaphore.count` field is `long` type unlike the same field in the `semaphore` structure.
			
 
				+
			
 
				+The `count` field of a `rw_semaphore` structure may have following values:
			
 
				+
			
 
				+* `0x0000000000000000` - `reader/writer semaphore` is in unlocked state and no one is waiting for a lock;
			
 
				+* `0x000000000000000X` - `X` readers are active or attempting to acquire a lock and no writer waiting;
			
 
				+* `0xffffffff0000000X` - may represent different cases. The first is - `X` readers are active or attempting to acquire a lock with waiters for the lock. The second is - one writer attempting a lock, no waiters for the lock. And the last - one writer is active and no waiters for the lock;
			
 
				+* `0xffffffff00000001` - may represented two different cases. The first is - one reader is active or attempting to acquire a lock and exist waiters for the lock. The second case is one writer is active or attempting to acquire a lock and no waiters for the lock;
			
 
				+* `0xffffffff00000000` - represents situation when there are readers or writers are queued, but no one is active or is in the process of acquire of a lock;
			
 
				+* `0xfffffffe00000001` - a writer is active or attempting to acquire a lock and waiters are in queue.
			
 
				+
			
 
				+So, besides the `count` field, all of these fields are similar to fields of the `semaphore` structure. Last three fields depend on the two configuration options of the Linux kernel: the `CONFIG_RWSEM_SPIN_ON_OWNER` and `CONFIG_DEBUG_LOCK_ALLOC`. The first two fields may be familiar us by declaration of the [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion) structure from the [previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-4.html). The first `osq` field represents [MCS lock](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf) spinner for `optimistic spinning` and the second represents process which is current owner of a lock.
			
 
				+
			
 
				+The last field of the `rw_semaphore` structure is - `dep_map` - debugging related, and as I already wrote in previous parts, we will skip debugging related stuff in this chapter.
			
 
				+
			
 
				+That's all. Now we know a little about what is it `reader/writer lock` in general and `reader/writer semaphore` in particular. Additionally we saw how a `reader/writer semaphore` is represented in the Linux kernel. In this case, we may go ahead and start to look at the [API](https://en.wikipedia.org/wiki/Application_programming_interface) which the Linux kernel provides for manipulation of `reader/writer semaphores`.
			
 
				+
			
 
				+Reader/Writer semaphore API
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+So, we know a little about `reader/writer semaphores` from theoretical side, let's look on its implementation in the Linux kernel. All `reader/writer semaphores` related [API](https://en.wikipedia.org/wiki/Application_programming_interface) is located in the [include/linux/rwsem.h](https://github.com/torvalds/linux/blob/master/include/linux/rwsem.h) header file.
			
 
				+
			
 
				+As always Before we will consider an [API](https://en.wikipedia.org/wiki/Application_programming_interface) of the `reader/writer semaphore` mechanism in the Linux kernel, we need to know how to initialize the `rw_semaphore` structure. As we already saw in previous parts of this [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/index.html), all [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) may be initialized in two ways:
			
 
				+
			
 
				+* `statically`;
			
 
				+* `dynamically`.
			
 
				+
			
 
				+And `reader/writer semaphore` is not an exception. First of all, let's take a look at the first approach. We may initialize `rw_semaphore` structure with the help of the `DECLARE_RWSEM` macro in compile time. This macro is defined in the [include/linux/rwsem.h](https://github.com/torvalds/linux/blob/master/include/linux/rwsem.h) header file and looks:
			
 
				+
			
 
				+```C
			
 
				+#define DECLARE_RWSEM(name) \
			
 
				+        struct rw_semaphore name = __RWSEM_INITIALIZER(name)
			
 
				+```
			
 
				+
			
 
				+As we may see, the `DECLARE_RWSEM` macro just expands to the definition of the `rw_semaphore` structure with the given name. Additionally new `rw_semaphore` structure is initialized with the value of the `__RWSEM_INITIALIZER` macro:
			
 
				+
			
 
				+```C
			
 
				+#define __RWSEM_INITIALIZER(name)              \
			
 
				+{                                                              \
			
 
				+        .count = RWSEM_UNLOCKED_VALUE,                         \
			
 
				+        .wait_list = LIST_HEAD_INIT((name).wait_list),         \
			
 
				+        .wait_lock = __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock)  \
			
 
				+         __RWSEM_OPT_INIT(name)                                \
			
 
				+         __RWSEM_DEP_MAP_INIT(name)
			
 
				+} 
			
 
				+```
			
 
				+
			
 
				+and expands to the initialization of fields of `rw_semaphore` structure. First of all we initialize `count` field of the `rw_semaphore` structure to the `unlocked` state with `RWSEM_UNLOCKED_VALUE` macro from the [arch/x86/include/asm/rwsem.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/rwsem.h) architecture specific header file:
			
 
				+
			
 
				+```C
			
 
				+#define RWSEM_UNLOCKED_VALUE            0x00000000L
			
 
				+```
			
 
				+
			
 
				+After this we initialize list of a lock waiters with the empty linked list and [spinlock](https://en.wikipedia.org/wiki/Spinlock) for protection of this list with the `unlocked` state too. The `__RWSEM_OPT_INIT` macro depends on the state of the `CONFIG_RWSEM_SPIN_ON_OWNER` kernel configuration option and if this option is enabled it expands to the initialization of the `osq` and `owner` fields of the `rw_semaphore` structure. As we already saw above, the `CONFIG_RWSEM_SPIN_ON_OWNER` kernel configuration option is enabled by default for [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture, so let's take a look at the definition of the `__RWSEM_OPT_INIT` macro:
			
 
				+
			
 
				+```C
			
 
				+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
			
 
				+    #define __RWSEM_OPT_INIT(lockname) , .osq = OSQ_LOCK_UNLOCKED, .owner = NULL
			
 
				+#else
			
 
				+    #define __RWSEM_OPT_INIT(lockname)
			
 
				+#endif
			
 
				+```
			
 
				+
			
 
				+As we may see, the `__RWSEM_OPT_INIT` macro initializes the [MCS lock](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf) lock with `unlocked` state and initial `owner` of a lock with `NULL`. From this moment, a `rw_semaphore` structure will be initialized in a compile time and may be used for data protection.
			
 
				+
			
 
				+The second way to initialize a `rw_semaphore` structure is `dynamically` or use the `init_rwsem` macro from the [include/linux/rwsem.h](https://github.com/torvalds/linux/blob/master/include/linux/rwsem.h) header file. This macro declares an instance of the `lock_class_key` which is related to the [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) of the Linux kernel and to the call of the `__init_rwsem` function with the given `reader/writer semaphore`:
			
 
				+
			
 
				+```C
			
 
				+#define init_rwsem(sem)                         \
			
 
				+do {                                                            \
			
 
				+        static struct lock_class_key __key;                     \
			
 
				+                                                                \
			
 
				+        __init_rwsem((sem), #sem, &__key);                      \
			
 
				+} while (0)
			
 
				+```
			
 
				+
			
 
				+If you will start definition of the `__init_rwsem` function, you will notice that there are couple of source code files which contain it. As you may guess, sometimes we need to initialize additional fields of the `rw_semaphore` structure, like the `osq` and `owner`. But sometimes not. All of this depends on some kernel configuration options. If we will look at the [kernel/locking/Makefile](https://github.com/torvalds/linux/blob/master/kernel/locking/Makefile) makefile, we will see following lines:
			
 
				+
			
 
				+```Makefile
			
 
				+obj-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
			
 
				+obj-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem-xadd.o
			
 
				+```
			
 
				+
			
 
				+As we already know, the Linux kernel for `x86_64` architecture has enabled `CONFIG_RWSEM_XCHGADD_ALGORITHM` kernel configuration option by default:
			
 
				+
			
 
				+```
			
 
				+config RWSEM_XCHGADD_ALGORITHM
			
 
				+	def_bool 64BIT
			
 
				+```
			
 
				+
			
 
				+in the [arch/x86/um/Kconfig](https://github.com/torvalds/linux/blob/master/arch/x86/um/Kconfig) kernel configuration file. In this case, implementation of the `__init_rwsem` function will be located in the [kernel/locking/rwsem-xadd.c](https://github.com/torvalds/linux/blob/master/locking/rwsem-xadd.c) source code file for us. Let's take a look at this function:
			
 
				+
			
 
				+```C
			
 
				+void __init_rwsem(struct rw_semaphore *sem, const char *name,
			
 
				+                    struct lock_class_key *key)
			
 
				+{
			
 
				+#ifdef CONFIG_DEBUG_LOCK_ALLOC
			
 
				+        debug_check_no_locks_freed((void *)sem, sizeof(*sem));
			
 
				+        lockdep_init_map(&sem->dep_map, name, key, 0);
			
 
				+#endif
			
 
				+        sem->count = RWSEM_UNLOCKED_VALUE;
			
 
				+        raw_spin_lock_init(&sem->wait_lock);
			
 
				+        INIT_LIST_HEAD(&sem->wait_list);
			
 
				+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
			
 
				+        sem->owner = NULL;
			
 
				+        osq_lock_init(&sem->osq);
			
 
				+#endif
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+We may see here almost the same as in `__RWSEM_INITIALIZER` macro with difference that all of this will be executed in [runtime](https://en.wikipedia.org/wiki/Run_time_%28program_lifecycle_phase%29).
			
 
				+
			
 
				+So, from now we are able to initialize a `reader/writer semaphore` let's look at the `lock` and `unlock` API. The Linux kernel provides following primary [API](https://en.wikipedia.org/wiki/Application_programming_interface) to manipulate `reader/writer semaphores`:
			
 
				+
			
 
				+* `void down_read(struct rw_semaphore *sem)` - lock for reading;
			
 
				+* `int down_read_trylock(struct rw_semaphore *sem)` - try lock for reading;
			
 
				+* `void down_write(struct rw_semaphore *sem)` - lock for writing;
			
 
				+* `int down_write_trylock(struct rw_semaphore *sem)` - try lock for writing;
			
 
				+* `void up_read(struct rw_semaphore *sem)` - release a read lock;
			
 
				+* `void up_write(struct rw_semaphore *sem)` - release a write lock;
			
 
				+
			
 
				+Let's start as always from the locking. First of all let's consider implementation of the `down_write` function which executes a try of acquiring of a lock for `write`. This function is [kernel/locking/rwsem.c](https://github.com/torvalds/linux/blob/master/kernel/locking/rwsem.c) source code file and starts from the call of the macro from the [include/linux/kernel.h](https://github.com/torvalds/linux/blob/master/include/linux/kernel.h) header file:
			
 
				+
			
 
				+```C
			
 
				+void __sched down_write(struct rw_semaphore *sem)
			
 
				+{
			
 
				+        might_sleep();
			
 
				+        rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
			
 
				+
			
 
				+        LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
			
 
				+        rwsem_set_owner(sem);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+We already met the `might_sleep` macro in the [previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-4.html). In short words, Implementation of the `might_sleep` macro depends on the `CONFIG_DEBUG_ATOMIC_SLEEP` kernel configuration option and if this option is enabled, this macro just prints a stack trace if it was executed in [atomic](https://en.wikipedia.org/wiki/Linearizability) context. As this macro is mostly for debugging purpose we will skip it and will go ahead. Additionally we will skip the next macro from the `down_read` function - `rwsem_acquire` which is related to the [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) of the Linux kernel, because this is topic of other part.
			
 
				+
			
 
				+The only two things that remained in the `down_write` function is the call of the `LOCK_CONTENDED` macro which is defined in the [include/linux/lockdep.h](https://github.com/torvalds/linux/blob/master/include/linux/lockdep.h) header file and setting of owner of a lock with the `rwsem_set_owner` function which sets owner to currently running process:
			
 
				+
			
 
				+```C
			
 
				+static inline void rwsem_set_owner(struct rw_semaphore *sem)
			
 
				+{
			
 
				+        sem->owner = current;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+As you already may guess, the `LOCK_CONTENDED` macro does all job for us. Let's look at the implementation of the `LOCK_CONTENDED` macro:
			
 
				+
			
 
				+```C
			
 
				+#define LOCK_CONTENDED(_lock, try, lock) \
			
 
				+        lock(_lock)
			
 
				+```
			
 
				+
			
 
				+As we may see it just calls the `lock` function which is third parameter of the `LOCK_CONTENDED` macro with the given `rw_semaphore`. In our case the third parameter of the `LOCK_CONTENDED` macro is the `__down_write` function which is architecture specific function and located in the [arch/x86/include/asm/rwsem.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/rwsem.h) header file. Let's look at the implementation of the `__down_write` function:
			
 
				+
			
 
				+```C
			
 
				+static inline void __down_write(struct rw_semaphore *sem)
			
 
				+{
			
 
				+        __down_write_nested(sem, 0);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+which just executes a call of the `__down_write_nested` function from the same source code file. Let's take a look at the implementation of the `__down_write_nested` function:
			
 
				+
			
 
				+```C
			
 
				+static inline void __down_write_nested(struct rw_semaphore *sem, int subclass)
			
 
				+{
			
 
				+        long tmp;
			
 
				+
			
 
				+        asm volatile("# beginning down_write\n\t"
			
 
				+                     LOCK_PREFIX "  xadd      %1,(%2)\n\t"
			
 
				+                     "  test " __ASM_SEL(%w1,%k1) "," __ASM_SEL(%w1,%k1) "\n\t"
			
 
				+                     "  jz        1f\n"
			
 
				+                     "  call call_rwsem_down_write_failed\n"
			
 
				+                     "1:\n"
			
 
				+                     "# ending down_write"
			
 
				+                     : "+m" (sem->count), "=d" (tmp)
			
 
				+                     : "a" (sem), "1" (RWSEM_ACTIVE_WRITE_BIAS)
			
 
				+                     : "memory", "cc");
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+As for other synchronization primitives which we saw in this chapter, usually `lock/unlock` functions consists only from an [inline assembly](https://0xax.gitbooks.io/linux-insides/content/Theory/asm.html) statement. As we may see, in our case the same for `__down_write_nested` function. Let's try to understand what does this function do. The first line of our assembly statement is just a comment, let's skip it. The second like contains `LOCK_PREFIX` which will be expanded to the [LOCK](http://x86.renejeschke.de/html/file_module_x86_id_159.html) instruction as we already know. The next [xadd](http://x86.renejeschke.de/html/file_module_x86_id_327.html) instruction executes `add` and `exchange` operations. In other words, `xadd` instruction adds value of the `RWSEM_ACTIVE_WRITE_BIAS`:
			
 
				+
			
 
				+```C
			
 
				+#define RWSEM_ACTIVE_WRITE_BIAS         (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
			
 
				+
			
 
				+#define RWSEM_WAITING_BIAS              (-RWSEM_ACTIVE_MASK-1)
			
 
				+#define RWSEM_ACTIVE_BIAS               0x00000001L
			
 
				+```
			
 
				+
			
 
				+or `0xffffffff00000001` to the `count` of the given `reader/writer semaphore` and returns previous value of it. After this we check the active mask in the `rw_semaphore->count`. If it was zero before, this means that there were no-one writer before, so we acquired a lock. In other way we call the `call_rwsem_down_write_failed` function from the [arch/x86/lib/rwsem.S](https://github.com/torvalds/linux/blob/master/arch/x86/lib/rwsem.S) assembly file. The the `call_rwsem_down_write_failed` function just calls the `rwsem_down_write_failed` function from the [kernel/locking/rwsem-xadd.c](https://github.com/torvalds/linux/blob/master/locking/rwsem-xadd.c) source code file anticipatorily save general purpose registers:
			
 
				+
			
 
				+```assembly
			
 
				+ENTRY(call_rwsem_down_write_failed)
			
 
				+	FRAME_BEGIN
			
 
				+	save_common_regs
			
 
				+	movq %rax,%rdi
			
 
				+	call rwsem_down_write_failed
			
 
				+	restore_common_regs
			
 
				+	FRAME_END
			
 
				+	ret
			
 
				+    ENDPROC(call_rwsem_down_write_failed)
			
 
				+```
			
 
				+
			
 
				+The `rwsem_down_write_failed` function starts from the [atomic](https://en.wikipedia.org/wiki/Linearizability) update of the `count` value:
			
 
				+
			
 
				+```C
			
 
				+ __visible
			
 
				+struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
			
 
				+{
			
 
				+    count = rwsem_atomic_update(-RWSEM_ACTIVE_WRITE_BIAS, sem);
			
 
				+    ...
			
 
				+    ...
			
 
				+    ...
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+with the `-RWSEM_ACTIVE_WRITE_BIAS` value. The `rwsem_atomic_update` function is defined in the [arch/x86/include/asm/rwsem.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/rwsem.h) header file and implement exchange and add logic:
			
 
				+
			
 
				+```C
			
 
				+static inline long rwsem_atomic_update(long delta, struct rw_semaphore *sem)
			
 
				+{
			
 
				+        return delta + xadd(&sem->count, delta);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+This function atomically adds the given delta to the `count` and returns old value of the count. After this it just returns sum of the given `delta` and old value of the `count` field. In our case we undo write bias from the `count` as we didn't acquire a lock. After this step we try to do `optimistic spinning` by the call of the `rwsem_optimistic_spin` function:
			
 
				+
			
 
				+```C
			
 
				+if (rwsem_optimistic_spin(sem))
			
 
				+      return sem;
			
 
				+```
			
 
				+
			
 
				+We will skip implementation of the `rwsem_optimistic_spin` function, as it is similar on the `mutex_optimistic_spin` function which we saw in the [previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-4.html). In short words we check existence other tasks ready to run that have higher priority in the `rwsem_optimistic_spin` function. If there are such tasks, the process will be added to the [MCS](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf) `waitqueue` and start to spin in the loop until a lock will be able to be acquired. If `optimistic spinning` is disabled, a process will be added to the and marked as waiting for write:
			
 
				+
			
 
				+```C
			
 
				+waiter.task = current;
			
 
				+waiter.type = RWSEM_WAITING_FOR_WRITE;
			
 
				+
			
 
				+if (list_empty(&sem->wait_list))
			
 
				+    waiting = false;
			
 
				+
			
 
				+list_add_tail(&waiter.list, &sem->wait_list);
			
 
				+```
			
 
				+
			
 
				+waiters list and start to wait until it will successfully acquire the lock. After we have added a process to the waiters list which was empty before this moment, we update the value of the `rw_semaphore->count` with the `RWSEM_WAITING_BIAS`:
			
 
				+
			
 
				+```C
			
 
				+count = rwsem_atomic_update(RWSEM_WAITING_BIAS, sem);
			
 
				+```
			
 
				+
			
 
				+with this we mark `rw_semaphore->counter` that it is already locked and exists/waits one `writer` which wants to acquire the lock. In other way we try to wake `reader` processes from the `wait queue` that were queued before this `writer` process and there are no active readers. In the end of the `rwsem_down_write_failed` a `writer` process will go to sleep which didn't acquire a lock in the following loop:
			
 
				+
			
 
				+```C
			
 
				+while (true) {
			
 
				+    if (rwsem_try_write_lock(count, sem))
			
 
				+        break;
			
 
				+    raw_spin_unlock_irq(&sem->wait_lock);
			
 
				+    do {
			
 
				+        schedule();
			
 
				+        set_current_state(TASK_UNINTERRUPTIBLE);
			
 
				+    } while ((count = sem->count) & RWSEM_ACTIVE_MASK);
			
 
				+    raw_spin_lock_irq(&sem->wait_lock);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+I will skip explanation of this loop as we already met similar functional in the [previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-4.html).
			
 
				+
			
 
				+That's all. From this moment, our `writer` process will acquire or not acquire a lock depends on the value of the `rw_semaphore->count` field. Now if we will look at the implementation of the `down_read` function which executes a try of acquiring of a lock. We will see similar actions which we saw in the `down_write` function. This function calls different debugging and lock validator related functions/macros:
			
 
				+
			
 
				+```C
			
 
				+void __sched down_read(struct rw_semaphore *sem)
			
 
				+{
			
 
				+        might_sleep();
			
 
				+        rwsem_acquire_read(&sem->dep_map, 0, 0, _RET_IP_);
			
 
				+
			
 
				+        LOCK_CONTENDED(sem, __down_read_trylock, __down_read);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+and does all job in the `__down_read` function. The `__down_read` consists of inline assembly statement:
			
 
				+
			
 
				+```C
			
 
				+static inline void __down_read(struct rw_semaphore *sem)
			
 
				+{
			
 
				+         asm volatile("# beginning down_read\n\t"
			
 
				+                     LOCK_PREFIX _ASM_INC "(%1)\n\t"
			
 
				+                     "  jns        1f\n"
			
 
				+                     "  call call_rwsem_down_read_failed\n"
			
 
				+                     "1:\n\t"
			
 
				+                     "# ending down_read\n\t"
			
 
				+                     : "+m" (sem->count)
			
 
				+                     : "a" (sem)
			
 
				+                     : "memory", "cc");
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+which increments value of the given `rw_semaphore->count` and call the `call_rwsem_down_read_failed` if this value is negative. In other way we jump at the label `1:` and exit. After this `read` lock will be successfully acquired. Notice that we check a sign of the `count` value as it may be negative, because as you may remember most significant [word](https://en.wikipedia.org/wiki/Word_%28computer_architecture%29) of the `rw_semaphore->count` contains negated number of active writers.
			
 
				+
			
 
				+Let's consider case when a process wants to acquire a lock for `read` operation, but it is already locked. In this case the `call_rwsem_down_read_failed` function from the [arch/x86/lib/rwsem.S](https://github.com/torvalds/linux/blob/master/arch/x86/lib/rwsem.S)  assembly file will be called. If you will look at the implementation of this function, you will notice that it does the same that `call_rwsem_down_read_failed` function does. Except it calls the `rwsem_down_read_failed` function instead of `rwsem_dow_write_failed`. Now let's consider implementation of the `rwsem_down_read_failed` function. It starts from the adding a process to the `wait queue` and updating of value of the `rw_semaphore->counter`:
			
 
				+
			
 
				+```C
			
 
				+long adjustment = -RWSEM_ACTIVE_READ_BIAS;
			
 
				+
			
 
				+waiter.task = tsk;
			
 
				+waiter.type = RWSEM_WAITING_FOR_READ;
			
 
				+
			
 
				+if (list_empty(&sem->wait_list))
			
 
				+    adjustment += RWSEM_WAITING_BIAS;
			
 
				+list_add_tail(&waiter.list, &sem->wait_list);
			
 
				+
			
 
				+count = rwsem_atomic_update(adjustment, sem);
			
 
				+```
			
 
				+
			
 
				+Notice that if the `wait queue` was empty before we clear the `rw_semaphore->counter` and undo `read` bias in other way. At the next step we check that there are no active locks and we are first in the `wait queue` we need to join currently active `reader` processes. In other way we go to sleep until a lock will not be able to acquired.
			
 
				+
			
 
				+That's all. Now we know how `reader` and `writer` processes will behave in different cases during a lock acquisition. Now let's take a short look at `unlock` operations. The `up_read` and `up_write` functions allows us to unlock a `reader` or `writer` lock. First of all let's take a look at the implementation of the `up_write` function which is defined in the [kernel/locking/rwsem.c](https://github.com/torvalds/linux/blob/master/kernel/locking/rwsem.c) source code file:
			
 
				+
			
 
				+```C
			
 
				+void up_write(struct rw_semaphore *sem)
			
 
				+{
			
 
				+        rwsem_release(&sem->dep_map, 1, _RET_IP_);
			
 
				+
			
 
				+        rwsem_clear_owner(sem);
			
 
				+        __up_write(sem);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+First of all it calls the `rwsem_release` macro which is related to the lock validator of the Linux kernel, so we will skip it now. And at the next line the `rwsem_clear_owner` function which as you may understand from the name of this function, just clears the `owner` field of the given `rw_semaphore`:
			
 
				+
			
 
				+```C
			
 
				+static inline void rwsem_clear_owner(struct rw_semaphore *sem)
			
 
				+{
			
 
				+	sem->owner = NULL;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+The `__up_write` function does all job of unlocking of the lock. The `_up_write` is architecture-specific function, so for our case it will be located in the [arch/x86/include/asm/rwsem.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/rwsem.h) source code file. If we will take a look at the implementation of this function, we will see that it does almost the same that `__down_write` function, but conversely. Instead of adding of the `RWSEM_ACTIVE_WRITE_BIAS` to the `count`, we subtract the same value and check the `sign` of the previous value.
			
 
				+
			
 
				+If the previous value of the `rw_semaphore->count` is not negative, a writer process released a lock and now it may be acquired by someone else. In other case, the `rw_semaphore->count` will contain negative values. This means that there is at least one `writer` in a wait queue. In this case the `call_rwsem_wake` function will be called. This function acts like similar functions which we already saw above. It store general purpose registers at the stack for preserving and call the `rwsem_wake` function.
			
 
				+
			
 
				+First of all the `rwsem_wake` function checks if a spinner is present. In this case it will just acquire a lock which is just released by lock owner. In other case there must be someone in the `wait queue` and we need to wake or writer process if it exists at the top of the `wait queue` or all `reader` processes. The `up_read` function which release a `reader` lock acts in similar way like `up_write`, but with a little difference. Instead of subtracting of `RWSEM_ACTIVE_WRITE_BIAS` from the `rw_semaphore->count`, it subtracts `1` from it, because less significant word of the `count` contains number active locks. After this it checks `sign` of the `count` and calls the `rwsem_wake` like `__up_write` if the `count` is negative or in other way lock will be successfully released.
			
 
				+
			
 
				+That's all. We have considered API for manipulation with `reader/writer semaphore`: `up_read/up_write` and `down_read/down_write`. We saw that the Linux kernel provides additional API, besides this functions, like the ``, `` and etc. But I will not consider implementation of these function in this part because it must be similar on that we have seen in this part of except few subtleties.
			
 
				+
			
 
				+Conclusion
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+This is the end of the fifth part of the [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) chapter in the Linux kernel. In this part we met with special type of `semaphore` - `readers/writer` semaphore which provides access to data for multiply process to read or for one process to writer. In the next part we will continue to dive into synchronization primitives in the Linux kernel.
			
 
				+
			
 
				+If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
			
 
				+
			
 
				+**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				+
			
 
				+Links
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+* [Synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29)
			
 
				+* [Readers/Writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock)
			
 
				+* [Spinlocks](https://en.wikipedia.org/wiki/Spinlock)
			
 
				+* [Semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29)
			
 
				+* [Mutex](https://en.wikipedia.org/wiki/Mutual_exclusion)
			
 
				+* [x86_64 architecture](https://en.wikipedia.org/wiki/X86-64)
			
 
				+* [Doubly linked list](https://0xax.gitbooks.io/linux-insides/content/DataStructures/dlist.html)
			
 
				+* [MCS lock](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf)
			
 
				+* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
			
 
				+* [Linux kernel lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt)
			
 
				+* [Atomic operations](https://en.wikipedia.org/wiki/Linearizability)
			
 
				+* [Inline assembly](https://0xax.gitbooks.io/linux-insides/content/Theory/asm.html)
			
 
				+* [XADD instruction](http://x86.renejeschke.de/html/file_module_x86_id_327.html)
			
 
				+* [LOCK instruction](http://x86.renejeschke.de/html/file_module_x86_id_159.html)
			
 
				+* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-4.html)
			
--- a/SyncPrim/sync-6.md
+++ b/SyncPrim/sync-6.md
@@ -0,0 +1,352 @@
 
				+Synchronization primitives in the Linux kernel. Part 6.
			
 
				+================================================================================
			
 
				+
			
 
				+Introduction
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+This is the sixth part of the chapter which describes [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_(computer_science)) in the Linux kernel and in the previous parts we finished to consider different [readers-writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock) synchronization primitives. We will continue to learn synchronization primitives in this part and start to consider a similar synchronization primitive which can be used to avoid the `writer starvation` problem. The name of this synchronization primitive is - `seqlock` or `sequential locks`.
			
 
				+
			
 
				+We know from the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-5.html) that [readers-writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock) is a special lock mechanism which allows concurrent access for read-only operations, but an exclusive lock is needed for writing or modifying data. As we may guess, it may lead to a problem which is called `writer starvation`. In other words, a writer process can't acquire a lock as long as at least one reader process which aqcuired a lock holds it. So, in the situation when contention is high, it will lead to situation when a writer process which wants to acquire a lock will wait for it for a long time.
			
 
				+
			
 
				+The `seqlock` synchronization primitive can help solve this problem.
			
 
				+
			
 
				+As in all previous parts of this [book](https://0xax.gitbooks.io/linux-insides/content), we will try to consider this synchronization primitive from the theoretical side and only than we will consider [API](https://en.wikipedia.org/wiki/Application_programming_interface) provided by the Linux kernel to manipulate with `seqlocks`.
			
 
				+
			
 
				+So, let's start.
			
 
				+
			
 
				+Sequential lock
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+So, what is a `seqlock` synchronization primitive and how does it work? Let's try to answer on these questions in this paragraph. Actually `sequential locks` were introduced in the Linux kernel 2.6.x. Main point of this synchronization primitive is to provide fast and lock-free access to shared resources. Since the heart of `sequential lock` synchronization primitive is [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) synchronization primitive, `sequential locks` work in situations where the protected resources are small and simple. Additionally write access must be rare and also should be fast. 
			
 
				+
			
 
				+Work of this synchronization primitive is based on the sequence of events counter. Actually a `sequential lock` allows free access to a resource for readers, but each reader must check existence of conflicts with a writer. This synchronization primitive introduces a special counter. The main algorithm of work of `sequential locks` is simple: Each writer which acquired a sequential lock increments this counter and additionally acquires a [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html). When this writer finishes, it will release the acquired spinlock to give access to other writers and increment the counter of a sequential lock again.
			
 
				+
			
 
				+Read only access works on the following principle, it gets the value of a `sequential lock` counter before it will enter into [critical section](https://en.wikipedia.org/wiki/Critical_section) and compares it with the value of the same `sequential lock` counter at the exit of critical section. If their values are equal, this means that there weren't writers for this period. If their values are not equal, this means that a writer has incremented the counter during the [critical section](https://en.wikipedia.org/wiki/Critical_section). This conflict means that reading of protected data must be repeated.
			
 
				+
			
 
				+That's all. As we may see principle of work of `sequential locks` is simple.
			
 
				+
			
 
				+```C
			
 
				+unsigned int seq_counter_value;
			
 
				+
			
 
				+do {
			
 
				+    seq_counter_value = get_seq_counter_val(&the_lock);
			
 
				+    //
			
 
				+    // do as we want here
			
 
				+    //
			
 
				+} while (__retry__);
			
 
				+```
			
 
				+
			
 
				+Actually the Linux kernel does not provide `get_seq_counter_val()` function. Here it is just a stub. Like a `__retry__` too. As I already wrote above, we will see actual the [API](https://en.wikipedia.org/wiki/Application_programming_interface) for this in the next paragraph of this part.
			
 
				+
			
 
				+Ok, now we know what a `seqlock` synchronization primitive is and how it is represented in the Linux kernel. In this case, we may go ahead and start to look at the [API](https://en.wikipedia.org/wiki/Application_programming_interface) which the Linux kernel provides for manipulation of synchronization primitives of this type.
			
 
				+
			
 
				+Sequential lock API
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+So, now we know a little about `sequentional lock` synchronization primitive from theoretical side, let's look at its implementation in the Linux kernel. All `sequentional locks` [API](https://en.wikipedia.org/wiki/Application_programming_interface) are located in the [include/linux/seqlock.h](https://github.com/torvalds/linux/blob/master/include/linux/seqlock.h) header file.
			
 
				+
			
 
				+First of all we may see that the a `sequential lock` machanism is represented by the following type:
			
 
				+
			
 
				+```C
			
 
				+typedef struct {
			
 
				+	struct seqcount seqcount;
			
 
				+	spinlock_t lock;
			
 
				+} seqlock_t;
			
 
				+```
			
 
				+
			
 
				+As we may see the `seqlock_t` provides two fields. These fields represent a sequential lock counter, description of which we saw above and also a [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) which will protect data from other writers. Note that the `seqcount` counter represented as `seqcount` type. The `seqcount` is structure:
			
 
				+
			
 
				+```C
			
 
				+typedef struct seqcount {
			
 
				+	unsigned sequence;
			
 
				+#ifdef CONFIG_DEBUG_LOCK_ALLOC
			
 
				+	struct lockdep_map dep_map;
			
 
				+#endif
			
 
				+} seqcount_t;
			
 
				+```
			
 
				+
			
 
				+which holds counter of a sequential lock and [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) related field.
			
 
				+
			
 
				+As always in previous parts of this [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/), before we will consider an [API](https://en.wikipedia.org/wiki/Application_programming_interface) of `sequential lock` mechanism in the Linux kernel, we need to know how to initialize an instance of `seqlock_t`.
			
 
				+
			
 
				+We saw in the previous parts that often the Linux kernel provides two approaches to execute initialization of the given synchronization primitive. The same situation with the `seqlock_t` structure. These approaches allows to initialize a `seqlock_t` in two following:
			
 
				+
			
 
				+* `statically`;
			
 
				+* `dynamically`.
			
 
				+
			
 
				+ways. Let's look at the first approach. We are able to intialize a `seqlock_t` statically with the `DEFINE_SEQLOCK` macro:
			
 
				+
			
 
				+```C
			
 
				+#define DEFINE_SEQLOCK(x) \
			
 
				+		seqlock_t x = __SEQLOCK_UNLOCKED(x)
			
 
				+```
			
 
				+
			
 
				+which is defined in the [include/linux/seqlock.h](https://github.com/torvalds/linux/blob/master/include/linux/seqlock.h) header file. As we may see, the `DEFINE_SEQLOCK` macro takes one argument and expands to the definition and initialization of the `seqlock_t` structure. Initialization occurs with the help of the `__SEQLOCK_UNLOCKED` macro which is defined in the same source code file. Let's look at the implementation of this macro:
			
 
				+
			
 
				+```C
			
 
				+#define __SEQLOCK_UNLOCKED(lockname)			\
			
 
				+	{						\
			
 
				+		.seqcount = SEQCNT_ZERO(lockname),	\
			
 
				+		.lock =	__SPIN_LOCK_UNLOCKED(lockname)	\
			
 
				+	}
			
 
				+```
			
 
				+
			
 
				+As we may see the, `__SEQLOCK_UNLOCKED` macro executes initialization of fields of the given `seqlock_t` structure. The first field is `seqcount` initialized with the `SEQCNT_ZERO` macro which expands to the:
			
 
				+
			
 
				+```C
			
 
				+#define SEQCNT_ZERO(lockname) { .sequence = 0, SEQCOUNT_DEP_MAP_INIT(lockname)}
			
 
				+```
			
 
				+
			
 
				+So we just initialize counter of the given sequential lock to zero and additionally we can see [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) related initialization which depends on the state of the `CONFIG_DEBUG_LOCK_ALLOC` kernel configuration option:
			
 
				+
			
 
				+```C
			
 
				+#ifdef CONFIG_DEBUG_LOCK_ALLOC
			
 
				+# define SEQCOUNT_DEP_MAP_INIT(lockname) \
			
 
				+    .dep_map = { .name = #lockname } \
			
 
				+    ...
			
 
				+    ...
			
 
				+    ...
			
 
				+#else
			
 
				+# define SEQCOUNT_DEP_MAP_INIT(lockname)
			
 
				+    ...
			
 
				+    ...
			
 
				+    ...
			
 
				+#endif
			
 
				+```
			
 
				+
			
 
				+As I already wrote in previous parts of this [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/) we will not consider [debugging](https://en.wikipedia.org/wiki/Debugging) and [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) related stuff in this part. So for now we just skip the `SEQCOUNT_DEP_MAP_INIT` macro. The second field of the given `seqlock_t` is `lock` initialized with the `__SPIN_LOCK_UNLOCKED` macro which is defined in the [include/linux/spinlock_types.h](https://github.com/torvalds/linux/blob/master/include/linux/spinlock_types.h) header file. We will not consider implementation of this macro here as it just initialize [rawspinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) with architecture-specific methods (More abot spinlocks you may read in first parts of this [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/)).
			
 
				+
			
 
				+We have considered the first way to initialize a sequential lock. Let's consider second way to do the same, but do it dynamically. We can initialize a sequentional lock with the `seqlock_init` macro which is defined in the same  [include/linux/seqlock.h](https://github.com/torvalds/linux/blob/master/include/linux/seqlock.h) header file.
			
 
				+
			
 
				+Let's look at the implementation of this macro:
			
 
				+
			
 
				+```C
			
 
				+#define seqlock_init(x)					\
			
 
				+	do {						\
			
 
				+		seqcount_init(&(x)->seqcount);		\
			
 
				+		spin_lock_init(&(x)->lock);		\
			
 
				+	} while (0)
			
 
				+```
			
 
				+
			
 
				+As we may see, the `seqlock_init` expands into two macros. The first macro `seqcount_init` takes counter of the given sequential lock and expands to the call of the `__seqcount_init` function:
			
 
				+
			
 
				+```C
			
 
				+# define seqcount_init(s)				\
			
 
				+	do {						\
			
 
				+		static struct lock_class_key __key;	\
			
 
				+		__seqcount_init((s), #s, &__key);	\
			
 
				+	} while (0)
			
 
				+```
			
 
				+
			
 
				+from the same header file. This function
			
 
				+
			
 
				+```C
			
 
				+static inline void __seqcount_init(seqcount_t *s, const char *name,
			
 
				+					  struct lock_class_key *key)
			
 
				+{
			
 
				+    lockdep_init_map(&s->dep_map, name, key, 0);
			
 
				+    s->sequence = 0;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+just initializes counter of the given `seqcount_t` with zero. The second call from the `seqlock_init` macro is the call of the `spin_lock_init` macro which we saw in the [first part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) of this chapter.
			
 
				+
			
 
				+So, now we know how to initialize a `sequential lock`, now let's look at how to use it. The Linux kernel provides following [API](https://en.wikipedia.org/wiki/Application_programming_interface) to manipulate `sequential locks`:
			
 
				+
			
 
				+```C
			
 
				+static inline unsigned read_seqbegin(const seqlock_t *sl);
			
 
				+static inline unsigned read_seqretry(const seqlock_t *sl, unsigned start);
			
 
				+static inline void write_seqlock(seqlock_t *sl);
			
 
				+static inline void write_sequnlock(seqlock_t *sl);
			
 
				+static inline void write_seqlock_irq(seqlock_t *sl);
			
 
				+static inline void write_sequnlock_irq(seqlock_t *sl);
			
 
				+static inline void read_seqlock_excl(seqlock_t *sl)
			
 
				+static inline void read_sequnlock_excl(seqlock_t *sl)
			
 
				+```
			
 
				+
			
 
				+and others. Before we move on to considering the implementation of this [API](https://en.wikipedia.org/wiki/Application_programming_interface), we must know that actually there are two types of readers. The first type of reader never blocks a writer process. In this case writer will not wait for readers. The second type of reader which can lock. In this case, the locking reader will block the writer as it will wait while reader will not release its lock.
			
 
				+
			
 
				+First of all let's consider the first type of readers. The `read_seqbegin` function begins a seq-read [critical section](https://en.wikipedia.org/wiki/Critical_section).
			
 
				+
			
 
				+As we may see this function just returns value of the `read_seqcount_begin` function:
			
 
				+
			
 
				+```C
			
 
				+static inline unsigned read_seqbegin(const seqlock_t *sl)
			
 
				+{
			
 
				+	return read_seqcount_begin(&sl->seqcount);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+In its turn the `read_seqcount_begin` function calls the `raw_read_seqcount_begin` function:
			
 
				+
			
 
				+```C
			
 
				+static inline unsigned read_seqcount_begin(const seqcount_t *s)
			
 
				+{
			
 
				+	return raw_read_seqcount_begin(s);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+which just returns value of the `sequential lock` counter:
			
 
				+
			
 
				+```C
			
 
				+static inline unsigned raw_read_seqcount(const seqcount_t *s)
			
 
				+{
			
 
				+	unsigned ret = READ_ONCE(s->sequence);
			
 
				+	smp_rmb();
			
 
				+	return ret;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+After we have the initial value of the given `sequential lock` counter and did some stuff, we know from the previous paragraph of this function, that we need to compare it with the current value of the counter the same `sequential lock` before we will exit from the critical section. We can achieve this by the call of the `read_seqretry` function. This function takes a `sequential lock`, start value of the counter and through a chain of functions:
			
 
				+
			
 
				+```C
			
 
				+static inline unsigned read_seqretry(const seqlock_t *sl, unsigned start)
			
 
				+{
			
 
				+	return read_seqcount_retry(&sl->seqcount, start);
			
 
				+}
			
 
				+
			
 
				+static inline int read_seqcount_retry(const seqcount_t *s, unsigned start)
			
 
				+{
			
 
				+	smp_rmb();
			
 
				+	return __read_seqcount_retry(s, start);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+it calls the `__read_seqcount_retry` function:
			
 
				+
			
 
				+```C
			
 
				+static inline int __read_seqcount_retry(const seqcount_t *s, unsigned start)
			
 
				+{
			
 
				+	return unlikely(s->sequence != start);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+which just compares value of the counter of the given `sequential lock` with the initial value of this counter. If the initial value of the counter which is obtained from `read_seqbegin()` function is odd, this means that a writer was in the middle of updating the data when our reader began to act. In this case the value of the data can be in inconsistent state, so we need to try to read it again.
			
 
				+
			
 
				+This is a common pattern in the Linux kernel. For example, you may remember the `jiffies` concept from the [first part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) of the [timers and time management in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Timers/) chapter. The sequential lock is used to obtain value of `jiffies` at [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture:
			
 
				+
			
 
				+```C
			
 
				+u64 get_jiffies_64(void)
			
 
				+{
			
 
				+	unsigned long seq;
			
 
				+	u64 ret;
			
 
				+
			
 
				+	do {
			
 
				+		seq = read_seqbegin(&jiffies_lock);
			
 
				+		ret = jiffies_64;
			
 
				+	} while (read_seqretry(&jiffies_lock, seq));
			
 
				+	return ret;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Here we just read the value of the counter of the `jiffies_lock` sequential lock and then we write value of the `jiffies_64` system variable to the `ret`. As here we may see `do/while` loop, the body of the loop will be executed at least one time. So, as the body of loop was executed, we read and compare the current value of the counter of the `jiffies_lock` with the initial value. If these values are not equal, execution of the loop will be repeated, else `get_jiffies_64` will return its value in `ret`.
			
 
				+
			
 
				+We just saw the first type of readers which do not block writer and other readers. Let's consider second type. It does not update value of a `sequential lock` counter, but just locks `spinlock`:
			
 
				+
			
 
				+```C
			
 
				+static inline void read_seqlock_excl(seqlock_t *sl)
			
 
				+{
			
 
				+	spin_lock(&sl->lock);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+So, no one reader or writer can't access protected data. When a reader finishes, the lock must be unlocked with the:
			
 
				+
			
 
				+```C
			
 
				+static inline void read_sequnlock_excl(seqlock_t *sl)
			
 
				+{
			
 
				+	spin_unlock(&sl->lock);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+function.
			
 
				+
			
 
				+Now we know how `sequential lock` work for readers. Let's consider how does writer act when it wants to acquire a `sequential lock` to modify data. To acquire a `sequential lock`, writer should use `write_seqlock` function. If we look at the implementation of this function:
			
 
				+
			
 
				+```C
			
 
				+static inline void write_seqlock(seqlock_t *sl)
			
 
				+{
			
 
				+	spin_lock(&sl->lock);
			
 
				+	write_seqcount_begin(&sl->seqcount);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+We will see that it acquires `spinlock` to prevent access from other writers and calls the `write_seqcount_begin` function. This function just increments value of the `sequential lock` counter:
			
 
				+
			
 
				+```C
			
 
				+static inline void raw_write_seqcount_begin(seqcount_t *s)
			
 
				+{
			
 
				+	s->sequence++;
			
 
				+	smp_wmb();
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+When a writer process will finish to modify data, the `write_sequnlock` function must be called to release a lock and give access to other writers or readers. Let's consider at the implementation of the `write_sequnlock` function. It looks pretty simple:
			
 
				+
			
 
				+```C
			
 
				+static inline void write_sequnlock(seqlock_t *sl)
			
 
				+{
			
 
				+	write_seqcount_end(&sl->seqcount);
			
 
				+	spin_unlock(&sl->lock);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+First of all it just calls `write_seqcount_end` function to increase value of the counter of the `sequential` lock again:
			
 
				+
			
 
				+```C
			
 
				+static inline void raw_write_seqcount_end(seqcount_t *s)
			
 
				+{
			
 
				+	smp_wmb();
			
 
				+	s->sequence++;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+and in the end we just call the `spin_unlock` macro to give access for other readers or writers.
			
 
				+
			
 
				+That's all about `sequential lock` mechanism in the Linux kernel. Of course we did not consider full [API](https://en.wikipedia.org/wiki/Application_programming_interface) of this mechanism in this part. But all other functions are based on these which we described here. For example, Linux kernel also provides some safe macros/functions to use `sequential lock` mechanism in [interrupt handlers](https://en.wikipedia.org/wiki/Interrupt_handler) of [softirq](https://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-9.html): `write_seqclock_irq` and `write_sequnlock_irq`:
			
 
				+
			
 
				+```C
			
 
				+static inline void write_seqlock_irq(seqlock_t *sl)
			
 
				+{
			
 
				+	spin_lock_irq(&sl->lock);
			
 
				+	write_seqcount_begin(&sl->seqcount);
			
 
				+}
			
 
				+
			
 
				+static inline void write_sequnlock_irq(seqlock_t *sl)
			
 
				+{
			
 
				+	write_seqcount_end(&sl->seqcount);
			
 
				+	spin_unlock_irq(&sl->lock);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+As we may see, these functions differ only in the initialization of spinlock. They call `spin_lock_irq` and `spin_unlock_irq` instead of `spin_lock` and `spin_unlock`.
			
 
				+
			
 
				+Or for example `write_seqlock_irqsave` and `write_sequnlock_irqrestore` functions which are the same but used `spin_lock_irqsave` and `spin_unlock_irqsave` macro to use in [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_(PC_architecture)) handlers.
			
 
				+
			
 
				+That's all.
			
 
				+
			
 
				+Conclusion
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+This is the end of the sixth part of the [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) chapter in the Linux kernel. In this part we met with new synchronization primitive which is called - `sequential lock`. From the theoretical side, this synchronization primitive very similar on a [readers-writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock) synchronization primitive, but allows to avoid `writer-starving` issue.
			
 
				+
			
 
				+If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
			
 
				+
			
 
				+**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				+
			
 
				+Links
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+* [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_(computer_science))
			
 
				+* [readers-writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock) 
			
 
				+* [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html)
			
 
				+* [critical section](https://en.wikipedia.org/wiki/Critical_section)
			
 
				+* [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt)
			
 
				+* [debugging](https://en.wikipedia.org/wiki/Debugging)
			
 
				+* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
			
 
				+* [x86_64](https://en.wikipedia.org/wiki/X86-64)
			
 
				+* [Timers and time management in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Timers/)
			
 
				+* [interrupt handlers](https://en.wikipedia.org/wiki/Interrupt_handler)
			
 
				+* [softirq](https://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-9.html)
			
 
				+* [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_(PC_architecture))
			
 
				+* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-5.html)
			
--- a/SysCall/README.md
+++ b/SysCall/README.md
@@ -1,6 +1,8 @@
 
				 # System calls
			
 
				 
			
 
				-This chapter describes the `system call` concept in the linux kernel. You will see here a
			
 
				-couple of posts which describe the full cycle of the kernel loading process:
			
 
				+This chapter describes the `system call` concept in the linux kernel.
			
 
				 
			
 
				 * [Introduction to system call concept](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html) - this part is introduction to the `system call` concept in the Linux kernel.
			
 
				+* [How the Linux kernel handles a system call](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html) - this part describes how the Linux kernel handles a system call from an userspace application.
			
 
				+* [vsyscall and vDSO](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-3.html) - third part describes `vsyscall` and `vDSO` concepts.
			
 
				+* [How the Linux kernel runs a program](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-4.html) - this part describes startup process of a program.
			
--- a/SysCall/syscall-1.md
+++ b/SysCall/syscall-1.md
@@ -4,14 +4,14 @@ System calls in the Linux kernel. Part 1.
 
				 Introduction
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-This post opens up a new chapter in [linux-insides](http://0xax.gitbooks.io/linux-insides/content/) book and as you may understand from the title, this chapter will be devoted to the [System call](https://en.wikipedia.org/wiki/System_call) concept in the Linux kernel. The choice of topic for this chapter is not accidental. In the previous [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) we saw interrupts and interrupt handling. The concept of system calls is very similar to that of interrupts. This is because the most common way to implement system calls is as software interrupts. We will see many different aspects that are related to the system call concept. For example, we will learn what's happening when a system call occurs from userspace, we will see an implementation of a couple system call handlers in the Linux kernel, [VDSO](https://en.wikipedia.org/wiki/VDSO) and [vsyscall](https://lwn.net/Articles/446528/) concepts and many many more.
			
 
				+This post opens up a new chapter in [linux-insides](http://0xax.gitbooks.io/linux-insides/content/) book, and as you may understand from the title, this chapter will be devoted to the [System call](https://en.wikipedia.org/wiki/System_call) concept in the Linux kernel. The choice of topic for this chapter is not accidental. In the previous [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) we saw interrupts and interrupt handling. The concept of system calls is very similar to that of interrupts. This is because the most common way to implement system calls is as software interrupts. We will see many different aspects that are related to the system call concept. For example, we will learn what's happening when a system call occurs from userspace. We will see an implementation of a couple system call handlers in the Linux kernel, [VDSO](https://en.wikipedia.org/wiki/VDSO) and [vsyscall](https://lwn.net/Articles/446528/) concepts and many many more.
			
 
				 
			
 
				-Before we start to dive into the implementation of the system calls related stuff in the Linux kernel source code, it is good to know some theory about system calls. Let's do it in the following paragraph.
			
 
				+Before we dive into Linux system call implementation, it is good to know some theory about system calls. Let's do it in the following paragraph.
			
 
				 
			
 
				 System call. What is it?
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-A system call is just a userspace request of a kernel service. Yes, the operating system kernel provides many services. When your program wants to write to or read from a file, starts to listen for connections on a [socket](https://en.wikipedia.org/wiki/Network_socket), delete or create a directory, or even to finish its work, a program uses a system call. In another words, a system call is just a [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) function that is placed in the kernel space and a user program can ask the kernel to do something via this function.
			
 
				+A system call is just a userspace request of a kernel service. Yes, the operating system kernel provides many services. When your program wants to write to or read from a file, start to listen for connections on a [socket](https://en.wikipedia.org/wiki/Network_socket), delete or create directory, or even to finish its work, a program uses a system call. In other words, a system call is just a [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) kernel space function that user space programs call to handle some request.
			
 
				 
			
 
				 The Linux kernel provides a set of these functions and each architecture provides its own set. For example: the [x86_64](https://en.wikipedia.org/wiki/X86-64) provides [322](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl) system calls and the [x86](https://en.wikipedia.org/wiki/X86) provides [358](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_32.tbl) different system calls. Ok, a system call is just a function. Let's look on a simple `Hello world` example that's written in the assembly programming language:
			
 
				 
			
@@ -120,7 +120,7 @@ _exit(0)                                = ?
 
				 +++ exited with 0 +++
			
 
				 ```
			
 
				 
			
 
				-In the first line of the `strace` output, we can see [execve](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L68) system call that executes our program, and the second and third are system calls that we have used in our program: `write` and `exit`. Note that we pass the parameter through the general purpose registers in our example. The order of the registers is not not accidental. The order of the registers is defined by the following agreement - [x86-64 calling conventions](https://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions). This and other agreement for the `x86_64` architecture explained in the special document - [System V Application Binary Interface. PDF](http://www.x86-64.org/documentation/abi.pdf). In a general way, argument(s) of a function are placed either in registers or pushed on the stack. The right order is:
			
 
				+In the first line of the `strace` output, we can see [execve](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L68) system call that executes our program, and the second and third are system calls that we have used in our program: `write` and `exit`. Note that we pass the parameter through the general purpose registers in our example. The order of the registers is not accidental. The order of the registers is defined by the following agreement - [x86-64 calling conventions](https://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions). This and other agreement for the `x86_64` architecture explained in the special document - [System V Application Binary Interface. PDF](http://www.x86-64.org/documentation/abi.pdf). In a general way, argument(s) of a function are placed either in registers or pushed on the stack. The right order is:
			
 
				 
			
 
				 * `rdi`;
			
 
				 * `rsi`;
			
@@ -131,7 +131,7 @@ In the first line of the `strace` output, we can see [execve](https://github.com
 
				 
			
 
				 for the first six parameters of a function. If a function has more than six arguments, other parameters will be placed on the stack.
			
 
				 
			
 
				-We do not use system calls in our code directly, but anyway our program uses it when we want to print something, check access to a file or just write or read something to it.
			
 
				+We do not use system calls in our code directly, but our program uses it when we want to print something, check access to a file or just write or read something to it.
			
 
				 
			
 
				 For example:
			
 
				 
			
@@ -152,7 +152,7 @@ int main(int argc, char **argv)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-There are no `fopen`, `fgets`, `printf` and `fclose` system calls in the Linux kernel, but `open`, `read` `write` and `close` instead. I think you know that these four functions `fopen`, `fgets`, `printf` and `fclose` are just functions that defined in the `C` [standard library](https://en.wikipedia.org/wiki/GNU_C_Library). Actually these functions are wrappers for the system calls. We do not call system calls directly in our code, but using [wrapper](https://en.wikipedia.org/wiki/Wrapper_function) functions from the standard library. The main reason of this is simple: a system call must be performed quickly, very quickly. As a system call must be quick, it must be small. The standard library takes responsibility to perform system calls with the correct set parameters and makes different check before it will call the given system call. Let's compile our program with the following command:
			
 
				+There are no `fopen`, `fgets`, `printf` and `fclose` system calls in the Linux kernel, but `open`, `read` `write` and `close` instead. I think you know that these four functions `fopen`, `fgets`, `printf` and `fclose` are just functions that defined in the `C` [standard library](https://en.wikipedia.org/wiki/GNU_C_Library). Actually these functions are wrappers for the system calls. We do not call system calls directly in our code, but using [wrapper](https://en.wikipedia.org/wiki/Wrapper_function) functions from the standard library. The main reason of this is simple: a system call must be performed quickly, very quickly. As a system call must be quick, it must be small. The standard library takes responsibility to perform system calls with the correct set parameters and makes different checks before it will call the given system call. Let's compile our program with the following command:
			
 
				 
			
 
				 ```
			
 
				 $ gcc test.c -o test
			
@@ -178,13 +178,13 @@ The `ltrace` util displays a set of userspace calls of a program. The `fopen` fu
 
				 write@SYS(1, "Hello World!\n\n", 14) = 14
			
 
				 ```
			
 
				 
			
 
				-Yes, system calls are ubiquitous. Each program needs to open/write/read file, network connection, allocation of memory and many other things that can be provide only by the kernel. The [proc](https://en.wikipedia.org/wiki/Procfs) file system contains special file in a format: `/proc/pid/systemcall` that exposes the system call number and argument registers for the system call currently being executed by the process. For example, first pid that is [systemd](https://en.wikipedia.org/wiki/Systemd) for me uses:
			
 
				+Yes, system calls are ubiquitous. Each program needs to open/write/read file, network connection, allocate memory and many other things that can be provided only by the kernel. The [proc](https://en.wikipedia.org/wiki/Procfs) file system contains special files in a format: `/proc/pid/systemcall` that exposes the system call number and argument registers for the system call currently being executed by the process. For example, pid 1, that is [systemd](https://en.wikipedia.org/wiki/Systemd) for me:
			
 
				 
			
 
				 ```
			
 
				 $ sudo cat /proc/1/comm
			
 
				 systemd
			
 
				 
			
 
				-$ sudo cat /proc/1/syscall 
			
 
				+$ sudo cat /proc/1/syscall
			
 
				 232 0x4 0x7ffdf82e11b0 0x1f 0xffffffff 0x100 0x7ffdf82e11bf 0x7ffdf82e11a0 0x7f9114681193
			
 
				 ```
			
 
				 
			
@@ -197,18 +197,18 @@ $ ps ax | grep emacs
 
				 $ sudo cat /proc/2093/comm
			
 
				 emacs
			
 
				 
			
 
				-$ sudo cat /proc/2093/syscall 
			
 
				+$ sudo cat /proc/2093/syscall
			
 
				 270 0xf 0x7fff068a5a90 0x7fff068a5b10 0x0 0x7fff068a59c0 0x7fff068a59d0 0x7fff068a59b0 0x7f777dd8813c
			
 
				 ```
			
 
				 
			
 
				 the system call with the number `270` which is [sys_pselect6](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L279) system call that allows `emacs` to monitor multiple file descriptors.
			
 
				 
			
 
				-Now we know a little about system call, what is it and why do we need in it. So let's look on the `write` system that our program used.
			
 
				+Now we know a little about system call, what is it and why we need in it. So let's look at the `write` system call that our program used.
			
 
				 
			
 
				 Implementation of write system call
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-Let's look on the implementation of this system call directly in the source code of the Linux kernel. As we already know, the `write` system call defined in the [fs/read_write.c](https://github.com/torvalds/linux/blob/master/fs/read_write.c) source code file and looks like this:
			
 
				+Let's look at the implementation of this system call directly in the source code of the Linux kernel. As we already know, the `write` system call is defined in the [fs/read_write.c](https://github.com/torvalds/linux/blob/master/fs/read_write.c) source code file and looks like this:
			
 
				 
			
 
				 ```C
			
 
				 SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
			
@@ -229,7 +229,7 @@ SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
 
				 }
			
 
				 ```
			
 
				 
			
 
				-First of all about the `SYSCALL_DEFINE3` macro. This macro defined in the [include/linux/syscalls.h](https://github.com/torvalds/linux/blob/master/include/linux/syscalls.h) header file and expands to the definition of the `sys_name(...)` function. Let's look on this macro:
			
 
				+First of all, the `SYSCALL_DEFINE3` macro is defined in the [include/linux/syscalls.h](https://github.com/torvalds/linux/blob/master/include/linux/syscalls.h) header file and expands to the definition of the `sys_name(...)` function. Let's look at this macro:
			
 
				 
			
 
				 ```C
			
 
				 #define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)
			
@@ -244,7 +244,7 @@ As we can see the `SYSCALL_DEFINE3` macro takes `name` parameter which will repr
 
				 * `SYSCALL_METADATA`;
			
 
				 * `__SYSCALL_DEFINEx`.
			
 
				 
			
 
				-Implementation of the first macro `SYSCALL_METADATA` depends on the `CONFIG_FTRACE_SYSCALLS` kernel configuration option. As we can understand from the name of this option, it allows to enable tracer to catch the syscall entry and exit events. If this kernel configration option is enabled, the `SYSCALL_METADATA` macro executes initialization of the `syscall_metadata` structure that defined in the [include/trace/syscall.h](https://github.com/torvalds/linux/blob/master/include/trace/syscall.h) header file and contains different useful fields as name of a system call, number of a system call in the system call [table](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl), number of parameters of a system call, list of parameter types and etc:
			
 
				+Implementation of the first macro `SYSCALL_METADATA` depends on the `CONFIG_FTRACE_SYSCALLS` kernel configuration option. As we can understand from the name of this option, it allows to enable tracer to catch the syscall entry and exit events. If this kernel configuration option is enabled, the `SYSCALL_METADATA` macro executes initialization of the `syscall_metadata` structure that defined in the [include/trace/syscall.h](https://github.com/torvalds/linux/blob/master/include/trace/syscall.h) header file and contains different useful fields as name of a system call, number of a system call in the system call [table](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl), number of parameters of a system call, list of parameter types and etc:
			
 
				 
			
 
				 ```C
			
 
				 #define SYSCALL_METADATA(sname, nb, ...)                             \
			
@@ -296,13 +296,13 @@ The second macro `__SYSCALL_DEFINEx` expands to the definition of the five follo
 
				         static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__))
			
 
				 ```
			
 
				 
			
 
				-The first `sys##name` is definition of the syscall handler function with the given name - `sys_system_call_name`. The `__SC_DECL` macro takes the `__VA_ARGS__` and combines call input parameter system type and the parameter name, because the macro definition is unable to determine the parameter types. And the `__MAP` macro applyes `__SC_DECL` macro to the `__VA_ARGS__` arguments. The other functions that are generated by the `__SYSCALL_DEFINEx` macro are need to protect from the [CVE-2009-0029](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2009-0029) and we will not dive into details about this here. Ok, as result of the `SYSCALL_DEFINE3` macro, we will have:
			
 
				+The first `sys##name` is definition of the syscall handler function with the given name - `sys_system_call_name`. The `__SC_DECL` macro takes the `__VA_ARGS__` and combines call input parameter system type and the parameter name, because the macro definition is unable to determine the parameter types. And the `__MAP` macro applies `__SC_DECL` macro to the `__VA_ARGS__` arguments. The other functions that are generated by the `__SYSCALL_DEFINEx` macro are need to protect from the [CVE-2009-0029](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2009-0029) and we will not dive into details about this here. Ok, as result of the `SYSCALL_DEFINE3` macro, we will have:
			
 
				 
			
 
				 ```C
			
 
				-asmlinkage long sys_write(unsigned int fd, const char __user * filename, size_t count);
			
 
				+asmlinkage long sys_write(unsigned int fd, const char __user * buf, size_t count);
			
 
				 ```
			
 
				 
			
 
				-Now we know a little about system calls definition and we can back to the implementation of the `write` system call. Let's look on the implementation of this system call again:
			
 
				+Now we know a little about the system call's definition and we can go back to the implementation of the `write` system call. Let's look on the implementation of this system call again:
			
 
				 
			
 
				 ```C
			
 
				 SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
			
@@ -323,13 +323,13 @@ SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
 
				 }
			
 
				 ```
			
 
				 
			
 
				-As we already know and can see on the code, it takes three arguments:
			
 
				+As we already know and can see from the code, it takes three arguments:
			
 
				 
			
 
				 * `fd`    - file descriptor;
			
 
				 * `buf`   - buffer to write;
			
 
				 * `count` - length of buffer to write.
			
 
				 
			
 
				-and writes data from a buffer declared by the user to a given device or a file. Note that the second parameter `buf`, defined with the `__user` attribute. The main purpose of this attribute is for checking of the Linux kernel code with the [sparse](https://en.wikipedia.org/wiki/Sparse) util. It defined in the [include/linux/compiler.h](https://github.com/torvalds/linux/blob/master/include/linux/compiler.h) header file and depends on the `__CHECKER__` definition in the Linux kernel. That's all about useful meta-information related to our `sys_write` system call, let's try to understand how this system call is implemented. As we can see it starts from the definition of the `f` structure that has `fd` structure type that represent file descriptor in the Linux kernel and we put the result of the call of the `fdget_pos` function. The `fdget_pos` function defined in the same [source](https://github.com/torvalds/linux/blob/master/fs/read_write.c) code file and just expands to the call of the `__to_fd` function:
			
 
				+and writes data from a buffer declared by the user to a given device or a file. Note that the second parameter `buf`, defined with the `__user` attribute. The main purpose of this attribute is for checking the Linux kernel code with the [sparse](https://en.wikipedia.org/wiki/Sparse) util. It is defined in the [include/linux/compiler.h](https://github.com/torvalds/linux/blob/master/include/linux/compiler.h) header file and depends on the `__CHECKER__` definition in the Linux kernel. That's all about useful meta-information related to our `sys_write` system call, let's try to understand how this system call is implemented. As we can see it starts from the definition of the `f` structure that has `fd` structure type that represent file descriptor in the Linux kernel and we put the result of the call of the `fdget_pos` function. The `fdget_pos` function defined in the same [source](https://github.com/torvalds/linux/blob/master/fs/read_write.c) code file and just expands the call of the `__to_fd` function:
			
 
				 
			
 
				 ```C
			
 
				 static inline struct fd fdget_pos(int fd)
			
@@ -338,7 +338,7 @@ static inline struct fd fdget_pos(int fd)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-The main purpose of the `fdget_pos` is convert given file descriptor which is just number to the `fd` strucutre. Through the long chain of function calls, the `fdget_pos` function get the file descriptor table of the current process or in another words `current->files` and tries to find correspnding file descriptor number there. As we got `fd` structure for the given file descriptor number, we check it and return if it does not exist. In other way we get the current position in the file with the call of the `file_pos_read` function that just returns `f_pos` field of the our file:
			
 
				+The main purpose of the `fdget_pos` is to convert the given file descriptor which is just a number to the `fd` structure. Through the long chain of function calls, the `fdget_pos` function gets the file descriptor table of the current process, `current->files`, and tries to find a corresponding file descriptor number there. As we got the `fd` structure for the given file descriptor number, we check it and return if it does not exist. We get the current position in the file with the call of the `file_pos_read` function that just returns `f_pos` field of the our file:
			
 
				 
			
 
				 ```C
			
 
				 static inline loff_t file_pos_read(struct file *file)
			
@@ -347,14 +347,14 @@ static inline loff_t file_pos_read(struct file *file)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-and call the `vfs_write` function. The `vfs_write` function defined the [fs/read_write.c](https://github.com/torvalds/linux/blob/master/fs/read_write.c) source code file and does main work for us - writes given buffer to the given file starting from the given position. We will not dive into details about the `vfs_write` function, because this function is weakly related to the `system call` concept but mostly about [Virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system) concept which we will see in another chapter. As the `vfs_write` has finished its work, we check the result of it and if it was finished successfully we change the position in the file with the `file_pos_write` function:
			
 
				+and call the `vfs_write` function. The `vfs_write` function defined in the [fs/read_write.c](https://github.com/torvalds/linux/blob/master/fs/read_write.c) source code file and does the work for us - writes given buffer to the given file starting from the given position. We will not dive into details about the `vfs_write` function, because this function is weakly related to the `system call` concept but mostly about [Virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system) concept which we will see in another chapter. After the `vfs_write` has finished its work, we check the result and if it was finished successfully we change the position in the file with the `file_pos_write` function:
			
 
				 
			
 
				 ```C
			
 
				 if (ret >= 0)
			
 
				 	file_pos_write(f.file, pos);
			
 
				 ```
			
 
				 
			
 
				-that just updates `f_pos` with the given position of the give file:
			
 
				+that just updates `f_pos` with the given position in the given file:
			
 
				 
			
 
				 ```C
			
 
				 static inline void file_pos_write(struct file *file, loff_t pos)
			
@@ -363,24 +363,24 @@ static inline void file_pos_write(struct file *file, loff_t pos)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-In the end of the our `write` system call handler, we can see call of the following function:
			
 
				+At the end of the our `write` system call handler, we can see the call of the following function:
			
 
				 
			
 
				 ```C
			
 
				 fdput_pos(f);
			
 
				 ```
			
 
				 
			
 
				-unlocks the `f_pos_lock` mutex that protects file position during concurrently write from threads that share file descriptor.
			
 
				+unlocks the `f_pos_lock` mutex that protects file position during concurrent writes from threads that share file descriptor.
			
 
				 
			
 
				 That's all.
			
 
				 
			
 
				-Just now, we partly saw implementation one of system calls that provided by the Linux kernel. Of course we have missed some parts in the implementation of the `write` system call in this part, because as I already wrote above, we will see only system calls related stuff in this chapter and will not see other stuff related to the other subsystem as [Virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system) and etc.
			
 
				+We have seen the partial implementation of one system call provided by the Linux kernel. Of course we have missed some parts in the implementation of the `write` system call, because as I mentioned above, we will see only system calls related stuff in this chapter and will not see other stuff related to other subsystems, such as [Virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system).
			
 
				 
			
 
				 Conclusion
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-This is the end of the first part about system calls concept in the Linux kernel. We saw theory about this concept in this part and in the next part we will continue to dive into this topic and start to touch Linux kernel code which is related to the system calls.
			
 
				+This concludes the first part covering system call concepts in the Linux kernel. We have covered the theory of system calls so far and in the next part we will continue to dive into this topic, touching Linux kernel code related to system calls.
			
 
				 
			
 
				-If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-internals/issues/new).
			
 
				+If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
			
 
				 
			
 
				 **Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				 
			
--- a/SysCall/syscall-2.md
+++ b/SysCall/syscall-2.md
@@ -0,0 +1,409 @@
 
				+System calls in the Linux kernel. Part 2.
			
 
				+================================================================================
			
 
				+
			
 
				+How does the Linux kernel handle a system call 
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+The previous [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html) was the first part of the chapter that describes the [system call](https://en.wikipedia.org/wiki/System_call) concepts in the Linux kernel.
			
 
				+In the previous part we learned what a system call is in the Linux kernel, and in operating systems in general. This was introduced from a user-space perspective, and part of the [write](http://man7.org/linux/man-pages/man2/write.2.html) system call implementation was discussed. In this part we continue our look at system calls, starting with some theory before moving onto the Linux kernel code.
			
 
				+
			
 
				+A user application does not make the system call directly from our applications. We did not write the `Hello world!` program like:
			
 
				+
			
 
				+```C
			
 
				+int main(int argc, char **argv)
			
 
				+{
			
 
				+	...
			
 
				+	...
			
 
				+	...
			
 
				+	sys_write(fd1, buf, strlen(buf));
			
 
				+	...
			
 
				+	...
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+We can use something similar with the help of [C standard library](https://en.wikipedia.org/wiki/GNU_C_Library) and it will look something like this:
			
 
				+
			
 
				+```C
			
 
				+#include <unistd.h>
			
 
				+
			
 
				+int main(int argc, char **argv)
			
 
				+{
			
 
				+	...
			
 
				+	...
			
 
				+	...
			
 
				+	write(fd1, buf, strlen(buf));
			
 
				+	...
			
 
				+	...
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+But anyway, `write` is not a direct system call and not a kernel function. An application must fill general purpose registers with the correct values in the correct order and use the `syscall` instruction to make the actual system call. In this part we will look at what occurs in the Linux kernel when the `syscall` instruction is met by the processor.
			
 
				+
			
 
				+Initialization of the system calls table
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+From the previous part we know that system call concept is very similar to an interrupt. Furthermore, system calls are implemented as software interrupts. So, when the processor handles a `syscall` instruction from a user application, this instruction causes an exception which transfers control to an exception handler. As we know, all exception handlers (or in other words kernel [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) functions that will react on an exception) are placed in the kernel code. But how does the Linux kernel search for the address of the necessary system call handler for the related system call? The Linux kernel contains a special table called the `system call table`. The system call table is represented by the `sys_call_table` array in the Linux kernel which is defined in the [arch/x86/entry/syscall_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscall_64.c) source code file. Let's look at its implementation:
			
 
				+
			
 
				+```C
			
 
				+asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
			
 
				+	[0 ... __NR_syscall_max] = &sys_ni_syscall,
			
 
				+    #include <asm/syscalls_64.h>
			
 
				+};
			
 
				+```
			
 
				+
			
 
				+As we can see, the `sys_call_table` is an array of `__NR_syscall_max + 1` size where the `__NR_syscall_max` macro represents the maximum number of system calls for the given [architecture](https://en.wikipedia.org/wiki/List_of_CPU_architectures). This book is about the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture, so for our case the `__NR_syscall_max` is `322` and this is the correct number at the time of writing (current Linux kernel version is `4.2.0-rc8+`). We can see this macro in the header file generated by [Kbuild](https://www.kernel.org/doc/Documentation/kbuild/makefiles.txt) during kernel compilation - include/generated/asm-offsets.h`:
			
 
				+
			
 
				+```C
			
 
				+#define __NR_syscall_max 322
			
 
				+```
			
 
				+
			
 
				+There will be the same number of system calls in the [arch/x86/entry/syscalls/syscall_64.tbl](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L331) for the `x86_64`. There are two important topics here; the type of the `sys_call_table` array, and the initialization of elements in this array. First of all, the type. The `sys_call_ptr_t` represents a pointer to a system call table. It is defined as [typedef](https://en.wikipedia.org/wiki/Typedef) for a function pointer that returns nothing and does not take arguments:
			
 
				+
			
 
				+```C
			
 
				+typedef void (*sys_call_ptr_t)(void);
			
 
				+```
			
 
				+
			
 
				+The second thing is the initialization of the `sys_call_table` array. As we can see in the code above, all elements of our array that contain pointers to the system call handlers point to the `sys_ni_syscall`. The `sys_ni_syscall` function represents not-implemented system calls. To start with, all elements of the `sys_call_table` array point to the not-implemented system call. This is the correct initial behaviour, because we only initialize storage of the pointers to the system call handlers, it is populated later on. Implementation of the `sys_ni_syscall` is pretty easy, it just returns [-errno](http://man7.org/linux/man-pages/man3/errno.3.html) or `-ENOSYS` in our case:
			
 
				+
			
 
				+```C
			
 
				+asmlinkage long sys_ni_syscall(void)
			
 
				+{
			
 
				+	return -ENOSYS;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+The `-ENOSYS` error tells us that:
			
 
				+
			
 
				+```
			
 
				+ENOSYS          Function not implemented (POSIX.1)
			
 
				+```
			
 
				+
			
 
				+Also a note on `...` in the initialization of the `sys_call_table`. We can do it with a [GCC](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) compiler extension called - [Designated Initializers](https://gcc.gnu.org/onlinedocs/gcc/Designated-Inits.html). This extension allows us to initialize elements in non-fixed order. As you can see, we include the `asm/syscalls_64.h` header at the end of the array. This header file is generated by the special script at [arch/x86/entry/syscalls/syscalltbl.sh](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscalltbl.sh) and generates our header file from the [syscall table](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl). The `asm/syscalls_64.h` contains definitions of the following macros:
			
 
				+
			
 
				+```C
			
 
				+__SYSCALL_COMMON(0, sys_read, sys_read)
			
 
				+__SYSCALL_COMMON(1, sys_write, sys_write)
			
 
				+__SYSCALL_COMMON(2, sys_open, sys_open)
			
 
				+__SYSCALL_COMMON(3, sys_close, sys_close)
			
 
				+__SYSCALL_COMMON(5, sys_newfstat, sys_newfstat)
			
 
				+...
			
 
				+...
			
 
				+...
			
 
				+```
			
 
				+
			
 
				+The `__SYSCALL_COMMON` macro is defined in the same source code [file](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscall_64.c) and expands to the `__SYSCALL_64` macro which expands to the function definition:
			
 
				+
			
 
				+```C
			
 
				+#define __SYSCALL_COMMON(nr, sym, compat) __SYSCALL_64(nr, sym, compat)
			
 
				+#define __SYSCALL_64(nr, sym, compat) [nr] = sym,
			
 
				+```
			
 
				+
			
 
				+So, after this, our `sys_call_table` takes the following form:
			
 
				+
			
 
				+```C
			
 
				+asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
			
 
				+	[0 ... __NR_syscall_max] = &sys_ni_syscall,
			
 
				+	[0] = sys_read,
			
 
				+	[1] = sys_write,
			
 
				+	[2] = sys_open,
			
 
				+	...
			
 
				+	...
			
 
				+	...
			
 
				+};
			
 
				+```
			
 
				+
			
 
				+After this all elements that point to the non-implemented system calls will contain the address of the `sys_ni_syscall` function that just returns `-ENOSYS` as we saw above, and other elements will point to the `sys_syscall_name` functions.
			
 
				+
			
 
				+At this point, we have filled the system call table and the Linux kernel knows where each system call handler is. But the Linux kernel does not call a `sys_syscall_name` function immediately after it is instructed to handle a system call from a user space application. Remember the [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) about interrupts and interrupt handling. When the Linux kernel gets the control to handle an interrupt, it had to do some preparations like save user space registers, switch to a new stack and many more tasks before it will call an interrupt handler. There is the same situation with the system call handling. The preparation for handling a system call is the first thing, but before the Linux kernel will start these preparations, the entry point of a system call must be initialized and only the Linux kernel knows how to perform this preparation. In the next paragraph we will see the process of the initialization of the system call entry in the Linux kernel.
			
 
				+
			
 
				+Initialization of the system call entry
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+When a system call occurs in the system, where are the first bytes of code that starts to handle it? As we can read in the Intel manual - [64-ia-32-architectures-software-developer-vol-2b-manual](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html):
			
 
				+
			
 
				+```
			
 
				+SYSCALL invokes an OS system-call handler at privilege level 0.
			
 
				+It does so by loading RIP from the IA32_LSTAR MSR
			
 
				+```
			
 
				+
			
 
				+it means that we need to put the system call entry in to the `IA32_LSTAR` [model specific register](https://en.wikipedia.org/wiki/Model-specific_register). This operation takes place during the Linux kernel initialization process. If you have read the fourth [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-4.html) of the chapter that describes interrupts and interrupt handling in the Linux kernel, you know that the Linux kernel calls the `trap_init` function during the initialization process. This function is defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) source code file and executes the initialization of the `non-early` exception handlers like divide error, [coprocessor](https://en.wikipedia.org/wiki/Coprocessor) error etc. Besides the initialization of the `non-early` exceptions handlers, this function calls the `cpu_init` function from the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/blob/arch/x86/kernel/cpu/common.c) source code file which besides initialization of `per-cpu` state, calls the `syscall_init` function from the same source code file.
			
 
				+
			
 
				+This function performs the initialization of the system call entry point. Let's look on the implementation of this function. It does not take parameters and first of all it fills two model specific registers:
			
 
				+
			
 
				+```C
			
 
				+wrmsrl(MSR_STAR,  ((u64)__USER32_CS)<<48  | ((u64)__KERNEL_CS)<<32);
			
 
				+wrmsrl(MSR_LSTAR, entry_SYSCALL_64);
			
 
				+```
			
 
				+
			
 
				+The first model specific register - `MSR_STAR` contains `63:48` bits of the user code segment. These bits will be loaded to the `CS` and `SS` segment registers for the `sysret` instruction which provides functionality to return from a system call to user code with the related privilege. Also the `MSR_STAR` contains `47:32` bits from the kernel code that will be used as the base selector for `CS` and `SS` segment registers when user space applications execute a system call. In the second line of code we fill the `MSR_LSTAR` register with the `entry_SYSCALL_64` symbol that represents system call entry. The `entry_SYSCALL_64` is defined in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly file and contains code related to the preparation performed before a system call handler will be executed (I already wrote about these preparations, read above). We will not consider the `entry_SYSCALL_64` now, but will return to it later in this chapter.
			
 
				+
			
 
				+After we have set the entry point for system calls, we need to set the following model specific registers:
			
 
				+
			
 
				+* `MSR_CSTAR` - target `rip` for the compatibility mode callers;
			
 
				+* `MSR_IA32_SYSENTER_CS` - target `cs` for the `sysenter` instruction;
			
 
				+* `MSR_IA32_SYSENTER_ESP` - target `esp` for the `sysenter` instruction;
			
 
				+* `MSR_IA32_SYSENTER_EIP` - target `eip` for the `sysenter` instruction.
			
 
				+
			
 
				+The values of these model specific register depend on the `CONFIG_IA32_EMULATION` kernel configuration option. If this kernel configuration option is enabled, it allows legacy 32-bit programs to run under a 64-bit kernel. In the first case, if the `CONFIG_IA32_EMULATION` kernel configuration option is enabled, we fill these model specific registers with the entry point for the system calls the compatibility mode:
			
 
				+
			
 
				+```C
			
 
				+wrmsrl(MSR_CSTAR, entry_SYSCALL_compat);
			
 
				+```
			
 
				+
			
 
				+and with the kernel code segment, put zero to the stack pointer and write the address of the `entry_SYSENTER_compat` symbol to the [instruction pointer](https://en.wikipedia.org/wiki/Program_counter):
			
 
				+
			
 
				+```C
			
 
				+wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
			
 
				+wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
			
 
				+wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
			
 
				+```
			
 
				+
			
 
				+In another way, if the `CONFIG_IA32_EMULATION` kernel configuration option is disabled, we write `ignore_sysret` symbol to the `MSR_CSTAR`:
			
 
				+
			
 
				+```C
			
 
				+wrmsrl(MSR_CSTAR, ignore_sysret);
			
 
				+```
			
 
				+
			
 
				+that is defined in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly file and just returns `-ENOSYS` error code:
			
 
				+
			
 
				+```assembly
			
 
				+ENTRY(ignore_sysret)
			
 
				+	mov	$-ENOSYS, %eax
			
 
				+	sysret
			
 
				+END(ignore_sysret)
			
 
				+```
			
 
				+
			
 
				+Now we need to fill `MSR_IA32_SYSENTER_CS`, `MSR_IA32_SYSENTER_ESP`, `MSR_IA32_SYSENTER_EIP` model specific registers as we did in the previous code when the `CONFIG_IA32_EMULATION` kernel configuration option was enabled. In this case (when the `CONFIG_IA32_EMULATION` configuration option is not set) we fill the `MSR_IA32_SYSENTER_ESP` and the `MSR_IA32_SYSENTER_EIP` with zero and put the invalid segment of the [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table) to the `MSR_IA32_SYSENTER_CS` model specific register:
			
 
				+
			
 
				+```C
			
 
				+wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)GDT_ENTRY_INVALID_SEG);
			
 
				+wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
			
 
				+wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);
			
 
				+```
			
 
				+
			
 
				+You can read more about the `Global Descriptor Table` in the second [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html) of the chapter that describes the booting process of the Linux kernel.
			
 
				+
			
 
				+At the end of the `syscall_init` function, we just mask flags in the [flags register](https://en.wikipedia.org/wiki/FLAGS_register) by writing the set of flags to the `MSR_SYSCALL_MASK` model specific register:
			
 
				+
			
 
				+```C
			
 
				+wrmsrl(MSR_SYSCALL_MASK,
			
 
				+	   X86_EFLAGS_TF|X86_EFLAGS_DF|X86_EFLAGS_IF|
			
 
				+	   X86_EFLAGS_IOPL|X86_EFLAGS_AC|X86_EFLAGS_NT);
			
 
				+```
			
 
				+
			
 
				+These flags will be cleared during syscall initialization. That's all, it is the end of the `syscall_init` function and it means that system call entry is ready to work. Now we can see what will occur when a user application executes the `syscall` instruction.
			
 
				+
			
 
				+Preparation before system call handler will be called
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+As I already wrote, before a system call or an interrupt handler will be called by the Linux kernel we need to do some preparations. The `idtentry` macro performs the preparations required before an exception handler will be executed, the `interrupt` macro performs the preparations required before an interrupt handler will be called and the `entry_SYSCALL_64` will do the preparations required before a system call handler will be executed.
			
 
				+
			
 
				+The `entry_SYSCALL_64` is defined in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S)  assembly file and starts from the following macro:
			
 
				+
			
 
				+```assembly
			
 
				+SWAPGS_UNSAFE_STACK
			
 
				+```
			
 
				+
			
 
				+This macro is defined in the [arch/x86/include/asm/irqflags.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/irqflags.h) header file and expands to the `swapgs` instruction:
			
 
				+
			
 
				+```C
			
 
				+#define SWAPGS_UNSAFE_STACK	swapgs
			
 
				+```
			
 
				+
			
 
				+which exchanges the current GS base register value with the value contained in the `MSR_KERNEL_GS_BASE ` model specific register. In other words we moved it on to the kernel stack. After this we point the old stack pointer to the `rsp_scratch` [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable and setup the stack pointer to point to the top of stack for the current processor:
			
 
				+
			
 
				+```assembly
			
 
				+movq	%rsp, PER_CPU_VAR(rsp_scratch)
			
 
				+movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
			
 
				+```
			
 
				+
			
 
				+In the next step we push the stack segment and the old stack pointer to the stack:
			
 
				+
			
 
				+```assembly
			
 
				+pushq	$__USER_DS
			
 
				+pushq	PER_CPU_VAR(rsp_scratch)
			
 
				+```
			
 
				+
			
 
				+After this we enable interrupts, because interrupts are `off` on entry and save the general purpose [registers](https://en.wikipedia.org/wiki/Processor_register) (besides `bp`, `bx` and from `r12` to `r15`), flags, `-ENOSYS` for the non-implemented system call and code segment register on the stack:
			
 
				+
			
 
				+```assembly
			
 
				+ENABLE_INTERRUPTS(CLBR_NONE)
			
 
				+
			
 
				+pushq	%r11
			
 
				+pushq	$__USER_CS
			
 
				+pushq	%rcx
			
 
				+pushq	%rax
			
 
				+pushq	%rdi
			
 
				+pushq	%rsi
			
 
				+pushq	%rdx
			
 
				+pushq	%rcx
			
 
				+pushq	$-ENOSYS
			
 
				+pushq	%r8
			
 
				+pushq	%r9
			
 
				+pushq	%r10
			
 
				+pushq	%r11
			
 
				+sub	$(6*8), %rsp
			
 
				+```
			
 
				+
			
 
				+When a system call occurs from the user's application, general purpose registers have the following state:
			
 
				+
			
 
				+* `rax` - contains system call number; 
			
 
				+* `rcx` - contains return address to the user space;
			
 
				+* `r11` - contains register flags;
			
 
				+* `rdi` - contains first argument of a system call handler;
			
 
				+* `rsi` - contains second argument of a system call handler;
			
 
				+* `rdx` - contains third argument of a system call handler;
			
 
				+* `r10` - contains fourth argument of a system call handler;
			
 
				+* `r8`  - contains fifth argument of a system call handler;
			
 
				+* `r9`  - contains sixth argument of a system call handler;
			
 
				+
			
 
				+Other general purpose registers (as `rbp`, `rbx` and from `r12` to `r15`) are callee-preserved in [C ABI](http://www.x86-64.org/documentation/abi.pdf)). So we push register flags on the top of the stack, then user code segment, return address to the user space, system call number, first three arguments, dump error code for the non-implemented system call and other arguments on the stack.
			
 
				+
			
 
				+In the next step we check the `_TIF_WORK_SYSCALL_ENTRY` in the current `thread_info`:
			
 
				+
			
 
				+```assembly
			
 
				+testl	$_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
			
 
				+jnz	tracesys
			
 
				+```
			
 
				+
			
 
				+The `_TIF_WORK_SYSCALL_ENTRY` macro is defined in the [arch/x86/include/asm/thread_info.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/thread_info.h) header file and provides set of the thread information flags that are related to the system calls tracing:
			
 
				+
			
 
				+```C
			
 
				+#define _TIF_WORK_SYSCALL_ENTRY \
			
 
				+    (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT |   \
			
 
				+    _TIF_SECCOMP | _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT |     \
			
 
				+    _TIF_NOHZ)
			
 
				+```
			
 
				+
			
 
				+We will not consider debugging/tracing related stuff in this chapter, but will see it in the separate chapter that will be devoted to the debugging and tracing techniques in the Linux kernel. After the `tracesys` label, the next label is the `entry_SYSCALL_64_fastpath`. In the `entry_SYSCALL_64_fastpath` we check the `__SYSCALL_MASK` that is defined in the [arch/x86/include/asm/unistd.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/unistd.h) header file and
			
 
				+
			
 
				+```C
			
 
				+# ifdef CONFIG_X86_X32_ABI
			
 
				+#  define __SYSCALL_MASK (~(__X32_SYSCALL_BIT))
			
 
				+# else
			
 
				+#  define __SYSCALL_MASK (~0)
			
 
				+# endif
			
 
				+```
			
 
				+
			
 
				+where the `__X32_SYSCALL_BIT` is
			
 
				+
			
 
				+```C
			
 
				+#define __X32_SYSCALL_BIT	0x40000000
			
 
				+```
			
 
				+
			
 
				+As we can see the `__SYSCALL_MASK` depends on the `CONFIG_X86_X32_ABI` kernel configuration option and represents the mask for the 32-bit [ABI](https://en.wikipedia.org/wiki/Application_binary_interface) in the 64-bit kernel.
			
 
				+
			
 
				+So we check the value of the `__SYSCALL_MASK` and if the `CONFIG_X86_X32_ABI` is disabled we compare the value of the `rax` register to the maximum syscall number (`__NR_syscall_max`), alternatively if the `CONFIG_X86_X32_ABI` is enabled we mask the `eax` register with the `__X32_SYSCALL_BIT` and do the same comparison: 
			
 
				+
			
 
				+```assembly
			
 
				+#if __SYSCALL_MASK == ~0
			
 
				+	cmpq	$__NR_syscall_max, %rax
			
 
				+#else
			
 
				+	andl	$__SYSCALL_MASK, %eax
			
 
				+	cmpl	$__NR_syscall_max, %eax
			
 
				+#endif
			
 
				+```
			
 
				+
			
 
				+After this we check the result of the last comparison with the `ja` instruction that executes if `CF` and `ZF` flags are zero:
			
 
				+
			
 
				+```assembly
			
 
				+ja	1f
			
 
				+```
			
 
				+
			
 
				+and if we have the correct system call for this, we move the fourth argument from the `r10` to the `rcx` to keep [x86_64 C ABI](http://www.x86-64.org/documentation/abi.pdf) compliant and execute the `call` instruction with the address of a system call handler: 
			
 
				+
			
 
				+```assembly
			
 
				+movq	%r10, %rcx
			
 
				+call	*sys_call_table(, %rax, 8)
			
 
				+```
			
 
				+
			
 
				+Note, the `sys_call_table` is an array that we saw above in this part. As we already know the `rax` general purpose register contains the number of a system call and each element of the `sys_call_table` is 8-bytes. So we are using `*sys_call_table(, %rax, 8)` this notation to find the correct offset in the `sys_call_table` array for the given system call handler.
			
 
				+
			
 
				+That's all. We did all the required preparations and the system call handler was called for the given interrupt handler, for example `sys_read`, `sys_write` or other system call handler that is defined with the `SYSCALL_DEFINE[N]` macro in the Linux kernel code.
			
 
				+
			
 
				+Exit from a system call
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+After a system call handler finishes its work, we will return back to the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S), right after where we have called the system call handler:
			
 
				+
			
 
				+```assembly
			
 
				+call	*sys_call_table(, %rax, 8)
			
 
				+```
			
 
				+
			
 
				+The next step after we've returned from a system call handler is to put the return value of a system handler on to the stack. We know that a system call returns the result to the user program in the general purpose `rax` register, so we are moving its value on to the stack after the system call handler has finished its work:
			
 
				+
			
 
				+```C
			
 
				+movq	%rax, RAX(%rsp)
			
 
				+```
			
 
				+
			
 
				+on the `RAX` place.
			
 
				+
			
 
				+After this we can see the call of the `LOCKDEP_SYS_EXIT` macro from the [arch/x86/include/asm/irqflags.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/irqflags.h):
			
 
				+
			
 
				+```assembly
			
 
				+LOCKDEP_SYS_EXIT
			
 
				+```
			
 
				+
			
 
				+The implementation of this macro depends on the `CONFIG_DEBUG_LOCK_ALLOC` kernel configuration option that allows us to debug locks on exit from a system call. And again, we will not consider it in this chapter, but will return to it in a separate one. In the end of the `entry_SYSCALL_64` function we restore all general purpose registers besides `rxc` and `r11`, because the `rcx` register must contain the return address to the application that called system call and the `r11` register contains the old [flags register](https://en.wikipedia.org/wiki/FLAGS_register). After all general purpose registers are restored, we fill `rcx` with the return address, `r11` register with the flags and `rsp` with the old stack pointer:
			
 
				+
			
 
				+```assembly
			
 
				+RESTORE_C_REGS_EXCEPT_RCX_R11
			
 
				+
			
 
				+movq	RIP(%rsp), %rcx
			
 
				+movq	EFLAGS(%rsp), %r11
			
 
				+movq	RSP(%rsp), %rsp
			
 
				+
			
 
				+USERGS_SYSRET64
			
 
				+```
			
 
				+
			
 
				+In the end we just call the `USERGS_SYSRET64` macro that expands to the call of the `swapgs` instruction which exchanges again the user `GS` and kernel `GS` and the `sysretq` instruction which executes on exit from a system call handler:
			
 
				+
			
 
				+```C
			
 
				+#define USERGS_SYSRET64				\
			
 
				+	swapgs;	           				\
			
 
				+	sysretq;
			
 
				+```
			
 
				+
			
 
				+Now we know what occurs when a user application calls a system call. The full path of this process is as follows:
			
 
				+
			
 
				+* User application contains code that fills general purpose register with the values (system call number and arguments of this system call);
			
 
				+* Processor switches from the user mode to kernel mode and starts execution of the system call entry - `entry_SYSCALL_64`; 
			
 
				+* `entry_SYSCALL_64` switches to the kernel stack and saves some general purpose registers, old stack and code segment, flags and etc... on the stack;
			
 
				+* `entry_SYSCALL_64` checks the system call number in the `rax` register, searches a system call handler in the `sys_call_table` and calls it, if the number of a system call is correct;
			
 
				+* If a system call is not correct, jump on exit from system call;
			
 
				+* After a system call handler will finish its work, restore general purpose registers, old stack, flags and return address and exit from the `entry_SYSCALL_64` with the `sysretq` instruction. 
			
 
				+
			
 
				+That's all.
			
 
				+
			
 
				+Conclusion
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+This is the end of the second part about the system calls concept in the Linux kernel. In the previous [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html) we saw theory about this concept from the user application view. In this part we continued to dive into the stuff which is related to the system call concept and saw what the Linux kernel does when a system call occurs.
			
 
				+
			
 
				+If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
			
 
				+
			
 
				+**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				+
			
 
				+Links
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+* [system call](https://en.wikipedia.org/wiki/System_call)
			
 
				+* [write](http://man7.org/linux/man-pages/man2/write.2.html)
			
 
				+* [C standard library](https://en.wikipedia.org/wiki/GNU_C_Library)
			
 
				+* [list of cpu architectures](https://en.wikipedia.org/wiki/List_of_CPU_architectures)
			
 
				+* [x86_64](https://en.wikipedia.org/wiki/X86-64)
			
 
				+* [kbuild](https://www.kernel.org/doc/Documentation/kbuild/makefiles.txt)
			
 
				+* [typedef](https://en.wikipedia.org/wiki/Typedef)
			
 
				+* [errno](http://man7.org/linux/man-pages/man3/errno.3.html)
			
 
				+* [gcc](https://en.wikipedia.org/wiki/GNU_Compiler_Collection)
			
 
				+* [model specific register](https://en.wikipedia.org/wiki/Model-specific_register)
			
 
				+* [intel 2b manual](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html)
			
 
				+* [coprocessor](https://en.wikipedia.org/wiki/Coprocessor)
			
 
				+* [instruction pointer](https://en.wikipedia.org/wiki/Program_counter)
			
 
				+* [flags register](https://en.wikipedia.org/wiki/FLAGS_register)
			
 
				+* [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table)
			
 
				+* [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
			
 
				+* [general purpose registers](https://en.wikipedia.org/wiki/Processor_register)
			
 
				+* [ABI](https://en.wikipedia.org/wiki/Application_binary_interface)
			
 
				+* [x86_64 C ABI](http://www.x86-64.org/documentation/abi.pdf)
			
 
				+* [previous chapter](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html)
			
--- a/SysCall/syscall-3.md
+++ b/SysCall/syscall-3.md
@@ -0,0 +1,403 @@
 
				+System calls in the Linux kernel. Part 3.
			
 
				+================================================================================
			
 
				+
			
 
				+vsyscalls and vDSO
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+This is the third part of the [chapter](http://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) that describes system calls in the Linux kernel and we saw preparations after a system call caused by an userspace application and process of handling of a system call in the previous [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html). In this part we will look at two concepts that are very close to the system call concept, they are called `vsyscall` and `vdso`.
			
 
				+
			
 
				+We already know what is a `system call`. This is special routine in the Linux kernel which userspace application asks to do privileged tasks, like to read or to write to a file, to open a socket and etc. As you may know, invoking a system call is an expensive operation in the Linux kernel, because the processor must interrupt the currently executing task and switch context to kernel mode, subsequently jumping again into userspace after the system call handler finishes its work. These two mechanisms - `vsyscall` and `vdso` are designed to speed up this process for certain system calls and in this part we will try to understand how these mechanisms work.
			
 
				+
			
 
				+Introduction to vsyscalls
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+The `vsyscall` or `virtual system call` is the first and oldest mechanism in the Linux kernel that is designed to accelerate execution of certain system calls. The principle of work of the `vsyscall` concept is simple. The Linux kernel maps into user space a page that contains some variables and the implementation of some system calls. We can find information about this memory space in the Linux kernel [documentation](https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt) for the [x86_64](https://en.wikipedia.org/wiki/X86-64):
			
 
				+
			
 
				+```
			
 
				+ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
			
 
				+```
			
 
				+
			
 
				+or:
			
 
				+
			
 
				+```
			
 
				+~$ sudo cat /proc/1/maps | grep vsyscall
			
 
				+ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
			
 
				+```
			
 
				+
			
 
				+After this, these system calls will be executed in userspace and this means that there will not be [context switching](https://en.wikipedia.org/wiki/Context_switch). Mapping of the `vsyscall` page occurs in the `map_vsyscall` function that is defined in the [arch/x86/entry/vsyscall/vsyscall_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_64.c) source code file. This function is called during the Linux kernel initialization in the `setup_arch` function that is defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) source code file (we saw this function in the fifth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) of the Linux kernel initialization process chapter).
			
 
				+
			
 
				+Note that implementation of the `map_vsyscall` function depends on the `CONFIG_X86_VSYSCALL_EMULATION` kernel configuration option:
			
 
				+
			
 
				+```C
			
 
				+#ifdef CONFIG_X86_VSYSCALL_EMULATION
			
 
				+extern void map_vsyscall(void);
			
 
				+#else
			
 
				+static inline void map_vsyscall(void) {}
			
 
				+#endif
			
 
				+```
			
 
				+
			
 
				+As we can read in the help text, the `CONFIG_X86_VSYSCALL_EMULATION` configuration option: `Enable vsyscall emulation`. Why emulate `vsyscall`? Actually, the `vsyscall` is a legacy [ABI](https://en.wikipedia.org/wiki/Application_binary_interface) due to security reasons. Virtual system calls have fixed addresses, meaning that `vsyscall` page is still at the same location every time and the location of this page is determined in the `map_vsyscall` function. Let's look on the implementation of this function: 
			
 
				+
			
 
				+```C
			
 
				+void __init map_vsyscall(void)
			
 
				+{
			
 
				+    extern char __vsyscall_page;
			
 
				+    unsigned long physaddr_vsyscall = __pa_symbol(&__vsyscall_page);
			
 
				+	...
			
 
				+	...
			
 
				+	...
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+As we can see, at the beginning of the `map_vsyscall` function we get the physical address of the `vsyscall` page with the `__pa_symbol` macro (we already saw implementation if this macro in the fourth [path](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process). The `__vsyscall_page` symbol defined in the [arch/x86/entry/vsyscall/vsyscall_emu_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_emu_64.S) assembly source code file and have the following [virtual address](https://en.wikipedia.org/wiki/Virtual_address_space):
			
 
				+
			
 
				+```
			
 
				+ffffffff81881000 D __vsyscall_page
			
 
				+```
			
 
				+
			
 
				+in the `.data..page_aligned, aw` [section](https://en.wikipedia.org/wiki/Memory_segmentation) and contains call of the three following system calls:
			
 
				+
			
 
				+* `gettimeofday`;
			
 
				+* `time`;
			
 
				+* `getcpu`.
			
 
				+
			
 
				+Or:
			
 
				+
			
 
				+```assembly
			
 
				+__vsyscall_page:
			
 
				+	mov $__NR_gettimeofday, %rax
			
 
				+	syscall
			
 
				+	ret
			
 
				+
			
 
				+	.balign 1024, 0xcc
			
 
				+	mov $__NR_time, %rax
			
 
				+	syscall
			
 
				+	ret
			
 
				+
			
 
				+	.balign 1024, 0xcc
			
 
				+	mov $__NR_getcpu, %rax
			
 
				+	syscall
			
 
				+	ret
			
 
				+```
			
 
				+
			
 
				+Let's go back to the implementation of the `map_vsyscall` function and return to the implementation of the `__vsyscall_page`, later. After we receiving the physical address of the `__vsyscall_page`, we check the value of the `vsyscall_mode` variable and set the [fix-mapped](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html) address for the `vsyscall` page with the `__set_fixmap` macro:
			
 
				+
			
 
				+```C
			
 
				+if (vsyscall_mode != NONE)
			
 
				+	__set_fixmap(VSYSCALL_PAGE, physaddr_vsyscall,
			
 
				+                 vsyscall_mode == NATIVE
			
 
				+                             ? PAGE_KERNEL_VSYSCALL
			
 
				+                             : PAGE_KERNEL_VVAR);
			
 
				+```
			
 
				+
			
 
				+The `__set_fixmap` takes three arguments: The first is index of the `fixed_addresses` [enum](https://en.wikipedia.org/wiki/Enumerated_type). In our case `VSYSCALL_PAGE` is the first element of the `fixed_addresses` enum for the `x86_64` architecture:
			
 
				+
			
 
				+```C
			
 
				+enum fixed_addresses {
			
 
				+...
			
 
				+...
			
 
				+...
			
 
				+#ifdef CONFIG_X86_VSYSCALL_EMULATION
			
 
				+	VSYSCALL_PAGE = (FIXADDR_TOP - VSYSCALL_ADDR) >> PAGE_SHIFT,
			
 
				+#endif
			
 
				+...
			
 
				+...
			
 
				+...
			
 
				+```
			
 
				+
			
 
				+It equal to the `511`. The second argument is the physical address of the page that has to be mapped and the third argument is the flags of the page. Note that the flags of the `VSYSCALL_PAGE` depend on the `vsyscall_mode` variable. It will be `PAGE_KERNEL_VSYSCALL` if the `vsyscall_mode` variable is `NATIVE` and the `PAGE_KERNEL_VVAR` otherwise. Both macros (the `PAGE_KERNEL_VSYSCALL` and the `PAGE_KERNEL_VVAR`) will be expanded to the following flags:
			
 
				+
			
 
				+```C
			
 
				+#define __PAGE_KERNEL_VSYSCALL          (__PAGE_KERNEL_RX | _PAGE_USER)
			
 
				+#define __PAGE_KERNEL_VVAR              (__PAGE_KERNEL_RO | _PAGE_USER)
			
 
				+```
			
 
				+
			
 
				+that represent access rights to the `vsyscall` page. Both flags have the same `_PAGE_USER` flags that means that the page can be accessed by a user-mode process running at lower privilege levels. The second flag depends on the value of the `vsyscall_mode` variable. The first flag (`__PAGE_KERNEL_VSYSCALL`) will be set in the case where `vsyscall_mode` is `NATIVE`. This means virtual system calls will be native `syscall` instructions. In other way the vsyscall will have `PAGE_KERNEL_VVAR` if the `vsyscall_mode` variable will be `emulate`. In this case virtual system calls will be turned into traps and are emulated reasonably. The `vsyscall_mode` variable gets its value in the `vsyscall_setup` function:
			
 
				+
			
 
				+```C
			
 
				+static int __init vsyscall_setup(char *str)
			
 
				+{
			
 
				+	if (str) {
			
 
				+		if (!strcmp("emulate", str))
			
 
				+			vsyscall_mode = EMULATE;
			
 
				+		else if (!strcmp("native", str))
			
 
				+			vsyscall_mode = NATIVE;
			
 
				+		else if (!strcmp("none", str))
			
 
				+			vsyscall_mode = NONE;
			
 
				+		else
			
 
				+			return -EINVAL;
			
 
				+
			
 
				+		return 0;
			
 
				+	}
			
 
				+
			
 
				+	return -EINVAL;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+That will be called during early kernel parameters parsing:
			
 
				+
			
 
				+```C
			
 
				+early_param("vsyscall", vsyscall_setup);
			
 
				+```
			
 
				+
			
 
				+More about `early_param` macro you can read in the sixth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) of the chapter that describes process of the initialization of the Linux kernel.
			
 
				+
			
 
				+In the end of the `vsyscall_map` function we just check that virtual address of the `vsyscall` page is equal to the value of the `VSYSCALL_ADDR` with the [BUILD_BUG_ON](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) macro:
			
 
				+
			
 
				+```C
			
 
				+BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
			
 
				+             (unsigned long)VSYSCALL_ADDR);
			
 
				+```
			
 
				+
			
 
				+That's all. `vsyscall` page is set up. The result of the all the above is the following: If we pass `vsyscall=native` parameter to the kernel command line, virtual system calls will be handled as native `syscall` instructions in the [arch/x86/entry/vsyscall/vsyscall_emu_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_emu_64.S). The [glibc](https://en.wikipedia.org/wiki/GNU_C_Library) knows addresses of the virtual system call handlers. Note that virtual system call handlers are aligned by `1024` (or `0x400`) bytes:
			
 
				+
			
 
				+```assembly
			
 
				+__vsyscall_page:
			
 
				+	mov $__NR_gettimeofday, %rax
			
 
				+	syscall
			
 
				+	ret
			
 
				+
			
 
				+	.balign 1024, 0xcc
			
 
				+	mov $__NR_time, %rax
			
 
				+	syscall
			
 
				+	ret
			
 
				+
			
 
				+	.balign 1024, 0xcc
			
 
				+	mov $__NR_getcpu, %rax
			
 
				+	syscall
			
 
				+	ret
			
 
				+```
			
 
				+
			
 
				+And the start address of the `vsyscall` page is the `ffffffffff600000` every time. So, the [glibc](https://en.wikipedia.org/wiki/GNU_C_Library) knows the addresses of the all virtual system call handlers. You can find definition of these addresses in the `glibc` source code:
			
 
				+
			
 
				+```C
			
 
				+#define VSYSCALL_ADDR_vgettimeofday   0xffffffffff600000
			
 
				+#define VSYSCALL_ADDR_vtime 	      0xffffffffff600400
			
 
				+#define VSYSCALL_ADDR_vgetcpu	      0xffffffffff600800
			
 
				+```
			
 
				+
			
 
				+All virtual system call requests will fall into the `__vsyscall_page` + `VSYSCALL_ADDR_vsyscall_name` offset, put the number of a virtual system call to the `rax` general purpose [register](https://en.wikipedia.org/wiki/Processor_register) and the native for the x86_64 `syscall` instruction will be executed.
			
 
				+
			
 
				+In the second case, if we pass `vsyscall=emulate` parameter to the kernel command line, an attempt to perform virtual system call handler will cause a [page fault](https://en.wikipedia.org/wiki/Page_fault) exception. Of course, remember, the `vsyscall` page has `__PAGE_KERNEL_VVAR` access rights that forbid execution. The `do_page_fault` function is the `#PF` or page fault handler. It tries to understand the reason of the last page fault. And one of the reason can be situation when virtual system call called and `vsyscall` mode is `emulate`. In this case `vsyscall` will be handled by the `emulate_vsyscall` function that defined in the [arch/x86/entry/vsyscall/vsyscall_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_64.c) source code file.
			
 
				+
			
 
				+The `emulate_vsyscall` function gets the number of a virtual system call, checks it, prints error and sends [segmentation fault](https://en.wikipedia.org/wiki/Segmentation_fault) single:
			
 
				+
			
 
				+```C
			
 
				+...
			
 
				+...
			
 
				+...
			
 
				+vsyscall_nr = addr_to_vsyscall_nr(address);
			
 
				+if (vsyscall_nr < 0) {
			
 
				+	warn_bad_vsyscall(KERN_WARNING, regs, "misaligned vsyscall...);
			
 
				+	goto sigsegv;
			
 
				+}
			
 
				+...
			
 
				+...
			
 
				+...
			
 
				+sigsegv:
			
 
				+	force_sig(SIGSEGV, current);
			
 
				+	reutrn true;
			
 
				+```
			
 
				+
			
 
				+As it checked number of a virtual system call, it does some yet another checks like `access_ok` violations and execute system call function depends on the number of a virtual system call:
			
 
				+
			
 
				+```C
			
 
				+switch (vsyscall_nr) {
			
 
				+	case 0:
			
 
				+		ret = sys_gettimeofday(
			
 
				+			(struct timeval __user *)regs->di,
			
 
				+			(struct timezone __user *)regs->si);
			
 
				+		break;
			
 
				+	...
			
 
				+	...
			
 
				+	...
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+In the end we put the result of the `sys_gettimeofday` or another virtual system call handler to the `ax` general purpose register, as we did it with the normal system calls and restore the [instruction pointer](https://en.wikipedia.org/wiki/Program_counter) register and add `8` bytes to the [stack pointer](https://en.wikipedia.org/wiki/Stack_register) register. This operation emulates `ret` instruction.
			
 
				+
			
 
				+```C
			
 
				+	regs->ax = ret;
			
 
				+
			
 
				+do_ret:
			
 
				+	regs->ip = caller;
			
 
				+	regs->sp += 8;
			
 
				+	return true;
			
 
				+```		
			
 
				+
			
 
				+That's all. Now let's look on the modern concept - `vDSO`.
			
 
				+
			
 
				+Introduction to vDSO
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+As I already wrote above, `vsyscall` is an obsolete concept and replaced by the `vDSO` or `virtual dynamic shared object`. The main difference between the `vsyscall` and `vDSO` mechanisms is that `vDSO` maps memory pages into each process in a shared object [form](https://en.wikipedia.org/wiki/Library_%28computing%29#Shared_libraries), but `vsyscall` is static in memory and has the same address every time. For the `x86_64` architecture it is called -`linux-vdso.so.1`. All userspace applications linked with this shared library via the `glibc`. For example:
			
 
				+
			
 
				+```
			
 
				+~$ ldd /bin/uname
			
 
				+	linux-vdso.so.1 (0x00007ffe014b7000)
			
 
				+	libc.so.6 => /lib64/libc.so.6 (0x00007fbfee2fe000)
			
 
				+	/lib64/ld-linux-x86-64.so.2 (0x00005559aab7c000)
			
 
				+```
			
 
				+
			
 
				+Or:
			
 
				+
			
 
				+```
			
 
				+~$ sudo cat /proc/1/maps | grep vdso
			
 
				+7fff39f73000-7fff39f75000 r-xp 00000000 00:00 0       [vdso]
			
 
				+```
			
 
				+
			
 
				+Here we can see that [uname](https://en.wikipedia.org/wiki/Uname) util was linked with the three libraries:
			
 
				+
			
 
				+* `linux-vdso.so.1`;
			
 
				+* `libc.so.6`;
			
 
				+* `ld-linux-x86-64.so.2`.
			
 
				+
			
 
				+The first provides `vDSO` functionality, the second is `C` [standard library](https://en.wikipedia.org/wiki/C_standard_library) and the third is the program interpreter (more about this you can read in the part that describes [linkers](http://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html)). So, the `vDSO` solves limitations of the `vsyscall`. Implementation of the `vDSO` is similar to `vsyscall`.
			
 
				+
			
 
				+Initialization of the `vDSO` occurs in the `init_vdso` function that defined in the [arch/x86/entry/vdso/vma.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vma.c) source code file. This function starts from the initialization of the `vDSO` images for 32-bits and 64-bits depends on the `CONFIG_X86_X32_ABI` kernel configuration option:
			
 
				+
			
 
				+```C
			
 
				+static int __init init_vdso(void)
			
 
				+{
			
 
				+	init_vdso_image(&vdso_image_64);
			
 
				+
			
 
				+#ifdef CONFIG_X86_X32_ABI
			
 
				+	init_vdso_image(&vdso_image_x32);
			
 
				+#endif	
			
 
				+```
			
 
				+
			
 
				+Both function initialize the `vdso_image` structure. This structure is defined in the two generated source code files: the [arch/x86/entry/vdso/vdso-image-64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vdso-image-64.c) and the [arch/x86/entry/vdso/vdso-image-64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vdso-image-64.c). These source code files generated by the [vdso2c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vdso2c.c) program from the different source code files, represent different approaches to call a system call like `int 0x80`, `sysenter` and etc. The full set of the images depends on the kernel configuration.
			
 
				+
			
 
				+For example for the `x86_64` Linux kernel it will contain `vdso_image_64`:
			
 
				+
			
 
				+```C
			
 
				+#ifdef CONFIG_X86_64
			
 
				+extern const struct vdso_image vdso_image_64;
			
 
				+#endif
			
 
				+```
			
 
				+
			
 
				+But for the `x86` - `vdso_image_32`:
			
 
				+
			
 
				+```C
			
 
				+#ifdef CONFIG_X86_X32
			
 
				+extern const struct vdso_image vdso_image_x32;
			
 
				+#endif
			
 
				+```
			
 
				+
			
 
				+If our kernel is configured for the `x86` architecture or for the `x86_64` and compatibility mode, we will have ability to call a system call with the `int 0x80` interrupt, if compatibility mode is enabled, we will be able to call a system call with the native `syscall instruction` or `sysenter` instruction in other way:
			
 
				+
			
 
				+```C
			
 
				+#if defined CONFIG_X86_32 || defined CONFIG_COMPAT
			
 
				+  extern const struct vdso_image vdso_image_32_int80;
			
 
				+#ifdef CONFIG_COMPAT
			
 
				+  extern const struct vdso_image vdso_image_32_syscall;
			
 
				+#endif
			
 
				+ extern const struct vdso_image vdso_image_32_sysenter;
			
 
				+#endif
			
 
				+```
			
 
				+
			
 
				+As we can understand from the name of the `vdso_image` structure, it represents image of the `vDSO` for the certain mode of the system call entry. This structure contains information about size in bytes of the `vDSO` area that always a multiple of `PAGE_SIZE` (`4096` bytes), pointer to the text mapping, start and end address of the `alternatives` (set of instructions with better alternatives for the certain type of the processor) and etc. For example `vdso_image_64` looks like this:
			
 
				+
			
 
				+```C
			
 
				+const struct vdso_image vdso_image_64 = {
			
 
				+	.data = raw_data,
			
 
				+	.size = 8192,
			
 
				+	.text_mapping = {
			
 
				+		.name = "[vdso]",
			
 
				+		.pages = pages,
			
 
				+	},
			
 
				+	.alt = 3145,
			
 
				+	.alt_len = 26,
			
 
				+	.sym_vvar_start = -8192,
			
 
				+	.sym_vvar_page = -8192,
			
 
				+	.sym_hpet_page = -4096,
			
 
				+};
			
 
				+```
			
 
				+
			
 
				+Where the `raw_data` contains raw binary code of the 64-bit `vDSO` system calls which are `2` page size:
			
 
				+
			
 
				+```C
			
 
				+static struct page *pages[2];
			
 
				+```
			
 
				+
			
 
				+or 8 Kilobytes.
			
 
				+
			
 
				+The `init_vdso_image` function is defined in the same source code file and just initializes the `vdso_image.text_mapping.pages`. First of all this function calculates the number of pages and initializes each `vdso_image.text_mapping.pages[number_of_page]` with the `virt_to_page` macro that converts given address to the `page` structure:
			
 
				+
			
 
				+```C
			
 
				+void __init init_vdso_image(const struct vdso_image *image)
			
 
				+{
			
 
				+	int i;
			
 
				+	int npages = (image->size) / PAGE_SIZE;
			
 
				+
			
 
				+	for (i = 0; i < npages; i++)
			
 
				+		image->text_mapping.pages[i] =
			
 
				+			virt_to_page(image->data + i*PAGE_SIZE);
			
 
				+	...
			
 
				+	...
			
 
				+	...
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+The `init_vdso` function passed to the `subsys_initcall` macro adds the given function to the `initcalls` list. All functions from this list will be called in the `do_initcalls` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file:
			
 
				+
			
 
				+```C
			
 
				+subsys_initcall(init_vdso);
			
 
				+```
			
 
				+
			
 
				+Ok, we just saw initialization of the `vDSO` and initialization of `page` structures that are related to the memory pages that contain `vDSO` system calls. But to where do their pages map? Actually they are mapped by the kernel, when it loads binary to the memory. The Linux kernel calls the `arch_setup_additional_pages` function from the [arch/x86/entry/vdso/vma.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vma.c) source code file that checks that `vDSO` enabled for the `x86_64` and calls the `map_vdso` function:
			
 
				+
			
 
				+```C
			
 
				+int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
			
 
				+{
			
 
				+	if (!vdso64_enabled)
			
 
				+		return 0;
			
 
				+
			
 
				+	return map_vdso(&vdso_image_64, true);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+The `map_vdso` function is defined in the same source code file and maps pages for the `vDSO` and for the shared `vDSO` variables. That's all. The main differences between the `vsyscall` and the `vDSO` concepts is that `vsyscal` has a static address of `ffffffffff600000` and implements `3` system calls, whereas the `vDSO` loads dynamically and implements four system calls:
			
 
				+
			
 
				+* `__vdso_clock_gettime`;
			
 
				+* `__vdso_getcpu`;
			
 
				+* `__vdso_gettimeofday`;
			
 
				+* `__vdso_time`.
			
 
				+
			
 
				+
			
 
				+That's all.
			
 
				+
			
 
				+Conclusion
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+This is the end of the third part about the system calls concept in the Linux kernel. In the previous [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html) we discussed the implementation of the preparation from the Linux kernel side, before a system call will be handled and implementation of the `exit` process from a system call handler. In this part we continued to dive into the stuff which is related to the system call concept and learned two new concepts that are very similar to the system call - the `vsyscall` and the `vDSO`.
			
 
				+
			
 
				+After all of these three parts, we know almost all things that are related to system calls, we know what system call is and why user applications need them.  We also know what occurs when a user application calls a system call and how the kernel handles system calls.
			
 
				+
			
 
				+The next part will be the last part in this [chapter](http://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) and we will see what occurs when a user runs the program.
			
 
				+
			
 
				+If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
			
 
				+
			
 
				+**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				+
			
 
				+Links
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+* [x86_64 memory map](https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt)
			
 
				+* [x86_64](https://en.wikipedia.org/wiki/X86-64)
			
 
				+* [context switching](https://en.wikipedia.org/wiki/Context_switch)
			
 
				+* [ABI](https://en.wikipedia.org/wiki/Application_binary_interface)
			
 
				+* [virtual address](https://en.wikipedia.org/wiki/Virtual_address_space)
			
 
				+* [Segmentation](https://en.wikipedia.org/wiki/Memory_segmentation)
			
 
				+* [enum](https://en.wikipedia.org/wiki/Enumerated_type)
			
 
				+* [fix-mapped addresses](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html)
			
 
				+* [glibc](https://en.wikipedia.org/wiki/GNU_C_Library)
			
 
				+* [BUILD_BUG_ON](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html)
			
 
				+* [Processor register](https://en.wikipedia.org/wiki/Processor_register)
			
 
				+* [Page fault](https://en.wikipedia.org/wiki/Page_fault)
			
 
				+* [segmentation fault](https://en.wikipedia.org/wiki/Segmentation_fault)
			
 
				+* [instruction pointer](https://en.wikipedia.org/wiki/Program_counter)
			
 
				+* [stack pointer](https://en.wikipedia.org/wiki/Stack_register)
			
 
				+* [uname](https://en.wikipedia.org/wiki/Uname)
			
 
				+* [Linkers](http://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html)
			
 
				+* [Previous part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html)
			
--- a/SysCall/syscall-4.md
+++ b/SysCall/syscall-4.md
@@ -0,0 +1,430 @@
 
				+System calls in the Linux kernel. Part 4.
			
 
				+================================================================================
			
 
				+
			
 
				+How does the Linux kernel run a program
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+This is the fourth part of the [chapter](http://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) that describes [system calls](https://en.wikipedia.org/wiki/System_call) in the Linux kernel and as I wrote in the conclusion of the [previous](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-3.html) - this part will be last in this chapter. In the previous part we stopped at the two new concepts:
			
 
				+
			
 
				+* `vsyscall`;
			
 
				+* `vDSO`;
			
 
				+
			
 
				+that are related and very similar on system call concept.
			
 
				+
			
 
				+This part will be last part in this chapter and as you can understand from the part's title - we will see what does occur in the Linux kernel when we run our programs. So, let's start.
			
 
				+
			
 
				+how do we launch our programs?
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+There are many different ways to launch an application from a user perspective. For example we can run a program from the [shell](https://en.wikipedia.org/wiki/Unix_shell) or double-click on the application icon. It does not matter. The Linux kernel handles application launch regardless how we do launch this application.
			
 
				+
			
 
				+In this part we will consider the way when we just launch an application from the shell. As you know, the standard way to launch an application from shell is the following: We just launch a [terminal emulator](https://en.wikipedia.org/wiki/Terminal_emulator) application and just write the name of the program and pass or not arguments to our program, for example:
			
 
				+
			
 
				+![ls shell](http://s14.postimg.org/d6jgidc7l/Screenshot_from_2015_09_07_17_31_55.png)
			
 
				+
			
 
				+Let's consider what does occur when we launch an application from the shell, what does shell do when we write program name, what does Linux kernel do etc. But before we will start to consider these interesting things, I want to warn that this book is about the Linux kernel. That's why we will see Linux kernel insides related stuff mostly in this part. We will not consider in details what does shell do, we will not consider complex cases, for example subshells etc.
			
 
				+
			
 
				+My default shell is - [bash](https://en.wikipedia.org/wiki/Bash_%28Unix_shell%29), so I will consider how do bash shell launches a program. So let's start. The `bash` shell as well as any program that written with [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) programming language starts from the [main](https://en.wikipedia.org/wiki/Entry_point) function. If you will look on the source code of the `bash` shell, you will find the `main` function in the [shell.c](https://github.com/bminor/bash/blob/master/shell.c#L357) source code file. This function makes many different things before the main thread loop of the `bash` started to work. For example this function:
			
 
				+
			
 
				+* checks and tries to open `/dev/tty`;
			
 
				+* check that shell running in debug mode;
			
 
				+* parses command line arguments;
			
 
				+* reads shell environment;
			
 
				+* loads `.bashrc`, `.profile` and other configuration files;
			
 
				+* and many many more.
			
 
				+
			
 
				+After all of these operations we can see the call of the `reader_loop` function. This function defined in the [eval.c](https://github.com/bminor/bash/blob/master/eval.c#L67) source code file and represents main thread loop or in other words it reads and executes commands. As the `reader_loop` function made all checks and read the given program name and arguments, it calls the `execute_command` function from the [execute_cmd.c](https://github.com/bminor/bash/blob/master/execute_cmd.c#L378) source code file. The `execute_command` function through the chain of the functions calls:
			
 
				+
			
 
				+```
			
 
				+execute_command
			
 
				+--> execute_command_internal
			
 
				+----> execute_simple_command
			
 
				+------> execute_disk_command
			
 
				+--------> shell_execve
			
 
				+```
			
 
				+
			
 
				+makes different checks like do we need to start `subshell`, was it builtin `bash` function or not etc. As I already wrote above, we will not consider all details about things that are not related to the Linux kernel. In the end of this process, the `shell_execve` function calls the `execve` system call:
			
 
				+
			
 
				+```C
			
 
				+execve (command, args, env);
			
 
				+```
			
 
				+
			
 
				+The `execve` system call has the following signature:
			
 
				+
			
 
				+```
			
 
				+int execve(const char *filename, char *const argv [], char *const envp[]);   
			
 
				+```
			
 
				+
			
 
				+and executes a program by the given filename, with the given arguments and [environment variables](https://en.wikipedia.org/wiki/Environment_variable). This system call is the first in our case and only, for example:
			
 
				+
			
 
				+```
			
 
				+$ strace ls
			
 
				+execve("/bin/ls", ["ls"], [/* 62 vars */]) = 0
			
 
				+
			
 
				+$ strace echo
			
 
				+execve("/bin/echo", ["echo"], [/* 62 vars */]) = 0
			
 
				+
			
 
				+$ strace uname
			
 
				+execve("/bin/uname", ["uname"], [/* 62 vars */]) = 0
			
 
				+```
			
 
				+
			
 
				+So, a user application (`bash` in our case) calls the system call and as we already know the next step is Linux kernel.
			
 
				+
			
 
				+execve system call
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+We saw preparation before a system call called by a user application and after a system call handler finished its work in the second [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html) of this chapter. We stopped at the call of the `execve` system call in the previous paragraph. This system call defined in the [fs/exec.c](https://github.com/torvalds/linux/blob/master/fs/exec.c) source code file and as we already know it takes three arguments:
			
 
				+
			
 
				+```
			
 
				+SYSCALL_DEFINE3(execve,
			
 
				+		const char __user *, filename,
			
 
				+		const char __user *const __user *, argv,
			
 
				+		const char __user *const __user *, envp)
			
 
				+{
			
 
				+	return do_execve(getname(filename), argv, envp);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Implementation of the `execve` is pretty simple here, as we can see it just returns the result of the `do_execve` function. The `do_execve` function defined in the same source code file and do the following things:
			
 
				+
			
 
				+* Initialize two pointers on a userspace data with the given arguments and environment variables;
			
 
				+* return the result of the `do_execveat_common`.
			
 
				+
			
 
				+We can see its implementation:
			
 
				+
			
 
				+```C
			
 
				+struct user_arg_ptr argv = { .ptr.native = __argv };
			
 
				+struct user_arg_ptr envp = { .ptr.native = __envp };
			
 
				+return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
			
 
				+```
			
 
				+
			
 
				+The `do_execveat_common` function does main work - it executes a new program. This function takes similar set of arguments, but as you can see it takes five arguments instead of three. The first argument is the file descriptor that represent directory with our application, in our case the `AT_FDCWD` means that the given pathname is interpreted relative to the current working directory of the calling process. The fifth argument is flags. In our case we passed `0` to the `do_execveat_common`. We will check in a next step, so will see it latter.
			
 
				+
			
 
				+First of all the `do_execveat_common` function checks the `filename` pointer and returns if it is `NULL`. After this we check flags of the current process that limit of running processes is not exceed:
			
 
				+
			
 
				+```C
			
 
				+if (IS_ERR(filename))
			
 
				+	return PTR_ERR(filename);
			
 
				+
			
 
				+if ((current->flags & PF_NPROC_EXCEEDED) &&
			
 
				+	atomic_read(&current_user()->processes) > rlimit(RLIMIT_NPROC)) {
			
 
				+	retval = -EAGAIN;
			
 
				+	goto out_ret;
			
 
				+}
			
 
				+
			
 
				+current->flags &= ~PF_NPROC_EXCEEDED;
			
 
				+```
			
 
				+
			
 
				+If these two checks were successful we unset `PF_NPROC_EXCEEDED` flag in the flags of the current process to prevent fail of the `execve`. You can see that in the next step we call the `unshare_files` function that defined in the [kernel/fork.c](https://github.com/torvalds/linux/blob/master/kernel/fork.c) and unshares the files of the current task and check the result of this function:
			
 
				+
			
 
				+```C
			
 
				+retval = unshare_files(&displaced);
			
 
				+if (retval)
			
 
				+	goto out_ret;
			
 
				+```
			
 
				+
			
 
				+We need to call this function to eliminate potential leak of the execve'd binary's [file descriptor](https://en.wikipedia.org/wiki/File_descriptor). In the next step we start preparation of the `bprm` that represented by the `struct linux_binprm` structure (defined in the [include/linux/binfmts.h](https://github.com/torvalds/linux/blob/master/linux/binfmts.h) header file). The `linux_binprm` structure is used to hold the arguments that are used when loading binaries. For example it contains `vma` field which has `vm_area_struct` type and represents single memory area over a contiguous interval in a given address space where our application will be loaded, `mm` field which is memory descriptor of the binary, pointer to the top of memory and many other different fields.
			
 
				+
			
 
				+First of all we allocate memory for this structure with the `kzalloc` function and check the result of the allocation:
			
 
				+
			
 
				+```C
			
 
				+bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
			
 
				+if (!bprm)
			
 
				+	goto out_files;
			
 
				+```
			
 
				+
			
 
				+After this we start to prepare the `binprm` credentials with the call of the `prepare_bprm_creds` function:
			
 
				+
			
 
				+```C
			
 
				+retval = prepare_bprm_creds(bprm);
			
 
				+	if (retval)
			
 
				+		goto out_free;
			
 
				+
			
 
				+check_unsafe_exec(bprm);
			
 
				+current->in_execve = 1;
			
 
				+```
			
 
				+
			
 
				+Initialization of the `binprm` credentials in other words is initialization of the `cred` structure that stored inside of the `linux_binprm` structure. The `cred` structure contains the security context of a task for example [real uid](https://en.wikipedia.org/wiki/User_identifier#Real_user_ID) of the task, real [guid](https://en.wikipedia.org/wiki/Globally_unique_identifier) of the task, `uid` and `guid` for the [virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system) operations etc. In the next step as we executed preparation of the `bprm` credentials we check that now we can safely execute a program with the call of the `check_unsafe_exec` function and set the current process to the `in_execve` state.
			
 
				+
			
 
				+After all of these operations we call the `do_open_execat` function that checks the flags that we passed to the `do_execveat_common` function (remember that we have `0` in the `flags`) and searches and opens executable file on disk, checks that our we will load a binary file from `noexec` mount points (we need to avoid execute a binary from filesystems that do not contain executable binaries like [proc](https://en.wikipedia.org/wiki/Procfs) or [sysfs](https://en.wikipedia.org/wiki/Sysfs)), initializes `file` structure and returns pointer on this structure. Next we can see the call the `sched_exec` after this:
			
 
				+
			
 
				+```C
			
 
				+file = do_open_execat(fd, filename, flags);
			
 
				+retval = PTR_ERR(file);
			
 
				+if (IS_ERR(file))
			
 
				+	goto out_unmark;
			
 
				+
			
 
				+sched_exec();
			
 
				+```
			
 
				+
			
 
				+The `sched_exec` function is used to determine the least loaded processor that can execute the new program and to migrate the current process to it.
			
 
				+
			
 
				+After this we need to check [file descriptor](https://en.wikipedia.org/wiki/File_descriptor) of the give executable binary. We try to check does the name of the our binary file starts from the `/` symbol or does the path of the given executable binary is interpreted relative to the current working directory of the calling process or in other words file descriptor is `AT_FDCWD` (read above about this).
			
 
				+
			
 
				+If one of these checks is successful we set the binary parameter filename:
			
 
				+
			
 
				+```C
			
 
				+bprm->file = file;
			
 
				+
			
 
				+if (fd == AT_FDCWD || filename->name[0] == '/') {
			
 
				+	bprm->filename = filename->name;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Otherwise if the filename is empty we set the binary parameter filename to the `/dev/fd/%d` or `/dev/fd/%d/%s` depends on the filename of the given executable binary which means that we will execute the file to which the file descriptor refers:
			
 
				+
			
 
				+```C
			
 
				+} else {
			
 
				+	if (filename->name[0] == '\0')
			
 
				+		pathbuf = kasprintf(GFP_TEMPORARY, "/dev/fd/%d", fd);
			
 
				+	else
			
 
				+		pathbuf = kasprintf(GFP_TEMPORARY, "/dev/fd/%d/%s",
			
 
				+		                    fd, filename->name);
			
 
				+	if (!pathbuf) {
			
 
				+		retval = -ENOMEM;
			
 
				+		goto out_unmark;
			
 
				+	}
			
 
				+
			
 
				+	bprm->filename = pathbuf;
			
 
				+}
			
 
				+
			
 
				+bprm->interp = bprm->filename;
			
 
				+```
			
 
				+
			
 
				+Note that we set not only the `bprm->filename` but also `bprm->interp` that will contain name of the program interpreter. For now we just write the same name there, but later it will be updated with the real name of the program interpreter depends on binary format of a program. You can read above that we already prepared `cred` for the `linux_binprm`. The next step is initialization of other fields of the `linux_binprm`.  First of all we call the `bprm_mm_init` function and pass the `bprm` to it:
			
 
				+
			
 
				+```C
			
 
				+retval = bprm_mm_init(bprm);
			
 
				+if (retval)
			
 
				+	goto out_unmark;
			
 
				+```
			
 
				+
			
 
				+The `bprm_mm_init` defined in the same source code file and as we can understand from the function's name, it makes initialization of the memory descriptor or in other words the `bprm_mm_init` function initializes `mm_struct` structure. This structure defined in the [include/linux/mm_types.h](https://github.com/torvalds/linux/blob/master/include/mm_types.h) header file and represents address space of a process. We will not consider implementation of the `bprm_mm_init` function because we do not know many important stuff related to the Linux kernel memory manager, but we just need to know that this function initializes `mm_struct` and populate it with a temporary stack `vm_area_struct`.
			
 
				+
			
 
				+After this we calculate the count of the command line arguments which are were passed to the our executable binary, the count of the environment variables and set it to the `bprm->argc` and `bprm->envc` respectively:
			
 
				+
			
 
				+```C
			
 
				+bprm->argc = count(argv, MAX_ARG_STRINGS);
			
 
				+if ((retval = bprm->argc) < 0)
			
 
				+	goto out;
			
 
				+
			
 
				+bprm->envc = count(envp, MAX_ARG_STRINGS);
			
 
				+if ((retval = bprm->envc) < 0)
			
 
				+	goto out;
			
 
				+```
			
 
				+
			
 
				+As you can see we do this operations with the help of the `count` function that defined in the [same](https://github.com/torvalds/linux/blob/master/fs/exec.c) source code file and calculates the count of strings in the `argv` array. The `MAX_ARG_STRINGS` macro defined in the [include/uapi/linux/binfmts.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/binfmts.h) header file and as we can understand from the macro's name, it represents maximum number of strings that were passed to the `execve` system call. The value of the `MAX_ARG_STRINGS`:
			
 
				+
			
 
				+```C
			
 
				+#define MAX_ARG_STRINGS 0x7FFFFFFF
			
 
				+```
			
 
				+
			
 
				+After we calculated the number of the command line arguments and environment variables, we call the `prepare_binprm` function. We already call the function with the similar name before this moment. This function is called `prepare_binprm_cred` and we remember that this function initializes `cred` structure in the `linux_bprm`. Now the `prepare_binprm` function:
			
 
				+
			
 
				+```C
			
 
				+retval = prepare_binprm(bprm);
			
 
				+if (retval < 0)
			
 
				+	goto out;
			
 
				+```
			
 
				+
			
 
				+fills the `linux_binprm` structure with the `uid` from [inode](https://en.wikipedia.org/wiki/Inode) and read `128` bytes from the binary executable file. We read only first `128` from the executable file because we need to check a type of our executable. We will read the rest of the executable file in the later step. After the preparation of the `linux_bprm` structure we copy the filename of the executable binary file, command line arguments and environment variables to the `linux_bprm` with the call of the `copy_strings_kernel` function:
			
 
				+
			
 
				+```C
			
 
				+retval = copy_strings_kernel(1, &bprm->filename, bprm);
			
 
				+if (retval < 0)
			
 
				+	goto out;
			
 
				+
			
 
				+retval = copy_strings(bprm->envc, envp, bprm);
			
 
				+if (retval < 0)
			
 
				+	goto out;
			
 
				+
			
 
				+retval = copy_strings(bprm->argc, argv, bprm);
			
 
				+if (retval < 0)
			
 
				+	goto out;
			
 
				+```
			
 
				+
			
 
				+And set the pointer to the top of new program's stack that we set in the `bprm_mm_init` function:
			
 
				+
			
 
				+```C
			
 
				+bprm->exec = bprm->p;
			
 
				+```
			
 
				+
			
 
				+The top of the stack will contain the program filename and we store this filename to the `exec` field of the `linux_bprm` structure.
			
 
				+
			
 
				+Now we have filled `linux_bprm` structure, we call the `exec_binprm` function:
			
 
				+
			
 
				+```C
			
 
				+retval = exec_binprm(bprm);
			
 
				+if (retval < 0)
			
 
				+	goto out;
			
 
				+```
			
 
				+
			
 
				+First of all we store the [pid](https://en.wikipedia.org/wiki/Process_identifier) and `pid` that seen from the [namespace](https://en.wikipedia.org/wiki/Cgroups) of the current task in the `exec_binprm`:
			
 
				+
			
 
				+```C
			
 
				+old_pid = current->pid;
			
 
				+rcu_read_lock();
			
 
				+old_vpid = task_pid_nr_ns(current, task_active_pid_ns(current->parent));
			
 
				+rcu_read_unlock();
			
 
				+```
			
 
				+
			
 
				+and call the:
			
 
				+
			
 
				+```C
			
 
				+search_binary_handler(bprm);
			
 
				+```
			
 
				+
			
 
				+function. This function goes through the list of handlers that contains different binary formats. Currently the Linux kernel supports following binary formats:
			
 
				+
			
 
				+* `binfmt_script` - support for interpreted scripts that are starts from the [#!](https://en.wikipedia.org/wiki/Shebang_%28Unix%29) line;
			
 
				+* `binfmt_misc` - support different binary formats, according to runtime configuration of the Linux kernel;
			
 
				+* `binfmt_elf` - support [elf](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) format;
			
 
				+* `binfmt_aout` - support [a.out](https://en.wikipedia.org/wiki/A.out) format;
			
 
				+* `binfmt_flat` - support for [flat](https://en.wikipedia.org/wiki/Binary_file#Structure) format;
			
 
				+* `binfmt_elf_fdpic` - Support for [elf](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) [FDPIC](http://elinux.org/UClinux_Shared_Library#FDPIC_ELF) binaries;
			
 
				+* `binfmt_em86` - support for Intel [elf](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) binaries running on [Alpha](https://en.wikipedia.org/wiki/DEC_Alpha) machines.
			
 
				+
			
 
				+So, the search_binary_handler tries to call the `load_binary` function and pass `linux_binprm` to it. If the binary handler supports the given executable file format, it starts to prepare the executable binary for execution:
			
 
				+
			
 
				+```C
			
 
				+int search_binary_handler(struct linux_binprm *bprm)
			
 
				+{
			
 
				+	...
			
 
				+	...
			
 
				+	...
			
 
				+	list_for_each_entry(fmt, &formats, lh) {
			
 
				+		retval = fmt->load_binary(bprm);
			
 
				+		if (retval < 0 && !bprm->mm) {
			
 
				+			force_sigsegv(SIGSEGV, current);
			
 
				+			return retval;
			
 
				+		}
			
 
				+	}
			
 
				+	
			
 
				+	return retval;	
			
 
				+```
			
 
				+
			
 
				+Where the `load_binary` for example for the [elf](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) checks the magic number (each `elf` binary file contains magic number in the header) in the `linux_bprm` buffer (remember that we read first `128` bytes from the executable binary file): and exit if it is not `elf` binary:
			
 
				+
			
 
				+```C
			
 
				+static int load_elf_binary(struct linux_binprm *bprm)
			
 
				+{
			
 
				+	...
			
 
				+	...
			
 
				+	...
			
 
				+	loc->elf_ex = *((struct elfhdr *)bprm->buf);
			
 
				+
			
 
				+	if (memcmp(elf_ex.e_ident, ELFMAG, SELFMAG) != 0)
			
 
				+		goto out;
			
 
				+```
			
 
				+
			
 
				+If the given executable file is in `elf` format, the `load_elf_binary` continues to execute. The `load_elf_binary` does many different things to prepare on execution executable file. For example it checks the architecture and type of the executable file:
			
 
				+
			
 
				+```C
			
 
				+if (loc->elf_ex.e_type != ET_EXEC && loc->elf_ex.e_type != ET_DYN)
			
 
				+	goto out;
			
 
				+if (!elf_check_arch(&loc->elf_ex))
			
 
				+	goto out;
			
 
				+```
			
 
				+
			
 
				+and exit if there is wrong architecture and executable file non executable non shared. Tries to load the `program header table`:
			
 
				+
			
 
				+```C
			
 
				+elf_phdata = load_elf_phdrs(&loc->elf_ex, bprm->file);
			
 
				+if (!elf_phdata)
			
 
				+	goto out;
			
 
				+```
			
 
				+
			
 
				+that describes [segments](https://en.wikipedia.org/wiki/Memory_segmentation). Read the `program interpreter` and libraries that linked with the our executable binary file from disk and load it to memory. The `program interpreter` specified in the `.interp` section of the executable file and as you can read in the part that describes [Linkers](http://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html) it is - `/lib64/ld-linux-x86-64.so.2` for the `x86_64`. It setups the stack and map `elf` binary into the correct location in memory. It maps the [bss](https://en.wikipedia.org/wiki/.bss) and the [brk](http://man7.org/linux/man-pages/man2/sbrk.2.html) sections and does many many other different things to prepare executable file to execute.
			
 
				+
			
 
				+In the end of the execution of the `load_elf_binary` we call the `start_thread` function and pass three arguments to it:
			
 
				+
			
 
				+```C
			
 
				+	start_thread(regs, elf_entry, bprm->p);
			
 
				+	retval = 0;
			
 
				+out:
			
 
				+	kfree(loc);
			
 
				+out_ret:
			
 
				+	return retval;
			
 
				+```
			
 
				+
			
 
				+These arguments are:
			
 
				+
			
 
				+* Set of [registers](https://en.wikipedia.org/wiki/Processor_register) for the new task;
			
 
				+* Address of the entry point of the new task;
			
 
				+* Address of the top of the stack for the new task.
			
 
				+
			
 
				+As we can understand from the function's name, it starts new thread, but it is not so. The `start_thread` function just prepares new task's registers to be ready to run. Let's look on the implementation of this function:
			
 
				+
			
 
				+```C
			
 
				+void
			
 
				+start_thread(struct pt_regs *regs, unsigned long new_ip, unsigned long new_sp)
			
 
				+{
			
 
				+        start_thread_common(regs, new_ip, new_sp,
			
 
				+                            __USER_CS, __USER_DS, 0);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+As we can see the `start_thread` function just makes a call of the `start_thread_common` function that will do all for us:
			
 
				+
			
 
				+```C
			
 
				+static void
			
 
				+start_thread_common(struct pt_regs *regs, unsigned long new_ip,
			
 
				+                    unsigned long new_sp,
			
 
				+                    unsigned int _cs, unsigned int _ss, unsigned int _ds)
			
 
				+{
			
 
				+        loadsegment(fs, 0);
			
 
				+        loadsegment(es, _ds);
			
 
				+        loadsegment(ds, _ds);
			
 
				+        load_gs_index(0);
			
 
				+        regs->ip                = new_ip;
			
 
				+        regs->sp                = new_sp;
			
 
				+        regs->cs                = _cs;
			
 
				+        regs->ss                = _ss;
			
 
				+        regs->flags             = X86_EFLAGS_IF;
			
 
				+        force_iret();
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+The `start_thread_common` function fills `fs` segment register with zero and `es` and `ds` with the value of the data segment register. After this we set new values to the [instruction pointer](https://en.wikipedia.org/wiki/Program_counter), `cs` segments etc. In the end of the `start_thread_common` function we can see the `force_iret` macro that force a system call return via `iret` instruction. Ok, we prepared new thread to run in userspace and now we can return from the `exec_binprm` and now we are in the `do_execveat_common` again. After the `exec_binprm` will finish its execution we release memory for structures that was allocated before and return.
			
 
				+
			
 
				+After we returned from the `execve` system call handler, execution of our program will be started. We can do it, because all context related information already configured for this purpose. As we saw the `execve` system call does not return control to a process, but code, data and other segments of the caller process are just overwritten of the program segments. The exit from our application will be implemented through the `exit` system call.
			
 
				+
			
 
				+That's all. From this point our program will be executed.
			
 
				+
			
 
				+Conclusion
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+This is the end of the fourth and last part of the about the system calls concept in the Linux kernel. We saw almost all related stuff to the `system call` concept in these four parts. We started from the understanding of the `system call` concept, we have learned what is it and why do users applications need in this concept. Next we saw how does the Linux handle a system call from a user application. We met two similar concepts to the `system call` concept, they are `vsyscall` and `vDSO` and finally we saw how does Linux kernel run a user program.
			
 
				+
			
 
				+If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
			
 
				+
			
 
				+**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				+
			
 
				+Links
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+* [System call](https://en.wikipedia.org/wiki/System_call)
			
 
				+* [shell](https://en.wikipedia.org/wiki/Unix_shell)
			
 
				+* [bash](https://en.wikipedia.org/wiki/Bash_%28Unix_shell%29)
			
 
				+* [entry point](https://en.wikipedia.org/wiki/Entry_point) 
			
 
				+* [C](https://en.wikipedia.org/wiki/C_%28programming_language%29)
			
 
				+* [environment variables](https://en.wikipedia.org/wiki/Environment_variable)
			
 
				+* [file descriptor](https://en.wikipedia.org/wiki/File_descriptor)
			
 
				+* [real uid](https://en.wikipedia.org/wiki/User_identifier#Real_user_ID)
			
 
				+* [virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system)
			
 
				+* [procfs](https://en.wikipedia.org/wiki/Procfs)
			
 
				+* [sysfs](https://en.wikipedia.org/wiki/Sysfs)
			
 
				+* [inode](https://en.wikipedia.org/wiki/Inode)
			
 
				+* [pid](https://en.wikipedia.org/wiki/Process_identifier)
			
 
				+* [namespace](https://en.wikipedia.org/wiki/Cgroups)
			
 
				+* [#!](https://en.wikipedia.org/wiki/Shebang_%28Unix%29)
			
 
				+* [elf](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format)
			
 
				+* [a.out](https://en.wikipedia.org/wiki/A.out)
			
 
				+* [flat](https://en.wikipedia.org/wiki/Binary_file#Structure) 
			
 
				+* [Alpha](https://en.wikipedia.org/wiki/DEC_Alpha)
			
 
				+* [FDPIC](http://elinux.org/UClinux_Shared_Library#FDPIC_ELF)
			
 
				+* [segments](https://en.wikipedia.org/wiki/Memory_segmentation)
			
 
				+* [Linkers](http://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html)
			
 
				+* [Processor register](https://en.wikipedia.org/wiki/Processor_register)
			
 
				+* [instruction pointer](https://en.wikipedia.org/wiki/Program_counter)
			
 
				+* [Previous part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-3.html)
			
--- a/Theory/ELF.md
+++ b/Theory/ELF.md
@@ -1,32 +1,32 @@
 
				 Executable and Linkable Format
			
 
				 ================================================================================
			
 
				 
			
 
				-ELF (Executable and Linkable Format) is a standard file format for executable files, object code, shared libraries, and core dumps. Linux, as well as, many other UNIX-like operating systems uses this format. Let's look on the structure of ELF-64 File Format and some defintions in the linux kernel source code related with it.
			
 
				+ELF (Executable and Linkable Format) is a standard file format for executable files, object code, shared libraries and core dumps. Linux and many UNIX-like operating systems use this format. Let's look at the structure of the ELF-64 Object File Format and some definitions in the linux kernel source code which related with it.
			
 
				 
			
 
				-An ELF file consists of the following parts:
			
 
				+An ELF object file consists of the following parts:
			
 
				 
			
 
				-* ELF header - describes the main characteristics of the object file: type, CPU architecture, virtual address of the entry point, size and offset of the remaining parts, etc...;
			
 
				-* Program header table - lists the available segments and their attributes. Program header table needs loaders for placing sections of this file as virtual memory segments;
			
 
				-* Section header table - contains the description of sections.
			
 
				+* ELF header - describes the main characteristics of the object file: type, CPU architecture, the virtual address of the entry point, the size and offset of the remaining parts, etc...;
			
 
				+* Program header table - lists the available segments and their attributes. Program header table need loaders for placing sections of the file as virtual memory segments;
			
 
				+* Section header table - contains the description of the sections.
			
 
				 
			
 
				-Now let's look closer on these components.
			
 
				+Now let's have a closer look on these components.
			
 
				 
			
 
				 **ELF header**
			
 
				 
			
 
				-It's located in the beginning of the object file. Its main point is to locate all other parts of the object file. ELF header contains following fields:
			
 
				+The ELF header is located at the beginning of the object file. Its main purpose is to locate all other parts of the object file. The File header contains the following fields:
			
 
				 
			
 
				-* ELF identification - array of bytes which helps identify this file as an ELF file and also provides information about general object file characteristics;
			
 
				-* Object file type - identifies the object file type. This field can describe whether this file is a relocatable file or executable file, etc...;
			
 
				+* ELF identification - array of bytes which helps identify the file as an ELF object file and also provides information about general object file characteristic;
			
 
				+* Object file type - identifies the object file type. This field can describe that ELF file is a relocatable object file, an executable file, etc...;
			
 
				 * Target architecture;
			
 
				 * Version of the object file format;
			
 
				 * Virtual address of the program entry point;
			
 
				 * File offset of the program header table;
			
 
				 * File offset of the section header table;
			
 
				-* Size of the ELF header;
			
 
				-* Size of the program header table entry;
			
 
				+* Size of an ELF header;
			
 
				+* Size of a program header table entry;
			
 
				 * and other fields...
			
 
				 
			
 
				-You can find `elf64_hdr` structure which presents ELF64 header in the linux kernel source code:
			
 
				+You can find the `elf64_hdr` structure which presents ELF64 header in the linux kernel source code:
			
 
				 
			
 
				 ```C
			
 
				 typedef struct elf64_hdr {
			
@@ -47,11 +47,11 @@ typedef struct elf64_hdr {
 
				 } Elf64_Ehdr;
			
 
				 ```
			
 
				 
			
 
				-This structure defines in the [elf.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/elf.h)
			
 
				+This structure defined in the [elf.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/elf.h#L220)
			
 
				 
			
 
				 **Sections**
			
 
				 
			
 
				-All data is stored in sections in an Elf file. Sections are identified by index in the section header table. Section header contains following fields:
			
 
				+All data stores in a sections in an Elf object file. Sections identified by index in the section header table. Section header contains following fields:
			
 
				 
			
 
				 * Section name;
			
 
				 * Section type;
			
@@ -64,7 +64,7 @@ All data is stored in sections in an Elf file. Sections are identified by index
 
				 * Address alignment boundary;
			
 
				 * Size of entries, if section has table;
			
 
				 
			
 
				-And presented with the following `elf64_shdr` structure in the linux kernel source code:
			
 
				+And presented with the following `elf64_shdr` structure in the linux kernel:
			
 
				 
			
 
				 ```C
			
 
				 typedef struct elf64_shdr {
			
@@ -81,9 +81,11 @@ typedef struct elf64_shdr {
 
				 } Elf64_Shdr;
			
 
				 ```
			
 
				 
			
 
				+[elf.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/elf.h#L312)
			
 
				+
			
 
				 **Program header table**
			
 
				 
			
 
				-All sections are grouped into segments in an executable file or shared library. Program header table is an array of structures which describe every segment. It looks like:
			
 
				+All sections are grouped into segments in an executable or shared object file. Program header is an array of structures which describe every segment. It looks like:
			
 
				 
			
 
				 ```C
			
 
				 typedef struct elf64_phdr {
			
@@ -98,14 +100,16 @@ typedef struct elf64_phdr {
 
				 } Elf64_Phdr;
			
 
				 ```
			
 
				 
			
 
				-`elf64_phdr` structure defines in the same [elf.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/elf.h).
			
 
				+in the linux kernel source code.
			
 
				+
			
 
				+`elf64_phdr` defined in the same [elf.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/elf.h#L254).
			
 
				 
			
 
				-And ELF file also contains other fields/structures which you can find in the [Documentation](http://www.uclibc.org/docs/elf-64-gen.pdf). Now let's look on the `vmlinux`.
			
 
				+The ELF object file also contains other fields/structures which you can find in the [Documentation](http://www.uclibc.org/docs/elf-64-gen.pdf). Now let's a look at the `vmlinux` ELF object.
			
 
				 
			
 
				 vmlinux
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-`vmlinux` is an ELF file too. So we can look at it with the `readelf` util. First of all, let's look on the elf header of vmlinux:
			
 
				+`vmlinux` is also a relocatable ELF object file . We can take a look at it with the `readelf` util. First of all let's look at the header:
			
 
				 
			
 
				 ```
			
 
				 $ readelf -h  vmlinux
			
@@ -131,15 +135,15 @@ ELF Header:
 
				   Section header string table index: 70
			
 
				 ```
			
 
				 
			
 
				-Here we can see that `vmlinux` is 64-bit executable file.
			
 
				+Here we can see that `vmlinux` is a 64-bit executable file.
			
 
				 
			
 
				-We can read from the [Documentation/x86/x86_64/mm.txt](https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt):
			
 
				+We can read from the [Documentation/x86/x86_64/mm.txt](https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt#L21):
			
 
				 
			
 
				 ```
			
 
				 ffffffff80000000 - ffffffffa0000000 (=512 MB)  kernel text mapping, from phys 0
			
 
				 ```
			
 
				 
			
 
				-So we can find it in the `vmlinux` with:
			
 
				+We can then look this address up in the `vmlinux` ELF object with:
			
 
				 
			
 
				 ```
			
 
				 $ readelf -s vmlinux | grep ffffffff81000000
			
@@ -148,9 +152,9 @@ $ readelf -s vmlinux | grep ffffffff81000000
 
				  90766: ffffffff81000000     0 NOTYPE  GLOBAL DEFAULT    1 startup_64
			
 
				 ```
			
 
				 
			
 
				-Note that ,the address of `startup_64` routine is not `ffffffff80000000`, but `ffffffff81000000`. Now I'll explain why.
			
 
				+Note that the address of the `startup_64` routine is not `ffffffff80000000`, but `ffffffff81000000` and now I'll explain why.
			
 
				 
			
 
				-We can see the following definition in the [arch/x86/kernel/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S):
			
 
				+We can see following definition in the [arch/x86/kernel/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S):
			
 
				 
			
 
				 ```
			
 
				     . = __START_KERNEL;
			
@@ -172,13 +176,12 @@ Where `__START_KERNEL` is:
 
				 #define __START_KERNEL		(__START_KERNEL_map + __PHYSICAL_START)
			
 
				 ```
			
 
				 
			
 
				-`__START_KERNEL_map` is the value from documentation - `ffffffff80000000` and `__PHYSICAL_START` is `0x1000000`. That's why address of the `startup_64` is `ffffffff81000000`.
			
 
				+`__START_KERNEL_map` is the value from the documentation - `ffffffff80000000` and `__PHYSICAL_START` is `0x1000000`. That's why address of the `startup_64` is `ffffffff81000000`.
			
 
				 
			
 
				-At last we can get program headers from `vmlinux` with the following command:
			
 
				+And at last we can get program headers from `vmlinux` with the following command:
			
 
				 
			
 
				 ```
			
 
				-
			
 
				-$ readelf -l vmlinux
			
 
				+readelf -l vmlinux
			
 
				 
			
 
				 Elf file type is EXEC (Executable file)
			
 
				 Entry point 0x1000000
			
@@ -210,6 +213,6 @@ Program Headers:
 
				 		  .smp_locks .data_nosave .bss .brk
			
 
				 ```
			
 
				 
			
 
				-Here we can see five segments with sections list. All of these sections you can find in the generated linker script at - `arch/x86/kernel/vmlinux.lds`.
			
 
				+Here we can see five segments with sections list. You can find all of these sections in the generated linker script at - `arch/x86/kernel/vmlinux.lds`.
			
 
				 
			
 
				-That's all. Of course it's not a full description of ELF(Executable and	Linkable Format), but if you are interested in it, you can find documentation - [here](http://www.uclibc.org/docs/elf-64-gen.pdf)
			
 
				+That's all. Of course it's not a full description of ELF (Executable and Linkable Format), but if you want to know more, you can find the documentation - [here](http://www.uclibc.org/docs/elf-64-gen.pdf)
			
--- a/Theory/Paging.md
+++ b/Theory/Paging.md
@@ -4,19 +4,19 @@ Paging
 
				 Introduction
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-In the fifth [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) of the series `Linux kernel booting process` we learned about what the kernel does in its earliest stage. In the next step the kernel will initialize different things like `initrd` mounting, lockdep initialization, and many many others things, before we can see how the kernel runs the first init process.
			
 
				+In the fifth [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) of the series `Linux kernel booting process` we learned about what the kernel does in its earliest stage. In the next step the kernel will initialize different things like `initrd` mounting, lockdep initialization, and many many other things, before we can see how the kernel runs the first init process.
			
 
				 
			
 
				 Yeah, there will be many different things, but many many and once again many work with **memory**.
			
 
				 
			
 
				-In my view, memory management is one of the most complex part of the linux kernel and in system programming in general. This is why before we proceed with the kernel initialization stuff, we need to get acquainted with paging.
			
 
				+In my view, memory management is one of the most complex parts of the Linux kernel and in system programming in general. This is why we need to get acquainted with paging, before we proceed with the kernel initialization stuff.
			
 
				 
			
 
				-`Paging` is a mechanism that translates a linear memory address to a physical address. If you have read the previous parts of this book, you may remember that we saw segmentation in real mode when physical addresses are calculated by shifting a segment register by four and adding an offset. We also saw segmentation in protected mode, where we used the descriptor tables and base addresses from descriptors with offsets to calculate the physical addresses. Now that we are in 64-bit mode, will see paging.
			
 
				+`Paging` is a mechanism that translates a linear memory address to a physical address. If you have read the previous parts of this book, you may remember that we saw segmentation in real mode when physical addresses are calculated by shifting a segment register by four and adding an offset. We also saw segmentation in protected mode, where we used the descriptor tables and base addresses from descriptors with offsets to calculate the physical addresses. Now we will see paging in 64-bit mode.
			
 
				 
			
 
				 As the Intel manual says:
			
 
				 
			
 
				 > Paging provides a mechanism for implementing a conventional demand-paged, virtual-memory system where sections of a program’s execution environment are mapped into physical memory as needed.
			
 
				 
			
 
				-So... In this post I will try to explain the theory behind paging. Of course it will be closely related to the `x86_64` version of the linux kernel for, but we will not go into too much details (at least in this post).
			
 
				+So... In this post I will try to explain the theory behind paging. Of course it will be closely related to the `x86_64` version of the Linux kernel, but we will not go into too much details (at least in this post).
			
 
				 
			
 
				 Enabling paging
			
 
				 --------------------------------------------------------------------------------
			
@@ -33,7 +33,7 @@ We will only explain the last mode here. To enable the `IA-32e paging` paging mo
 
				 * set the `CR4.PAE` bit;
			
 
				 * set the `IA32_EFER.LME` bit.
			
 
				 
			
 
				-We already saw where those this bits were set in [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S):
			
 
				+We already saw where those bits were set in [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S):
			
 
				 
			
 
				 ```assembly
			
 
				 movl	$(X86_CR0_PG | X86_CR0_PE), %eax
			
@@ -52,14 +52,14 @@ wrmsr
 
				 Paging structures
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-Paging divides the linear address space into fixed-size pages. Pages can be mapped into the physical address space or even external storage. This fixed size is `4096` bytes for the `x86_64` linux kernel. To perform the linear address translation to a physical address special structures are used. Every structure is `4096` bytes size and contains `512` entries (this only for `PAE` and `IA32_EFER.LME` modes). Paging structures are hierarchical and the linux kernel uses 4 level of paging in the `x86_64` architecture. The CPU uses a part of the linear address to identify the entry in another paging structure which is at the lower level or physical memory region (`page frame`) or physical address in this region (`page offset`). The address of the top level paging structure located in the `cr3` register. We already saw this in [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S):
			
 
				+Paging divides the linear address space into fixed-size pages. Pages can be mapped into the physical address space or external storage. This fixed size is `4096` bytes for the `x86_64` Linux kernel. To perform the translation from linear address to physical address, special structures are used. Every structure is `4096` bytes and contains `512` entries (this only for `PAE` and `IA32_EFER.LME` modes). Paging structures are hierarchical and the Linux kernel uses 4 level of paging in the `x86_64` architecture. The CPU uses a part of linear addresses to identify the entry in another paging structure which is at the lower level, physical memory region (`page frame`) or physical address in this region (`page offset`). The address of the top level paging structure located in the `cr3` register. We have already seen this in [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S):
			
 
				 
			
 
				 ```assembly
			
 
				 leal	pgtable(%ebx), %eax
			
 
				 movl	%eax, %cr3
			
 
				 ```
			
 
				 
			
 
				-We built the page table structures and put the address of the top-level structure in the `cr3` register. Here `cr3` is used to store the address of the top-level structure, the `PML4` or `Page Global Directory` as it is called in the linux kernel. `cr3` is 64-bit register and has the following structure:
			
 
				+We build the page table structures and put the address of the top-level structure in the `cr3` register. Here `cr3` is used to store the address of the top-level structure, the `PML4` or `Page Global Directory` as it is called in the Linux kernel. `cr3` is 64-bit register and has the following structure:
			
 
				 
			
 
				 ```
			
 
				 63                  52 51                                                        32
			
@@ -78,24 +78,24 @@ We built the page table structures and put the address of the top-level structur
 
				 
			
 
				 These fields have the following meanings:
			
 
				 
			
 
				-* Bits 2:0 - ignored;
			
 
				-* Bits 51:12 - stores the address of the top level paging structure;
			
 
				-* Bit 3 and 4 - PWT or Page-Level Writethrough and PCD or Page-level cache disable indicate. These bits control the way the page or Page Table is handled by the hardware cache; 
			
 
				-* Reserved - reserved must be 0;
			
 
				 * Bits 63:52 - reserved must be 0.
			
 
				+* Bits 51:12 - stores the address of the top level paging structure;
			
 
				+* Reserved   - reserved must be 0;
			
 
				+* Bits 4 : 3 - PWT or Page-Level Writethrough and PCD or Page-level cache disable indicate. These bits control the way the page or Page Table is handled by the hardware cache;
			
 
				+* Bits 2 : 0 - ignored;
			
 
				 
			
 
				-The linear address translation address is following:
			
 
				+The linear address translation is following:
			
 
				 
			
 
				-* A given linear address arrives to the [MMU](http://en.wikipedia.org/wiki/Memory_management_unit) instead of memory bus. 
			
 
				-* 64-bit linear address splits on some parts. Only low 48 bits are significant, it means that `2^48` or 256 TBytes of linear-address space may be accessed at any given time.
			
 
				+* A given linear address arrives to the [MMU](http://en.wikipedia.org/wiki/Memory_management_unit) instead of memory bus.
			
 
				+* 64-bit linear address is split into some parts. Only low 48 bits are significant, it means that `2^48` or 256 TBytes of linear-address space may be accessed at any given time.
			
 
				 * `cr3` register stores the address of the 4 top-level paging structure.
			
 
				-* `47:39` bits of the given linear address stores an index into the paging structure level-4, `38:30` bits stores index into the paging structure level-3, `29:21` bits stores an index into the paging structure level-2, `20:12` bits stores an index into the paging structure level-1 and `11:0` bits provide the byte offset into the physical page.
			
 
				+* `47:39` bits of the given linear address store an index into the paging structure level-4, `38:30` bits store index into the paging structure level-3, `29:21` bits store an index into the paging structure level-2, `20:12` bits store an index into the paging structure level-1 and `11:0` bits provide the offset into the physical page in byte.
			
 
				 
			
 
				 schematically, we can imagine it like this:
			
 
				 
			
 
				 ![4-level paging](http://oi58.tinypic.com/207mb0x.jpg)
			
 
				 
			
 
				-Every access to a linear address is either a supervisor-mode access or a user-mode access. This access is determined by the `CPL` (current privilege level). If `CPL < 3` it is a supervisor mode access level otherwise, otherwise it is a user mode access level. For example, the top level page table entry contains access bits and has the following structure:
			
 
				+Every access to a linear address is either a supervisor-mode access or a user-mode access. This access is determined by the `CPL` (current privilege level). If `CPL < 3` it is a supervisor mode access level, otherwise it is a user mode access level. For example, the top level page table entry contains access bits and has the following structure:
			
 
				 
			
 
				 ```
			
 
				 63  62                  52 51                                                    32
			
@@ -114,31 +114,31 @@ Every access to a linear address is either a supervisor-mode access or a user-mo
 
				 
			
 
				 Where:
			
 
				 
			
 
				-* 63 bit - N/X bit (No Execute Bit) - presents ability to execute the code from physical pages mapped by the table entry;
			
 
				+* 63 bit - N/X bit (No Execute Bit) which presents ability to execute the code from physical pages mapped by the table entry;
			
 
				 * 62:52 bits - ignored by CPU, used by system software;
			
 
				 * 51:12 bits - stores physical address of the lower level paging structure;
			
 
				-* 12:9  bits - ignored by CPU;
			
 
				+* 11: 9 bits - ignored by CPU;
			
 
				 * MBZ - must be zero bits;
			
 
				 * Ignored bits;
			
 
				 * A - accessed bit indicates was physical page or page structure accessed;
			
 
				 * PWT and PCD used for cache;
			
 
				-* U/S - user/supervisor bit controls user access to the all physical pages mapped by this table entry;
			
 
				-* R/W - read/write bit controls read/write access to the all physical pages mapped by this table entry;
			
 
				+* U/S - user/supervisor bit controls user access to all the physical pages mapped by this table entry;
			
 
				+* R/W - read/write bit controls read/write access to all the physical pages mapped by this table entry;
			
 
				 * P - present bit. Current bit indicates was page table or physical page loaded into primary memory or not.
			
 
				 
			
 
				-Ok, we know about the paging structures and their entries. Now let's see some details about 4-level paging in the linux kernel.
			
 
				+Ok, we know about the paging structures and their entries. Now let's see some details about 4-level paging in the Linux kernel.
			
 
				 
			
 
				-Paging structures in the linux kernel
			
 
				+Paging structures in the Linux kernel
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-As we've seen, the linux kernel in `x86_64` uses 4-level page tables. Their names are:
			
 
				+As we've seen, the Linux kernel in `x86_64` uses 4-level page tables. Their names are:
			
 
				 
			
 
				 * Page Global Directory
			
 
				 * Page Upper  Directory
			
 
				 * Page Middle Directory
			
 
				 * Page Table Entry
			
 
				 
			
 
				-After you've compiled and installed the linux kernel, you can see the `System.map` file which stores the virtual addresses of the functions that are used by the kernel. For example:
			
 
				+After you've compiled and installed the Linux kernel, you can see the `System.map` file which stores the virtual addresses of the functions that are used by the kernel. For example:
			
 
				 
			
 
				 ```
			
 
				 $ grep "start_kernel" System.map
			
@@ -146,14 +146,14 @@ ffffffff81efe497 T x86_64_start_kernel
 
				 ffffffff81efeaa2 T start_kernel
			
 
				 ```
			
 
				 
			
 
				-We can see `0xffffffff81efe497` here. I doubt you really have that much RAM installed. But anyway, `start_kernel` and `x86_64_start_kernel` will be executed. The address space in `x86_64` is `2^64` size, but it's too large, that's why a smaller address space is used, only 48-bits wide. So we have a situation where the physical address space is limited to 48 bits, but addressing still performed with 64 bit pointers. How is this problem solved? Look at this diagram:
			
 
				+We can see `0xffffffff81efe497` here. I doubt you really have that much RAM installed. But anyway, `start_kernel` and `x86_64_start_kernel` will be executed. The address space in `x86_64` is `2^64` wide, but it's too large, that's why a smaller address space is used, only 48-bits wide. So we have a situation where the physical address space is limited to 48 bits, but addressing still performs with 64 bit pointers. How is this problem solved? Look at this diagram:
			
 
				 
			
 
				 ```
			
 
				 0xffffffffffffffff  +-----------+
			
 
				                     |           |
			
 
				                     |           | Kernelspace
			
 
				                     |           |
			
 
				- 0xffff800000000000 +-----------+
			
 
				+0xffff800000000000  +-----------+
			
 
				                     |           |
			
 
				                     |           |
			
 
				                     |   hole    |
			
@@ -166,7 +166,7 @@ We can see `0xffffffff81efe497` here. I doubt you really have that much RAM inst
 
				 0x0000000000000000  +-----------+
			
 
				 ```
			
 
				 
			
 
				-This solution is `sign extension`. Here we can see that the lower 48 bits of a virtual address can be used for addressing. Bits `63:48` can be either only zeroes or only ones. Note that the virtual address space is split in 2 parts:
			
 
				+This solution is `sign extension`. Here we can see that the lower 48 bits of a virtual address can be used for addressing. Bits `63:48` can be either only zeroes or only ones. Note that the virtual address space is split into 2 parts:
			
 
				 
			
 
				 * Kernel space
			
 
				 * Userspace
			
@@ -201,13 +201,13 @@ We can see here the memory map for user space, kernel space and the non-canonica
 
				 
			
 
				 Previously this guard hole and `__PAGE_OFFSET` was from `0xffff800000000000` to `0xffff80ffffffffff` to prevent access to non-canonical area, but was later extended by 3 bits for the hypervisor.
			
 
				 
			
 
				-Next is the lowest usable address in kernel space - `ffff880000000000`. This virtual memory region is for direct mapping of the all physical memory. After the memory space which maps all physical addresses, the guard hole. It needs to be between the direct mapping of all the physical memory and the vmalloc area. After the virtual memory map for the first terabyte and the unused hole after it, we can see the `kasan` shadow memory. It was added by [commit](https://github.com/torvalds/linux/commit/ef7f0d6a6ca8c9e4b27d78895af86c2fbfaeedb2) and provides the kernel address sanitizer. After the next unused hole we can see the `esp` fixup stacks (we will talk about it in other parts of this book) and the start of the kernel text mapping from the physical address - `0`. We can find the definition of this address in the same file as the `__PAGE_OFFSET`:
			
 
				+Next is the lowest usable address in kernel space - `ffff880000000000`. This virtual memory region is for direct mapping of all the physical memory. After the memory space which maps all the physical addresses, the guard hole. It needs to be between the direct mapping of all the physical memory and the vmalloc area. After the virtual memory map for the first terabyte and the unused hole after it, we can see the `kasan` shadow memory. It was added by [commit](https://github.com/torvalds/linux/commit/ef7f0d6a6ca8c9e4b27d78895af86c2fbfaeedb2) and provides the kernel address sanitizer. After the next unused hole we can see the `esp` fixup stacks (we will talk about it in other parts of this book) and the start of the kernel text mapping from the physical address - `0`. We can find the definition of this address in the same file as the `__PAGE_OFFSET`:
			
 
				 
			
 
				 ```C
			
 
				 #define __START_KERNEL_map      _AC(0xffffffff80000000, UL)
			
 
				 ```
			
 
				 
			
 
				-Usually kernel's `.text` start here with the `CONFIG_PHYSICAL_START` offset. We saw it in the post about [ELF64](https://github.com/0xAX/linux-insides/blob/master/Theory/ELF.md):
			
 
				+Usually kernel's `.text` starts here with the `CONFIG_PHYSICAL_START` offset. We have seen it in the post about [ELF64](https://github.com/0xAX/linux-insides/blob/master/Theory/ELF.md):
			
 
				 
			
 
				 ```
			
 
				 readelf -s vmlinux | grep ffffffff81000000
			
@@ -216,11 +216,11 @@ readelf -s vmlinux | grep ffffffff81000000
 
				  90766: ffffffff81000000     0 NOTYPE  GLOBAL DEFAULT    1 startup_64
			
 
				 ```
			
 
				 
			
 
				-Here i checked `vmlinux` with the `CONFIG_PHYSICAL_START` is `0x1000000`. So we have the start point of the kernel `.text` - `0xffffffff80000000` and offset - `0x1000000`, the resulted virtual address will be `0xffffffff80000000 + 1000000 = 0xffffffff81000000`.
			
 
				+Here I check `vmlinux` with `CONFIG_PHYSICAL_START` is `0x1000000`. So we have the start point of the kernel `.text` - `0xffffffff80000000` and offset - `0x1000000`, the resulted virtual address will be `0xffffffff80000000 + 1000000 = 0xffffffff81000000`.
			
 
				 
			
 
				-After the kernel `.text` region there is the virtual memory region for kernel modules, `vsyscalls` and an unused hole of 2 megabytes.
			
 
				+After the kernel `.text` region there is the virtual memory region for kernel module, `vsyscalls` and an unused hole of 2 megabytes.
			
 
				 
			
 
				-We've seen how the kernel's virtual memory map is laid out and how a virtual address is translated into a physical one. Let's take for example following address:
			
 
				+We've seen how virtual memory map in the kernel is laid out and how a virtual address is translated into a physical one. Let's take the following address as example:
			
 
				 
			
 
				 ```
			
 
				 0xffffffff81000000
			
@@ -236,21 +236,20 @@ In binary it will be:
 
				 This virtual address is split in parts as described above:
			
 
				 
			
 
				 * `63:48` - bits not used;
			
 
				-* `47:39` - bits of the given linear address stores an index into the paging structure level-4; 
			
 
				-* `38:30` - bits stores index into the paging structure level-3;
			
 
				-* `29:21` - bits stores an index into the paging structure level-2;
			
 
				-* `20:12` - bits stores an index into the paging structure level-1;
			
 
				-* `11:0`  - bits provide the byte offset into the physical page.
			
 
				+* `47:39` - bits store an index into the paging structure level-4;
			
 
				+* `38:30` - bits store index into the paging structure level-3;
			
 
				+* `29:21` - bits store an index into the paging structure level-2;
			
 
				+* `20:12` - bits store an index into the paging structure level-1;
			
 
				+* `11:0`  - bits provide the offset into the physical page in byte.
			
 
				 
			
 
				 That is all. Now you know a little about theory of `paging` and we can go ahead in the kernel source code and see the first initialization steps.
			
 
				 
			
 
				 Conclusion
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-It's the end of this short part about paging theory. Of course this post doesn't cover every detail of paging, but soon we'll see in practice how the linux kernel builds paging structures and works with them.
			
 
				-
			
 
				-**Please note that English is not my first language and I am really sorry for any inconvenience. If you've found any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
			
 
				+It's the end of this short part about paging theory. Of course this post doesn't cover every detail of paging, but soon we'll see in practice how the Linux kernel builds paging structures and works with them.
			
 
				 
			
 
				+**Please note that English is not my first language and I am really sorry for any inconvenience. If you've found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				 
			
 
				 Links
			
 
				 --------------------------------------------------------------------------------
			
--- a/Theory/README.md
+++ b/Theory/README.md
@@ -4,3 +4,4 @@ This chapter describes various theoretical concepts and concepts which are not d
 
				 
			
 
				 * [Paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html)
			
 
				 * [Elf64 format](http://0xax.gitbooks.io/linux-insides/content/Theory/ELF.html)
			
 
				+* [Inline assembly](http://0xax.gitbooks.io/linux-insides/content/Theory/asm.html)
			
--- a/Theory/asm.md
+++ b/Theory/asm.md
@@ -0,0 +1,441 @@
 
				+Inline assembly
			
 
				+================================================================================
			
 
				+
			
 
				+Introduction
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+While reading source code in the [Linux kernel](https://github.com/torvalds/linux), I often see statements like this:
			
 
				+
			
 
				+```C
			
 
				+__asm__("andq %%rsp,%0; ":"=r" (ti) : "0" (CURRENT_MASK));
			
 
				+```
			
 
				+
			
 
				+Yes, this is [inline assembly](https://en.wikipedia.org/wiki/Inline_assembler) or in other words assembler code which is integrated in a high level programming language. In this case the high level programming language is [C](https://en.wikipedia.org/wiki/C_%28programming_language%29). Yes, the `C` programming language is not very high-level, but still.
			
 
				+
			
 
				+If you are familiar with the [assembly](https://en.wikipedia.org/wiki/Assembly_language) programming language, you may notice that `inline assembly` is not very different from normal assembler. Moreover, the special form of inline assembly which is called `basic form` is exactly the same. For example:
			
 
				+
			
 
				+```C
			
 
				+__asm__("movq %rax, %rsp");
			
 
				+```
			
 
				+
			
 
				+or
			
 
				+
			
 
				+```C
			
 
				+__asm__("hlt");
			
 
				+```
			
 
				+
			
 
				+The same code (of course without `__asm__` prefix) you might see in plain assembly code. Yes, this is very similar, but not so simple as it might seem at first glance. Actually, the [GCC](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) supports two forms of inline assembly statements:
			
 
				+
			
 
				+* `basic`;
			
 
				+* `extended`.
			
 
				+
			
 
				+The basic form consists of only two things: the `__asm__` keyword and the string with valid assembler instructions. For example it may look something like this:
			
 
				+
			
 
				+```C
			
 
				+__asm__("movq    $3, %rax\t\n"
			
 
				+        "movq    %rsi, %rdi");
			
 
				+```
			
 
				+
			
 
				+The `asm` keyword may be used in place of `__asm__`, however `__asm__` is portable whereas the `asm` keyword is a `GNU` [extension](https://gcc.gnu.org/onlinedocs/gcc/C-Extensions.html). In further examples I will only use the `__asm__` variant.
			
 
				+
			
 
				+If you know assembly programming language this looks pretty familiar. The main problem is in the second form of inline assembly statements - `extended`. This form allows us to pass parameters to an assembly statement, perform [jumps](https://en.wikipedia.org/wiki/Branch_%28computer_science%29) etc. Does not sound difficult, but requires knowledge of special rules in addition to knowledge of the assembly language. Every time I see yet another piece of inline assembly code in the Linux kernel, I need to refer to the official [documentation](https://gcc.gnu.org/onlinedocs/) of `GCC` to remember how a particular `qualifier` behaves or what the meaning of `=&r` is for example.
			
 
				+
			
 
				+I've decided to write this part to consolidate my knowledge related to the inline assembly, as inline assembly statements are quite common in the Linux kernel and we may see them in [linux-insides](https://0xax.gitbooks.io/linux-insides/content/) parts sometimes. I thought that it would be useful if we have a special part which contains information on more important aspects of the inline assembly. Of course you may find comprehensive information about inline assembly in the official [documentation](https://gcc.gnu.org/onlinedocs/gcc/Using-Assembly-Language-with-C.html#Using-Assembly-Language-with-C), but I like to put everything in one place.
			
 
				+
			
 
				+** Note: This part will not provide guide for assembly programming. It is not intended to teach you to write programs with assembler or to know what one or another assembler instruction means. Just a little memo for extended asm. **
			
 
				+
			
 
				+Introduction to extended inline assembly
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+So, let's start. As I already mentioned above, the `basic` assembly statement consists of the `asm` or `__asm__` keyword and set of assembly instructions. This form is in no way different from "normal" assembly. The most interesting part is inline assembler with operands, or `extended` assembler. An extended assembly statement looks more complicated and consists of more than two parts:
			
 
				+
			
 
				+```assembly
			
 
				+__asm__ [volatile] [goto] (AssemblerTemplate
			
 
				+                           [ : OutputOperands ]
			
 
				+                           [ : InputOperands  ]
			
 
				+                           [ : Clobbers       ]
			
 
				+                           [ : GotoLabels     ]);
			
 
				+```
			
 
				+
			
 
				+All parameters which are marked with squared brackets are optional. You may notice that if we skip the optional parameters and the modifiers `volatile` and `goto` we obtain the `basic` form. 
			
 
				+
			
 
				+Let's start to consider this in order. The first optional `qualifier` is `volatile`. This specifier tells the compiler that an assembly statement may produce `side effects`. In this case we need to prevent compiler optimizations related to the given assembly statement. In simple terms the `volatile` specifier instructs the compiler not to modify the statement and place it exactly where it was in the original code. As an example let's look at the following function from the [Linux kernel](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h):
			
 
				+
			
 
				+```C
			
 
				+static inline void native_load_gdt(const struct desc_ptr *dtr)
			
 
				+{
			
 
				+	asm volatile("lgdt %0"::"m" (*dtr));
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Here we see the `native_load_gdt` function which loads a base address from the [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table) to the `GDTR` register with the `lgdt` instruction. This assembly statement is marked with `volatile` qualifier. It is very important that the compiler does not change the original place of this assembly statement in the resulting code. Otherwise the `GDTR` register may contain wrong address for the `Global Descriptor Table` or the address may be correct, but the structure has not been filled yet. This can lead to an exception being generated, preventing the kernel from booting correctly.
			
 
				+
			
 
				+The second optional `qualifier` is the `goto`. This qualifier tells the compiler that the given assembly statement may perform a jump to one of the labels which are listed in the `GotoLabels`. For example:
			
 
				+
			
 
				+```C
			
 
				+__asm__ goto("jmp %l[label]" : : : label);
			
 
				+```
			
 
				+
			
 
				+Since we finished with these two qualifiers, let's look at the main part of an assembly statement body. As we have seen above, the main part of an assembly statement consists of the following four parts:
			
 
				+
			
 
				+* set of assembly instructions;
			
 
				+* output parameters;
			
 
				+* input parameters;
			
 
				+* clobbers.
			
 
				+
			
 
				+The first represents a string which contains a set of valid assembly instructions which may be separated by the `\t\n` sequence. Names of processor [registers](https://en.wikipedia.org/wiki/Processor_register) must be prefixed with the `%%` sequence in `extended` form and other symbols like immediates must start with the `$` symbol. The `OutputOperands` and `InputOperands` are comma-separated lists of [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) variables which may be provided with "constraints" and the `Clobbers` is a list of registers or other values which are modified by the assembler instructions from the `AssemblerTemplate` beyond those listed in the `OutputOperands`. Before we dive into the examples we have to know a little bit about `constraints`. A constraint is a string which specifies placement of an operand. For example the value of an operand may be written to a processor register or read from memory etc.
			
 
				+
			
 
				+Consider the following simple example:
			
 
				+
			
 
				+```C
			
 
				+#include <stdio.h>
			
 
				+
			
 
				+int main(void)
			
 
				+{
			
 
				+        int a = 5;
			
 
				+        int b = 10;
			
 
				+        int sum = 0;
			
 
				+
			
 
				+        __asm__("addl %1,%2" : "=r" (sum) : "r" (a), "0" (b));
			
 
				+        printf("a + b = %d\n", sum);
			
 
				+        return 0;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Let's compile and run it to be sure that it works as expected:
			
 
				+
			
 
				+```
			
 
				+$ gcc test.c -o test
			
 
				+./test
			
 
				+a + b = 15
			
 
				+```
			
 
				+
			
 
				+Ok, great. It works. Now let's look at this example in detail. Here we see a simple `C` program which calculates the sum of two variables placing the result into the `sum` variable and in the end we print the result. This example consists of three parts. The first is the assembly statement with the [add](http://x86.renejeschke.de/html/file_module_x86_id_5.html) instruction. It adds the value of the source operand together with the value of the destination operand and stores the result in the destination operand. In our case:
			
 
				+
			
 
				+```assembly
			
 
				+addl %1, %2
			
 
				+```
			
 
				+
			
 
				+will be expanded to the:
			
 
				+
			
 
				+```assembly
			
 
				+addl a, b
			
 
				+```
			
 
				+
			
 
				+Variables and expressions which are listed in the `OutputOperands` and `InputOperands` may be matched in the `AssemblerTemplate`. An input/output operand is designated as `%N` where the `N` is the number of operand from left to right beginning from `zero`. The second part of the our assembly statement is located after the first `:` symbol and contains the definition of the output value:
			
 
				+
			
 
				+```assembly
			
 
				+"=r" (sum)
			
 
				+```
			
 
				+
			
 
				+Notice that the `sum` is marked with two special symbols: `=r`. This is the first constraint that we have encountered. The actual constraint here is only `r` itself. The `=` symbol is `modifier` which denotes output value. This tells to compiler that the previous value will be discarded and replaced by the new data. Besides the `=` modifier, `GCC` provides support for following three modifiers:
			
 
				+
			
 
				+* `+` - an operand is read and written by an instruction;
			
 
				+* `&` - output register shouldn't overlap an input register and should be used only for output;
			
 
				+* `%` - tells the compiler that operands may be [commutative](https://en.wikipedia.org/wiki/Commutative_property).
			
 
				+
			
 
				+Now let's go back to the `r` qualifier. As I mentioned above, a qualifier denotes the placement of an operand. The `r` symbol means a value will be stored in one of the [general purpose register](https://en.wikipedia.org/wiki/Processor_register). The last part of our assembly statement:
			
 
				+
			
 
				+```assembly
			
 
				+"r" (a), "0" (b)
			
 
				+```
			
 
				+
			
 
				+These are input operands - variables `a` and `b`. We already know what the `r` qualifier does. Now we can have a look at the constraint for the variable `b`. The `0` or any other digit from `1` to `9` is called "matching constraint". With this a single operand can be used for multiple roles. The value of the constraint is the source operand index. In our case `0` will match `sum`. If we look at assembly output of our program
			
 
				+
			
 
				+```C
			
 
				+0000000000400400 <main>:
			
 
				+  400401:       ba 05 00 00 00          mov    $0x5,%edx
			
 
				+  400406:       b8 0a 00 00 00          mov    $0xa,%eax
			
 
				+  40040b:       01 d0                   add    %edx,%eax
			
 
				+```
			
 
				+
			
 
				+we see that only two general purpose registers are used: `%edx` and `%eax`. This way the `%eax` register is used for storing the value of `b` as well as storing the result of the calculation. We have looked at input and output parameters of an inline assembly statement. Before we move on to other constraints supported by `gcc`, there is one remaining part of the inline assembly statement we have not discussed yet - `clobbers`.
			
 
				+
			
 
				+Clobbers
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+As mentioned above, the "clobbered" part should contain a comma-separated list of registers whose content will be modified by the assembler code. This is useful if our assembly expression needs additional registers for calculation. If we add clobbered registers to the inline assembly statement, the compiler take this into account and the register in question will not simultaneously be used by the compiler.
			
 
				+
			
 
				+Consider the example from before, but we will add an additional, simple assembler instruction:
			
 
				+
			
 
				+```C
			
 
				+__asm__("movq $100, %%rdx\t\n"
			
 
				+        "addl %1,%2" : "=r" (sum) : "r" (a), "0" (b));
			
 
				+```
			
 
				+
			
 
				+If we look at the assembly output
			
 
				+
			
 
				+```C
			
 
				+0000000000400400 <main>:
			
 
				+  400400:       ba 05 00 00 00          mov    $0x5,%edx
			
 
				+  400405:       b8 0a 00 00 00          mov    $0xa,%eax
			
 
				+  40040a:       48 c7 c2 64 00 00 00    mov    $0x64,%rdx
			
 
				+  400411:       01 d0                   add    %edx,%eax
			
 
				+```
			
 
				+
			
 
				+we see that the `%edx` register is overwritten with `0x64` or `100` and the result will be `115` instead of `15`. Now if we add the `%rdx` register to the list of "clobbered" register
			
 
				+
			
 
				+```C
			
 
				+__asm__("movq $100, %%rdx\t\n"
			
 
				+        "addl %1,%2" : "=r" (sum) : "r" (a), "0" (b) : "%rdx");
			
 
				+```
			
 
				+
			
 
				+and look at the assembler output again
			
 
				+
			
 
				+```C
			
 
				+0000000000400400 <main>:
			
 
				+  400400:       b9 05 00 00 00          mov    $0x5,%ecx
			
 
				+  400405:       b8 0a 00 00 00          mov    $0xa,%eax
			
 
				+  40040a:       48 c7 c2 64 00 00 00    mov    $0x64,%rdx
			
 
				+  400411:       01 c8                   add    %ecx,%eax
			
 
				+```
			
 
				+
			
 
				+the `%ecx` register will be used for `sum` calculation, preserving the intended semantics of the program. Besides general purpose registers, we may pass two special specifiers. They are:
			
 
				+
			
 
				+* `cc`;
			
 
				+* `memory`.
			
 
				+
			
 
				+The first - `cc` indicates that an assembler code modifies [flags](https://en.wikipedia.org/wiki/FLAGS_register) register. This is typically used if the assembly within contains arithmetic or logic instructions.
			
 
				+
			
 
				+```C
			
 
				+__asm__("incq %0" ::""(variable): "cc");
			
 
				+```
			
 
				+
			
 
				+The second `memory` specifier tells the compiler that the given inline assembly statement executes read/write operations on memory not specified by operands in the output list. This prevents the compiler from keeping memory values loaded and cached in registers. Let's take a look at the following example:
			
 
				+
			
 
				+```C
			
 
				+#include <stdio.h>
			
 
				+
			
 
				+int main(void)
			
 
				+{
			
 
				+        int a[3] = {10,20,30};
			
 
				+        int b = 5;
			
 
				+        
			
 
				+        __asm__ volatile("incl %0" :: "m" (a[0]));
			
 
				+        printf("a[0] - b = %d\n", a[0] - b);
			
 
				+        return 0;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+This example may be artificial, but it illustrates the main idea. Here we have an array of integers and one integer variable. The example is pretty simple, we take the first element of `a` and increment its value. After this we subtract the value of `b` from the  first element of `a`. In the end we print the result. If we compile and run this simple example the result may surprise you.
			
 
				+
			
 
				+```
			
 
				+~$ gcc -O3  test.c -o test
			
 
				+~$ ./test
			
 
				+a[0] - b = 5
			
 
				+```
			
 
				+
			
 
				+The result is `5` here, but why? We incremented `a[0]` and subtracted b, so the result should be `6` here. If we have a look at the assembler output for this example
			
 
				+
			
 
				+```assembly
			
 
				+00000000004004f6 <main>:
			
 
				+  4004f6:       c7 44 24 f0 0a 00 00    movl   $0xa,-0x10(%rsp)
			
 
				+  4004fd:       00 
			
 
				+  4004fe:       c7 44 24 f4 14 00 00    movl   $0x14,-0xc(%rsp)
			
 
				+  400505:       00 
			
 
				+  400506:       c7 44 24 f8 1e 00 00    movl   $0x1e,-0x8(%rsp)
			
 
				+  40050d:       00 
			
 
				+  40050e:       ff 44 24 f0             incl   -0x10(%rsp)
			
 
				+  400512:       b8 05 00 00 00          mov    $0x5,%eax
			
 
				+```
			
 
				+
			
 
				+we see that the first element of the `a` contains the value `0xa` (`10`). The last two lines of code are the actual calculations. We see our increment instruction with `incl` but then just a move of `5` to the `%eax` register. This looks strange. The problem is we have passed the `-O3` flag to `gcc`, so the compiler did some constant folding and propagation to determine the result of `a[0] - 5` at compile time and reduced it to a `mov` with a constant `5` at runtime.
			
 
				+
			
 
				+Let's now add `memory` to the clobbers list
			
 
				+
			
 
				+```C
			
 
				+__asm__ volatile("incl %0" :: "m" (a[0]) : "memory");
			
 
				+```
			
 
				+
			
 
				+and the new result of running this is
			
 
				+
			
 
				+```
			
 
				+~$ gcc -O3  test.c -o test
			
 
				+~$ ./test
			
 
				+a[0] - b = 6
			
 
				+```
			
 
				+
			
 
				+Now the result is correct. If we look at the assembly output again
			
 
				+
			
 
				+```assembly
			
 
				+00000000004004f6 <main>:
			
 
				+  4004f6:       c7 44 24 f0 0a 00 00    movl   $0xa,-0x10(%rsp)
			
 
				+  4004fd:       00 
			
 
				+  4004fe:       c7 44 24 f4 14 00 00    movl   $0x14,-0xc(%rsp)
			
 
				+  400505:       00 
			
 
				+  400506:       c7 44 24 f8 1e 00 00    movl   $0x1e,-0x8(%rsp)
			
 
				+  40050d:       00 
			
 
				+  40050e:       ff 44 24 f0             incl   -0x10(%rsp)
			
 
				+  400512:       8b 44 24 f0             mov    -0x10(%rsp),%eax
			
 
				+  400516:       83 e8 05                sub    $0x5,%eax
			
 
				+  400519:       c3                      retq
			
 
				+```
			
 
				+
			
 
				+we will see one difference here which is in the following piece code:
			
 
				+
			
 
				+```assembly
			
 
				+  400512:       8b 44 24 f0             mov    -0x10(%rsp),%eax
			
 
				+  400516:       83 e8 05                sub    $0x5,%eax
			
 
				+```
			
 
				+
			
 
				+Instead of constant folding, `GCC` now preserves calculations in the assembly and places the value of `a[0]` in the `%eax` register afterwards. In the end it just subtracts the constant value of `b`. Besides the `memory` specifier, we also see a new constraint here - `m`. This constraint tells the compiler to use the address of `a[0]`, instead of its value. So, now we are finished with `clobbers` and we may continue by looking at other constraints supported by `GCC` besides `r` and `m` which we have already seen.
			
 
				+
			
 
				+Constraints
			
 
				+---------------------------------------------------------------------------------
			
 
				+
			
 
				+Now that we are finished with all three parts of an inline assembly statement, let's return to constraints. We already saw some constraints in the previous parts, like `r` which represents a `register` operand, `m` which represents a memory operand and `0-9` which represent an reused, indexed operand. Besides these `GCC` provides support for other constraints. For example the `i` constraint represents an `immediate` integer operand with know value.
			
 
				+
			
 
				+```C
			
 
				+#include <stdio.h>
			
 
				+
			
 
				+int main(void)
			
 
				+{
			
 
				+        int a = 0;
			
 
				+
			
 
				+        __asm__("movl %1, %0" : "=r"(a) : "i"(100));
			
 
				+        printf("a = %d\n", a);
			
 
				+        return 0;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+The result is:
			
 
				+
			
 
				+```
			
 
				+~$ gcc test.c -o test
			
 
				+~$ ./test
			
 
				+a = 100
			
 
				+```
			
 
				+
			
 
				+Or for example `I` which represents an immediate 32-bit integer. The difference between `i` and `I` is that `i` is general, whereas `I` is strictly specified to 32-bit integer data. For example if you try to compile the following
			
 
				+
			
 
				+```C
			
 
				+int test_asm(int nr)
			
 
				+{
			
 
				+        unsigned long a = 0;
			
 
				+
			
 
				+        __asm__("movq %1, %0" : "=r"(a) : "I"(0xffffffffffff));
			
 
				+        return a;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+you will get an error
			
 
				+
			
 
				+```
			
 
				+$ gcc -O3 test.c -o test
			
 
				+test.c: In function ‘test_asm’:
			
 
				+test.c:7:9: warning: asm operand 1 probably doesn’t match constraints
			
 
				+         __asm__("movq %1, %0" : "=r"(a) : "I"(0xffffffffffff));
			
 
				+         ^
			
 
				+test.c:7:9: error: impossible constraint in ‘asm’
			
 
				+```
			
 
				+
			
 
				+when at the same time
			
 
				+
			
 
				+```C
			
 
				+int test_asm(int nr)
			
 
				+{
			
 
				+        unsigned long a = 0;
			
 
				+
			
 
				+        __asm__("movq %1, %0" : "=r"(a) : "i"(0xffffffffffff));
			
 
				+        return a;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+works perfectly.
			
 
				+
			
 
				+```
			
 
				+~$ gcc -O3 test.c -o test
			
 
				+~$ echo $?
			
 
				+0
			
 
				+```
			
 
				+
			
 
				+`GCC` also supports `J`, `K`, `N` constraints for integer constants in the range of 0-63 bits, signed 8-bit integer constants and unsigned 8-bit integer constants respectively. The `o` constraint represents a memory operand with an `offsetable` memory address. For example:
			
 
				+
			
 
				+```C
			
 
				+#include <stdio.h>
			
 
				+
			
 
				+int main(void)
			
 
				+{
			
 
				+        static unsigned long arr[3] = {0, 1, 2};
			
 
				+        static unsigned long element;
			
 
				+        
			
 
				+        __asm__ volatile("movq 16+%1, %0" : "=r"(element) : "o"(arr));
			
 
				+        printf("%d\n", element);
			
 
				+        return 0;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+The result, as expected:
			
 
				+
			
 
				+```
			
 
				+~$ gcc -O3 test.c -o test
			
 
				+~$ ./test
			
 
				+2
			
 
				+```
			
 
				+
			
 
				+All of these constraints may be combined (so long as they do not conflict). In this case the compiler will choose the best one for a certain situation. For example:
			
 
				+
			
 
				+```C
			
 
				+#include <stdio.h>
			
 
				+
			
 
				+int a = 1;
			
 
				+
			
 
				+int main(void)
			
 
				+{
			
 
				+        int b;
			
 
				+        __asm__ ("movl %1,%0" : "=r"(b) : "r"(a));
			
 
				+        return b;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+will use a memory operand.
			
 
				+
			
 
				+```assembly
			
 
				+0000000000400400 <main>:
			
 
				+  400400:       8b 05 26 0c 20 00       mov    0x200c26(%rip),%eax        # 60102c <a>
			
 
				+```
			
 
				+
			
 
				+That's about all of the commonly used constraints in inline assembly statements. You can find more in the official [documentation](https://gcc.gnu.org/onlinedocs/gcc/Simple-Constraints.html#Simple-Constraints).
			
 
				+
			
 
				+Architecture specific constraints
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+Before we finish, let's look at the set of special constraints. These constrains are architecture specific and as this book is specific to the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture, we will look at constraints related to it. First of all the set of `a` ... `d` and also `S` and `D` constraints represent [generic purpose](https://en.wikipedia.org/wiki/Processor_register) registers. In this case the `a` constraint corresponds to `%al`, `%ax`, `%eax` or `%rax` register depending on instruction size. The `S` and `D` constraints are `%si` and `%di` registers respectively. For example let's take our previous example. We can see in its assembly output that value of the `a` variable is stored in the `%eax` register. Now let's look at the assembly output of the same assembly, but with other constraint: 
			
 
				+
			
 
				+```C
			
 
				+#include <stdio.h>
			
 
				+
			
 
				+int a = 1;
			
 
				+
			
 
				+int main(void)
			
 
				+{
			
 
				+        int b;
			
 
				+        __asm__ ("movl %1,%0" : "=r"(b) : "d"(a));
			
 
				+        return b;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Now we see that value of the `a` variable will be stored in the `%edx` register:
			
 
				+
			
 
				+```assembly
			
 
				+0000000000400400 <main>:
			
 
				+  400400:       8b 15 26 0c 20 00       mov    0x200c26(%rip),%edx        # 60102c <a>
			
 
				+```
			
 
				+
			
 
				+The `f` and `t` constraints represent any floating point stack register - `%st` and the top of the floating point stack respectively. The `u` constraint represents the second value from the top of the floating point stack.
			
 
				+
			
 
				+That's all. You may find more details about [x86_64](https://en.wikipedia.org/wiki/X86-64) and general constraints in the official [documentation](https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-Constraints).
			
 
				+
			
 
				+Links
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+* [Linux kernel source code](https://github.com/torvalds/linux)
			
 
				+* [assembly programming language](https://en.wikipedia.org/wiki/Assembly_language) 
			
 
				+* [GCC](https://en.wikipedia.org/wiki/GNU_Compiler_Collection)
			
 
				+* [GNU extension](https://gcc.gnu.org/onlinedocs/gcc/C-Extensions.html)
			
 
				+* [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table)
			
 
				+* [Processor registers](https://en.wikipedia.org/wiki/Processor_register)
			
 
				+* [add instruction](http://x86.renejeschke.de/html/file_module_x86_id_5.html)
			
 
				+* [flags register](https://en.wikipedia.org/wiki/FLAGS_register)
			
 
				+* [x86_64](https://en.wikipedia.org/wiki/X86-64)
			
 
				+* [constraints](https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-Constraints)
			
--- a/Timers/README.md
+++ b/Timers/README.md
@@ -0,0 +1,11 @@
 
				+# Timers and time management
			
 
				+
			
 
				+This chapter describes timers and time management related concepts in the linux kernel.
			
 
				+
			
 
				+* [Introduction](http://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) - this part is introduction to the timers in the Linux kernel.
			
 
				+* [Introduction to the clocksource framework](https://github.com/0xAX/linux-insides/blob/master/Timers/timers-2.md) - this part describes `clocksource` framework in the Linux kernel.
			
 
				+* [The tick broadcast framework and dyntick](https://github.com/0xAX/linux-insides/blob/master/Timers/timers-3.md) - this part describes tick broadcast framework and dyntick concept.
			
 
				+* [Introduction to timers](https://github.com/0xAX/linux-insides/blob/master/Timers/timers-3.md) - this chapter describes timers in the Linux kernel.
			
 
				+* [Introduction to the clockevents framework](https://github.com/0xAX/linux-insides/blob/master/Timers/timers-5.md) - this part describes yet another clock/time management related framework - `clockevents`.
			
 
				+* [x86 related clock sources](https://github.com/0xAX/linux-insides/blob/master/Timers/timers-5.md) - this part describes `x86_64` related clock sources.
			
 
				+* [Time related system calls in the Linux kernel](https://github.com/0xAX/linux-insides/blob/master/Timers/timers-7.md) - this part describes time related system calls.
			
--- a/Timers/timers-1.md
+++ b/Timers/timers-1.md
@@ -0,0 +1,436 @@
 
				+Timers and time management in the Linux kernel. Part 1.
			
 
				+================================================================================
			
 
				+
			
 
				+Introduction
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+This is yet another post that opens new chapter in the [linux-insides](http://0xax.gitbooks.io/linux-insides/content/) book. The previous [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-4.html) was a list part of the chapter that describes [system call](https://en.wikipedia.org/wiki/System_call) concept and now time is to start new chapter. As you can understand from the post's title, this chapter will be devoted to the `timers` and `time management` in the Linux kernel. The choice of topic for the current chapter is not accidental. Timers and generally time management are very important and widely used in the Linux kernel. The Linux kernel uses timers for various tasks, different timeouts for example in [TCP](https://en.wikipedia.org/wiki/Transmission_Control_Protocol) implementation, the kernel must know current time, scheduling asynchronous functions, next event interrupt scheduling and many many more.
			
 
				+
			
 
				+So, we will start to learn implementation of the different time management related stuff in this part. We will see different types of timers and how do different Linux kernel subsystems use them. As always we will start from the earliest part of the Linux kernel and will go through initialization process of the Linux kernel. We already did it in the special [chapter](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) which describes initialization process of the Linux kernel, but as you may remember we missed some things there. And one of them is the initialization of timers.
			
 
				+
			
 
				+Let's start.
			
 
				+
			
 
				+Initialization of non-standard PC hardware clock
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+After the Linux kernel was decompressed (more about this you can read in the [Kernel decompression](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) part) the architecture non-specific code starts to work in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file. After initialization of the [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt), initialization of [cgroups](https://en.wikipedia.org/wiki/Cgroups) and setting [canary](https://en.wikipedia.org/wiki/Buffer_overflow_protection) value we can see the call of the `setup_arch` function.
			
 
				+
			
 
				+As you may remember this function defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c#L842) source code file and prepares/initializes architecture-specific stuff (for example it reserves place for [bss](https://en.wikipedia.org/wiki/.bss) section, reserves place for [initrd](https://en.wikipedia.org/wiki/Initrd), parses kernel command line and many many other things). Besides this, we can find some time management related functions there.
			
 
				+
			
 
				+The first is:
			
 
				+
			
 
				+```C
			
 
				+x86_init.timers.wallclock_init();
			
 
				+```
			
 
				+
			
 
				+We already saw `x86_init` structure in the chapter that describes initialization of the Linux kernel. This structure contains pointers to the default setup functions for the different platforms like [Intel MID](https://en.wikipedia.org/wiki/Mobile_Internet_device#Intel_MID_platforms), [Intel CE4100](http://www.wpgholdings.com/epaper/US/newsRelease_20091215/255874.pdf) and etc. The `x86_init` structure defined in the [arch/x86/kernel/x86_init.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/x86_init.c#L36) and as you can see it determines standard PC hardware by default.
			
 
				+
			
 
				+As we can see, the `x86_init` structure has `x86_init_ops` type that provides a set of functions for platform specific setup like reserving standard resources, platform specific memory setup, initialization of interrupt handlers and etc. This structure looks like:
			
 
				+
			
 
				+```C
			
 
				+struct x86_init_ops {
			
 
				+	struct x86_init_resources       resources;
			
 
				+    struct x86_init_mpparse         mpparse;
			
 
				+    struct x86_init_irqs            irqs;
			
 
				+    struct x86_init_oem             oem;
			
 
				+    struct x86_init_paging          paging;
			
 
				+    struct x86_init_timers          timers;
			
 
				+    struct x86_init_iommu           iommu;
			
 
				+    struct x86_init_pci             pci;
			
 
				+};
			
 
				+```
			
 
				+
			
 
				+We can note `timers` field that has `x86_init_timers` type and as we can understand by its name - this field is related to time management and timers. The `x86_init_timers` contains four fields which are all functions that returns pointer on [void](https://en.wikipedia.org/wiki/Void_type):
			
 
				+
			
 
				+* `setup_percpu_clockev` - set up the per cpu clock event device for the boot cpu;
			
 
				+* `tsc_pre_init` - platform function called before [TSC](https://en.wikipedia.org/wiki/Time_Stamp_Counter) init;
			
 
				+* `timer_init` - initialize the platform timer;
			
 
				+* `wallclock_init` - initialize the wallclock device.
			
 
				+
			
 
				+So, as we already know, in our case the `wallclock_init` executes initialization of the wallclock device. If we will look on the `x86_init` structure, we will see that `wallclock_init` points to the `x86_init_noop`:
			
 
				+
			
 
				+```C
			
 
				+struct x86_init_ops x86_init __initdata = {
			
 
				+	...
			
 
				+	...
			
 
				+	...
			
 
				+	.timers = {
			
 
				+		.wallclock_init		    = x86_init_noop,
			
 
				+	},
			
 
				+	...
			
 
				+	...
			
 
				+	...
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Where the `x86_init_noop` is just a function that does nothing:
			
 
				+
			
 
				+```C
			
 
				+void __cpuinit x86_init_noop(void) { }
			
 
				+```
			
 
				+
			
 
				+for the standard PC hardware. Actually, the `wallclock_init` function is used in the [Intel MID](https://en.wikipedia.org/wiki/Mobile_Internet_device#Intel_MID_platforms) platform. Initialization of the `x86_init.timers.wallclock_init` located in the [arch/x86/platform/intel-mid/intel-mid.c](https://github.com/torvalds/linux/blob/master/arch/x86/platform/intel-mid/intel-mid.c) source code file in the `x86_intel_mid_early_setup` function:
			
 
				+
			
 
				+```C
			
 
				+void __init x86_intel_mid_early_setup(void)
			
 
				+{
			
 
				+	...
			
 
				+	...
			
 
				+	...
			
 
				+	x86_init.timers.wallclock_init = intel_mid_rtc_init;
			
 
				+	...
			
 
				+	...
			
 
				+	...
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Implementation of the `intel_mid_rtc_init` function is in the [arch/x86/platform/intel-mid/intel_mid_vrtc.c](https://github.com/torvalds/linux/blob/master/arch/x86/platform/intel-mid/intel_mid_vrtc.c) source code file and looks pretty easy. First of all, this function parses [Simple Firmware Interface](https://en.wikipedia.org/wiki/Simple_Firmware_Interface) M-Real-Time-Clock table for the getting such devices to the `sfi_mrtc_array` array and initialization of the `set_time` and `get_time` functions:
			
 
				+
			
 
				+```C
			
 
				+void __init intel_mid_rtc_init(void)
			
 
				+{
			
 
				+	unsigned long vrtc_paddr;
			
 
				+
			
 
				+	sfi_table_parse(SFI_SIG_MRTC, NULL, NULL, sfi_parse_mrtc);
			
 
				+
			
 
				+	vrtc_paddr = sfi_mrtc_array[0].phys_addr;
			
 
				+	if (!sfi_mrtc_num || !vrtc_paddr)
			
 
				+		return;
			
 
				+
			
 
				+	vrtc_virt_base = (void __iomem *)set_fixmap_offset_nocache(FIX_LNW_VRTC,
			
 
				+								vrtc_paddr);
			
 
				+
			
 
				+    x86_platform.get_wallclock = vrtc_get_time;
			
 
				+	x86_platform.set_wallclock = vrtc_set_mmss;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+That's all, after this a device based on `Intel MID` will be able to get time from hardware clock. As I already wrote, the standard PC [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture does not support `x86_init_noop` and just do nothing during call of this function. We just saw initialization of the [real time clock](https://en.wikipedia.org/wiki/Real-time_clock) for the [Intel MID](https://en.wikipedia.org/wiki/Mobile_Internet_device#Intel_MID_platforms) architecture and now times to return to the general `x86_64` architecture and will look on the time management related stuff there.
			
 
				+
			
 
				+Acquainted with jiffies
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+If we will return to the `setup_arch` function which is located as you remember in the  [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c#L842) source code file, we will see the next call of the time management related function:
			
 
				+
			
 
				+```C
			
 
				+register_refined_jiffies(CLOCK_TICK_RATE);
			
 
				+```
			
 
				+
			
 
				+Before we will look on the implementation of this function, we must know about [jiffy](https://en.wikipedia.org/wiki/Jiffy_%28time%29). As we can read on wikipedia:
			
 
				+
			
 
				+```
			
 
				+Jiffy is an informal term for any unspecified short period of time
			
 
				+```
			
 
				+
			
 
				+This definition is very similar to the `jiffy` in the Linux kernel. There is global variable with the `jiffies` which holds the number of ticks that have occurred since the system booted. The Linux kernel sets this variable to zero:
			
 
				+
			
 
				+```C
			
 
				+extern unsigned long volatile __jiffy_data jiffies;
			
 
				+```
			
 
				+
			
 
				+during initialization process. This global variable will be increased each time during timer interrupt. Besides this, near the `jiffies` variable we can see definition of the similar variable
			
 
				+
			
 
				+```C
			
 
				+extern u64 jiffies_64;
			
 
				+```
			
 
				+
			
 
				+Actually only one of these variables is in use in the Linux kernel. And it depends on the processor type. For the [x86_64](https://en.wikipedia.org/wiki/X86-64) it will be `u64` use and for the [x86](https://en.wikipedia.org/wiki/X86) is `unsigned long`. We will see this if we will look on the [arch/x86/kernel/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S) linker script:
			
 
				+
			
 
				+```
			
 
				+#ifdef CONFIG_X86_32
			
 
				+...
			
 
				+jiffies = jiffies_64;
			
 
				+...
			
 
				+#else
			
 
				+...
			
 
				+jiffies_64 = jiffies;
			
 
				+...
			
 
				+#endif
			
 
				+```
			
 
				+
			
 
				+In the case of `x86_32` the `jiffies` will be lower `32` bits of the `jiffies_64` variable. Schematically, we can imagine it as follows
			
 
				+
			
 
				+```
			
 
				+                    jiffies_64
			
 
				++-----------------------------------------------------+
			
 
				+|                       |                             |
			
 
				+|                       |                             |
			
 
				+|                       |       jiffies on `x86_32`   |
			
 
				+|                       |                             |
			
 
				+|                       |                             |
			
 
				++-----------------------------------------------------+
			
 
				+63                     31                             0
			
 
				+```
			
 
				+
			
 
				+Now we know a little theory about `jiffies` and we can return to the our function. There is no architecture-specific implementation for our function - the `register_refined_jiffies`. This function located in the generic kernel code - [kernel/time/jiffies.c](https://github.com/torvalds/linux/blob/master/kernel/time/jiffies.c) source code file. Main point of the `register_refined_jiffies` is registration of the jiffy `clocksource`. Before we will look on the implementation of the `register_refined_jiffies` function, we must know what is it `clocksource`. As we can read in the comments:
			
 
				+
			
 
				+```
			
 
				+The `clocksource` is hardware abstraction for a free-running counter.
			
 
				+```
			
 
				+
			
 
				+I'm not sure about you, but that description didn't give a good understanding about the `clocksource` concept. Let's try to understand what is it, but we will not go deeper because this topic will be described in a separate part in much more detail. The main point of the `clocksource` is timekeeping abstraction or in very simple words - it provides a time value to the kernel. We already know about `jiffies` interface that represents number of ticks that have occurred since the system booted. It represented by the global variable in the Linux kernel and increased each timer interrupt. The Linux kernel can use `jiffies` for time measurement. So why do we need in separate context like the `clocksource`? Actually different hardware devices provide different clock sources that are widely in their capabilities. The availability of more precise techniques for time intervals measurement is hardware-dependent.
			
 
				+
			
 
				+For example `x86` has on-chip a 64-bit counter that is called [Time Stamp  Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) and its frequency can be equal to processor frequency. Or for example [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) that consists of a `64-bit` counter of at least `10 MHz` frequency. Two different timers and they are both for `x86`. If we will add timers from other architectures, this only makes this problem more complex. The Linux kernel provides `clocksource` concept to solve the problem.
			
 
				+
			
 
				+The clocksource concept represented by the `clocksource` structure in the Linux kernel. This structure defined in the [include/linux/clocksource.h](https://github.com/torvalds/linux/blob/master/include/linux/clocksource.h) header file and contains a couple of fields that describe a time counter. For example it contains - `name` field which is the name of a counter, `flags` field that describes different properties of a counter, pointers to the `suspend` and `resume` functions, and many more.
			
 
				+
			
 
				+Let's look on the `clocksource` structure for jiffies that defined in the [kernel/time/jiffies.c](https://github.com/torvalds/linux/blob/master/kernel/time/jiffies.c) source code file:
			
 
				+
			
 
				+```C
			
 
				+static struct clocksource clocksource_jiffies = {
			
 
				+	.name		= "jiffies",
			
 
				+	.rating		= 1,
			
 
				+	.read		= jiffies_read,
			
 
				+	.mask		= 0xffffffff,
			
 
				+	.mult		= NSEC_PER_JIFFY << JIFFIES_SHIFT,
			
 
				+	.shift		= JIFFIES_SHIFT,
			
 
				+	.max_cycles	= 10,
			
 
				+};
			
 
				+```
			
 
				+
			
 
				+We can see definition of the default name here - `jiffies`, the next is `rating` field allows the best registered clock source to be chosen by the clock source management code available for the specified hardware. The `rating` may have following value:
			
 
				+
			
 
				+* `1-99`    - Only available for bootup and testing purposes;
			
 
				+* `100-199` - Functional for real use, but not desired.
			
 
				+* `200-299` - A correct and usable clocksource.
			
 
				+* `300-399` - A reasonably fast and accurate clocksource.
			
 
				+* `400-499` - The ideal clocksource. A must-use where available;
			
 
				+
			
 
				+For example rating of the [time stamp counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) is `300`, but rating of the [high precision event timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) is `250`. The next field is `read` - is pointer to the function that allows to read clocksource's cycle value or in other words it just returns `jiffies` variable with `cycle_t` type:
			
 
				+
			
 
				+```C
			
 
				+static cycle_t jiffies_read(struct clocksource *cs)
			
 
				+{
			
 
				+        return (cycle_t) jiffies;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+that is just 64-bit unsigned type:
			
 
				+
			
 
				+```C
			
 
				+typedef u64 cycle_t;
			
 
				+```
			
 
				+
			
 
				+The next field is the `mask` value ensures that subtraction between counters values from non `64 bit` counters do not need special overflow logic. In our case the mask is `0xffffffff` and it is `32` bits. This means that `jiffy` wraps around to zero after `42` seconds:
			
 
				+
			
 
				+```python
			
 
				+>>> 0xffffffff
			
 
				+4294967295
			
 
				+# 42 nanoseconds
			
 
				+>>> 42 * pow(10, -9)
			
 
				+4.2000000000000006e-08
			
 
				+# 43 nanoseconds
			
 
				+>>> 43 * pow(10, -9)
			
 
				+4.3e-08
			
 
				+```
			
 
				+
			
 
				+The next two fields `mult` and `shift` are used to convert the clocksource's period to nanoseconds per cycle. When the kernel calls the `clocksource.read` function, this function returns value in `machine` time units represented with `cycle_t` data type that we saw just now. To convert this return value to the [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond) we need in these two fields: `mult` and `shift`. The `clocksource` provides `clocksource_cyc2ns` function that will do it for us with the following expression:
			
 
				+
			
 
				+```C
			
 
				+((u64) cycles * mult) >> shift;
			
 
				+```
			
 
				+
			
 
				+As we can see the `mult` field is equal:
			
 
				+
			
 
				+```C
			
 
				+NSEC_PER_JIFFY << JIFFIES_SHIFT
			
 
				+
			
 
				+#define NSEC_PER_JIFFY  ((NSEC_PER_SEC+HZ/2)/HZ)
			
 
				+#define NSEC_PER_SEC    1000000000L
			
 
				+```
			
 
				+
			
 
				+by default, and the `shift` is
			
 
				+
			
 
				+```C
			
 
				+#if HZ < 34
			
 
				+  #define JIFFIES_SHIFT   6
			
 
				+#elif HZ < 67
			
 
				+  #define JIFFIES_SHIFT   7
			
 
				+#else
			
 
				+  #define JIFFIES_SHIFT   8
			
 
				+#endif
			
 
				+```
			
 
				+
			
 
				+The `jiffies` clock source uses the `NSEC_PER_JIFFY` multiplier conversion to specify the nanosecond over cycle ratio. Note that values of the  `JIFFIES_SHIFT` and `NSEC_PER_JIFFY` depend on `HZ` value. The `HZ` represents the frequency of the system timer. This macro defined in the [include/asm-generic/param.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/param.h) and depends on the `CONFIG_HZ` kernel configuration option. The value of `HZ` differs for each supported architecture, but for `x86` it's defined like:
			
 
				+
			
 
				+```C
			
 
				+#define HZ		CONFIG_HZ
			
 
				+```
			
 
				+
			
 
				+Where `CONFIG_HZ` can be one of the following values:
			
 
				+
			
 
				+![HZ](http://s9.postimg.org/xy85r3jrj/image.png)
			
 
				+
			
 
				+This means that in our case the timer interrupt frequency is `250 HZ` or occurs `250` times per second or one timer interrupt each `4ms`.
			
 
				+
			
 
				+The last field that we can see in the definition of the `clocksource_jiffies` structure is the - `max_cycles` that holds the maximum cycle value that can safely be multiplied without potentially causing an overflow.
			
 
				+
			
 
				+	Ok, we just saw definition of the `clocksource_jiffies` structure, also we know a little about `jiffies` and `clocksource`, now is time to get back to the implementation of the our function. In the beginning of this part we have stopped on the call of the:
			
 
				+
			
 
				+```C
			
 
				+register_refined_jiffies(CLOCK_TICK_RATE);
			
 
				+```
			
 
				+
			
 
				+function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c#L842) source code file.
			
 
				+
			
 
				+As I already wrote, the main purpose of the `register_refined_jiffies` function is to register `refined_jiffies` clocksource. We already saw the `clocksource_jiffies` structure represents standard `jiffies` clock source. Now, if you look in the [kernel/time/jiffies.c](https://github.com/torvalds/linux/blob/master/kernel/time/jiffies.c) source code file, you will find yet another clock source definition:
			
 
				+
			
 
				+```C
			
 
				+struct clocksource refined_jiffies;
			
 
				+```
			
 
				+
			
 
				+There is one different between `refined_jiffies` and `clocksource_jiffies`: The standard `jiffies` based clock source is the lowest common denominator clock source which should function on all systems. As we already know, the `jiffies` global variable will be increased during each timer interrupt. This means that standard `jiffies` based clock source has the same resolution as the timer interrupt frequency. From this we can understand that standard `jiffies` based clock source may suffer from inaccuracies. The `refined_jiffies` uses `CLOCK_TICK_RATE` as the base of `jiffies` shift.
			
 
				+
			
 
				+Let's look on the implementation of this function. First of all we can see that the `refined_jiffies` clock source based on the `clocksource_jiffies` structure:
			
 
				+
			
 
				+```C
			
 
				+int register_refined_jiffies(long cycles_per_second)
			
 
				+{
			
 
				+	u64 nsec_per_tick, shift_hz;
			
 
				+	long cycles_per_tick;
			
 
				+
			
 
				+	refined_jiffies = clocksource_jiffies;
			
 
				+	refined_jiffies.name = "refined-jiffies";
			
 
				+	refined_jiffies.rating++;
			
 
				+	...
			
 
				+	...
			
 
				+	...
			
 
				+```
			
 
				+
			
 
				+Here we can see that we update the name of the `refined_jiffies` to `refined-jiffies` and increase the rating of this structure. As you remember, the `clocksource_jiffies` has rating - `1`, so our `refined_jiffies` clocksource will have rating - `2`. This means that the `refined_jiffies` will be best selection for clock source management code.
			
 
				+
			
 
				+In the next step we need to calculate number of cycles per one tick:
			
 
				+
			
 
				+```C
			
 
				+cycles_per_tick = (cycles_per_second + HZ/2)/HZ;
			
 
				+```
			
 
				+
			
 
				+Note that we have used `NSEC_PER_SEC` macro as the base of the standard `jiffies` multiplier. Here we are using the `cycles_per_second` which is the first parameter of the `register_refined_jiffies` function. We've passed the `CLOCK_TICK_RATE` macro to the `register_refined_jiffies` function. This macro definied in the [arch/x86/include/asm/timex.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/timex.h) header file and expands to the:
			
 
				+
			
 
				+```C
			
 
				+#define CLOCK_TICK_RATE         PIT_TICK_RATE
			
 
				+```
			
 
				+
			
 
				+where the `PIT_TICK_RATE` macro expands to the frequency of the [Intel 8253](Programmable interval timer):
			
 
				+
			
 
				+```C
			
 
				+#define PIT_TICK_RATE 1193182ul
			
 
				+```
			
 
				+
			
 
				+After this we calculate `shift_hz` for the `register_refined_jiffies` that will store `hz << 8` or in other words frequency of the system timer. We shift left the `cycles_per_second` or frequency of the programmable interval timer on `8` in order to get extra accuracy:
			
 
				+
			
 
				+```C
			
 
				+shift_hz = (u64)cycles_per_second << 8;
			
 
				+shift_hz += cycles_per_tick/2;
			
 
				+do_div(shift_hz, cycles_per_tick);
			
 
				+```
			
 
				+
			
 
				+In the next step we calculate the number of seconds per one tick by shifting left the `NSEC_PER_SEC` on `8` too as we did it with the `shift_hz` and do the same calculation as before:
			
 
				+
			
 
				+```C
			
 
				+nsec_per_tick = (u64)NSEC_PER_SEC << 8;
			
 
				+nsec_per_tick += (u32)shift_hz/2;
			
 
				+do_div(nsec_per_tick, (u32)shift_hz);
			
 
				+```
			
 
				+
			
 
				+```C
			
 
				+refined_jiffies.mult = ((u32)nsec_per_tick) << JIFFIES_SHIFT;
			
 
				+```
			
 
				+
			
 
				+In the end of the `register_refined_jiffies` function we register new clock source with the `__clocksource_register` function that defined in the [include/linux/clocksource.h](https://github.com/torvalds/linux/blob/master/include/linux/clocksource.h) header file and return:
			
 
				+
			
 
				+```C
			
 
				+__clocksource_register(&refined_jiffies);
			
 
				+return 0;
			
 
				+```
			
 
				+
			
 
				+The clock source management code provides the API for clock source registration and selection. As we can see, clock sources are registered by calling the  `__clocksource_register` function during kernel initialization or from a kernel module. During registration, the clock source management code will choose the best clock source available in the system using the `clocksource.rating` field which we already saw when we initialized `clocksource` structure for `jiffies`.
			
 
				+
			
 
				+Using the jiffies
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+We just saw initialization of two `jiffies` based clock sources in the previous paragraph:
			
 
				+
			
 
				+* standard `jiffies` based clock source;
			
 
				+* refined  `jiffies` based clock source;
			
 
				+
			
 
				+Don't worry if you don't understand the calculations here. They look frightening at first. Soon, step by step we will learn these things. So, we just saw initialization of `jffies` based clock sources and also we know that the Linux kernel has the global variable `jiffies` that holds the number of ticks that have occurred since the kernel started to work. Now, let's look how to use it. To use `jiffies` we just can use `jiffies` global variable by its name or with the call of the `get_jiffies_64` function. This function defined in the [kernel/time/jiffies.c](https://github.com/torvalds/linux/blob/master/kernel/time/jiffies.c) source code file and just returns full `64-bit` value of the `jiffies`:
			
 
				+
			
 
				+```C
			
 
				+u64 get_jiffies_64(void)
			
 
				+{
			
 
				+	unsigned long seq;
			
 
				+	u64 ret;
			
 
				+
			
 
				+	do {
			
 
				+		seq = read_seqbegin(&jiffies_lock);
			
 
				+		ret = jiffies_64;
			
 
				+	} while (read_seqretry(&jiffies_lock, seq));
			
 
				+	return ret;
			
 
				+}
			
 
				+EXPORT_SYMBOL(get_jiffies_64);
			
 
				+```
			
 
				+
			
 
				+Note that the `get_jiffies_64` function does not implemented as `jiffies_read` for example:
			
 
				+
			
 
				+```C
			
 
				+static cycle_t jiffies_read(struct clocksource *cs)
			
 
				+{
			
 
				+	return (cycle_t) jiffies;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+We can see that implementation of the `get_jiffies_64` is more complex. The reading of the `jiffies_64` variable is implemented using [seqlocks](https://en.wikipedia.org/wiki/Seqlock). Actually this is done for machines that cannot atomically read the full 64-bit values.
			
 
				+
			
 
				+If we can access the `jiffies` or the `jiffies_64` variable we can convert it to `human` time units. To get one second we can use following expression:
			
 
				+
			
 
				+```C
			
 
				+jiffies / HZ
			
 
				+```
			
 
				+
			
 
				+So, if we know this, we can get any time units. For example:
			
 
				+
			
 
				+```C
			
 
				+/* Thirty seconds from now */
			
 
				+jiffies + 30*HZ
			
 
				+
			
 
				+/* Two minutes from now */
			
 
				+jiffies + 120*HZ
			
 
				+
			
 
				+/* One millisecond from now */
			
 
				+jiffies + HZ / 1000
			
 
				+```
			
 
				+
			
 
				+That's all.
			
 
				+
			
 
				+Conclusion
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+This concludes the first part covering time and time management related concepts in the Linux kernel. We met first two concepts and its initialization in this part: `jiffies` and `clocksource`. In the next part we will continue to dive into this interesting theme and as I already wrote in this part we will acquainted and try to understand insides of these and other time management concepts in the Linux kernel.
			
 
				+
			
 
				+If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
			
 
				+
			
 
				+**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				+
			
 
				+Links
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+* [system call](https://en.wikipedia.org/wiki/System_call)
			
 
				+* [TCP](https://en.wikipedia.org/wiki/Transmission_Control_Protocol)
			
 
				+* [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt)
			
 
				+* [cgroups](https://en.wikipedia.org/wiki/Cgroups)
			
 
				+* [bss](https://en.wikipedia.org/wiki/.bss)
			
 
				+* [initrd](https://en.wikipedia.org/wiki/Initrd)
			
 
				+* [Intel MID](https://en.wikipedia.org/wiki/Mobile_Internet_device#Intel_MID_platforms)
			
 
				+* [TSC](https://en.wikipedia.org/wiki/Time_Stamp_Counter)
			
 
				+* [void](https://en.wikipedia.org/wiki/Void_type)
			
 
				+* [Simple Firmware Interface](https://en.wikipedia.org/wiki/Simple_Firmware_Interface)
			
 
				+* [x86_64](https://en.wikipedia.org/wiki/X86-64)
			
 
				+* [real time clock](https://en.wikipedia.org/wiki/Real-time_clock)
			
 
				+* [Jiffy](https://en.wikipedia.org/wiki/Jiffy_%28time%29)
			
 
				+* [high precision event timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer)
			
 
				+* [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond)
			
 
				+* [Intel 8253](https://en.wikipedia.org/wiki/Intel_8253)
			
 
				+* [seqlocks](https://en.wikipedia.org/wiki/Seqlock)
			
 
				+* [cloksource documentation](https://www.kernel.org/doc/Documentation/timers/timekeeping.txt)
			
 
				+* [Previous chapter](https://0xax.gitbooks.io/linux-insides/content/SysCall/index.html)
			
--- a/Timers/timers-2.md
+++ b/Timers/timers-2.md
@@ -0,0 +1,451 @@
 
				+Timers and time management in the Linux kernel. Part 2.
			
 
				+================================================================================
			
 
				+
			
 
				+Introduction to the `clocksource` framework
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+The previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) was the first part in the current [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) that describes timers and time management related stuff in the Linux kernel. We got acquainted with two concepts in the previous part:
			
 
				+
			
 
				+  * `jiffies`
			
 
				+  * `clocksource`
			
 
				+
			
 
				+The first is the global variable that is defined in the [include/linux/jiffies.h](https://github.com/torvalds/linux/blob/master/include/linux/jiffies.h) header file and represents the counter that is increased during each timer interrupt. So if we can access this global variable and we know the timer interrupt rate we can convert `jiffies` to the human time units. As we already know the timer interrupt rate represented by the compile-time constant that is called `HZ` in the Linux kernel. The value of `HZ` is equal to the value of the `CONFIG_HZ` kernel configuration option and if we will look into the [arch/x86/configs/x86_64_defconfig](https://github.com/torvalds/linux/blob/master/arch/x86/configs/x86_64_defconfig) kernel configuration file, we will see that:
			
 
				+
			
 
				+```
			
 
				+CONFIG_HZ_1000=y
			
 
				+```
			
 
				+
			
 
				+kernel configuration option is set. This means that value of `CONFIG_HZ` will be `1000` by default for the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture. So, if we divide the value of `jiffies` by the value of `HZ`:
			
 
				+
			
 
				+```
			
 
				+jiffies / HZ
			
 
				+```
			
 
				+
			
 
				+we will get the amount of seconds that elapsed since the beginning of the moment the Linux kernel started to work or in other words we will get the system [uptime](https://en.wikipedia.org/wiki/Uptime). Since `HZ` represents the amount of timer interrupts in a second, we can set a value for some time in the future. For example:
			
 
				+
			
 
				+```C
			
 
				+/* one minute from now */
			
 
				+unsigned long later = jiffies + 60*HZ;
			
 
				+
			
 
				+/* five minutes from now */
			
 
				+unsigned long later = jiffies + 5*60*HZ;
			
 
				+```
			
 
				+
			
 
				+This is a very common practice in the Linux kernel. For example, if you will look into the [arch/x86/kernel/smpboot.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/smpboot.c) source code file, you will find the `do_boot_cpu` function. This function boots all processors besides bootstrap processor. You can find a snippet that waits ten seconds for a response from the application processor:
			
 
				+
			
 
				+```C
			
 
				+if (!boot_error) {
			
 
				+	timeout = jiffies + 10*HZ;
			
 
				+	while (time_before(jiffies, timeout)) {
			
 
				+		...
			
 
				+		...
			
 
				+		...
			
 
				+		udelay(100);
			
 
				+	}
			
 
				+	...
			
 
				+	...
			
 
				+	...
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+We assign `jiffies + 10*HZ` value to the `timeout` variable here. As I think you already understood, this means a ten seconds timeout. After this we are entering a loop where we use the `time_before` macro to compare the current `jiffies` value and our timeout.
			
 
				+
			
 
				+Or for example if we look into the [sound/isa/sscape.c](https://github.com/torvalds/linux/blob/master/sound/isa/sscape) source code file which represents the driver for the [Ensoniq Soundscape Elite](https://en.wikipedia.org/wiki/Ensoniq_Soundscape_Elite) sound card, we will see the `obp_startup_ack` function that waits upto a given timeout for the On-Board Processor to return its start-up acknowledgement sequence:
			
 
				+
			
 
				+```C
			
 
				+static int obp_startup_ack(struct soundscape *s, unsigned timeout)
			
 
				+{
			
 
				+	unsigned long end_time = jiffies + msecs_to_jiffies(timeout);
			
 
				+
			
 
				+	do {
			
 
				+		...
			
 
				+		...
			
 
				+		...
			
 
				+		x = host_read_unsafe(s->io_base);
			
 
				+		...
			
 
				+		...
			
 
				+		...
			
 
				+		if (x == 0xfe || x == 0xff)
			
 
				+			return 1;
			
 
				+		msleep(10);
			
 
				+	} while (time_before(jiffies, end_time));
			
 
				+
			
 
				+	return 0;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+As you can see, the `jiffies` variable is very widely used in the Linux kernel [code](http://lxr.free-electrons.com/ident?i=jiffies). As I already wrote, we met yet another new time management related concept in the previous part - `clocksource`. We have only seen a short description of this concept and the API for a clock source registration. Let's take a closer look in this part.
			
 
				+
			
 
				+Introduction to `clocksource`
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+The `clocksource` concept represents the generic API for clock sources management in the Linux kernel. Why do we need a separate framework for this? Let's go back to the beginning. The `time` concept is the fundamental concept in the Linux kernel and other operating system kernels. And the timekeeping is one of the necessities to use this concept. For example Linux kernel must know and update the time elapsed since system startup, it must determine how long the current process has been running for every processor and many many more. Where the Linux kernel can get information about time? First of all it is Real Time Clock or [RTC](https://en.wikipedia.org/wiki/Real-time_clock) that represents by the a nonvolatile device. You can find a set of architecture-independent real time clock drivers in the Linux kernel in the [drivers/rtc](https://github.com/torvalds/linux/tree/master/drivers/rtc) directory. Besides this, each architecture can provide a driver for the architecture-dependent real time clock, for example - `CMOS/RTC` - [arch/x86/kernel/rtc.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/rtc.c) for the [x86](https://en.wikipedia.org/wiki/X86) architecture. The second is system timer - timer that excites [interrupts](https://en.wikipedia.org/wiki/Interrupt) with a periodic rate. For example, for [IBM PC](https://en.wikipedia.org/wiki/IBM_Personal_Computer) compatibles it was - [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer).
			
 
				+
			
 
				+We already know that for timekeeping purposes we can use `jiffies` in the Linux kernel. The `jiffies` can be considered as read only global variable which is updated with `HZ` frequency. We know that the `HZ` is a compile-time kernel parameter whose reasonable range is from `100` to `1000` [Hz](https://en.wikipedia.org/wiki/Hertz). So, it is guaranteed to have an interface for time measurement  with `1` - `10` milliseconds resolution. Besides standard `jiffies`, we saw the `refined_jiffies` clock source in the previous part that is based on the `i8253/i8254` [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer) tick rate which is almost `1193182` hertz. So we can get something about `1` microsecond resolution with the `refined_jiffies`. In this time, [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond) are the favorite choice for the time value units of the given clock source.
			
 
				+
			
 
				+The availability of more precise techniques for time intervals measurement is hardware-dependent. We just knew a little about `x86` dependent timers hardware. But each architecture provides own timers hardware. Earlier each architecture had own implementation for this purpose. Solution of this problem is an abstraction layer and associated API in a common code framework for managing various clock sources and independent of the timer interrupt. This common code framework became - `clocksource` framework.
			
 
				+
			
 
				+Generic timeofday and clock source management framework moved a lot of timekeeping code into the architecture independent portion of the code, with the architecture-dependent portion reduced to defining and managing low-level hardware pieces of clocksources. It takes a large amount of funds to measure the time interval on different architectures with different hardware, and it is very complex. Implementation of the each clock related service is strongly associated with an individual hardware device and as you can understand, it results in similar implementations for different architectures.
			
 
				+
			
 
				+Within this framework, each clock source is required to maintain a representation of time as a monotonically increasing value. As we can see in the Linux kernel code, nanoseconds are the favorite choice for the time value units of a clock source in this time. One of the main point of the clock source framework is to allow an user to select clock source among a range of available hardware devices supporting clock functions when configuring the system and selecting, accessing and scaling different clock sources.
			
 
				+
			
 
				+The clocksource structure
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+The fundamental of the `clocksource` framework is the `clocksource` structure that defined in the [include/linux/clocksource.h](https://github.com/torvalds/linux/blob/master/include/linux/clocksource.h) header file. We already saw some fields that are provided by the `clocksource` structure in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html). Let's look on the full definition of this structure and try to describe all of its fields:
			
 
				+
			
 
				+```C
			
 
				+struct clocksource {
			
 
				+	cycle_t (*read)(struct clocksource *cs);
			
 
				+	cycle_t mask;
			
 
				+	u32 mult;
			
 
				+	u32 shift;
			
 
				+	u64 max_idle_ns;
			
 
				+	u32 maxadj;
			
 
				+#ifdef CONFIG_ARCH_CLOCKSOURCE_DATA
			
 
				+	struct arch_clocksource_data archdata;
			
 
				+#endif
			
 
				+	u64 max_cycles;
			
 
				+	const char *name;
			
 
				+	struct list_head list;
			
 
				+	int rating;
			
 
				+	int (*enable)(struct clocksource *cs);
			
 
				+	void (*disable)(struct clocksource *cs);
			
 
				+	unsigned long flags;
			
 
				+	void (*suspend)(struct clocksource *cs);
			
 
				+	void (*resume)(struct clocksource *cs);
			
 
				+#ifdef CONFIG_CLOCKSOURCE_WATCHDOG
			
 
				+	struct list_head wd_list;
			
 
				+	cycle_t cs_last;
			
 
				+	cycle_t wd_last;
			
 
				+#endif
			
 
				+	struct module *owner;
			
 
				+} ____cacheline_aligned;
			
 
				+```
			
 
				+
			
 
				+We already saw the first field of the `clocksource` structure in the previous part - it is pointer to the `read` function that returns best counter selected by the clocksource framework. For example we use `jiffies_read` function to read `jiffies` value:
			
 
				+
			
 
				+```C
			
 
				+static struct clocksource clocksource_jiffies = {
			
 
				+	...
			
 
				+	.read		= jiffies_read,
			
 
				+	...
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+where `jiffies_read` just returns:
			
 
				+
			
 
				+```C
			
 
				+static cycle_t jiffies_read(struct clocksource *cs)
			
 
				+{
			
 
				+	return (cycle_t) jiffies;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Or the `read_tsc` function:
			
 
				+
			
 
				+```C
			
 
				+static struct clocksource clocksource_tsc = {
			
 
				+	...
			
 
				+    .read                   = read_tsc,
			
 
				+	...
			
 
				+};
			
 
				+```
			
 
				+
			
 
				+for the [time stamp counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) reading.
			
 
				+
			
 
				+The next field is `mask` that allows to ensure that subtraction between counters values from non `64 bit` counters do not need special overflow logic. After the `mask` field, we can see two fields: `mult` and `shift`. These are the fields that are base of mathematical functions that are provide ability to convert time values specific to each clock source. In other words these two fields help us to convert an abstract machine time units of a counter to nanoseconds.
			
 
				+
			
 
				+After these two fields we can see the `64` bits `max_idle_ns` field represents max idle time permitted by the clocksource in nanoseconds. We need in this field for the Linux kernel with enabled `CONFIG_NO_HZ` kernel configuration option. This kernel configuration option enables the Linux kernel to run without a regular timer tick (we will see full explanation of this in other part). The problem that dynamic tick allows the kernel to sleep for periods longer than a single tick, moreover sleep time could be unlimited. The `max_idle_ns` field represents this sleeping limit.
			
 
				+
			
 
				+The next field after the `max_idle_ns` is the `maxadj` field which is the maximum adjustment value to `mult`. The main formula by which we convert cycles to the nanoseconds:
			
 
				+
			
 
				+```C
			
 
				+((u64) cycles * mult) >> shift;
			
 
				+```
			
 
				+
			
 
				+is not `100%` accurate. Instead the number is taken as close as possible to a nanosecond and `maxadj` helps to correct this and allows clocksource API to avoid `mult` values that might overflow when adjusted. The next four fields are pointers to the function:
			
 
				+
			
 
				+* `enable` - optional function to enable clocksource;
			
 
				+* `disable` - optional function to disable clocksource;
			
 
				+* `suspend` - suspend function for the clocksource;
			
 
				+* `resume` - resume function for the clocksource;
			
 
				+
			
 
				+The next field is the `max_cycles` and as we can understand from its name, this field represents maximum cycle value before potential overflow. And the last field is `owner` represents reference to a kernel [module](https://en.wikipedia.org/wiki/Loadable_kernel_module) that is owner of a clocksource. This is all. We just went through all the standard fields of the `clocksource` structure. But you can noted that we missed some fields of the `clocksource` structure. We can divide all of missed field on two types: Fields of the first type are already known for us. For example, they are `name` field that represents name of a `clocksource`, the `rating` field that helps to the Linux kernel to select the best clocksource and etc. The second type, fields which are dependent from the different Linux kernel configuration options. Let's look on these fields.
			
 
				+
			
 
				+The first field is the `archdata`. This field has `arch_clocksource_data` type and depends on the `CONFIG_ARCH_CLOCKSOURCE_DATA` kernel configuration option. This field is actual only for the [x86](https://en.wikipedia.org/wiki/X86) and [IA64](https://en.wikipedia.org/wiki/IA-64) architectures for this moment. And again, as we can understand from the field's name, it represents architecture-specific data for a clock source. For example, it represents `vDSO` clock mode:
			
 
				+
			
 
				+```C
			
 
				+struct arch_clocksource_data {
			
 
				+    int vclock_mode;
			
 
				+};
			
 
				+```
			
 
				+ 
			
 
				+for the `x86` architectures. Where the `vDSO` clock mode can be one of the:
			
 
				+
			
 
				+```C
			
 
				+#define VCLOCK_NONE 0
			
 
				+#define VCLOCK_TSC  1
			
 
				+#define VCLOCK_HPET 2
			
 
				+#define VCLOCK_PVCLOCK 3
			
 
				+```
			
 
				+
			
 
				+The last three fields are `wd_list`, `cs_last` and the `wd_last` depends on the `CONFIG_CLOCKSOURCE_WATCHDOG` kernel configuration option. First of all let's try to understand what is it `watchdog`. In a simple words, watchdog is a timer that is used for detection of the computer malfunctions and recovering from it. All of these three fields contain watchdog related data that is used by the `clocksource` framework. If we will grep the Linux kernel source code, we will see that only [arch/x86/KConfig](https://github.com/torvalds/linux/blob/master/arch/x86/Kconfig#L54) kernel configuration file contains the `CONFIG_CLOCKSOURCE_WATCHDOG` kernel configuration option. So, why do `x86` and `x86_64` need in [watchdog](https://en.wikipedia.org/wiki/Watchdog_timer)? You already may know that all `x86` processors has special 64-bit register - [time stamp counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter). This register contains number of [cycles](https://en.wikipedia.org/wiki/Clock_rate) since the reset. Sometimes the time stamp counter needs to be verified against another clock source. We will not see initialization of the `watchdog` timer in this part, before this we must learn more about timers.
			
 
				+
			
 
				+That's all. From this moment we know all fields of the `clocksource` structure. This knowledge will help us to learn insides of the `clocksource` framework.
			
 
				+
			
 
				+New clock source registration
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+We saw only one function from the `clocksource` framework in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html). This function was - `__clocksource_register`. This function defined in the [include/linux/clocksource.h](https://github.com/torvalds/linux/tree/master/include/linux/clocksource.h) header file and as we can understand from the function's name, main point of this function is to register new clocksource. If we will look on the implementation of the `__clocksource_register` function, we will see that it just makes call of the `__clocksource_register_scale` function and returns its result:
			
 
				+
			
 
				+```C
			
 
				+static inline int __clocksource_register(struct clocksource *cs)
			
 
				+{
			
 
				+	return __clocksource_register_scale(cs, 1, 0);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Before we will see implementation of the `__clocksource_register_scale` function, we can see that `clocksource` provides additional API for a new clock source registration:
			
 
				+
			
 
				+```C
			
 
				+static inline int clocksource_register_hz(struct clocksource *cs, u32 hz)
			
 
				+{
			
 
				+        return __clocksource_register_scale(cs, 1, hz);
			
 
				+}
			
 
				+
			
 
				+static inline int clocksource_register_khz(struct clocksource *cs, u32 khz)
			
 
				+{
			
 
				+        return __clocksource_register_scale(cs, 1000, khz);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+And all of these functions do the same. They return value of the `__clocksource_register_scale` function but with different set of parameters. The `__clocksource_register_scale` function defined in the [kernel/time/clocksource.c](https://github.com/torvalds/linux/tree/master/kernel/time/clocksource.c) source code file. To understand difference between these functions, let's look on the parameters of the `clocksource_register_khz` function. As we can see, this function takes three parameters:
			
 
				+
			
 
				+* `cs` - clocksource to be installed;
			
 
				+* `scale` - scale factor of a clock source. In other words, if we will multiply value of this parameter on frequency, we will get `hz` of a clocksource;
			
 
				+* `freq` - clock source frequency divided by scale.
			
 
				+
			
 
				+Now let's look on the implementation of the `__clocksource_register_scale` function:
			
 
				+
			
 
				+```C
			
 
				+int __clocksource_register_scale(struct clocksource *cs, u32 scale, u32 freq)
			
 
				+{
			
 
				+        __clocksource_update_freq_scale(cs, scale, freq);
			
 
				+        mutex_lock(&clocksource_mutex);
			
 
				+        clocksource_enqueue(cs);
			
 
				+        clocksource_enqueue_watchdog(cs);
			
 
				+        clocksource_select();
			
 
				+        mutex_unlock(&clocksource_mutex);
			
 
				+        return 0;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+First of all we can see that the `__clocksource_register_scale` function starts from the call of the `__clocksource_update_freq_scale` function that defined in the same source code file and updates given clock source with the new frequency. Let's look on the implementation of this function. In the first step we need to check given frequency and if it was not passed as `zero`, we need to calculate `mult` and `shift` parameters for the given clock source. Why do we need to check value of the `frequency`? Actually it can be zero. if you attentively looked on the implementation of the `__clocksource_register` function, you may have noticed that we passed `frequency` as `0`. We will do it only for some clock sources that have self defined `mult` and `shift` parameters. Look in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) and you will see that we saw calculation of the `mult` and `shift` for `jiffies`. The `__clocksource_update_freq_scale` function will do it for us for other clock sources.
			
 
				+
			
 
				+So in the start of the `__clocksource_update_freq_scale` function we check the value of the `frequency` parameter and if is not zero we need to calculate `mult` and `shift` for the given clock source. Let's look on the `mult` and `shift` calculation:
			
 
				+
			
 
				+```C
			
 
				+void __clocksource_update_freq_scale(struct clocksource *cs, u32 scale, u32 freq)
			
 
				+{
			
 
				+        u64 sec;
			
 
				+
			
 
				+		if (freq) {
			
 
				+             sec = cs->mask;
			
 
				+             do_div(sec, freq);
			
 
				+             do_div(sec, scale);
			
 
				+
			
 
				+             if (!sec)
			
 
				+                   sec = 1;
			
 
				+             else if (sec > 600 && cs->mask > UINT_MAX)
			
 
				+                   sec = 600;
			
 
				+ 
			
 
				+             clocks_calc_mult_shift(&cs->mult, &cs->shift, freq,
			
 
				+                                    NSEC_PER_SEC / scale, sec * scale);
			
 
				+	    }
			
 
				+	    ...
			
 
				+        ...
			
 
				+        ...
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Here we can see calculation of the maximum number of seconds which we can run before a clock source counter will overflow. First of all we fill the `sec` variable with the value of a clock source mask. Remember that a clock source's mask represents maximum amount of bits that are valid for the given clock source. After this, we can see two division operations. At first we divide our `sec` variable on a clock source frequency and then on scale factor. The `freq` parameter shows us how many timer interrupts will be occurred in one second. So, we divide `mask` value that represents maximum number of a counter (for example `jiffy`) on the frequency of a timer and will get the maximum number of seconds for the certain clock source. The second division operation will give us maximum number of seconds for the certain clock source depends on its scale factor which can be `1` hertz or `1` kilohertz (10^ Hz).
			
 
				+
			
 
				+After we have got maximum number of seconds, we check this value and set it to `1` or `600` depends on the result at the next step. These values is maximum sleeping time for a clocksource in seconds. In the next step we can see call of the `clocks_calc_mult_shift`. Main point of this function is calculation of the `mult` and `shift` values for a given clock source. In the end of the `__clocksource_update_freq_scale` function we check that just calculated `mult` value of a given clock source will not cause overflow after adjustment, update the `max_idle_ns` and `max_cycles` values of a given clock source with the maximum nanoseconds that can be converted to a clock source counter and print result to the kernel buffer:
			
 
				+
			
 
				+```C
			
 
				+pr_info("%s: mask: 0x%llx max_cycles: 0x%llx, max_idle_ns: %lld ns\n",
			
 
				+	cs->name, cs->mask, cs->max_cycles, cs->max_idle_ns);
			
 
				+```
			
 
				+
			
 
				+that we can see in the [dmesg](https://en.wikipedia.org/wiki/Dmesg) output:
			
 
				+
			
 
				+```
			
 
				+$ dmesg | grep "clocksource:"
			
 
				+[    0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
			
 
				+[    0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484882848 ns
			
 
				+[    0.094084] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
			
 
				+[    0.205302] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
			
 
				+[    1.452979] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x7350b459580, max_idle_ns: 881591204237 ns
			
 
				+```
			
 
				+
			
 
				+After the `__clocksource_update_freq_scale` function will finish its work, we can return back to the `__clocksource_register_scale` function that will register new clock source. We can see the call of the following three functions:
			
 
				+
			
 
				+```C
			
 
				+mutex_lock(&clocksource_mutex);
			
 
				+clocksource_enqueue(cs);
			
 
				+clocksource_enqueue_watchdog(cs);
			
 
				+clocksource_select();
			
 
				+mutex_unlock(&clocksource_mutex);
			
 
				+```
			
 
				+
			
 
				+Note that before the first will be called, we lock the `clocksource_mutex` [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion). The point of the `clocksource_mutex` mutex is to protect `curr_clocksource` variable which represents currently selected `clocksource` and `clocksource_list` variable which represents list that contains registered `clocksources`. Now, let's look on these three functions.
			
 
				+
			
 
				+The first `clocksource_enqueue` function and other two defined in the same source code [file](https://github.com/torvalds/linux/tree/master/kernel/time/clocksource.c). We go through all already registered `clocksources` or in other words we go through all elements of the `clocksource_list` and tries to find best place for a given `clocksource`:
			
 
				+
			
 
				+```C
			
 
				+static void clocksource_enqueue(struct clocksource *cs)
			
 
				+{
			
 
				+	struct list_head *entry = &clocksource_list;
			
 
				+	struct clocksource *tmp;
			
 
				+
			
 
				+	list_for_each_entry(tmp, &clocksource_list, list)
			
 
				+		if (tmp->rating >= cs->rating)
			
 
				+			entry = &tmp->list;
			
 
				+	list_add(&cs->list, entry);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+In the end we just insert new clocksource to the `clocksource_list`. The second function - `clocksource_enqueue_watchdog` does almost the same that previous function, but it inserts new clock source to the `wd_list` depends on flags of a clock source and starts new [watchdog](https://en.wikipedia.org/wiki/Watchdog_timer) timer. As I already wrote, we will not consider `watchdog` related stuff in this part but will do it in next parts.
			
 
				+
			
 
				+The last function is the `clocksource_select`. As we can understand from the function's name, main point of this function - select the best `clocksource` from registered clocksources. This function consists only from the call of the function helper:
			
 
				+
			
 
				+```C
			
 
				+static void clocksource_select(void)
			
 
				+{
			
 
				+	return __clocksource_select(false);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Note that the `__clocksource_select` function takes one parameter (`false` in our case). This [bool](https://en.wikipedia.org/wiki/Boolean_data_type) parameter shows how to traverse the `clocksource_list`. In our case we pass `false` that is meant that we will go through all entries of the `clocksource_list`. We already know that `clocksource` with the best rating will the first in the `clocksource_list` after the call of the `clocksource_enqueue` function, so we can easily get it from this list. After we found a clock source with the best rating, we switch to it:
			
 
				+
			
 
				+```C
			
 
				+if (curr_clocksource != best && !timekeeping_notify(best)) {
			
 
				+	pr_info("Switched to clocksource %s\n", best->name);
			
 
				+	curr_clocksource = best;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+The result of this operation we can see in the `dmesg` output:
			
 
				+
			
 
				+```
			
 
				+$ dmesg | grep Switched
			
 
				+[    0.199688] clocksource: Switched to clocksource hpet
			
 
				+[    2.452966] clocksource: Switched to clocksource tsc
			
 
				+```
			
 
				+
			
 
				+Note that we can see two clock sources in the `dmesg` output (`hpet` and `tsc` in our case). Yes, actually there can be many different clock sources on a particular hardware. So the Linux kernel knows about all registered clock sources and switches to a clock source with a better rating each time after registration of a new clock source.
			
 
				+
			
 
				+If we will look on the bottom of the [kernel/time/clocksource.c](https://github.com/torvalds/linux/tree/master/kernel/time/clocksource.c) source code file, we will see that it has [sysfs](https://en.wikipedia.org/wiki/Sysfs) interface. Main initialization occurs in the `init_clocksource_sysfs` function which will be called during device `initcalls`. Let's look on the implementation of the `init_clocksource_sysfs` function:
			
 
				+
			
 
				+```C
			
 
				+static struct bus_type clocksource_subsys = {
			
 
				+	.name = "clocksource",
			
 
				+	.dev_name = "clocksource",
			
 
				+};
			
 
				+
			
 
				+static int __init init_clocksource_sysfs(void)
			
 
				+{
			
 
				+	int error = subsys_system_register(&clocksource_subsys, NULL);
			
 
				+
			
 
				+	if (!error)
			
 
				+		error = device_register(&device_clocksource);
			
 
				+	if (!error)
			
 
				+		error = device_create_file(
			
 
				+				&device_clocksource,
			
 
				+				&dev_attr_current_clocksource);
			
 
				+	if (!error)
			
 
				+		error = device_create_file(&device_clocksource,
			
 
				+					   &dev_attr_unbind_clocksource);
			
 
				+	if (!error)
			
 
				+		error = device_create_file(
			
 
				+				&device_clocksource,
			
 
				+				&dev_attr_available_clocksource);
			
 
				+	return error;
			
 
				+}
			
 
				+device_initcall(init_clocksource_sysfs);
			
 
				+```
			
 
				+
			
 
				+First of all we can see that it registers a `clocksource` subsystem with the call of the `subsys_system_register` function. In other words, after the call of this function, we will have following directory:
			
 
				+
			
 
				+```
			
 
				+$ pwd
			
 
				+/sys/devices/system/clocksource
			
 
				+```
			
 
				+
			
 
				+After this step, we can see registration of the `device_clocksource` device which is represented by the following structure:
			
 
				+
			
 
				+```C
			
 
				+static struct device device_clocksource = {
			
 
				+	.id	= 0,
			
 
				+	.bus	= &clocksource_subsys,
			
 
				+};
			
 
				+```
			
 
				+
			
 
				+and creation of three files:
			
 
				+
			
 
				+* `dev_attr_current_clocksource`;
			
 
				+* `dev_attr_unbind_clocksource`;
			
 
				+* `dev_attr_available_clocksource`.
			
 
				+
			
 
				+These files will provide information about current clock source in the system, available clock sources in the system and interface which allows to unbind the clock source.
			
 
				+
			
 
				+After the `init_clocksource_sysfs` function will be executed, we will be able find some information about available clock sources in the:
			
 
				+
			
 
				+```
			
 
				+$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource 
			
 
				+tsc hpet acpi_pm 
			
 
				+```
			
 
				+
			
 
				+Or for example information about current clock source in the system:
			
 
				+
			
 
				+```
			
 
				+$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource 
			
 
				+tsc
			
 
				+```
			
 
				+
			
 
				+In the previous part, we saw API for the registration of the `jiffies` clock source, but didn't dive into details about the `clocksource` framework. In this part we did it and saw implementation of the new clock source registration and selection of a clock source with the best rating value in the system. Of course, this is not all API that `clocksource` framework provides. There a couple additional functions like `clocksource_unregister` for removing given clock source from the `clocksource_list` and etc. But I will not describe this functions in this part, because they are not important for us right now. Anyway if you are interesting in it, you can find it in the [kernel/time/clocksource.c](https://github.com/torvalds/linux/tree/master/kernel/time/clocksource.c).
			
 
				+
			
 
				+That's all.
			
 
				+
			
 
				+Conclusion
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+This is the end of the second part of the chapter that describes timers and timer management related stuff in the Linux kernel. In the previous part got acquainted with the following two concepts: `jiffies` and `clocksource`. In this part we saw some examples of the `jiffies` usage and knew more details about the `clocksource` concept.
			
 
				+
			
 
				+If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
			
 
				+
			
 
				+**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				+
			
 
				+Links
			
 
				+-------------------------------------------------------------------------------
			
 
				+
			
 
				+* [x86](https://en.wikipedia.org/wiki/X86)
			
 
				+* [x86_64](https://en.wikipedia.org/wiki/X86-64)
			
 
				+* [uptime](https://en.wikipedia.org/wiki/Uptime)
			
 
				+* [Ensoniq Soundscape Elite](https://en.wikipedia.org/wiki/Ensoniq_Soundscape_Elite)
			
 
				+* [RTC](https://en.wikipedia.org/wiki/Real-time_clock)
			
 
				+* [interrupts](https://en.wikipedia.org/wiki/Interrupt)
			
 
				+* [IBM PC](https://en.wikipedia.org/wiki/IBM_Personal_Computer)
			
 
				+* [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer)
			
 
				+* [Hz](https://en.wikipedia.org/wiki/Hertz)
			
 
				+* [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond)
			
 
				+* [dmesg](https://en.wikipedia.org/wiki/Dmesg)
			
 
				+* [time stamp counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter)
			
 
				+* [loadable kernel module](https://en.wikipedia.org/wiki/Loadable_kernel_module)
			
 
				+* [IA64](https://en.wikipedia.org/wiki/IA-64)
			
 
				+* [watchdog](https://en.wikipedia.org/wiki/Watchdog_timer)
			
 
				+* [clock rate](https://en.wikipedia.org/wiki/Clock_rate)
			
 
				+* [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion)
			
 
				+* [sysfs](https://en.wikipedia.org/wiki/Sysfs)
			
 
				+* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html)
			
--- a/Timers/timers-3.md
+++ b/Timers/timers-3.md
@@ -0,0 +1,444 @@
 
				+Timers and time management in the Linux kernel. Part 3.
			
 
				+================================================================================
			
 
				+
			
 
				+The tick broadcast framework and dyntick
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+This is third part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel and we stopped on the `clocksource` framework in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-2.html). We have started to consider this framework because it is closely related to the special counters which are provided by the Linux kernel. One of these counters which we already saw in the first [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) of this chapter is - `jiffies`. As I already wrote in the first part of this chapter, we will consider time management related stuff step by step during the Linux kernel initialization. Previous step was call of the:
			
 
				+
			
 
				+```C
			
 
				+register_refined_jiffies(CLOCK_TICK_RATE);
			
 
				+```
			
 
				+
			
 
				+function which defined in the [kernel/time/jiffies.c](https://github.com/torvalds/linux/blob/master/kernel/time/jiffies.c) source code file and executes initialization of the `refined_jiffies` clock source for us. Recall that this function is called from the `setup_arch` function that defined in the [https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c](arch/x86/kernel/setup.c) source code and executes architecture-specific ([x86_64](https://en.wikipedia.org/wiki/X86-64) in our case) initialization. Look on the implementation of the `setup_arch` and you will note that the call of the `register_refined_jiffies` is the last step before the `setup_arch` function will finish its work.
			
 
				+
			
 
				+There are many different `x86_64` specific things already configured after the end of the `setup_arch` execution. For example some early [interrupt](https://en.wikipedia.org/wiki/Interrupt) handlers already able to handle interrupts, memory space reserved for the [initrd](https://en.wikipedia.org/wiki/Initrd), [DMI](https://en.wikipedia.org/wiki/Desktop_Management_Interface) scanned, the Linux kernel log buffer is already set and this means that the [printk](https://en.wikipedia.org/wiki/Printk) function is able to work, [e820](https://en.wikipedia.org/wiki/E820) parsed and the Linux kernel already knows about available memory and and many many other architecture specific things (if you are interesting, you can read more about the `setup_arch` function and Linux kernel initialization process in the second [chapter](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) of this book).
			
 
				+
			
 
				+Now, the `setup_arch` finished its work and we can back to the generic Linux kernel code. Recall that the `setup_arch` function was called from the `start_kernel` function which is defined in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file. So, we shall return to this function. You can see that there are many different function are called right after `setup_arch` function inside of the `start_kernel` function, but since our chapter is devoted to timers and time management related stuff, we will skip all code which is not related to this topic. The first function which is related to the time management in the Linux kernel is:
			
 
				+
			
 
				+```C
			
 
				+tick_init();
			
 
				+```
			
 
				+
			
 
				+in the `start_kernel`. The `tick_init` function defined in the [kernel/time/tick-common.c](https://github.com/torvalds/linux/blob/master/kernel/time/tick-common.c) source code file and does two things:
			
 
				+
			
 
				+* Initialization of `tick broadcast` framework related data structures;
			
 
				+* Initialization of `full` tickless mode related data structures.
			
 
				+
			
 
				+We didn't see anything related to the `tick broadcast` framework in this book and didn't know anything about tickless mode in the Linux kernel. So, the main point of this part is to look on these concepts and to know what are they.
			
 
				+
			
 
				+The idle process
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+First of all, let's look on the implementation of the `tick_init` function. As I already wrote, this function defined in the [kernel/time/tick-common.c](https://github.com/torvalds/linux/blob/master/kernel/time/tick-common.c) source code file and consists from the two calls of following functions:
			
 
				+
			
 
				+```C
			
 
				+void __init tick_init(void)
			
 
				+{
			
 
				+	tick_broadcast_init();
			
 
				+	tick_nohz_init();
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+As you can understand from the paragraph's title, we are interesting only in the `tick_broadcast_init` function for now. This function defined in the [kernel/time/tick-broadcast.c](https://github.com/torvalds/linux/blob/master/kernel/time/tick-broadcast.c) source code file and executes initialization of the `tick broadcast` framework related data structures. Before we will look on the implementation of the `tick_broadcast_init` function and will try to understand what does this function do, we need to know about `tick broadcast` framework.
			
 
				+
			
 
				+Main point of a central processor is to execute programs. But sometimes a processor may be in a special state when it is not being used by any program. This special state is called - [idle](https://en.wikipedia.org/wiki/Idle_%28CPU%29). When the processor has no anything to execute, the Linux kernel launches `idle` task. We already saw a little about this in the last part of the [Linux kernel initialization process](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-10.html). When the Linux kernel will finish all initialization processes in the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file, it will call the `rest_init` function from the same source code file. Main point of this function is to launch kernel `init` thread and the `kthreadd` thread, to call the `schedule` function to start task scheduling and to go to sleep by calling the `cpu_idle_loop` function that defined in the [kernel/sched/idle.c](https://github.com/torvalds/linux/blob/master/kernel/sched/idle.c) source code file.
			
 
				+
			
 
				+The `cpu_idle_loop` function represents infinite loop which checks the need for rescheduling on each iteration. After the scheduler finds something to execute, the `idle` process will finish its work and the control will be moved to a new runnable task with the call of the `schedule_preempt_disabled` function:
			
 
				+
			
 
				+```C
			
 
				+static void cpu_idle_loop(void)
			
 
				+{
			
 
				+	while (1) {
			
 
				+		while (!need_resched()) {
			
 
				+		...
			
 
				+		...
			
 
				+		...
			
 
				+	    /* the main idle function */
			
 
				+		cpuidle_idle_call();
			
 
				+	}
			
 
				+	...
			
 
				+	...
			
 
				+	...
			
 
				+	schedule_preempt_disabled();
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Of course, we will not consider full implementation of the `cpu_idle_loop` function and details of the `idle` state in this part, because it is not related to our topic. But there is one interesting moment for us. We know that the processor can execute only one task in one time. How does the Linux kernel decide to reschedule and stop `idle` process if the processor executes infinite loop in the `cpu_idle_loop`? The answer is system timer interrupts. When an interrupt occurs, the processor stops the `idle` thread and transfers control to an interrupt handler. After the system timer interrupt handler will be handled, the `need_resched` will return true and the Linux kernel will stop `idle` process and will transfer control to the current runnable task. But handling of the system timer interrupts is not effective for [power management](https://en.wikipedia.org/wiki/Power_management), because if a processor is in `idle` state,  there is little point in sending it a system timer interrupt.
			
 
				+
			
 
				+By default, there is the `CONFIG_HZ_PERIODIC` kernel configuration option which is enabled in the Linux kernel and tells to handle each interrupt of the system timer. To solve this problem, the Linux kernel provides two additional ways of managing scheduling-clock interrupts:
			
 
				+
			
 
				+The first is to omit scheduling-clock ticks on idle processors. To enable this behaviour in the Linux kernel, we need to enable the `CONFIG_NO_HZ_IDLE` kernel configuration option. This option allows Linux kernel to avoid sending timer interrupts to idle processors. In this case periodic timer interrupts will be replaced with on-demand interrupts. This mode is called - `dyntick-idle` mode. But if the kernel does not handle interrupts of a system timer, how can the kernel decide if the system has nothing to do?
			
 
				+
			
 
				+Whenever the idle task is selected to run, the periodic tick is disabled with the call of the `tick_nohz_idle_enter` function that defined in the [kernel/time/tick-sched.c](https://github.com/torvalds/linux/blob/master/kernel/time/tich-sched.c) source code file and enabled with the call of the `tick_nohz_idle_exit` function. There is special concept in the Linux kernel which is called - `clock event devices` that are used to schedule the next interrupt. This concept provides API for devices which can deliver interrupts at a specific time in the future and represented by the `clock_event_device` structure in the Linux kernel. We will not dive into implementation of the `clock_event_device` structure now. We will see it in the next prat of this chapter. But there is one interesting moment for us right now.
			
 
				+
			
 
				+The second way is to omit scheduling-clock ticks on processors that are either in `idle` state or that have only one runnable task or in other words busy processor. We can enable this feature with the `CONFIG_NO_HZ_FULL` kernel configuration option and it allows to reduce the number of timer interrupts significantly.
			
 
				+
			
 
				+Besides the `cpu_idle_loop`, idle processor can be in a sleeping state. The Linux kernel provides special `cpuidle` framework. Main point of this framework is to put an idle processor to sleeping states. The name of the set of these states is - `C-states`. But how does a processor will be woken if local timer is disabled? The linux kernel provides `tick broadcast` framework for this. The main point of this framework is assign a timer which is not affected by the `C-states`. This timer will wake a sleeping processor.
			
 
				+
			
 
				+Now, after some theory we can return to the implementation of our function. Let's recall that the `tick_init` function just calls two following functions:
			
 
				+
			
 
				+```C
			
 
				+void __init tick_init(void)
			
 
				+{
			
 
				+	tick_broadcast_init();
			
 
				+	tick_nohz_init();
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Let's consider the first function. The first `tick_broadcast_init` function defined in the [kernel/time/tick-broadcast.c](https://github.com/torvalds/linux/blob/master/kernel/time/tick-broadcast.c) source code file and executes initialization of the `tick broadcast` framework related data structures. Let's look on the implementation of the `tick_broadcast_init` function:
			
 
				+
			
 
				+```C
			
 
				+void __init tick_broadcast_init(void)
			
 
				+{
			
 
				+        zalloc_cpumask_var(&tick_broadcast_mask, GFP_NOWAIT);
			
 
				+        zalloc_cpumask_var(&tick_broadcast_on, GFP_NOWAIT);
			
 
				+        zalloc_cpumask_var(&tmpmask, GFP_NOWAIT);
			
 
				+#ifdef CONFIG_TICK_ONESHOT
			
 
				+         zalloc_cpumask_var(&tick_broadcast_oneshot_mask, GFP_NOWAIT);
			
 
				+         zalloc_cpumask_var(&tick_broadcast_pending_mask, GFP_NOWAIT);
			
 
				+         zalloc_cpumask_var(&tick_broadcast_force_mask, GFP_NOWAIT);
			
 
				+#endif
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+As we can see, the `tick_broadcast_init` function allocates different [cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) with the help of the `zalloc_cpumask_var` function. The `zalloc_cpumask_var` function defined in the [lib/cpumask.c](https://github.com/torvalds/linux/blob/master/lib/cpumask.c) source code file and expands to the call of the following function:
			
 
				+
			
 
				+```C
			
 
				+bool zalloc_cpumask_var(cpumask_var_t *mask, gfp_t flags)
			
 
				+{
			
 
				+        return alloc_cpumask_var(mask, flags | __GFP_ZERO);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Ultimately, the memory space will be allocated for the given `cpumask` with the certain flags with the help of the `kmalloc_node` function:
			
 
				+
			
 
				+```C
			
 
				+*mask = kmalloc_node(cpumask_size(), flags, node);
			
 
				+```
			
 
				+
			
 
				+Now let's look on the `cpumasks` that will be initialized in the `tick_broadcast_init` function. As we can see, the `tick_broadcast_init` function will initialize six `cpumasks`, and moreover, initialization of the last three `cpumasks` will be depended on the `CONFIG_TICK_ONESHOT` kernel configuration option.
			
 
				+
			
 
				+The first three `cpumasks` are:
			
 
				+
			
 
				+* `tick_broadcast_mask` - the bitmap which represents list of processors that are in a sleeping mode;
			
 
				+* `tick_broadcast_on` - the bitmap that stores numbers of processors which are in a periodic broadcast state;
			
 
				+* `tmpmask` - this bitmap for temporary usage.
			
 
				+
			
 
				+As we already know, the next three `cpumasks` depends on the `CONFIG_TICK_ONESHOT` kernel configuration option. Actually each clock event devices can be in one of two modes:
			
 
				+
			
 
				+* `periodic` - clock events devices that support periodic events;
			
 
				+* `oneshot`  - clock events devices that capable of issuing events that happen only once.
			
 
				+
			
 
				+The linux kernel defines two mask for such clock events devices in the [include/linux/clockchips.h](https://github.com/torvalds/linux/blob/master/include/linux/clockchips.h) header file:
			
 
				+
			
 
				+```C
			
 
				+#define CLOCK_EVT_FEAT_PERIODIC        0x000001
			
 
				+#define CLOCK_EVT_FEAT_ONESHOT         0x000002
			
 
				+```
			
 
				+
			
 
				+So, the last three `cpumasks` are:
			
 
				+
			
 
				+* `tick_broadcast_oneshot_mask` - stores numbers of processors that must be notified;
			
 
				+* `tick_broadcast_pending_mask` - stores numbers of processors that pending broadcast;
			
 
				+* `tick_broadcast_force_mask`   - stores numbers of processors with enforced broadcast.
			
 
				+
			
 
				+We have initialized six `cpumasks` in the `tick broadcast` framework, and now we can proceed to implementation of this framework.
			
 
				+
			
 
				+The `tick broadcast` framework
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+Hardware may provide some clock source devices. When a processor sleeps and its local timer stopped, there must be additional clock source device that will handle awakening of a processor. The Linux kernel uses these `special` clock source devices which can raise an interrupt at a specified time. We already know that such timers called `clock events` devices in the Linux kernel. Besides `clock events` devices. Actually, each processor in the system has its own local timer which is programmed to issue interrupt at the time of the next deferred task. Also these timers can be programmed to do a periodical job, like updating `jiffies` and etc. These timers represented by the `tick_device` structure in the Linux kernel. This structure defined in the [kernel/time/tick-sched.h](https://github.com/torvalds/linux/blob/master/kernel/time/tick-sched.h) header file and looks:
			
 
				+
			
 
				+```C
			
 
				+struct tick_device {
			
 
				+        struct clock_event_device *evtdev;
			
 
				+        enum tick_device_mode mode;
			
 
				+};
			
 
				+```
			
 
				+
			
 
				+Note, that the `tick_device` structure contains two fields. The first field - `evtdev` represents pointer to the `clock_event_device` structure that defined in the [include/linux/clockchips.h](https://github.com/torvalds/linux/blob/master/include/linux/clockchips.h) header file and represents descriptor of a clock event device. A `clock event` device allows to register an event that will happen in the future. As I already wrote, we will not consider `clock_event_device` structure and related API in this part, but will see it in the next part.
			
 
				+
			
 
				+The second field of the `tick_device` structure represents mode of the `tick_device`. As we already know, the mode can be one of the:
			
 
				+
			
 
				+```C
			
 
				+num tick_device_mode {
			
 
				+        TICKDEV_MODE_PERIODIC,
			
 
				+        TICKDEV_MODE_ONESHOT,
			
 
				+};
			
 
				+```
			
 
				+
			
 
				+Each `clock events` device in the system registers itself by the call of the `clockevents_register_device` function or `clockevents_config_and_register` function during initialization process of the Linux kernel. During the registration of a new `clock events` device, the Linux kernel calls the `tick_check_new_device` function that defined in the [kernel/time/tick-common.c](https://github.com/torvalds/linux/blob/master/kernel/tick-common.c) source code file and checks the given `clock events` device should be used by the Linux kernel. After all checks, the `tick_check_new_device` function executes a call of the:
			
 
				+
			
 
				+```C
			
 
				+tick_install_broadcast_device(newdev);
			
 
				+```
			
 
				+
			
 
				+function that checks that the given `clock event` device can be broadcast device and install it, if the given device can be broadcast device. Let's look on the implementation of the `tick_install_broadcast_device` function:
			
 
				+
			
 
				+```C
			
 
				+void tick_install_broadcast_device(struct clock_event_device *dev)
			
 
				+{
			
 
				+	struct clock_event_device *cur = tick_broadcast_device.evtdev;
			
 
				+
			
 
				+	if (!tick_check_broadcast_device(cur, dev))
			
 
				+		return;
			
 
				+
			
 
				+	if (!try_module_get(dev->owner))
			
 
				+		return;
			
 
				+
			
 
				+	clockevents_exchange_device(cur, dev);
			
 
				+
			
 
				+	if (cur)
			
 
				+		cur->event_handler = clockevents_handle_noop;
			
 
				+
			
 
				+	tick_broadcast_device.evtdev = dev;
			
 
				+
			
 
				+	if (!cpumask_empty(tick_broadcast_mask))
			
 
				+		tick_broadcast_start_periodic(dev);
			
 
				+
			
 
				+	if (dev->features & CLOCK_EVT_FEAT_ONESHOT)
			
 
				+		tick_clock_notify();
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+First of all we get the current `clock event` device from the `tick_broadcast_device`. The `tick_broadcast_device` defined in the [kernel/time/tick-common.c](https://github.com/torvalds/linux/blob/master/kernel/tick-common.c) source code file:
			
 
				+
			
 
				+```C
			
 
				+static struct tick_device tick_broadcast_device;
			
 
				+```
			
 
				+
			
 
				+and represents external clock device that keeps track of events for a processor. The first step after we got the current clock device is the call of the `tick_check_broadcast_device` function which checks that a given clock events device can be utilized as broadcast device. The main point of the `tick_check_broadcast_device` function is to check value of the `features` field of the given `clock events` device. As we can understand from the name of this field, the `features` field contains a clock event device features. Available values defined in the [include/linux/clockchips.h](https://github.com/torvalds/linux/blob/master/include/linux/clockchips.h) header file and can be one of the `CLOCK_EVT_FEAT_PERIODIC` - which represents a clock events device which supports periodic events and etc. So, the `tick_check_broadcast_device` function check `features` flags for `CLOCK_EVT_FEAT_ONESHOT`, `CLOCK_EVT_FEAT_DUMMY` and other flags and returns `false` if the given clock events device has one of these features. In other way the `tick_check_broadcast_device` function compares `ratings` of the given clock event device and current clock event device and returns the best.
			
 
				+
			
 
				+After the `tick_check_broadcast_device` function, we can see the call of the `try_module_get` function that checks module owner of the clock events. We need to do it to be sure that the given `clock events` device was correctly initialized. The next step is the call of the `clockevents_exchange_device` function that defined in the [kernel/time/clockevents.c](https://github.com/torvalds/linux/blob/master/kernel/time/clockevents.c) source code file and will release old clock events device and replace the previous functional handler with a dummy handler.
			
 
				+
			
 
				+In the last step of the `tick_install_broadcast_device` function we check that the `tick_broadcast_mask` is not empty and start the given `clock events` device in periodic mode with the call of the `tick_broadcast_start_periodic` function:
			
 
				+
			
 
				+```C
			
 
				+if (!cpumask_empty(tick_broadcast_mask))
			
 
				+	tick_broadcast_start_periodic(dev);
			
 
				+
			
 
				+if (dev->features & CLOCK_EVT_FEAT_ONESHOT)
			
 
				+	tick_clock_notify();
			
 
				+```
			
 
				+
			
 
				+The `tick_broadcast_mask` filled in the `tick_device_uses_broadcast` function that checks a `clock events` device during registration of this `clock events` device:
			
 
				+
			
 
				+```C
			
 
				+int cpu = smp_processor_id();
			
 
				+
			
 
				+int tick_device_uses_broadcast(struct clock_event_device *dev, int cpu)
			
 
				+{
			
 
				+	...
			
 
				+	...
			
 
				+	...
			
 
				+	if (!tick_device_is_functional(dev)) {
			
 
				+		...
			
 
				+		cpumask_set_cpu(cpu, tick_broadcast_mask);
			
 
				+		...
			
 
				+	}
			
 
				+	...
			
 
				+	...
			
 
				+	...
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+More about the `smp_processor_id` macro you can read in the fourth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process chapter.
			
 
				+
			
 
				+The `tick_broadcast_start_periodic` function check the given `clock event` device and call the `tick_setup_periodic` function:
			
 
				+
			
 
				+```
			
 
				+static void tick_broadcast_start_periodic(struct clock_event_device *bc)
			
 
				+{
			
 
				+	if (bc)
			
 
				+		tick_setup_periodic(bc, 1);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+that defined in the [kernel/time/tick-common.c](https://github.com/torvalds/linux/blob/master/kernel/time/tick-common.c) source code file and sets broadcast handler for the given `clock event` device by the call of the following function:
			
 
				+
			
 
				+```C
			
 
				+tick_set_periodic_handler(dev, broadcast);
			
 
				+```
			
 
				+
			
 
				+This function checks the second parameter which represents broadcast state (`on` or `off`) and sets the broadcast handler depends on its value:
			
 
				+
			
 
				+```C
			
 
				+void tick_set_periodic_handler(struct clock_event_device *dev, int broadcast)
			
 
				+{
			
 
				+	if (!broadcast)
			
 
				+		dev->event_handler = tick_handle_periodic;
			
 
				+	else
			
 
				+		dev->event_handler = tick_handle_periodic_broadcast;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+When an `clock event` device will issue an interrupt, the `dev->event_handler` will be called. For example, let's look on the interrupt handler of the [high precision event timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) which is located in the [arch/x86/kernel/hpet.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/hpet.c) source code file:
			
 
				+
			
 
				+```C
			
 
				+static irqreturn_t hpet_interrupt_handler(int irq, void *data)
			
 
				+{
			
 
				+	struct hpet_dev *dev = (struct hpet_dev *)data;
			
 
				+	struct clock_event_device *hevt = &dev->evt;
			
 
				+
			
 
				+	if (!hevt->event_handler) {
			
 
				+		printk(KERN_INFO "Spurious HPET timer interrupt on HPET timer %d\n",
			
 
				+				dev->num);
			
 
				+		return IRQ_HANDLED;
			
 
				+	}
			
 
				+
			
 
				+	hevt->event_handler(hevt);
			
 
				+	return IRQ_HANDLED;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+The `hpet_interrupt_handler` gets the [irq](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) specific data and check the event handler of the `clock event` device. Recall that we just set in the `tick_set_periodic_handler` function. So the `tick_handler_periodic_broadcast` function will be called in the end of the high precision event timer interrupt handler.
			
 
				+
			
 
				+The `tick_handler_periodic_broadcast` function calls the
			
 
				+
			
 
				+```C
			
 
				+bc_local = tick_do_periodic_broadcast();
			
 
				+```
			
 
				+
			
 
				+function which stores numbers of processors which have asked to be woken up in the temporary `cpumask` and call the `tick_do_broadcast` function:
			
 
				+
			
 
				+```
			
 
				+cpumask_and(tmpmask, cpu_online_mask, tick_broadcast_mask);
			
 
				+return tick_do_broadcast(tmpmask);
			
 
				+```
			
 
				+
			
 
				+The `tick_do_broadcast` calls the `broadcast` function of the given clock events which sends [IPI](https://en.wikipedia.org/wiki/Inter-processor_interrupt) interrupt to the set of the processors. In the end we can call the event handler of the given `tick_device`:
			
 
				+
			
 
				+```C
			
 
				+if (bc_local)
			
 
				+	td->evtdev->event_handler(td->evtdev);
			
 
				+```
			
 
				+
			
 
				+which actually represents interrupt handler of the local timer of a processor. After this a processor will wake up. That is all about `tick broadcast` framework in the Linux kernel. We have missed some aspects of this framework, for example reprogramming of a `clock event` device and broadcast with the oneshot timer and etc. But the Linux kernel is very big, it is not real to cover all aspects of it. I think it will be interesting to dive into with yourself.
			
 
				+
			
 
				+If you remember, we have started this part with the call of the `tick_init` function. We just consider the `tick_broadcast_init` function and releated theory, but the `tick_init` function contains another call of a function and this function is - `tick_nohz_init`. Let's look on the implementation of this function.
			
 
				+
			
 
				+Initialization of dyntick related data structures
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+We already saw some information about `dyntick` concept in this part and we know that this concept allows kernel to disable system timer interrupts in the `idle` state. The `tick_nohz_init` function makes initialization of the different data structures which are related to this concept. This function defined in the [kernel/time/tick-sched.c](https://github.com/torvalds/linux/blob/master/kernel/time/tich-sched.c) source code file and starts from the check of the value of the `tick_nohz_full_running` variable which represents state of the tick-less mode for the `idle` state and the state when system timer interrups are disabled during a processor has only one runnable task:
			
 
				+
			
 
				+```C
			
 
				+if (!tick_nohz_full_running) {
			
 
				+    if (tick_nohz_init_all() < 0)
			
 
				+    return;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+If this mode is not running we call the `tick_nohz_init_all` function that defined in the same source code file and check its result. The `tick_nohz_init_all` function tries to allocate the `tick_nohz_full_mask` with the call of the `alloc_cpumask_var` that will allocate space for a `tick_nohz_full_mask`. The `tck_nohz_full_mask` will store numbers of processors that have enabled full `NO_HZ`. After successful allocation of the `tick_nohz_full_mask` we set all bits in the `tick_nogz_full_mask`, set the `tick_nohz_full_running` and return result to the `tick_nohz_init` function:
			
 
				+
			
 
				+```C
			
 
				+static int tick_nohz_init_all(void)
			
 
				+{
			
 
				+        int err = -1;
			
 
				+#ifdef CONFIG_NO_HZ_FULL_ALL
			
 
				+        if (!alloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL)) {
			
 
				+                WARN(1, "NO_HZ: Can't allocate full dynticks cpumask\n");
			
 
				+                return err;
			
 
				+        }
			
 
				+        err = 0;
			
 
				+        cpumask_setall(tick_nohz_full_mask);
			
 
				+        tick_nohz_full_running = true;
			
 
				+#endif
			
 
				+        return err;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+In the next step we try to allocate a memory space for the `housekeeping_mask`:
			
 
				+
			
 
				+```C
			
 
				+if (!alloc_cpumask_var(&housekeeping_mask, GFP_KERNEL)) {
			
 
				+	WARN(1, "NO_HZ: Can't allocate not-full dynticks cpumask\n");
			
 
				+	cpumask_clear(tick_nohz_full_mask);
			
 
				+	tick_nohz_full_running = false;
			
 
				+	return;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+This `cpumask` will store number of processor for `housekeeping` or in other words we need at least in one processor that will not be in `NO_HZ` mode, because it will do timekeeping and etc. After this we check the result of the architecture-specific `arch_irq_work_has_interrupt` function. This function checks ability to send inter-processor interrupt for the certain architecture. We need to check this, because system timer of a processor will be disabled during `NO_HZ` mode, so there must be at least one online processor which can send inter-processor interrupt to awake offline processor. This function defined in the [arch/x86/include/asm/irq_work.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/irq_work.h) header file for the [x86_64](https://en.wikipedia.org/wiki/X86-64) and just checks that a processor has [APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) from the [CPUID](https://en.wikipedia.org/wiki/CPUID):
			
 
				+
			
 
				+```C
			
 
				+static inline bool arch_irq_work_has_interrupt(void)
			
 
				+{
			
 
				+    return cpu_has_apic;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+If a processor has not `APIC`, the Linux kernel prints warning message, clears the `tick_nohz_full_mask` cpumask, copies numbers of all possible processors in the system to the `housekeeping_mask` and resets the value of the `tick_nohz_full_running` variable:
			
 
				+
			
 
				+```C
			
 
				+if (!arch_irq_work_has_interrupt()) {
			
 
				+	pr_warning("NO_HZ: Can't run full dynticks because arch doesn't "
			
 
				+		   "support irq work self-IPIs\n");
			
 
				+	cpumask_clear(tick_nohz_full_mask);
			
 
				+	cpumask_copy(housekeeping_mask, cpu_possible_mask);
			
 
				+	tick_nohz_full_running = false;
			
 
				+	return;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+After this step, we get the number of the current processor by the call of the `smp_processor_id` and check this processor in the `tick_nohz_full_mask`. If the `tick_nohz_full_mask` contains a given processor we clear appropriate bit in the `tick_nohz_full_mask`:
			
 
				+
			
 
				+```C
			
 
				+cpu = smp_processor_id();
			
 
				+
			
 
				+if (cpumask_test_cpu(cpu, tick_nohz_full_mask)) {
			
 
				+	pr_warning("NO_HZ: Clearing %d from nohz_full range for timekeeping\n", cpu);
			
 
				+	cpumask_clear_cpu(cpu, tick_nohz_full_mask);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Because this processor will be used for timekeeping. After this step we put all numbers of processors that are in the `cpu_possible_mask` and not in the `tick_nohz_full_mask`:
			
 
				+
			
 
				+```C
			
 
				+cpumask_andnot(housekeeping_mask,
			
 
				+	       cpu_possible_mask, tick_nohz_full_mask);
			
 
				+```
			
 
				+
			
 
				+After this operation, the `housekeeping_mask` will contain all processors of the system except a processor for timekeeping. In the last step of the `tick_nohz_init_all` function, we are going through all processors that are defined in the `tick_nohz_full_mask` and call the following function for an each processor:
			
 
				+
			
 
				+```C
			
 
				+for_each_cpu(cpu, tick_nohz_full_mask)
			
 
				+	context_tracking_cpu_set(cpu);
			
 
				+```
			
 
				+
			
 
				+The `context_tracking_cpu_set` function defined in the [kernel/context_tracking.c](https://github.com/torvalds/linux/blob/master/kernel/context_tracking.c) source code file and main point of this function is to set the `context_tracking.active` [percpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable to `true`. When the `active` field will be set to `true` for the certain processor, all [context switches](https://en.wikipedia.org/wiki/Context_switch) will be ignored by the Linux kernel context tracking subsystem for this processor.
			
 
				+
			
 
				+That's all. This is the end of the `tick_nohz_init` function. After this `NO_HZ` related data structures will be initialzed. We didn't see API of the `NO_HZ` mode, but will see it soon.
			
 
				+
			
 
				+Conclusion
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+This is the end of the third part of the chapter that describes timers and timer management related stuff in the Linux kernel. In the previous part got acquainted with the `clocksource` concept in the Linux kernel which represents framework for managing different clock source in a interrupt and hardware characteristics independent way. We continued to look on the Linux kernel initialization process in a time management context in this part and got acquainted with two new concepts for us: the `tick broadcast` framework and `tick-less` mode. The first concept helps the Linux kernel to deal with processors which are in deep sleep and the second concept represents the mode in which kernel may work to improve power management of `idle` processors.
			
 
				+
			
 
				+In the next part we will continue to dive into timer management related things in the Linux kernel and will see new concept for us - `timers`.
			
 
				+
			
 
				+If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
			
 
				+
			
 
				+**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				+
			
 
				+Links
			
 
				+-------------------------------------------------------------------------------
			
 
				+
			
 
				+* [x86_64](https://en.wikipedia.org/wiki/X86-64)
			
 
				+* [initrd](https://en.wikipedia.org/wiki/Initrd)
			
 
				+* [interrupt](https://en.wikipedia.org/wiki/Interrupt)
			
 
				+* [DMI](https://en.wikipedia.org/wiki/Desktop_Management_Interface)
			
 
				+* [printk](https://en.wikipedia.org/wiki/Printk)
			
 
				+* [CPU idle](https://en.wikipedia.org/wiki/Idle_%28CPU%29)
			
 
				+* [power management](https://en.wikipedia.org/wiki/Power_management)
			
 
				+* [NO_HZ documentation](https://github.com/torvalds/linux/blob/master/Documentation/timers/NO_HZ.txt)
			
 
				+* [cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
			
 
				+* [high precision event timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer)
			
 
				+* [irq](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
			
 
				+* [IPI](https://en.wikipedia.org/wiki/Inter-processor_interrupt)
			
 
				+* [CPUID](https://en.wikipedia.org/wiki/CPUID)
			
 
				+* [APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller)
			
 
				+* [percpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
			
 
				+* [context switches](https://en.wikipedia.org/wiki/Context_switch)
			
 
				+* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-2.html)
			
--- a/Timers/timers-4.md
+++ b/Timers/timers-4.md
@@ -0,0 +1,427 @@
 
				+Timers and time management in the Linux kernel. Part 4.
			
 
				+================================================================================
			
 
				+
			
 
				+Timers
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+This is fourth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel and in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-3.html) we knew about the `tick broadcast` framework and `NO_HZ` mode in the Linux kernel. We will continue to dive into the time management related stuff in the Linux kernel in this part and will be acquainted with yet another concept in the Linux kernel - `timers`. Before we will look at timers in the Linux kernel, we have to learn some theory about this concept. Note that we will consider software timers in this part.
			
 
				+
			
 
				+The Linux kernel provides a `software timer` concept to allow to kernel functions could be invoked at future moment. Timers are widely used in the Linux kernel. For example, look in the [net/netfilter/ipset/ip_set_list_set.c](https://github.com/torvalds/linux/blob/master/net/netfilter/ipset/ip_set_list_set.c) source code file. This source code file provides implementation of the framework for the managing of groups of [IP](https://en.wikipedia.org/wiki/Internet_Protocol) addresses.
			
 
				+
			
 
				+We can find the `list_set` structure that contains `gc` filed in this source code file:
			
 
				+
			
 
				+```C
			
 
				+struct list_set {
			
 
				+	...
			
 
				+	struct timer_list gc;
			
 
				+	...
			
 
				+};
			
 
				+```
			
 
				+
			
 
				+Not that the `gc` filed has `timer_list` type. This structure defined in the [include/linux/timer.h](https://github.com/torvalds/linux/blob/master/include/linux/timer.h) header file and main point of this structure is to store `dynamic` timers in the Linux kernel. Actually, the Linux kernel provides two types of timers called dynamic timers and interval timers. First type of timers is used by the kernel, and the second can be used by user mode. The `timer_list` structure contains actual `dynamic` timers. The `list_set` contains `gc` timer in our example represents timer for garbage collection. This timer will be initialized in the `list_set_gc_init` function:
			
 
				+
			
 
				+```C
			
 
				+static void
			
 
				+list_set_gc_init(struct ip_set *set, void (*gc)(unsigned long ul_set))
			
 
				+{
			
 
				+	struct list_set *map = set->data;
			
 
				+	...
			
 
				+	...
			
 
				+	...
			
 
				+	map->gc.function = gc;
			
 
				+	map->gc.expires = jiffies + IPSET_GC_PERIOD(set->timeout) * HZ;
			
 
				+	...
			
 
				+	...
			
 
				+	...
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+A function that is pointed by the `gc` pointer, will be called after timeout which is equal to the `map->gc.expires`.
			
 
				+
			
 
				+Ok, we will not dive into this example with the [netfilter](https://en.wikipedia.org/wiki/Netfilter), because this chapter is not about [network](https://en.wikipedia.org/wiki/Computer_network) related stuff. But we saw that timers are widely used in the Linux kernel and learned that they represent concept which allows to functions to be called in future.
			
 
				+
			
 
				+Now let's continue to research source code of Linux kernel which is related to the timers and time management stuff as we did it in all previous chapters.
			
 
				+
			
 
				+Introduction to dynamic timers in the Linux kernel
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+As I already wrote, we knew about the `tick broadcast` framework and `NO_HZ` mode in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-3.html). They will be initialized in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file by the call of the `tick_init` function. If we will look at this source code file, we will see that the next time management related function is:
			
 
				+
			
 
				+```C
			
 
				+init_timers();
			
 
				+```
			
 
				+
			
 
				+This function defined in the [kernel/time/timer.c](https://github.com/torvalds/linux/blob/master/kernel/time/timer.c) source code file and contains calls of four functions:
			
 
				+
			
 
				+```C
			
 
				+void __init init_timers(void)
			
 
				+{
			
 
				+	init_timer_cpus();
			
 
				+	init_timer_stats();
			
 
				+	timer_register_cpu_notifier();
			
 
				+	open_softirq(TIMER_SOFTIRQ, run_timer_softirq);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Let's look on implementation of each function. The first function is `init_timer_cpus` defined in the [same](https://github.com/torvalds/linux/blob/master/kernel/time/timer.c) source code file and just calls the `init_timer_cpu` function for each possible processor in the system:
			
 
				+
			
 
				+```C
			
 
				+static void __init init_timer_cpus(void)
			
 
				+{
			
 
				+	int cpu;
			
 
				+
			
 
				+	for_each_possible_cpu(cpu)
			
 
				+		init_timer_cpu(cpu);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+If you do not know or do not remember what is it a `possible` cpu, you can read the special [part](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) of this book which describes `cpumask` concept in the Linux kernel. In short words, a `possible` processor is a processor which can be plugged in anytime during the life of the system.
			
 
				+
			
 
				+The `init_timer_cpu` function does main work for us, namely it executes initialization of the `tvec_base` structure for each processor. This structure defined in the [kernel/time/timer.c](https://github.com/torvalds/linux/blob/master/kernel/time/timer.c) source code file and stores data related to a `dynamic` timer for a certain processor. Let's look on the definition of this structure:
			
 
				+
			
 
				+```C
			
 
				+struct tvec_base {
			
 
				+	spinlock_t lock;
			
 
				+	struct timer_list *running_timer;
			
 
				+	unsigned long timer_jiffies;
			
 
				+	unsigned long next_timer;
			
 
				+	unsigned long active_timers;
			
 
				+	unsigned long all_timers;
			
 
				+	int cpu;
			
 
				+	bool migration_enabled;
			
 
				+	bool nohz_active;
			
 
				+	struct tvec_root tv1;
			
 
				+	struct tvec tv2;
			
 
				+	struct tvec tv3;
			
 
				+	struct tvec tv4;
			
 
				+	struct tvec tv5;
			
 
				+} ____cacheline_aligned;
			
 
				+```
			
 
				+
			
 
				+The `thec_base` structure contains following fields: The `lock` for `tvec_base` protection, the next `running_timer` field points to the currently running timer for the certain processor, the `timer_jiffies` fields represents the earliest expiration time (it will be used by the Linux kernel to find already expired timers). The next field - `next_timer` contains the next pending timer for a next timer [interrupt](https://en.wikipedia.org/wiki/Interrupt) in a case when a processor goes to sleep and the `NO_HZ` mode is enabled in the Linux kernel. The `active_timers` field provides accounting of non-deferrable timers or in other words all timers that will not be stopped during a processor will go to sleep. The `all_timers` field tracks total number of timers or `active_timers` + deferrable timers. The `cpu` field represents number of a processor which owns timers. The `migration_enabled` and `nohz_active` fields are represent opportunity of timers migration to another processor and status of the `NO_HZ` mode respectively.
			
 
				+
			
 
				+The last five fields of the `tvec_base` structure represent lists of dynamic timers. The first `tv1` field has:
			
 
				+
			
 
				+```C
			
 
				+#define TVR_SIZE (1 << TVR_BITS)
			
 
				+#define TVR_BITS (CONFIG_BASE_SMALL ? 6 : 8)
			
 
				+
			
 
				+...
			
 
				+...
			
 
				+...
			
 
				+
			
 
				+struct tvec_root {
			
 
				+	struct hlist_head vec[TVR_SIZE];
			
 
				+};
			
 
				+```
			
 
				+
			
 
				+type. Note that the value of the `TVR_SIZE` depends on the `CONFIG_BASE_SMALL` kernel configuration option:
			
 
				+
			
 
				+![base small](http://s17.postimg.org/db3towlu7/base_small.png)
			
 
				+
			
 
				+that reduces size of the kernel data structures if disabled. The `v1` is array that may contain `64` or `256` elements where an each element represents a dynamic timer that will decay within the next `255` system timer interrupts. Next three fields: `tv2`, `tv3` and `tv4` are lists with dynamic timers too, but they store dynamic timers which will decay the next `2^14 - 1`, `2^20 - 1` and `2^26` respectively. The last `tv5` field represents list which stores dynamic timers with a large expiring period.
			
 
				+
			
 
				+So, now we saw the `tvec_base` structure and description of its fields and we can look on the implementation of the `init_timer_cpu` function. As I already wrote, this function defined in the [kernel/time/timer.c](https://github.com/torvalds/linux/blob/master/kernel/time/timer.c) source code file and executes initialization of the `tvec_bases`:
			
 
				+
			
 
				+```C
			
 
				+static void __init init_timer_cpu(int cpu)
			
 
				+{
			
 
				+	struct tvec_base *base = per_cpu_ptr(&tvec_bases, cpu);
			
 
				+
			
 
				+	base->cpu = cpu;
			
 
				+	spin_lock_init(&base->lock);
			
 
				+
			
 
				+	base->timer_jiffies = jiffies;
			
 
				+	base->next_timer = base->timer_jiffies;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+The `tvec_bases` represents [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable which represents main data structure for a dynamic timer for a given processor. This `per-cpu` variable defined in the same source code file:
			
 
				+
			
 
				+```C
			
 
				+static DEFINE_PER_CPU(struct tvec_base, tvec_bases);
			
 
				+```
			
 
				+
			
 
				+First of all we're getting the address of the `tvec_bases` for the given processor to `base` variable and as we got it, we are starting to initialize some of the `tvec_base` fields in the `init_timer_cpu` function. After initialization of the `per-cpu` dynamic timers with the [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) and the number of a possible processor, we need to initialize a `tstats_lookup_lock` [spinlock](https://en.wikipedia.org/wiki/Spinlock) in the `init_timer_stats` function:
			
 
				+
			
 
				+```C
			
 
				+void __init init_timer_stats(void)
			
 
				+{
			
 
				+	int cpu;
			
 
				+
			
 
				+	for_each_possible_cpu(cpu)
			
 
				+		raw_spin_lock_init(&per_cpu(tstats_lookup_lock, cpu));
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+The `tstats_lookcup_lock` variable represents `per-cpu` raw spinlock:
			
 
				+
			
 
				+```C
			
 
				+static DEFINE_PER_CPU(raw_spinlock_t, tstats_lookup_lock);
			
 
				+```
			
 
				+
			
 
				+which will be used for protection of operation with statistics of timers that can be accessed through the [procfs](https://en.wikipedia.org/wiki/Procfs):
			
 
				+
			
 
				+```C
			
 
				+static int __init init_tstats_procfs(void)
			
 
				+{
			
 
				+	struct proc_dir_entry *pe;
			
 
				+
			
 
				+	pe = proc_create("timer_stats", 0644, NULL, &tstats_fops);
			
 
				+	if (!pe)
			
 
				+		return -ENOMEM;
			
 
				+	return 0;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+For example:
			
 
				+
			
 
				+```
			
 
				+$ cat /proc/timer_stats
			
 
				+Timerstats sample period: 3.888770 s
			
 
				+  12,     0 swapper          hrtimer_stop_sched_tick (hrtimer_sched_tick)
			
 
				+  15,     1 swapper          hcd_submit_urb (rh_timer_func)
			
 
				+   4,   959 kedac            schedule_timeout (process_timeout)
			
 
				+   1,     0 swapper          page_writeback_init (wb_timer_fn)
			
 
				+  28,     0 swapper          hrtimer_stop_sched_tick (hrtimer_sched_tick)
			
 
				+  22,  2948 IRQ 4            tty_flip_buffer_push (delayed_work_timer_fn)
			
 
				+  ...
			
 
				+  ...
			
 
				+  ...
			
 
				+```
			
 
				+
			
 
				+The next step after initialization of the `tstats_lookup_lock` spinlock is the call of the `timer_register_cpu_notifier` function. This function depends on the `CONFIG_HOTPLUG_CPU` kernel configuration option which enables support for [hotplug](https://en.wikipedia.org/wiki/Hot_swapping) processors in the Linux kernel.
			
 
				+
			
 
				+When a processor will be logically offlined, a notification will be sent to the Linux kernel with the `CPU_DEAD` or the `CPU_DEAD_FROZEN` event by the call of the `cpu_notifier` macro:
			
 
				+
			
 
				+```C
			
 
				+#ifdef CONFIG_HOTPLUG_CPU
			
 
				+...
			
 
				+...
			
 
				+static inline void timer_register_cpu_notifier(void)
			
 
				+{
			
 
				+	cpu_notifier(timer_cpu_notify, 0);
			
 
				+}
			
 
				+...
			
 
				+...
			
 
				+#else
			
 
				+...
			
 
				+...
			
 
				+static inline void timer_register_cpu_notifier(void) { }
			
 
				+...
			
 
				+...
			
 
				+#endif /* CONFIG_HOTPLUG_CPU */
			
 
				+```
			
 
				+
			
 
				+In this case the `timer_cpu_notify` will be called which checks an event type and will call the `migrate_timers` function:
			
 
				+
			
 
				+```C
			
 
				+static int timer_cpu_notify(struct notifier_block *self,
			
 
				+	                        unsigned long action, void *hcpu)
			
 
				+{
			
 
				+	switch (action) {
			
 
				+	case CPU_DEAD:
			
 
				+	case CPU_DEAD_FROZEN:
			
 
				+		migrate_timers((long)hcpu);
			
 
				+		break;
			
 
				+	default:
			
 
				+		break;
			
 
				+	}
			
 
				+
			
 
				+	return NOTIFY_OK;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+This chapter will not describe `hotplug` related events in the Linux kernel source code, but if you are interesting in such things, you can find implementation of the `migrate_timers` function in the [kernel/time/timer.c](https://github.com/torvalds/linux/blob/master/kernel/time/timer.c) source code file.
			
 
				+
			
 
				+The last step in the `init_timers` function is the call of the:
			
 
				+
			
 
				+```C
			
 
				+open_softirq(TIMER_SOFTIRQ, run_timer_softirq);
			
 
				+```
			
 
				+
			
 
				+function. The `open_softirq` function may be already familar to you if you have read the ninth [part](https://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-9.html) about the interrupts and interrupt handling in the Linux kernel. In short words, the `open_softirq` function defined in the [kernel/softirq.c](https://github.com/torvalds/linux/blob/master/kernel/softirq.c) source code file and executes initialization of the deferred interrupt handler.
			
 
				+
			
 
				+In our case the deferred function is the `run_timer_softirq` function that is will be called after a hardware interrupt in the `do_IRQ` function which defined in the [arch/x86/kernel/irq.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/irq.c) source code file. The main point of this function is to handle a software dynamic timer. The Linux kernel does not do this thing during the hardware timer interrupt handling because this is time consuming operation.
			
 
				+
			
 
				+Let's look on the implementation of the `run_timer_softirq` function:
			
 
				+
			
 
				+```C
			
 
				+static void run_timer_softirq(struct softirq_action *h)
			
 
				+{
			
 
				+	struct tvec_base *base = this_cpu_ptr(&tvec_bases);
			
 
				+
			
 
				+	if (time_after_eq(jiffies, base->timer_jiffies))
			
 
				+		__run_timers(base);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+At the beginning of the `run_timer_softirq` function we get a `dynamic` timer for a current processor and compares the current value of the [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) with the value of the `timer_jiffies` for the current structure by the call of the `time_after_eq` macro which is defined in the [include/linux/jiffies.h](https://github.com/torvalds/linux/blob/master/include/linux/jiffies.h) header file:
			
 
				+
			
 
				+```C
			
 
				+#define time_after_eq(a,b)          \
			
 
				+    (typecheck(unsigned long, a) && \
			
 
				+     typecheck(unsigned long, b) && \
			
 
				+    ((long)((a) - (b)) >= 0))
			
 
				+```
			
 
				+
			
 
				+Reclaim that the `timer_jiffies` field of the `tvec_base` structure represents the relative time when functions delayed by the given timer will be executed. So we compare these two values and if the current time represented by the `jiffies` is greater than `base->timer_jiffies`, we call the `__run_timers` function that defined in the same source code file. Let's look on the implementation of this function.
			
 
				+
			
 
				+As I just wrote, the `__run_timers` function runs all expired timers for a given processor. This function starts from the acquiring of the `tvec_base's`  lock to protect the `tvec_base` structure
			
 
				+
			
 
				+```C
			
 
				+static inline void __run_timers(struct tvec_base *base)
			
 
				+{
			
 
				+	struct timer_list *timer;
			
 
				+
			
 
				+	spin_lock_irq(&base->lock);
			
 
				+	...
			
 
				+	...
			
 
				+	...
			
 
				+	spin_unlock_irq(&base->lock);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+After this it starts the loop while the `timer_jiffies` will not be greater than the `jiffies`:
			
 
				+
			
 
				+```C
			
 
				+while (time_after_eq(jiffies, base->timer_jiffies)) {
			
 
				+	...
			
 
				+	...
			
 
				+	...
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+We can find many different manipulations in the our loop, but the main point is to find expired timers and call delayed functions. First of all we need to calculate the `index` of the `base->tv1` list that stores the next timer to be handled with the following expression:
			
 
				+
			
 
				+```C
			
 
				+index = base->timer_jiffies & TVR_MASK;
			
 
				+```
			
 
				+
			
 
				+where the `TVR_MASK` is a mask for the getting of the `tvec_root->vec` elements. As we got the index with the next timer which must be handled we check its value. If the index is zero, we go through all lists in our cascade table `tv2`, `tv3` and etc., and rehashing it with the call of the `cascade` function:
			
 
				+
			
 
				+```C
			
 
				+if (!index &&
			
 
				+	(!cascade(base, &base->tv2, INDEX(0))) &&
			
 
				+		(!cascade(base, &base->tv3, INDEX(1))) &&
			
 
				+				!cascade(base, &base->tv4, INDEX(2)))
			
 
				+		cascade(base, &base->tv5, INDEX(3));
			
 
				+```
			
 
				+
			
 
				+After this we increase the value of the `base->timer_jiffies`:
			
 
				+
			
 
				+```C
			
 
				+++base->timer_jiffies;
			
 
				+```
			
 
				+
			
 
				+In the last step we are executing a corresponding function for each timer from the list in a following loop:
			
 
				+
			
 
				+```C
			
 
				+hlist_move_list(base->tv1.vec + index, head);
			
 
				+
			
 
				+while (!hlist_empty(head)) {
			
 
				+	...
			
 
				+	...
			
 
				+	...
			
 
				+	timer = hlist_entry(head->first, struct timer_list, entry);
			
 
				+	fn = timer->function;
			
 
				+	data = timer->data;
			
 
				+
			
 
				+	spin_unlock(&base->lock);
			
 
				+	call_timer_fn(timer, fn, data);
			
 
				+	spin_lock(&base->lock);
			
 
				+
			
 
				+	...
			
 
				+	...
			
 
				+	...
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+where the `call_timer_fn` just call the given function:
			
 
				+
			
 
				+```C
			
 
				+static void call_timer_fn(struct timer_list *timer, void (*fn)(unsigned long),
			
 
				+	                      unsigned long data)
			
 
				+{
			
 
				+	...
			
 
				+	...
			
 
				+	...
			
 
				+	fn(data);
			
 
				+	...
			
 
				+	...
			
 
				+	...
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+That's all. The Linux kernel has infrastructure for `dynamic timers` from this moment. We will not dive into this interesting theme. As I already wrote the `timers` is a [widely](http://lxr.free-electrons.com/ident?i=timer_list) used concept in the Linux kernel and nor one part, nor two parts will not cover understanding of such things how it implemented and how it works. But now we know about this concept, why does the Linux kernel needs in it and some data structures around it.
			
 
				+
			
 
				+Now let's look usage of `dynamic timers` in the Linux kernel.
			
 
				+
			
 
				+Usage of dynamic timers
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+As you already can noted, if the Linux kernel provides a concept, it also provides API for managing of this concept and the `dynamic timers` concept is not exception here. To use a timer in the Linux kernel code, we must define a variable with a `timer_list` type. We can initialize our `timer_list` structure in two ways. The first is to use the `init_timer` macro that defined in the [include/linux/timer.h](https://github.com/torvalds/linux/blob/master/include/linux/timer.h) header file:
			
 
				+
			
 
				+```C
			
 
				+#define init_timer(timer)    \
			
 
				+	__init_timer((timer), 0)
			
 
				+
			
 
				+#define __init_timer(_timer, _flags)   \
			
 
				+         init_timer_key((_timer), (_flags), NULL, NULL)
			
 
				+```
			
 
				+
			
 
				+where the `init_timer_key` function just calls the:
			
 
				+
			
 
				+```C
			
 
				+do_init_timer(timer, flags, name, key);
			
 
				+```
			
 
				+
			
 
				+function which fields the given `timer` with default values. The second way is to use the:
			
 
				+
			
 
				+```C
			
 
				+#define TIMER_INITIALIZER(_function, _expires, _data)		\
			
 
				+	__TIMER_INITIALIZER((_function), (_expires), (_data), 0)
			
 
				+```
			
 
				+
			
 
				+macro which will initilize the given `timer_list` structure too.
			
 
				+
			
 
				+After a `dynamic timer` is initialzed we can start this `timer` with the call of the:
			
 
				+
			
 
				+```C
			
 
				+void add_timer(struct timer_list * timer);
			
 
				+```
			
 
				+
			
 
				+function and stop it with the:
			
 
				+
			
 
				+```C
			
 
				+int del_timer(struct timer_list * timer);
			
 
				+```
			
 
				+
			
 
				+function.
			
 
				+
			
 
				+That's all.
			
 
				+
			
 
				+Conclusion
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+This is the end of the fourth part of the chapter that describes timers and timer management related stuff in the Linux kernel. In the previous part we got acquainted with the two new concepts: the `tick broadcast` framework and the `NO_HZ` mode. In this part we continued to dive into time managemented related stuff and got acquainted with the new concept - `dynamic timer` or software timer. We didn't saw implementation of a `dynamic timers` management code in details in this part but saw data structures and API around this concept.
			
 
				+
			
 
				+In the next part we will continue to dive into timer management related things in the Linux kernel and will see new concept for us - `timers`.
			
 
				+
			
 
				+If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
			
 
				+
			
 
				+**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				+
			
 
				+Links
			
 
				+-------------------------------------------------------------------------------
			
 
				+
			
 
				+* [IP](https://en.wikipedia.org/wiki/Internet_Protocol)
			
 
				+* [netfilter](https://en.wikipedia.org/wiki/Netfilter)
			
 
				+* [network](https://en.wikipedia.org/wiki/Computer_network)
			
 
				+* [cpumask](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
			
 
				+* [interrupt](https://en.wikipedia.org/wiki/Interrupt)
			
 
				+* [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html)
			
 
				+* [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
			
 
				+* [spinlock](https://en.wikipedia.org/wiki/Spinlock)
			
 
				+* [procfs](https://en.wikipedia.org/wiki/Procfs)
			
 
				+* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-3.html)
			
--- a/Timers/timers-5.md
+++ b/Timers/timers-5.md
@@ -0,0 +1,415 @@
 
				+Timers and time management in the Linux kernel. Part 5.
			
 
				+================================================================================
			
 
				+
			
 
				+Introduction to the `clockevents` framework
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+This is fifth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel. As you might noted from the title of this part, the `clockevents` framework will be discussed. We already saw one framework in the [second](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-2.html) part of this chapter. It was `clocksource` framework. Both of these frameworks represent timekeeping abstractions in the Linux kernel.
			
 
				+
			
 
				+At first let's refresh your memory and try to remember what is it `clocksource` framework and and what its purpose. The main goal of the `clocksource` framework is to provide `timeline`. As described in the [documentation](https://github.com/0xAX/linux/blob/master/Documentation/timers/timekeeping.txt):
			
 
				+
			
 
				+> For example issuing the command 'date' on a Linux system will eventually read the clock source to determine exactly what time it is.
			
 
				+
			
 
				+The Linux kernel supports many different clock sources. You can find some of them in the [drivers/closksource](https://github.com/torvalds/linux/tree/master/drivers/clocksource). For example old good [Intel 8253](https://en.wikipedia.org/wiki/Intel_8253) - [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer) with `1193182` Hz frequency, yet another one - [ACPI PM](http://uefi.org/sites/default/files/resources/ACPI_5.pdf) timer with `3579545` Hz frequency. Besides the [drivers/closksource](https://github.com/torvalds/linux/tree/master/drivers/clocksource) directory, each architecture may provide own architecture-specific clock sources. For example [x86](https://en.wikipedia.org/wiki/X86) architecture provides [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer), or for example [powerpc](https://en.wikipedia.org/wiki/PowerPC) provides access to the processor timer through `timebase` register.
			
 
				+
			
 
				+Each clock source provides monotonic atomic counter. As I already wrote, the Linux kernel supports a huge set of different clock source and each clock source has own parameters like [frequency](https://en.wikipedia.org/wiki/Frequency). The main goal of the `clocksource` framework is to provide [API](https://en.wikipedia.org/wiki/Application_programming_interface) to select best available clock source in the system i.e. a clock source with the highest frequency. Additional goal of the `clocksource` framework is to represent an atomic counter provided by a clock source in human units. In this time, nanoseconds are the favorite choice for the time value units of the given clock source in the Linux kernel.
			
 
				+
			
 
				+The `clocksource` framework represented by the `clocksource` structure which is defined in the [include/linux/clocksource.h](https://github.com/torvalds/linux/blob/master/include/linux/clocksource.h) header code file which contains `name` of a clock source, rating of certain clock source in the system (a clock source with the higher frequency has the biggest rating in the system), `list` of all registered clock source in the system, `enable` and `disable` fields to enable and disable a clock source, pointer to the `read` function which must return an atomic counter of a clock source and etc.
			
 
				+
			
 
				+Additionally the `clocksource` structure provides two fields: `mult` and `shift` which are needed for translation of an atomic counter which is provided by a certain clock source to the human units, i.e. [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond). Translation occurs via following formula:
			
 
				+
			
 
				+```
			
 
				+ns ~= (clocksource * mult) >> shift
			
 
				+```
			
 
				+
			
 
				+As we already know, besides the `clocksource` structure, the `clocksource` framework provides an API for registration of clock source with different frequency scale factor:
			
 
				+
			
 
				+```C
			
 
				+static inline int clocksource_register_hz(struct clocksource *cs, u32 hz)
			
 
				+static inline int clocksource_register_khz(struct clocksource *cs, u32 khz)
			
 
				+```
			
 
				+
			
 
				+A clock source unregistration:
			
 
				+
			
 
				+```C
			
 
				+int clocksource_unregister(struct clocksource *cs)
			
 
				+```
			
 
				+
			
 
				+and etc.
			
 
				+
			
 
				+Additionally to the `clocksource` framework, the Linux kernel provides `clockevents` framework. As described in the [documentation](https://github.com/0xAX/linux/blob/master/Documentation/timers/timekeeping.txt):
			
 
				+
			
 
				+> Clock events are the conceptual reverse of clock sources
			
 
				+
			
 
				+Main goal of the is to manage clock event devices or in other words - to manage devices that allow to register an event or in other words [interrupt](https://en.wikipedia.org/wiki/Interrupt) that is going to happen at a defined point of time in the future.
			
 
				+
			
 
				+Now we know a little about the `clockevents` framework in the Linux kernel, and now time is to see on it [API](https://en.wikipedia.org/wiki/Application_programming_interface).
			
 
				+
			
 
				+API of `clockevents` framework
			
 
				+-------------------------------------------------------------------------------
			
 
				+
			
 
				+The main structure which described a clock event device is `clock_event_device` structure. This structure is defined in the [include/linux/clockchips.h](https://github.com/torvalds/linux/blob/master/include/linux/clockchips.h) header file and contains a huge set of fields. as well as the `clocksource` structure it has `name` fields which contains human readable name of a clock event device, for example [local APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) timer:
			
 
				+
			
 
				+```C
			
 
				+static struct clock_event_device lapic_clockevent = {
			
 
				+    .name                   = "lapic",
			
 
				+    ...
			
 
				+    ...
			
 
				+    ...
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Addresses of the `event_handler`, `set_next_event`, `next_event` functions for a certain clock event device which are an [interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler), setter of next event and local storage for next event respectively. Yet another field of the `clock_event_device` structure is - `features` field. Its value maybe on of the following generic features:
			
 
				+
			
 
				+```C
			
 
				+#define CLOCK_EVT_FEAT_PERIODIC	0x000001
			
 
				+#define CLOCK_EVT_FEAT_ONESHOT		0x000002
			
 
				+```
			
 
				+
			
 
				+Where the `CLOCK_EVT_FEAT_PERIODIC` represents device which may be programmed to generate events periodically. The `CLOCK_EVT_FEAT_ONESHOT` represents device which may generate an event only once. Besides these two features, there are also architecture-specific features. For example [x86_64](https://en.wikipedia.org/wiki/X86-64) supports two additional features:
			
 
				+
			
 
				+```C
			
 
				+#define CLOCK_EVT_FEAT_C3STOP		0x000008
			
 
				+```
			
 
				+
			
 
				+The first `CLOCK_EVT_FEAT_C3STOP` means that a clock event device will be stopped in the [C3](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface#Device_states) state. Additionally the `clock_event_device` structure has `mult` and `shift` fields as well as `clocksource` structure. The `clocksource` structure also contains other fields, but we will consider it later.
			
 
				+
			
 
				+After we considered part of the `clock_event_device` structure, time is to look at the `API` of the `clockevents` framework. To work with a clock event device, first of all we need to initialize `clock_event_device` structure and register a clock events device. The `clockevents` framework provides following `API` for registration of clock event devices:
			
 
				+
			
 
				+```C
			
 
				+void clockevents_register_device(struct clock_event_device *dev)
			
 
				+{
			
 
				+   ...
			
 
				+   ...
			
 
				+   ...
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+This function defined in the [kernel/time/clockevents.c](https://github.com/torvalds/linux/blob/master/kernel/time/clockevents.c) source code file and as we may see, the `clockevents_register_device` function takes only one parameter:
			
 
				+
			
 
				+* address of a `clock_event_device` structure which represents a clock event device.
			
 
				+
			
 
				+So, to register a clock event device, at first we need to initialize `clock_event_device` structure with parameters of a certain clock event device. Let's take a look at one random clock event device in the Linux kernel source code. We can find one in the [drivers/closksource](https://github.com/torvalds/linux/tree/master/drivers/clocksource) directory or try to take a look at an architecture-specific clock event device. Let's take for example - [Periodic Interval Timer (PIT) for at91sam926x](http://www.atmel.com/Images/doc6062.pdf). You can find its implementation in the [drivers/closksource](https://github.com/torvalds/linux/tree/master/drivers/clocksource/timer-atmel-pit.c).
			
 
				+
			
 
				+First of all let's look at initialization of the `clock_event_device` structure. This occurs in the `at91sam926x_pit_common_init` function:
			
 
				+
			
 
				+```C
			
 
				+struct pit_data {
			
 
				+    ...
			
 
				+    ...
			
 
				+    struct clock_event_device       clkevt;
			
 
				+    ...
			
 
				+    ...
			
 
				+};
			
 
				+
			
 
				+static void __init at91sam926x_pit_common_init(struct pit_data *data)
			
 
				+{
			
 
				+    ...
			
 
				+    ...
			
 
				+    ...
			
 
				+    data->clkevt.name = "pit";
			
 
				+    data->clkevt.features = CLOCK_EVT_FEAT_PERIODIC;
			
 
				+    data->clkevt.shift = 32;
			
 
				+    data->clkevt.mult = div_sc(pit_rate, NSEC_PER_SEC, data->clkevt.shift);
			
 
				+    data->clkevt.rating = 100;
			
 
				+    data->clkevt.cpumask = cpumask_of(0);
			
 
				+
			
 
				+    data->clkevt.set_state_shutdown = pit_clkevt_shutdown;
			
 
				+    data->clkevt.set_state_periodic = pit_clkevt_set_periodic;
			
 
				+    data->clkevt.resume = at91sam926x_pit_resume;
			
 
				+    data->clkevt.suspend = at91sam926x_pit_suspend;
			
 
				+    ...
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Here we can see that `at91sam926x_pit_common_init` takes one parameter - pointer to the `pit_data` structure which contains `clock_event_device` structure which will contain clock event related information of the `at91sam926x` [periodic Interval Timer](https://en.wikipedia.org/wiki/Programmable_interval_timer). At the start we fill `name` of the timer device and its `features`. In our case we deal with periodic timer which as we already know may be programmed to generate events periodically.
			
 
				+
			
 
				+The next two fields `shift` and `mult` are familiar to us. They will be used to translate counter of our timer to nanoseconds. After this we set rating of the timer  to `100`. This means if there will not be timers with higher rating in the system, this timer will be used for timekeeping. The next field - `cpumask` indicates for which processors in the system the device will work. In our case, the device will work for the first processor. The `cpumask_of` macro defined in the [include/linux/cpumask.h](https://github.com/torvalds/linux/tree/master/include/linux/cpumask.h) header file and just expands to the call of the:
			
 
				+
			
 
				+```C
			
 
				+#define cpumask_of(cpu) (get_cpu_mask(cpu))
			
 
				+```
			
 
				+
			
 
				+Where the `get_cpu_mask` returns the cpumask containing just a given `cpu` number. More about `cpumasks` concept you may read in the [CPU masks in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) part. In the last four lines of code we set callbacks for the clock event device suspend/resume, device shutdown and update of the clock event device state.
			
 
				+
			
 
				+After we finished with the initialization of the `at91sam926x` periodic timer, we can register it by the call of the following functions:
			
 
				+
			
 
				+```C
			
 
				+clockevents_register_device(&data->clkevt);
			
 
				+```
			
 
				+
			
 
				+Now we can consider implementation of the `clockevent_register_device` function. As I already wrote above, this function is defined in the [kernel/time/clockevents.c](https://github.com/torvalds/linux/blob/master/kernel/time/clockevents.c) source code file and starts from the initialization of the initial event device state:
			
 
				+
			
 
				+```C
			
 
				+clockevent_set_state(dev, CLOCK_EVT_STATE_DETACHED);
			
 
				+```
			
 
				+
			
 
				+Actually, an event device may be in one of this states:
			
 
				+
			
 
				+```C
			
 
				+enum clock_event_state {
			
 
				+	CLOCK_EVT_STATE_DETACHED,
			
 
				+	CLOCK_EVT_STATE_SHUTDOWN,
			
 
				+	CLOCK_EVT_STATE_PERIODIC,
			
 
				+	CLOCK_EVT_STATE_ONESHOT,
			
 
				+	CLOCK_EVT_STATE_ONESHOT_STOPPED,
			
 
				+};
			
 
				+```
			
 
				+
			
 
				+Where:
			
 
				+
			
 
				+* `CLOCK_EVT_STATE_DETACHED` - a clock event device is not not used by `clockevents` framework. Actually it is initial state of all clock event devices;
			
 
				+* `CLOCK_EVT_STATE_SHUTDOWN` - a clock event device is powered-off;
			
 
				+* `CLOCK_EVT_STATE_PERIODIC` - a clock event device may be programmed to generate event periodically;
			
 
				+* `CLOCK_EVT_STATE_ONESHOT`  - a clock event device may be programmed to generate event only once;
			
 
				+* `CLOCK_EVT_STATE_ONESHOT_STOPPED` - a clock event device was programmed to generate event only once and now it is temporary stopped.
			
 
				+
			
 
				+The implementation of the `clock_event_set_state` function is pretty easy:
			
 
				+
			
 
				+```C
			
 
				+static inline void clockevent_set_state(struct clock_event_device *dev,
			
 
				+					enum clock_event_state state)
			
 
				+{
			
 
				+	dev->state_use_accessors = state;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+As we can see, it just fills the `state_use_accessors` field of the given `clock_event_device` structure with the given value which is in our case is `CLOCK_EVT_STATE_DETACHED`. Actually all clock event devices has this initial state during registration. The `state_use_accessors` field of the `clock_event_device` structure provides `current` state of the clock event device.
			
 
				+
			
 
				+After we have set initial state of the given `clock_event_device` structure we check that the `cpumask` of the given clock event device is not zero:
			
 
				+
			
 
				+```C
			
 
				+if (!dev->cpumask) {
			
 
				+	WARN_ON(num_possible_cpus() > 1);
			
 
				+	dev->cpumask = cpumask_of(smp_processor_id());
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Remember that we have set the `cpumask` of the `at91sam926x` periodic timer to first processor. If the `cpumask` field is zero, we check the number of possible processors in the system and print warning message if it is less than on. Additionally we set the `cpumask` of the given clock event device to the current processor. If you are interested in how the `smp_processor_id` macro is implemented, you can read more about it in the fourth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process chapter.
			
 
				+
			
 
				+After this check we lock the actual code of the clock event device registration by the call following macros:
			
 
				+
			
 
				+```C
			
 
				+raw_spin_lock_irqsave(&clockevents_lock, flags);
			
 
				+...
			
 
				+...
			
 
				+...
			
 
				+raw_spin_unlock_irqrestore(&clockevents_lock, flags);
			
 
				+```
			
 
				+
			
 
				+Additionally the `raw_spin_lock_irqsave` and the `raw_spin_unlock_irqrestore` macros disable local interrupts, however interrupts on other processors still may occur. We need to do it to prevent potential [deadlock](https://en.wikipedia.org/wiki/Deadlock) if we adding new clock event device to the list of clock event devices and an interrupt occurs from other clock event device.
			
 
				+
			
 
				+We can see following code of clock event device registration between the `raw_spin_lock_irqsave` and `raw_spin_unlock_irqrestore` macros:
			
 
				+
			
 
				+```C
			
 
				+list_add(&dev->list, &clockevent_devices);
			
 
				+tick_check_new_device(dev);
			
 
				+clockevents_notify_released();
			
 
				+```
			
 
				+
			
 
				+First of all we add the given clock event device to the list of clock event devices which is represented by the `clockevent_devices`:
			
 
				+
			
 
				+```C
			
 
				+static LIST_HEAD(clockevent_devices);
			
 
				+```
			
 
				+
			
 
				+At the next step we call the `tick_check_new_device` function which is defined in the [kernel/time/tick-common.c](https://github.com/torvalds/linux/blob/master/kernel/time/tick-common.c) source code file and checks do the new registered clock event device should be used or not. The `tick_check_new_device` function checks the given `clock_event_device` gets the current registered tick device which is represented by the `tick_device` structure and compares their ratings and features. Actually `CLOCK_EVT_STATE_ONESHOT` is preferred:
			
 
				+
			
 
				+```C
			
 
				+static bool tick_check_preferred(struct clock_event_device *curdev,
			
 
				+				 struct clock_event_device *newdev)
			
 
				+{
			
 
				+	if (!(newdev->features & CLOCK_EVT_FEAT_ONESHOT)) {
			
 
				+		if (curdev && (curdev->features & CLOCK_EVT_FEAT_ONESHOT))
			
 
				+			return false;
			
 
				+		if (tick_oneshot_mode_active())
			
 
				+			return false;
			
 
				+	}
			
 
				+
			
 
				+	return !curdev ||
			
 
				+		newdev->rating > curdev->rating ||
			
 
				+	       !cpumask_equal(curdev->cpumask, newdev->cpumask);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+If the new registered clock event device is more preferred than old tick device, we exchange old and new registered devices and install new device:
			
 
				+
			
 
				+```C
			
 
				+clockevents_exchange_device(curdev, newdev);
			
 
				+tick_setup_device(td, newdev, cpu, cpumask_of(cpu));
			
 
				+```
			
 
				+
			
 
				+The `clockevents_exchange_device` function releases or in other words deleted the old clock event device from the `clockevent_devices` list. The next function - `tick_setup_device` as we may understand from its name, setups new tick device. This function check the mode of the new registered clock event device and call the `tick_setup_periodic` function or the `tick_setup_oneshot` depends on the tick device mode:
			
 
				+
			
 
				+```C
			
 
				+if (td->mode == TICKDEV_MODE_PERIODIC)
			
 
				+	tick_setup_periodic(newdev, 0);
			
 
				+else
			
 
				+	tick_setup_oneshot(newdev, handler, next_event);
			
 
				+```
			
 
				+
			
 
				+Both of this functions calls the `clockevents_switch_state` to change state of the clock event device and the `clockevents_program_event` function to set next event of clock event device based on delta between the maximum and minimum difference current time and time for the next event. The `tick_setup_periodic`:
			
 
				+
			
 
				+```C
			
 
				+clockevents_switch_state(dev, CLOCK_EVT_STATE_PERIODIC);
			
 
				+clockevents_program_event(dev, next, false))
			
 
				+```
			
 
				+
			
 
				+and the `tick_setup_oneshot_periodic`:
			
 
				+
			
 
				+```C
			
 
				+clockevents_switch_state(newdev, CLOCK_EVT_STATE_ONESHOT);
			
 
				+clockevents_program_event(newdev, next_event, true);
			
 
				+```
			
 
				+
			
 
				+The `clockevents_switch_state` function checks that the clock event device is not in the given state and calls the `__clockevents_switch_state` function from the same source code file:
			
 
				+
			
 
				+```C
			
 
				+if (clockevent_get_state(dev) != state) {
			
 
				+	if (__clockevents_switch_state(dev, state))
			
 
				+		return;
			
 
				+```
			
 
				+
			
 
				+The `__clockevents_switch_state` function just makes a call of the certain callback depends on the given state:
			
 
				+
			
 
				+```C
			
 
				+static int __clockevents_switch_state(struct clock_event_device *dev,
			
 
				+				      enum clock_event_state state)
			
 
				+{
			
 
				+	if (dev->features & CLOCK_EVT_FEAT_DUMMY)
			
 
				+		return 0;
			
 
				+
			
 
				+	switch (state) {
			
 
				+	case CLOCK_EVT_STATE_DETACHED:
			
 
				+	case CLOCK_EVT_STATE_SHUTDOWN:
			
 
				+		if (dev->set_state_shutdown)
			
 
				+			return dev->set_state_shutdown(dev);
			
 
				+		return 0;
			
 
				+
			
 
				+	case CLOCK_EVT_STATE_PERIODIC:
			
 
				+		if (!(dev->features & CLOCK_EVT_FEAT_PERIODIC))
			
 
				+			return -ENOSYS;
			
 
				+		if (dev->set_state_periodic)
			
 
				+			return dev->set_state_periodic(dev);
			
 
				+		return 0;
			
 
				+    ...
			
 
				+    ...
			
 
				+    ...
			
 
				+```
			
 
				+
			
 
				+In our case for `at91sam926x` periodic timer, the state is the `CLOCK_EVT_FEAT_PERIODIC`:
			
 
				+
			
 
				+```C
			
 
				+data->clkevt.features = CLOCK_EVT_FEAT_PERIODIC;
			
 
				+data->clkevt.set_state_periodic = pit_clkevt_set_periodic;
			
 
				+```
			
 
				+
			
 
				+So, for the `pit_clkevt_set_periodic` callback will be called. If we will read the documentation of the [Periodic Interval Timer (PIT) for at91sam926x](http://www.atmel.com/Images/doc6062.pdf), we will see that there is `Periodic Interval Timer Mode Register` which allows us to control of periodic interval timer.
			
 
				+
			
 
				+It looks like:
			
 
				+
			
 
				+```
			
 
				+31                                                   25        24
			
 
				++---------------------------------------------------------------+
			
 
				+|                                          |  PITIEN  |  PITEN  |
			
 
				++---------------------------------------------------------------+
			
 
				+23                            19                               16
			
 
				++---------------------------------------------------------------+
			
 
				+|                             |               PIV               |
			
 
				++---------------------------------------------------------------+
			
 
				+15                                                              8
			
 
				++---------------------------------------------------------------+
			
 
				+|                            PIV                                |
			
 
				++---------------------------------------------------------------+
			
 
				+7                                                               0
			
 
				++---------------------------------------------------------------+
			
 
				+|                            PIV                                |
			
 
				++---------------------------------------------------------------+
			
 
				+```
			
 
				+
			
 
				+Where `PIV` or `Periodic Interval Value` - defines the value compared with the primary `20-bit` counter of the Periodic Interval Timer. The `PITEN` or `Period Interval Timer Enabled` if the bit is `1` and the `PITIEN` or `Periodic Interval Timer Interrupt Enable` if the bit is `1`. So, to set periodic mode, we need to set `24`, `25` bits in the `Periodic Interval Timer Mode Register`. And we are doing it in the `pit_clkevt_set_periodic` function:
			
 
				+
			
 
				+```C
			
 
				+static int pit_clkevt_set_periodic(struct clock_event_device *dev)
			
 
				+{
			
 
				+        struct pit_data *data = clkevt_to_pit_data(dev);
			
 
				+        ...
			
 
				+        ...
			
 
				+        ...
			
 
				+        pit_write(data->base, AT91_PIT_MR,
			
 
				+                  (data->cycle - 1) | AT91_PIT_PITEN | AT91_PIT_PITIEN);
			
 
				+
			
 
				+        return 0;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Where the `AT91_PT_MR`, `AT91_PT_PITEN` and the `AT91_PIT_PITIEN` are declared as:
			
 
				+
			
 
				+```C
			
 
				+#define AT91_PIT_MR             0x00
			
 
				+#define AT91_PIT_PITIEN       BIT(25)
			
 
				+#define AT91_PIT_PITEN        BIT(24)
			
 
				+```
			
 
				+
			
 
				+After the setup of the new clock event device is finished, we can return to the `clockevents_register_device` function. The last function in the `clockevents_register_device` function is:
			
 
				+
			
 
				+```C
			
 
				+clockevents_notify_released();
			
 
				+```
			
 
				+
			
 
				+This function checks the `clockevents_released` list which contains released clock event devices (remember that they may occur after the call of the ` clockevents_exchange_device` function). If this list is not empty, we go through clock event devices from the `clock_events_released` list and delete it from the `clockevent_devices`:
			
 
				+
			
 
				+```C
			
 
				+static void clockevents_notify_released(void)
			
 
				+{
			
 
				+	struct clock_event_device *dev;
			
 
				+
			
 
				+	while (!list_empty(&clockevents_released)) {
			
 
				+		dev = list_entry(clockevents_released.next,
			
 
				+				 struct clock_event_device, list);
			
 
				+		list_del(&dev->list);
			
 
				+		list_add(&dev->list, &clockevent_devices);
			
 
				+		tick_check_new_device(dev);
			
 
				+	}
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+That's all. From this moment we have registered new clock event device. So the usage of the `clockevents` framework is simple and clear. Architectures registered their clock event devices, in the clock events core. Users of the clockevents core can get clock event devices for their use. The `clockevents` framework provides notification mechanisms for various clock related management events like a clock event device registered or unregistered, a processor is offlined in system which supports [CPU hotplug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt) and etc.
			
 
				+
			
 
				+We saw implementation only of the `clockevents_register_device` function. But generally, the clock event layer [API](https://en.wikipedia.org/wiki/Application_programming_interface) is small. Besides the `API` for clock event device registration, the `clockevents` framework provides functions to schedule the next event interrupt, clock event device notification service and support for suspend and resume for clock event devices.
			
 
				+
			
 
				+If you want to know more about `clockevents` API you can start to research following source code and header files: [kernel/time/tick-common.c](https://github.com/torvalds/linux/blob/master/kernel/time/tick-common.c), [kernel/time/clockevents.c](https://github.com/torvalds/linux/blob/master/kernel/time/clockevents.c) and [include/linux/clockchips.h](https://github.com/torvalds/linux/blob/master/include/linux/clockchips.h).
			
 
				+
			
 
				+That's all.
			
 
				+
			
 
				+Conclusion
			
 
				+-------------------------------------------------------------------------------
			
 
				+
			
 
				+This is the end of the fifth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) that describes timers and timer management related stuff in the Linux kernel. In the previous part got acquainted with the `timers` concept. In this part we continued to learn time management related stuff in the Linux kernel and saw a little about yet another framework - `clockevents`.
			
 
				+
			
 
				+If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
			
 
				+
			
 
				+**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				+
			
 
				+Links
			
 
				+-------------------------------------------------------------------------------
			
 
				+
			
 
				+* [timekeeping documentation](https://github.com/0xAX/linux/blob/master/Documentation/timers/timekeeping.txt)
			
 
				+* [Intel 8253](https://en.wikipedia.org/wiki/Intel_8253)
			
 
				+* [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer)
			
 
				+* [ACPI pdf](http://uefi.org/sites/default/files/resources/ACPI_5.pdf)
			
 
				+* [x86](https://en.wikipedia.org/wiki/X86)
			
 
				+* [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer)
			
 
				+* [powerpc](https://en.wikipedia.org/wiki/PowerPC)
			
 
				+* [frequency](https://en.wikipedia.org/wiki/Frequency)
			
 
				+* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
			
 
				+* [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond)
			
 
				+* [interrupt](https://en.wikipedia.org/wiki/Interrupt)
			
 
				+* [interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler)
			
 
				+* [local APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller)
			
 
				+* [C3 state](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface#Device_states) 
			
 
				+* [Periodic Interval Timer (PIT) for at91sam926x](http://www.atmel.com/Images/doc6062.pdf)
			
 
				+* [CPU masks in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
			
 
				+* [deadlock](https://en.wikipedia.org/wiki/Deadlock)
			
 
				+* [CPU hotplug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt)
			
 
				+* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-3.html)
			
--- a/Timers/timers-6.md
+++ b/Timers/timers-6.md
@@ -0,0 +1,413 @@
 
				+Timers and time management in the Linux kernel. Part 6.
			
 
				+================================================================================
			
 
				+
			
 
				+x86_64 related clock sources
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+This is sixth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel. In the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-5.html) we saw `clockevents` framework and now we will continue to dive into time management related stuff in the Linux kernel. This part will describe implementation of [x86](https://en.wikipedia.org/wiki/X86) architecture related clock sources (more about `clocksource` concept you can read in the [second part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-2.html) of this chapter).
			
 
				+
			
 
				+First of all we must know what clock sources may be used at `x86` architecture. It is easy to know from the [sysfs](https://en.wikipedia.org/wiki/Sysfs) or from content of the `/sys/devices/system/clocksource/clocksource0/available_clocksource`. The `/sys/devices/system/clocksource/clocksourceN` provides two special files to achieve this:
			
 
				+
			
 
				+* `available_clocksource` - provides information about available clock sources in the system;
			
 
				+* `current_clocksource`   - provides information about currently used clock source in the system.
			
 
				+
			
 
				+So, let's look:
			
 
				+
			
 
				+```
			
 
				+$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource 
			
 
				+tsc hpet acpi_pm 
			
 
				+```
			
 
				+
			
 
				+We can see that there are three registered clock sources in my system:
			
 
				+
			
 
				+* `tsc` - [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter);
			
 
				+* `hpet` - [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer);
			
 
				+* `acpi_pm` - [ACPI Power Management Timer](http://uefi.org/sites/default/files/resources/ACPI_5.pdf).
			
 
				+
			
 
				+Now let's look at the second file which provides best clock source (a clock source which has the best rating in the system):
			
 
				+
			
 
				+```
			
 
				+$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource 
			
 
				+tsc
			
 
				+```
			
 
				+
			
 
				+For me it is [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter). As we may know from the [second part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-2.html) of this chapter, which describes internals of the `clocksource` framework in the Linux kernel, the best clock source in a system is a clock source with the best (highest) rating or in other words with the highest [frequency](https://en.wikipedia.org/wiki/Frequency).
			
 
				+
			
 
				+Frequency of the [ACPI](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) power management timer is `3.579545 MHz`. Frequency of the [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) is at least `10 MHz`. And the frequency of the [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) depends on processor. For example On older processors, the `Time Stamp Counter` was counting internal processor clock cycles. This means its frequency changed when the processor's frequency scaling changed. The situation has changed for newer processors. Newer processors have an `invariant Time Stamp counter` that increments at a constant rate in all operational states of processor. Actually we can get its frequency in the output of the `/proc/cpuinfo`. For example for the first processor in the system:
			
 
				+
			
 
				+```
			
 
				+$ cat /proc/cpuinfo
			
 
				+...
			
 
				+model name	: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
			
 
				+...
			
 
				+```
			
 
				+
			
 
				+And although Intel manual says that the frequency of the `Time Stamp Counter`, while constant, is not necessarily the maximum qualified frequency of the processor, or the frequency given in the brand string, anyway we may see that it will be much more than frequency of the `ACPI PM` timer or `High Precision Event Timer`. And we can see that the clock source with the best rating or highest frequency is current in the system.
			
 
				+
			
 
				+You can note that besides these three clock source, we don't see yet another two familiar us clock sources in the output of the `/sys/devices/system/clocksource/clocksource0/available_clocksource`. These clock sources are `jiffy` and `refined_jiffies`. We don't see them because this filed maps only high resolution clock sources or in other words clock sources with the [CLOCK_SOURCE_VALID_FOR_HRES](https://github.com/torvalds/linux/blob/master/include/linux/clocksource.h#L113) flag.
			
 
				+
			
 
				+As I already wrote above, we will consider all of these three clock sources in this part. We will consider it in order of their initialization or:
			
 
				+
			
 
				+* `hpet`;
			
 
				+* `acpi_pm`;
			
 
				+* `tsc`.
			
 
				+
			
 
				+We can make sure that the order is exactly like this in the output of the [dmesg](https://en.wikipedia.org/wiki/Dmesg) util:
			
 
				+
			
 
				+```
			
 
				+$ dmesg | grep clocksource
			
 
				+[    0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
			
 
				+[    0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484882848 ns
			
 
				+[    0.094369] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
			
 
				+[    0.186498] clocksource: Switched to clocksource hpet
			
 
				+[    0.196827] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
			
 
				+[    1.413685] tsc: Refined TSC clocksource calibration: 3999.981 MHz
			
 
				+[    1.413688] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x73509721780, max_idle_ns: 881591102108 ns
			
 
				+[    2.413748] clocksource: Switched to clocksource tsc
			
 
				+```
			
 
				+
			
 
				+The first clock source is the [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer), so let's start from it.
			
 
				+
			
 
				+High Precision Event Timer
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+The implementation of the `High Precision Event Timer` for the [x86](https://en.wikipedia.org/wiki/X86) architecture is located in the [arch/x86/kernel/hpet.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/hpet.c) source code file. Its initialization starts from the call of the `hpet_enable` function. This function is called during Linux kernel initialization. If we will look into `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file, we will see that after the all architecture-specific stuff initialized, early console is disabled and time management subsystem already ready, call of the following function:
			
 
				+
			
 
				+```C
			
 
				+if (late_time_init)
			
 
				+	late_time_init();
			
 
				+```
			
 
				+
			
 
				+which does initialization of the late architecture specific timers after early jiffy counter already initialized. The definition of the `late_time_init` function for the `x86` architecture is located in the [arch/x86/kernel/time.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/time.c) source code file. It looks pretty easy:
			
 
				+
			
 
				+```C
			
 
				+static __init void x86_late_time_init(void)
			
 
				+{
			
 
				+	x86_init.timers.timer_init();
			
 
				+	tsc_init();
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+As we may see, it does initialization of the `x86` related timer and initialization of the `Time Stamp Counter`. The seconds we will see in the next paragraph, but now let's consider the call of the `x86_init.timers.timer_init` function. The `timer_init` points to the `hpet_time_init` function from the same source code file. We can verify this by looking on the definition of the `x86_init` structure from the [arch/x86/kernel/x86_init.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/x86_init.c):
			
 
				+
			
 
				+```C
			
 
				+struct x86_init_ops x86_init __initdata = {
			
 
				+   ...
			
 
				+   ...
			
 
				+   ...
			
 
				+   .timers = {
			
 
				+		.setup_percpu_clockev	= setup_boot_APIC_clock,
			
 
				+		.timer_init		= hpet_time_init,
			
 
				+		.wallclock_init		= x86_init_noop,
			
 
				+   },
			
 
				+   ...
			
 
				+   ...
			
 
				+   ...
			
 
				+```
			
 
				+
			
 
				+The `hpet_time_init` function does setup of the [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer) if we can not enable `High Precision Event Timer` and setups default timer [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) for the enabled timer:
			
 
				+
			
 
				+```C
			
 
				+void __init hpet_time_init(void)
			
 
				+{
			
 
				+	if (!hpet_enable())
			
 
				+		setup_pit_timer();
			
 
				+	setup_default_timer_irq();
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+First of all the `hpet_enable` function check we can enable `High Precision Event Timer` in the system by the call of the `is_hpet_capable` function and if we can, we map a virtual address space for it:
			
 
				+
			
 
				+```C
			
 
				+int __init hpet_enable(void)
			
 
				+{
			
 
				+	if (!is_hpet_capable())
			
 
				+		return 0;
			
 
				+
			
 
				+    hpet_set_mapping();
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+The `is_hpet_capable` function checks that we didn't pass `hpet=disable` to the kernel command line and the `hpet_address` is received from the [ACPI HPET](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) table. The `hpet_set_mapping` function just maps the virtual address spaces for the timer registers:
			
 
				+
			
 
				+```C
			
 
				+hpet_virt_address = ioremap_nocache(hpet_address, HPET_MMAP_SIZE);
			
 
				+```
			
 
				+
			
 
				+As we can read in the  [IA-PC HPET (High Precision Event Timers) Specification](http://www.intel.com/content/dam/www/public/us/en/documents/technical-specifications/software-developers-hpet-spec-1-0a.pdf):
			
 
				+
			
 
				+> The timer register space is 1024 bytes
			
 
				+
			
 
				+So, the `HPET_MMAP_SIZE` is `1024` bytes too:
			
 
				+
			
 
				+```C
			
 
				+#define HPET_MMAP_SIZE		1024
			
 
				+```
			
 
				+
			
 
				+After we mapped virtual space for the `High Precision Event Timer`, we read `HPET_ID` register to get number of the timers:
			
 
				+
			
 
				+```C
			
 
				+id = hpet_readl(HPET_ID);
			
 
				+
			
 
				+last = (id & HPET_ID_NUMBER) >> HPET_ID_NUMBER_SHIFT;
			
 
				+```
			
 
				+
			
 
				+We need to get this number to allocate correct amount of space for the `General Configuration Register` of the `High Precision Event Timer`:
			
 
				+
			
 
				+```C
			
 
				+cfg = hpet_readl(HPET_CFG);
			
 
				+
			
 
				+hpet_boot_cfg = kmalloc((last + 2) * sizeof(*hpet_boot_cfg), GFP_KERNEL);
			
 
				+```
			
 
				+
			
 
				+After the space is allocated for the configuration register of the `High Precision Event Timer`, we allow to main counter to run, and allow timer interrupts if they are enabled by the setting of `HPET_CFG_ENABLE` bit in the configuration register for all timers. In the end we just register new clock source by the call of the `hpet_clocksource_register` function:
			
 
				+
			
 
				+```C
			
 
				+if (hpet_clocksource_register())
			
 
				+	goto out_nohpet;
			
 
				+```
			
 
				+
			
 
				+which just calls already familiar
			
 
				+
			
 
				+```C
			
 
				+clocksource_register_hz(&clocksource_hpet, (u32)hpet_freq);
			
 
				+```
			
 
				+
			
 
				+function. Where the `clocksource_hpet` is the `clocksource` structure with the rating `250` (remember rating of the previous `refined_jiffies` clock source was `2`), name - `hpet` and `read_hpet` callback for the reading of atomic counter provided by the `High Precision Event Timer`:
			
 
				+
			
 
				+```C
			
 
				+static struct clocksource clocksource_hpet = {
			
 
				+	.name		= "hpet",
			
 
				+	.rating		= 250,
			
 
				+	.read		= read_hpet,
			
 
				+	.mask		= HPET_MASK,
			
 
				+	.flags		= CLOCK_SOURCE_IS_CONTINUOUS,
			
 
				+	.resume		= hpet_resume_counter,
			
 
				+	.archdata	= { .vclock_mode = VCLOCK_HPET },
			
 
				+};
			
 
				+```
			
 
				+
			
 
				+After the `clocksource_hpet` is registered, we can return to the `hpet_time_init()` function from the [arch/x86/kernel/time.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/time.c) source code file. We can remember that the last step is the call of the:
			
 
				+
			
 
				+```C
			
 
				+setup_default_timer_irq();
			
 
				+```
			
 
				+
			
 
				+function in the `hpet_time_init()`. The `setup_default_timer_irq` function checks existence of `legacy` IRQs or in other words support for the [i8259](https://en.wikipedia.org/wiki/Intel_8259) and setups [IRQ0](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29#Master_PIC) depends on this.
			
 
				+
			
 
				+That's all. From this moment the [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) clock source registered in the Linux kernel `clock source` framework and may be used from generic kernel code via the `read_hpet`:
			
 
				+```C
			
 
				+static cycle_t read_hpet(struct clocksource *cs)
			
 
				+{
			
 
				+	return (cycle_t)hpet_readl(HPET_COUNTER);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+function which just reads and returns atomic counter from the `Main Counter Register`.
			
 
				+
			
 
				+ACPI PM timer
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+The seconds clock source is [ACPI Power Management Timer](http://uefi.org/sites/default/files/resources/ACPI_5.pdf). Implementation of this clock source is located in the [drivers/clocksource/acpi_pm.c](https://github.com/torvalds/linux/blob/master/drivers/clocksource_acpi_pm.c) source code file and starts from the call of the `init_acpi_pm_clocksource` function during `fs` [initcall](http://www.compsoc.man.ac.uk/~moz/kernelnewbies/documents/initcall/kernel.html).
			
 
				+
			
 
				+If we will look at implementation of the `init_acpi_pm_clocksource` function, we will see that it starts from the check of the value of `pmtmr_ioport` variable:
			
 
				+
			
 
				+```C
			
 
				+static int __init init_acpi_pm_clocksource(void)
			
 
				+{
			
 
				+    ...
			
 
				+    ...
			
 
				+    ...
			
 
				+	if (!pmtmr_ioport)
			
 
				+		return -ENODEV;
			
 
				+    ...
			
 
				+    ...
			
 
				+    ...
			
 
				+```
			
 
				+
			
 
				+This `pmtmr_ioport` variable contains extended address of the `Power Management Timer Control Register Block`. It gets its value in the `acpi_parse_fadt` function which is defined in the [arch/x86/kernel/acpi/boot.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/acpi/boot.c) source code file. This function parses `FADT` or `Fixed ACPI Description Table` [ACPI](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) table and tries to get the values of the `X_PM_TMR_BLK` field which contains extended address of the `Power Management Timer Control Register Block`, represented in `Generic Address Structure` format:
			
 
				+
			
 
				+```C
			
 
				+static int __init acpi_parse_fadt(struct acpi_table_header *table)
			
 
				+{
			
 
				+#ifdef CONFIG_X86_PM_TIMER
			
 
				+        ...
			
 
				+        ...
			
 
				+        ...
			
 
				+		pmtmr_ioport = acpi_gbl_FADT.xpm_timer_block.address;
			
 
				+        ...
			
 
				+        ...
			
 
				+        ...
			
 
				+#endif
			
 
				+	return 0;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+So, if the `CONFIG_X86_PM_TIMER` Linux kernel configuration option is disabled or something going wrong in the `acpi_parse_fadt` function, we can't access the `Power Management Timer` register and return from the `init_acpi_pm_clocksource`. In other way, if the value of the `pmtmr_ioport` variable is not zero, we check rate of this timer and register this clock source by the call of the:
			
 
				+
			
 
				+```C
			
 
				+clocksource_register_hz(&clocksource_acpi_pm, PMTMR_TICKS_PER_SEC);
			
 
				+```
			
 
				+    
			
 
				+function. After the call of the `clocksource_register_hs`, the `acpi_pm` clock source will be registered in the `clocksource` framework of the Linux kernel:
			
 
				+
			
 
				+```C
			
 
				+static struct clocksource clocksource_acpi_pm = {
			
 
				+	.name		= "acpi_pm",
			
 
				+	.rating		= 200,
			
 
				+	.read		= acpi_pm_read,
			
 
				+	.mask		= (cycle_t)ACPI_PM_MASK,
			
 
				+	.flags		= CLOCK_SOURCE_IS_CONTINUOUS,
			
 
				+};
			
 
				+```
			
 
				+
			
 
				+with the rating - `200` and the `acpi_pm_read` callback to read atomic counter provided by the `acpi_pm` clock source. The `acpi_pm_read` function just executes `read_pmtmr` function:
			
 
				+
			
 
				+```C
			
 
				+static cycle_t acpi_pm_read(struct clocksource *cs)
			
 
				+{
			
 
				+	return (cycle_t)read_pmtmr();
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+which reads value of the `Power Management Timer` register. This register has following structure:
			
 
				+
			
 
				+```
			
 
				++-------------------------------+----------------------------------+
			
 
				+|                               |                                  |
			
 
				+|  upper eight bits of a        |      running count of the        |
			
 
				+| 32-bit power management timer |     power management timer       |
			
 
				+|                               |                                  |
			
 
				++-------------------------------+----------------------------------+
			
 
				+31          E_TMR_VAL           24               TMR_VAL           0
			
 
				+```
			
 
				+
			
 
				+Address of this register is stored in the `Fixed ACPI Description Table` [ACPI](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) table and we already have it in the `pmtmr_ioport`. So, the implementation of the `read_pmtmr` function is pretty easy:
			
 
				+
			
 
				+```C
			
 
				+static inline u32 read_pmtmr(void)
			
 
				+{
			
 
				+	return inl(pmtmr_ioport) & ACPI_PM_MASK;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+We just read the value of the `Power Management Timer` register and mask its `24` bits.
			
 
				+
			
 
				+That's all. Now we move to the last clock source in this part - `Time Stamp Counter`.
			
 
				+
			
 
				+Time Stamp Counter
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+The third and last clock source in this part is - [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) clock source and its implementation is located in the [arch/x86/kernel/tsc.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/tsc.c) source code file. We already saw the `x86_late_time_init` function in this part and initialization of the [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) starts from this place. This function calls the `tsc_init()` function from the [arch/x86/kernel/tsc.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/tsc.c) source code file.
			
 
				+
			
 
				+At the beginning of the `tsc_init` function we can see check, which checks that a processor has support of the `Time Stamp Counter`:
			
 
				+
			
 
				+```C
			
 
				+void __init tsc_init(void)
			
 
				+{
			
 
				+	u64 lpj;
			
 
				+	int cpu;
			
 
				+
			
 
				+	if (!cpu_has_tsc) {
			
 
				+		setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
			
 
				+		return;
			
 
				+	}
			
 
				+    ...
			
 
				+    ...
			
 
				+    ...
			
 
				+```
			
 
				+
			
 
				+The `cpu_has_tsc` macro expands to the call of the `cpu_has` macro:
			
 
				+
			
 
				+```C
			
 
				+#define cpu_has_tsc		boot_cpu_has(X86_FEATURE_TSC)
			
 
				+
			
 
				+#define boot_cpu_has(bit)	cpu_has(&boot_cpu_data, bit)
			
 
				+
			
 
				+#define cpu_has(c, bit)							\
			
 
				+	(__builtin_constant_p(bit) && REQUIRED_MASK_BIT_SET(bit) ? 1 :	\
			
 
				+	 test_cpu_cap(c, bit))
			
 
				+```
			
 
				+
			
 
				+which check the given bit (the `X86_FEATURE_TSC_DEADLINE_TIMER` in our case) in the `boot_cpu_data` array which is filled during early Linux kernel initialization. If the processor has support of the `Time Stamp Counter`, we get the frequency of the `Time Stamp Counter` by the call of the `calibrate_tsc` function from the same source code file which tries to get frequency from the different source like [Model Specific Register](https://en.wikipedia.org/wiki/Model-specific_register), calibrate over [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer) and etc, after this we initialize frequency and scale factor for the all processors in the system:
			
 
				+
			
 
				+```C
			
 
				+tsc_khz = x86_platform.calibrate_tsc();
			
 
				+cpu_khz = tsc_khz;
			
 
				+
			
 
				+for_each_possible_cpu(cpu) {
			
 
				+	cyc2ns_init(cpu);
			
 
				+	set_cyc2ns_scale(cpu_khz, cpu);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+because only first bootstrap processor will call the `tsc_init`. After this we check hat `Time Stamp Counter` is not disabled:
			
 
				+
			
 
				+```
			
 
				+if (tsc_disabled > 0)
			
 
				+	return;
			
 
				+...
			
 
				+...
			
 
				+...
			
 
				+check_system_tsc_reliable();
			
 
				+```
			
 
				+
			
 
				+and call the `check_system_tsc_reliable` function which sets the `tsc_clocksource_reliable` if bootstrap processor has the `X86_FEATURE_TSC_RELIABLE` feature. Note that we went through the `tsc_init` function, but did not register our clock source. Actual registration of the `Time Stamp Counter` clock source occurs in the:
			
 
				+
			
 
				+```C
			
 
				+static int __init init_tsc_clocksource(void)
			
 
				+{
			
 
				+	if (!cpu_has_tsc || tsc_disabled > 0 || !tsc_khz)
			
 
				+		return 0;
			
 
				+    ...
			
 
				+    ...
			
 
				+    ...
			
 
				+    if (boot_cpu_has(X86_FEATURE_TSC_RELIABLE)) {
			
 
				+		clocksource_register_khz(&clocksource_tsc, tsc_khz);
			
 
				+		return 0;
			
 
				+	}
			
 
				+```
			
 
				+
			
 
				+function. This function called during the `device` [initcall](http://www.compsoc.man.ac.uk/~moz/kernelnewbies/documents/initcall/kernel.html). We do it to be sure that the `Time Stamp Counter` clock source will be registered after the  [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) clock source.
			
 
				+
			
 
				+After these all three clock sources will be registered in the `clocksource` framework and the `Time Stamp Counter` clock source will be selected as active, because it has the highest rating among other clock sources:
			
 
				+
			
 
				+```C
			
 
				+static struct clocksource clocksource_tsc = {
			
 
				+	.name                   = "tsc",
			
 
				+	.rating                 = 300,
			
 
				+	.read                   = read_tsc,
			
 
				+	.mask                   = CLOCKSOURCE_MASK(64),
			
 
				+	.flags                  = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_MUST_VERIFY,
			
 
				+	.archdata               = { .vclock_mode = VCLOCK_TSC },
			
 
				+};
			
 
				+```
			
 
				+
			
 
				+That's all.
			
 
				+
			
 
				+Conclusion
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+This is the end of the sixth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) that describes timers and timer management related stuff in the Linux kernel. In the previous part got acquainted with the `clockevents` framework. In this part we continued to learn time management related stuff in the Linux kernel and saw a little about three different clock sources which are used in the [x86](https://en.wikipedia.org/wiki/X86) architecture. The next part will be last part of this [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) and we will see some user space related stuff, i.e. how some time related [system calls](https://en.wikipedia.org/wiki/System_call) implemented in the Linux kernel.
			
 
				+
			
 
				+If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
			
 
				+
			
 
				+**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				+
			
 
				+Links
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+* [x86](https://en.wikipedia.org/wiki/X86)
			
 
				+* [sysfs](https://en.wikipedia.org/wiki/Sysfs)
			
 
				+* [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter)
			
 
				+* [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer)
			
 
				+* [ACPI Power Management Timer (PDF)](http://uefi.org/sites/default/files/resources/ACPI_5.pdf)
			
 
				+* [frequency](https://en.wikipedia.org/wiki/Frequency).
			
 
				+* [dmesg](https://en.wikipedia.org/wiki/Dmesg)
			
 
				+* [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer)
			
 
				+* [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) 
			
 
				+* [IA-PC HPET (High Precision Event Timers) Specification](http://www.intel.com/content/dam/www/public/us/en/documents/technical-specifications/software-developers-hpet-spec-1-0a.pdf)
			
 
				+* [IRQ0](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29#Master_PIC)
			
 
				+* [i8259](https://en.wikipedia.org/wiki/Intel_8259)
			
 
				+* [initcall](http://www.compsoc.man.ac.uk/~moz/kernelnewbies/documents/initcall/kernel.html)
			
 
				+* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-5.html)
			
--- a/Timers/timers-7.md
+++ b/Timers/timers-7.md
@@ -0,0 +1,421 @@
 
				+Timers and time management in the Linux kernel. Part 7.
			
 
				+================================================================================
			
 
				+
			
 
				+Time related system calls in the Linux kernel
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+This is the seventh and last part [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel. In the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-6.html) we saw some [x86_64](https://en.wikipedia.org/wiki/X86-64) like [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) and [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter). Internal time management is interesting part of the Linux kernel, but of course not only the kernel needs in the `time` concept. Our programs need to know time too. In this part, we will consider implementation of some time management related [system calls](https://en.wikipedia.org/wiki/System_call). These system calls are:
			
 
				+
			
 
				+* `clock_gettime`;
			
 
				+* `gettimeofday`;
			
 
				+* `nanosleep`.
			
 
				+
			
 
				+We will start from simple userspace [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) program and see all way from the call of the [standard library](https://en.wikipedia.org/wiki/Standard_library) function to the implementation of certain system call. As each [architecture](https://github.com/torvalds/linux/tree/master/arch) provides its own implementation of certain system call, we will consider only [x86_64](https://en.wikipedia.org/wiki/X86-64) specific implementations of system calls, as this book is related to this architecture.
			
 
				+
			
 
				+Additionally we will not consider concept of system calls in this part, but only implementations of these three system calls in the Linux kernel. If you are interested in what is it a `system call`, there is special [chapter](https://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) about this.
			
 
				+
			
 
				+So, let's from the `gettimeofday` system call.
			
 
				+
			
 
				+Implementation of the `gettimeofday` system call
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+As we can understand from the name of the `gettimeofday`, this function returns current time. First of all, let's look on the following simple example:
			
 
				+
			
 
				+```C
			
 
				+#include <time.h>
			
 
				+#include <sys/time.h>
			
 
				+#include <stdio.h>
			
 
				+
			
 
				+int main(int argc, char **argv)
			
 
				+{
			
 
				+    char buffer[40];
			
 
				+    struct timeval time;
			
 
				+        
			
 
				+    gettimeofday(&time, NULL);
			
 
				+
			
 
				+    strftime(buffer, 40, "Current date/time: %m-%d-%Y/%T", localtime(&time.tv_sec));
			
 
				+    printf("%s\n",buffer);
			
 
				+
			
 
				+    return 0;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+As you can see, here we call the `gettimeofday` function which takes two parameters: pointer to the `timeval` structure which represents an elapsed tim:
			
 
				+
			
 
				+```C
			
 
				+struct timeval {
			
 
				+    time_t      tv_sec;     /* seconds */
			
 
				+    suseconds_t tv_usec;    /* microseconds */
			
 
				+};
			
 
				+```
			
 
				+
			
 
				+The second parameter of the `gettimeofday` function is pointer to the `timezone` structure which represents a timezone. In our example, we pass address of the `timeval time` to the `gettimeofday` function, the Linux kernel fills the given `timeval` structure and returns it back to us. Additionally, we format the time with the `strftime` function to get something more human readable than elapsed microseconds. Let's see on result:
			
 
				+
			
 
				+```C
			
 
				+~$ gcc date.c -o date
			
 
				+~$ ./date
			
 
				+Current date/time: 03-26-2016/16:42:02
			
 
				+```
			
 
				+
			
 
				+As you already may know, an userspace application does not call a system call directly from the kernel space. Before the actual system call entry will be called, we call a function from the standard library. In my case it is [glibc](https://en.wikipedia.org/wiki/GNU_C_Library), so I will consider this case. The implementation of the `gettimeofday` function is located in the [sysdeps/unix/sysv/linux/x86/gettimeofday.c](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86/gettimeofday.c;h=36f7c26ffb0e818709d032c605fec8c4bd22a14e;hb=HEAD) source code file. As you already may know, the `gettimeofday` is not usual system call. It is located in the special area which is called `vDSO` (you can read more about it in the [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-3.html) which describes this concept).
			
 
				+
			
 
				+The `glibc` implementation of the `gettimeofday` tries to resolve the given symbol, in our case this symbol is `__vdso_gettimeofday` by the call of the `_dl_vdso_vsym` internal function. If the symbol will not be resolved, it returns `NULL` and we fallback to the call of the usual system call:
			
 
				+
			
 
				+```C
			
 
				+return (_dl_vdso_vsym ("__vdso_gettimeofday", &linux26)
			
 
				+  ?: (void*) (&__gettimeofday_syscall));
			
 
				+```
			
 
				+
			
 
				+The `gettimeofday` entry is located in the [arch/x86/entry/vdso/vclock_gettime.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vclock_gettime.c) source code file. As we can see the `gettimeofday` is weak alias of the `__vdso_gettimeofday`:
			
 
				+
			
 
				+```C
			
 
				+int gettimeofday(struct timeval *, struct timezone *)
			
 
				+	__attribute__((weak, alias("__vdso_gettimeofday")));
			
 
				+```
			
 
				+
			
 
				+The `__vdso_gettimeofday` is defined in the same source code file and calls the `do_realtime` function if the given `timeval` is not null:
			
 
				+
			
 
				+```C
			
 
				+notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz)
			
 
				+{
			
 
				+	if (likely(tv != NULL)) {
			
 
				+		if (unlikely(do_realtime((struct timespec *)tv) == VCLOCK_NONE))
			
 
				+			return vdso_fallback_gtod(tv, tz);
			
 
				+		tv->tv_usec /= 1000;
			
 
				+	}
			
 
				+	if (unlikely(tz != NULL)) {
			
 
				+		tz->tz_minuteswest = gtod->tz_minuteswest;
			
 
				+		tz->tz_dsttime = gtod->tz_dsttime;
			
 
				+	}
			
 
				+
			
 
				+	return 0;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+If the `do_realtime` will fail, we fallback to the real system call via call the `syscall` instruction and passing the `__NR_gettimeofday` system call number and the given `timeval` and `timezone`:
			
 
				+
			
 
				+```C
			
 
				+notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
			
 
				+{
			
 
				+	long ret;
			
 
				+
			
 
				+	asm("syscall" : "=a" (ret) :
			
 
				+	    "0" (__NR_gettimeofday), "D" (tv), "S" (tz) : "memory");
			
 
				+	return ret;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+The `do_realtime` function gets the time data from the `vsyscall_gtod_data` structure which is defined in the [arch/x86/include/asm/vgtod.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/vgtod.h#L16) header file and contains mapping of the `timespec` structure and a couple of fields which are related to the current clock source in the system. This function fills the given `timeval` structure with values from the `vsyscall_gtod_data` which contains a time related data which is updated via timer interrupt.
			
 
				+
			
 
				+First of all we try to access the `gtod` or `global time of day` the `vsyscall_gtod_data` structure via the call of the `gtod_read_begin` and will continue to do it until it will be successful:
			
 
				+
			
 
				+```C
			
 
				+do {
			
 
				+	seq = gtod_read_begin(gtod);
			
 
				+	mode = gtod->vclock_mode;
			
 
				+	ts->tv_sec = gtod->wall_time_sec;
			
 
				+	ns = gtod->wall_time_snsec;
			
 
				+	ns += vgetsns(&mode);
			
 
				+	ns >>= gtod->shift;
			
 
				+} while (unlikely(gtod_read_retry(gtod, seq)));
			
 
				+
			
 
				+ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
			
 
				+ts->tv_nsec = ns;
			
 
				+```
			
 
				+
			
 
				+As we got access to the `gtod`, we fill the `ts->tv_sec` with the `gtod->wall_time_sec` which stores current time in seconds gotten from the [real time clock](https://en.wikipedia.org/wiki/Real-time_clock) during initialization of the timekeeping subsystem in the Linux kernel and the same value but in nanoseconds. In the end of this code we just fill the given `timespec` structure with the resulted values.
			
 
				+
			
 
				+That's all about the `gettimeofday` system call. The next system call in our list is the `clock_gettime`.
			
 
				+
			
 
				+Implementation of the clock_gettime system call
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+The `clock_gettime` function gets the time which is specified by the second parameter. Generally the `clock_gettime` function takes two parameters:
			
 
				+
			
 
				+* `clk_id` - clock identifier;
			
 
				+* `timespec` - address of the `timespec` structure which represent elapsed time.
			
 
				+
			
 
				+Let's look on the following simple example:
			
 
				+
			
 
				+```C
			
 
				+#include <time.h>
			
 
				+#include <sys/time.h>
			
 
				+#include <stdio.h>
			
 
				+
			
 
				+int main(int argc, char **argv)
			
 
				+{
			
 
				+    struct timespec elapsed_from_boot;
			
 
				+
			
 
				+    clock_gettime(CLOCK_BOOTTIME, &elapsed_from_boot);
			
 
				+
			
 
				+    printf("%d - seconds elapsed from boot\n", elapsed_from_boot.tv_sec);
			
 
				+    
			
 
				+    return 0;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+which prints `uptime` information:
			
 
				+
			
 
				+```C
			
 
				+~$ gcc uptime.c -o uptime
			
 
				+~$ ./uptime
			
 
				+14180 - seconds elapsed from boot
			
 
				+```
			
 
				+
			
 
				+We can easily check the result with the help of the [uptime](https://en.wikipedia.org/wiki/Uptime#Using_uptime) util:
			
 
				+
			
 
				+```
			
 
				+~$ uptime
			
 
				+up  3:56
			
 
				+```
			
 
				+
			
 
				+The `elapsed_from_boot.tv_sec` represents elapsed time in seconds, so:
			
 
				+
			
 
				+```python
			
 
				+>>> 14180 / 60
			
 
				+236
			
 
				+>>> 14180 / 60 / 60
			
 
				+3
			
 
				+>>> 14180 / 60 % 60
			
 
				+56
			
 
				+```
			
 
				+
			
 
				+The `clock_id` maybe one of the following:
			
 
				+
			
 
				+* `CLOCK_REALTIME` - system wide clock which measures real or wall-clock time;
			
 
				+* `CLOCK_REALTIME_COARSE` - faster version of the `CLOCK_REALTIME`;
			
 
				+* `CLOCK_MONOTONIC` - represents monotonic time since some unspecified starting point; 
			
 
				+* `CLOCK_MONOTONIC_COARSE` - faster version of the `CLOCK_MONOTONIC`;
			
 
				+* `CLOCK_MONOTONIC_RAW` - the same as the `CLOCK_MONOTONIC` but provides non [NTP](https://en.wikipedia.org/wiki/Network_Time_Protocol) adjusted time. 
			
 
				+* `CLOCK_BOOTTIME` - the same as the `CLOCK_MONOTONIC` but plus time that the system was suspended;
			
 
				+* `CLOCK_PROCESS_CPUTIME_ID` - per-process time consumed by all threads in the process;
			
 
				+* `CLOCK_THREAD_CPUTIME_ID` - thread-specific clock.
			
 
				+
			
 
				+The `clock_gettime` is not usual syscall too, but as the `gettimeofday`, this system call is placed in the `vDSO` area. Entry of this system call is located in the same source code file - [arch/x86/entry/vdso/vclock_gettime.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vclock_gettime.c)) as for `gettimeofday`.
			
 
				+
			
 
				+The Implementation of the `clock_gettime` depends on the clock id. If we have passed the `CLOCK_REALTIME` clock id, the `do_realtime` function will be called:
			
 
				+
			
 
				+```C
			
 
				+notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
			
 
				+{
			
 
				+	switch (clock) {
			
 
				+	case CLOCK_REALTIME:
			
 
				+		if (do_realtime(ts) == VCLOCK_NONE)
			
 
				+			goto fallback;
			
 
				+		break;
			
 
				+    ...
			
 
				+    ...
			
 
				+    ...
			
 
				+fallback:
			
 
				+	return vdso_fallback_gettime(clock, ts);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+In other cases, the `do_{name_of_clock_id}` function is called. Implementations of some of them is similar. For example if we will pass the `CLOCK_MONOTONIC` clock id:
			
 
				+
			
 
				+```C
			
 
				+...
			
 
				+...
			
 
				+...
			
 
				+case CLOCK_MONOTONIC:
			
 
				+	if (do_monotonic(ts) == VCLOCK_NONE)
			
 
				+		goto fallback;
			
 
				+	break;
			
 
				+...
			
 
				+...
			
 
				+...
			
 
				+```
			
 
				+
			
 
				+the `do_monotonic` function will be called which is very similar on the implementation of the `do_realtime`:
			
 
				+
			
 
				+```C
			
 
				+notrace static int __always_inline do_monotonic(struct timespec *ts)
			
 
				+{
			
 
				+	do {
			
 
				+		seq = gtod_read_begin(gtod);
			
 
				+		mode = gtod->vclock_mode;
			
 
				+		ts->tv_sec = gtod->monotonic_time_sec;
			
 
				+		ns = gtod->monotonic_time_snsec;
			
 
				+		ns += vgetsns(&mode);
			
 
				+		ns >>= gtod->shift;
			
 
				+	} while (unlikely(gtod_read_retry(gtod, seq)));
			
 
				+
			
 
				+	ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
			
 
				+	ts->tv_nsec = ns;
			
 
				+
			
 
				+	return mode;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+We already saw a little about the implementation of this function in the previous paragraph about the `gettimeofday`. There is only one difference here, that the `sec` and `nsec` of our `timespec` value will be based on the `gtod->monotonic_time_sec` instead of `gtod->wall_time_sec` which maps the value of the `tk->tkr_mono.xtime_nsec` or number of [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond) elapsed.
			
 
				+
			
 
				+That's all.
			
 
				+
			
 
				+Implementation of the `nanosleep` system call
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+The last system call in our list is the `nanosleep`. As you can understand from its name, this function provides `sleeping` ability. Let's look on the following simple example:
			
 
				+
			
 
				+```C
			
 
				+#include <time.h>
			
 
				+#include <stdlib.h>
			
 
				+#include <stdio.h>
			
 
				+
			
 
				+int main (void)
			
 
				+{    
			
 
				+   struct timespec ts = {5,0};
			
 
				+
			
 
				+   printf("sleep five seconds\n");
			
 
				+   nanosleep(&ts, NULL);
			
 
				+   printf("end of sleep\n");
			
 
				+
			
 
				+   return 0;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+If we will compile and run it, we will see the first line
			
 
				+
			
 
				+```
			
 
				+~$ gcc sleep_test.c -o sleep
			
 
				+~$ ./sleep
			
 
				+sleep five seconds
			
 
				+end of sleep
			
 
				+```
			
 
				+
			
 
				+and the second line after five seconds.
			
 
				+
			
 
				+The `nanosleep` is not located in the `vDSO` area like the `gettimeofday` and the `clock_gettime` functions. So, let's look how the `real` system call which is located in the kernel space will be called by the standard library. The implementation of the `nanosleep` system call will be called with the help of the [syscall](http://www.felixcloutier.com/x86/SYSCALL.html) instruction. Before the execution of the `syscall` instruction, parameters of the system call must be put in processor [registers](https://en.wikipedia.org/wiki/Processor_register) according to order which is described in the [System V Application Binary Interface](http://www.x86-64.org/documentation/abi.pdf) or in other words:
			
 
				+
			
 
				+* `rdi` - first parameter;
			
 
				+* `rsi` - second parameter;
			
 
				+* `rdx` - third parameter;
			
 
				+* `r10` - fourth parameter;
			
 
				+* `r8` - fifth parameter;
			
 
				+* `r9` - sixth parameter.
			
 
				+
			
 
				+The `nanosleep` system call has two parameters - two pointers to the `timespec` structures. The system call suspends the calling thread until the given timeout has elapsed. Additionally it will finish if a signal interrupts its execution. It takes two parameters, the first is `timespec` which represents timeout for the sleep. The second parameter is the pointer to the `timespec` structure too and it contains remainder of time if the call of the `nanosleep` was interrupted.
			
 
				+
			
 
				+As `nanosleep` has two parameters:
			
 
				+
			
 
				+```C
			
 
				+int nanosleep(const struct timespec *req, struct timespec *rem);
			
 
				+```
			
 
				+
			
 
				+To call system call, we need put the `req` to the `rdi` register, and the `rem` parameter to the `rsi` register. The [glibc](https://en.wikipedia.org/wiki/GNU_C_Library) does these job in the `INTERNAL_SYSCALL` macro which is located in the [sysdeps/unix/sysv/linux/x86_64/sysdep.h](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/sysdep.h;h=d023d68174d3dfb4e698160b31ae31ad291802e1;hb=HEAD) header file.
			
 
				+
			
 
				+```C
			
 
				+# define INTERNAL_SYSCALL(name, err, nr, args...) \
			
 
				+  INTERNAL_SYSCALL_NCS (__NR_##name, err, nr, ##args)
			
 
				+```
			
 
				+
			
 
				+which takes the name of the system call, storage for possible error during execution of system call, number of the system call (all `x86_64` system calls you can find in the [system calls table](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl)) and arguments of certain system call. The `INTERNAL_SYSCALL` macro just expands to the call of the `INTERNAL_SYSCALL_NCS` macro, which prepares arguments of system call (puts them into the processor registers in correct order), executes `syscall` instruction and returns the result:
			
 
				+
			
 
				+```C
			
 
				+# define INTERNAL_SYSCALL_NCS(name, err, nr, args...)      \
			
 
				+  ({									                                      \
			
 
				+    unsigned long int resultvar;					                          \
			
 
				+    LOAD_ARGS_##nr (args)						                              \
			
 
				+    LOAD_REGS_##nr							                                  \
			
 
				+    asm volatile (							                                  \
			
 
				+    "syscall\n\t"							                                  \
			
 
				+    : "=a" (resultvar)							                              \
			
 
				+    : "0" (name) ASM_ARGS_##nr : "memory", REGISTERS_CLOBBERED_BY_SYSCALL);   \
			
 
				+    (long int) resultvar; })
			
 
				+```
			
 
				+
			
 
				+The `LOAD_ARGS_##nr` macro calls the `LOAD_ARGS_N` macro where the `N` is number of arguments of the system call. In our case, it will be the `LOAD_ARGS_2` macro. Ultimately all of these macros will be expanded to the following:
			
 
				+
			
 
				+```C
			
 
				+# define LOAD_REGS_TYPES_1(t1, a1)					   \
			
 
				+  register t1 _a1 asm ("rdi") = __arg1;					   \
			
 
				+  LOAD_REGS_0
			
 
				+
			
 
				+# define LOAD_REGS_TYPES_2(t1, a1, t2, a2)				   \
			
 
				+  register t2 _a2 asm ("rsi") = __arg2;					   \
			
 
				+  LOAD_REGS_TYPES_1(t1, a1)
			
 
				+...
			
 
				+...
			
 
				+...
			
 
				+```
			
 
				+
			
 
				+After the `syscall` instruction will be executed, the [context switch](https://en.wikipedia.org/wiki/Context_switch) will occur and the kernel will transfer execution to the system call handler. The system call handler for the `nanosleep` system call is located in the [kernel/time/hrtimer.c](https://github.com/torvalds/linux/blob/master/kernel/time/hrtimer.c) source code file and defined with the `SYSCALL_DEFINE2` macro helper:
			
 
				+
			
 
				+```C
			
 
				+SYSCALL_DEFINE2(nanosleep, struct timespec __user *, rqtp,
			
 
				+		struct timespec __user *, rmtp)
			
 
				+{
			
 
				+	struct timespec tu;
			
 
				+
			
 
				+	if (copy_from_user(&tu, rqtp, sizeof(tu)))
			
 
				+		return -EFAULT;
			
 
				+
			
 
				+	if (!timespec_valid(&tu))
			
 
				+		return -EINVAL;
			
 
				+
			
 
				+	return hrtimer_nanosleep(&tu, rmtp, HRTIMER_MODE_REL, CLOCK_MONOTONIC);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+More about the `SYSCALL_DEFINE2` macro you may read in the [chapter](https://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) about system calls. If we look at the implementation of the `nanosleep` system call, first of all we will see that it starts from the call of the `copy_from_user` function. This function copies the given data from the userspace to kernelspace. In our case we copy timeout value to sleep to the kernelspace `timespec` structure and check that the given `timespec` is valid by the call of the `timesc_valid` function:
			
 
				+
			
 
				+```C
			
 
				+static inline bool timespec_valid(const struct timespec *ts)
			
 
				+{
			
 
				+	if (ts->tv_sec < 0)
			
 
				+		return false;
			
 
				+	if ((unsigned long)ts->tv_nsec >= NSEC_PER_SEC)
			
 
				+		return false;
			
 
				+	return true;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+which just checks that the given `timespec` does not represent date before `1970` and nanoseconds does not overflow `1` second. The `nanosleep` function ends with the call of the `hrtimer_nanosleep` function from the same source code file. The `hrtimer_nanosleep` function creates a [timer](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-4.html) and calls the `do_nanosleep` function. The `do_nanosleep` does main job for us. This function provides loop:
			
 
				+
			
 
				+```C
			
 
				+do {
			
 
				+	set_current_state(TASK_INTERRUPTIBLE);
			
 
				+	hrtimer_start_expires(&t->timer, mode);
			
 
				+
			
 
				+	if (likely(t->task))
			
 
				+		freezable_schedule();
			
 
				+    
			
 
				+} while (t->task && !signal_pending(current));
			
 
				+
			
 
				+__set_current_state(TASK_RUNNING);
			
 
				+return t->task == NULL;
			
 
				+```
			
 
				+
			
 
				+Which freezes current task during sleep. After we set `TASK_INTERRUPTIBLE` flag for the current task, the `hrtimer_start_expires` function starts the give high-resolution timer on the current processor. As the given high resolution timer will expire, the task will be again running.
			
 
				+
			
 
				+That's all.
			
 
				+
			
 
				+Conclusion
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+This is the end of the seventh part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) that describes timers and timer management related stuff in the Linux kernel. In the previous part we saw [x86_64](https://en.wikipedia.org/wiki/X86-64) specific clock sources. As I wrote in the beginning, this part is the last part of this chapter. We saw important time management related concepts like `clocksource` and `clockevents` frameworks, `jiffies` counter and etc., in this chpater. Of course this does not cover all of the time management in the Linux kernel. Many parts of this mostly related to the scheduling which we will see in other chapter. 
			
 
				+
			
 
				+If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
			
 
				+
			
 
				+**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				+
			
 
				+
			
 
				+Links
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+* [system call](https://en.wikipedia.org/wiki/System_call)
			
 
				+* [C programming language](https://en.wikipedia.org/wiki/C_%28programming_language%29)
			
 
				+* [standard library](https://en.wikipedia.org/wiki/Standard_library)
			
 
				+* [glibc](https://en.wikipedia.org/wiki/GNU_C_Library)
			
 
				+* [real time clock](https://en.wikipedia.org/wiki/Real-time_clock)
			
 
				+* [NTP](https://en.wikipedia.org/wiki/Network_Time_Protocol)
			
 
				+* [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond)
			
 
				+* [register](https://en.wikipedia.org/wiki/Processor_register)
			
 
				+* [System V Application Binary Interface](http://www.x86-64.org/documentation/abi.pdf)
			
 
				+* [context switch](https://en.wikipedia.org/wiki/Context_switch)
			
 
				+* [Introduction to timers in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-4.html)
			
 
				+* [uptime](https://en.wikipedia.org/wiki/Uptime#Using_uptime)
			
 
				+* [system calls table for x86_64](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl)
			
 
				+* [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer)
			
 
				+* [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter)
			
 
				+* [x86_64](https://en.wikipedia.org/wiki/X86-64)
			
 
				+* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-6.html)
			
--- a/contributors.md
+++ b/contributors.md
@@ -67,3 +67,33 @@ Thank you to all contributors:
 
				 * [Ian Miell](https://github.com/ianmiell)
			
 
				 * [DongLiang Mu](https://github.com/mudongliang)
			
 
				 * [Johan Manuel](https://github.com/29jm)
			
 
				+* [Brian Rak](https://github.com/brakthehack)
			
 
				+* [Robin Peiremans](https://github.com/rpeiremans)
			
 
				+* [xiaoqiang zhao](https://github.com/hitmoon)
			
 
				+* [aouelete](https://github.com/aouelete)
			
 
				+* [Dennis Birkholz](https://github.com/dennisbirkholz)
			
 
				+* [Anton Tyurin](https://github.com/noxiouz)
			
 
				+* [Bogdan Kulbida](https://github.com/kulbida)
			
 
				+* [Matt Hudgins](https://github.com/mhudgins)
			
 
				+* [Ruth Grace Wong](https://github.com/ruthgrace)
			
 
				+* [Jeremy Lacomis](https://github.com/jlacomis)
			
 
				+* [Dubyah](https://github.com/Dubyah)
			
 
				+* [Matthieu Tardy](https://github.com/c0riolis)
			
 
				+* [michaelian ennis](https://github.com/mennis)
			
 
				+* [Amitay Stern](https://github.com/amist)
			
 
				+* [Matt Todd](https://github.com/mtodd)
			
 
				+* [Piyush Pangtey](https://github.com/pangteypiyush)
			
 
				+* [Alfred Agrell](https://github.com/Alcaro)
			
 
				+* [Jakub Wilk](https://github.com/jwilk)
			
 
				+* [Justus Adam](https://github.com/JustusAdam)
			
 
				+* [Roy Wellington Ⅳ](https://github.com/thanatos)
			
 
				+* [Jonathan Rennison](https://github.com/JGRennison)
			
 
				+* [Mack Stump](https://github.com/rmbreak)
			
 
				+* [Pushpinder Singh](https://github.com/PrinceDhaliwal)
			
 
				+* [Xiaoqin Hu](https://github.com/huxq)
			
 
				+* [Jeremy Cline](https://github.com/jeremycline)
			
 
				+* [Kavindra Nikhurpa](https://github.com/kavi-nikhurpa)
			
 
				+* [Connor Mullen](https://github.com/mullen3)
			
 
				+* [Alex Gonzalez](https://github.com/alex-gonz)
			
 
				+* [Tim Konick](https://github.com/tijko)
			
 
				+* [Anastas Stoyanovsky](https://github.com/anastasds)
			
--- a/interrupts/README.md
+++ b/interrupts/README.md
@@ -1,14 +1,14 @@
 
				 # Interrupts and Interrupt Handling
			
 
				 
			
 
				-You will find a couple of posts which describe interrupts and exceptions handling in the linux kernel.
			
 
				+In the following posts, we will cover interrupts and exceptions handling in the linux kernel.
			
 
				 
			
 
				-* [Interrupts and Interrupt Handling. Part 1.](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-1.md) - describes an interrupts handling theory.
			
 
				-* [Start to dive into interrupts in the Linux kernel](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-2.md) - this part starts to describe interrupts and exceptions handling related stuff from the early stage.
			
 
				-* [Early interrupt handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-3.md) - third part describes early interrupt handlers.
			
 
				-* [Interrupt handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-4.md) - fourth part describes first non-early interrupt handlers.
			
 
				-* [Implementation of exception handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-5.md) - descripbes implementation of some exception handlers as double fault, divide by zero and etc.
			
 
				-* [Handling Non-Maskable interrupts](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-6.md) - describes handling of non-maskable interrupts and the rest of interrupts handlers from the architecture-specific part.
			
 
				-* [Dive into external hardware interrupts](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-7.md) - this part describes early initialization of code which is related to handling of external hardware interrupts.
			
 
				-* [Non-early initialization of the IRQs](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-8.md) - this part describes non-early initialization of code which is related to handling of external hardware interrupts.
			
 
				-* [Softirq, Tasklets and Workqueues](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-9.md) - this part describes softirqs, tasklets and workqueues concepts.
			
 
				-* [](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-10.md) - this is the last part of the interrupts and interrupt handling chapter and here we will see a real hardware driver and interrupts related stuff.
			
 
				+* [Interrupts and Interrupt Handling. Part 1.](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-1.md) - describes interrupts and interrupt handling theory.
			
 
				+* [Interrupts in the Linux Kernel](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-2.md) - describes stuffs related to interrupts and exceptions handling from the early stage.
			
 
				+* [Early interrupt handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-3.md) - describes early interrupt handlers.
			
 
				+* [Interrupt handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-4.md) - describes first non-early interrupt handlers.
			
 
				+* [Implementation of exception handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-5.md) - describes implementation of some exception handlers such as double fault, divide by zero etc.
			
 
				+* [Handling non-maskable interrupts](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-6.md) - describes handling of non-maskable interrupts and remaining interrupt handlers from the architecture-specific part.
			
 
				+* [External hardware interrupts](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-7.md) - describes early initialization of code which is related to handling external hardware interrupts.
			
 
				+* [Non-early initialization of the IRQs](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-8.md) - describes non-early initialization of code which is related to handling external hardware interrupts.
			
 
				+* [Softirq, Tasklets and Workqueues](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-9.md) - describes softirqs, tasklets and workqueues concepts.
			
 
				+* [](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-10.md) - this is the last part of the `Interrupts and Interrupt Handling` chapter and here we will see a real hardware driver and some interrupts related stuff.
			
--- a/interrupts/interrupts-1.md
+++ b/interrupts/interrupts-1.md
@@ -9,14 +9,14 @@ This is the first part of the new chapter of the [linux insides](http://0xax.git
 
				 What is an Interrupt?
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-We have already heard of the word `interrupt` in several parts of this book. We even saw a couple of examples of interrupt handlers. In the current chapter we will start from the theory i.e.
			
 
				+We have already heard of the word `interrupt` in several parts of this book. We even saw a couple of examples of interrupt handlers. In the current chapter we will start from the theory i.e.,
			
 
				 
			
 
				 * What are `interrupts` ?
			
 
				 * What are `interrupt handlers`?
			
 
				 
			
 
				 We will then continue to dig deeper into the details of `interrupts` and how the Linux kernel handles them.
			
 
				 
			
 
				-So..., First of all what is an interrupt? An interrupt is an `event` which is raised by software or hardware when its needs the CPU's attention. For example, we press a button on the keyboard and what do we expect next? What should the operating system and computer do after this? To simplify matters assume that each peripheral device has an interrupt line to the CPU. A device can use it to signal an interrupt to the CPU. However interrupts are not signaled directly to the CPU. In the old machines there was a [PIC](http://en.wikipedia.org/wiki/Programmable_Interrupt_Controller) which is a chip responsible for sequentially processing multiple interrupt requests from multiple devices. In the new machines there is an [Advanced Programmable Interrupt Controller](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) commonly known as - `APIC`. An `APIC` consists of two separate devices:
			
 
				+The first question that arises in our mind when we come across word `interrupt` is `What is an interrupt?` An interrupt is an `event` raised by software or hardware when it needs the CPU's attention. For example, we press a button on the keyboard and what do we expect next? What should the operating system and computer do after this? To simplify matters, assume that each peripheral device has an interrupt line to the CPU. A device can use it to signal an interrupt to the CPU. However, interrupts are not signaled directly to the CPU. In the old machines there was a [PIC](http://en.wikipedia.org/wiki/Programmable_Interrupt_Controller) which is a chip responsible for sequentially processing multiple interrupt requests from multiple devices. In the new machines there is an [Advanced Programmable Interrupt Controller](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) commonly known as - `APIC`. An `APIC` consists of two separate devices:
			
 
				 
			
 
				 * `Local APIC`
			
 
				 * `I/O APIC`
			
@@ -37,12 +37,12 @@ Addresses of each of the interrupt handlers are maintained in a special location
 
				 BUG_ON((unsigned)n > 0xFF);
			
 
				 ```
			
 
				 
			
 
				-You can find this check within the Linux kernel source code related to interrupt setup (eg. The `set_intr_gate`, `void set_system_intr_gate` in [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h)). First 32 vector numbers from `0` to `31` are reserved by the processor and used for the processing of architecture-defined exceptions and interrupts. You can find the table with the description of these vector numbers in the second part of the Linux kernel initialization process - [Early interrupt and exception handling](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html). Vector numbers from `32` to `255` are designated as user-defined interrupts and are not reserved by the processor. These interrupts are generally assigned to external I/O devices to enable those devices to send interrupts to the processor.
			
 
				+You can find this check within the Linux kernel source code related to interrupt setup (eg. The `set_intr_gate`, `void set_system_intr_gate` in [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h)). The first 32 vector numbers from `0` to `31` are reserved by the processor and used for the processing of architecture-defined exceptions and interrupts. You can find the table with the description of these vector numbers in the second part of the Linux kernel initialization process - [Early interrupt and exception handling](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html). Vector numbers from `32` to `255` are designated as user-defined interrupts and are not reserved by the processor. These interrupts are generally assigned to external I/O devices to enable those devices to send interrupts to the processor.
			
 
				 
			
 
				 Now let's talk about the types of interrupts. Broadly speaking, we can split interrupts into 2 major classes:
			
 
				 
			
 
				-* External or hardware generated interrupts;
			
 
				-* Software-generated interrupts.
			
 
				+* External or hardware generated interrupts
			
 
				+* Software-generated interrupts
			
 
				 
			
 
				 The first - external interrupts are received through the `Local APIC` or pins on the processor which are connected to the `Local APIC`. The second - software-generated interrupts are caused by an exceptional condition in the processor itself (sometimes using special architecture-specific instructions). A common example for an exceptional condition is `division by zero`. Another example is exiting a program with the `syscall` instruction.
			
 
				 
			
@@ -154,12 +154,12 @@ static void setup_idt(void)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-from the [arch/x86/boot/pm.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pm.c). The `Interrupt Descriptor table` can be located anywhere in the linear address space and the base address of it must be aligned on an 8-byte boundary on `x86` or 16-byte boundary on `x86_64`. Base address of the `IDT` is stored in the special register - `IDTR`. There are two instructions on `x86`-compatible processors to modify the `IDTR` register:
			
 
				+from the [arch/x86/boot/pm.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pm.c). The `Interrupt Descriptor table` can be located anywhere in the linear address space and the base address of it must be aligned on an 8-byte boundary on `x86` or 16-byte boundary on `x86_64`. The base address of the `IDT` is stored in the special register - `IDTR`. There are two instructions on `x86`-compatible processors to modify the `IDTR` register:
			
 
				 
			
 
				 * `LIDT`
			
 
				 * `SIDT`
			
 
				 
			
 
				-The first instruction `LIDT` is used to load the base-address of the `IDT` i.e. the specified operand into the `IDTR`. The second instruction `SIDT` is used to read and store the contents of the `IDTR` into the specified operand. The `IDTR` register is 48-bits on the `x86` and contains following information:
			
 
				+The first instruction `LIDT` is used to load the base-address of the `IDT` i.e., the specified operand into the `IDTR`. The second instruction `SIDT` is used to read and store the contents of the `IDTR` into the specified operand. The `IDTR` register is 48-bits on the `x86` and contains the following information:
			
 
				 
			
 
				 ```
			
 
				 +-----------------------------------+----------------------+
			
@@ -227,7 +227,7 @@ And the last `Type` field describes the type of the `IDT` entry. There are three
 
				 * Trap gate
			
 
				 * Task gate
			
 
				 
			
 
				-The `IST` or `Interrupt Stack Table` is a new mechanism in the `x86_64`. It is used as an alternative to the the legacy stack-switch mechanism. Previously The `x86` architecture provided a mechanism to automatically switch stack frames in response to an interrupt. The `IST` is a modified version of the `x86` Stack switching mode. This mechanism unconditionally switches stacks when it is enabled and can be enabled for any interrupt in the `IDT` entry related with the certain interrupt (we will soon see it). From this we can understand that `IST` is not necessary for all interrupts. Some interrupts can continue to use the legacy stack switching mode. The `IST` mechanism provides up to seven `IST` pointers in the [Task State Segment](http://en.wikipedia.org/wiki/Task_state_segment) or `TSS` which is the special structure which contains information about a process. The `TSS` is used for stack switching during the execution of an interrupt or exception handler in the Linux kernel. Each pointer is referenced by an interrupt gate from the `IDT`.
			
 
				+The `IST` or `Interrupt Stack Table` is a new mechanism in the `x86_64`. It is used as an alternative to the legacy stack-switch mechanism. Previously the `x86` architecture provided a mechanism to automatically switch stack frames in response to an interrupt. The `IST` is a modified version of the `x86` Stack switching mode. This mechanism unconditionally switches stacks when it is enabled and can be enabled for any interrupt in the `IDT` entry related with the certain interrupt (we will soon see it). From this we can understand that `IST` is not necessary for all interrupts. Some interrupts can continue to use the legacy stack switching mode. The `IST` mechanism provides up to seven `IST` pointers in the [Task State Segment](http://en.wikipedia.org/wiki/Task_state_segment) or `TSS` which is the special structure which contains information about a process. The `TSS` is used for stack switching during the execution of an interrupt or exception handler in the Linux kernel. Each pointer is referenced by an interrupt gate from the `IDT`.
			
 
				 
			
 
				 The `Interrupt Descriptor Table` represented by the array of the `gate_desc` structures:
			
 
				 
			
@@ -274,7 +274,7 @@ Each active thread has a large stack in the Linux kernel for the `x86_64` archit
 
				 #define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)
			
 
				 ```
			
 
				 
			
 
				-The `PAGE_SIZE` is `4096`-bytes and the `THREAD_SIZE_ORDER` depends on the `KASAN_STACK_ORDER`. As we can see, the `KASAN_STACK` depends on the `CONFIG_KASAN` kernel configuration parameter and equals to the:
			
 
				+The `PAGE_SIZE` is `4096`-bytes and the `THREAD_SIZE_ORDER` depends on the `KASAN_STACK_ORDER`. As we can see, the `KASAN_STACK` depends on the `CONFIG_KASAN` kernel configuration parameter and is defined as:
			
 
				 
			
 
				 ```C
			
 
				 #ifdef CONFIG_KASAN
			
@@ -284,7 +284,7 @@ The `PAGE_SIZE` is `4096`-bytes and the `THREAD_SIZE_ORDER` depends on the `KASA
 
				 #endif
			
 
				 ```
			
 
				 
			
 
				-`KASan` is a runtime memory [debugger](http://lwn.net/Articles/618180/). So... the `THREAD_SIZE` will be `16384` bytes if `CONFIG_KASAN` is disabled or `32768` if this kernel configuration option is enabled. These stacks contain useful data as long as a thread is alive or in a zombie state. While the thread is in user-space, the kernel stack is empty except for the `thread_info` structure (details about this structure are available in the fourth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process) at the bottom of the stack. The active or zombie threads aren't the only threads with their own stack. There also exist specialized stacks that are associated with each available CPU. These stacks are active when the kernel is executing on that CPU. When the user-space is executing on the CPU, these stacks do not contain any useful information. Each CPU has a few special per-cpu stacks as well. The first is the `interrupt stack` used for the external hardware interrupts. Its size is determined as follows:
			
 
				+`KASan` is a runtime memory [debugger](http://lwn.net/Articles/618180/). Thus, the `THREAD_SIZE` will be `16384` bytes if `CONFIG_KASAN` is disabled or `32768` if this kernel configuration option is enabled. These stacks contain useful data as long as a thread is alive or in a zombie state. While the thread is in user-space, the kernel stack is empty except for the `thread_info` structure (details about this structure are available in the fourth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process) at the bottom of the stack. The active or zombie threads aren't the only threads with their own stack. There also exist specialized stacks that are associated with each available CPU. These stacks are active when the kernel is executing on that CPU. When the user-space is executing on the CPU, these stacks do not contain any useful information. Each CPU has a few special per-cpu stacks as well. The first is the `interrupt stack` used for the external hardware interrupts. Its size is determined as follows:
			
 
				 
			
 
				 ```C
			
 
				 #define IRQ_STACK_ORDER (2 + KASAN_STACK_ORDER)
			
@@ -304,9 +304,9 @@ union irq_stack_union {
 
				 };
			
 
				 ```
			
 
				 
			
 
				-The first `irq_stack` field is a 16 kilobytes array. Also you can see that `irq_stack_union` contains structure with the two fields:
			
 
				+The first `irq_stack` field is a 16 kilobytes array. Also you can see that `irq_stack_union` contains a structure with the two fields:
			
 
				 
			
 
				-* `gs_base` - The `gs` register always points to the bottom of the `irqstack` union. On the `x86_64`, the `gs` register is shared by per-cpu area and stack canary (more about `per-cpu` variables you can read in the special [part](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)).  All per-cpu symbols are zero based and the `gs` points to the base of per-cpu area. You already know that [segmented memory model](http://en.wikipedia.org/wiki/Memory_segmentation) is abolished in the long mode, but we can set base address for the two segment registers - `fs` and `gs` with the [Model specific registers](http://en.wikipedia.org/wiki/Model-specific_register) and these registers can be still be used as address registers. If you remember the first [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) of the Linux kernel initialization process, you can remember that we have set the `gs` register:
			
 
				+* `gs_base` - The `gs` register always points to the bottom of the `irqstack` union. On the `x86_64`, the `gs` register is shared by per-cpu area and stack canary (more about `per-cpu` variables you can read in the special [part](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)).  All per-cpu symbols are zero based and the `gs` points to the base of the per-cpu area. You already know that [segmented memory model](http://en.wikipedia.org/wiki/Memory_segmentation) is abolished in the long mode, but we can set the base address for the two segment registers - `fs` and `gs` with the [Model specific registers](http://en.wikipedia.org/wiki/Model-specific_register) and these registers can be still be used as address registers. If you remember the first [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) of the Linux kernel initialization process, you can remember that we have set the `gs` register:
			
 
				 
			
 
				 ```assembly
			
 
				 	movl	$MSR_GS_BASE,%ecx
			
@@ -321,9 +321,9 @@ where `initial_gs` points to the `irq_stack_union`:
 
				 GLOBAL(initial_gs)
			
 
				 .quad	INIT_PER_CPU_VAR(irq_stack_union)
			
 
				 ```
			
 
				-    
			
 
				+
			
 
				 * `stack_canary` - [Stack canary](http://en.wikipedia.org/wiki/Stack_buffer_overflow#Stack_canaries) for the interrupt stack is a `stack protector`
			
 
				-to verify that the stack hasn't been overwritten. Note that `gs_base` is an 40 bytes array. `GCC` requires that stack canary will be on the fixed offset from the base of the `gs` and its value must be `40` for the `x86_64` and `20` for the `x86`.
			
 
				+to verify that the stack hasn't been overwritten. Note that `gs_base` is a 40 bytes array. `GCC` requires that stack canary will be on the fixed offset from the base of the `gs` and its value must be `40` for the `x86_64` and `20` for the `x86`.
			
 
				 
			
 
				 The `irq_stack_union` is the first datum in the `percpu` area, we can see it in the `System.map`:
			
 
				 
			
@@ -343,7 +343,7 @@ We can see its definition in the code:
 
				 DECLARE_PER_CPU_FIRST(union irq_stack_union, irq_stack_union) __visible;
			
 
				 ```
			
 
				 
			
 
				-Now, its time to look at the initialization of the `irq_stack_union`. Besides the `irq_stack_union` definition, we can see the definition of the following per-cpu variables in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/processor.h):
			
 
				+Now, it's time to look at the initialization of the `irq_stack_union`. Besides the `irq_stack_union` definition, we can see the definition of the following per-cpu variables in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/processor.h):
			
 
				 
			
 
				 ```C
			
 
				 DECLARE_PER_CPU(char *, irq_stack_ptr);
			
@@ -374,7 +374,7 @@ for_each_possible_cpu(cpu) {
 
				 }
			
 
				 ```
			
 
				 
			
 
				-Here we go over all the CPUs on-by-one and setup `irq_stack_ptr`. This turns out to be equal to the top of the interrupt stack minus `64`. Why `64`?TODO  [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/common.c) source code file is following:
			
 
				+Here we go over all the CPUs one-by-one and setup `irq_stack_ptr`. This turns out to be equal to the top of the interrupt stack minus `64`. Why `64`?TODO  [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/common.c) source code file is following:
			
 
				 
			
 
				 ```C
			
 
				 void load_percpu_segment(int cpu)
			
@@ -387,21 +387,21 @@ void load_percpu_segment(int cpu)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-and as we already know `gs` register points to the bottom of the interrupt stack:
			
 
				+and as we already know the `gs` register points to the bottom of the interrupt stack.
			
 
				 
			
 
				 ```assembly
			
 
				 	movl	$MSR_GS_BASE,%ecx
			
 
				 	movl	initial_gs(%rip),%eax
			
 
				 	movl	initial_gs+4(%rip),%edx
			
 
				-	wrmsr	
			
 
				+	wrmsr
			
 
				 
			
 
				 	GLOBAL(initial_gs)
			
 
				 	.quad	INIT_PER_CPU_VAR(irq_stack_union)
			
 
				 ```
			
 
				 
			
 
				-Here we can see the `wrmsr` instruction which loads the data from `edx:eax` into the [Model specific register](http://en.wikipedia.org/wiki/Model-specific_register) pointed by the `ecx` register. In our case model specific register is `MSR_GS_BASE` which contains the base address of the memory segment pointed by the `gs` register. `edx:eax` points to the address of the `initial_gs` which is the base address of our `irq_stack_union`.
			
 
				+Here we can see the `wrmsr` instruction which loads the data from `edx:eax` into the [Model specific register](http://en.wikipedia.org/wiki/Model-specific_register) pointed by the `ecx` register. In our case the model specific register is `MSR_GS_BASE` which contains the base address of the memory segment pointed by the `gs` register. `edx:eax` points to the address of the `initial_gs` which is the base address of our `irq_stack_union`.
			
 
				 
			
 
				-We already know that `x86_64` has a feature called `Interrupt Stack Table` or `IST` and this feature provides the ability to switch to a new stack for events non-maskable interrupt, double fault and etc... There can be up to seven `IST` entries per-cpu. Some of them are:
			
 
				+We already know that `x86_64` has a feature called `Interrupt Stack Table` or `IST` and this feature provides the ability to switch to a new stack for events non-maskable interrupt, double fault etc. There can be up to seven `IST` entries per-cpu. Some of them are:
			
 
				 
			
 
				 * `DOUBLEFAULT_STACK`
			
 
				 * `NMI_STACK`
			
@@ -457,24 +457,24 @@ When an interrupt or an exception occurs, the new `ss` selector is forced to `NU
 
				 |      RSP      | 32
			
 
				 |     RFLAGS    | 24
			
 
				 |      CS       | 16
			
 
				-|      RIP      | 8 
			
 
				+|      RIP      | 8
			
 
				 |   Error code  | 0
			
 
				 |               |
			
 
				-+---------------+ 
			
 
				++---------------+
			
 
				 ```
			
 
				 
			
 
				-If the `IST` field in the interrupt gate is not `0`, we read the `IST` pointer into `rsp`. If the interrupt vector number has an error code associated with it, we then push the error code onto the stack. If the interrupt vector number has no error code, we go ahead and push the dummy error code on to the stack. We need to do this to ensure stack consistency. Next we load the segment-selector field from the gate descriptor into the CS register and must verify that the target code-segment is a 64-bit mode code segment by the checking bit `21` i.e. the `L` bit in the `Global Descriptor Table`. Finally we load the offset field from the gate descriptor into `rip` which will be the entry-point of the interrupt handler. After this the interrupt handler begins to execute. After an interrupt handler finishes its execution, it must return control to the interrupted process with the `iret` instruction. The `iret` instruction unconditionally pops the stack pointer (`ss:rsp`) to restore the stack of the interrupted process and does not depend on the `cpl` change.
			
 
				+If the `IST` field in the interrupt gate is not `0`, we read the `IST` pointer into `rsp`. If the interrupt vector number has an error code associated with it, we then push the error code onto the stack. If the interrupt vector number has no error code, we go ahead and push the dummy error code on to the stack. We need to do this to ensure stack consistency. Next, we load the segment-selector field from the gate descriptor into the CS register and must verify that the target code-segment is a 64-bit mode code segment by the checking bit `21` i.e. the `L` bit in the `Global Descriptor Table`. Finally we load the offset field from the gate descriptor into `rip` which will be the entry-point of the interrupt handler. After this the interrupt handler begins to execute and when the interrupt handler finishes its execution, it must return control to the interrupted process with the `iret` instruction. The `iret` instruction unconditionally pops the stack pointer (`ss:rsp`) to restore the stack of the interrupted process and does not depend on the `cpl` change.
			
 
				 
			
 
				 That's all.
			
 
				 
			
 
				 Conclusion
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-It is the end of the first part about interrupts and interrupt handling in the Linux kernel. We saw some theory and the first steps of the initialization of stuff related to interrupts and exceptions. In the next part we will continue to dive into interrupts and interrupts handling - into the more practical aspects of it.
			
 
				+It is the end of the first part of `Interrupts and Interrupt Handling` in the Linux kernel. We covered some theory and the first steps of initialization of stuffs related to interrupts and exceptions. In the next part we will continue to dive into the more practical aspects of interrupts and interrupt handling.
			
 
				 
			
 
				-If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				+If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				 
			
 
				-**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me a PR to [linux-internals](https://github.com/0xAX/linux-internals).**
			
 
				+**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me a PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				 
			
 
				 Links
			
 
				 --------------------------------------------------------------------------------
			
@@ -483,8 +483,8 @@ Links
 
				 * [Advanced Programmable Interrupt Controller](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller)
			
 
				 * [protected mode](http://en.wikipedia.org/wiki/Protected_mode)
			
 
				 * [long mode](http://en.wikipedia.org/wiki/Long_mode)
			
 
				-* [kernel stacks](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks)
			
 
				-* [Task State Segement](http://en.wikipedia.org/wiki/Task_state_segment)
			
 
				+* [kernel stacks](https://www.kernel.org/doc/Documentation/x86/kernel-stacks)
			
 
				+* [Task State Segment](http://en.wikipedia.org/wiki/Task_state_segment)
			
 
				 * [segmented memory model](http://en.wikipedia.org/wiki/Memory_segmentation)
			
 
				 * [Model specific registers](http://en.wikipedia.org/wiki/Model-specific_register)
			
 
				 * [Stack canary](http://en.wikipedia.org/wiki/Stack_buffer_overflow#Stack_canaries)
			
--- a/interrupts/interrupts-10.md
+++ b/interrupts/interrupts-10.md
@@ -24,7 +24,7 @@ module_init(serial21285_init);
 
				 module_exit(serial21285_exit);
			
 
				 ```
			
 
				 
			
 
				-The most part of device drivers can be compiled as a loadable kernel [module](https://en.wikipedia.org/wiki/Loadable_kernel_module) or in another way they can be statically linked into the Linux kernel. In the first case initialization of a device driver will be produced via the `module_init` and `module_Exit` macros that are defined in the [include/linux/init.h](https://github.com/torvalds/linux/blob/master/include/linux/init.h):
			
 
				+The most part of device drivers can be compiled as a loadable kernel [module](https://en.wikipedia.org/wiki/Loadable_kernel_module) or in another way they can be statically linked into the Linux kernel. In the first case initialization of a device driver will be produced via the `module_init` and `module_exit` macros that are defined in the [include/linux/init.h](https://github.com/torvalds/linux/blob/master/include/linux/init.h):
			
 
				 
			
 
				 ```C
			
 
				 #define module_init(initfn)                                     \
			
@@ -179,7 +179,7 @@ Now let's look at the calls of the `request_irq` functions in our example. As we
 
				 #define _DC21285_IRQ(x)         (16 + (x))
			
 
				 ```
			
 
				 
			
 
				-The [ISA](https://en.wikipedia.org/wiki/Industry_Standard_Architecture) IRQs on this board are from `0` to `15`, so, our interrupts will have first two numbers: `16` and `17`. Second parameters for two calls of the `request_irq` functions are `serial21285_rx_chars` and `serial21285_tx_chars`. These functions will be called when an `RX` or `TX` interrupt occured. We will not dive in this part into details of these functions, because this chapter covers the interrupts and interrupts handling but not device and drivers. The next parameter - `flags` and as we can see, it is zero in both calls of the `request_irq` function. All acceptable flags are defined as `IRQF_*` macros in the [include/linux/interrupt.h](https://github.com/torvalds/linux/blob/master/include/linux/interrupt.h). Some of it:
			
 
				+The [ISA](https://en.wikipedia.org/wiki/Industry_Standard_Architecture) IRQs on this board are from `0` to `15`, so, our interrupts will have first two numbers: `16` and `17`. Second parameters for two calls of the `request_irq` functions are `serial21285_rx_chars` and `serial21285_tx_chars`. These functions will be called when an `RX` or `TX` interrupt occurred. We will not dive in this part into details of these functions, because this chapter covers the interrupts and interrupts handling but not device and drivers. The next parameter - `flags` and as we can see, it is zero in both calls of the `request_irq` function. All acceptable flags are defined as `IRQF_*` macros in the [include/linux/interrupt.h](https://github.com/torvalds/linux/blob/master/include/linux/interrupt.h). Some of it:
			
 
				 
			
 
				 * `IRQF_SHARED` - allows sharing the irq among several devices;
			
 
				 * `IRQF_PERCPU` - an interrupt is per cpu;
			
@@ -219,7 +219,7 @@ if (((irqflags & IRQF_SHARED) && !dev_id) ||
 
				                return -EINVAL;
			
 
				 ```
			
 
				 
			
 
				-First of all we check that real `dev_id` is passed for the shared interrupt and the `IRQF_COND_SUSPEND` only makes sense for shared interrupts. Othrewise we exit from this function with the `-EINVAL` error. After this we convert the given `irq` number to the `irq` descriptor wit the help of the `irq_to_desc` function that defined in the [kernel/irq/irqdesc.c](https://github.com/torvalds/linux/blob/master/kernel/irq/irqdesc.c) source code file and exit from this function with the `-EINVAL` error if it was not successful:
			
 
				+First of all we check that real `dev_id` is passed for the shared interrupt and the `IRQF_COND_SUSPEND` only makes sense for shared interrupts. Otherwise we exit from this function with the `-EINVAL` error. After this we convert the given `irq` number to the `irq` descriptor wit the help of the `irq_to_desc` function that defined in the [kernel/irq/irqdesc.c](https://github.com/torvalds/linux/blob/master/kernel/irq/irqdesc.c) source code file and exit from this function with the `-EINVAL` error if it was not successful:
			
 
				 
			
 
				 ```C
			
 
				 desc = irq_to_desc(irq);
			
@@ -296,7 +296,7 @@ if (new->thread_fn && !nested) {
 
				 }
			
 
				 ```
			
 
				 
			
 
				-And fill the rest of the given interrupt descriptor fields in the end. So, our `16` and `17` interrupt request lines are registered and the `` and `` functions will be invoked when an interrupt controller will get event releated to these interrupts. Now let's look at what happens when an interrupt occurs. 
			
 
				+And fill the rest of the given interrupt descriptor fields in the end. So, our `16` and `17` interrupt request lines are registered and the `serial21285_rx_chars` and `serial21285_tx_chars` functions will be invoked when an interrupt controller will get event releated to these interrupts. Now let's look at what happens when an interrupt occurs. 
			
 
				 
			
 
				 Prepare to handle an interrupt
			
 
				 --------------------------------------------------------------------------------
			
@@ -331,14 +331,14 @@ ENTRY(irq_entries_start)
 
				 END(irq_entries_start)
			
 
				 ```
			
 
				 
			
 
				-Here we can see the [GNU assembler](https://en.wikipedia.org/wiki/GNU_Assembler) `.rept` instruction which repeats the the sequence of lines that are before `.endr` - `FIRST_SYSTEM_VECTOR - FIRST_EXTERNAL_VECTOR` times. As we already know, the `FIRST_SYSTEM_VECTOR` is `0xef`, and the `FIRST_EXTERNAL_VECTOR` is equal to `0x20`. So, it will work:
			
 
				+Here we can see the [GNU assembler](https://en.wikipedia.org/wiki/GNU_Assembler) `.rept` instruction which repeats the sequence of lines that are before `.endr` - `FIRST_SYSTEM_VECTOR - FIRST_EXTERNAL_VECTOR` times. As we already know, the `FIRST_SYSTEM_VECTOR` is `0xef`, and the `FIRST_EXTERNAL_VECTOR` is equal to `0x20`. So, it will work:
			
 
				 
			
 
				 ```python
			
 
				 >>> 0xef - 0x20
			
 
				 207
			
 
				 ```
			
 
				 
			
 
				-times. In the body of the `.rept` instruction we push entry stubs on the stack (note that we use negative numbers for the interrupt vector numbers, because positive numbers already reserved to identify [system calls](https://en.wikipedia.org/wiki/System_call)), increment the `vector` variable and jump on the `common_interrupt` label. In the `common_interrupt` we adjust vector number on the stack and execute `interrupt` number with the `do_IRQ` parameter:
			
 
				+times. In the body of the `.rept` instruction we push entry stubs on the stack (note that we use negative numbers for the interrupt vector numbers, because positive numbers already reserved to identify [system calls](https://en.wikipedia.org/wiki/System_call)), increase the `vector` variable and jump on the `common_interrupt` label. In the `common_interrupt` we adjust vector number on the stack and execute `interrupt` number with the `do_IRQ` parameter:
			
 
				 
			
 
				 ```assembly
			
 
				 common_interrupt:
			
@@ -346,7 +346,7 @@ common_interrupt:
 
				 	interrupt do_IRQ
			
 
				 ```
			
 
				 
			
 
				-The macro `interrupt` defined in the same source code file and saves [general purpose](https://en.wikipedia.org/wiki/Processor_register) registers on the stack, change the userspace `gs` on the kernel with the `SWAPGS` assembler instruction if need, increment [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) - `irq_count` variable that shows that we are in interrupt and call the `do_IRQ` function. This function defined in the [arch/x86/kernel/irq.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/irq.c) source code file and handles our device interrupt. Let's look at this function. The `do_IRQ` function takes one parameter - `pt_regs` structure that stores values of the userspace registers:
			
 
				+The macro `interrupt` defined in the same source code file and saves [general purpose](https://en.wikipedia.org/wiki/Processor_register) registers on the stack, change the userspace `gs` on the kernel with the `SWAPGS` assembler instruction if need, increase [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) - `irq_count` variable that shows that we are in interrupt and call the `do_IRQ` function. This function defined in the [arch/x86/kernel/irq.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/irq.c) source code file and handles our device interrupt. Let's look at this function. The `do_IRQ` function takes one parameter - `pt_regs` structure that stores values of the userspace registers:
			
 
				 
			
 
				 ```C
			
 
				 __visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
			
@@ -363,7 +363,7 @@ __visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-At the beginning of this function we can see call of the `set_irq_regs` function that returns saved `per-cpu` irq register pointer and the calls of the `irq_enter` and `exit_idle` functions. The first function `irq_enter` enters to an interrupt context with the updating `__preempt_count` variable and the section function - `exit_idle` checks that current process is `idle` with [pid](https://en.wikipedia.org/wiki/Process_identifier) - `0` and notify the `idle_notifier` with the `IDLE_END`.
			
 
				+At the beginning of this function we can see call of the `set_irq_regs` function that returns saved `per-cpu` irq register pointer and the calls of the `irq_enter` and `exit_idle` functions. The first function `irq_enter` enters to an interrupt context with the updating `__preempt_count` variable and the second function - `exit_idle` checks that current process is `idle` with [pid](https://en.wikipedia.org/wiki/Process_identifier) - `0` and notify the `idle_notifier` with the `IDLE_END`.
			
 
				 
			
 
				 In the next step we read the `irq` for the current cpu and call the `handle_irq` function:
			
 
				 
			
@@ -413,7 +413,7 @@ We already know that when an `IRQ` finishes its work, deferred interrupts will b
 
				 Exit from interrupt
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-Ok, the interrupt handler finished its execution and now we must return from the interrupt. When the work of the `do_IRQ` function will be finsihed, we will return back to the assembler code in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry_entry_64.S) to the `ret_from_intr` label. First of all we disable interrupts with the `DISABLE_INTERRUPTS` macro that expands to the `cli` instruction and decrement value of the `irq_count` [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable. Remember, this variable had value - `1`, when we were in interrupt context:
			
 
				+Ok, the interrupt handler finished its execution and now we must return from the interrupt. When the work of the `do_IRQ` function will be finsihed, we will return back to the assembler code in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry_entry_64.S) to the `ret_from_intr` label. First of all we disable interrupts with the `DISABLE_INTERRUPTS` macro that expands to the `cli` instruction and decreases value of the `irq_count` [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable. Remember, this variable had value - `1`, when we were in interrupt context:
			
 
				 
			
 
				 ```assembly
			
 
				 DISABLE_INTERRUPTS(CLBR_NONE)
			
@@ -448,11 +448,11 @@ That's all.
 
				 Conclusion
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-It is the end of the tenth part of the [Interrupts and Interrupt Handling](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) chapter and as you have read in the beginning of this part - it is the last part of this chapter. This chapter started from the explanation of the theory of interrupts and we have learned what is it interrupt and kinds of interrupts, then we saw exceptions and handling of this kind of interrupts, deferred interrupts and finally we looked on the hardware interrupts and thanlding of their in this part. Of course, this part and even this chapter does not cover full aspects of interrupts and interrupt handling in the Linux kernel. It is not realistic to do this. At least for me. It was the big part, I don't know how about you, but it was really big for me. This theme is much bigger than this chapter and I am not sure that somewhere there is a book that covers it. We have missed many part and aspects of interrupts and interrupt handling, but I think it will be good point to dive in the kernel code related to the interrupts and interrupts handling.
			
 
				+It is the end of the tenth part of the [Interrupts and Interrupt Handling](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) chapter and as you have read in the beginning of this part - it is the last part of this chapter. This chapter started from the explanation of the theory of interrupts and we have learned what is it interrupt and kinds of interrupts, then we saw exceptions and handling of this kind of interrupts, deferred interrupts and finally we looked on the hardware interrupts and the handling of theirs in this part. Of course, this part and even this chapter does not cover full aspects of interrupts and interrupt handling in the Linux kernel. It is not realistic to do this. At least for me. It was the big part, I don't know how about you, but it was really big for me. This theme is much bigger than this chapter and I am not sure that somewhere there is a book that covers it. We have missed many part and aspects of interrupts and interrupt handling, but I think it will be good point to dive in the kernel code related to the interrupts and interrupts handling.
			
 
				 
			
 
				-If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				+If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				 
			
 
				-**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
			
 
				+**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				 
			
 
				 Links
			
 
				 --------------------------------------------------------------------------------
			
--- a/interrupts/interrupts-2.md
+++ b/interrupts/interrupts-2.md
@@ -4,9 +4,9 @@ Interrupts and Interrupt Handling. Part 2.
 
				 Start to dive into interrupt and exceptions handling in the Linux kernel
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-We saw some theory about an interrupts and an exceptions handling in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html) and as I already wrote in that part, we will start to dive into interrupts and exceptions in the Linux kernel source code in this part. As you already can note, the previous part mostly described theoretical aspects and since this part we will start to dive directly into the Linux kernel source code. We will start to do it as we did it in other chapters, from the very early places. We will not see the Linux kernel source code from the earliest [code lines](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L292) as we saw it for example in the [Linux kernel booting process](http://0xax.gitbooks.io/linux-insides/content/Booting/index.html) chapter, but we will start from the earliest code which is related to the interrupts and exceptions. Since this part we will try to go through the all interrupts and exceptions related stuff which we can find in the Linux kernel source code.
			
 
				+We saw some theory about interrupts and exception handling in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html) and as I already wrote in that part, we will start to dive into interrupts and exceptions in the Linux kernel source code in this part. As you already can note, the previous part mostly described theoretical aspects and in this part we will start to dive directly into the Linux kernel source code. We will start to do it as we did it in other chapters, from the very early places. We will not see the Linux kernel source code from the earliest [code lines](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L292) as we saw it for example in the [Linux kernel booting process](http://0xax.gitbooks.io/linux-insides/content/Booting/index.html) chapter, but we will start from the earliest code which is related to the interrupts and exceptions. In this part we will try to go through the all interrupts and exceptions related stuff which we can find in the Linux kernel source code.
			
 
				 
			
 
				-If you've read the previous parts, you can remember that the earliest place in the Linux kernel `x86_64` architecture-specifix source code which is related to the interrupt is located in the [arch/x86/boot/pm.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pm.c) source code file and represents the first setup of the [Interrupt Descriptor Table](http://en.wikipedia.org/wiki/Interrupt_descriptor_table). It occurs right before the transition into the [protected mode](http://en.wikipedia.org/wiki/Protected_mode) in the `go_to_protected_mode` function by the call of the `setup_idt`:
			
 
				+If you've read the previous parts, you can remember that the earliest place in the Linux kernel `x86_64` architecture-specific source code which is related to the interrupt is located in the [arch/x86/boot/pm.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pm.c) source code file and represents the first setup of the [Interrupt Descriptor Table](http://en.wikipedia.org/wiki/Interrupt_descriptor_table). It occurs right before the transition into the [protected mode](http://en.wikipedia.org/wiki/Protected_mode) in the `go_to_protected_mode` function by the call of the `setup_idt`:
			
 
				 
			
 
				 ```C
			
 
				 void go_to_protected_mode(void)
			
@@ -17,7 +17,7 @@ void go_to_protected_mode(void)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-The `setup_idt` function defined in the same source code file as the `go_to_protected_mode` function and just loads address of the `NULL` interrupts descriptor table:
			
 
				+The `setup_idt` function is defined in the same source code file as the `go_to_protected_mode` function and just loads the address of the `NULL` interrupts descriptor table:
			
 
				 
			
 
				 ```C
			
 
				 static void setup_idt(void)
			
@@ -27,7 +27,7 @@ static void setup_idt(void)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-where `gdt_ptr` represents special 48-bit `GTDR` register which must contain base address of the `Global Descriptor Table`:
			
 
				+where `gdt_ptr` represents a special 48-bit `GDTR` register which must contain the base address of the `Global Descriptor Table`:
			
 
				 
			
 
				 ```C
			
 
				 struct gdt_ptr {
			
@@ -36,18 +36,18 @@ struct gdt_ptr {
 
				 } __attribute__((packed));
			
 
				 ```
			
 
				 
			
 
				-Of course in our case the `gdt_ptr` does not represent `GDTR` register, but `IDTR` since we set `Interrupt Descriptor Table`. You will not find `idt_ptr` structure, because if it had been in the Linux kernel source code, it would have been the same as `gdt_ptr` but with different name. So, as you can understand there is no sense to have two similar structures which are differ only in a name. You can note here, that we do not fill the `Interrupt Descriptor Table` with entries, because it is too early to handle any interrupts or exceptions for this moment. That's why we just fill the `IDT` with the `NULL`.
			
 
				+Of course in our case the `gdt_ptr` does not represent the `GDTR` register, but `IDTR` since we set `Interrupt Descriptor Table`. You will not find an `idt_ptr` structure, because if it had been in the Linux kernel source code, it would have been the same as `gdt_ptr` but with different name. So, as you can understand there is no sense to have two similar structures which differ only by name. You can note here, that we do not fill the `Interrupt Descriptor Table` with entries, because it is too early to handle any interrupts or exceptions at this point. That's why we just fill the `IDT` with `NULL`.
			
 
				 
			
 
				-And after the setup of the [Interrupt descriptor table](http://en.wikipedia.org/wiki/Interrupt_descriptor_table), [Global Descriptor Table](http://en.wikipedia.org/wiki/GDT) and other stuff we jump into [protected mode](http://en.wikipedia.org/wiki/Protected_mode) in the - [arch/x86/boot/pmjump.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pmjump.S). More about it you can read in the [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html) which describes transition to the protected mode.
			
 
				+After the setup of the [Interrupt descriptor table](http://en.wikipedia.org/wiki/Interrupt_descriptor_table), [Global Descriptor Table](http://en.wikipedia.org/wiki/GDT) and other stuff we jump into [protected mode](http://en.wikipedia.org/wiki/Protected_mode) in the - [arch/x86/boot/pmjump.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pmjump.S). You can read more about it in the [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html) which describes the transition to protected mode.
			
 
				 
			
 
				-We already know from the earliest parts that entry of the protected mode located in the `boot_params.hdr.code32_start` and you can see that we pass the entry of the protected mode and `boot_params` to the `protected_mode_jump` in the end of the [arch/x86/boot/pm.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pm.c):
			
 
				+We already know from the earliest parts that entry to protected mode is located in the `boot_params.hdr.code32_start` and you can see that we pass the entry of the protected mode and `boot_params` to the `protected_mode_jump` in the end of the [arch/x86/boot/pm.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pm.c):
			
 
				 
			
 
				 ```C
			
 
				 protected_mode_jump(boot_params.hdr.code32_start,
			
 
				 			    (u32)&boot_params + (ds() << 4));
			
 
				 ```
			
 
				 
			
 
				-The `protected_mode_jump` defined in the [arch/x86/boot/pmjump.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pmjump.S) and gets these two parameters in the `ax` and `dx` registers using one of the [8086](http://en.wikipedia.org/wiki/Intel_8086) calling  [convention](http://en.wikipedia.org/wiki/X86_calling_conventions#List_of_x86_calling_conventions):
			
 
				+The `protected_mode_jump` is defined in the [arch/x86/boot/pmjump.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pmjump.S) and gets these two parameters in the `ax` and `dx` registers using one of the [8086](http://en.wikipedia.org/wiki/Intel_8086) calling  [conventions](http://en.wikipedia.org/wiki/X86_calling_conventions#List_of_x86_calling_conventions):
			
 
				 
			
 
				 ```assembly
			
 
				 GLOBAL(protected_mode_jump)
			
@@ -63,7 +63,7 @@ GLOBAL(protected_mode_jump)
 
				 ENDPROC(protected_mode_jump)
			
 
				 ```
			
 
				 
			
 
				-where `in_pm32` contains jump to the 32-bit entrypoint:
			
 
				+where `in_pm32` contains a jump to the 32-bit entry point:
			
 
				 
			
 
				 ```assembly
			
 
				 GLOBAL(in_pm32)
			
@@ -75,12 +75,12 @@ GLOBAL(in_pm32)
 
				 ENDPROC(in_pm32)
			
 
				 ```
			
 
				 
			
 
				-As you can remember 32-bit entry point is in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) assembly file, although it contains `_64` in the its name. We can see the two similar files in the `arch/x86/boot/compressed` directory:
			
 
				+As you can remember the 32-bit entry point is in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) assembly file, although it contains `_64` in its name. We can see the two similar files in the `arch/x86/boot/compressed` directory:
			
 
				 
			
 
				 * `arch/x86/boot/compressed/head_32.S`.
			
 
				 * `arch/x86/boot/compressed/head_64.S`;
			
 
				 
			
 
				-But the 32-bit mode entry point the the second file in our case. The first file even not compiled for `x86_64`. Let's look on the [arch/x86/boot/compressed/Makefile](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/Makefile):
			
 
				+But the 32-bit mode entry point is the second file in our case. The first file is not even compiled for `x86_64`. Let's look at the [arch/x86/boot/compressed/Makefile](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/Makefile):
			
 
				 
			
 
				 ```
			
 
				 vmlinux-objs-y := $(obj)/vmlinux.lds $(obj)/head_$(BITS).o $(obj)/misc.o \
			
@@ -100,7 +100,7 @@ else
 
				 endif
			
 
				 ```
			
 
				 
			
 
				-Now as we jumped on the `startup_32` from the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) we will not find anything related to the interrupt handling here. The `startup_32` contains code that makes preparations before transition into the [long mode](http://en.wikipedia.org/wiki/Long_mode) and directly jumps in it. The `long mode` entry located `startup_64` and it makes preparation before the [kernel decompression](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) that occurs in the `decompress_kernel` from the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c). After kernel decompressed, we jump on the `startup_64` from the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S). In the `startup_64` we start to build identity-mapped pages. After we have built identity-mapped pages, checked [NX](http://en.wikipedia.org/wiki/NX_bit) bit, made setup of the `Extended Feature Enable Register` (see in links), updated early `Global Descriptor Table` wit the `lgdt` instruction, we need to setup `gs` register with the following code:
			
 
				+Now as we jumped on the `startup_32` from the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) we will not find anything related to the interrupt handling here. The `startup_32` contains code that makes preparations before the transition into [long mode](http://en.wikipedia.org/wiki/Long_mode) and directly jumps in to it. The `long mode` entry is located in `startup_64` and it makes preparations before the [kernel decompression](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) that occurs in the `decompress_kernel` from the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c). After the kernel is decompressed, we jump on the `startup_64` from the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S). In the `startup_64` we start to build identity-mapped pages. After we have built identity-mapped pages, checked the [NX](http://en.wikipedia.org/wiki/NX_bit) bit, setup the `Extended Feature Enable Register` (see in links), and updated the early `Global Descriptor Table` with the `lgdt` instruction, we need to setup `gs` register with the following code:
			
 
				 
			
 
				 ```assembly
			
 
				 movl	$MSR_GS_BASE,%ecx
			
@@ -109,27 +109,27 @@ movl	initial_gs+4(%rip),%edx
 
				 wrmsr
			
 
				 ```
			
 
				 
			
 
				-We already saw this code in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html) and not time to know better what is going on here. First of all pay attention on the last `wrmsr` instruction. This instruction writes data from the `edx:eax` registers to the [model specific register](http://en.wikipedia.org/wiki/Model-specific_register) specified by the `ecx` register. We can see that `ecx` contains `$MSR_GS_BASE` which declared in the [arch/x86/include/uapi/asm/msr-index.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/msr-index.h) and looks:
			
 
				+We already saw this code in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html). First of all pay attention on the last `wrmsr` instruction. This instruction writes data from the `edx:eax` registers to the [model specific register](http://en.wikipedia.org/wiki/Model-specific_register) specified by the `ecx` register. We can see that `ecx` contains `$MSR_GS_BASE` which is declared in the [arch/x86/include/uapi/asm/msr-index.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/msr-index.h) and looks like:
			
 
				 
			
 
				 ```C
			
 
				 #define MSR_GS_BASE             0xc0000101
			
 
				 ```
			
 
				 
			
 
				-From this we can understand that `MSR_GS_BASE` defines number of the `model specific register`. Since registers `cs`, `ds`, `es`, and `ss` are not used in the 64-bit mode, their fields are ignored. But we can access memory over `fs` and `gs` registers. The model specific register provides `back door` to the hidden parts of these segment registers and allows to use 64-bit base address for segment register addressed by the `fs` and `gs`. So the `MSR_GS_BASE` is the hidden part and this part is mapped on the `GS.base` field. Let's look on the `initial_gs`:
			
 
				+From this we can understand that `MSR_GS_BASE` defines the number of the `model specific register`. Since registers `cs`, `ds`, `es`, and `ss` are not used in the 64-bit mode, their fields are ignored. But we can access memory over `fs` and `gs` registers. The model specific register provides a `back door` to the hidden parts of these segment registers and allows to use 64-bit base address for segment register addressed by the `fs` and `gs`. So the `MSR_GS_BASE` is the hidden part and this part is mapped on the `GS.base` field. Let's look on the `initial_gs`:
			
 
				 
			
 
				 ```assembly
			
 
				 GLOBAL(initial_gs)
			
 
				 	.quad	INIT_PER_CPU_VAR(irq_stack_union)
			
 
				 ```
			
 
				 
			
 
				-We pass `irq_stack_union` symbol to the `INIT_PER_CPU_VAR` macro which just concatenates `init_per_cpu__` prefix with the given symbol. In our case we will get `init_per_cpu__irq_stack_union` symbol. Let's look on the [linker](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S) script. There we can see following definition:
			
 
				+We pass `irq_stack_union` symbol to the `INIT_PER_CPU_VAR` macro which just concatenates the `init_per_cpu__` prefix with the given symbol. In our case we will get the `init_per_cpu__irq_stack_union` symbol. Let's look at the [linker](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S) script. There we can see following definition:
			
 
				 
			
 
				 ```
			
 
				 #define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load
			
 
				 INIT_PER_CPU(irq_stack_union);
			
 
				 ```
			
 
				 
			
 
				-It tells us that address of the `init_per_cpu__irq_stack_union` will be `irq_stack_union + __per_cpu_load`. Now we need to understand where are `init_per_cpu__irq_stack_union` and `__per_cpu_load` and what they mean. The first `irq_stack_union` defined in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/processor.h) with the `DECLARE_INIT_PER_CPU` macro which expands to call of the `init_per_cpu_var` macro:
			
 
				+It tells us that the address of the `init_per_cpu__irq_stack_union` will be `irq_stack_union + __per_cpu_load`. Now we need to understand where `init_per_cpu__irq_stack_union` and `__per_cpu_load` are what they mean. The first `irq_stack_union` is defined in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/processor.h) with the `DECLARE_INIT_PER_CPU` macro which expands to call the `init_per_cpu_var` macro:
			
 
				 
			
 
				 ```C
			
 
				 DECLARE_INIT_PER_CPU(irq_stack_union);
			
@@ -140,7 +140,7 @@ DECLARE_INIT_PER_CPU(irq_stack_union);
 
				 #define init_per_cpu_var(var)  init_per_cpu__##var
			
 
				 ```
			
 
				 
			
 
				-If we will expand all macro we will get the same `init_per_cpu__irq_stack_union` as we got after expanding of the `INIT_PER_CPU` macro, but you can note that it is already not just symbol, but variable. Let's look on the `typeof(percpu_var(var))` expression. Our `var` is `irq_stack_union` and `per_cpu_var` macro defined in the [arch/x86/include/asm/percpu.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/percpu.h):
			
 
				+If we expand all macros we will get the same `init_per_cpu__irq_stack_union` as we got after expanding the `INIT_PER_CPU` macro, but you can note that it is not just a symbol, but a variable. Let's look at the `typeof(per_cpu_var(var))` expression. Our `var` is `irq_stack_union` and the `per_cpu_var` macro is defined in the [arch/x86/include/asm/percpu.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/percpu.h):
			
 
				 
			
 
				 ```C
			
 
				 #define PER_CPU_VAR(var)        %__percpu_seg:var
			
@@ -154,13 +154,13 @@ where:
 
				 endif
			
 
				 ```
			
 
				 
			
 
				-So, we accessing `gs:irq_stack_union` and geting its type which is `irq_union`. Ok, we defined the first variable and know its address, now let's look on the second `__per_cpu_load` symbol. There are a couple of percpu variables which are located after this symbol. The `__per_cpu_load` defined in the [include/asm-generic/sections.h](https://github.com/torvalds/linux/blob/master/include/asm-generic-sections.h):
			
 
				+So, we are accessing `gs:irq_stack_union` and getting its type which is `irq_union`. Ok, we defined the first variable and know its address, now let's look at the second `__per_cpu_load` symbol. There are a couple of `per-cpu` variables which are located after this symbol. The `__per_cpu_load` is defined in the [include/asm-generic/sections.h](https://github.com/torvalds/linux/blob/master/include/asm-generic-sections.h):
			
 
				 
			
 
				 ```C
			
 
				 extern char __per_cpu_load[], __per_cpu_start[], __per_cpu_end[];
			
 
				 ```
			
 
				 
			
 
				-and presented base address of the `per-cpu` variables from the data area. So, we know address of the `irq_stack_union`, `__per_cpu_load` and we know that `init_per_cpu__irq_stack_union` must be placed right after `__per_cpu_load`. And we can see it in the [System.map](http://en.wikipedia.org/wiki/System.map):
			
 
				+and presented base address of the `per-cpu` variables from the data area. So, we know the address of the `irq_stack_union`, `__per_cpu_load` and we know that `init_per_cpu__irq_stack_union` must be placed right after `__per_cpu_load`. And we can see it in the [System.map](http://en.wikipedia.org/wiki/System.map):
			
 
				 
			
 
				 ```
			
 
				 ...
			
@@ -174,7 +174,7 @@ ffffffff819ed000 A init_per_cpu__irq_stack_union
 
				 ...
			
 
				 ```
			
 
				 
			
 
				-Now we know about `initia_gs`, so let's book to the our code:
			
 
				+Now we know about `initial_gs`, so let's look at the code:
			
 
				 
			
 
				 ```assembly
			
 
				 movl	$MSR_GS_BASE,%ecx
			
@@ -183,7 +183,7 @@ movl	initial_gs+4(%rip),%edx
 
				 wrmsr
			
 
				 ```
			
 
				 
			
 
				-Here we specified model specific register with `MSR_GS_BASE`, put 64-bit address of the `initial_gs` to the `edx:eax` pair and execute `wrmsr` instruction for filling the `gs` register with base address of the `init_per_cpu__irq_stack_union` which will be bottom of the interrupt stack. After this we will jump to the C code on the `x86_64_start_kernel` from the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c). In the `x86_64_start_kernel` function we do the last preparations before we jump into the generic and architecture-independent kernel code and on of these preparations is filling of the early `Interrupt Descriptor Table` with the interrupts handlres entries or `early_idt_handlers`. You can remember it, if you have read the part about the [Early interrupt and exception handling](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html) and can remember following code:
			
 
				+Here we specified a model specific register with `MSR_GS_BASE`, put the 64-bit address of the `initial_gs` to the `edx:eax` pair and execute the `wrmsr` instruction for filling the `gs` register with the base address of the `init_per_cpu__irq_stack_union` which will be at the bottom of the interrupt stack. After this we will jump to the C code on the `x86_64_start_kernel` from the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c). In the `x86_64_start_kernel` function we do the last preparations before we jump into the generic and architecture-independent kernel code and one of these preparations is filling the early `Interrupt Descriptor Table` with the interrupts handlers entries or `early_idt_handlers`. You can remember it, if you have read the part about the [Early interrupt and exception handling](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html) and can remember following code:
			
 
				 
			
 
				 ```C
			
 
				 for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
			
@@ -266,7 +266,7 @@ union irq_stack_union {
 
				 };
			
 
				 ```
			
 
				 
			
 
				-which defined in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/processor.h). We know that [unioun](http://en.wikipedia.org/wiki/Union_type) in the [C](http://en.wikipedia.org/wiki/C_%28programming_language%29) programming language is a data structure which stores only one field in a memory. We can see here that structure has first field - `gs_base` which is 40 bytes size and represents bottom of the `irq_stack`. So, after this our check with the `BUILD_BUG_ON` macro should end successfully. (you can read the first part about Linux kernel initialization [process](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) if you're interesting about the `BUILD_BUG_ON` macro).
			
 
				+which defined in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/processor.h). We know that [union](http://en.wikipedia.org/wiki/Union_type) in the [C](http://en.wikipedia.org/wiki/C_%28programming_language%29) programming language is a data structure which stores only one field in a memory. We can see here that structure has first field - `gs_base` which is 40 bytes size and represents bottom of the `irq_stack`. So, after this our check with the `BUILD_BUG_ON` macro should end successfully. (you can read the first part about Linux kernel initialization [process](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) if you're interesting about the `BUILD_BUG_ON` macro).
			
 
				 
			
 
				 After this we calculate new `canary` value based on the random number and [Time Stamp Counter](http://en.wikipedia.org/wiki/Time_Stamp_Counter):
			
 
				 
			
@@ -304,7 +304,7 @@ This macro defined in the [include/linux/irqflags.h](https://github.com/torvalds
 
				 #endif
			
 
				 ```
			
 
				 
			
 
				-They are both similar and as you can see have only one difference: the `local_irq_disable` macro contains call of the `trace_hardirqs_off` when `CONFIG_TRACE_IRQFLAGS_SUPPORT` is enabled. There is special feature in the [lockdep](http://lwn.net/Articles/321663/) subsystem - `irq-flags tracing` for tracing `hardirq` and `stoftirq` state. In ourcase `lockdep` subsytem can give us interesting information about hard/soft irqs on/off events which are occurs in the system. The `trace_hardirqs_off` function defined in the [kernel/locking/lockdep.c](https://github.com/torvalds/linux/blob/master/kernel/locking/lockdep.c):
			
 
				+They are both similar and as you can see have only one difference: the `local_irq_disable` macro contains call of the `trace_hardirqs_off` when `CONFIG_TRACE_IRQFLAGS_SUPPORT` is enabled. There is special feature in the [lockdep](http://lwn.net/Articles/321663/) subsystem - `irq-flags tracing` for tracing `hardirq` and `softirq` state. In our case `lockdep` subsystem can give us interesting information about hard/soft irqs on/off events which are occurs in the system. The `trace_hardirqs_off` function defined in the [kernel/locking/lockdep.c](https://github.com/torvalds/linux/blob/master/kernel/locking/lockdep.c):
			
 
				 
			
 
				 ```C
			
 
				 void trace_hardirqs_off(void)
			
@@ -314,7 +314,7 @@ void trace_hardirqs_off(void)
 
				 EXPORT_SYMBOL(trace_hardirqs_off);
			
 
				 ```
			
 
				 
			
 
				-and just calls `trace_hardirqs_off_caller` function. The `trace_hardirqs_off_caller` checks the `hardirqs_enabled` filed of the current process increment the `redundant_hardirqs_off` if call of the `local_irq_disable` was redundant or the `hardirqs_off_events` if it was not. These two fields and other `lockdep` statistic related fields are defined in the [kernel/locking/lockdep_internals.h](https://github.com/torvalds/linux/blob/master/kernel/locking/lockdep_internals.h) and located in the `lockdep_stats` structure:
			
 
				+and just calls `trace_hardirqs_off_caller` function. The `trace_hardirqs_off_caller` checks the `hardirqs_enabled` field of the current process and increases the `redundant_hardirqs_off` if call of the `local_irq_disable` was redundant or the `hardirqs_off_events` if it was not. These two fields and other `lockdep` statistic related fields are defined in the [kernel/locking/lockdep_insides.h](https://github.com/torvalds/linux/blob/master/kernel/locking/lockdep_insides.h) and located in the `lockdep_stats` structure:
			
 
				 
			
 
				 ```C
			
 
				 struct lockdep_stats {
			
@@ -340,7 +340,7 @@ static void lockdep_stats_debug_show(struct seq_file *m)
 
				 							 hr1 = debug_atomic_read(redundant_hardirqs_on),
			
 
				     ...
			
 
				 	...
			
 
				-	... 
			
 
				+	...
			
 
				     seq_printf(m, " hardirq on events:             %11llu\n", hi1);
			
 
				     seq_printf(m, " hardirq off events:            %11llu\n", hi2);
			
 
				     seq_printf(m, " redundant hardirq ons:         %11llu\n", hr1);
			
@@ -371,7 +371,7 @@ static inline void native_irq_disable(void)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-And you already must remember that `cli` instruction clears the [IF](http://en.wikipedia.org/wiki/Interrupt_flag) flag which determines ability of a processor to handle and interrupt or an exception. Besides the `local_irq_disable`, as you already can know there is an inverse macr - `local_irq_enable`. This macro has the same tracing mechanism and very similar on the `local_irq_enable`, but as you can understand from its name, it enables interrupts with the `sti` instruction:
			
 
				+And you already must remember that `cli` instruction clears the [IF](http://en.wikipedia.org/wiki/Interrupt_flag) flag which determines ability of a processor to handle an interrupt or an exception. Besides the `local_irq_disable`, as you already can know there is an inverse macro - `local_irq_enable`. This macro has the same tracing mechanism and very similar on the `local_irq_enable`, but as you can understand from its name, it enables interrupts with the `sti` instruction:
			
 
				 
			
 
				 ```C
			
 
				 static inline void native_irq_enable(void)
			
@@ -499,7 +499,7 @@ static inline void set_system_intr_gate_ist(int n, void *addr, unsigned ist)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-Do you see it? Look on the fourth parameter of the `_set_gate`. It is `0x3`. In the `set_intr_gate` it was `0x0`. We know that this parameter represent `DPL` or privilege level. We also know that `0` is the highest privilge level and `3` is the lowest.Now we know how `set_system_intr_gate_ist`, `set_intr_gate_ist`, `set_intr_gate` are work and we can return to the `early_trap_init` function. Let's look on it again:
			
 
				+Do you see it? Look on the fourth parameter of the `_set_gate`. It is `0x3`. In the `set_intr_gate` it was `0x0`. We know that this parameter represent `DPL` or privilege level. We also know that `0` is the highest privilege level and `3` is the lowest.Now we know how `set_system_intr_gate_ist`, `set_intr_gate_ist`, `set_intr_gate` are work and we can return to the `early_trap_init` function. Let's look on it again:
			
 
				 
			
 
				 ```C
			
 
				 set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);
			
@@ -519,9 +519,9 @@ Conclusion
 
				 
			
 
				 It is the end of the second part about interrupts and interrupt handling in the Linux kernel. We saw the some theory in the previous part and started to dive into interrupts and exceptions handling in the current part. We have started from the earliest parts in the Linux kernel source code which are related to the interrupts. In the next part we will continue to dive into this interesting theme and will know more about interrupt handling process.
			
 
				 
			
 
				-If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				+If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				 
			
 
				-**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
			
 
				+**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				 
			
 
				 Links
			
 
				 --------------------------------------------------------------------------------
			
@@ -530,7 +530,7 @@ Links
 
				 * [Protected mode](http://en.wikipedia.org/wiki/Protected_mode)
			
 
				 * [List of x86 calling conventions](http://en.wikipedia.org/wiki/X86_calling_conventions#List_of_x86_calling_conventions)
			
 
				 * [8086](http://en.wikipedia.org/wiki/Intel_8086)
			
 
				-* [Long mode](http://en.wikipedia.org/wiki/Long_mode) 
			
 
				+* [Long mode](http://en.wikipedia.org/wiki/Long_mode)
			
 
				 * [NX](http://en.wikipedia.org/wiki/NX_bit)
			
 
				 * [Extended Feature Enable Register](http://en.wikipedia.org/wiki/Control_register#Additional_Control_registers_in_x86-64_series)
			
 
				 * [Model-specific register](http://en.wikipedia.org/wiki/Model-specific_register)
			
--- a/interrupts/interrupts-3.md
+++ b/interrupts/interrupts-3.md
@@ -1,10 +1,12 @@
 
				 Interrupts and Interrupt Handling. Part 3.
			
 
				 ================================================================================
			
 
				 
			
 
				-Interrupt handlers
			
 
				+Exception Handling
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-This is the third part of the [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) about an interrupts and an exceptions handling and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) we stoped in the `setup_arch` function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blame/master/arch/x86/kernel/setup.c) on the setting of the two exceptions handlers for the two following exceptions:
			
 
				+This is the third part of the [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) about an interrupts and an exceptions handling in the Linux kernel and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) we stopped at the `setup_arch` function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blame/master/arch/x86/kernel/setup.c) source code file.
			
 
				+
			
 
				+We already know that this function executes initialization of architecture-specfic stuff. In our case the `setup_arch` function does [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture related initializations. The `setup_arch` is big function, and in the previous part we stopped on the setting of the two exceptions handlers for the two following exceptions:
			
 
				 
			
 
				 * `#DB` - debug exception, transfers control from the interrupted process to the debug handler;
			
 
				 * `#BP` - breakpoint exception, caused by the `int 3` instruction.
			
@@ -22,22 +24,28 @@ void __init early_trap_init(void)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). We already saw implementation of the `set_intr_gate_ist` and `set_system_intr_gate_ist` functions in the previous part and now we will look on the implementation of these early exceptions handlers.
			
 
				+from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). We already saw implementation of the `set_intr_gate_ist` and `set_system_intr_gate_ist` functions in the previous part and now we will look on the implementation of these two exceptions handlers.
			
 
				 
			
 
				 Debug and Breakpoint exceptions
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-Ok, we set the interrupts gates in the `early_trap_init` function for the `#DB` and `#BP` exceptions and now time is to look on their handlers. But first of all let's look on these exceptions. The first exceptions - `#DB` or debug exception occurs when a debug event occurs, for example attempt to change the contents of a [debug register](http://en.wikipedia.org/wiki/X86_debug_register). Debug registers are special registers which present in processors starting from the [Intel 80386](http://en.wikipedia.org/wiki/Intel_80386) and as you can understand from its name they are used for debugging. These registers allow to set breakpoints on the code and read or write data to trace, thus tracking the place of errors. The debug registers are privileged resources available and the program in either real-address or protected mode at `CPL` is `0`, that's why we have used `set_intr_gate_ist` for the `#DB`, but not the `set_system_intr_gate_ist`. The verctor number of the `#DB` exceptions is `1` (we pass it as `X86_TRAP_DB`) and has no error code:
			
 
				+Ok, we setup exception handlers in the `early_trap_init` function for the `#DB` and `#BP` exceptions and now time is to consider their implementations. But before we will do this, first of all let's look on details of these exceptions.
			
 
				+
			
 
				+The first exceptions - `#DB` or `debug` exception occurs when a debug event occurs. For example - attempt to change the contents of a [debug register](http://en.wikipedia.org/wiki/X86_debug_register). Debug registers are special registers that were presented in `x86` processors starting from the [Intel 80386](http://en.wikipedia.org/wiki/Intel_80386) processor and as you can understand from name of this CPU extension, main purpose of these registers is debugging.
			
 
				+
			
 
				+These registers allow to set breakpoints on the code and read or write data to trace it. Debug registers may be accessed only in the privileged mode and an attempt to read or write the debug registers when executing at any other privilege level causes a [general protection fault](https://en.wikipedia.org/wiki/General_protection_fault) exception. That's why we have used `set_intr_gate_ist` for the `#DB` exception, but not the `set_system_intr_gate_ist`.
			
 
				+
			
 
				+The verctor number of the `#DB` exceptions is `1` (we pass it as `X86_TRAP_DB`) and as we may read in specification, this exception has no error code:
			
 
				 
			
 
				 ```
			
 
				-----------------------------------------------------------------------------------------------
			
 
				-|Vector|Mnemonic|Description         |Type |Error Code|Source                                |
			
 
				-----------------------------------------------------------------------------------------------
			
 
				-|1     | #DB    |Reserved            |F/T  |NO        |                                      |
			
 
				-----------------------------------------------------------------------------------------------
			
 
				++-----------------------------------------------------+
			
 
				+|Vector|Mnemonic|Description         |Type |Error Code|
			
 
				++-----------------------------------------------------+
			
 
				+|1     | #DB    |Reserved            |F/T  |NO        |
			
 
				++-----------------------------------------------------+
			
 
				 ```
			
 
				 
			
 
				-The second is `#BP` or breakpoint exception occurs when processor executes the [INT 3](http://en.wikipedia.org/wiki/INT_%28x86_instruction%29#INT_3) instruction. We can add it anywhere in our code, for example let's look on the simple program:
			
 
				+The second exception is `#BP` or `breakpoint` exception occurs when processor executes the [int 3](http://en.wikipedia.org/wiki/INT_%28x86_instruction%29#INT_3) instruction. Unlike the `DB` exception, the `#BP` exception may occur in userspace. We can add it anywhere in our code, for example let's look on the simple program:
			
 
				 
			
 
				 ```C
			
 
				 // breakpoint.c
			
@@ -94,54 +102,56 @@ Program received signal SIGTRAP, Trace/breakpoint trap.
 
				 ...
			
 
				 ```
			
 
				 
			
 
				-Now we know a little about these two exceptions and we can move on to consideration of their handlers.
			
 
				+From this moment we know a little about these two exceptions and we can move on to consideration of their handlers.
			
 
				 
			
 
				-Preparation before an interrupt handler
			
 
				+Preparation before an exception handler
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-As you can note, the `set_intr_gate_ist` and `set_system_intr_gate_ist` functions takes an addresses of the exceptions handlers in the second parameter:
			
 
				+As you may note before, the `set_intr_gate_ist` and `set_system_intr_gate_ist` functions takes an addresses of exceptions handlers in theirs second parameter. In or case our two exception handlers will be:
			
 
				 
			
 
				-* `&debug`;
			
 
				-* `&int3`.
			
 
				+* `debug`;
			
 
				+* `int3`.
			
 
				 
			
 
				-You will not find these functions in the C code. All that can be found in in the `*.c/*.h` files only definition of this functions in the [arch/x86/include/asm/traps.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/traps.h): 
			
 
				+You will not find these functions in the C code. all of that could be found in the kernel's `*.c/*.h` files only definition of these functions which are located in the [arch/x86/include/asm/traps.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/traps.h) kernel header file:
			
 
				 
			
 
				 ```C
			
 
				 asmlinkage void debug(void);
			
 
				+```
			
 
				+
			
 
				+and
			
 
				+
			
 
				+```C
			
 
				 asmlinkage void int3(void);
			
 
				 ```
			
 
				 
			
 
				-But we can see `asmlinkage` descriptor here. The `asmlinkage` is the special specificator of the [gcc](http://en.wikipedia.org/wiki/GNU_Compiler_Collection). Actually for a `C` functions which are will be called from assembly, we need in explicit declaration of the function calling convention. In our case, if function maked with `asmlinkage` descriptor, then `gcc` will compile the function to retrieve parameters from stack. So, both handlers are defined in the [arch/x86/kernel/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S) assembly source code file with the `idtentry` macro:
			
 
				+You may note `asmlinkage` directive in definitions of these functions. The directive is the special specificator of the [gcc](http://en.wikipedia.org/wiki/GNU_Compiler_Collection). Actually for a `C` functions which are called from assembly, we need in explicit declaration of the function calling convention. In our case, if function maked with `asmlinkage` descriptor, then `gcc` will compile the function to retrieve parameters from stack.
			
 
				+
			
 
				+So, both handlers are defined in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly source code file with the `idtentry` macro:
			
 
				 
			
 
				 ```assembly
			
 
				 idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
			
 
				+```
			
 
				+
			
 
				+and
			
 
				+
			
 
				+```assembly
			
 
				 idtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
			
 
				 ```
			
 
				 
			
 
				-Actually `debug` and `int3` are not interrupts handlers. Remember that before we can execute an interrupt/exception handler, we need to do some preparations as:
			
 
				+Each exception handler may be consists from two parts. The first part is generic part and it is the same for all exception handlers. An exception handler should to save  [general purpose registers](https://en.wikipedia.org/wiki/Processor_register) on the stack, switch to kernel stack if an exception came from userspace and transfer control to the second part of an exception handler. The second part of an exception handler does certain work depends on certain exception. For example page fault exception handler should find virtual page for given address, invalid opcode exception handler should send `SIGILL` [signal](https://en.wikipedia.org/wiki/Unix_signal) and etc.
			
 
				 
			
 
				-* When an interrupt or exception occured, the processor uses an exception or interrupt vector as an index to a descriptor in the `IDT`;
			
 
				-* In legacy mode `ss:esp` registers are pushed on the stack only if privilege level changed. In 64-bit mode `ss:rsp` pushed on the stack everytime;
			
 
				-* During stack switching with `IST` the new `ss` selector is forced to null. Old `ss` and `rsp` are pushed on the new stack.
			
 
				-* The `rflags`, `cs`, `rip` and error code pushed on the stack;
			
 
				-* Control transfered to an interrupt handler;
			
 
				-* After an interrupt handler will finish its work and finishes with the `iret` instruction, old `ss` will be poped from the stack and loaded to the `ss` register.
			
 
				-* `ss:rsp` will be popped from the stack unconditionally in the 64-bit mode and will be popped only if there is a privilege level change in legacy mode.
			
 
				-* `iret` instruction will restore `rip`, `cs` and `rflags`;
			
 
				-* Interrupted program will continue its execution.
			
 
				+As we just saw, an exception handler starts from definition of the `idtentry` macro from the [arch/x86/kernel/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S) assembly source code file, so let's look at implementation of this macro. As we may see, the `idtentry` macro takes five arguments:
			
 
				 
			
 
				-```
			
 
				-    +--------------------+
			
 
				-+40 |        ss          |
			
 
				-+32 |       rsp          |
			
 
				-+24 |      rflags        |
			
 
				-+16 |        cs          |
			
 
				- +8 |       rip          |
			
 
				-  0 |    error code      |
			
 
				-    +--------------------+
			
 
				-```
			
 
				+* `sym` - defines global symbol with the `.globl name` which will be an an entry of exception handler;
			
 
				+* `do_sym` - symbol name which represents a secondary entry of an exception handler;
			
 
				+* `has_error_code` - information about existence of an error code of exception.
			
 
				+
			
 
				+The last two parameters are optional:
			
 
				 
			
 
				-Now we can see on the preparations before a process will transfer control to an interrupt/exception handler from practical side. As I already wrote above the first thirteen exceptions handlers defined in the [arch/x86/kernel/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S) assembly file with the [idtentry](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S#L967) macro:
			
 
				+* `paranoid` - shows us how we need to check current mode (will see explanation in details later);
			
 
				+* `shift_ist` - shows us is an exception running at `Interrupt Stack Table`.
			
 
				+
			
 
				+Definition of the `.idtentry` macro looks:
			
 
				 
			
 
				 ```assembly
			
 
				 .macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
			
@@ -153,107 +163,203 @@ END(\sym)
 
				 .endm
			
 
				 ```
			
 
				 
			
 
				-This macro defines an exception entry point and as we can see it takes `five` arguments:
			
 
				+Before we will consider internals of the `idtentry` macro, we should to know state of stack when an exception occurs. As we may read in the [Intel® 64 and IA-32 Architectures Software Developer’s Manual 3A](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html), the state of stack when an exception occurs is following:
			
 
				 
			
 
				-* `sym` - defines global symbol with the `.globl name`.
			
 
				-* `do_sym` - an interrupt handler.
			
 
				-* `has_error_code:req` - information about error code, The `:req` qualifier tells the assembler that the argument is required;
			
 
				-* `paranoid` - shows us how we need to check current mode;
			
 
				-* `shift_ist` - shows us what's stack to use;
			
 
				+```
			
 
				+    +------------+
			
 
				++40 | %SS        |
			
 
				++32 | %RSP       |
			
 
				++24 | %RFLAGS    |
			
 
				++16 | %CS        |
			
 
				+ +8 | %RIP       |
			
 
				+  0 | ERROR CODE | <-- %RSP
			
 
				+    +------------+
			
 
				+```
			
 
				 
			
 
				-As we can see our exceptions handlers are almost the same:
			
 
				+Now we may start to consider implementation of the `idtmacro`. Both `#DB` and `BP` exception handlers are defined as:
			
 
				 
			
 
				 ```assembly
			
 
				 idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
			
 
				 idtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
			
 
				 ```
			
 
				 
			
 
				-The differences are only in the global name and name of exceptions handlers. Now let's look how `idtentry` macro implemented. It starts from the two checks:
			
 
				-
			
 
				-```assembly
			
 
				-	.if \shift_ist != -1 && \paranoid == 0
			
 
				-	.error "using shift_ist requires paranoid=1"
			
 
				-	.endif
			
 
				-
			
 
				-	.if \has_error_code
			
 
				-	XCPT_FRAME
			
 
				-	.else
			
 
				-	INTR_FRAME
			
 
				-	.endif
			
 
				-```
			
 
				-
			
 
				-First check makes the check that an exceptions uses `Interrupt stack table` and `paranoid` is set, in other way it emits the erorr with the [.error](https://sourceware.org/binutils/docs/as/Error.html#Error) directive. The second `if` clause checks existence of an error code and calls `XCPT_FRAME` or `INTR_FRAME` macros depends on it. These macros just expand to the set of [CFI directives](https://sourceware.org/binutils/docs/as/CFI-directives.html) which are used by `GNU AS` to manage call frames. The `CFI` directives are used only to generate [dwarf2](http://en.wikipedia.org/wiki/DWARF) unwind information for better backtraces and they don't change any code, so we will not go into detail about it and from this point I will skip all code which is related to these directives. In the next step we check error code again and push it on the stack if an exception has it with the:
			
 
				+If we will look at these definitions, we may know that compiler will generate two routines with `debug` and `int3` names and both of these exception handlers will call `do_debug` and `do_int3` secondary handlers after some preparation. The third parameter defines existence of error code and as we may see both our exception do not have them. As we may see on the diagram above, processor pushes error code on stack if an exception provides it. In our case, the `debug` and `int3` exception do not have error codes. This may bring some difficulties because stack will look differently for exceptions which provides error code and for exceptions which not. That's why implementation of the `idtentry` macro starts from putting a fake error code to the stack if an exception does not provide it:
			
 
				 
			
 
				 ```assembly
			
 
				 .ifeq \has_error_code
			
 
				-	pushq_cfi $-1
			
 
				+    pushq	$-1
			
 
				 .endif
			
 
				 ```
			
 
				 
			
 
				-The `pushq_cfi` macro defined in the [arch/x86/include/asm/dwarf2.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/dwarf2.h) and expands to the `pushq` instruction which pushes given error code:
			
 
				+But it is not only fake error-code. Moreover the `-1` also represents invalid system call number, so that the system call restart logic will not be triggered.
			
 
				+
			
 
				+The last two parameters of the `idtentry` macro `shift_ist` and `paranoid` allow to know do an exception handler runned at stack from `Interrupt Stack Table` or not. You already may know that each kernel thread in the system has own stack. In addition to these stacks, there are some specialized stacks associated with each processor in the system. One of these stacks is - exception stack. The [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture provides special feature which is called - `Interrupt Stack Table`. This feature allows to switch to a new stack for designated events such as an atomic exceptions like `double fault` and etc. So the `shift_ist` parameter allows us to know do we need to switch on `IST` stack for an exception handler or not.
			
 
				+
			
 
				+The second parameter - `paranoid` defines the method which helps us to know did we come from userspace or not to an exception handler. The easiest way to determine this is to via `CPL` or `Current Privilege Level` in `CS` segment register. If it is equal to `3`, we came from userspace, if zero we came from kernel space:
			
 
				 
			
 
				-```assembly
			
 
				-	.macro pushq_cfi reg
			
 
				-	pushq \reg
			
 
				-	CFI_ADJUST_CFA_OFFSET 8
			
 
				-	.endm
			
 
				+```
			
 
				+testl $3,CS(%rsp)
			
 
				+jnz userspace
			
 
				+...
			
 
				+...
			
 
				+...
			
 
				+// we are from the kernel space
			
 
				 ```
			
 
				 
			
 
				-Pay attention on the `$-1`. We already know that when an exception occrus, the processor pushes `ss`, `rsp`, `rflags`, `cs` and `rip` on the stack:
			
 
				+But unfortunately this method does not give a 100% guarantee. As described in the kernel documentation:
			
 
				 
			
 
				-```C
			
 
				-#define RIP		16*8
			
 
				-#define CS		17*8
			
 
				-#define EFLAGS	18*8
			
 
				-#define RSP		19*8
			
 
				-#define SS		20*8
			
 
				+> if we are in an NMI/MCE/DEBUG/whatever super-atomic entry context,
			
 
				+> which might have triggered right after a normal entry wrote CS to the
			
 
				+> stack but before we executed SWAPGS, then the only safe way to check
			
 
				+> for GS is the slower method: the RDMSR.
			
 
				+
			
 
				+In other words for example `NMI` could happen inside the critical section of a [swapgs](http://www.felixcloutier.com/x86/SWAPGS.html) instruction. In this way we should check value of the `MSR_GS_BASE` [model specific register](https://en.wikipedia.org/wiki/Model-specific_register) which stores pointer to the start of per-cpu area. So to check did we come from userspace or not, we should to check value of the `MSR_GS_BASE` model specific register and if it is negative we came from kernel space, in other way we came from userspace:
			
 
				+
			
 
				+```assembly
			
 
				+movl $MSR_GS_BASE,%ecx
			
 
				+rdmsr
			
 
				+testl %edx,%edx
			
 
				+js 1f
			
 
				 ```
			
 
				 
			
 
				-With the `pushq \reg` we denote that place before the `RIP` will contain error code of an exception:
			
 
				+In first two lines of code we read value of the `MSR_GS_BASE` model specific register into `edx:eax` pair. We can't set negative value to the `gs` from userspace. But from other side we know that direct mapping of the physical memory starts from the `0xffff880000000000` virtual address. In this way, `MSR_GS_BASE` will contain an address from `0xffff880000000000` to `0xffffc7ffffffffff`. After the `rdmsr` instruction will be executed, the smallest possible value in the `%edx` register will be - `0xffff8800` which is `-30720` in unsigned 4 bytes. That's why kernel space `gs` which points to start of `per-cpu` area will contain negative value.
			
 
				 
			
 
				-```C
			
 
				-#define ORIG_RAX	15*8
			
 
				+After we pushed fake error code on the stack, we should allocate space for general purpose registers with:
			
 
				+
			
 
				+```assembly
			
 
				+ALLOC_PT_GPREGS_ON_STACK
			
 
				 ```
			
 
				 
			
 
				-The `ORIG_RAX` will contain error code of an exception, [IRQ](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) number on a hardware interrupt and system call number on [system call](http://en.wikipedia.org/wiki/System_call) entry. In the next step we can see thr `ALLOC_PT_GPREGS_ON_STACK` macro which allocates space for the 15 general purpose registers on the stack:
			
 
				+macro which is defined in the [arch/x86/entry/calling.h](https://github.com/torvalds/linux/blob/master/arch/x86/entry/calling.h) header file. This macro just allocates 15*8 bytes space on the stack to preserve general purpose registers:
			
 
				 
			
 
				 ```assembly
			
 
				 .macro ALLOC_PT_GPREGS_ON_STACK addskip=0
			
 
				-subq	$15*8+\addskip, %rsp
			
 
				-CFI_ADJUST_CFA_OFFSET 15*8+\addskip
			
 
				+    addq	$-(15*8+\addskip), %rsp
			
 
				 .endm
			
 
				 ```
			
 
				 
			
 
				-After this we check `paranoid` and if it is set we check first three `CPL` bits. We compare it with the `3` and it allows us to know did we come from userspace or not:
			
 
				+So the stack will look like this after execution of the `ALLOC_PT_GPREGS_ON_STACK`:
			
 
				+
			
 
				+```
			
 
				+     +------------+
			
 
				++160 | %SS        |
			
 
				++152 | %RSP       |
			
 
				++144 | %RFLAGS    |
			
 
				++136 | %CS        |
			
 
				++128 | %RIP       |
			
 
				++120 | ERROR CODE |
			
 
				+     |------------|
			
 
				++112 |            |
			
 
				++104 |            |
			
 
				+ +96 |            |
			
 
				+ +88 |            |
			
 
				+ +80 |            |
			
 
				+ +72 |            |
			
 
				+ +64 |            |
			
 
				+ +56 |            |
			
 
				+ +48 |            |
			
 
				+ +40 |            |
			
 
				+ +32 |            |
			
 
				+ +24 |            |
			
 
				+ +16 |            |
			
 
				+  +8 |            |
			
 
				+  +0 |            | <- %RSP
			
 
				+     +------------+
			
 
				+```
			
 
				+
			
 
				+After we allocated space for general purpose registers, we do some checks to understand did an exception come from userspace or not and if yes, we should move back to an interrupted process stack or stay on exception stack:
			
 
				 
			
 
				 ```assembly
			
 
				 .if \paranoid
			
 
				-  .if \paranoid == 1
			
 
				-    CFI_REMEMBER_STATE
			
 
				-	testl $3, CS(%rsp)
			
 
				-	jnz 1f
			
 
				-  .endif
			
 
				-  call paranoid_entry
			
 
				+    .if \paranoid == 1
			
 
				+	    testb	$3, CS(%rsp)
			
 
				+	    jnz	1f
			
 
				+	.endif
			
 
				+	call	paranoid_entry
			
 
				 .else
			
 
				-  call error_entry
			
 
				+	call	error_entry
			
 
				 .endif
			
 
				 ```
			
 
				 
			
 
				-If we came from userspace we jump on the label `1` which starts from the `call error_entry` instruction. The `error_entry` saves all registers in the `pt_regs` structure which presetens an interrupt/exception stack frame and defined in the [arch/x86/include/uapi/asm/ptrace.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/ptrace.h). It saves common and extra registers on the stack with the:
			
 
				+Let's consider all of these there cases in course.
			
 
				+
			
 
				+An exception occured in userspace
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+In the first let's consider a case when an exception has `paranoid=1` like our `debug` and `int3` exceptions. In this case we check selector from `CS` segment register and jump at `1f` label if we came from userspace or the `paranoid_entry` will be called in other way.
			
 
				+
			
 
				+Let's consider first case when we came from userspace to an exception handler. As described above we should jump at `1` label. The `1` label starts from the call of the
			
 
				+
			
 
				+```assembly
			
 
				+call	error_entry
			
 
				+```
			
 
				+
			
 
				+routine which saves all general purpose registers in the previously allocated area on the stack:
			
 
				 
			
 
				 ```assembly
			
 
				 SAVE_C_REGS 8
			
 
				 SAVE_EXTRA_REGS 8
			
 
				 ```
			
 
				 
			
 
				-from `rdi` to `r15` and executes [swapgs](http://www.felixcloutier.com/x86/SWAPGS.html) instruction. This instruction provides a method to for the Linux kernel to obtain a pointer to the kernel data structures and save the user's `gsbase`. After this we will exit from the `error_entry` with the `ret` instruction. After the `error_entry` finished to execute, since we came from userspace we need to switch on kernel interrupt stack:
			
 
				+These both macros are defined in the  [arch/x86/entry/calling.h](https://github.com/torvalds/linux/blob/master/arch/x86/entry/calling.h) header file and just move values of general purpose registers to a certain place at the stack, for example:
			
 
				+
			
 
				+```assembly
			
 
				+.macro SAVE_EXTRA_REGS offset=0
			
 
				+	movq %r15, 0*8+\offset(%rsp)
			
 
				+	movq %r14, 1*8+\offset(%rsp)
			
 
				+	movq %r13, 2*8+\offset(%rsp)
			
 
				+	movq %r12, 3*8+\offset(%rsp)
			
 
				+	movq %rbp, 4*8+\offset(%rsp)
			
 
				+	movq %rbx, 5*8+\offset(%rsp)
			
 
				+.endm
			
 
				+```
			
 
				+
			
 
				+After execution of `SAVE_C_REGS` and `SAVE_EXTRA_REGS` the stack will look:
			
 
				+
			
 
				+```
			
 
				+     +------------+
			
 
				++160 | %SS        |
			
 
				++152 | %RSP       |
			
 
				++144 | %RFLAGS    |
			
 
				++136 | %CS        |
			
 
				++128 | %RIP       |
			
 
				++120 | ERROR CODE |
			
 
				+     |------------|
			
 
				++112 | %RDI       |
			
 
				++104 | %RSI       |
			
 
				+ +96 | %RDX       |
			
 
				+ +88 | %RCX       |
			
 
				+ +80 | %RAX       |
			
 
				+ +72 | %R8        |
			
 
				+ +64 | %R9        |
			
 
				+ +56 | %R10       |
			
 
				+ +48 | %R11       |
			
 
				+ +40 | %RBX       |
			
 
				+ +32 | %RBP       |
			
 
				+ +24 | %R12       |
			
 
				+ +16 | %R13       |
			
 
				+  +8 | %R14       |
			
 
				+  +0 | %R15       | <- %RSP
			
 
				+     +------------+
			
 
				+```
			
 
				+
			
 
				+After the kernel saved general purpose registers at the stack, we should check that we came from userspace space again with:
			
 
				 
			
 
				 ```assembly
			
 
				-	movq %rsp,%rdi
			
 
				-	call sync_regs
			
 
				+testb	$3, CS+8(%rsp)
			
 
				+jz	.Lerror_kernelspace
			
 
				 ```
			
 
				 
			
 
				-We just save all registers to the `error_entry` in the `error_entry`, we put address of the `pt_regs` to the `rdi` and call `sync_regs` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c):
			
 
				+because we may have potentially fault if as described in documentation truncated `%RIP` was reported. Anyway, in both cases the [SWAPGS](http://www.felixcloutier.com/x86/SWAPGS.html) instruction will be executed and values from `MSR_KERNEL_GS_BASE` and `MSR_GS_BASE` will be swapped. From this moment the `%gs` register will point to the base address of kernel structures. So, the `SWAPGS` instruction is called and it was main point of the `error_entry` routing.
			
 
				+
			
 
				+Now we can back to the `idtentry` macro. We may see following assembler code after the call of `error_entry`:
			
 
				+
			
 
				+```assembly
			
 
				+movq	%rsp, %rdi
			
 
				+call	sync_regs
			
 
				+```
			
 
				+
			
 
				+Here we put base address of stack pointer `%rdi` register which will be first argument (according to [x86_64 ABI](https://www.uclibc.org/docs/psABI-x86_64.pdf)) of the `sync_regs` function and call this function which is defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c) source code file:
			
 
				 
			
 
				 ```C
			
 
				 asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs)
			
@@ -264,190 +370,136 @@ asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-This function switchs off the `IST` stack if we came from usermode. After this we switch on the stack which we got from the `sync_regs`:
			
 
				+This function takes the result of the `task_ptr_regs` macro which is defined in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/processor.h) header file, stores it in the stack pointer and return it. The `task_ptr_regs` macro expands to the address of `thread.sp0` which represents pointer to the normal kernel stack:
			
 
				 
			
 
				-```assembly
			
 
				-movq %rax,%rsp
			
 
				-movq %rsp,%rdi
			
 
				+```C
			
 
				+#define task_pt_regs(tsk)       ((struct pt_regs *)(tsk)->thread.sp0 - 1)
			
 
				 ```
			
 
				 
			
 
				-and put pointer of the `pt_regs` again in the `rdi`, and in the last step we call an exception handler:
			
 
				+As we came from userspace, this means that exception handler will run in real process context. After we got stack pointer from the `sync_regs` we switch stack:
			
 
				 
			
 
				 ```assembly
			
 
				-call \do_sym
			
 
				+movq	%rax, %rsp
			
 
				 ```
			
 
				 
			
 
				-So, realy exceptions handlers are `do_debug` and `do_int3` functions. We will see these function in this part, but little later. First of all let's look on the preparations before a processor will transfer control to an interrupt handler. In another way if `paranoid` is set, but it is not 1, we call `paranoid_entry` which makes almost the same that `error_entry`, but it checks current mode with more slow but accurate way:
			
 
				-
			
 
				-```assembly
			
 
				-ENTRY(paranoid_entry)
			
 
				-	SAVE_C_REGS 8
			
 
				-	SAVE_EXTRA_REGS 8
			
 
				-	...
			
 
				-	...
			
 
				-	movl $MSR_GS_BASE,%ecx
			
 
				-	rdmsr
			
 
				-	testl %edx,%edx
			
 
				-	js 1f	/* negative -> in kernel */
			
 
				-	SWAPGS
			
 
				-	...
			
 
				-	...
			
 
				-	ret
			
 
				-END(paranoid_entry)
			
 
				-```
			
 
				+The last two steps before an exception handler will call secondary handler are:
			
 
				 
			
 
				-If `edx` wll be negative, we are in the kernel mode. As we store all registers on the stack, check that we are in the kernel mode, we need to setup `IST` stack if it is set for a given exception, call an exception handler and restore the exception stack:
			
 
				+1. Passing pointer to `pt_regs` structure which contains preserved general purpose registers to the `%rdi` register:
			
 
				 
			
 
				 ```assembly
			
 
				-	.if \shift_ist != -1
			
 
				-	subq $EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist)
			
 
				-	.endif
			
 
				-
			
 
				-	call \do_sym
			
 
				-
			
 
				-	.if \shift_ist != -1
			
 
				-	addq $EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist)
			
 
				-	.endif
			
 
				+movq	%rsp, %rdi
			
 
				 ```
			
 
				 
			
 
				-The last step when an exception handler will finish it's work all registers will be restored from the stack with the `RESTORE_C_REGS` and `RESTORE_EXTRA_REGS` macros and control will be returned an interrupted task. That's all. Now we know about preparation before an interrupt/exception handler will start to execute and we can go directly to the implementation of the handlers.
			
 
				-
			
 
				-Implementation of ainterrupts and exceptions handlers
			
 
				---------------------------------------------------------------------------------
			
 
				+as it will be passed as first parameter of secondary exception handler.
			
 
				 
			
 
				-Both handlers `do_debug` and `do_int3` defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c) source code file and have two similar things: All interrupts/exceptions handlers marked with the `dotraplinkage` prefix that expands to the:
			
 
				+2. Pass error code to the `%rsi` register as it will be second argument of an exception handler and set it to `-1` on the stack for the same purpose as we did it before - to prevent restart of a system call:
			
 
				 
			
 
				-```C
			
 
				-#define dotraplinkage __visible
			
 
				-#define __visible __attribute__((externally_visible))
			
 
				+```
			
 
				+.if \has_error_code
			
 
				+	movq	ORIG_RAX(%rsp), %rsi
			
 
				+	movq	$-1, ORIG_RAX(%rsp)
			
 
				+.else
			
 
				+	xorl	%esi, %esi
			
 
				+.endif
			
 
				 ```
			
 
				 
			
 
				-which tells to compiler that something else uses this function (in our case these functions are called from the assembly interrupt preparation code). And also they takes two parameters:
			
 
				-
			
 
				-* pointer to the `pt_regs` structure which contains registers of the interrupted task;
			
 
				-* error code.
			
 
				+Additionally you may see that we zeroed the `%esi` register above in a case if an exception does not provide error code. 
			
 
				 
			
 
				-First of all let's consider `do_debug` handler. This function starts from the getting previous state with the `ist_enter` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). We call it because we need to know, did we come to the interrupt handler from the kernel mode or user mode.
			
 
				+In the end we just call secondary exception handler:
			
 
				 
			
 
				-```C
			
 
				-prev_state = ist_enter(regs);
			
 
				+```assembly
			
 
				+call	\do_sym
			
 
				 ```
			
 
				 
			
 
				-The `ist_enter` function returns previous state context state and executes a couple preprartions before we continue to handle an exception. It starts from the check of the previous mode with the `user_mode_vm` macro. It takes `pt_regs` structure which contains a set of registers of the interrupted task and returns `1` if we came from userspace and `0` if we came from kernel space. According to the previous mode we execute `exception_enter` if we are from the userspace or inform [RCU](https://en.wikipedia.org/wiki/Read-copy-update) if we are from krenel space:
			
 
				+which:
			
 
				 
			
 
				 ```C
			
 
				-...
			
 
				-if (user_mode_vm(regs)) {
			
 
				-	prev_state = exception_enter();
			
 
				-} else {
			
 
				-	rcu_nmi_enter();
			
 
				-	prev_state = IN_KERNEL;
			
 
				-}
			
 
				-...
			
 
				-...
			
 
				-...
			
 
				-return prev_state;
			
 
				+dotraplinkage void do_debug(struct pt_regs *regs, long error_code);
			
 
				 ```
			
 
				 
			
 
				-After this we load the `DR6` debug registers to the `dr6` variable with the call of the `get_debugreg` macro from the [arch/x86/include/asm/debugreg.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/debugreg.h):
			
 
				+will be for `debug` exception and:
			
 
				 
			
 
				 ```C
			
 
				-get_debugreg(dr6, 6);
			
 
				-dr6 &= ~DR6_RESERVED;
			
 
				+dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code);
			
 
				 ```
			
 
				 
			
 
				-The `DR6` debug register is debug status register contains information about the reason for stopping the `#DB` or debug exception handler. After we loaded its value to the `dr6` variable we filter out all reserved bits (`4:12` bits). In the next step we check `dr6` register and previous state with the following `if` condition expression:
			
 
				-
			
 
				-```C
			
 
				-if (!dr6 && user_mode_vm(regs))
			
 
				-	user_icebp = 1;
			
 
				-```
			
 
				+will be for `int 3` exception. In this part we will not see implementations of secondary handlers, because of they are very specific, but will see some of them in one of next parts.
			
 
				 
			
 
				-If `dr6` does not show any reasons why we caught this trap we set `user_icebp` to one which means that user-code wants to get [SIGTRAP](https://en.wikipedia.org/wiki/Unix_signal#SIGTRAP) signal. In the next step we check was it [kmemcheck](https://www.kernel.org/doc/Documentation/kmemcheck.txt) trap and if yes we go to exit:
			
 
				+We just considered first case when an exception occured in userspace. Let's consider last two.
			
 
				 
			
 
				-```C
			
 
				-if ((dr6 & DR_STEP) && kmemcheck_trap(regs))
			
 
				-	goto exit;
			
 
				-```
			
 
				+An exception with paranoid > 0 occured in kernelspace
			
 
				+--------------------------------------------------------------------------------
			
 
				 
			
 
				-After we did all these checks, we clear the `dr6` register, clear the `DEBUGCTLMSR_BTF` flag which provides single-step on branches debugging, set `dr6` register for the current thread and increase `debug_stack_usage` [per-cpu]([Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)) variable with the:
			
 
				+In this case an exception was occured in kernelspace and `idtentry` macro is defined with `paranoid=1` for this exception. This value of `paranoid` means that we should use slower way that we saw in the beginning of this part to check do we really came from kernelspace or not. The `paranoid_entry` routing allows us to know this:
			
 
				 
			
 
				-```C
			
 
				-set_debugreg(0, 6);
			
 
				-clear_tsk_thread_flag(tsk, TIF_BLOCKSTEP);
			
 
				-tsk->thread.debugreg6 = dr6;
			
 
				-debug_stack_usage_inc();
			
 
				+```assembly
			
 
				+ENTRY(paranoid_entry)
			
 
				+	cld
			
 
				+	SAVE_C_REGS 8
			
 
				+	SAVE_EXTRA_REGS 8
			
 
				+	movl	$1, %ebx
			
 
				+	movl	$MSR_GS_BASE, %ecx
			
 
				+	rdmsr
			
 
				+	testl	%edx, %edx
			
 
				+	js	1f
			
 
				+	SWAPGS
			
 
				+	xorl	%ebx, %ebx
			
 
				+1:	ret
			
 
				+END(paranoid_entry)
			
 
				 ```
			
 
				 
			
 
				-As we saved `dr6`, we can allow irqs:
			
 
				+As you may see, this function representes the same that we covered before. We use second (slow) method to get information about previous state of an interrupted task. As we checked this and executed `SWAPGS` in a case if we came from userspace, we should to do the same that we did before: We need to put pointer to a strucutre which holds general purpose registers to the `%rdi` (which will be first parameter of a secondary handler) and put error code if an exception provides it to the `%rsi` (which will be second parameter of a secondary handler):
			
 
				 
			
 
				-```C
			
 
				-static inline void preempt_conditional_sti(struct pt_regs *regs)
			
 
				-{
			
 
				-        preempt_count_inc();
			
 
				-        if (regs->flags & X86_EFLAGS_IF)
			
 
				-                local_irq_enable();
			
 
				-}
			
 
				+```assembly
			
 
				+movq	%rsp, %rdi
			
 
				+
			
 
				+.if \has_error_code
			
 
				+	movq	ORIG_RAX(%rsp), %rsi
			
 
				+	movq	$-1, ORIG_RAX(%rsp)
			
 
				+.else
			
 
				+	xorl	%esi, %esi
			
 
				+.endif
			
 
				 ```
			
 
				 
			
 
				-more about `local_irq_enabled` and related stuff you can read in the second part about [interrupts handling in the Linux kernel](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-2.html). In the next step we check the previous mode was [virtual 8086](https://en.wikipedia.org/wiki/Virtual_8086_mode) and handle the trap:
			
 
				+The last step before a secondary handler of an exception will be called is cleanup of new `IST` stack fram:
			
 
				 
			
 
				-```C
			
 
				-if (regs->flags & X86_VM_MASK) {
			
 
				-	handle_vm86_trap((struct kernel_vm86_regs *) regs, error_code, X86_TRAP_DB);
			
 
				-	  preempt_conditional_cli(regs);
			
 
				-      debug_stack_usage_dec();
			
 
				-	  goto exit;
			
 
				-}
			
 
				-...
			
 
				-...
			
 
				-...
			
 
				-exit:
			
 
				-	ist_exit(regs, prev_state);
			
 
				+```assembly
			
 
				+.if \shift_ist != -1
			
 
				+	subq	$EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist)
			
 
				+.endif
			
 
				 ```
			
 
				 
			
 
				-If we came not from the virtual 8086 mode, we need to check `dr6` register and previous mode as we did it above. Here we check if step mode debugging is
			
 
				-enabled and we are not from the user mode, we enabled step mode debugging in the `dr6` copy in the current thread, set `TIF_SINGLE_STEP` falg and re-enable [Trap flag](https://en.wikipedia.org/wiki/Trap_flag) for the user mode:
			
 
				+You may remember that we passed the `shift_ist` as argument of the `idtentry` macro. Here we check its value and if its not equal to `-1`, we get pointer to a stack from `Interrupt Stack Table` by `shift_ist` index and setup it.
			
 
				 
			
 
				-```C
			
 
				-if ((dr6 & DR_STEP) && !user_mode(regs)) {
			
 
				-        tsk->thread.debugreg6 &= ~DR_STEP;
			
 
				-        set_tsk_thread_flag(tsk, TIF_SINGLESTEP);
			
 
				-        regs->flags &= ~X86_EFLAGS_TF;
			
 
				-}
			
 
				+In the end of this second way we just call secondary exception handler as we did it before:
			
 
				+
			
 
				+```assembly
			
 
				+call	\do_sym
			
 
				 ```
			
 
				 
			
 
				-Then we get `SIGTRAP` signal code:
			
 
				+The last method is similar to previous both, but an exception occured with `paranoid=0` and we may use fast method determination of where we are from.
			
 
				 
			
 
				-```C
			
 
				-si_code = get_si_code(tsk->thread.debugreg6);
			
 
				-```
			
 
				+Exit from an exception handler
			
 
				+--------------------------------------------------------------------------------
			
 
				 
			
 
				-and send it for user icebp traps:
			
 
				+After secondary handler will finish its works, we will return to the `idtentry` macro and the next step will be jump to the `error_exit`:
			
 
				 
			
 
				-```C
			
 
				-if (tsk->thread.debugreg6 & (DR_STEP | DR_TRAP_BITS) || user_icebp)
			
 
				-	send_sigtrap(tsk, regs, error_code, si_code);
			
 
				-preempt_conditional_cli(regs);
			
 
				-debug_stack_usage_dec();
			
 
				-exit:
			
 
				-	ist_exit(regs, prev_state);
			
 
				+```assembly
			
 
				+jmp	error_exit
			
 
				 ```
			
 
				 
			
 
				-In the end we disabled `irqs`, decrement value of the `debug_stack_usage` and exit from the exception handler with the `ist_exit` function.
			
 
				-
			
 
				-The second exception handler is `do_int3` defined in the same source code file - [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). In the `do_int3` we makes almost the same that in the `do_debug` handler. We get the previous state with the `ist_enter`, increment and decrement the `debug_stack_usage` per-cpu variable, enabled and disable local interrupts. But of course there is one difference between these two handlers. We need to lock and than sync processor cores during breakpoint patching.
			
 
				+routine. The `error_exit` function defined in the same [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly source code file and the main goal of this function is to know where we are from (from userspace or kernelspace) and execute `SWPAGS` depends on this. Restore registers to previous state and execute `iret` instruction to transfer control to an interrupted task.
			
 
				 
			
 
				 That's all.
			
 
				 
			
 
				 Conclusion
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-It is the end of the third part about interrupts and interrupt handling in the Linux kernel. We saw the initialization of the [Interrupt descriptor table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table) in the previous part with the `#DB` and `#BP` gates and started to dive into preparation before control will be transfered to an exception handler and implementation of some interrupt handlers in this part. In the next part we will continue to dive into this theme and will go next by the `setup_arch` function and will try to understand interrupts handling related stuff.
			
 
				+It is the end of the third part about interrupts and interrupt handling in the Linux kernel. We saw the initialization of the [Interrupt descriptor table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table) in the previous part with the `#DB` and `#BP` gates and started to dive into preparation before control will be transferred to an exception handler and implementation of some interrupt handlers in this part. In the next part we will continue to dive into this theme and will go next by the `setup_arch` function and will try to understand interrupts handling related stuff.
			
 
				 
			
 
				-If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				+If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				 
			
 
				-**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
			
 
				+**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				 
			
 
				 Links
			
 
				 --------------------------------------------------------------------------------
			
--- a/interrupts/interrupts-4.md
+++ b/interrupts/interrupts-4.md
@@ -4,12 +4,12 @@ Interrupts and Interrupt Handling. Part 4.
 
				 Initialization of non-early interrupt gates
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-This is fourth part about an interrupts and exceptions handling in the Linux kernel and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html) we saw first early `#DB` and `#BP` exceptions handlers from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). We stopped on the right after the `early_trap_init` function that called in the `setup_arch` function which defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/setup.c). In this part we will continue to dive into an interrupts and exceptions handling in the Linux kernel for `x86_64` and continue to do it from from the place where we left off in the last part. First thing which is related to the interrupts and exceptions handling is the setup of the `#PF` or [page fault](https://en.wikipedia.org/wiki/Page_fault) handler with the `early_trap_pf_init` function. Let's start from it.
			
 
				+This is fourth part about an interrupts and exceptions handling in the Linux kernel and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html) we saw first early `#DB` and `#BP` exceptions handlers from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). We stopped on the right after the `early_trap_init` function that called in the `setup_arch` function which defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/setup.c). In this part we will continue to dive into an interrupts and exceptions handling in the Linux kernel for `x86_64` and continue to do it from the place where we left off in the last part. First thing which is related to the interrupts and exceptions handling is the setup of the `#PF` or [page fault](https://en.wikipedia.org/wiki/Page_fault) handler with the `early_trap_pf_init` function. Let's start from it.
			
 
				 
			
 
				 Early page fault handler
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-The `early_trap_pf_init` function defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). It uses `set_intr_gate` macro that filles [Interrupt Descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table) with the given entry:
			
 
				+The `early_trap_pf_init` function defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). It uses `set_intr_gate` macro that fills [Interrupt Descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table) with the given entry:
			
 
				 
			
 
				 ```C
			
 
				 void __init early_trap_pf_init(void)
			
@@ -99,7 +99,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-This register contains a linear address which caused `page fault`. In the next step we make a call of the `exception_enter` function from the [include/linux/context_tracking.h](https://github.com/torvalds/linux/blob/master/include/context_tracking.h). The `exception_enter` and `exception_exit` are functions from context tracking subsytem in the Linux kernel used by the [RCU](https://en.wikipedia.org/wiki/Read-copy-update) to remove its dependency on the timer tick while a processor runs in userspace. Almost in the every exception handler we will see similar code:
			
 
				+This register contains a linear address which caused `page fault`. In the next step we make a call of the `exception_enter` function from the [include/linux/context_tracking.h](https://github.com/torvalds/linux/blob/master/include/context_tracking.h). The `exception_enter` and `exception_exit` are functions from context tracking subsystem in the Linux kernel used by the [RCU](https://en.wikipedia.org/wiki/Read-copy-update) to remove its dependency on the timer tick while a processor runs in userspace. Almost in the every exception handler we will see similar code:
			
 
				 
			
 
				 ```C
			
 
				 enum ctx_state prev_state;
			
@@ -110,7 +110,7 @@ prev_state = exception_enter();
 
				 exception_exit(prev_state);
			
 
				 ```
			
 
				 
			
 
				-The `exception_enter` function checks that `context tracking` is enabled with the `context_tracking_is_enabled` and if it is in enabled state, we get previous context with te `this_cpu_read` (more about `this_cpu_*` operations you can read in the [Documentation](https://github.com/torvalds/linux/blob/master/Documentation/this_cpu_ops.txt)). After this it calls `context_tracking_user_exit` function which informs that Inform the context tracking that the processor is exiting userspace mode and entering the kernel:
			
 
				+The `exception_enter` function checks that `context tracking` is enabled with the `context_tracking_is_enabled` and if it is in enabled state, we get previous context with the `this_cpu_read` (more about `this_cpu_*` operations you can read in the [Documentation](https://github.com/torvalds/linux/blob/master/Documentation/this_cpu_ops.txt)). After this it calls `context_tracking_user_exit` function which informs the context tracking that the processor is exiting userspace mode and entering the kernel:
			
 
				 
			
 
				 ```C
			
 
				 static inline enum ctx_state exception_enter(void)
			
@@ -142,7 +142,7 @@ And in the end we return previous context. Between the `exception_enter` and `ex
 
				 __do_page_fault(regs, error_code, address);
			
 
				 ```
			
 
				 
			
 
				-The `__do_page_fault` is defined in the same source code file as `do_page_fault` - [arch/x86/mm/fault.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/fault.c). In the bingging of the `__do_page_fault` we check state of the [kmemcheck](https://www.kernel.org/doc/Documentation/kmemcheck.txt) checker. The `kmemcheck` detects warns about some uses of uninitialized memory. We need to check it because page fault can be caused by kmemcheck:
			
 
				+The `__do_page_fault` is defined in the same source code file as `do_page_fault` - [arch/x86/mm/fault.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/fault.c). In the beginning of the `__do_page_fault` we check state of the [kmemcheck](https://www.kernel.org/doc/Documentation/kmemcheck.txt) checker. The `kmemcheck` detects warns about some uses of uninitialized memory. We need to check it because page fault can be caused by kmemcheck:
			
 
				 
			
 
				 ```C
			
 
				 if (kmemcheck_active(regs))
			
@@ -150,7 +150,7 @@ if (kmemcheck_active(regs))
 
				 	prefetchw(&mm->mmap_sem);
			
 
				 ```
			
 
				 
			
 
				-After this we can see the call of the `prefetchw` which executes instruction with the same [name](http://www.felixcloutier.com/x86/PREFETCHW.html) which fetches [X86_FEATURE_3DNOW](https://en.wikipedia.org/?title=3DNow!) to get exclusive [cache line](https://en.wikipedia.org/wiki/CPU_cache). The main purpose of prefetching is to hide the latency of a memory access. In the next step we check that we got page fault not in the kernel space with the following conditiion:
			
 
				+After this we can see the call of the `prefetchw` which executes instruction with the same [name](http://www.felixcloutier.com/x86/PREFETCHW.html) which fetches [X86_FEATURE_3DNOW](https://en.wikipedia.org/?title=3DNow!) to get exclusive [cache line](https://en.wikipedia.org/wiki/CPU_cache). The main purpose of prefetching is to hide the latency of a memory access. In the next step we check that we got page fault not in the kernel space with the following condition:
			
 
				 
			
 
				 ```C
			
 
				 if (unlikely(fault_in_kernel_space(address))) {
			
@@ -197,12 +197,12 @@ static int proc_root_readdir(struct file *file, struct dir_context *ctx)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-Here we can see `proc_root_readdir` function which will be called when the Linux [VFS](https://en.wikipedia.org/wiki/Virtual_file_system) needs to read the `root` directory contents. If condition marked with `unlikely`, compiler can put `false` code right after branching. Now let's back to the our address check. Comparison between the given address and the `0x00007ffffffff000` will give us to know, was page fault in the kernel mode or user mode. After this check we know it. After this `__do_page_fault` routine will try to understand the problem that provoked page fault exception and then will pass address to the approprite routine. It can be `kmemcheck` fault, spurious fault, [kprobes](https://www.kernel.org/doc/Documentation/kprobes.txt) fault and etc. Will not dive into implementation details of the page fault exception handler in this part, because we need to know many different concepts which are provided by the Linux kerne, but will see it in the chapter about the [memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) in the Linux kernel.
			
 
				+Here we can see `proc_root_readdir` function which will be called when the Linux [VFS](https://en.wikipedia.org/wiki/Virtual_file_system) needs to read the `root` directory contents. If condition marked with `unlikely`, compiler can put `false` code right after branching. Now let's back to the our address check. Comparison between the given address and the `0x00007ffffffff000` will give us to know, was page fault in the kernel mode or user mode. After this check we know it. After this `__do_page_fault` routine will try to understand the problem that provoked page fault exception and then will pass address to the appropriate routine. It can be `kmemcheck` fault, spurious fault, [kprobes](https://www.kernel.org/doc/Documentation/kprobes.txt) fault and etc. Will not dive into implementation details of the page fault exception handler in this part, because we need to know many different concepts which are provided by the Linux kernel, but will see it in the chapter about the [memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) in the Linux kernel.
			
 
				 
			
 
				 Back to start_kernel
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-There are many different function calls after the `early_trap_pf_init` in the `setup_arch` function from different kernel subsystems, but there are no one interrupts and exceptions handling related. So, we have to go back where we came from - `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c#L492). The first things after the `setup_arch` is the `trap_init` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). This function makes initialization of the remaining exceptions handlers (remember that we already setup 3 handlres for the `#DB` - debug exception, `#BP` - breakpoint exception and `#PF` - page fault exception). The `trap_init` function starts from the check of the [Extended Industry Standard Architecture](https://en.wikipedia.org/wiki/Extended_Industry_Standard_Architecture):
			
 
				+There are many different function calls after the `early_trap_pf_init` in the `setup_arch` function from different kernel subsystems, but there are no one interrupts and exceptions handling related. So, we have to go back where we came from - `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c#L492). The first things after the `setup_arch` is the `trap_init` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). This function makes initialization of the remaining exceptions handlers (remember that we already setup 3 handlers for the `#DB` - debug exception, `#BP` - breakpoint exception and `#PF` - page fault exception). The `trap_init` function starts from the check of the [Extended Industry Standard Architecture](https://en.wikipedia.org/wiki/Extended_Industry_Standard_Architecture):
			
 
				 
			
 
				 ```C
			
 
				 #ifdef CONFIG_EISA
			
@@ -214,7 +214,7 @@ There are many different function calls after the `early_trap_pf_init` in the `s
 
				 #endif
			
 
				 ```
			
 
				 
			
 
				-Note that it depends on the `CONFIG_EISA` kernel configuration parameter which represetns `EISA` support. Here we use `early_ioremap` function to map `I/O` memory on the page tables. We use `readl` function to read first `4` bytes from the mapped region and if they are equal to `EISA` string we set `EISA_bus` to one. In the end we just unmap previously mapped region. More about `early_ioremap` you can read in the part which describes [Fix-Mapped Addresses and ioremap](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html).
			
 
				+Note that it depends on the `CONFIG_EISA` kernel configuration parameter which represents `EISA` support. Here we use `early_ioremap` function to map `I/O` memory on the page tables. We use `readl` function to read first `4` bytes from the mapped region and if they are equal to `EISA` string we set `EISA_bus` to one. In the end we just unmap previously mapped region. More about `early_ioremap` you can read in the part which describes [Fix-Mapped Addresses and ioremap](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html).
			
 
				 
			
 
				 After this we start to fill the `Interrupt Descriptor Table` with the different interrupt gates. First of all we set `#DE` or `Divide Error` and `#NMI` or `Non-maskable Interrupt`:
			
 
				 
			
@@ -235,7 +235,7 @@ set_intr_gate(X86_TRAP_NM, device_not_available);
 
				 Here we can see:
			
 
				 
			
 
				 * `#OF` or `Overflow` exception. This exception indicates that an overflow trap occurred when an special [INTO](http://x86.renejeschke.de/html/file_module_x86_id_142.html) instruction was executed;
			
 
				-* `#BR` or `BOUND Range exceeded` exception. This exception indeicates that a `BOUND-range-exceed` fault occured when a [BOUND](http://pdos.csail.mit.edu/6.828/2005/readings/i386/BOUND.htm) instruction was executed;
			
 
				+* `#BR` or `BOUND Range exceeded` exception. This exception indicates that a `BOUND-range-exceed` fault occurred when a [BOUND](http://pdos.csail.mit.edu/6.828/2005/readings/i386/BOUND.htm) instruction was executed;
			
 
				 * `#UD` or `Invalid Opcode` exception. Occurs when a processor attempted to execute invalid or reserved [opcode](https://en.wikipedia.org/?title=Opcode), processor attempted to execute instruction with invalid operand(s) and etc;
			
 
				 * `#NM` or `Device Not Available` exception. Occurs when the processor tries to execute `x87 FPU` floating point instruction while `EM` flag in the [control register](https://en.wikipedia.org/wiki/Control_register#CR0) `cr0` was set.
			
 
				 
			
@@ -264,9 +264,9 @@ Here we can see setup for the following exception handlers:
 
				 
			
 
				 * `#CSO` or `Coprocessor Segment Overrun` - this exception indicates that math [coprocessor](https://en.wikipedia.org/wiki/Coprocessor) of an old processor detected a page or segment violation. Modern processors do not generate this exception
			
 
				 * `#TS` or `Invalid TSS` exception - indicates that there was an error related to the [Task State Segment](https://en.wikipedia.org/wiki/Task_state_segment).
			
 
				-* `#NP` or `Segement Not Present` exception indicates that the `present flag` of a segment or gate descriptor is clear during attempt to load one of `cs`, `ds`, `es`, `fs`, or `gs` register.
			
 
				+* `#NP` or `Segment Not Present` exception indicates that the `present flag` of a segment or gate descriptor is clear during attempt to load one of `cs`, `ds`, `es`, `fs`, or `gs` register.
			
 
				 * `#SS` or `Stack Fault` exception indicates one of the stack related conditions was detected, for example a not-present stack segment is detected when attempting to load the `ss` register.
			
 
				-* `#GP` or `General Protection` exception indicates that the processor detected one of a class of protection violations called general-protection violations. There are many different conditions that can cause general-procetion exception. For example loading the `ss`, `ds`, `es`, `fs`, or `gs` register with a segment selector for a system segment, writing to a code segment or a read-only data segment, referencing an entry in the `Interrupt Descriptor Table` (following an interrupt or exception) that is not an interrupt, trap, or task gate and many many more.
			
 
				+* `#GP` or `General Protection` exception indicates that the processor detected one of a class of protection violations called general-protection violations. There are many different conditions that can cause general-protection exception. For example loading the `ss`, `ds`, `es`, `fs`, or `gs` register with a segment selector for a system segment, writing to a code segment or a read-only data segment, referencing an entry in the `Interrupt Descriptor Table` (following an interrupt or exception) that is not an interrupt, trap, or task gate and many many more.
			
 
				 * `Spurious Interrupt` - a hardware interrupt that is unwanted.
			
 
				 * `#MF` or `x87 FPU Floating-Point Error` exception caused when the [x87 FPU](https://en.wikipedia.org/wiki/X86_instruction_listings#x87_floating-point_instructions) has detected a floating point error.
			
 
				 * `#AC` or `Alignment Check` exception Indicates that the processor detected an unaligned memory operand when alignment checking was enabled.
			
@@ -329,7 +329,7 @@ __set_fixmap(FIX_RO_IDT, __pa_symbol(idt_table), PAGE_KERNEL_RO);
 
				 idt_descr.address = fix_to_virt(FIX_RO_IDT);
			
 
				 ```
			
 
				 
			
 
				-and write its address to the `idt_descr.address` (more about fix-mapped addresses you can read in the second part of the [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html) chapter). After this we can see the call of the `cpu_init` function that defined in the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/common.c). This function makes initialization of the all `per-cpu` state. In the beginnig of the `cpu_init` we do the following things: First of all we wait while current cpu is initialized and than we call the `cr4_init_shadow` function which stores shadow copy of the `cr4` control register for the current cpu and load CPU microcode if need with the following function calls:
			
 
				+and write its address to the `idt_descr.address` (more about fix-mapped addresses you can read in the second part of the [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html) chapter). After this we can see the call of the `cpu_init` function that defined in the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/common.c). This function makes initialization of the all `per-cpu` state. In the beginning of the `cpu_init` we do the following things: First of all we wait while current cpu is initialized and than we call the `cr4_init_shadow` function which stores shadow copy of the `cr4` control register for the current cpu and load CPU microcode if need with the following function calls:
			
 
				 
			
 
				 ```C
			
 
				 wait_for_master_cpu(cpu);
			
@@ -337,20 +337,20 @@ cr4_init_shadow();
 
				 load_ucode_ap();
			
 
				 ```
			
 
				 
			
 
				-Next we get the `Task State Segement` for the current cpu and `orig_ist` structure which represents origin `Interrupt Stack Table` values with the:
			
 
				+Next we get the `Task State Segment` for the current cpu and `orig_ist` structure which represents origin `Interrupt Stack Table` values with the:
			
 
				 
			
 
				 ```C
			
 
				 t = &per_cpu(cpu_tss, cpu);
			
 
				 oist = &per_cpu(orig_ist, cpu);
			
 
				 ```
			
 
				 
			
 
				-As we got values of the `Task State Segement` and `Interrupt Stack Table` for the current processor, we clear following bits in the `cr4` control register:
			
 
				+As we got values of the `Task State Segment` and `Interrupt Stack Table` for the current processor, we clear following bits in the `cr4` control register:
			
 
				 
			
 
				 ```C
			
 
				 cr4_clear_bits(X86_CR4_VME|X86_CR4_PVI|X86_CR4_TSD|X86_CR4_DE);
			
 
				 ```
			
 
				 
			
 
				-with this we disable `vm86` extension, virtual interrupts, timestamp ([RDTSC](https://en.wikipedia.org/wiki/Time_Stamp_Counter) can only be executed with the highest privilege) and debug extension. After this we reload the `Glolbal Descripto Table` and `Interrupt Descriptor table` with the:
			
 
				+with this we disable `vm86` extension, virtual interrupts, timestamp ([RDTSC](https://en.wikipedia.org/wiki/Time_Stamp_Counter) can only be executed with the highest privilege) and debug extension. After this we reload the `Global Descriptor Table` and `Interrupt Descriptor table` with the:
			
 
				 
			
 
				 ```C
			
 
				 	switch_to_new_gdt(cpu);
			
@@ -358,7 +358,7 @@ with this we disable `vm86` extension, virtual interrupts, timestamp ([RDTSC](ht
 
				 	load_current_idt();
			
 
				 ```
			
 
				 
			
 
				-After this we setup array of the Thread-Local Storage Descriptors, configure [NX](https://en.wikipedia.org/wiki/NX_bit) and load CPU microcode. Now is time to setup and load `per-cpu` Task State Segements. We are going in a loop through the all exception stack which is `N_EXCEPTION_STACKS` or `4` and fill it with `Interrupt Stack Tables`:
			
 
				+After this we setup array of the Thread-Local Storage Descriptors, configure [NX](https://en.wikipedia.org/wiki/NX_bit) and load CPU microcode. Now is time to setup and load `per-cpu` Task State Segments. We are going in a loop through the all exception stack which is `N_EXCEPTION_STACKS` or `4` and fill it with `Interrupt Stack Tables`:
			
 
				 
			
 
				 ```C
			
 
				 	if (!oist->ist[0]) {
			
@@ -374,7 +374,7 @@ After this we setup array of the Thread-Local Storage Descriptors, configure [NX
 
				 	}
			
 
				 ```
			
 
				 
			
 
				-As we have filled `Task State Segements` with the `Interrupt Stack Tables` we can set `TSS` descriptor for the current processor and load it with the:
			
 
				+As we have filled `Task State Segments` with the `Interrupt Stack Tables` we can set `TSS` descriptor for the current processor and load it with the:
			
 
				 
			
 
				 ```C
			
 
				 set_tss_desc(cpu, t);
			
@@ -421,18 +421,18 @@ set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);
 
				 #endif
			
 
				 ```
			
 
				 
			
 
				-Here we copy `idt_table` to the `nmi_dit_table` and setup exception handlers for the `#DB` or `Debug exception` and `#BR` or `Breakpoint exception`. You can remember that we already set these interrupt gates in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html), so why do we need to setup it again? We setup it again because when we initialized it before in the `early_trap_init` function, the `Task State Segement` was not ready yet, but now it is ready after the call of the `cpu_init` function.
			
 
				+Here we copy `idt_table` to the `nmi_dit_table` and setup exception handlers for the `#DB` or `Debug exception` and `#BR` or `Breakpoint exception`. You can remember that we already set these interrupt gates in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html), so why do we need to setup it again? We setup it again because when we initialized it before in the `early_trap_init` function, the `Task State Segment` was not ready yet, but now it is ready after the call of the `cpu_init` function.
			
 
				 
			
 
				 That's all. Soon we will consider all handlers of these interrupts/exceptions.
			
 
				 
			
 
				 Conclusion
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-It is the end of the fourth part about interrupts and interrupt handling in the Linux kernel. We saw the initialization of the [Task State Segment](https://en.wikipedia.org/wiki/Task_state_segment) in this part and initialization of the different interrupt handlers as `Divide Error`, `Page Fault` excetpion and etc. You can noted that we saw just initialization stuf, and will dive into details about handlers for these exceptions. In the next part we will start to do it.
			
 
				+It is the end of the fourth part about interrupts and interrupt handling in the Linux kernel. We saw the initialization of the [Task State Segment](https://en.wikipedia.org/wiki/Task_state_segment) in this part and initialization of the different interrupt handlers as `Divide Error`, `Page Fault` exception and etc. You can note that we saw just initialization stuff, and will dive into details about handlers for these exceptions. In the next part we will start to do it.
			
 
				 
			
 
				-If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				+If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				 
			
 
				-**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
			
 
				+**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				 
			
 
				 Links
			
 
				 --------------------------------------------------------------------------------
			
--- a/interrupts/interrupts-5.md
+++ b/interrupts/interrupts-5.md
@@ -62,7 +62,7 @@ native_irq_return_iret:
 
				 iretq
			
 
				 ```
			
 
				 
			
 
				-More about the `idtentry` macro you can read in the thirt part of the [http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html) chapter. Ok, now we saw the preparation before an exception handler will be executed and now time to look on the handlers. First of all let's look on the following handlers:
			
 
				+More about the `idtentry` macro you can read in the third part of the [http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html) chapter. Ok, now we saw the preparation before an exception handler will be executed and now time to look on the handlers. First of all let's look on the following handlers:
			
 
				 
			
 
				 * divide_error
			
 
				 * overflow
			
@@ -93,7 +93,7 @@ As we can see the `DO_ERROR` macro takes 4 parameters:
 
				 * String which describes an exception;
			
 
				 * Exception handler entry point.
			
 
				 
			
 
				-This macro defined in the same souce code file and expands to the function with the `do_handler` name:
			
 
				+This macro defined in the same source code file and expands to the function with the `do_handler` name:
			
 
				 
			
 
				 ```C
			
 
				 #define DO_ERROR(trapnr, signr, str, name)                              \
			
@@ -192,7 +192,7 @@ static ATOMIC_NOTIFIER_HEAD(die_chain);
 
				 return atomic_notifier_call_chain(&die_chain, val, &args);
			
 
				 ```
			
 
				 
			
 
				-which just expands to the `atomit_notifier_head` structure that contains lock and `notifier_block`:
			
 
				+which just expands to the `atomic_notifier_head` structure that contains lock and `notifier_block`:
			
 
				 
			
 
				 ```C
			
 
				 struct atomic_notifier_head {
			
@@ -211,7 +211,7 @@ static inline void conditional_sti(struct pt_regs *regs)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-more about `local_irq_enable` macro you can read in the second [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-2.html) of this chapter. The next and last call in the `do_error_trap` is the `do_trap` function. First of all the `do_trap` function defined the `tsk` variable which has `trak_struct` type and represents the current interrupted process. After the definition of the `tsk`, we can see the call of the `do_trap_no_signal` function:
			
 
				+more about `local_irq_enable` macro you can read in the second [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-2.html) of this chapter. The next and last call in the `do_error_trap` is the `do_trap` function. First of all the `do_trap` function defined the `tsk` variable which has `task_struct` type and represents the current interrupted process. After the definition of the `tsk`, we can see the call of the `do_trap_no_signal` function:
			
 
				 
			
 
				 ```C
			
 
				 struct task_struct *tsk = current;
			
@@ -280,7 +280,7 @@ This is the end of the `do_trap`. We just saw generic implementation for eight d
 
				 Double fault
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-The next exception is `#DF` or `Double fault`. This exception occurrs when the processor detected a second exception while calling an exception handler for a prior exception. We set the trap gate for this exception in the previous part:
			
 
				+The next exception is `#DF` or `Double fault`. This exception occurs when the processor detected a second exception while calling an exception handler for a prior exception. We set the trap gate for this exception in the previous part:
			
 
				 
			
 
				 ```C
			
 
				 set_intr_gate_ist(X86_TRAP_DF, &double_fault, DOUBLEFAULT_STACK);
			
@@ -292,14 +292,14 @@ Note that this exception runs on the `DOUBLEFAULT_STACK` [Interrupt Stack Table]
 
				 #define DOUBLEFAULT_STACK 1
			
 
				 ```
			
 
				 
			
 
				-The `double_fault` is handler for this exception and defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). The `double_fault` handler starts from the definition of two variables: string that describes excetpion and interrupted process, as other exception handlers:
			
 
				+The `double_fault` is handler for this exception and defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). The `double_fault` handler starts from the definition of two variables: string that describes exception and interrupted process, as other exception handlers:
			
 
				 
			
 
				 ```C
			
 
				 static const char str[] = "double fault";
			
 
				 struct task_struct *tsk = current;
			
 
				 ```
			
 
				 
			
 
				-The handler of the double fault exception splitted on two parts. The first part is the check which checks that a fault is a `non-IST` fault on the `espfix64` stack. Actually the `iret` instruction restores only the bottom `16` bits when returning to a `16` bit segment. The `espfix` feature solves this problem. So if the `non-IST` fault on the espfix64 stack we modify the stack to make it look like `General Protection Fault`:
			
 
				+The handler of the double fault exception split on two parts. The first part is the check which checks that a fault is a `non-IST` fault on the `espfix64` stack. Actually the `iret` instruction restores only the bottom `16` bits when returning to a `16` bit segment. The `espfix` feature solves this problem. So if the `non-IST` fault on the espfix64 stack we modify the stack to make it look like `General Protection Fault`:
			
 
				 
			
 
				 ```C
			
 
				 struct pt_regs *normal_regs = task_pt_regs(current);
			
@@ -311,13 +311,13 @@ regs->sp = (unsigned long)&normal_regs->orig_ax;
 
				 return;
			
 
				 ```
			
 
				 
			
 
				-In the second case we do almost the same that we did in the previous excetpion handlers. The first is the call of the `ist_enter` function that discards previous context, `user` in our case:
			
 
				+In the second case we do almost the same that we did in the previous exception handlers. The first is the call of the `ist_enter` function that discards previous context, `user` in our case:
			
 
				 
			
 
				 ```C
			
 
				 ist_enter(regs);
			
 
				 ```
			
 
				 
			
 
				-And after this we fill the interrupted process with the vector number of the `Double fault` excetpion and error code as we did it in the previous handlers:
			
 
				+And after this we fill the interrupted process with the vector number of the `Double fault` exception and error code as we did it in the previous handlers:
			
 
				 
			
 
				 ```C
			
 
				 tsk->thread.error_code = error_code;
			
@@ -348,7 +348,7 @@ The next exception is the `#NM` or `Device not available`. The `Device not avail
 
				 
			
 
				 * The processor executed an [x87 FPU](https://en.wikipedia.org/wiki/X87) floating-point instruction while the EM flag in [control register](https://en.wikipedia.org/wiki/Control_register) `cr0` was set;
			
 
				 * The processor executed a `wait` or `fwait` instruction while the `MP` and `TS` flags of register `cr0` were set;
			
 
				-* The processor executed an [x87 FPU](https://en.wikipedia.org/wiki/X87), [MMX](https://en.wikipedia.org/wiki/MMX_%28instruction_set%29) or [SSE](https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions) instruction while the `TS` falg in control register `cr0` was set and the `EM` flag is clear.
			
 
				+* The processor executed an [x87 FPU](https://en.wikipedia.org/wiki/X87), [MMX](https://en.wikipedia.org/wiki/MMX_%28instruction_set%29) or [SSE](https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions) instruction while the `TS` flag in control register `cr0` was set and the `EM` flag is clear.
			
 
				 
			
 
				 The handler of the `Device not available` exception is the `do_device_not_available` function and it defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c) source code file too. It starts and ends from the getting of the previous context, as other traps which we saw in the beginning of this part:
			
 
				 
			
@@ -465,9 +465,9 @@ Conclusion
 
				 
			
 
				 It is the end of the fifth part of the [Interrupts and Interrupt Handling](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) chapter and we saw implementation of some interrupt handlers in this part. In the next part we will continue to dive into interrupt and exception handlers and will see handler for the [Non-Maskable Interrupts](https://en.wikipedia.org/wiki/Non-maskable_interrupt), handling of the math [coprocessor](https://en.wikipedia.org/wiki/Coprocessor) and [SIMD](https://en.wikipedia.org/wiki/SIMD) coprocessor exceptions and many many more.
			
 
				 
			
 
				-If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				+If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				 
			
 
				-**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
			
 
				+**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				 
			
 
				 Links
			
 
				 --------------------------------------------------------------------------------
			
--- a/interrupts/interrupts-6.md
+++ b/interrupts/interrupts-6.md
@@ -16,7 +16,7 @@ in this part. So, let's start.
 
				 Non-Maskable interrupt handling
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-A [Non-Maskable](https://en.wikipedia.org/wiki/Non-maskable_interrupt) interrupt is a hardware interrupt that cannot be ignore by standard masking techniques. In a general way, a non-maskable interrupt can be generated in either of two ways:
			
 
				+A [Non-Maskable](https://en.wikipedia.org/wiki/Non-maskable_interrupt) interrupt is a hardware interrupt that cannot be ignored by standard masking techniques. In a general way, a non-maskable interrupt can be generated in either of two ways:
			
 
				 
			
 
				 * External hardware asserts the non-maskable interrupt [pin](https://en.wikipedia.org/wiki/CPU_socket) on the CPU.
			
 
				 * The processor receives a message on the system bus or the APIC serial bus with a delivery mode `NMI`.
			
@@ -83,7 +83,7 @@ first_nmi:
 
				 	pushq	$1
			
 
				 ```
			
 
				 
			
 
				-Why do we push `1` on the stack? As the comment says: `We allow breakpoints in NMIs`. On the [x86_64](https://en.wikipedia.org/wiki/X86-64), like other architectures, the CPU will not execute another `NMI` until the first `NMI` is complete. A `NMI` interrupt finished with the [iret](http://faydoc.tripod.com/cpu/iret.htm) instruction like other interrupts and exceptions do it. If the `NMI` handler triggers either a [page fault](https://en.wikipedia.org/wiki/Page_fault) or [breakpoint](https://en.wikipedia.org/wiki/Breakpoint) or another exception which are use `iret` instruction too. If this happens while in `NMI` context, the CPU will leave `NMI` context and a new `NMI` may come in. The `iret` used to return from those exceptions will re-enable `NMIs` and we will get nested non-maskable interrupts. The problem the `NMI` handler will not return to the state that it was, when the exception triggered, but instead it will return to a state that will allow new `NMIs` to preempt the running `NMI` handler. If another `NMI` comes in before the first NMI handler is complete, the new NMI will write all over the preempted `NMIs` stack. We can have nested `NMIs` where the next `NMI` is using the top of the stack of the previous `NMI`. It means that we cannot execute it because a nested non-maskable interrupt will corrupt stack of a previous non-maskable interrupt. That's why we have allocated space on the stack for temporary variable. We will check this variable that it was set when a previous `NMI` is executing and clear if it is not nested `NMI`. We push `1` here to the previously allocated space on the stack to denote that a `non-maskable` interrupt executed currently. Remember that when and `NMI` or another exception occurs we have the following [stack frame](https://en.wikipedia.org/wiki/Call_stack):
			
 
				+Why do we push `1` on the stack? As the comment says: `We allow breakpoints in NMIs`. On the [x86_64](https://en.wikipedia.org/wiki/X86-64), like other architectures, the CPU will not execute another `NMI` until the first `NMI` is completed. A `NMI` interrupt finished with the [iret](http://faydoc.tripod.com/cpu/iret.htm) instruction like other interrupts and exceptions do it. If the `NMI` handler triggers either a [page fault](https://en.wikipedia.org/wiki/Page_fault) or [breakpoint](https://en.wikipedia.org/wiki/Breakpoint) or another exception which are use `iret` instruction too. If this happens while in `NMI` context, the CPU will leave `NMI` context and a new `NMI` may come in. The `iret` used to return from those exceptions will re-enable `NMIs` and we will get nested non-maskable interrupts. The problem the `NMI` handler will not return to the state that it was, when the exception triggered, but instead it will return to a state that will allow new `NMIs` to preempt the running `NMI` handler. If another `NMI` comes in before the first NMI handler is complete, the new NMI will write all over the preempted `NMIs` stack. We can have nested `NMIs` where the next `NMI` is using the top of the stack of the previous `NMI`. It means that we cannot execute it because a nested non-maskable interrupt will corrupt stack of a previous non-maskable interrupt. That's why we have allocated space on the stack for temporary variable. We will check this variable that it was set when a previous `NMI` is executing and clear if it is not nested `NMI`. We push `1` here to the previously allocated space on the stack to denote that a `non-maskable` interrupt executed currently. Remember that when and `NMI` or another exception occurs we have the following [stack frame](https://en.wikipedia.org/wiki/Call_stack):
			
 
				 
			
 
				 ```
			
 
				 +------------------------+
			
@@ -215,7 +215,7 @@ movq	$-1, %rsi
 
				 call	do_nmi
			
 
				 ```
			
 
				 
			
 
				-We will back to the `do_nmi` little later in this part, but now let's look what occurs after the `do_nmi` will finish its execution. After the `do_nmi` handler will be finished we check the `cr2` register, because we can got page fault during `do_nmi` performed and if we got it we restore original `cr2`, in other way we jump on the label `1`. After this we test content of the `ebx` register (remember it must contain `0` if we have used `swapgs` instruction and `1` if we didn't use it) and execute `SWAPGS_UNSAFE_STACK` if it contains `1` or jump to the `nmi_restore` label. The `SWAPGS_UNSAFE_STACK` macro just expands to the `swapgs` instruction. In the `nmi_restore` label we restore general purpose registers, clear allocated space on the stack for this registers clear our temporary variable and exit from the interrupt handler with the `INTERRUPT_RETURN` macro:
			
 
				+We will back to the `do_nmi` little later in this part, but now let's look what occurs after the `do_nmi` will finish its execution. After the `do_nmi` handler will be finished we check the `cr2` register, because we can got page fault during `do_nmi` performed and if we got it we restore original `cr2`, in other way we jump on the label `1`. After this we test content of the `ebx` register (remember it must contain `0` if we have used `swapgs` instruction and `1` if we didn't use it) and execute `SWAPGS_UNSAFE_STACK` if it contains `1` or jump to the `nmi_restore` label. The `SWAPGS_UNSAFE_STACK` macro just expands to the `swapgs` instruction. In the `nmi_restore` label we restore general purpose registers, clear allocated space on the stack for this registers, clear our temporary variable and exit from the interrupt handler with the `INTERRUPT_RETURN` macro:
			
 
				 
			
 
				 ```assembly
			
 
				 	movq	%cr2, %rcx
			
@@ -290,7 +290,7 @@ handled = nmi_handle(NMI_LOCAL, regs, b2b);
 
				 __this_cpu_add(nmi_stats.normal, handled);
			
 
				 ```
			
 
				 
			
 
				-And than non-specific `NMIs` depends on its reason:
			
 
				+And then non-specific `NMIs` depends on its reason:
			
 
				 
			
 
				 ```C
			
 
				 reason = x86_platform.get_nmi_reason();
			
@@ -448,9 +448,9 @@ Conclusion
 
				 
			
 
				 It is the end of the sixth part of the [Interrupts and Interrupt Handling](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) chapter and we saw implementation of some exception handlers in this part, like `non-maskable` interrupt, [SIMD](https://en.wikipedia.org/wiki/SIMD) and [x87 FPU](https://en.wikipedia.org/wiki/X87) floating point exception. Finally we have finsihed with the `trap_init` function in this part and will go ahead in the next part. The next our point is the external interrupts and the `early_irq_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c).
			
 
				 
			
 
				-If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				+If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				 
			
 
				-**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
			
 
				+**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				 
			
 
				 Links
			
 
				 --------------------------------------------------------------------------------
			
--- a/interrupts/interrupts-7.md
+++ b/interrupts/interrupts-7.md
@@ -4,9 +4,9 @@ Interrupts and Interrupt Handling. Part 7.
 
				 Introduction to external interrupts
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-This is the seventh part of the Interrupts and Interrupt Handling in the Linux kernel [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-6.html) we have finished with the exceptions which are generated by the processor. In this part we will continue to dive to the interrupt handling and will start with the external handware interrupt handling. As you can remember, in the previous part we have finsihed with the `trap_init` function from the [arch/x86/kernel/trap.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c) and the next step is the call of the `early_irq_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c).
			
 
				+This is the seventh part of the Interrupts and Interrupt Handling in the Linux kernel [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-6.html) we have finished with the exceptions which are generated by the processor. In this part we will continue to dive to the interrupt handling and will start with the external hardware interrupt handling. As you can remember, in the previous part we have finished with the `trap_init` function from the [arch/x86/kernel/trap.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c) and the next step is the call of the `early_irq_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c).
			
 
				 
			
 
				-Interrupts are signal that are sent across [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) or `Interrupt Request Line` by a hardware or software. External hardware interrupts allow devices like keyboard, mouse and etc, to indicate that it needs attention of the processor. Once the processor receives the `Interrupt Request`, it will temporary stop execution of the running program and invoke special routine which depends on an interrupt. We already know that this routine is called interrupt handler (or how we will call it `ISR` or `Interrupt Service Routine` from this part). The `ISR` or `Interrupt Handler Routine` can be found in Interrupt Vector table that is located at fixed address in the memory. After the interrupt is handled processor resumes the interrupted process. At the boot/initialization time, the Linux kernel identifies all devices in the machine, and appropriate interrupt handlers are loaded into the interrupt table. As we saw in the previous parts, most exceptions are handled simply by the sending a [Unix signal](https://en.wikipedia.org/wiki/Unix_signal) to the interrupted process. That's why kernel is can handle an exception quickly. Unfortunatelly we can not use this approach for the external handware interrupts, because often they arrive after (and sometimes long after) the process to which they are related has been suspended. So it would make no sense to send a Unix signal to the current process. External interrupt handling depends on the type of an interrupt:
			
 
				+Interrupts are signal that are sent across [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) or `Interrupt Request Line` by a hardware or software. External hardware interrupts allow devices like keyboard, mouse and etc, to indicate that it needs attention of the processor. Once the processor receives the `Interrupt Request`, it will temporary stop execution of the running program and invoke special routine which depends on an interrupt. We already know that this routine is called interrupt handler (or how we will call it `ISR` or `Interrupt Service Routine` from this part). The `ISR` or `Interrupt Handler Routine` can be found in Interrupt Vector table that is located at fixed address in the memory. After the interrupt is handled processor resumes the interrupted process. At the boot/initialization time, the Linux kernel identifies all devices in the machine, and appropriate interrupt handlers are loaded into the interrupt table. As we saw in the previous parts, most exceptions are handled simply by the sending a [Unix signal](https://en.wikipedia.org/wiki/Unix_signal) to the interrupted process. That's why kernel is can handle an exception quickly. Unfortunately we can not use this approach for the external hardware interrupts, because often they arrive after (and sometimes long after) the process to which they are related has been suspended. So it would make no sense to send a Unix signal to the current process. External interrupt handling depends on the type of an interrupt:
			
 
				 
			
 
				 * `I/O` interrupts;
			
 
				 * Timer interrupts;
			
@@ -14,17 +14,17 @@ Interrupts are signal that are sent across [IRQ](https://en.wikipedia.org/wiki/I
 
				 
			
 
				 I will try to describe all types of interrupts in this book.
			
 
				 
			
 
				-Generally, a handler of an `I/O` interrupt must be flexible enough to service several devices at the same time. For exmaple in the [PCI](https://en.wikipedia.org/wiki/Conventional_PCI) bus architecture several devices may share the same `IRQ` line. In the simplest way the Linux kernel must do following thing when an `I/O` interrupt occured:
			
 
				+Generally, a handler of an `I/O` interrupt must be flexible enough to service several devices at the same time. For example in the [PCI](https://en.wikipedia.org/wiki/Conventional_PCI) bus architecture several devices may share the same `IRQ` line. In the simplest way the Linux kernel must do following thing when an `I/O` interrupt occurred:
			
 
				 
			
 
				 * Save the value of an `IRQ` and the register's contents on the kernel stack;
			
 
				 * Send an acknowledgment to the hardware controller which is servicing the `IRQ` line;
			
 
				 * Execute the interrupt service routine (next we will call it `ISR`) which is associated with the device;
			
 
				 * Restore registers and return from an interrupt;
			
 
				 
			
 
				-Ok, we know a little theory and now let's start with the `early_irq_init` function. The implementation of the `early_irq_init` function is in the [kernel/irq/irqdesc.c](https://github.com/torvalds/linux/blob/master/kernel/irq/irqdesc.c). This function make early initialziation of the `irq_desc` structure. The `irq_desc` structure is the foundation of interrupt management code in the Linux kernel. An array of this structure, which has the same name - `irq_desc`, keeps track of every interrupt request source in the Linux kernel. This structure defined in the [include/linux/irqdesc.h](https://github.com/torvalds/linux/blob/master/include/linux/irqdesc.h) and as you can note it depends on the `CONFIG_SPARSE_IRQ` kernel configuration option. This kernel configuration option enables support for sparse irqs. The `irq_desc` structure contains many different fiels:
			
 
				+Ok, we know a little theory and now let's start with the `early_irq_init` function. The implementation of the `early_irq_init` function is in the [kernel/irq/irqdesc.c](https://github.com/torvalds/linux/blob/master/kernel/irq/irqdesc.c). This function make early initialization of the `irq_desc` structure. The `irq_desc` structure is the foundation of interrupt management code in the Linux kernel. An array of this structure, which has the same name - `irq_desc`, keeps track of every interrupt request source in the Linux kernel. This structure defined in the [include/linux/irqdesc.h](https://github.com/torvalds/linux/blob/master/include/linux/irqdesc.h) and as you can note it depends on the `CONFIG_SPARSE_IRQ` kernel configuration option. This kernel configuration option enables support for sparse irqs. The `irq_desc` structure contains many different files:
			
 
				 
			
 
				 * `irq_common_data` - per irq and chip data passed down to chip functions;
			
 
				-* `status_use_accessors` - contains status of the interrupt source which is can be combination of of the values from the `enum` from the [include/linux/irq.h](https://github.com/torvalds/linux/blob/master/include/linux/irq.h) and different macros which are defined in the same source code file;
			
 
				+* `status_use_accessors` - contains status of the interrupt source which is combination of the values from the `enum` from the [include/linux/irq.h](https://github.com/torvalds/linux/blob/master/include/linux/irq.h) and different macros which are defined in the same source code file;
			
 
				 * `kstat_irqs` - irq stats per-cpu;
			
 
				 * `handle_irq` - highlevel irq-events handler;
			
 
				 * `action` - identifies the interrupt service routines to be invoked when the [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) occurs;
			
@@ -113,7 +113,7 @@ static void __init init_irq_default_affinity(void)
 
				 #endif
			
 
				 ```
			
 
				 
			
 
				-We know that when a hardware, such as disk controller or keyboard, needs attention from the processor, it throws an interrupt. The interrupt tells to the processor that something has happened and that the processor should interrupt current process and handle an incoming event. In order to prevent mutliple devices from sending the same interrupts, the [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) system was established where each device in a computer system is assigned its own special IRQ so that its interrupts are unique. Linux kernel can assign certain `IRQs` to specific processors. This is known as `SMP IRQ affinity`, and it allows you control how your system will respond to various hardware events (that's why it has certain implementation only if the `CONFIG_SMP` kernel configuration option is set). After we allocated `irq_default_affinity` cpumask, we can see `printk` output:
			
 
				+We know that when a hardware, such as disk controller or keyboard, needs attention from the processor, it throws an interrupt. The interrupt tells to the processor that something has happened and that the processor should interrupt current process and handle an incoming event. In order to prevent multiple devices from sending the same interrupts, the [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) system was established where each device in a computer system is assigned its own special IRQ so that its interrupts are unique. Linux kernel can assign certain `IRQs` to specific processors. This is known as `SMP IRQ affinity`, and it allows you control how your system will respond to various hardware events (that's why it has certain implementation only if the `CONFIG_SMP` kernel configuration option is set). After we allocated `irq_default_affinity` cpumask, we can see `printk` output:
			
 
				 
			
 
				 ```C
			
 
				 printk(KERN_INFO "NR_IRQS:%d\n", NR_IRQS);
			
@@ -126,7 +126,7 @@ which prints `NR_IRQS`:
 
				 [    0.000000] NR_IRQS:4352
			
 
				 ```
			
 
				 
			
 
				-The `NR_IRQS` is the maximum number of the `irq` descriptors or in another words maximum number of interrupts. Its value depends on the state of the `COFNIG_X86_IO_APIC` kernel configuration option. If the `CONFIG_X86_IO_APIC` is not set and the Linux kernel uses an old [PIC](https://en.wikipedia.org/wiki/Programmable_Interrupt_Controller) chip, the `NR_IRQS` is:
			
 
				+The `NR_IRQS` is the maximum number of the `irq` descriptors or in another words maximum number of interrupts. Its value depends on the state of the `CONFIG_X86_IO_APIC` kernel configuration option. If the `CONFIG_X86_IO_APIC` is not set and the Linux kernel uses an old [PIC](https://en.wikipedia.org/wiki/Programmable_Interrupt_Controller) chip, the `NR_IRQS` is:
			
 
				 
			
 
				 ```C
			
 
				 #define NR_IRQS_LEGACY                    16
			
@@ -168,7 +168,7 @@ In the first case (`CPU_VECTOR_LIMIT > IO_APIC_VECTOR_LIMIT`), the `NR_IRQS` wil
 
				 [    0.000000] NR_IRQS:4352
			
 
				 ```
			
 
				 
			
 
				-In the next step we assign array of the IRQ descriptors to the `irq_desc` variable which we defined in the start of the `early_irq_init` function and cacluate count of the `irq_desc` array with the `ARRAY_SIZE` macro:
			
 
				+In the next step we assign array of the IRQ descriptors to the `irq_desc` variable which we defined in the start of the `early_irq_init` function and calculate count of the `irq_desc` array with the `ARRAY_SIZE` macro:
			
 
				 
			
 
				 ```C
			
 
				 desc = irq_desc;
			
@@ -221,7 +221,7 @@ cpu3 26648 8 6931 678891 414 0 244 0 0 0
 
				 ...
			
 
				 ```
			
 
				 
			
 
				-Where the sixth column is the servicing interrupts. After this we allocate [cpumask](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) for the given irq descriptor affinity and initialize the [spinlock](https://en.wikipedia.org/wiki/Spinlock) for the given interrupt descriptor. After this before the [critical section](https://en.wikipedia.org/wiki/Critical_section), the lock will be aqcuired with a call of the `raw_spin_lock` and unlocked with the call of the `raw_spin_unlock`. In the next step we call the `lockdep_set_class` macro which set the [Lock validator](https://lwn.net/Articles/185666/) `irq_desc_lock_class` class for the lock of the given interrupt descriptor. More about `lockdep`, `spinlock` and other synchronization primitives will be described in the separate chapter.
			
 
				+Where the sixth column is the servicing interrupts. After this we allocate [cpumask](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) for the given irq descriptor affinity and initialize the [spinlock](https://en.wikipedia.org/wiki/Spinlock) for the given interrupt descriptor. After this before the [critical section](https://en.wikipedia.org/wiki/Critical_section), the lock will be acquired with a call of the `raw_spin_lock` and unlocked with the call of the `raw_spin_unlock`. In the next step we call the `lockdep_set_class` macro which set the [Lock validator](https://lwn.net/Articles/185666/) `irq_desc_lock_class` class for the lock of the given interrupt descriptor. More about `lockdep`, `spinlock` and other synchronization primitives will be described in the separate chapter.
			
 
				 
			
 
				 In the end of the loop we call the `desc_set_defaults` function from the [kernel/irq/irqdesc.c](https://github.com/torvalds/linux/blob/master/kernel/irq/irqdesc.c). This function takes four parameters:
			
 
				 
			
@@ -301,7 +301,7 @@ In the end of the `early_irq_init` function we return the return value of the `a
 
				 return arch_early_irq_init();
			
 
				 ```
			
 
				 
			
 
				-This function defined in the [kernel/apic/vector.c](https://github.com/torvalds/linux/blob/master/kernel/apic/vector.c) and contains only one call of the `arch_early_ioapic_init` function from the [kernel/apic/io_apic.c](https://github.com/torvalds/linux/blob/master/kernel/apic/io_apic.c). As we can understand from the `arch_early_ioapic_init` function's name, this function makes early initialization of the [I/O APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller). First of all it make a check of the number of the legacy interrupts wit the call of the `nr_legacy_irqs` function. If we have no lagacy interrupts with the [Intel 8259](https://en.wikipedia.org/wiki/Intel_8259) programmable interrupt controller we set `io_apic_irqs` to the `0xffffffffffffffff`: 
			
 
				+This function defined in the [kernel/apic/vector.c](https://github.com/torvalds/linux/blob/master/kernel/apic/vector.c) and contains only one call of the `arch_early_ioapic_init` function from the [kernel/apic/io_apic.c](https://github.com/torvalds/linux/blob/master/kernel/apic/io_apic.c). As we can understand from the `arch_early_ioapic_init` function's name, this function makes early initialization of the [I/O APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller). First of all it make a check of the number of the legacy interrupts wit the call of the `nr_legacy_irqs` function. If we have no legacy interrupts with the [Intel 8259](https://en.wikipedia.org/wiki/Intel_8259) programmable interrupt controller we set `io_apic_irqs` to the `0xffffffffffffffff`: 
			
 
				 
			
 
				 ```C
			
 
				 if (!nr_legacy_irqs())
			
@@ -330,7 +330,7 @@ That's all.
 
				 Sparse IRQs
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-We already saw in the beginning of this part that implementation of the `early_irq_init` function depends on the `CONFIG_SPARSE_IRQ` kernel configuration option. Previously we saw implementation of the `early_irq_init` function when the `CONFIG_SPARSE_IRQ` configuration option is not set, not let's look on the its implementation when this option is set. Implementation of this function very similar, but little differ. We can see the same definition of variables and call of the `init_irq_default_affinity` in the beginning of the `early_irq_init` function:
			
 
				+We already saw in the beginning of this part that implementation of the `early_irq_init` function depends on the `CONFIG_SPARSE_IRQ` kernel configuration option. Previously we saw implementation of the `early_irq_init` function when the `CONFIG_SPARSE_IRQ` configuration option is not set, now let's look on the its implementation when this option is set. Implementation of this function very similar, but little differ. We can see the same definition of variables and call of the `init_irq_default_affinity` in the beginning of the `early_irq_init` function:
			
 
				 
			
 
				 ```C
			
 
				 #ifdef CONFIG_SPARSE_IRQ
			
@@ -367,7 +367,7 @@ if (nr_irqs > (NR_VECTORS * nr_cpu_ids))
 
				 nr = (gsi_top + nr_legacy_irqs()) + 8 * nr_cpu_ids;
			
 
				 ```
			
 
				 
			
 
				-Take a look on the `gsi_top` variable. Each `APIC` is identified with its own `ID` and with the offset where its `IRQ` starts. It is called `GSI` base or `Global System Interrupt` base. So the `gsi_top` represnters it. We get the `Global System Interrupt` base from the [MultiProcessor Configuration Table](https://en.wikipedia.org/wiki/MultiProcessor_Specification) table (you can remember that we have parsed this table in the sixth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) of the Linux Kernel initialization process chapter).
			
 
				+Take a look on the `gsi_top` variable. Each `APIC` is identified with its own `ID` and with the offset where its `IRQ` starts. It is called `GSI` base or `Global System Interrupt` base. So the `gsi_top` represents it. We get the `Global System Interrupt` base from the [MultiProcessor Configuration Table](https://en.wikipedia.org/wiki/MultiProcessor_Specification) table (you can remember that we have parsed this table in the sixth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) of the Linux Kernel initialization process chapter).
			
 
				 
			
 
				 After this we update the `nr` depends on the value of the `gsi_top`:
			
 
				 
			
@@ -413,7 +413,7 @@ if (WARN_ON(initcnt > IRQ_BITMAP_BITS))
 
				     initcnt = IRQ_BITMAP_BITS;
			
 
				 ```
			
 
				 
			
 
				-where `IRQ_BITMAP_BITS` is equal to the `NR_IRQS` if the `CONFIG_SPARSE_IRQ` is not set and `NR_IRQS + 8196` in other way. In the next step we are going over all interrupt descript which need to be allocated in the loop and allocate space for the descriptor and insert to the `irq_desc_tree` [radix tree](http://0xax.gitbooks.io/linux-insides/content/DataStructures/radix-tree.html):
			
 
				+where `IRQ_BITMAP_BITS` is equal to the `NR_IRQS` if the `CONFIG_SPARSE_IRQ` is not set and `NR_IRQS + 8196` in other way. In the next step we are going over all interrupt descriptors which need to be allocated in the loop and allocate space for the descriptor and insert to the `irq_desc_tree` [radix tree](http://0xax.gitbooks.io/linux-insides/content/DataStructures/radix-tree.html):
			
 
				 
			
 
				 ```C
			
 
				 for (i = 0; i < initcnt; i++) {
			
@@ -434,11 +434,11 @@ That's all.
 
				 Conclusion
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-It is the end of the seventh part of the [Interrupts and Interrupt Handling](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) chapter and we started to dive into external hardware interrupts in this part. We saw early initialization of the `irq_desc` structure which represents description of an external interrupt and contains information about it like list of irq actions, information about interrupt handler, interrupts's owner, count of the unhandled interrupt and etc. In the next part we will continue to research external interrupts.
			
 
				+It is the end of the seventh part of the [Interrupts and Interrupt Handling](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) chapter and we started to dive into external hardware interrupts in this part. We saw early initialization of the `irq_desc` structure which represents description of an external interrupt and contains information about it like list of irq actions, information about interrupt handler, interrupt's owner, count of the unhandled interrupt and etc. In the next part we will continue to research external interrupts.
			
 
				 
			
 
				-If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				+If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				 
			
 
				-**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
			
 
				+**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				 
			
 
				 Links
			
 
				 --------------------------------------------------------------------------------
			
--- a/interrupts/interrupts-8.md
+++ b/interrupts/interrupts-8.md
@@ -6,7 +6,7 @@ Non-early initialization of the IRQs
 
				 
			
 
				 This is the eighth part of the Interrupts and Interrupt Handling in the Linux kernel [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-7.html) we started to dive into the external hardware [interrupts](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29). We looked on the implementation of the `early_irq_init` function from the [kernel/irq/irqdesc.c](https://github.com/torvalds/linux/blob/master/kernel/irq/irqdesc.c) source code file and saw the initialization of the `irq_desc` structure in this function. Remind that `irq_desc` structure (defined in the [include/linux/irqdesc.h](https://github.com/torvalds/linux/blob/master/include/linux/irqdesc.h#L46) is the foundation of interrupt management code in the Linux kernel and represents an interrupt descriptor. In this part we will continue to dive into the initialization stuff which is related to the external hardware interrupts.
			
 
				 
			
 
				-Right after the call of the `early_irq_init` function in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) we can see the call of the `init_IRQ` function. This function is architecture-specfic and defined in the [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/master/kernel/irqinit.c). The `init_IRQ` function makes initialization of the `vector_irq` [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable that defined in the same [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/master/kernel/irqinit.c) source code file:
			
 
				+Right after the call of the `early_irq_init` function in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) we can see the call of the `init_IRQ` function. This function is architecture-specific and defined in the [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/master/kernel/irqinit.c). The `init_IRQ` function makes initialization of the `vector_irq` [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable that defined in the same [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/master/kernel/irqinit.c) source code file:
			
 
				 
			
 
				 ```C
			
 
				 ...
			
@@ -66,7 +66,7 @@ __visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-Why is `legacy` here? Actuall all interrupts handled by the modern [IO-APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller#I.2FO_APICs) controller. But these interrupts (from `0x30` to `0x3f`) by legacy interrupt-controllers like [Programmable Interrupt Controller](https://en.wikipedia.org/wiki/Programmable_Interrupt_Controller). If these interrupts are handled by the `I/O APIC` then this vector space will be freed and re-used. Let's look on this code closer. First of all the `nr_legacy_irqs` defined in the [arch/x86/include/asm/i8259.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/i8259.h) and just returns the `nr_legacy_irqs` field from the `legacy_pic` strucutre:
			
 
				+Why is `legacy` here? Actually all interrupts are handled by the modern [IO-APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller#I.2FO_APICs) controller. But these interrupts (from `0x30` to `0x3f`) by legacy interrupt-controllers like [Programmable Interrupt Controller](https://en.wikipedia.org/wiki/Programmable_Interrupt_Controller). If these interrupts are handled by the `I/O APIC` then this vector space will be freed and re-used. Let's look on this code closer. First of all the `nr_legacy_irqs` defined in the [arch/x86/include/asm/i8259.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/i8259.h) and just returns the `nr_legacy_irqs` field from the `legacy_pic` structure:
			
 
				 
			
 
				 ```C
			
 
				 static inline int nr_legacy_irqs(void)
			
@@ -91,7 +91,7 @@ struct legacy_pic {
 
				 };
			
 
				 ```
			
 
				 
			
 
				-Actuall default maximum number of the legacy interrupts represtented by the `NR_IRQ_LEGACY` macro from the [arch/x86/include/asm/irq_vectors.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/irq_vectors.h):
			
 
				+Actual default maximum number of the legacy interrupts represented by the `NR_IRQ_LEGACY` macro from the [arch/x86/include/asm/irq_vectors.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/irq_vectors.h):
			
 
				 
			
 
				 ```C
			
 
				 #define NR_IRQS_LEGACY                    16
			
@@ -107,7 +107,7 @@ In the loop we are accessing the `vecto_irq` per-cpu array with the `per_cpu` ma
 
				 
			
 
				 Why is `0x30` here? You can remember from the first [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html) of this chapter that first 32 vector numbers from `0` to `31` are reserved by the processor and used for the processing of architecture-defined exceptions and interrupts. Vector numbers from `0x30` to `0x3f` are reserved for the [ISA](https://en.wikipedia.org/wiki/Industry_Standard_Architecture). So, it means that we fill the `vector_irq` from the `IRQ0_VECTOR` which is equal to the `32` to the `IRQ0_VECTOR + 16` (before the `0x30`).
			
 
				 
			
 
				-In the end of the `init_IRQ` functio we can see the call of the following function:
			
 
				+In the end of the `init_IRQ` function we can see the call of the following function:
			
 
				 
			
 
				 ```C
			
 
				 x86_init.irqs.intr_init();
			
@@ -161,7 +161,7 @@ $ cat /proc/interrupts
 
				   8:          1          0          0          0          0          0          0          0   IO-APIC   8-edge      rtc0
			
 
				 ```
			
 
				 
			
 
				-look on the last columnt;
			
 
				+look on the last column;
			
 
				 
			
 
				 * `(*irq_mask)(struct irq_data *data)`  - mask an interrupt source;
			
 
				 * `(*irq_ack)(struct irq_data *data)` - start of a new interrupt;
			
@@ -200,7 +200,7 @@ and writing it with the help of the `apic_write` function:
 
				 apic_write(APIC_SPIV, value);
			
 
				 ```
			
 
				 
			
 
				-After we have enabled `APIC` for the bootstrap processor, we return to the `init_ISA_irqs` function and in the next step we initalize legacy `Programmable Interrupt Controller` and set the legacy chip and handler for the each legacy irq:
			
 
				+After we have enabled `APIC` for the bootstrap processor, we return to the `init_ISA_irqs` function and in the next step we initialize legacy `Programmable Interrupt Controller` and set the legacy chip and handler for the each legacy irq:
			
 
				 
			
 
				 ```C
			
 
				 legacy_pic->init(0);
			
@@ -229,9 +229,9 @@ struct legacy_pic default_legacy_pic = {
 
				 }
			
 
				 ```
			
 
				 
			
 
				-The `init_8259A` function defined in the same source code file and executes initialization of the [Intel 8259](https://en.wikipedia.org/wiki/Intel_8259) ``Programmable Interrupt Controller` (more about it will be in the separate chapter abot `Programmable Interrupt Controllers` and `APIC`).
			
 
				+The `init_8259A` function defined in the same source code file and executes initialization of the [Intel 8259](https://en.wikipedia.org/wiki/Intel_8259) ``Programmable Interrupt Controller` (more about it will be in the separate chapter about `Programmable Interrupt Controllers` and `APIC`).
			
 
				 
			
 
				-Now we can return to the `native_init_IRQ` function, after the `init_ISA_irqs` function finished its work. The next step is the call of the `apic_intr_init` function that allocates special interrupt gates which are used by the [SMP](https://en.wikipedia.org/wiki/Symmetric_multiprocessing) architecture for the [Inter-processor interrupt](https://en.wikipedia.org/wiki/Inter-processor_interrupt). The `alloc_intr_gate` macro from the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h) used for the interrupt descriptor allocation allocation:
			
 
				+Now we can return to the `native_init_IRQ` function, after the `init_ISA_irqs` function finished its work. The next step is the call of the `apic_intr_init` function that allocates special interrupt gates which are used by the [SMP](https://en.wikipedia.org/wiki/Symmetric_multiprocessing) architecture for the [Inter-processor interrupt](https://en.wikipedia.org/wiki/Inter-processor_interrupt). The `alloc_intr_gate` macro from the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h) used for the interrupt descriptor allocation:
			
 
				 
			
 
				 ```C
			
 
				 #define alloc_intr_gate(n, addr)                        \
			
@@ -326,7 +326,7 @@ int main() {
 
				 }
			
 
				 ```
			
 
				 
			
 
				-and will look on the assembly output of our example we will see followig assembly code:
			
 
				+and will look on the assembly output of our example we will see following assembly code:
			
 
				 
			
 
				 ```assembly
			
 
				 pushq	%rbp
			
@@ -351,7 +351,7 @@ movl	%eax, %edi
 
				 call	variable_test_bit
			
 
				 ```
			
 
				 
			
 
				-for the `variable_test_bit`. These two code listings starts with the same part, first of all we save base of the current stack frame in the `%rbp` register. But after this code for both examples is different. In the first example we put `$268435456` (here the `$268435456` is our second parameter - `0x10000000`) to the `esi` and `$25` (our first parameter) to the `edi` register and call `constant_test_bit`. We put functuin parameters to the `esi` and `edi` registers because as we are learning Linux kernel for the `x86_64` architecture we use `System V AMD64 ABI` [calling convention](https://en.wikipedia.org/wiki/X86_calling_conventions). All is pretty simple. When we are using predifined constant, the compiler can just substitute its value. Now let's look on the second part. As you can see here, the compiler can not substitute value from the `nr` variable. In this case compiler must calcuate its offset on the programm's [stack frame](https://en.wikipedia.org/wiki/Call_stack). We substract `16` from the `rsp` register to allocate stack for the local variables data and put the `$24` (value of the `nr` variable) to the `rbp` with offset `-4`. Our stack frame will be like this:
			
 
				+for the `variable_test_bit`. These two code listings starts with the same part, first of all we save base of the current stack frame in the `%rbp` register. But after this code for both examples is different. In the first example we put `$268435456` (here the `$268435456` is our second parameter - `0x10000000`) to the `esi` and `$25` (our first parameter) to the `edi` register and call `constant_test_bit`. We put function parameters to the `esi` and `edi` registers because as we are learning Linux kernel for the `x86_64` architecture we use `System V AMD64 ABI` [calling convention](https://en.wikipedia.org/wiki/X86_calling_conventions). All is pretty simple. When we are using predefined constant, the compiler can just substitute its value. Now let's look on the second part. As you can see here, the compiler can not substitute value from the `nr` variable. In this case compiler must calculate its offset on the program's [stack frame](https://en.wikipedia.org/wiki/Call_stack). We subtract `16` from the `rsp` register to allocate stack for the local variables data and put the `$24` (value of the `nr` variable) to the `rbp` with offset `-4`. Our stack frame will be like this:
			
 
				 
			
 
				 ```
			
 
				          <- stack grows 
			
@@ -369,7 +369,7 @@ for the `variable_test_bit`. These two code listings starts with the same part,
 
				 
			
 
				 After this we put this value to the `eax`, so `eax` register now contains value of the `nr`. In the end we do the same that in the first example, we put the `$268435456` (the first parameter of the `variable_test_bit` function) and the value of the `eax` (value of `nr`) to the `edi` register (the second parameter of the `variable_test_bit function`). 
			
 
				 
			
 
				-The next step after the `apic_intr_init` function will finish its work is the setting interrup gates from the `FIRST_EXTERNAL_VECTOR` or `0x20` to the `0x256`:
			
 
				+The next step after the `apic_intr_init` function will finish its work is the setting interrupt gates from the `FIRST_EXTERNAL_VECTOR` or `0x20` to the `0x256`:
			
 
				 
			
 
				 ```C
			
 
				 i = FIRST_EXTERNAL_VECTOR;
			
@@ -392,7 +392,7 @@ for_each_clear_bit_from(i, used_vectors, NR_VECTORS)
 
				 #endif
			
 
				 ```
			
 
				 
			
 
				-Where the `spurious_interrupt` function represent interrupt handler fro the `spurious` interrupt. Here the `used_vectors` is the `unsigned long` that contains already initialized interrupt gates. We already filled first `32` interrupt vectors in the `trap_init` function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) source code file:
			
 
				+Where the `spurious_interrupt` function represent interrupt handler for the `spurious` interrupt. Here the `used_vectors` is the `unsigned long` that contains already initialized interrupt gates. We already filled first `32` interrupt vectors in the `trap_init` function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) source code file:
			
 
				 
			
 
				 ```C
			
 
				 for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++)
			
@@ -414,7 +414,7 @@ First of all let's deal with the condition. The `acpi_ioapic` variable represent
 
				 #define acpi_ioapic 0
			
 
				 ```
			
 
				 
			
 
				-The second condition - `!of_ioapic && nr_legacy_irqs()` checks that we do not use [Open Firmware](https://en.wikipedia.org/wiki/Open_Firmware) `I/O APIC` and legacy interrupt controller. We already know about the `nr_legacy_irqs`. The second is `of_ioapic` variable defined in the [arch/x86/kernel/devicetree.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/devicetree.c) and initialized in the `dtb_ioapic_setup` function that build information about `APICs` in the [devicetree](https://en.wikipedia.org/wiki/Device_tree). Note that `of_ioapic` variable depends on the `CONFIG_OF` Linux kernel configuration opiotn. If this option is not set, the value of the `of_ioapic` will be zero too:
			
 
				+The second condition - `!of_ioapic && nr_legacy_irqs()` checks that we do not use [Open Firmware](https://en.wikipedia.org/wiki/Open_Firmware) `I/O APIC` and legacy interrupt controller. We already know about the `nr_legacy_irqs`. The second is `of_ioapic` variable defined in the [arch/x86/kernel/devicetree.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/devicetree.c) and initialized in the `dtb_ioapic_setup` function that build information about `APICs` in the [devicetree](https://en.wikipedia.org/wiki/Device_tree). Note that `of_ioapic` variable depends on the `CONFIG_OF` Linux kernel configuration option. If this option is not set, the value of the `of_ioapic` will be zero too:
			
 
				 
			
 
				 ```C
			
 
				 #ifdef CONFIG_OF
			
@@ -430,7 +430,7 @@ extern int of_ioapic;
 
				 #endif
			
 
				 ```
			
 
				 
			
 
				-If the condition will return non-zero vaule we call the:
			
 
				+If the condition will return non-zero value we call the:
			
 
				 
			
 
				 ```C
			
 
				 setup_irq(2, &irq2);
			
@@ -446,7 +446,7 @@ static struct irqaction irq2 = {
 
				 };
			
 
				 ```
			
 
				 
			
 
				-Some time ago interrupt controller consisted of two chips and one was connected to second. The second chip that was connected to the first chip via this `IRQ 2` line. This chip serviced lines from `8` to `15` and after after this lines of the first chip. So, for example [Intel 8259A](https://en.wikipedia.org/wiki/Intel_8259) has following lines:
			
 
				+Some time ago interrupt controller consisted of two chips and one was connected to second. The second chip that was connected to the first chip via this `IRQ 2` line. This chip serviced lines from `8` to `15` and after this lines of the first chip. So, for example [Intel 8259A](https://en.wikipedia.org/wiki/Intel_8259) has following lines:
			
 
				 
			
 
				 * `IRQ 0`  - system time;
			
 
				 * `IRQ 1`  - keyboard;
			
@@ -513,9 +513,9 @@ It is the end of the eighth part of the [Interrupts and Interrupt Handling](http
 
				 
			
 
				 In the next part we will continue to learn interrupts handling related stuff and will see initialization of the `softirqs`.
			
 
				 
			
 
				-If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				+If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				 
			
 
				-**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
			
 
				+**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				 
			
 
				 Links
			
 
				 --------------------------------------------------------------------------------
			
--- a/interrupts/interrupts-9.md
+++ b/interrupts/interrupts-9.md
@@ -4,32 +4,32 @@ Interrupts and Interrupt Handling. Part 9.
 
				 Introduction to deferred interrupts (Softirq, Tasklets and Workqueues)
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-It is the ninth part of the [linux-insides](https://www.gitbook.com/book/0xax/linux-insides/details) book and in the previous [Previous part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-8.html) we saw implementation of the `init_IRQ` from that defined in the [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/irqinit.c) source code file. So, we will continue to dive into the initialization stuff which is related to the external hardware interrupts in this part.
			
 
				-
			
 
				-After the `init_IRQ` function we can see the call of the `softirq_init` function in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c). This function defined in the [kernel/softirq.c](https://github.com/torvalds/linux/blob/master/kernel/softirq.c) source code file and as we can understand from its name, this function makes initialization of the `softirq` or in other words initialization of the `deferred interrupts`. What is it deferreed intrrupt? We already saw a little bit about it in the ninth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-9.html) of the chapter that describes initialization process of the Linux kernel. There are three types of `deffered interrupts` in the Linux kernel:
			
 
				-
			
 
				-* `softirqs`;
			
 
				-* `tasklets`;
			
 
				-* `workqueues`;
			
 
				-
			
 
				-And we will see description of all of these types in this part. As I said, we saw only a little bit about this theme, so, now is time to dive deep into details about this theme.
			
 
				-
			
 
				-Deferred interrupts
			
 
				-----------------------------------------------------------------------------------
			
 
				+It is the nine part of the Interrupts and Interrupt Handling in the Linux kernel [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) and in the previous [Previous part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-8.html) we saw implementation of the `init_IRQ` from that defined in the [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/irqinit.c) source code file. So, we will continue to dive into the initialization stuff which is related to the external hardware interrupts in this part.
			
 
				 
			
 
				 Interrupts may have different important characteristics and there are two among them:
			
 
				 
			
 
				 * Handler of an interrupt must execute quickly;
			
 
				 * Sometime an interrupt handler must do a large amount of work.
			
 
				 
			
 
				-As you can understand, it is almost impossible to make so that both characteristics were valid. Because of these, previously the handling of interrupts was splitted into two parts:
			
 
				+As you can understand, it is almost impossible to make so that both characteristics were valid. Because of these, previously the handling of interrupts was split into two parts:
			
 
				 
			
 
				 * Top half;
			
 
				 * Bottom half;
			
 
				 
			
 
				-Once the Linux kernel was one of the ways the organization postprocessing, and which was called: `the bottom half` of the processor, but now it is already not actual. Now this term has remained as a common noun referring to all the different ways of organizing deffered processing of an interrupt. With the advent of parallelisms in the Linux kernel, all new schemes of implementation of the bottom half handlers are built on the performance of the processor specific kernel thread that called `ksoftirqd` (will be discussed below). The `softirq` mechanism represents handling of interrupts that are `almost` as important as the handling of the hardware interrupts. The deferred processing of an interrupt suggests that some of the actions for an interrupt may be postponed to a later execution when the system will be less loaded. As you can suggests, an interrupt handler can do large amount of work that is impermissible as it executes in the context where interrupts are disabled. That's why processing of an interrupt can be splitted on two different parts. In the first part, the main handler of an interrupt does only minimal and the most important job. After this it schedules the second part and finishes its work. When the system is less busy and context of the processor allows to handle interrupts, the second part starts its work and finishes to process remaing part of a deferred interrupt. That is main explanation of the deferred interrupt handling.
			
 
				+Once the Linux kernel was one of the ways the organization postprocessing, and which was called: `the bottom half` of the processor, but now it is already not actual. Now this term has remained as a common noun referring to all the different ways of organizing deferred processing of an interrupt.The deferred processing of an interrupt suggests that some of the actions for an interrupt may be postponed to a later execution when the system will be less loaded. As you can suggests, an interrupt handler can do large amount of work that is impermissible as it executes in the context where interrupts are disabled. That's why processing of an interrupt can be split on two different parts. In the first part, the main handler of an interrupt does only minimal and the most important job. After this it schedules the second part and finishes its work. When the system is less busy and context of the processor allows to handle interrupts, the second part starts its work and finishes to process remaining part of a deferred interrupt.
			
 
				 
			
 
				-As I already wrote above, handling of deferred interrupts (or `softirq` in other words) and accordingly `tasklets` is performed by a set of the special kernel threads (one thread per processor). Each processor has its own thread that is called `ksoftirqd/n` where the `n` is the number of the processor. We can see it in the output of the `systemd-cgls` util:
			
 
				+There are three types of `deferred interrupts` in the Linux kernel:
			
 
				+
			
 
				+* `softirqs`;
			
 
				+* `tasklets`;
			
 
				+* `workqueues`;
			
 
				+
			
 
				+And we will see description of all of these types in this part. As I said, we saw only a little bit about this theme, so, now is time to dive deep into details about this theme.
			
 
				+
			
 
				+Softirqs
			
 
				+----------------------------------------------------------------------------------
			
 
				+
			
 
				+With the advent of parallelisms in the Linux kernel, all new schemes of implementation of the bottom half handlers are built on the performance of the processor specific kernel thread that called `ksoftirqd` (will be discussed below). Each processor has its own thread that is called `ksoftirqd/n` where the `n` is the number of the processor. We can see it in the output of the `systemd-cgls` util:
			
 
				 
			
 
				 ```
			
 
				 $ systemd-cgls -k | grep ksoft
			
@@ -49,7 +49,7 @@ The `spawn_ksoftirqd` function starts this these threads. As we can see this fun
 
				 early_initcall(spawn_ksoftirqd);
			
 
				 ```
			
 
				 
			
 
				-Deferred interrupts are determined statically at compile-time of the Linux kernel and the `open_softirq` function takes care of `softirq` initialization. The `open_softirq` function defined in the [kernel/softirq.c](https://github.com/torvalds/linux/blob/master/kernel/softirq.c):
			
 
				+Softirqs are determined statically at compile-time of the Linux kernel and the `open_softirq` function takes care of `softirq` initialization. The `open_softirq` function defined in the [kernel/softirq.c](https://github.com/torvalds/linux/blob/master/kernel/softirq.c):
			
 
				 
			
 
				 
			
 
				 ```C
			
@@ -61,7 +61,7 @@ void open_softirq(int nr, void (*action)(struct softirq_action *))
 
				 
			
 
				 and as we can see this function uses two parameters:
			
 
				 
			
 
				-* the index of the `softirq_vec` array; 
			
 
				+* the index of the `softirq_vec` array;
			
 
				 * a pointer to the softirq function to be executed;
			
 
				 
			
 
				 First of all let's look on the `softirq_vec` array:
			
@@ -145,7 +145,7 @@ The `raise_softirq_irqoff` function marks the softirq as deffered by setting the
 
				 __raise_softirq_irqoff(nr);
			
 
				 ```
			
 
				 
			
 
				-macro. After this, it checks the result of the `in_interrupt` that returns `irq_count` value. We already saw the `irq_count` in the first [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html) of this chapter and it is used to check if a CPU is already on an interrupt stack or not. We just exit from the `raise_softirq_irqoff`, restore `IF` flang and enable interrupts on the local processor, if we are in the interrupt context, otherwise  we call the `wakeup_softirqd`:
			
 
				+macro. After this, it checks the result of the `in_interrupt` that returns `irq_count` value. We already saw the `irq_count` in the first [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html) of this chapter and it is used to check if a CPU is already on an interrupt stack or not. We just exit from the `raise_softirq_irqoff`, restore `IF` flag and enable interrupts on the local processor, if we are in the interrupt context, otherwise  we call the `wakeup_softirqd`:
			
 
				 
			
 
				 ```C
			
 
				 if (!in_interrupt())
			
@@ -164,7 +164,7 @@ static void wakeup_softirqd(void)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-Each `ksoftirqd` kernel thread runs the `run_ksoftirqd` function that checks existence of deferred interrupts and calls the `__do_softirq` function depends on result. This function reads the `__softirq_pending` softirq bit mask of the local processor and executes the deferrable functions corresponding to every bit set. During execution of a deferred function, new pending `softirqs` might occur. The main problem here that execution of the userspace code can be delayed for a long time while the `__do_softirq` function will handle deferred interrupts. For this purpose, it has the limit of the time when it must be finsihed:
			
 
				+Each `ksoftirqd` kernel thread runs the `run_ksoftirqd` function that checks existence of deferred interrupts and calls the `__do_softirq` function depends on result. This function reads the `__softirq_pending` softirq bit mask of the local processor and executes the deferrable functions corresponding to every bit set. During execution of a deferred function, new pending `softirqs` might occur. The main problem here that execution of the userspace code can be delayed for a long time while the `__do_softirq` function will handle deferred interrupts. For this purpose, it has the limit of the time when it must be finished:
			
 
				 
			
 
				 ```C
			
 
				 unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
			
@@ -196,7 +196,7 @@ if (!in_interrupt() && local_softirq_pending())
 
				     invoke_softirq();
			
 
				 ```
			
 
				 
			
 
				-that executes the `__do_softirq` too. So what do we have in summary. Each `softirq` goes through the following stages: Registration of a `softirq` with the `open_softirq` function. Activation of a `softirq` by marking it as deferred with the `raise_softirq` function. After this, all marked `softirqs` will be runned in the next time the Linux kernel schedules a round of executions of deferrable functions. And execution of the deferred functions that have the same type.
			
 
				+that executes the `__do_softirq` too. So what do we have in summary. Each `softirq` goes through the following stages: Registration of a `softirq` with the `open_softirq` function. Activation of a `softirq` by marking it as deferred with the `raise_softirq` function. After this, all marked `softirqs` will be r in the next time the Linux kernel schedules a round of executions of deferrable functions. And execution of the deferred functions that have the same type.
			
 
				 
			
 
				 As I already wrote, the `softirqs` are statically allocated and it is a problem for a kernel module that can be loaded. The second concept that built on top of `softirq` -- the `tasklets` solves this problem.
			
 
				 
			
@@ -286,7 +286,7 @@ open_softirq(TASKLET_SOFTIRQ, tasklet_action);
 
				 open_softirq(HI_SOFTIRQ, tasklet_hi_action);
			
 
				 ```
			
 
				 
			
 
				-at the end of the `softirq_init` function. The main purpose of the `open_softirq` function is the initalization of `softirq`. Let's look on the implementation of the `open_softirq` function.
			
 
				+at the end of the `softirq_init` function. The main purpose of the `open_softirq` function is the initialization of `softirq`. Let's look on the implementation of the `open_softirq` function.
			
 
				 
			
 
				 , in our case they are: `tasklet_action` and the `tasklet_hi_action` or the `softirq` function associated with the `HI_SOFTIRQ` softirq is named `tasklet_hi_action` and `softirq` function associated with the `TASKLET_SOFTIRQ` is named `tasklet_action`. The Linux kernel provides API for the manipulating of `tasklets`. First of all it is the `tasklet_init` function that takes `tasklet_struct`, function and parameter for it and initializes the given `tasklet_struct` with the given data:
			
 
				 
			
@@ -364,7 +364,7 @@ static void tasklet_action(struct softirq_action *a)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-In the beginning of the `tasketl_action` function, we disable interrupts for the local processor with the help of the `local_irq_disable` macro (you can read about this macro in the second [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-2.html) of this chapter). In the next step, we take a head of the list that contains tasklets with normal priority and set this per-cpu list to `NULL` because all tasklets must be executed in a generaly way. After this we enable interrupts for the local processor and go through the list of taklets in the loop. In every iteration of the loop we call the `tasklet_trylock` function for the given tasklet that updates state of the given tasklet on `TASKLET_STATE_RUN`: 
			
 
				+In the beginning of the `tasklet_action` function, we disable interrupts for the local processor with the help of the `local_irq_disable` macro (you can read about this macro in the second [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-2.html) of this chapter). In the next step, we take a head of the list that contains tasklets with normal priority and set this per-cpu list to `NULL` because all tasklets must be executed in a generally way. After this we enable interrupts for the local processor and go through the list of tasklets in the loop. In every iteration of the loop we call the `tasklet_trylock` function for the given tasklet that updates state of the given tasklet on `TASKLET_STATE_RUN`: 
			
 
				 
			
 
				 ```C
			
 
				 static inline int tasklet_trylock(struct tasklet_struct *t)
			
@@ -460,13 +460,24 @@ static inline bool queue_work(struct workqueue_struct *wq,
 
				 }
			
 
				 ```
			
 
				 
			
 
				-The `queue_work` function just calls the `queue_work_on` function that queue work on specific processor. Note that in our case we pass the `WORK_STRUCT_PENDING_BIT` to the `queue_work_on` function. It is a part of the `enum` that is defined in the [include/linux/workqueue.h](https://github.com/torvalds/linux/blob/master/include/linux/workqueue.h) and represents workqueue which are not bound to any specific processor. The `queue_work_on` function tests and set the `WORK_STRUCT_PENDING_BIT` bit of the given `work` and executes the `__queue_work` function with the `workqueue` for the given processor and given `work`:
			
 
				+The `queue_work` function just calls the `queue_work_on` function that queue work on specific processor. Note that in our case we pass the `WORK_CPU_UNBOUND` to the `queue_work_on` function. It is a part of the `enum` that is defined in the [include/linux/workqueue.h](https://github.com/torvalds/linux/blob/master/include/linux/workqueue.h) and represents workqueue which are not bound to any specific processor. The `queue_work_on` function tests and set the `WORK_STRUCT_PENDING_BIT` bit of the given `work` and executes the `__queue_work` function with the `workqueue` for the given processor and given `work`:
			
 
				 
			
 
				 ```C
			
 
				-__queue_work(cpu, wq, work);
			
 
				+bool queue_work_on(int cpu, struct workqueue_struct *wq,
			
 
				+           struct work_struct *work)
			
 
				+{
			
 
				+    bool ret = false;
			
 
				+    ...
			
 
				+    if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
			
 
				+        __queue_work(cpu, wq, work);
			
 
				+        ret = true;
			
 
				+    }
			
 
				+    ...
			
 
				+    return ret;
			
 
				+}
			
 
				 ```
			
 
				 
			
 
				-The `__queue_work` function gets the `work pool`. Yes, the `work pool` not `workqueue`. Actually, all `works` are not placed in the `workqueue`, but to the `work pool` that is represented by the `worker_pool` structure in the Linux kernel. As you can see above, the `workqueue_struct` structure has the `pwqs` field which is list of `worker_pools`. When we create a `workqueue`, it stands out for each processor the `pool_workqueue`. Each `pool_workqueue` associated with `worker_pool`, which is allocated on the same processor and corresponds to the type of priority queue. Through them `workqueue` interacts with `worker_pool`. So in the `__queue_work` function we set the cpu to the current processor with the `raw_smp_processor_id` (you can find information about this marco in the fouth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process chapter), getting the `pool_workqueue` for the given `workqueue_struct` and insert the given `work` to the given `workqueue`:
			
 
				+The `__queue_work` function gets the `work pool`. Yes, the `work pool` not `workqueue`. Actually, all `works` are not placed in the `workqueue`, but to the `work pool` that is represented by the `worker_pool` structure in the Linux kernel. As you can see above, the `workqueue_struct` structure has the `pwqs` field which is list of `worker_pools`. When we create a `workqueue`, it stands out for each processor the `pool_workqueue`. Each `pool_workqueue` associated with `worker_pool`, which is allocated on the same processor and corresponds to the type of priority queue. Through them `workqueue` interacts with `worker_pool`. So in the `__queue_work` function we set the cpu to the current processor with the `raw_smp_processor_id` (you can find information about this macro in the fourth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process chapter), getting the `pool_workqueue` for the given `workqueue_struct` and insert the given `work` to the given `workqueue`:
			
 
				 
			
 
				 ```C
			
 
				 static void __queue_work(int cpu, struct workqueue_struct *wq,
			
@@ -501,7 +512,7 @@ The next part will be last part of the `Interrupts and Interrupt Handling` chapt
 
				 
			
 
				 If you have any questions or suggestions, write me a comment or ping me at [twitter](https://twitter.com/0xAX).
			
 
				 
			
 
				-**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
			
 
				+**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				 
			
 
				 Links
			
 
				 --------------------------------------------------------------------------------
			
--- a/mm/README.md
+++ b/mm/README.md
@@ -5,3 +5,4 @@ couple of posts which describe different parts of the linux memory management fr
 
				 
			
 
				 * [Memblock](https://github.com/0xAX/linux-insides/blob/master/mm/linux-mm-1.md) - describes early `memblock` allocator.
			
 
				 * [Fix-Mapped Addresses and ioremap](https://github.com/0xAX/linux-insides/blob/master/mm/linux-mm-2.md) - describes `fix-mapped` addresses and early `ioremap`.
			
 
				+* [kmemcheck](https://github.com/0xAX/linux-insides/blob/master/mm/mm-3.md) - third part describes `kmemcheck` tool.
			
--- a/mm/linux-mm-1.md
+++ b/mm/linux-mm-1.md
@@ -4,13 +4,13 @@ Linux kernel memory management Part 1.
 
				 Introduction
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-Memory management is one of the most complex (and I think that it is the most complex) parts of the operating system kernel. In the [last preparations before the kernel entry point](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html) part we stopped right before call of the `start_kernel` function. This function initializes all the kernel features (including architecture-dependent features) before the kernel runs the first `init` process. You may remember as we built early page tables, identity page tables and fixmap page tables in the boot time. No compilcated memory management is working yet. When the `start_kernel` function is called we will see the transition to more complex data structures and techniques for memory management. For a good understanding of the initialization process in the linux kernel we need to have a clear understanding of these techniques. This chapter will provide an overview of the different parts of the linux kernel memory management framework and its API, starting from the `memblock`.
			
 
				+Memory management is one of the most complex (and I think that it is the most complex) part of the operating system kernel. In the [last preparations before the kernel entry point](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html) part we stopped right before call of the `start_kernel` function. This function initializes all the kernel features (including architecture-dependent features) before the kernel runs the first `init` process. You may remember as we built early page tables, identity page tables and fixmap page tables in the boot time. No complicated memory management is working yet. When the `start_kernel` function is called we will see the transition to more complex data structures and techniques for memory management. For a good understanding of the initialization process in the linux kernel we need to have a clear understanding of these techniques. This chapter will provide an overview of the different parts of the linux kernel memory management framework and its API, starting from the `memblock`.
			
 
				 
			
 
				 Memblock
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				 Memblock is one of the methods of managing memory regions during the early bootstrap period while the usual kernel memory allocators are not up and
			
 
				-running yet. Previously it was called `Logical Memory Block`, but with the [patch](https://lkml.org/lkml/2010/7/13/68) by Yinghai Lu, it was renamed to the `memblock`. As Linux kernel for `x86_64` architecture uses this method. We already met `memblock` in the [Last preparations before the kernel entry point](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html) part. And now time to get acquainted with it closer. We will see how it is implemented.
			
 
				+running yet. Previously it was called `Logical Memory Block`, but with the [patch](https://lkml.org/lkml/2010/7/13/68) by Yinghai Lu, it was renamed to the `memblock`. As Linux kernel for `x86_64` architecture uses this method. We already met `memblock` in the [Last preparations before the kernel entry point](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html) part. And now it's time to get acquainted with it closer. We will see how it is implemented.
			
 
				 
			
 
				 We will start to learn `memblock` from the data structures. Definitions of the all data structures can be found in the [include/linux/memblock.h](https://github.com/torvalds/linux/blob/master/include/linux/memblock.h) header file.
			
 
				 
			
@@ -39,7 +39,7 @@ struct memblock_type {
 
				 };
			
 
				 ```
			
 
				 
			
 
				-This structure provides information about memory type. It contains fields which describe the number of memory regions which are inside the current memory block, the size of all memory regions, the size of the allocated array of the memory regions and pointer to the array of the `memblock_region` structures. `memblock_region` is a structure which describes a memory region. Its definition is:
			
 
				+This structure provides information about the memory type. It contains fields which describe the number of memory regions which are inside the current memory block, the size of all memory regions, the size of the allocated array of the memory regions and pointer to the array of the `memblock_region` structures. `memblock_region` is a structure which describes a memory region. Its definition is:
			
 
				 
			
 
				 ```C
			
 
				 struct memblock_region {
			
@@ -52,15 +52,18 @@ struct memblock_region {
 
				 };
			
 
				 ```
			
 
				 
			
 
				-`memblock_region` provides base address and size of the memory region, flags which can be:
			
 
				+`memblock_region` provides the base address and size of the memory region as well as a flags field which can have the following values:
			
 
				 
			
 
				 ```C
			
 
				-#define MEMBLOCK_ALLOC_ANYWHERE	(~(phys_addr_t)0)
			
 
				-#define MEMBLOCK_ALLOC_ACCESSIBLE	0
			
 
				-#define MEMBLOCK_HOTPLUG	0x1
			
 
				+enum {
			
 
				+    MEMBLOCK_NONE	= 0x0,	/* No special request */
			
 
				+    MEMBLOCK_HOTPLUG	= 0x1,	/* hotpluggable region */
			
 
				+    MEMBLOCK_MIRROR	= 0x2,	/* mirrored region */
			
 
				+    MEMBLOCK_NOMAP	= 0x4,	/* don't add to kernel direct mapping */
			
 
				+};
			
 
				 ```
			
 
				 
			
 
				-Also `memblock_region` provides integer field - [numa](http://en.wikipedia.org/wiki/Non-uniform_memory_access) node selector, if the `CONFIG_HAVE_MEMBLOCK_NODE_MAP` configuration option is enabled.
			
 
				+Also `memblock_region` provides an integer field - [numa](http://en.wikipedia.org/wiki/Non-uniform_memory_access) node selector, if the `CONFIG_HAVE_MEMBLOCK_NODE_MAP` configuration option is enabled.
			
 
				 
			
 
				 Schematically we can imagine it as:
			
 
				 
			
@@ -69,7 +72,7 @@ Schematically we can imagine it as:
 
				 |         memblock          |   |                           |
			
 
				 |  _______________________  |   |                           |
			
 
				 | |        memory         | |   |       Array of the        |
			
 
				-| |      memblock_type    |-|-->|      membock_region       |
			
 
				+| |      memblock_type    |-|-->|      memblock_region      |
			
 
				 | |_______________________| |   |                           |
			
 
				 |                           |   +---------------------------+
			
 
				 |  _______________________  |   +---------------------------+
			
@@ -85,7 +88,7 @@ These three structures: `memblock`, `memblock_type` and `memblock_region` are ma
 
				 Memblock initialization
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-As all API of the `memblock` described in the [include/linux/memblock.h](https://github.com/torvalds/linux/blob/master/include/linux/memblock.h) header file, all implementation of these function is in the [mm/memblock.c](https://github.com/torvalds/linux/blob/master/mm/memblock.c) source code file. Let's look at the top of the source code file and we will see the initialization of the `memblock` structure:
			
 
				+As all API of the `memblock` are described in the [include/linux/memblock.h](https://github.com/torvalds/linux/blob/master/include/linux/memblock.h) header file, all implementations of these functions are in the [mm/memblock.c](https://github.com/torvalds/linux/blob/master/mm/memblock.c) source code file. Let's look at the top of the source code file and we will see the initialization of the `memblock` structure:
			
 
				 
			
 
				 ```C
			
 
				 struct memblock memblock __initdata_memblock = {
			
@@ -107,7 +110,7 @@ struct memblock memblock __initdata_memblock = {
 
				 };
			
 
				 ```
			
 
				 
			
 
				-Here we can see initialization of the `memblock` structure which has the same name as structure - `memblock`. First of all note on `__initdata_memblock`. Defenition of this macro looks like:
			
 
				+Here we can see initialization of the `memblock` structure which has the same name as structure - `memblock`. First of all note the `__initdata_memblock`. Definition of this macro looks like:
			
 
				 
			
 
				 ```C
			
 
				 #ifdef CONFIG_ARCH_DISCARD_MEMBLOCK
			
@@ -119,9 +122,9 @@ Here we can see initialization of the `memblock` structure which has the same na
 
				 #endif
			
 
				 ```
			
 
				 
			
 
				-You can note that it depends on `CONFIG_ARCH_DISCARD_MEMBLOCK`. If this configuration option is enabled, memblock code will be put to the `.init` section and it will be released after the kernel is booted up.
			
 
				+You can see that it depends on `CONFIG_ARCH_DISCARD_MEMBLOCK`. If this configuration option is enabled, memblock code will be put into the `.init` section and will be released after the kernel is booted up.
			
 
				 
			
 
				-Next we can see initialization of the `memblock_type memory`, `memblock_type reserved` and `memblock_type physmem` fields of the `memblock` structure. Here we are interested only in the `memblock_type.regions` initialization process. Note that every `memblock_type` field initialized by the arrays of the `memblock_region`:
			
 
				+Next we can see the initialization of the `memblock_type memory`, `memblock_type reserved` and `memblock_type physmem` fields of the `memblock` structure. Here we are interested only in the `memblock_type.regions` initialization process. Note that every `memblock_type` field is initialized by and array of `memblock_region`s:
			
 
				 
			
 
				 ```C
			
 
				 static struct memblock_region memblock_memory_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
			
@@ -137,7 +140,7 @@ Every array contains 128 memory regions. We can see it in the `INIT_MEMBLOCK_REG
 
				 #define INIT_MEMBLOCK_REGIONS   128
			
 
				 ```
			
 
				 
			
 
				-Note that all arrays are also defined with the `__initdata_memblock` macro which we already saw in the `memblock` strucutre initialization (read above if you've forgot).
			
 
				+Note that all arrays are also defined with the `__initdata_memblock` macro which we already saw in the `memblock` structure initialization (read above if you've forgotten).
			
 
				 
			
 
				 The last two fields describe that `bottom_up` allocation is disabled and the limit of the current Memblock is:
			
 
				 
			
@@ -147,20 +150,20 @@ The last two fields describe that `bottom_up` allocation is disabled and the lim
 
				 
			
 
				 which is `0xffffffffffffffff`.
			
 
				 
			
 
				-On this step initialization of the `memblock` structure finished and we can look on the Memblock API.
			
 
				+On this step the initialization of the `memblock` structure has been finished and we can have a look at the Memblock API.
			
 
				 
			
 
				 Memblock API
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-Ok we have finished with initilization of the `memblock` structure and now we can look on the Memblock API and its implementation. As I said above, all implementation of the `memblock` presented in the [mm/memblock.c](https://github.com/torvalds/linux/blob/master/mm/memblock.c). To understand how `memblock` works and is implemented, let's look at its usage first of all. There are a couple of [places](http://lxr.free-electrons.com/ident?i=memblock) in the linux kernel where memblock is used. For example let's take `memblock_x86_fill` function from the [arch/x86/kernel/e820.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/e820.c#L1061). This function goes through the memory map provided by the [e820](http://en.wikipedia.org/wiki/E820) and adds memory regions reserved by the kernel to the `memblock` with the `memblock_add` function. As we met `memblock_add` function first, let's start from it.
			
 
				+Ok we have finished with the initialization of the `memblock` structure and now we can look at the Memblock API and its implementation. As I said above, the implementation of `memblock` is taking place fully in [mm/memblock.c](https://github.com/torvalds/linux/blob/master/mm/memblock.c). To understand how `memblock` works and how it is implemented, let's look at its usage first. There are a couple of [places](http://lxr.free-electrons.com/ident?i=memblock) in the linux kernel where memblock is used. For example let's take `memblock_x86_fill` function from the [arch/x86/kernel/e820.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/e820.c#L1061). This function goes through the memory map provided by the [e820](http://en.wikipedia.org/wiki/E820) and adds memory regions reserved by the kernel to the `memblock` with the `memblock_add` function. Since we have met the `memblock_add` function first, let's start from it.
			
 
				 
			
 
				-This function takes physical base address and size of the memory region and adds it to the `memblock`. `memblock_add` function does not do anything special in its body, but just calls:
			
 
				+This function takes a physical base address and the size of the memory region as arguments and add them to the `memblock`. The `memblock_add` function does not do anything special in its body, but just calls the:
			
 
				 
			
 
				 ```C
			
 
				 memblock_add_range(&memblock.memory, base, size, MAX_NUMNODES, 0);
			
 
				 ```
			
 
				 
			
 
				-function. We pass memory block type - `memory`, physical base address and size of the memory region, maximum number of nodes which are zero if `CONFIG_NODES_SHIFT` is not set in the configuration file or `CONFIG_NODES_SHIFT` if it is set, and flags. The `memblock_add_range` function adds new memory region to the memory block. It starts by checking the size of the given region and if it is zero it just returns. After this, `memblock_add_range` checks for existence of the memory regions in the `memblock` structure with the given `memblock_type`. If there are no memory regions, we just fill new `memory_region` with the given values and return (we already saw the implementation of this in the [First touch of the linux kernel memory manager framework](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html)). If `memblock_type` is not empty, we start to add new memory region to the `memblock` with the given `memblock_type`.
			
 
				+function. We pass the memory block type - `memory`, the physical base address and the size of the memory region, the maximum number of nodes which is 1 if `CONFIG_NODES_SHIFT` is not set in the configuration file or `1 << CONFIG_NODES_SHIFT` if it is set, and the flags. The `memblock_add_range` function adds a new memory region to the memory block. It starts by checking the size of the given region and if it is zero it just returns. After this, `memblock_add_range` checks the existence of the memory regions in the `memblock` structure with the given `memblock_type`. If there are no memory regions, we just fill new a `memory_region` with the given values and return (we already saw the implementation of this in the [First touch of the linux kernel memory manager framework](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html)). If `memblock_type` is not empty, we start to add a new memory region to the `memblock` with the given `memblock_type`.
			
 
				 
			
 
				 First of all we get the end of the memory region with the:
			
 
				 
			
@@ -177,12 +180,12 @@ static inline phys_addr_t memblock_cap_size(phys_addr_t base, phys_addr_t *size)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-`memblock_cap_size` returns new size which is the smallest value between the given size and `ULLONG_MAX - base`.
			
 
				+`memblock_cap_size` returns the new size which is the smallest value between the given size and `ULLONG_MAX - base`.
			
 
				 
			
 
				-After that we have the end address of the new memory region, `memblock_add_range` checks overlap and merge conditions with already added memory regions. Insertion of the new memory region to the `memblcok` consists of two steps:
			
 
				+After that we have the end address of the new memory region, `memblock_add_range` checks for overlap and merge conditions with memory regions that have been added before. Insertion of the new memory region to the `memblock` consists of two steps:
			
 
				 
			
 
				-* Adding of non-overlapping parts of the new memory area as separate regions;  
			
 
				-* Merging of all neighbouring regions.
			
 
				+* Adding of non-overlapping parts of the new memory area as separate regions;
			
 
				+* Merging of all neighboring regions.
			
 
				 
			
 
				 We are going through all the already stored memory regions and checking for overlap with the new region:
			
 
				 
			
@@ -202,7 +205,7 @@ We are going through all the already stored memory regions and checking for over
 
				 	}
			
 
				 ```
			
 
				 
			
 
				-If the new memory region does not overlap regions which are already stored in the `memblock`, insert this region into the memblock with and this is first step, we check that new region can fit into the memory block and call `memblock_double_array` in other way:
			
 
				+If the new memory region does not overlap with regions which are already stored in the `memblock`, insert this region into the memblock with and this is first step, we check if the new region can fit into the memory block and call `memblock_double_array` in another way:
			
 
				 
			
 
				 ```C
			
 
				 while (type->cnt + nr_new > type->max)
			
@@ -223,19 +226,19 @@ while (type->cnt + nr_new > type->max)
 
				 	}
			
 
				 ```
			
 
				 
			
 
				-As we set `insert` to `true` in the first step, now `memblock_insert_region` will be called. `memblock_insert_region` has almost the same implementation that we saw when we insert new region to the empty `memblock_type` (see above). This function gets the last memory region:
			
 
				+Since we set `insert` to `true` in the first step, now `memblock_insert_region` will be called. `memblock_insert_region` has almost the same implementation that we saw when we inserted a new region to the empty `memblock_type` (see above). This function gets the last memory region:
			
 
				 
			
 
				 ```C
			
 
				 struct memblock_region *rgn = &type->regions[idx];
			
 
				 ```
			
 
				 
			
 
				-and copies memory area with `memmove`:
			
 
				+and copies the memory area with `memmove`:
			
 
				 
			
 
				 ```C
			
 
				 memmove(rgn + 1, rgn, (type->cnt - idx) * sizeof(*rgn));
			
 
				 ```
			
 
				 
			
 
				-After this fills `memblock_region` fields of the new memory region base, size and etc... and increase size of the `memblock_type`. In the end of the execution, `memblock_add_range` calls `memblock_merge_regions` which merges neighboring compatible regions in the second step.
			
 
				+After this fills `memblock_region` fields of the new memory region base, size, etc. and increases size of the `memblock_type`. In the end of the execution, `memblock_add_range` calls `memblock_merge_regions` which merges neighboring compatible regions in the second step.
			
 
				 
			
 
				 In the second case the new memory region can overlap already stored regions. For example we already have `region1` in the `memblock`:
			
 
				 
			
@@ -279,7 +282,7 @@ if (base < end) {
 
				 }
			
 
				 ```
			
 
				 
			
 
				-In this case we insert `overlapping portion` (we insert only the higher portion, because the lower portion is already in the overlapped memory region), then the remaining portion and merge these portions with `memblock_merge_regions`. As I said above `memblock_merge_regions` function merges neighboring compatible regions. It goes through the all memory regions from the given `memblock_type`, takes two neighboring memory regions - `type->regions[i]` and `type->regions[i + 1]` and checks that these regions have the same flags, belong to the same node and that end address of the first regions is not equal to the base address of the second region:
			
 
				+In this case we insert `overlapping portion` (we insert only the higher portion, because the lower portion is already in the overlapped memory region), then the remaining portion and merge these portions with `memblock_merge_regions`. As I said above `memblock_merge_regions` function merges neighboring compatible regions. It goes through all memory regions from the given `memblock_type`, takes two neighboring memory regions - `type->regions[i]` and `type->regions[i + 1]` and checks that these regions have the same flags, belong to the same node and that the end address of the first regions is not equal to the base address of the second region:
			
 
				 
			
 
				 ```C
			
 
				 while (i < type->cnt - 1) {
			
@@ -295,19 +298,19 @@ while (i < type->cnt - 1) {
 
				 	}
			
 
				 ```
			
 
				 
			
 
				-If none of these conditions are not true, we update the size of the first region with the size of the next region:
			
 
				+If none of these conditions are true, we update the size of the first region with the size of the next region:
			
 
				 
			
 
				 ```C
			
 
				 this->size += next->size;
			
 
				 ```
			
 
				 
			
 
				-As we update the size of the first memory region with the size of the next memory region, we copy every (in the loop) memory region which is after the current (`this`) memory region to the one index ago with the `memmove` function:
			
 
				+As we update the size of the first memory region with the size of the next memory region, we move all memory regions which are after the (`next`) memory region one index backwards with the `memmove` function:
			
 
				 
			
 
				 ```C
			
 
				 memmove(next, next + 1, (type->cnt - (i + 2)) * sizeof(*next));
			
 
				 ```
			
 
				 
			
 
				-And decrease the count of the memory regions which are belongs to the `memblock_type`:
			
 
				+The `memmove` here moves all regions which are located after the `next` region to the base address of the `next` region. In the end we just decrease the count of the memory regions which belong to the `memblock_type`:
			
 
				 
			
 
				 ```C
			
 
				 type->cnt--;
			
@@ -326,11 +329,13 @@ After this we will get two memory regions merged into one:
 
				 +------------------------------------------------+
			
 
				 ```
			
 
				 
			
 
				+As we decreased counts of regions in a memblock with certain type, increased size of the `this` region and shifted all regions which are located after `next` region to its place.
			
 
				+
			
 
				 That's all. This is the whole principle of the work of the `memblock_add_range` function.
			
 
				 
			
 
				-There is also `memblock_reserve` function which does the same as `memblock_add`, but only with one difference. It stores `memblock_type.reserved` in the memblock instead of `memblock_type.memory`.
			
 
				+There is also `memblock_reserve` function which does the same as `memblock_add`, but with one difference. It stores `memblock_type.reserved` in the memblock instead of `memblock_type.memory`.
			
 
				 
			
 
				-Of course this is not the full API. Memblock provides an API for not only adding `memory` and `reserved` memory regions, but also:
			
 
				+Of course this is not the full API. Memblock provides APIs not only for adding `memory` and `reserved` memory regions, but also:
			
 
				 
			
 
				 * memblock_remove - removes memory region from memblock;
			
 
				 * memblock_find_in_range - finds free area in given range;
			
@@ -342,7 +347,7 @@ and many more....
 
				 Getting info about memory regions
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-Memblock also provides an API for getting information about allocated memory regions in the `memblcok`. It is split in two parts:
			
 
				+Memblock also provides an API for getting information about allocated memory regions in the `memblock`. It is split in two parts:
			
 
				 
			
 
				 * get_allocated_memblock_memory_regions_info - getting info about memory regions;
			
 
				 * get_allocated_memblock_reserved_regions_info - getting info about reserved regions.
			
@@ -394,20 +399,20 @@ And you will see something like this:
 
				 
			
 
				 ![Memblock](http://oi57.tinypic.com/1zoj589.jpg)
			
 
				 
			
 
				-Memblock has also support in [debugfs](http://en.wikipedia.org/wiki/Debugfs). If you run kernel not in `X86` architecture you can access:
			
 
				+Memblock also has support in [debugfs](http://en.wikipedia.org/wiki/Debugfs). If you run the kernel on another architecture than `X86` you can access:
			
 
				 
			
 
				 * /sys/kernel/debug/memblock/memory
			
 
				 * /sys/kernel/debug/memblock/reserved
			
 
				 * /sys/kernel/debug/memblock/physmem
			
 
				 
			
 
				-for getting dump of the `memblock` contents.
			
 
				+to get a dump of the `memblock` contents.
			
 
				 
			
 
				 Conclusion
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-This is the end of the first part about linux kernel memory management. If you have questions or suggestions, ping me on twitter [0xAX](https://twitter.com/0xAX), drop me an [email](anotherworldofworld@gmail.com) or just create an [issue](https://github.com/0xAX/linux-internals/issues/new).
			
 
				+This is the end of the first part about linux kernel memory management. If you have questions or suggestions, ping me on twitter [0xAX](https://twitter.com/0xAX), drop me an [email](anotherworldofworld@gmail.com) or just create an [issue](https://github.com/0xAX/linux-insides/issues/new).
			
 
				 
			
 
				-**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me a PR to [linux-internals](https://github.com/0xAX/linux-internals).**
			
 
				+**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me a PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				 
			
 
				 Links
			
 
				 --------------------------------------------------------------------------------
			
--- a/mm/linux-mm-2.md
+++ b/mm/linux-mm-2.md
@@ -4,7 +4,7 @@ Linux kernel memory management Part 2.
 
				 Fix-Mapped Addresses and ioremap
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-`Fix-Mapped` addresses are a set of special compile-time addresses whose corresponding physical address do not have to be a linear address minus `__START_KERNEL_map`. Each fix-mapped address maps one page frame and the kernel uses them as pointers that never change their address. That is the main point of these addresses. As the comment says: `to have a constant address at compile time, but to set the physical address only in the boot process`. You can remember that in the earliest [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html), we already set the `level2_fixmap_pgt`:
			
 
				+`Fix-Mapped` addresses are a set of special compile-time addresses whose corresponding physical addresses do not have to be a linear address minus `__START_KERNEL_map`. Each fix-mapped address maps one page frame and the kernel uses them as pointers that never change their address. That is the main point of these addresses. As the comment says: `to have a constant address at compile time, but to set the physical address only in the boot process`. You can remember that in the earliest [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html), we already set the `level2_fixmap_pgt`:
			
 
				 
			
 
				 ```assembly
			
 
				 NEXT_PAGE(level2_fixmap_pgt)
			
@@ -16,14 +16,13 @@ NEXT_PAGE(level1_fixmap_pgt)
 
				 	.fill	512,8,0
			
 
				 ```
			
 
				 
			
 
				-As you can see `level2_fixmap_pgt` is right after the `level2_kernel_pgt` which is kernel code+data+bss. Every fix-mapped address is represented by an integer index which is defined in the `fixed_addresses` enum from the [arch/x86/include/asm/fixmap.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/fixmap.h). For example it contains entries for `VSYSCALL_PAGE` - if emulation of legacy vsyscall page is enabled, `FIX_APIC_BASE` for local [apic](h
			
 
				-ttp://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) and etc... In a virtual memory fix-mapped area is placed in the modules area:
			
 
				+As you can see `level2_fixmap_pgt` is right after the `level2_kernel_pgt` which is kernel code+data+bss. Every fix-mapped address is represented by an integer index which is defined in the `fixed_addresses` enum from the [arch/x86/include/asm/fixmap.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/fixmap.h). For example it contains entries for `VSYSCALL_PAGE` - if emulation of legacy vsyscall page is enabled, `FIX_APIC_BASE` for local [apic](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller), etc. In virtual memory fix-mapped area is placed in the modules area:
			
 
				 
			
 
				 ```
			
 
				        +-----------+-----------------+---------------+------------------+
			
 
				        |           |                 |               |                  |
			
 
				-	   |kernel text|      kernel     |               |    vsyscalls     |
			
 
				-	   | mapping   |       text      |    Modules    |    fix-mapped    |
			
 
				+       |kernel text|      kernel     |               |    vsyscalls     |
			
 
				+       | mapping   |       text      |    Modules    |    fix-mapped    |
			
 
				        |from phys 0|       data      |               |    addresses     |
			
 
				        |           |                 |               |                  |
			
 
				        +-----------+-----------------+---------------+------------------+
			
@@ -37,15 +36,15 @@ Base virtual address and size of the `fix-mapped` area are presented by the two
 
				 #define FIXADDR_START		(FIXADDR_TOP - FIXADDR_SIZE)
			
 
				 ```
			
 
				 
			
 
				-Here `__end_of_permanent_fixed_addresses` is an element of the `fixed_addresses` enum and as I wrote above: Every fix-mapped address is represented by an integer index which is defined in the `fixed_addresses`. `PAGE_SHIFT` determines size of a page. For example size of the one page we can get with the `1 << PAGE_SHIFT`. In our case we need to get the size of the fix-mapped area, but not only of one page, that's why we are using `__end_of_permanent_fixed_addresses` for getting the size of the fix-mapped area. In my case it's a little more than `536` killobytes. In your case it might be a different number, because the size depends on amount of the fix-mapped addresses which are depends on your kernel's configuration.
			
 
				+Here `__end_of_permanent_fixed_addresses` is an element of the `fixed_addresses` enum and as I wrote above: Every fix-mapped address is represented by an integer index which is defined in the `fixed_addresses`. `PAGE_SHIFT` determines the size of a page. For example size of the one page we can get with the `1 << PAGE_SHIFT`. In our case we need to get the size of the fix-mapped area, but not only of one page, that's why we are using `__end_of_permanent_fixed_addresses` for getting the size of the fix-mapped area. In my case it's a little more than `536` kilobytes. In your case it might be a different number, because the size depends on amount of the fix-mapped addresses which are depends on your kernel's configuration.
			
 
				 
			
 
				-The second `FIXADDR_START` macro just extracts from the last address of the fix-mapped area its size for getting base virtual address of the fix-mapped area. `FIXADDR_TOP` is rounded up address from the base address of the [vsyscall](https://lwn.net/Articles/446528/) space:
			
 
				+The second `FIXADDR_START` macro just subtracts the fix-mapped area size from the last address of the fix-mapped area to get its base virtual address. `FIXADDR_TOP` is a rounded up address from the base address of the [vsyscall](https://lwn.net/Articles/446528/) space:
			
 
				 
			
 
				 ```C
			
 
				 #define FIXADDR_TOP     (round_up(VSYSCALL_ADDR + PAGE_SIZE, 1<<PMD_SHIFT) - PAGE_SIZE)
			
 
				 ```
			
 
				 
			
 
				-The `fixed_addresses` enums are used as an index to get the virtual address using the `fix_to_virt` function. Implementation of this function is easy:
			
 
				+The `fixed_addresses` enums are used as an index to get the virtual address by the `fix_to_virt` function. Implementation of this function is easy:
			
 
				  
			
 
				 ```C
			
 
				 static __always_inline unsigned long fix_to_virt(const unsigned int idx)
			
@@ -71,7 +70,7 @@ static inline unsigned long virt_to_fix(const unsigned long vaddr)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-`virt_to_fix` takes virtual address, checks that this address is between `FIXADDR_START` and `FIXADDR_TOP` and calls `__virt_to_fix` macro which implemented as:
			
 
				+`virt_to_fix` takes a virtual address, checks that this address is between `FIXADDR_START` and `FIXADDR_TOP` and calls the `__virt_to_fix` macro which implemented as:
			
 
				 
			
 
				 ```C
			
 
				 #define __virt_to_fix(x)        ((FIXADDR_TOP - ((x)&PAGE_MASK)) >> PAGE_SHIFT)
			
@@ -79,17 +78,17 @@ static inline unsigned long virt_to_fix(const unsigned long vaddr)
 
				 
			
 
				 A PFN is simply an index within physical memory that is counted in page-sized units. PFN for a physical address could be trivially defined as (page_phys_addr >> PAGE_SHIFT);
			
 
				 
			
 
				-`__virt_to_fix` clears the first 12 bits in the given address, subtracts it from the last address the of `fix-mapped` area (`FIXADDR_TOP`) and shifts right result on `PAGE_SHIFT` which is `12`. Let me explain how it works. As I already wrote we will clear the first 12 bits in the given address with `x & PAGE_MASK`. As we subtract this from the `FIXADDR_TOP`, we will get the last 12 bits of the `FIXADDR_TOP` which are present. We know that the first 12 bits of the virtual address represent the offset in the page frame. With the shiting it on `PAGE_SHIFT` we will get `Page frame number` which is just all bits in a virtual address besides the first 12 offset bits. `Fix-mapped` addresses are used in different [places](http://lxr.free-electrons.com/ident?i=fix_to_virt) in the linux kernel. `IDT` descriptor stored there, [Intel Trusted Execution Technology](http://en.wikipedia.org/wiki/Trusted_Execution_Technology) UUID stored in the `fix-mapped` area started from `FIX_TBOOT_BASE` index, [Xen](http://en.wikipedia.org/wiki/Xen) bootmap and many more... We already saw a little about `fix-mapped` addresses in the fifth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) about linux kernel initialization. We used `fix-mapped` area in the early `ioremap` initialization. Let's look on it and try to understand what is it `ioremap`, how it is implemented in the kernel and how it is releated to the `fix-mapped` addresses.
			
 
				+`__virt_to_fix` clears the first 12 bits in the given address, subtracts it from the last address the of `fix-mapped` area (`FIXADDR_TOP`) and shifts the result right on `PAGE_SHIFT` which is `12`. Let me explain how it works. As I already wrote we will clear the first 12 bits in the given address with `x & PAGE_MASK`. As we subtract this from the `FIXADDR_TOP`, we will get the last 12 bits of the `FIXADDR_TOP` which are present. We know that the first 12 bits of the virtual address represent the offset in the page frame. With the shifting it on `PAGE_SHIFT` we will get `Page frame number` which is just all bits in a virtual address besides the first 12 offset bits. `Fix-mapped` addresses are used in different [places](http://lxr.free-electrons.com/ident?i=fix_to_virt) in the linux kernel. `IDT` descriptor stored there, [Intel Trusted Execution Technology](http://en.wikipedia.org/wiki/Trusted_Execution_Technology) UUID stored in the `fix-mapped` area started from `FIX_TBOOT_BASE` index, [Xen](http://en.wikipedia.org/wiki/Xen) bootmap and many more... We already saw a little about `fix-mapped` addresses in the fifth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) about of the linux kernel initialization. We use `fix-mapped` area in the early `ioremap` initialization. Let's look at it more closely and try to understand what `ioremap` is, how it is implemented in the kernel and how it is related to the `fix-mapped` addresses.
			
 
				 
			
 
				 ioremap
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-Linux kernel provides many different primitives to manage memory. For this moment we will touch `I/O memory`. Every device is controlled by reading/writing from/to its registers. For example a driver can turn off/on a device by writing to its registers or get the state of a device by reading from its registers. Besides registers, many devices have buffers where a driver can write something or read from there. As we know for this moment there are two ways to access device's registers and data buffers:
			
 
				+The Linux kernel provides many different primitives to manage memory. For this moment we will touch `I/O memory`. Every device is controlled by reading/writing from/to its registers. For example a driver can turn off/on a device by writing to its registers or get the state of a device by reading from its registers. Besides registers, many devices have buffers where a driver can write something or read from there. As we know for this moment there are two ways to access device's registers and data buffers:
			
 
				 
			
 
				 * through the I/O ports;
			
 
				 * mapping of the all registers to the memory address space;
			
 
				 
			
 
				-In the first case every control register of a device has a number of input and output port. And driver of a device can read from a port and write to it with two `in` and `out` instructions which we already saw. If you want to know about currently registered port regions, you can know they by accessing of `/proc/ioports`:
			
 
				+In the first case every control register of a device has a number of input and output port. A device driver can read from a port and write to it with two `in` and `out` instructions which we already saw. If you want to know about currently registered port regions, you can learn about them by accessing `/proc/ioports`:
			
 
				 
			
 
				 ```
			
 
				 $ cat /proc/ioports
			
@@ -120,7 +119,7 @@ $ cat /proc/ioports
 
				 ...
			
 
				 ```
			
 
				 
			
 
				-`/proc/ioporst` provides information about what driver used address of a `I/O` ports region. All of these memory regions, for example `0000-0cf7`, were claimed with the `request_region` function from the [include/linux/ioport.h](https://github.com/torvalds/linux/blob/master/include/linux/ioport.h). Actually `request_region` is a macro which defied as:
			
 
				+`/proc/ioports` provides information about which driver uses which address of a `I/O` port region. All of these memory regions, for example `0000-0cf7`, were claimed with the `request_region` function from the [include/linux/ioport.h](https://github.com/torvalds/linux/blob/master/include/linux/ioport.h). Actually `request_region` is a macro which is defined as:
			
 
				 
			
 
				 ```C
			
 
				 #define request_region(start,n,name)   __request_region(&ioport_resource, (start), (n), (name), 0)
			
@@ -132,7 +131,7 @@ As we can see it takes three parameters:
 
				 * `n`     -  length of region;
			
 
				 * `name`  -  name of requester.
			
 
				 
			
 
				-`request_region` allocates `I/O` port region. Very often `check_region` function called before the `request_region` to check that the given address range is available and `release_region` to release memory region. `request_region` returns pointer to the `resource` structure. `resource` structure presents abstraction for a tree-like subset of system resources. We already saw `resource` structure in the firth part about kernel [initialization](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) process and it looks as:
			
 
				+`request_region` allocates an `I/O` port region. Very often the `check_region` function is called before the `request_region` to check that the given address range is available and the `release_region` function to release the memory region. `request_region` returns a pointer to the `resource` structure. The `resource` structure represents an abstraction for a tree-like subset of system resources. We already saw the `resource` structure in the fifth part of the kernel [initialization](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) process and it looks as follows:
			
 
				 
			
 
				 ```C
			
 
				 struct resource {
			
@@ -144,7 +143,7 @@ struct resource {
 
				 };
			
 
				 ```
			
 
				 
			
 
				-and contains start and end addresses of the resource, name and etc... Every `resource` structure contains pointers to the `parent`, `slibling` and `child` resources. As it has parent and childs, it means that every subset of resuorces has root `resource` structure. For example, for `I/O` ports it is `ioport_resource` structure:
			
 
				+and contains start and end addresses of the resource, the name, etc. Every `resource` structure contains pointers to the `parent`, `sibling` and `child` resources. As it has a parent and a childs, it means that every subset of resources has root `resource` structure. For example, for `I/O` ports it is the `ioport_resource` structure:
			
 
				 
			
 
				 ```C
			
 
				 struct resource ioport_resource = {
			
@@ -156,7 +155,7 @@ struct resource ioport_resource = {
 
				 EXPORT_SYMBOL(ioport_resource);
			
 
				 ```
			
 
				 
			
 
				-Or for `iomem`, it is `iomem_resource` structure:
			
 
				+Or for `iomem`, it is the `iomem_resource` structure:
			
 
				 
			
 
				 ```C
			
 
				 struct resource iomem_resource = {
			
@@ -167,13 +166,13 @@ struct resource iomem_resource = {
 
				 };
			
 
				 ```
			
 
				 
			
 
				-As I wrote about `request_regions` is used for registering of I/O port region and this macro used in many [places](http://lxr.free-electrons.com/ident?i=request_region) in the kernel. For example let's look at [drivers/char/rtc.c](https://github.com/torvalds/linux/blob/master/char/rtc.c). This source code file provides [Real Time Clock](http://en.wikipedia.org/wiki/Real-time_clock) interface in the linux kernel. As every kernel module, `rtc` module contains `module_init` definition:
			
 
				+As I have mentioned before, `request_regions` is used to register I/O port regions and this macro is used in many [places](http://lxr.free-electrons.com/ident?i=request_region) in the kernel. For example let's look at [drivers/char/rtc.c](https://github.com/torvalds/linux/blob/master/drivers/char/rtc.c). This source code file provides the [Real Time Clock](http://en.wikipedia.org/wiki/Real-time_clock) interface in the linux kernel. As every kernel module, `rtc` module contains `module_init` definition:
			
 
				 
			
 
				 ```C
			
 
				 module_init(rtc_init);
			
 
				 ```
			
 
				 
			
 
				-where `rtc_init` is `rtc` initialization function. This function defined in the same `rtc.c` source code file. In the `rtc_init` function we can see a couple calls of the `rtc_request_region` functions, which wrap `request_region` for example:
			
 
				+where `rtc_init` is the `rtc` initialization function. This function is defined in the same `rtc.c` source code file. In the `rtc_init` function we can see a couple of calls to the `rtc_request_region` functions, which wrap `request_region` for example:
			
 
				 
			
 
				 ```C
			
 
				 r = rtc_request_region(RTC_IO_EXTENT);
			
@@ -185,25 +184,25 @@ where `rtc_request_region` calls:
 
				 r = request_region(RTC_PORT(0), size, "rtc");
			
 
				 ```
			
 
				 
			
 
				-Here `RTC_IO_EXTENT` is a size of memory region and it is `0x8`, `"rtc"` is a name of region and `RTC_PORT` is:
			
 
				+Here `RTC_IO_EXTENT` is the size of the memory region and it is `0x8`, `"rtc"` is the name of the region and `RTC_PORT` is:
			
 
				 
			
 
				 ```C
			
 
				 #define RTC_PORT(x)     (0x70 + (x))
			
 
				 ```
			
 
				 
			
 
				-So with the `request_region(RTC_PORT(0), size, "rtc")` we register memory region, started at `0x70` and with size `0x8`. Let's look on the `/proc/ioports`:
			
 
				+So with the `request_region(RTC_PORT(0), size, "rtc")` we register a memory region that starts at `0x70` and and has a size of `0x8`. Let's look at `/proc/ioports`:
			
 
				 
			
 
				 ```
			
 
				 ~$ sudo cat /proc/ioports | grep rtc
			
 
				 0070-0077 : rtc0
			
 
				 ```
			
 
				 
			
 
				-So, we got it! Ok, it was ports. The second way is use of `I/O` memory. As I wrote above this way is mapping of control registers and memory of a device to the memory address space. `I/O` memory is a set of contiguous addresses which are provided by a device to CPU through a bus. All memory-mapped I/O addresses are not used by the kernel directly. There is a special `ioremap` function which allows us to covert the physical address on a bus to the kernel virtual address or in another words `ioremap` maps I/O physical memory region to access it from the kernel. The `ioremap` function takes two parameters:
			
 
				+So, we got it! Ok, that was it for the I/O ports. The second way to communicate with drivers is through the use of `I/O` memory. As I have mentioned above this works by mapping the control registers and the memory of a device to the memory address space. `I/O` memory is a set of contiguous addresses which are provided by a device to the CPU through a bus. None of the memory-mapped I/O addresses are used by the kernel directly. There is a special `ioremap` function which allows us to convert the physical address on a bus to a kernel virtual address. In other words, `ioremap` maps I/O physical memory regions to make them accessible from the kernel. The `ioremap` function takes two parameters:
			
 
				 
			
 
				 * start of the memory region;
			
 
				 * size of the memory region;
			
 
				 
			
 
				-I/O memory mapping API provides function for the checking, requesting and release of a memory region as this does I/O ports API. There are three functions for it:
			
 
				+The I/O memory mapping API provides functions to check, request and release memory regions as I/O memory. There are three functions for that:
			
 
				 
			
 
				 * `request_mem_region`
			
 
				 * `release_mem_region`
			
@@ -239,7 +238,7 @@ e0000000-feafffff : PCI Bus 0000:00
 
				 ...
			
 
				 ```
			
 
				 
			
 
				-Part of these addresses is from the call of the `e820_reserve_resources` function. We can find call of this function in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) and the function itself defined in the [arch/x86/kernel/e820.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/e820.c). `e820_reserve_resources` goes through the [e820](http://en.wikipedia.org/wiki/E820) map and inserts memory regions to the root `iomem` resource structure. All `e820` memory regions which are will be inserted to the `iomem` resource will have following types:
			
 
				+Part of these addresses are from the call of the `e820_reserve_resources` function. We can find a call to this function in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) and the function itself is defined in [arch/x86/kernel/e820.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/e820.c). `e820_reserve_resources` goes through the [e820](http://en.wikipedia.org/wiki/E820) map and inserts memory regions into the root `iomem` resource structure. All `e820` memory regions which are inserted into the `iomem` resource have the following types:
			
 
				 
			
 
				 ```C
			
 
				 static inline const char *e820_type_to_string(int e820_type)
			
@@ -255,15 +254,15 @@ static inline const char *e820_type_to_string(int e820_type)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-and we can see it in the `/proc/iomem` (read above).
			
 
				+and we can see them in the `/proc/iomem` (read above).
			
 
				 
			
 
				-Now let's try to understand how `ioremap` works. We already know a little about `ioremap`, we saw it in the fifth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) about linux kernel initialization. If you have read this part, you can remember the call of the `early_ioremap_init` function from the [arch/x86/mm/ioremap.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/ioremap.c). Initialization of the `ioremap` is split inn two parts: there is the early part which we can use before the normal `ioremap` is available and the normal `ioremap` which is available after `vmalloc` initialization and call of the `paging_init`. We do not know anything about `vmalloc` for now, so let's consider early initialization of the `ioremap`. First of all `early_ioremap_init` checks that `fixmap` is aligned on page middle directory boundary:
			
 
				+Now let's try to understand how `ioremap` works. We already know a little about `ioremap`, we saw it in the fifth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) about linux kernel initialization. If you have read this part, you can remember the call of the `early_ioremap_init` function from the [arch/x86/mm/ioremap.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/ioremap.c). Initialization of the `ioremap` is split into two parts: there is the early part which we can use before the normal `ioremap` is available and the normal `ioremap` which is available after `vmalloc` initialization and the call of `paging_init`. We do not know anything about `vmalloc` for now, so let's consider early initialization of the `ioremap`. First of all `early_ioremap_init` checks that `fixmap` is aligned on page middle directory boundary:
			
 
				 
			
 
				 ```C
			
 
				 BUILD_BUG_ON((fix_to_virt(0) + PAGE_SIZE) & ((1 << PMD_SHIFT) - 1));
			
 
				 ```
			
 
				 
			
 
				-more about `BUILD_BUG_ON` you can read in the first part about [Linux Kernel initialization](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html). So `BUILD_BUG_ON` macro raises compilation error if the given expression is true. In the next step after this check, we can see call of the `early_ioremap_setup` function from the [mm/early_ioremap.c](https://github.com/torvalds/linux/blob/master/mm/early_ioremap.c). This function presents generic initialization of the `ioremap`. `early_ioremap_setup` function fills the `slot_virt` array with the virtual addresses of the early fixmaps. All early fixmaps are after `__end_of_permanent_fixed_addresses` in memory. They are stats from the `FIX_BITMAP_BEGIN` (top) and ends with `FIX_BITMAP_END` (down). Actually there are `512` temporary boot-time mappings, used by early `ioremap`:
			
 
				+more about `BUILD_BUG_ON` you can read in the first part about [Linux Kernel initialization](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html). So `BUILD_BUG_ON` macro raises a compilation error if the given expression is true. In the next step after this check, we can see call of the `early_ioremap_setup` function from the [mm/early_ioremap.c](https://github.com/torvalds/linux/blob/master/mm/early_ioremap.c). This function presents generic initialization of the `ioremap`. `early_ioremap_setup` function fills the `slot_virt` array with the virtual addresses of the early fixmaps. All early fixmaps are after `__end_of_permanent_fixed_addresses` in memory. They start at `FIX_BITMAP_BEGIN` (top) and end with `FIX_BITMAP_END` (down). Actually there are `512` temporary boot-time mappings, used by early `ioremap`:
			
 
				 
			
 
				 ```
			
 
				 #define NR_FIX_BTMAPS		64
			
@@ -295,7 +294,7 @@ static unsigned long prev_size[FIX_BTMAPS_SLOTS] __initdata;
 
				 static unsigned long slot_virt[FIX_BTMAPS_SLOTS] __initdata;
			
 
				 ```
			
 
				 
			
 
				-`slot_virt` contains virtual addresses of the `fix-mapped` areas, `prev_map` array contains addresses of the early ioremap areas. Note that I wrote above: `Actually there are 512 temporary boot-time mappings, used by early ioremap` and you can see that all arrays defined with the `__initdata` attribute which means that this memory will be released after kernel initialization process. After `early_ioremap_setup` finished to work, we're getting page middle directory where early ioremap beginning with the `early_ioremap_pmd` function which just gets the base address of the page global directory and calculates the page middle directory for the given address:
			
 
				+`slot_virt` contains the virtual addresses of the `fix-mapped` areas, `prev_map` array contains addresses of the early ioremap areas. Note that I wrote above: `Actually there are 512 temporary boot-time mappings, used by early ioremap` and you can see that all arrays are defined with the `__initdata` attribute which means that this memory will be released after the kernel initialization process. After `early_ioremap_setup` has finished its work, we're getting page middle directory where early ioremap begins with the `early_ioremap_pmd` function which just gets the base address of the page global directory and calculates the page middle directory for the given address:
			
 
				 
			
 
				 ```C
			
 
				 static inline pmd_t * __init early_ioremap_pmd(unsigned long addr)
			
@@ -308,7 +307,7 @@ static inline pmd_t * __init early_ioremap_pmd(unsigned long addr)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-After this we fills `bm_pte` (early ioremap page table entries) with zeros and call the `pmd_populate_kernel` function:
			
 
				+After this we fill `bm_pte` (early ioremap page table entries) with zeros and call the `pmd_populate_kernel` function:
			
 
				 
			
 
				 ```C
			
 
				 pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));
			
@@ -326,7 +325,7 @@ pmd_populate_kernel(&init_mm, pmd, bm_pte);
 
				 static pte_t bm_pte[PAGE_SIZE/sizeof(pte_t)] __page_aligned_bss;
			
 
				 ```
			
 
				 
			
 
				-The `pmd_popularte_kernel` function defined in the [arch/x86/include/asm/pgalloc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/pgalloc.) and populates given page middle directory (`pmd`) with the given page table entries (`bm_pte`):
			
 
				+The `pmd_populate_kernel` function is defined in the [arch/x86/include/asm/pgalloc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/pgalloc.) and populates the page middle directory (`pmd`) provided as an argument with the given page table entries (`bm_pte`):
			
 
				 
			
 
				 ```C
			
 
				 static inline void pmd_populate_kernel(struct mm_struct *mm,
			
@@ -357,18 +356,18 @@ That's all. Early `ioremap` is ready to use. There are a couple of checks in the
 
				 Use of early ioremap
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-As early `ioremap` is setup, we can use it. It provides two functions:
			
 
				+As soon as early `ioremap` has been setup successfully, we can use it. It provides two functions:
			
 
				 
			
 
				 * early_ioremap
			
 
				 * early_iounmap
			
 
				 
			
 
				-for mapping/unmapping of IO physical address to virtual address. Both functions depends on `CONFIG_MMU` configuration option. [Memory management unit](http://en.wikipedia.org/wiki/Memory_management_unit) is a special block of memory management. Main purpose of this block is translation physical addresses to the virtual. Techinically memory management unit knows about high-level page table address (`pgd`) from the `cr3` control register. If `CONFIG_MMU` options is set to `n`, `early_ioremap` just returns the given physical address and `early_iounmap` does not nothing. In other way, if `CONFIG_MMU` option is set to `y`, `early_ioremap` calls `__early_ioremap` which takes three parameters:
			
 
				+for mapping/unmapping of I/O physical address to virtual address. Both functions depend on the `CONFIG_MMU` configuration option. [Memory management unit](http://en.wikipedia.org/wiki/Memory_management_unit) is a special block of memory management. The main purpose of this block is the translation of physical addresses to virtual addresses. The memory management unit knows about the high-level page table addresses (`pgd`) from the `cr3` control register. If `CONFIG_MMU` options is set to `n`, `early_ioremap` just returns the given physical address and `early_iounmap` does nothing. If `CONFIG_MMU` option is set to `y`, `early_ioremap` calls `__early_ioremap` which takes three parameters:
			
 
				 
			
 
				-* `phys_addr` - base physicall address of the `I/O` memory region to map on virtual addresses;
			
 
				-* `size`      - size of the `I/O` memroy region;
			
 
				+* `phys_addr` - base physical address of the `I/O` memory region to map on virtual addresses;
			
 
				+* `size`      - size of the `I/O` memory region;
			
 
				 * `prot`      - page table entry bits.
			
 
				 
			
 
				-First of all in the `__early_ioremap`, we goes through the all early ioremap fixmap slots and check first free are in the `prev_map` array and remember it's number in the `slot` variable and set up size as we found it:
			
 
				+First of all in the `__early_ioremap`, we go through all early ioremap fixmap slots and search for the first free one in the `prev_map` array. When we found it we remember its number in the `slot` variable and set up size:
			
 
				 
			
 
				 ```C
			
 
				 slot = -1;
			
@@ -394,20 +393,20 @@ phys_addr &= PAGE_MASK;
 
				 size = PAGE_ALIGN(last_addr + 1) - phys_addr;
			
 
				 ```
			
 
				 
			
 
				-Here we are using `PAGE_MASK` for clearing all bits in the `phys_addr` besides first 12 bits. `PAGE_MASK` macro defined as:
			
 
				+Here we are using `PAGE_MASK` for clearing all bits in the `phys_addr` except the first 12 bits. `PAGE_MASK` macro is defined as:
			
 
				 
			
 
				 ```C
			
 
				 #define PAGE_MASK       (~(PAGE_SIZE-1))
			
 
				 ```
			
 
				 
			
 
				-We know that size of a page is 4096 bytes or `1000000000000` in binary. `PAGE_SIZE - 1` will be `111111111111`, but with `~`, we will get `000000000000`, but as we use `~PAGE_MASK` we will get `111111111111` again. On the second line we do the same but clear first 12 bits and getting page-aligned size of the area on the third line. We getting aligned area and now we need to get the number of pages which are occupied by the new `ioremap` are and calculate the fix-mapped index from `fixed_addresses` in the next steps:
			
 
				+We know that size of a page is 4096 bytes or `1000000000000` in binary. `PAGE_SIZE - 1` will be `111111111111`, but with `~`, we will get `000000000000`, but as we use `~PAGE_MASK` we will get `111111111111` again. On the second line we do the same but clear the first 12 bits and getting page-aligned size of the area on the third line. We getting aligned area and now we need to get the number of pages which are occupied by the new `ioremap` area and calculate the fix-mapped index from `fixed_addresses` in the next steps:
			
 
				 
			
 
				 ```C
			
 
				 nrpages = size >> PAGE_SHIFT;
			
 
				 idx = FIX_BTMAP_BEGIN - NR_FIX_BTMAPS*slot;
			
 
				 ```
			
 
				 
			
 
				-Now we can fill `fix-mapped` area with the given physical addresses. Every iteration in the loop, we call `__early_set_fixmap` function from the [arch/x86/mm/ioremap.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/ioremap.c), increase given physical address on page size which is `4096` bytes and update `addresses` index and number of pages: 
			
 
				+Now we can fill `fix-mapped` area with the given physical addresses. On every iteration in the loop, we call the `__early_set_fixmap` function from the [arch/x86/mm/ioremap.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/ioremap.c), increase the given physical address by the page size which is `4096` bytes and update the `addresses` index and the number of pages:
			
 
				 
			
 
				 ```C
			
 
				 while (nrpages > 0) {
			
@@ -424,7 +423,7 @@ The `__early_set_fixmap` function gets the page table entry (stored in the `bm_p
 
				 pte = early_ioremap_pte(addr);
			
 
				 ```
			
 
				 
			
 
				-In the next step of the `early_ioremap_pte` we check the given page flags with the `pgprot_val` macro and calls `set_pte` or `pte_clear` depends on it:
			
 
				+In the next step of `early_ioremap_pte` we check the given page flags with the `pgprot_val` macro and call `set_pte` or `pte_clear` depending on the flags given:
			
 
				 
			
 
				 ```C
			
 
				 if (pgprot_val(flags))
			
@@ -439,13 +438,13 @@ As you can see above, we passed `FIXMAP_PAGE_IO` as flags to the `__early_iorema
 
				 (__PAGE_KERNEL_EXEC | _PAGE_NX)
			
 
				 ```
			
 
				 
			
 
				-flags, so we call `set_pte` function for setting page table entry which works in the same manner as `set_pmd` but for PTEs (read above about it). As we set all `PTEs` in the loop, we can see the call of the `__flush_tlb_one` function:
			
 
				+flags, so we call `set_pte` function to set the page table entry which works in the same manner as `set_pmd` but for PTEs (read above about it). As we have set all `PTEs` in the loop, we can now take a look at the call of the `__flush_tlb_one` function:
			
 
				 
			
 
				 ```C
			
 
				 __flush_tlb_one(addr);
			
 
				 ```
			
 
				 
			
 
				-This function defined in the [arch/x86/include/asm/tlbflush.h](https://github.com/torvalds/linux/blob/master) and calls `__flush_tlb_single` or `__flush_tlb` depends on value of the `cpu_has_invlpg`:
			
 
				+This function is defined in [arch/x86/include/asm/tlbflush.h](https://github.com/torvalds/linux/blob/master) and calls `__flush_tlb_single` or `__flush_tlb` depending on the value of `cpu_has_invlpg`:
			
 
				 
			
 
				 ```C
			
 
				 static inline void __flush_tlb_one(unsigned long addr)
			
@@ -457,13 +456,13 @@ static inline void __flush_tlb_one(unsigned long addr)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-`__flush_tlb_one` function invalidates given address in the [TLB](http://en.wikipedia.org/wiki/Translation_lookaside_buffer). As you just saw we updated paging structure, but `TLB` not informed of changes, that's why we need to do it manually. There are two ways how to do it. First is update `cr3` control register and `__flush_tlb` function does this:
			
 
				+The `__flush_tlb_one` function invalidates the given address in the [TLB](http://en.wikipedia.org/wiki/Translation_lookaside_buffer). As you just saw we updated the paging structure, but `TLB` is not informed of the changes, that's why we need to do it manually. There are two ways to do it. The first is to update the `cr3` control register and the `__flush_tlb` function does this:
			
 
				 
			
 
				 ```C
			
 
				 native_write_cr3(native_read_cr3());
			
 
				 ```
			
 
				 
			
 
				-The second method is to use `invlpg` instruction invalidates `TLB` entry. Let's look on `__flush_tlb_one` implementation. As you can see first of all it checks `cpu_has_invlpg` which defined as: 
			
 
				+The second method is to use the `invlpg` instruction to invalidate the `TLB` entry. Let's look at the `__flush_tlb_one` implementation. As you can see, first of all the function checks `cpu_has_invlpg` which is defined as:
			
 
				 
			
 
				 ```C
			
 
				 #if defined(CONFIG_X86_INVLPG) || defined(CONFIG_X86_64)
			
@@ -473,7 +472,7 @@ The second method is to use `invlpg` instruction invalidates `TLB` entry. Let's
 
				 #endif
			
 
				 ```
			
 
				 
			
 
				-If a CPU support `invlpg` instruction, we call the `__flush_tlb_single` macro which expands to the call of the `__native_flush_tlb_single`:
			
 
				+If a CPU supports the `invlpg` instruction, we call the `__flush_tlb_single` macro which expands to the call of `__native_flush_tlb_single`:
			
 
				 
			
 
				 ```C
			
 
				 static inline void __native_flush_tlb_single(unsigned long addr)
			
@@ -482,7 +481,7 @@ static inline void __native_flush_tlb_single(unsigned long addr)
 
				 }
			
 
				 ```
			
 
				 
			
 
				-or call `__flush_tlb` which just updates `cr3` register as we saw it above. After this step execution of the `__early_set_fixmap` function is finsihed and we can back to the `__early_ioremap` implementation. As we set fixmap area for the given address, need to save the base virtual address of the I/O Re-mapped area in the `prev_map` with the `slot` index:
			
 
				+or call `__flush_tlb` which just updates the `cr3` register as we have seen. After this step execution of the `__early_set_fixmap` function is finished and we can go back to the `__early_ioremap` implementation. When we have set up the fixmap area for the given address, we need to save the base virtual address of the I/O Re-mapped area in the `prev_map` using the `slot` index:
			
 
				 
			
 
				 ```C
			
 
				 prev_map[slot] = (void __iomem *)(offset + slot_virt[slot]);
			
@@ -490,22 +489,22 @@ prev_map[slot] = (void __iomem *)(offset + slot_virt[slot]);
 
				 
			
 
				 and return it.
			
 
				 
			
 
				-The second function is - `early_iounmap` - unmaps an `I/O` memory region. This function takes two parameters: base address and size of a `I/O` region and generally looks very similar on `early_ioremap`. It also goes through fixmap slots and looks for slot with the given address. After this it gets the index of the fixmap slot and calls `__late_clear_fixmap` or `__early_set_fixmap` depends on `after_paging_init` value. It calls `__early_set_fixmap` with on difference then it does `early_ioremap`: it passes `zero` as physicall address. And in the end it sets address of the I/O memory region to `NULL`:
			
 
				+The second function, `early_iounmap`, unmaps an `I/O` memory region. This function takes two parameters: base address and size of a `I/O` region and generally looks very similar to `early_ioremap`. It also goes through fixmap slots and looks for a slot with the given address. After that, it gets the index of the fixmap slot and calls `__late_clear_fixmap` or `__early_set_fixmap` depending on the `after_paging_init` value. It calls `__early_set_fixmap` with one difference to how `early_ioremap` does it: `early_iounmap` passes `zero` as physical address. And in the end it sets the address of the I/O memory region to `NULL`:
			
 
				 
			
 
				 ```C
			
 
				 prev_map[slot] = NULL;
			
 
				 ```
			
 
				 
			
 
				-That's all about `fixmaps` and `ioremap`. Of course this part does not cover full features of the `ioremap`, it was only early ioremap, but there is also normal ioremap. But we need to know more things than now before it.
			
 
				+That's all about `fixmaps` and `ioremap`. Of course this part does not cover all features of `ioremap`, only early ioremap but there is also normal ioremap. But we need to know more things before we study that in more detail.
			
 
				 
			
 
				 So, this is the end!
			
 
				 
			
 
				 Conclusion
			
 
				 --------------------------------------------------------------------------------
			
 
				 
			
 
				-This is the end of the second part about linux kernel memory management. If you have questions or suggestions, ping me on twitter [0xAX](https://twitter.com/0xAX), drop me an [email](anotherworldofworld@gmail.com) or just create an [issue](https://github.com/0xAX/linux-internals/issues/new).
			
 
				+This is the end of the second part about linux kernel memory management. If you have questions or suggestions, ping me on twitter [0xAX](https://twitter.com/0xAX), drop me an [email](anotherworldofworld@gmail.com) or just create an [issue](https://github.com/0xAX/linux-insides/issues/new).
			
 
				 
			
 
				-**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me a PR to [linux-internals](https://github.com/0xAX/linux-internals).**
			
 
				+**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me a PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				 
			
 
				 Links
			
 
				 --------------------------------------------------------------------------------
			
--- a/mm/mm-3.md
+++ b/mm/mm-3.md
@@ -0,0 +1,434 @@
 
				+Linux kernel memory management Part 3.
			
 
				+================================================================================
			
 
				+
			
 
				+Introduction to the kmemcheck in the Linux kernel
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+This is the third part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/mm/) which describes [memory management](https://en.wikipedia.org/wiki/Memory_management) in the Linux kernel and in the previous [part](https://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html) of this chapter we met two memory management related concepts:
			
 
				+
			
 
				+* `Fix-Mapped Addresses`;
			
 
				+* `ioremap`.
			
 
				+
			
 
				+The first concept represents special area in [virtual memory](https://en.wikipedia.org/wiki/Virtual_memory), whose corresponding physical mapping is calculated in [compile-time](https://en.wikipedia.org/wiki/Compile_time). The second concept provides ability to map input/output related memory to virtual memory.
			
 
				+
			
 
				+For example if you will look at the output of the `/proc/iomem`:
			
 
				+
			
 
				+```
			
 
				+$ sudo cat /proc/iomem
			
 
				+
			
 
				+00000000-00000fff : reserved
			
 
				+00001000-0009d7ff : System RAM
			
 
				+0009d800-0009ffff : reserved
			
 
				+000a0000-000bffff : PCI Bus 0000:00
			
 
				+000c0000-000cffff : Video ROM
			
 
				+000d0000-000d3fff : PCI Bus 0000:00
			
 
				+000d4000-000d7fff : PCI Bus 0000:00
			
 
				+000d8000-000dbfff : PCI Bus 0000:00
			
 
				+000dc000-000dffff : PCI Bus 0000:00
			
 
				+000e0000-000fffff : reserved
			
 
				+...
			
 
				+...
			
 
				+...
			
 
				+```
			
 
				+
			
 
				+you will see map of the system's memory for each physical device. Here the first column displays the memory registers used by each of the different types of memory. The second column lists the kind of memory located within those registers. Or for example:
			
 
				+
			
 
				+```
			
 
				+$ sudo cat /proc/ioports
			
 
				+
			
 
				+0000-0cf7 : PCI Bus 0000:00
			
 
				+  0000-001f : dma1
			
 
				+  0020-0021 : pic1
			
 
				+  0040-0043 : timer0
			
 
				+  0050-0053 : timer1
			
 
				+  0060-0060 : keyboard
			
 
				+  0064-0064 : keyboard
			
 
				+  0070-0077 : rtc0
			
 
				+  0080-008f : dma page reg
			
 
				+  00a0-00a1 : pic2
			
 
				+  00c0-00df : dma2
			
 
				+  00f0-00ff : fpu
			
 
				+    00f0-00f0 : PNP0C04:00
			
 
				+  03c0-03df : vga+
			
 
				+  03f8-03ff : serial
			
 
				+  04d0-04d1 : pnp 00:06
			
 
				+  0800-087f : pnp 00:01
			
 
				+  0a00-0a0f : pnp 00:04
			
 
				+  0a20-0a2f : pnp 00:04
			
 
				+  0a30-0a3f : pnp 00:04
			
 
				+...
			
 
				+...
			
 
				+...
			
 
				+```
			
 
				+
			
 
				+can show us lists of currently registered port regions used for input or output communication with a device. All memory-mapped I/O addresses are not used by the kernel directly. So, before the Linux kernel can use such memory, it must to map it to the virtual memory space which is the main purpose of the `ioremap` mechanism. Note that we saw only early `ioremap` in the previous [part](https://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html). Soon we will look at the implementation of the non-early `ioremap` function. But before this we must learn other things, like a different types of memory allocators and etc., because in other way it will be very difficult to understand it.
			
 
				+
			
 
				+So, before we will move on to the non-early [memory management](https://en.wikipedia.org/wiki/Memory_management) of the Linux kernel, we will see some mechanisms which provide special abilities for [debugging](https://en.wikipedia.org/wiki/Debugging), check of [memory leaks](https://en.wikipedia.org/wiki/Memory_leak), memory control and etc. It will be easier to understand how memory management arranged in the Linux kernel after learning of all of these things.
			
 
				+
			
 
				+As you already may guess from the title of this part, we will start to consider memory mechanisms from the [kmemcheck](https://www.kernel.org/doc/Documentation/kmemcheck.txt). As we always did in other [chapters](https://0xax.gitbooks.io/linux-insides/content/), we will start to consider from theoretical side and will learn what is `kmemcheck` mechanism in general and only after this, we will see how it is implemented in the Linux kernel.
			
 
				+
			
 
				+So let's start. What is it `kmemcheck` in the Linux kernel? As you may gues from the name of this mechanism, the `kmemcheck` checks memory. That's true. Main point of the `kmemcheck` mechanism is to check that some kernel code accesses `uninitialized memory`. Let's take following simple [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) program:
			
 
				+
			
 
				+```C
			
 
				+#include <stdlib.h>
			
 
				+#include <stdio.h>
			
 
				+
			
 
				+struct A {
			
 
				+        int a;
			
 
				+};
			
 
				+
			
 
				+int main(int argc, char **argv) {
			
 
				+        struct A *a = malloc(sizeof(struct A));
			
 
				+        printf("a->a = %d\n", a->a);
			
 
				+        return 0;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Here we allocate memory for the `A` structure and tries to print value of the `a` field. If we will compile this program without additional options:
			
 
				+
			
 
				+```
			
 
				+gcc test.c -o test
			
 
				+```
			
 
				+
			
 
				+The [compiler](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) will not show us warning that `a` filed is not unitialized. But if we will run this program with [valgrind](https://en.wikipedia.org/wiki/Valgrind) tool, we will see the following output:
			
 
				+
			
 
				+```
			
 
				+~$   valgrind --leak-check=yes ./test
			
 
				+==28469== Memcheck, a memory error detector
			
 
				+==28469== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
			
 
				+==28469== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
			
 
				+==28469== Command: ./test
			
 
				+==28469== 
			
 
				+==28469== Conditional jump or move depends on uninitialised value(s)
			
 
				+==28469==    at 0x4E820EA: vfprintf (in /usr/lib64/libc-2.22.so)
			
 
				+==28469==    by 0x4E88D48: printf (in /usr/lib64/libc-2.22.so)
			
 
				+==28469==    by 0x4005B9: main (in /home/alex/test)
			
 
				+==28469== 
			
 
				+==28469== Use of uninitialised value of size 8
			
 
				+==28469==    at 0x4E7E0BB: _itoa_word (in /usr/lib64/libc-2.22.so)
			
 
				+==28469==    by 0x4E8262F: vfprintf (in /usr/lib64/libc-2.22.so)
			
 
				+==28469==    by 0x4E88D48: printf (in /usr/lib64/libc-2.22.so)
			
 
				+==28469==    by 0x4005B9: main (in /home/alex/test)
			
 
				+...
			
 
				+...
			
 
				+...
			
 
				+```
			
 
				+
			
 
				+Actually the `kmemcheck` mechanism does the same for the kernel, what the `valgrind` does for userspace programs. It check unitilized memory.
			
 
				+
			
 
				+To enable this mechanism in the Linux kernel, you need to enable the `CONFIG_KMEMCHECK` kernel configuration option in the:
			
 
				+
			
 
				+```
			
 
				+Kernel hacking
			
 
				+  -> Memory Debugging
			
 
				+```
			
 
				+  
			
 
				+menu of the Linux kernel configuration:
			
 
				+
			
 
				+![kernel configuration menu](http://oi63.tinypic.com/2pzbog7.jpg)
			
 
				+
			
 
				+We may not only enable support of the `kmemcheck` mechanism in the Linux kernel, but it also provides some configuration options for us. We will see all of these options in the next paragraph of this part. Last note before we will consider how does the `kmemcheck` check memory. Now this mechanism is implemented only for the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture. You can be sure if you will look in the [arch/x86/Kconfig](https://github.com/torvalds/linux/blob/master/arch/x86/Kconfig) `x86` related kernel configuration file, you will see following lines:
			
 
				+
			
 
				+```
			
 
				+config X86
			
 
				+  ...
			
 
				+  ...
			
 
				+  ...
			
 
				+  select HAVE_ARCH_KMEMCHECK
			
 
				+  ...
			
 
				+  ...
			
 
				+  ...
			
 
				+```
			
 
				+
			
 
				+So, there is no anything which is specific for other architectures.
			
 
				+
			
 
				+Ok, so we know that `kmemcheck` provides mechanism to check usage of `uninitialized memory` in the Linux kernel and how to enable it. How it does these checks? When the Linux kernel tries to allocate some memory i.e. something is called like this:
			
 
				+
			
 
				+```C
			
 
				+struct my_struct *my_struct = kmalloc(sizeof(struct my_struct), GFP_KERNEL);
			
 
				+```
			
 
				+
			
 
				+or in other words somebody wants to access a [page](https://en.wikipedia.org/wiki/Page_%28computer_memory%29), a [page fault](https://en.wikipedia.org/wiki/Page_fault) exception is generated. This is achieved by the fact that the `kmemcheck` marks memory pages as `non-present` (more about this you can read in the special part which is devoted to [paging](https://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html)). If a `page fault` exception is occured, the exception handler knows about it and in a case when the `kmemcheck` is enabled it transfers control to it. After the `kmemcheck` will finish its checks, the page will be marked as `present` and the interrupted code will be able to continue execution. There is little subtlety in this chain. When the first instruction of interrupted code will be executed, the `kmemcheck` will mark the page as `non-present` again. In this way next access to memory will be catched again.
			
 
				+
			
 
				+We just considered the `kmemcheck` mechanism from theoretical side. Now let's consider how it is implemented in the Linux kernel.
			
 
				+
			
 
				+Implementation of the `kmemcheck` mechanism in the Linux kernel
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+So, now we know what is it `kmemcheck` and what it does in the Linux kernel. Time to see at its implementation in the Linux kernel. Implementation of the `kmemcheck` is splitted in two parts. The first is generic part is located in the [mm/kmemcheck.c](https://github.com/torvalds/linux/blob/master/mm/kmemcheck.c) source code file and the second [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture-specific part is located in the [arch/x86/mm/kmemcheck](https://github.com/torvalds/linux/tree/master/arch/x86/mm/kmemcheck) directory.
			
 
				+
			
 
				+Let's start from the initialization of this mechanism. We already know that to enable the `kmemcheck` mechanism in the Linux kernel, we must enable the `CONFIG_KMEMCHECK` kernel configuration option. But besides this, we need to pass one of following parameters:
			
 
				+
			
 
				+ * kmemcheck=0 (disabled)
			
 
				+ * kmemcheck=1 (enabled)
			
 
				+ * kmemcheck=2 (one-shot mode)
			
 
				+
			
 
				+to the Linux kernel command line. The first two are clear, but the last needs a little explanation. This option switches the `kmemcheck` in a special mode when it will be turned off after detecting the first use of uninitialized memory. Actually this mode is enabled by default in the Linux kernel:
			
 
				+
			
 
				+![kernel configuration menu](http://oi66.tinypic.com/y2eeh.jpg)
			
 
				+
			
 
				+We know from the seventh [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-7.html) of the chapter which describes initialization of the Linux kernel that the kernel command line is parsed during initialization of the Linux kernel in `do_initcall_level`, `do_early_param` functions. Actually the `kmemcheck` subsystem consists from two stages. The first stage is early. If we will look at the [mm/kmemcheck.c](https://github.com/torvalds/linux/blob/master/mm/kmemcheck.c) source code file, we will see the `param_kmemcheck` function which is will be called during early command line parsing:
			
 
				+
			
 
				+```C
			
 
				+static int __init param_kmemcheck(char *str)
			
 
				+{
			
 
				+	int val;
			
 
				+	int ret;
			
 
				+
			
 
				+	if (!str)
			
 
				+		return -EINVAL;
			
 
				+
			
 
				+	ret = kstrtoint(str, 0, &val);
			
 
				+	if (ret)
			
 
				+		return ret;
			
 
				+	kmemcheck_enabled = val;
			
 
				+	return 0;
			
 
				+}
			
 
				+
			
 
				+early_param("kmemcheck", param_kmemcheck);
			
 
				+```
			
 
				+
			
 
				+As we already saw, the `param_kmemcheck` may have one of the following values: `0` (enabled), `1` (disabled) or `2` (one-shot). The implementation of the `param_kmemcheck` is pretty simple. We just convert string value of the `kmemcheck` command line option to integer representation and set it to the `kmemcheck_enabled` variable.
			
 
				+
			
 
				+The second stage will be executed during initialization of the Linux kernel, rather during intialization of early [initcalls](https://0xax.gitbooks.io/linux-insides/content/Concepts/initcall.html). The second stage is represented by the `kmemcheck_init`:
			
 
				+
			
 
				+```C
			
 
				+int __init kmemcheck_init(void)
			
 
				+{
			
 
				+    ...
			
 
				+    ...
			
 
				+    ...
			
 
				+}
			
 
				+
			
 
				+early_initcall(kmemcheck_init);
			
 
				+```
			
 
				+
			
 
				+Main goal of the `kmemcheck_init` function is to call the `kmemcheck_selftest` function and check its result:
			
 
				+
			
 
				+```C
			
 
				+if (!kmemcheck_selftest()) {
			
 
				+	printk(KERN_INFO "kmemcheck: self-tests failed; disabling\n");
			
 
				+	kmemcheck_enabled = 0;
			
 
				+	return -EINVAL;
			
 
				+}
			
 
				+
			
 
				+printk(KERN_INFO "kmemcheck: Initialized\n");
			
 
				+```
			
 
				+
			
 
				+and return with the `EINVAL` if this check is failed. The `kmemcheck_selftest` function checks sizes of different memory access related [opcodes](https://en.wikipedia.org/wiki/Opcode) like `rep movsb`, `movzwq` and etc. If sizes of opcodes are equal to expected sizes, the `kmemcheck_selftest` will return `true` and `false` in other way.
			
 
				+
			
 
				+So when the somebody will call:
			
 
				+
			
 
				+```C
			
 
				+struct my_struct *my_struct = kmalloc(sizeof(struct my_struct), GFP_KERNEL);
			
 
				+```
			
 
				+
			
 
				+through a series of different function calls the `kmem_getpages` function will be called. This function is defined in the [mm/slab.c](https://github.com/torvalds/linux/blob/master/mm/slab.c) source code file and main goal of this function tries to allocate [pages](https://en.wikipedia.org/wiki/Paging) with the given flags. In the end of this function we can see following code:
			
 
				+
			
 
				+```C
			
 
				+if (kmemcheck_enabled && !(cachep->flags & SLAB_NOTRACK)) {
			
 
				+	kmemcheck_alloc_shadow(page, cachep->gfporder, flags, nodeid);
			
 
				+
			
 
				+    if (cachep->ctor)
			
 
				+		kmemcheck_mark_uninitialized_pages(page, nr_pages);
			
 
				+	else
			
 
				+		kmemcheck_mark_unallocated_pages(page, nr_pages);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+So, here we check that the if `kmemcheck` is enabled and the `SLAB_NOTRACK` bit is not set in flags we set `non-present` bit for the just allocated page. The `SLAB_NOTRACK` bit tell us to not track uninitialized memory. Additionally we check if a cache object has constructor (details will be considered in next parts) we mark allocated page as uninitilized or unallocated in other way. The `kmemcheck_alloc_shadow` function is defined in the [mm/kmemcheck.c](https://github.com/torvalds/linux/blob/master/mm/kmemcheck.c) source code file and does following things:
			
 
				+
			
 
				+```C
			
 
				+void kmemcheck_alloc_shadow(struct page *page, int order, gfp_t flags, int node)
			
 
				+{
			
 
				+    struct page *shadow;
			
 
				+
			
 
				+   	shadow = alloc_pages_node(node, flags | __GFP_NOTRACK, order);
			
 
				+
			
 
				+   	for(i = 0; i < pages; ++i)
			
 
				+		page[i].shadow = page_address(&shadow[i]);
			
 
				+
			
 
				+   	kmemcheck_hide_pages(page, pages);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+First of all it allocates memory space for the shadow bits. If this bit is set in a page, this means that this page is tracked by the `kmemcheck`. After we allocated space for the shadow bit, we fill all allocated pages with this bit. In the end we just call the `kmemcheck_hide_pages` function with the pointer to the allocated page and number of these pages. The `kmemcheck_hide_pages` is architecture-specific function, so its implementation is located in the [arch/x86/mm/kmemcheck/kmemcheck.c](https://github.com/torvalds/linux/tree/master/arch/x86/mm/kmemcheck/kmemcheck.c) source code file. The main goal of this function is to set `non-present` bit in given pages. Let's look at the implementation of this function:
			
 
				+
			
 
				+```C
			
 
				+void kmemcheck_hide_pages(struct page *p, unsigned int n)
			
 
				+{
			
 
				+	unsigned int i;
			
 
				+
			
 
				+	for (i = 0; i < n; ++i) {
			
 
				+		unsigned long address;
			
 
				+		pte_t *pte;
			
 
				+		unsigned int level;
			
 
				+
			
 
				+		address = (unsigned long) page_address(&p[i]);
			
 
				+		pte = lookup_address(address, &level);
			
 
				+		BUG_ON(!pte);
			
 
				+		BUG_ON(level != PG_LEVEL_4K);
			
 
				+
			
 
				+		set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT));
			
 
				+		set_pte(pte, __pte(pte_val(*pte) | _PAGE_HIDDEN));
			
 
				+		__flush_tlb_one(address);
			
 
				+	}
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Here we go through all pages and and tries to get `page table entry` for each page. If this operation was successful, we unset present bit and set hidden bit in each page. In the end we flush [translation lookaside buffer](https://en.wikipedia.org/wiki/Translation_lookaside_buffer), because some pages was changed. From this point allocated pages are tracked by the `kmemcheck`. Now, as `present` bit is unset, the [page fault](https://en.wikipedia.org/wiki/Page_fault) execution will be occured right after the `kmalloc` will return pointer to allocated space and a code will try to access this memory.
			
 
				+
			
 
				+As you may remember from the [second part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html) of the Linux kernel initialization chapter, the `page fault` handler is located in the [arch/x86/mm/fault.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/fault.c) source code file and represented by the `do_page_fault` function. We can see following check from the beginning of the `do_page_fault` function:
			
 
				+
			
 
				+```C
			
 
				+static noinline void
			
 
				+__do_page_fault(struct pt_regs *regs, unsigned long error_code,
			
 
				+		unsigned long address)
			
 
				+{
			
 
				+    ...
			
 
				+    ...
			
 
				+    ...
			
 
				+	if (kmemcheck_active(regs))
			
 
				+		kmemcheck_hide(regs);
			
 
				+    ...
			
 
				+    ...
			
 
				+    ...
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+The `kmemcheck_active` gets `kmemcheck_context` [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) structure and return the result of comparision of the `balance` field of this structure with zero:
			
 
				+
			
 
				+```
			
 
				+bool kmemcheck_active(struct pt_regs *regs)
			
 
				+{
			
 
				+	struct kmemcheck_context *data = this_cpu_ptr(&kmemcheck_context);
			
 
				+
			
 
				+	return data->balance > 0;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+The `kmemcheck_context` is structure which describes current state of the `kmemcheck` mechanism. It stored unitialized addresses, number of such addresses and etc. The `balance` field of this structure represents current state of the `kmemcheck` or in other words it can tell us did `kmemcheck` already hid pages or not yet. If the `data->balance` is greater than zero, the `kmemcheck_hide` function will be called. This means than `kmemecheck` already set `present` bit for given pages and now we need to hide pages again to to cause nest step page fault. This function will hide addresses of pages again by unsetting of `present` bit. This means that one session of `kmemcheck` already finished and new page fault occured. At the first step the `kmemcheck_active` will return false as the `data->balance` is zero for the start and the `kmemcheck_hide` will not be called. Next, we may see following line of code in the `do_page_fault`:
			
 
				+
			
 
				+```C
			
 
				+if (kmemcheck_fault(regs, address, error_code))
			
 
				+		return;
			
 
				+```
			
 
				+
			
 
				+First of all the `kmemcheck_fault` function checks that the fault was occured by the correct reason. At first we check the [flags register](https://en.wikipedia.org/wiki/FLAGS_register) and check that we are in normal kernel mode:
			
 
				+
			
 
				+```C
			
 
				+if (regs->flags & X86_VM_MASK)
			
 
				+		return false;
			
 
				+if (regs->cs != __KERNEL_CS)
			
 
				+		return false;
			
 
				+```
			
 
				+
			
 
				+If these checks wasn't successful we return from the `kmemcheck_fault` function as it was not `kmemcheck` related page fault. After this we try to lookup a `page table entry` related to the faulted address and if we can't find it we return:
			
 
				+
			
 
				+```C
			
 
				+pte = kmemcheck_pte_lookup(address);
			
 
				+if (!pte)
			
 
				+	return false;
			
 
				+```
			
 
				+
			
 
				+Last two steps of the `kmemcheck_fault` function is to call the `kmemcheck_access` function which check access to the given page and show addresses again by setting present bit in the given page. The `kmemcheck_access` function does all main job. It check current instruction which caused a page fault. If it will find an error, the context of this error will be saved by `kmemcheck` to the ring queue:
			
 
				+
			
 
				+```C
			
 
				+static struct kmemcheck_error error_fifo[CONFIG_KMEMCHECK_QUEUE_SIZE];
			
 
				+```
			
 
				+
			
 
				+The `kmemcheck` mechanism declares special [tasklet](https://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-9.html):
			
 
				+
			
 
				+```C
			
 
				+static DECLARE_TASKLET(kmemcheck_tasklet, &do_wakeup, 0);
			
 
				+```
			
 
				+
			
 
				+which runs the `do_wakeup` function from the [arch/x86/mm/kmemcheck/error.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/kmemcheck/error.c) source code file when it will be scheduled to run.
			
 
				+
			
 
				+The `do_wakeup` function will call the `kmemcheck_error_recall` function which will print errors collected by `kmemcheck`. As we already saw the:
			
 
				+
			
 
				+```C
			
 
				+kmemcheck_show(regs);
			
 
				+```
			
 
				+
			
 
				+function will be called in the end of the `kmemcheck_fault` function. This function will set present bit for the given pages again:
			
 
				+
			
 
				+```C
			
 
				+if (unlikely(data->balance != 0)) {
			
 
				+	kmemcheck_show_all();
			
 
				+	kmemcheck_error_save_bug(regs);
			
 
				+	data->balance = 0;
			
 
				+	return;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Where the `kmemcheck_show_all` function calls the `kmemcheck_show_addr` for each address:
			
 
				+
			
 
				+```C
			
 
				+static unsigned int kmemcheck_show_all(void)
			
 
				+{
			
 
				+	struct kmemcheck_context *data = this_cpu_ptr(&kmemcheck_context);
			
 
				+	unsigned int i;
			
 
				+	unsigned int n;
			
 
				+
			
 
				+	n = 0;
			
 
				+	for (i = 0; i < data->n_addrs; ++i)
			
 
				+		n += kmemcheck_show_addr(data->addr[i]);
			
 
				+
			
 
				+	return n;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+by the call of the `kmemcheck_show_addr`:
			
 
				+
			
 
				+```C
			
 
				+int kmemcheck_show_addr(unsigned long address)
			
 
				+{
			
 
				+	pte_t *pte;
			
 
				+
			
 
				+	pte = kmemcheck_pte_lookup(address);
			
 
				+	if (!pte)
			
 
				+		return 0;
			
 
				+
			
 
				+	set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT));
			
 
				+	__flush_tlb_one(address);
			
 
				+	return 1;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+In the end of the `kmemcheck_show` function we set the [TF](https://en.wikipedia.org/wiki/Trap_flag) flag if it wasn't set:
			
 
				+
			
 
				+```C
			
 
				+if (!(regs->flags & X86_EFLAGS_TF))
			
 
				+	data->flags = regs->flags;
			
 
				+```
			
 
				+
			
 
				+We need to do it because we need to hide pages again after first executed instruction after a page fault will be handled. In a case when the `TF` flag, so the processor will switch into single-step mode after the first instruction will be executed. In this case `debug` exception will occured. From this moment pages will be hidden again and execution will be continued. As pages hidden from this moment, page fault exception will occur again and `kmemcheck` continue to check/collect errors again and print them from time to time.
			
 
				+
			
 
				+That's all.
			
 
				+
			
 
				+Conclusion
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+This is the end of the third part about linux kernel [memory management](https://en.wikipedia.org/wiki/Memory_management). If you have questions or suggestions, ping me on twitter [0xAX](https://twitter.com/0xAX), drop me an [email](anotherworldofworld@gmail.com) or just create an [issue](https://github.com/0xAX/linux-insides/issues/new). In the next part we will see yet another memory debugging related tool - `kmemleak`.
			
 
				+
			
 
				+**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me a PR to [linux-insides](https://github.com/0xAX/linux-insides).**
			
 
				+
			
 
				+Links
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+* [memory management](https://en.wikipedia.org/wiki/Memory_management)
			
 
				+* [debugging](https://en.wikipedia.org/wiki/Debugging)
			
 
				+* [memory leaks](https://en.wikipedia.org/wiki/Memory_leak)
			
 
				+* [kmemcheck documentation](https://www.kernel.org/doc/Documentation/kmemcheck.txt)
			
 
				+* [valgrind](https://en.wikipedia.org/wiki/Valgrind)
			
 
				+* [paging](https://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html)
			
 
				+* [page fault](https://en.wikipedia.org/wiki/Page_fault)
			
 
				+* [initcalls](https://0xax.gitbooks.io/linux-insides/content/Concepts/initcall.html)
			
 
				+* [opcode](https://en.wikipedia.org/wiki/Opcode)
			
 
				+* [translation lookaside buffer](https://en.wikipedia.org/wiki/Translation_lookaside_buffer)
			
 
				+* [per-cpu variables](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
			
 
				+* [flags register](https://en.wikipedia.org/wiki/FLAGS_register)
			
 
				+* [tasklet](https://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-9.html)
			
 
				+* [Paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html)
			
 
				+* [Previous part](https://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html)