|
@@ -4,7 +4,7 @@ Kernel initialization. Part 5.
|
|
|
Continue of architecture-specific initializations
|
|
|
================================================================================
|
|
|
|
|
|
-In the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html), we stopped at the initialization of an architecture-specific stuff from the [setup_arch](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c#L856) function and will continue with it. As we reserved memory for the [initrd](http://en.wikipedia.org/wiki/Initrd), next step is the `olpc_ofw_detect` which detects [One Laptop Per Child support](http://wiki.laptop.org/go/OFW_FAQ). We will not consider platform related stuff in this book and will miss functions related with it. So let's go ahead. The next step is the `early_trap_init` function. This function initializes debug (`#DB` - raised when the `TF` flag of rflags is set) and `int3` (`#BP`) interrupts gate. If you don't know anything about interrupts, you can read about it in the [Early interrupt and exception handling](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html). In `x86` architecture `INT`, `INTO` and `INT3` are special instructions which allow a task to explicitly call an interrupt handler. The `INT3` instruction calls the breakpoint (`#BP`) handler. You can remember, we already saw it in the [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html) about interrupts: and exceptions:
|
|
|
+In the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html), we stopped at the initialization of an architecture-specific stuff from the [setup_arch](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c#L856) function and now we will continue with it. As we reserved memory for the [initrd](http://en.wikipedia.org/wiki/Initrd), next step is the `olpc_ofw_detect` which detects [One Laptop Per Child support](http://wiki.laptop.org/go/OFW_FAQ). We will not consider platform related stuff in this book and will skip functions related with it. So let's go ahead. The next step is the `early_trap_init` function. This function initializes debug (`#DB` - raised when the `TF` flag of rflags is set) and `int3` (`#BP`) interrupts gate. If you don't know anything about interrupts, you can read about it in the [Early interrupt and exception handling](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html). In `x86` architecture `INT`, `INTO` and `INT3` are special instructions which allow a task to explicitly call an interrupt handler. The `INT3` instruction calls the breakpoint (`#BP`) handler. You may remember, we already saw it in the [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html) about interrupts: and exceptions:
|
|
|
|
|
|
```
|
|
|
----------------------------------------------------------------------------------------------
|
|
@@ -14,7 +14,7 @@ In the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initializat
|
|
|
----------------------------------------------------------------------------------------------
|
|
|
```
|
|
|
|
|
|
-Debug interrupt `#DB` is the primary means of invoking debuggers. `early_trap_init` defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c). This functions sets `#DB` and `#BP` handlers and reloads [IDT](http://en.wikipedia.org/wiki/Interrupt_descriptor_table):
|
|
|
+Debug interrupt `#DB` is the primary method of invoking debuggers. `early_trap_init` defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c). This functions sets `#DB` and `#BP` handlers and reloads [IDT](http://en.wikipedia.org/wiki/Interrupt_descriptor_table):
|
|
|
|
|
|
```C
|
|
|
void __init early_trap_init(void)
|
|
@@ -25,7 +25,7 @@ void __init early_trap_init(void)
|
|
|
}
|
|
|
```
|
|
|
|
|
|
-We already saw implementation of the `set_intr_gate` in the previous part about interrupts. Here are two similar functions `set_intr_gate_ist` and `set_system_intr_gate_ist`. Both of these two functions take two parameters:
|
|
|
+We already saw implementation of the `set_intr_gate` in the previous part about interrupts. Here are two similar functions `set_intr_gate_ist` and `set_system_intr_gate_ist`. Both of these two functions take three parameters:
|
|
|
|
|
|
* number of the interrupt;
|
|
|
* base address of the interrupt/exception handler;
|
|
@@ -110,7 +110,7 @@ The next two macro from the `idtentry` implementation are:
|
|
|
|
|
|
First `ASM_CLAC` macro depends on `CONFIG_X86_SMAP` configuration option and need for security resason, more about it you can read [here](https://lwn.net/Articles/517475/). The second `PARAVIRT_ADJUST_EXCEPTION_FRAME` macro is for handling handle Xen-type-exceptions (this chapter about kernel initializations and we will not consider virtualization stuff here).
|
|
|
|
|
|
-The next piece of code checks is interrupt has error code or not and pushes `$-1` which is `0xffffffffffffffff` on `x86_64` on the stack if not:
|
|
|
+The next piece of code checks if interrupt has error code or not and pushes `$-1` which is `0xffffffffffffffff` on `x86_64` on the stack if not:
|
|
|
|
|
|
```assembly
|
|
|
.ifeq \has_error_code
|
|
@@ -144,14 +144,14 @@ Here we checks first and second bits in the `CS`. You can remember that `CS` reg
|
|
|
1: ret
|
|
|
```
|
|
|
|
|
|
-In the next steps we put `pt_regs` pointer to the `rdi`, save error code in the `rsi` if it is and call interrupt handler which is - `do_debug` in our case from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c). `do_debug` like other handlers takes two parameters:
|
|
|
+In the next steps we put `pt_regs` pointer to the `rdi`, save error code in the `rsi` if it has and call interrupt handler which is - `do_debug` in our case from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c). `do_debug` like other handlers takes two parameters:
|
|
|
|
|
|
* pt_regs - is a structure which presents set of CPU registers which are saved in the process' memory region;
|
|
|
* error code - error code of interrupt.
|
|
|
|
|
|
After interrupt handler finished its work, calls `paranoid_exit` which restores stack, switch on userspace if interrupt came from there and calls `iret`. That's all. Of course it is not all :), but we will see more deeply in the separate chapter about interrupts.
|
|
|
|
|
|
-This is general view of the `idtentry` macro for `#DB` interrupt. All interrupts are similar on this implementation and defined with idtentry too. After `early_trap_init` finished its work, the next function is `early_cpu_init`. This function defined in the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/common.c) and collects information about a CPU and its vendor.
|
|
|
+This is general view of the `idtentry` macro for `#DB` interrupt. All interrupts are similar to this implementation and defined with idtentry too. After `early_trap_init` finished its work, the next function is `early_cpu_init`. This function defined in the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/common.c) and collects information about CPU and its vendor.
|
|
|
|
|
|
Early ioremap initialization
|
|
|
--------------------------------------------------------------------------------
|
|
@@ -177,7 +177,7 @@ for (i = 0; i < FIX_BTMAPS_SLOTS; i++)
|
|
|
slot_virt[i] = __fix_to_virt(FIX_BTMAP_BEGIN - NR_FIX_BTMAPS*i);
|
|
|
```
|
|
|
|
|
|
-After this we get page middle directory entry for the `FIX_BTMAP_BEGIN` and put to the `pmd` variable, fills with zeros `bm_pte` which is boot time page tables and call `pmd_populate_kernel` function for setting given page table entry in the given page middle directory:
|
|
|
+After this we get page middle directory entry for the `FIX_BTMAP_BEGIN` and put to the `pmd` variable, fills `bm_pte` with zeros which is boot time page tables and call `pmd_populate_kernel` function for setting given page table entry in the given page middle directory:
|
|
|
|
|
|
```C
|
|
|
pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));
|
|
@@ -185,7 +185,7 @@ memset(bm_pte, 0, sizeof(bm_pte));
|
|
|
pmd_populate_kernel(&init_mm, pmd, bm_pte);
|
|
|
```
|
|
|
|
|
|
-That's all for this. If you feeling missunderstanding, don't worry. There is special part about `ioremap` and `fixmaps` in the [Linux Kernel Memory Management. Part 2](https://github.com/0xAX/linux-insides/blob/master/mm/linux-mm-2.md) chapter.
|
|
|
+That's all for this. If you feeling puzzled, don't worry. There is special part about `ioremap` and `fixmaps` in the [Linux Kernel Memory Management. Part 2](https://github.com/0xAX/linux-insides/blob/master/mm/linux-mm-2.md) chapter.
|
|
|
|
|
|
Obtaining major and minor numbers for the root device
|
|
|
--------------------------------------------------------------------------------
|
|
@@ -208,7 +208,7 @@ Protocol: ALL
|
|
|
deprecated, use the "root=" option on the command line instead.
|
|
|
```
|
|
|
|
|
|
-Now let's try understand what is it `old_decode_dev`. Actually it just calls `MKDEV` inside which generates `dev_t` from the give major and minor numbers. It's implementation pretty easy:
|
|
|
+Now let's try to understand what `old_decode_dev` does. Actually it just calls `MKDEV` inside which generates `dev_t` from the give major and minor numbers. It's implementation is pretty simple:
|
|
|
|
|
|
```C
|
|
|
static inline dev_t old_decode_dev(u16 val)
|
|
@@ -217,7 +217,7 @@ static inline dev_t old_decode_dev(u16 val)
|
|
|
}
|
|
|
```
|
|
|
|
|
|
-where `dev_t` is a kernel data type to present major/minor number pair. But what's the strange `old_` prefix? For historical reasons, there are two ways of managing the major and minor numbers of a device. In the first way major and minor numbers occupied 2 bytes. You can see it in the previous code: 8 bit for major number and 8 bit for minor number. But there is problem with this way: 256 major numbers and 256 minor numbers are possible. So 16-bit integer was replaced with 32-bit integer where 12 bits reserved for major number and 20 bits for minor. You can see this in the `new_decode_dev` implementation:
|
|
|
+where `dev_t` is a kernel data type to present major/minor number pair. But what's the strange `old_` prefix? For historical reasons, there are two ways of managing the major and minor numbers of a device. In the first way major and minor numbers occupied 2 bytes. You can see it in the previous code: 8 bit for major number and 8 bit for minor number. But there is a problem: only 256 major numbers and 256 minor numbers are possible. So 16-bit integer was replaced by 32-bit integer where 12 bits reserved for major number and 20 bits for minor. You can see this in the `new_decode_dev` implementation:
|
|
|
|
|
|
```C
|
|
|
static inline dev_t new_decode_dev(u32 dev)
|
|
@@ -248,7 +248,7 @@ The next point is the setup of the memory map with the call of the `setup_memory
|
|
|
bootloader_version |= boot_params.hdr.ext_loader_ver << 4;
|
|
|
```
|
|
|
|
|
|
-All of these parameters we got during boot time and stored in the `boot_params` structure. After this we need to setup the end of the I/O memory. As you know the one of the main purposes of the kernel is resource management. And one of the resource is a memory. As we already know there are two ways to communicate with devices are I/O ports and device memory. All information about registered resources available through:
|
|
|
+All of these parameters we got during boot time and stored in the `boot_params` structure. After this we need to setup the end of the I/O memory. As you know one of the main purposes of the kernel is resource management. And one of the resource is memory. As we already know there are two ways to communicate with devices are I/O ports and device memory. All information about registered resources are available through:
|
|
|
|
|
|
* /proc/ioports - provides a list of currently registered port regions used for input or output communication with a device;
|
|
|
* /proc/iomem - provides current map of the system's memory for each physical device.
|
|
@@ -315,15 +315,15 @@ EXPORT_SYMBOL(iomem_resource);
|
|
|
|
|
|
TODO EXPORT_SYMBOL
|
|
|
|
|
|
-`iomem_resource` defines root addresses range for io memory with `PCI mem` name and `IORESOURCE_MEM` (`0x00000200`) as flags. As i wrote about our current point is setup the end address of the `iomem`. We will do it with:
|
|
|
+`iomem_resource` defines root addresses range for io memory with `PCI mem` name and `IORESOURCE_MEM` (`0x00000200`) as flags. As i wrote above our current point is setup the end address of the `iomem`. We will do it with:
|
|
|
|
|
|
```C
|
|
|
iomem_resource.end = (1ULL << boot_cpu_data.x86_phys_bits) - 1;
|
|
|
```
|
|
|
|
|
|
-Here we shift `1` on `boot_cpu_data.x86_phys_bits`. `boot_cpu_data` is `cpuinfo_x86` structure which we filled during execution of the `early_cpu_init`. As you can understand from the name of the `x86_phys_bits` field, it presents maximum bits amount of the maximum physical address in the system. Note also that `iomem_resource` passed to the `EXPORT_SYMBOL` macro. This macro exports the given symbol (`iomem_resource` in our case) for dynamic linking or in another words it makes a symbol accessible to dynamically loaded modules.
|
|
|
+Here we shift `1` on `boot_cpu_data.x86_phys_bits`. `boot_cpu_data` is `cpuinfo_x86` structure which we filled during execution of the `early_cpu_init`. As you can understand from the name of the `x86_phys_bits` field, it presents maximum bits amount of the maximum physical address in the system. Note also that `iomem_resource` is passed to the `EXPORT_SYMBOL` macro. This macro exports the given symbol (`iomem_resource` in our case) for dynamic linking or in other words it makes a symbol accessible to dynamically loaded modules.
|
|
|
|
|
|
-As we set the end address of the root `iomem` resource address range, as I wrote about the next step will be setup of the memory map. It will be produced with the call of the `setup_memory_map` function:
|
|
|
+After we set the end address of the root `iomem` resource address range, as I wrote above the next step will be setup of the memory map. It will be produced with the call of the `setup_ memory_map` function:
|
|
|
|
|
|
```C
|
|
|
void __init setup_memory_map(void)
|
|
@@ -352,7 +352,7 @@ struct x86_init_ops x86_init __initdata = {
|
|
|
}
|
|
|
```
|
|
|
|
|
|
-As we can see here `memry_setup` field is `default_machine_specific_memory_setup` where we get the number of the [e820](http://en.wikipedia.org/wiki/E820) entries which we collected in the [boot time](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html), sanitize the BIOS e820 map and fill `e820map` structure with the memory regions. As all regions collect, print of all regions with printk. You can find this print if you execute `dmesg` command, you must see something like this:
|
|
|
+As we can see here `memry_setup` field is `default_machine_specific_memory_setup` where we get the number of the [e820](http://en.wikipedia.org/wiki/E820) entries which we collected in the [boot time](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html), sanitize the BIOS e820 map and fill `e820map` structure with the memory regions. As all regions are collected, print of all regions with printk. You can find this print if you execute `dmesg` command and you can see something like this:
|
|
|
|
|
|
```
|
|
|
[ 0.000000] e820: BIOS-provided physical RAM map:
|
|
@@ -406,7 +406,7 @@ static inline void __init copy_edd(void)
|
|
|
Memory descriptor initialization
|
|
|
--------------------------------------------------------------------------------
|
|
|
|
|
|
-The next step is initialization of the memory descriptor of the init process. As you already can know every process has own address space. This address space presented with special data structure which called `memory descriptor`. Directly in the linux kernel source code memory descriptor presented with `mm_struct` structure. `mm_struct` contains many different fields related with the process address space as start/end address of the kernel code/data, start/end of the brk, number of memory areas, list of memory areas and etc... This structure defined in the [include/linux/mm_types.h](https://github.com/torvalds/linux/blob/master/include/linux/mm_types.h). As every process has own memory descriptor, `task_struct` structure contains it in the `mm` and `active_mm` field. And our first `init` process has it too. You can remember that we saw the part of initialization of the init `task_struct` with `INIT_TASK` macro in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html):
|
|
|
+The next step is initialization of the memory descriptor of the init process. As you already can know every process has its own address space. This address space presented with special data structure which called `memory descriptor`. Directly in the linux kernel source code memory descriptor presented with `mm_struct` structure. `mm_struct` contains many different fields related with the process address space as start/end address of the kernel code/data, start/end of the brk, number of memory areas, list of memory areas and etc... This structure defined in the [include/linux/mm_types.h](https://github.com/torvalds/linux/blob/master/include/linux/mm_types.h). As every process has its own memory descriptor, `task_struct` structure contains it in the `mm` and `active_mm` field. And our first `init` process has it too. You can remember that we saw the part of initialization of the init `task_struct` with `INIT_TASK` macro in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html):
|
|
|
|
|
|
```C
|
|
|
#define INIT_TASK(tsk) \
|
|
@@ -420,7 +420,7 @@ The next step is initialization of the memory descriptor of the init process. As
|
|
|
}
|
|
|
```
|
|
|
|
|
|
-`mm` points to the process address space and `active_mm` points to the active address space if process has no own as kernel threads (more about it you can read in the [documentation](https://www.kernel.org/doc/Documentation/vm/active_mm.txt)). Now we fill memory descriptor of the initial process:
|
|
|
+`mm` points to the process address space and `active_mm` points to the active address space if process has no address space such as kernel threads (more about it you can read in the [documentation](https://www.kernel.org/doc/Documentation/vm/active_mm.txt)). Now we fill memory descriptor of the initial process:
|
|
|
|
|
|
```C
|
|
|
init_mm.start_code = (unsigned long) _text;
|
|
@@ -429,7 +429,7 @@ The next step is initialization of the memory descriptor of the init process. As
|
|
|
init_mm.brk = _brk_end;
|
|
|
```
|
|
|
|
|
|
-with the kernel's text, data and brk. `init_mm` is memory descriptor of the initial process and defined as:
|
|
|
+with the kernel's text, data and brk. `init_mm` is the memory descriptor of the initial process and defined as:
|
|
|
|
|
|
```C
|
|
|
struct mm_struct init_mm = {
|
|
@@ -444,7 +444,7 @@ struct mm_struct init_mm = {
|
|
|
};
|
|
|
```
|
|
|
|
|
|
-where `mm_rb` is a red-black tree of the virtual memory areas, `pgd` is a pointer to the page global directory, `mm_users` is address space users, `mm_count` is primary usage counter and `mmap_sem` is memory area semaphore. After that we setup memory descriptor of the initiali process, next step is initialization of the intel Memory Protection Extensions with `mpx_mm_init`. The next step after it is initialization of the code/data/bss resources with:
|
|
|
+where `mm_rb` is a red-black tree of the virtual memory areas, `pgd` is a pointer to the page global directory, `mm_users` is address space users, `mm_count` is primary usage counter and `mmap_sem` is memory area semaphore. After we setup memory descriptor of the initiali process, next step is initialization of the intel Memory Protection Extensions with `mpx_mm_init`. The next step is initialization of the code/data/bss resources with:
|
|
|
|
|
|
```C
|
|
|
code_resource.start = __pa_symbol(_text);
|
|
@@ -455,7 +455,7 @@ where `mm_rb` is a red-black tree of the virtual memory areas, `pgd` is a pointe
|
|
|
bss_resource.end = __pa_symbol(__bss_stop)-1;
|
|
|
```
|
|
|
|
|
|
-We already know a little about `resource` structure (read above). Here we fills code/data/bss resources with the physical addresses of they. You can see it in the `/proc/iomem` output:
|
|
|
+We already know a little about `resource` structure (read above). Here we fills code/data/bss resources with their physical addresses. You can see it in the `/proc/iomem`:
|
|
|
|
|
|
```C
|
|
|
00100000-be825fff : System RAM
|
|
@@ -464,7 +464,7 @@ We already know a little about `resource` structure (read above). Here we fills
|
|
|
01a11000-01ac3fff : Kernel bss
|
|
|
```
|
|
|
|
|
|
-All of these structures defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) and look like typical resource initialization:
|
|
|
+All of these structures are defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) and look like typical resource initialization:
|
|
|
|
|
|
```C
|
|
|
static struct resource code_resource = {
|
|
@@ -490,7 +490,7 @@ void x86_configure_nx(void)
|
|
|
Conclusion
|
|
|
--------------------------------------------------------------------------------
|
|
|
|
|
|
-It is the end of the fifth part about linux kernel initialization process. In this part we continued to dive in the `setup_arch` function which makes initialization of architecutre-specific stuff. It was long part, but we not finished with it. As i already wrote, the `setup_arch` is big function, and I am really not sure that we will cover full of it even in the next part. There were some new interesting concepts in this part like `Fix-mapped` addresses, ioremap and etc... Don't worry if they are unclear for you. There is special part about these concepts - [Linux kernel memory management Part 2.](https://github.com/0xAX/linux-insides/blob/master/mm/linux-mm-2.md). In the next part we will continue with the initialization of the architecture-specific stuff and will see parsing of the early kernel parameteres, early dump of the pci devices, direct Media Interface scanning and many many more.
|
|
|
+It is the end of the fifth part about linux kernel initialization process. In this part we continued to dive in the `setup_arch` function which makes initialization of architecutre-specific stuff. It was long part, but we have not finished with it. As i already wrote, the `setup_arch` is big function, and I am really not sure that we will cover all of it even in the next part. There were some new interesting concepts in this part like `Fix-mapped` addresses, ioremap and etc... Don't worry if they are unclear for you. There is a special part about these concepts - [Linux kernel memory management Part 2.](https://github.com/0xAX/linux-insides/blob/master/mm/linux-mm-2.md). In the next part we will continue with the initialization of the architecture-specific stuff and will see parsing of the early kernel parameteres, early dump of the pci devices, direct Media Interface scanning and many many more.
|
|
|
|
|
|
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
|
|
|