13/3/11

The Kernel Boot Process

The previous post explained how computers boot up right up to the point where the boot loader, after stuffing the kernel image into memory, is about to jump into the kernel entry point. This last post about booting takes a look at the guts of the kernel to see how an operating system starts life. Since I have an empirical bent I’ll link heavily to the sources for Linux kernel 2.6.25.6 at the Linux Cross Reference. The sources are very readable if you are familiar with C-like syntax; even if you miss some details you can get the gist of what’s happening. The main obstacle is the lack of context around some of the code, such as when or why it runs or the underlying features of the machine. I hope to provide a bit of that context. Due to brevity (hah!) a lot of fun stuff – like interrupts and memory – gets only a nod for now. The post ends with the highlights for the Windows boot.

At this point in the Intel x86 boot story the processor is running in real-mode, is able to address 1 MB of memory, and RAM looks like this for a modern Linux system:
RAM Contens after Bootloader
The kernel image has been loaded to memory by the boot loader using the BIOS disk I/O services. This image is an exact copy of the file in your hard drive that contains the kernel, e.g. /boot/vmlinuz-2.6.22-14-server. The image is split into two pieces: a small part containing the real-mode kernel code is loaded below the 640K barrier; the bulk of the kernel, which runs in protected mode, is loaded after the first megabyte of memory.

The action starts in the real-mode kernel header pictured above. This region of memory is used to implement the Linux boot protocol between the boot loader and the kernel. Some of the values there are read by the boot loader while doing its work. These include amenities such as a human-readable string containing the kernel version, but also crucial information like the size of the real-mode kernel piece. The boot loader also writes values to this region, such as the memory address for the command-line parameters given by the user in the boot menu. Once the boot loader is finished it has filled in all of the parameters required by the kernel header. It’s then time to jump into the kernel entry point. The diagram below shows the code sequence for the kernel initialization, along with source directories, files, and line numbers:
Architecture-specific Linux Kernel Initialization
The early kernel start-up for the Intel architecture is in file arch/x86/boot/header.S. It’s in assembly language, which is rare for the kernel at large but common for boot code. The start of this file actually contains boot sector code, a left over from the days when Linux could work without a boot loader. Nowadays this boot sector, if executed, only prints a “bugger_off_msg” to the user and reboots. Modern boot loaders ignore this legacy code. After the boot sector code we have the first 15 bytes of the real-mode kernel header; these two pieces together add up to 512 bytes, the size of a typical disk sector on Intel hardware.

After these 512 bytes, at offset 0×200, we find the very first instruction that runs as part of the Linux kernel: the real-mode entry point. It’s in header.S:110 and it is a 2-byte jump written directly in machine code as 0x3aeb. You can verify this by running hexdump on your kernel image and seeing the bytes at that offset – just a sanity check to make sure it’s not all a dream. The boot loader jumps into this location when it is finished, which in turn jumps to header.S:229 where we have a regular assembly routine called start_of_setup. This short routine sets up a stack, zeroes the bss segment (the area that contains static variables, so they start with zero values) for the real-mode kernel and then jumps to good old C code at arch/x86/boot/main.c:122.

main() does some house keeping like detecting memory layout, setting a video mode, etc. It then calls go_to_protected_mode(). Before the CPU can be set to protected mode, however, a few tasks must be done. There are two main issues: interrupts and memory. In real-mode the interrupt vector table for the processor is always at memory address 0, whereas in protected mode the location of the interrupt vector table is stored in a CPU register called IDTR. Meanwhile, the translation of logical memory addresses (the ones programs manipulate) to linear memory addresses (a raw number from 0 to the top of the memory) is different between real-mode and protected mode. Protected mode requires a register called GDTR to be loaded with the address of a Global Descriptor Table for memory. So go_to_protected_mode() calls setup_idt() and setup_gdt() to install a temporary interrupt descriptor table and global descriptor table.

We’re now ready for the plunge into protected mode, which is done by protected_mode_jump, another assembly routine. This routine enables protected mode by setting the PE bit in the CR0 CPU register. At this point we’re running with paging disabled; paging is an optional feature of the processor, even in protected mode, and there’s no need for it yet. What’s important is that we’re no longer confined to the 640K barrier and can now address up to 4GB of RAM. The routine then calls the 32-bit kernel entry point, which is startup_32 for compressed kernels. This routine does some basic register initializations and calls decompress_kernel(), a C function to do the actual decompression.

decompress_kernel() prints the familiar “Decompressing Linux…” message. Decompression happens in-place and once it’s finished the uncompressed kernel image has overwritten the compressed one pictured in the first diagram. Hence the uncompressed contents also start at 1MB. decompress_kernel() then prints “done.” and the comforting “Booting the kernel.” By “Booting” it means a jump to the final entry point in this whole story, given to Linus by God himself atop Mountain Halti, which is the protected-mode kernel entry point at the start of the second megabyte of RAM (0×100000). That sacred location contains a routine called, uh, startup_32. But this one is in a different directory, you see.

The second incarnation of startup_32 is also an assembly routine, but it contains 32-bit mode initializations. It clears the bss segment for the protected-mode kernel (which is the true kernel that will now run until the machine reboots or shuts down), sets up the final global descriptor table for memory, builds page tables so that paging can be turned on, enables paging, initializes a stack, creates the final interrupt descriptor table, and finally jumps to to the architecture-independent kernel start-up, start_kernel(). The diagram below shows the code flow for the last leg of the boot:
Architecture-independent Linux Kernel Initialization
start_kernel() looks more like typical kernel code, which is nearly all C and machine independent. The function is a long list of calls to initializations of the various kernel subsystems and data structures. These include the scheduler, memory zones, time keeping, and so on. start_kernel() then calls rest_init(), at which point things are almost all working. rest_init() creates a kernel thread passing another function, kernel_init(), as the entry point. rest_init() then calls schedule() to kickstart task scheduling and goes to sleep by calling cpu_idle(), which is the idle thread for the Linux kernel. cpu_idle() runs forever and so does process zero, which hosts it. Whenever there is work to do – a runnable process – process zero gets booted out of the CPU, only to return when no runnable processes are available.

But here’s the kicker for us. This idle loop is the end of the long thread we followed since boot, it’s the final descendent of the very first jump executed by the processor after power up. All of this mess, from reset vector to BIOS to MBR to boot loader to real-mode kernel to protected-mode kernel, all of it leads right here, jump by jump by jump it ends in the idle loop for the boot processor, cpu_idle(). Which is really kind of cool. However, this can’t be the whole story otherwise the computer would do no work.

At this point, the kernel thread started previously is ready to kick in, displacing process 0 and its idle thread. And so it does, at which point kernel_init() starts running since it was given as the thread entry point. kernel_init() is responsible for initializing the remaining CPUs in the system, which have been halted since boot. All of the code we’ve seen so far has been executed in a single CPU, called the boot processor. As the other CPUs, called application processors, are started they come up in real-mode and must run through several initializations as well. Many of the code paths are common, as you can see in the code for startup_32, but there are slight forks taken by the late-coming application processors. Finally, kernel_init() calls init_post(), which tries to execute a user-mode process in the following order: /sbin/init, /etc/init, /bin/init, and /bin/sh. If all fail, the kernel will panic. Luckily init is usually there, and starts running as PID 1. It checks its configuration file to figure out which processes to launch, which might include X11 Windows, programs for logging in on the console, network daemons, and so on. Thus ends the boot process as yet another Linux box starts running somewhere. May your uptime be long and untroubled.

The process for Windows is similar in many ways, given the common architecture. Many of the same problems are faced and similar initializations must be done. When it comes to boot one of the biggest differences is that Windows packs all of the real-mode kernel code, and some of the initial protected mode code, into the boot loader itself (C:\NTLDR). So instead of having 2 regions in the same kernel image, Windows uses different binary images. Plus Linux completely separates boot loader and kernel; in a way this automatically falls out of the open source process. The diagram below shows the main bits for the Windows kernel:
Windows Kernel Initialization
The Windows user-mode start-up is naturally very different. There’s no /sbin/init, but rather Csrss.exe and Winlogon.exe. Winlogon spawns Services.exe, which starts all of the Windows Services, and Lsass.exe, the local security authentication subsystem. The classic Windows login dialog runs in the context of Winlogon.

This is the end of this boot series. Thanks everyone for reading and for feedback. I’m sorry some things got superficial treatment; I’ve gotta start somewhere and only so much fits into blog-sized bites. But nothing like a day after the next; my plan is to do regular “Software Illustrated” posts like this series along with other topics. Meanwhile, here are some resources:
■ The best, most important resource, is source code for real kernels, either Linux or one of the BSDs.
■ Intel publishes excellent Software Developer’s Manuals, which you can download for free.
Understanding the Linux Kernel is a good book and walks through a lot of the Linux Kernel sources. It’s getting outdated and it’s dry, but I’d still recommend it to anyone who wants to grok the kernel. Linux Device Drivers is more fun, teaches well, but is limited in scope. Finally, Patrick Moroney suggested Linux Kernel Development by Robert Love in the comments for this post. I’ve heard other positive reviews for that book, so it sounds worth checking out.
■ For Windows, the best reference by far is Windows Internals by David Solomon and Mark Russinovich, the latter of Sysinternals fame. This is a great book, well-written and thorough. The main downside is the lack of source code.
[Update: In a comment below, Nix covered a lot of ground on the initial root file system that I glossed over. Thanks to Marius Barbu for catching a mistake where I wrote "CR3" instead of GDTR]

Comments
80 Responses to “The Kernel Boot Process”

1.FAb on June 23rd, 2008 8:40 am
Great article, I loved it. Thanks.

What tool do you use to generate these cute schema and illustrations ?
2.Frank Spychalski on June 23rd, 2008 8:44 am
Excellent article, thanks!
3.xxx on June 23rd, 2008 8:45 am
> Wow, that sounds very complicated. Is the process really that complicated?

No, he made it all up. What were you thinking?!
4.Maurice on June 23rd, 2008 8:45 am
Only for people that spam.
5.Gustavo Duarte on June 23rd, 2008 9:21 am
@FAb: cool, you’re welcome. The diagrams were all done in Visio 2007.

@Frank: thanks for reading.

@xxx: hahaha.

@Maurice: Yea, that comment cum URL is borderline. Sigh.
6.Traverse Davies on June 23rd, 2008 9:34 am
I have been seeing that ultimate anonymity crap come up in so many comment threads lately. Funny, always under different URL’s too. Other than that, great article (although I really could have used it about two weeks ago when trying to fix some weird boot errors, ah well, muddled through them in the end)
7.Gustavo Duarte on June 23rd, 2008 9:39 am
Alright, then I’m deleting it. Thanks for the heads up.
8.Dreamtorrent on June 23rd, 2008 11:26 am
Kick ass article, I enjoyed it!

It fills in a few blanks I got in my vague knowledge of this process, and being pretty humble in my knowledge, don’t think you under-complicated it at all – however I now am curious for more detail.

Oh, yeah. Did an upgrade which crashed and it deleted /sbin/init – at least now I know what step of the process that was … in hindsight. LOL

Keep it up!

H
9.meneame.net on June 23rd, 2008 11:36 am
La secuencia de inicio en el kernel Linux (ingles)…

Gran explicación de cómo se inicia el sistema operativo Linux. Quizá es un poco compleja para los no informático pero me ha parecido interesante….
10.kaizen on June 23rd, 2008 12:26 pm
what software do you use to create those nice diagrams?
11.[FAQ] How the Kernel Starts Up - Overclock.net - Overclocking.net on June 23rd, 2008 1:39 pm
[...] startup of a linux kernel primarily, but describes how the windows kernel is different in the end. Link __ BIG BROTHERWe apologize for the inconvenience IS [...]
12.Marius Barbu on June 23rd, 2008 2:42 pm
Nice writeup, subscribed!

However, there’s a little error in the article:
“Protected mode requires a register called CR3 to be loaded with the address of a Global Descriptor Table for memory”.

CR3 is the PDBR (Page Directory Base Register, holds the physical address of the page directory) so is only needed when paging is enabled. The Global Descriptor Table is loaded into GDTR (special register just like IDTR) by the lgdt instruction.
13.Stop Being Carbon · Things I wanna read in the next few days on June 23rd, 2008 2:55 pm
[...] Kernel Boot Process Diary of a failed Startup Who needs a Computer Science Degree when there’s Wikipedia Programmer Insecurity Metaclass Programming in Python [...]
14.Patrick Moroney on June 23rd, 2008 2:58 pm
I also highly recommend Linux Kernel Development by Robert Love
Much less dry then Understanding the Linux Kernel, and also more recent.
http://www.amazon.com/Linux-Kernel-Development-Novell-Press/dp/0672327201
15.Gustavo Duarte on June 24th, 2008 12:44 am
@Dreamtorrent: thanks

@kaizen: MS Visio 2007. I use ‘themes’, which make it easy to make decent-looking stuff.

@Marius: thanks for catching it. Fixed in the text.

@Patrick: thanks for the reference, I’ll add it to the text as well.
16.Jeff Moser on June 24th, 2008 9:07 pm
Thanks for the well researched post! I especially liked the links to the specific functions in the Linux kernel source.

I’ve subscribed to your feed and look forward to upcoming posts.

Keep up the great work!
17.pligg.com on June 25th, 2008 6:08 am
The Kernel Boot Process Explained: 2.6.25.6…

The kernel boot process for linux 2.6.25.6 explained…
18.Nikesh on June 25th, 2008 7:50 am
Can not have better then this, awesome !!!

Thanks.
19.The Burgeoning Openly Owned Web » links for 2008-06-28 on June 27th, 2008 7:09 pm
[...] The Kernel Boot Process : Gustavo Duarte the birth of “life” (tags: linux kernel boot bootloader) [...]
20.Sara Eulodue on June 29th, 2008 6:07 pm
This would be infinitely more useful if you showed the Windows boot process first since that is what most computers actually use. And THEN at the end you can use the academic Linux boot process for completeness. Nope, sorry, this gets a thumbs down from me on SU. Next.
21.Kevin DuBois on June 29th, 2008 10:01 pm
Great trifecta of boot-up articles, thanks!
22.Naseer on July 1st, 2008 5:14 am
Awesome article, Thank you !
23.Gustavo Duarte on July 1st, 2008 10:18 am
Thank you all for reading and for the feedback.
24.Justin Blanton | The kernel boot process on July 2nd, 2008 1:16 am
[...] The kernel boot process. [...]
25.Ben Petering on July 7th, 2008 12:40 am
Very good article. I love the ‘illustrated’ style you used.

Incidentally, I’ve just skimmed your entire blog, and I’m rather impressed. Not only is your English quite good (a concern you voiced in one post – IMO, reading TCP/IP Illustrated is a damn good start if you’re doing technical writing , but _every_ post you’ve written so far looks interesting and substantial.

Keep up the good work. I’ll be back.

-ben
26.Linkdump: Teorija kategorija, kako radi kernel… by Nikola Plejić on July 7th, 2008 4:14 am
[...] Chipsets and the Memory Map, How Computers Boot Up i The Kernel Boot Process Za one koje zanima kako računala rade iznutra, Gustavo Duarte je napisao seriju članaka čiji je [...]
27.The Kernel Boot Process « Vietwow’s Weblog on July 8th, 2008 8:09 am
[...] Nguồn : http://duartes.org/gustavo/blog/post/kernel-boot-process [...]
28.Christian on July 8th, 2008 1:59 pm
Muito bom gustavo!
Posso traduzir e colocar no meu blog, e uma referência p/ cá?

[]´s
29.eto demerzel on July 8th, 2008 2:51 pm
It’s fake, photoshopped. Look, you can see the blurred pixel area

Great article, definitely the best explanation about boot up flow I’ve found the graphics are the top.

Good work.
30.Gustavo Duarte on July 9th, 2008 4:56 am
@Ben: thanks a ton I’m a huge fan of the W. Richard Stevens books as well, so I fully agree they’re a damn good start. What I meant by the English comment was that I sometimes feel a lack of non-tech reading has hampered my English. Say, when it’s time to come up with a metaphor or the ‘right word’, that kind of thing. But I’ve been here in the US for a few years now, so it’s less of a problem now.

@Christian: obrigado, e pode traduzir sem problemas, desde que tenha o link. Se vc quiser eu posse te mandar os arquivos Visio 2007 para as imagens ou traduzi-las pra voce.

@eto: hahaha, fake computer pr0n. Anyhow, thanks for the kind words
31.Regular (S)expressions :: Entries :: linkz on July 10th, 2008 7:18 am
[...] linux kernel boot process; http://duartes.org/gustavo/blog/post/kernel-boot-process The previous post explained how computers boot up right up to the point where the boot loader, [...]
32.Christian on July 15th, 2008 8:31 am
Olá Gustavo!

Pode deixar que eu vou colocar o link sim.
Por favor, me envie os arquivos para eu traduzir.

Quando eu terminar de traduzir tudo, eu mando para você dar uma revisada, vc quer?

Obrigado e forte abraço!

Christian
33.Idefix on July 16th, 2008 4:55 am
Excellent article, but I have one question:
If decompression happens in-place, how come the compressed parts don’t get overwritten by uncompressed data before those compressed part are read?
34.Alfredo Reino » Archivo del Blog » Cómo arrancan los ordenadores on July 16th, 2008 8:07 am
[...] The kernel boot process [...]
35.Gustavo Duarte on July 16th, 2008 10:43 am
@Idefix: the compressed image is temporarily moved up in memory a notch, creating a ‘buffer zone’ between the place in memory where uncompressed contents are being written to and the place where compressed contents are read from.

The code is here.

cheers
36.Amjith on July 16th, 2008 11:05 am
Hi Gustavo,
The whole process of computer boot up from memory map to kernel loading was amazing. I linked your articles to http://www.osnews.com/story/20064/Computer_Boot_Up_Process. It is refreshing to see articles that are succinct and resourceful.
37.Mojes on July 16th, 2008 3:24 pm
Thank You!

This three articles show something very complicated in easy way. Good job!
I was looking for such text for a long time.

-mojes
38.Kilian Hekhuis on July 17th, 2008 1:45 am
“In real-mode the interrupt vector table for the processor is always at memory address 0, whereas in protected mode the location of the interrupt vector table is stored in a CPU register called IDTR” – This is not true. Also in real mode, the CPU uses the IDTR to locate the (real mode) IVT. In practice, the IDTR is always set to 0, but it could be changed.
39.Gustavo Duarte on July 17th, 2008 2:01 am
@Amjith: thank you for the kind words and also for the link. I got a ton of traffic from you.

@Mojes: you’re very welcome!

@Kilian: Thanks for noting this. I’ll change the language to be more accurate.
40.nakisa on August 8th, 2008 12:46 am
thanks alot , that was great

i will be so much glad ))))) if you post more and more such sweets.
41.Frederik Braun on August 11th, 2008 8:00 am
Well explained. I think I got it, despite the fact that I didn’t know much about this topic.
So, thank you

Frederik

P.S.: More posts on this topic will be appreciated
42.Memory Translation and Segmentation : Gustavo Duarte on August 12th, 2008 2:33 am
[...] segmentation, protection, and paging in Intel-compatible (x86) computers, in the spirit of the boot series, as the next step down the path of how kernels work. As usual, I’ll link to Linux kernel [...]
43.Kimia on August 26th, 2008 11:20 am
Hi Gustavo,

that was great article, i had a question , is it possible to monitor the boot parameters beacause of finding out if everything is ok or not , such as :
- MBR parameters(of its code and partition table) and their place in memory , and which of them are still in RAM or removed.

actually , I would like to program it under a program to use it as a utility in shell environment , and honestly i don’t khow what must i do , and even i don’t khow that it must be a assembly program or it can be C one,

I really will be so much thankful if you lead me in this way,
thanks a lot,

-Kimia
44.kavitha on August 27th, 2008 11:10 pm
hi duartes
you have done a very good job..

I have few queries : Is there any fixed size in memory reserved for user and kernel space? otherwise how can we know the size of memory used by kernel and user application?
45.Gustavo Duarte on August 28th, 2008 1:05 am
@Kimia: Nearly all of the real-mode data from the kernel is wiped out once the protected mode part starts running, so some of these early parameters are lost.

You can however read the MBR and partition table right off of the disk, by reading for example a device like /dev/hdxx or /dev/sbxx corresponding to your hard disk. It would not be too hard to read the partition table and MBR doing that. You might want to read the source code for fdisk() and other Linux disk utilities.

Does this help?

@kavitha: There is memory that the kernel reserves for itself, yes. But there’s also dynamic memory that the kernel allocates and frees as it runs. The kernel keeps a database of all of the memory and how it has been distributed (which process owns it, etc).

I’ll write a post on memory this weekend that will cover some of this.
46.Kimia on August 28th, 2008 1:48 pm
Hi Gustavo,

first of all , thanks alot of your attention and help.
I see, so as you said i can read from disk , but unfortunetly i don’t know exactly what cammand must be using to read first sector from disk in linux, i’m new to linux system programming and so i need so much help in this field ,

would you please guide me to a good way and reference to know more about this, i just know rare and not implemented khowledge in real systems and i’m new to this world with hungry mind,

best regards,
Kimia
47.Gustavo Duarte on August 28th, 2008 9:01 pm
@Kimia: no problem at all. Regarding reading from the disk, it’s Unix tradition to expose hardware devices as magic files in the filesystem, usually under directory /dev

Many devices are exposed there as files, including disks. The exact name of the file depends on the nature of the device (hard disk, USB disk, scanner, sound card, etc), the bus (ide, scsi, sata, usb, etc), the order of the device (1st hard drive on bus, 2nd, etc).

So to read the hard drive, you’d need to find the right device, and then you can use regular C functions like open(), read() to read raw bytes out of disk.

However, what I _really_ suggest you do is _read source code_. It’s one of the _best_ ways to learn, and in this case there are tools that do exactly what you want (MBR and partition table manipulation) and whose code is open source. So get yourself the code for Linux fdisk, maybe the GRUB configuration installer, and read the code. It can teach you a lot.

Of course, you need some books too. My favorite Unix author was W. Richard Stevens, he’s got some great books, but sadly he died and the books haven’t been updated. Look in Amazon for his books, maybe see what the commenters are saying, and find some 5-star books on Unix programming.

hope this helps,
gustavo
48.Sahaya Darcius on September 10th, 2008 12:00 am
I really enjoyed this article and it is a very good one to understand the kernel boot process. Thank you very much for this wonderful article that too in simplified form.
49.sergio on September 16th, 2008 5:09 am
Hola Gustavo, estaba leyendo tu articulo a ver si me aclaraba algunas cosas sobre el proceso de arranque del kernel, ya que soy un poco nuevo en estos menesteres.
Estoy trabajando con un sistema empotrado y quiero actualizar el kernel. Lo he compilado y se me ha generado una imagen del kernel y otro archivo con el sistema de ficheros rootfs.ext2. Mi pregunta es, ¿es necesario que almacene los dos en la flash de mi sistema empotrado? o basta con la imagen del kernel solo?
He tratado de arrancar mi nuevo kernel y empieza a descomprimir pero de da un panic
Kernel panic – not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
A lo mejor me puedes aclarar un poco como funciona esto. Estoy un poco perdido…
Gracias por adelantado
50.Peter Teoh on September 19th, 2008 3:19 am
Fantastic article!!! I love it!!!!
51.ashoka1 on September 23rd, 2008 5:23 am
nice details about kernel boot-up process. Never found better then this. Thank you.
52.dholm.com » Blog Archive » Tumblelog: 080929 on September 29th, 2008 1:06 am
[...] The Kernel Boot Process by Gustavo Duarte explains the Linux kernel boot process on an x86 platform. Very well written with descriptive and good looking diagrams. [...]
53.kimia on November 30th, 2008 1:15 am
Hi Gustavo again,
thanks again and again, this time because of your helpful advices and guideline that helped me alot.
54.Tim on December 9th, 2008 12:57 am
thank you very much !
55.Gautam on December 12th, 2008 6:52 am
Hi, thanks for this great work.
I am trying to read MBR using linux 2.6 kernel.
I need a source code for fdisk utility for intel x86 architecture.
Could you please guide me to the correct link to follow ?
56.Gustavo Duarte on December 12th, 2008 10:28 am
@Gautam:

The easiest way to do it is to just install source packages for the package that contains fdisk. Not sure which distro you have, but you would need to:

1. Find which package fdisk came from (rpm or yum or apt should be able to tell you)

2. Install the source package that corresponds to that package. There’s a 1-to-1 correspondence between binary packages and source packages.

I think in Debian-based distros the package is util-linux, see here:

http://packages.qa.debian.org/u/util-linux.html

The FSF has an fdisk project too, but I’m not sure if it’s the same thing because they say they provide an alternative to util-linux fdisk. The FSF page for their fdisk is here:

http://www.gnu.org/software/fdisk/

But go with the distro’s source.
57.Frank on December 30th, 2008 12:23 pm
Gustavo,

Great article. Any idea where the source code for /sbin/init itself is? I’m also interested in the “events” process that is spawned by the “init” process, and would like to know where the source code for that is as well, if you know.

-thanks
Frank
58.ajit mote on January 19th, 2009 11:40 pm
Thanks Man !!!
Awesome article on boot process !!!
59.Nix on February 5th, 2009 6:14 am
One minor additional complexity. The initial root filesystem is, by default, assembled from the contents of the usr/ subdirectory in the kernel source tree (it’s a compressed cpio archive) and linked into the kernel image; alternatively, a compressed filesystem image can be linked into the kernel image or provided in a separate file. Part of the boot process (in init/main.c:do_basic_setup()) involves executing ‘initcalls’, which are stored in an array of pointers to functions to be called at boot time, constructed by the linker. One of these initcalls is init/initramfs.c:populate_rootfs(), which initializes the nonswappable memory-backed filesystem which is always mounted at / (the *real* root filesystem is mounted over the top of it, later on). The rootfs is never unmounted: you can see it as the first entry in /proc/mounts. Then it uncompresses the cpio archive or arranges for the filesystem to be backed by that compressed filesystem image, if either are present, and executes /init on that filesystem, if present, to complete the boot via an ‘early userspace’, chroot to the real root filesystem once it’s found it, and exec the real init. So the job of finding root filesystems is *completely* customizable. You can assemble it from a RAID array with some components pulled over the network if you like (I’ve done this in extremis as part of disaster recovery).

Finally, if that didn’t work and we still don’t have a useful root filesystem with an /sbin/init on it, just before calling init_post(), the system may call prepare_namespace() in init/do_mounts.c. This can try to dig up a root filesystem in a variety of ways: pausing for a configurable amount of time so the user can do something to provide a filesystem, waiting for delayed device probes in case the root filesystem is on some slow-to-start thing like a SCSI disk or a USB key, doing automated RAID probing (somewhat dangerous because it can’t tell if the array it’s assembling is actually made of pieces that are meant to go together: the recommended way to boot off RAID is to use one of the earlier customizable boot processes and run the mdadm tool in there to do the assembly), mounting a block device specified via root= on the kernel command line, or even asking the user to insert a separate floppy containing the root filesystem (I’m not sure *anyone* does this anymore, even in emergencies).

I haven’t got into the half a dozen horrible ways the various early userspaces can signal their completion (echoing the real device numbers into a file in /proc, executing the horrible ‘pivot_root()’ syscall, or just deleting everything on the rootfs and doing a ‘chroot exec /sbin/init’ into the real root filesystem, which is the modern way to boot up because it doesn’t rely on any horrible early-userspace-specific hacks). For more, see Documentation/filesystems/ramfs-rootfs-initramfs.txt and Documentation/initrd.txt in your favourite Linux kernel tree.
60.Raj on February 13th, 2009 10:46 pm
Gustavo, thanks a ton for this article. Excellent explanation of a really complex process, even a newbie like me have no problems following.

Big thumbs up!!!
61.Ya-tou & me » Blog Archive » Memory Translation and Segmentation on February 19th, 2009 1:47 am
[...] in Intel-compatible (x86) computers, going further down the path of how kernels work. As in the boot series, I’ll link to Linux kernel sources but give Windows examples as well (sorry, I’m ignorant about [...]
62.Ya-tou & me » Blog Archive » The Kernel Boot Process on February 19th, 2009 1:49 am
[...] In a comment below, Nix covered a lot of ground on the initial root file system that I glossed over. Thanks to Marius [...]
63.steampunk.dk » Blog Archive » Nå sådan! on April 22nd, 2009 8:33 pm
[...] The kernel boot process Posted in IT Teknologi | Leave a Comment [...]
64.Andreea Lucau on May 25th, 2009 4:13 pm
I really liked you article. You have a nice Linux-like sense of humor:)
65.Bob Forder on June 5th, 2009 11:53 am
[quote]This would be infinitely more useful if you showed the Windows boot process first since that is what most computers actually use. And THEN at the end you can use the academic Linux boot process for completeness. Nope, sorry, this gets a thumbs down from me on SU. Next.[/quote]

Yeah, all those Windows kernel hackers out there must be disappointed…
66.dp on July 11th, 2009 6:22 pm
This is *easily* the best article on this subject I’ve been able to find on the web. Very clear, right level of detail (for me, at least), and good references. Thanks very much for writing it.

I infer from some prior comments that you aren’t a native English speaker. Fear not — your English is better than most natives’.

Any chance you’ll write something comparable, describing EFI? (For that matter, I’d like coverage of OSX as well, but suppose that’s asking too much.)

I’ve bookmarked your homepage.
67.rajakama on July 28th, 2009 6:49 pm
Nice Article. Thank you
68.Ashoka on August 19th, 2009 10:20 pm
Very good article.
Thanks a lot
69.Naveen on August 20th, 2009 6:41 am
hi Gustavo,
Thanks for the wonderful article but i have question

what is that “0x3aeb”? i found only “0xeb” (unconditional jump) in header.S. what does that “3a” represent? And why do we need this unconditional jump here? why can’t we put the start_of_setup code directly there?
70.The Boot Process of a Computer | From thoughts to text on October 16th, 2009 11:07 am
[...] The Kernel Boot Process [...]
71.dubhe on November 11th, 2009 11:22 am
Great explanation of a difficult process that if not easy to understand, I like to give thanks for this good article
72.Interesting Reading… – The Blogs at HowStuffWorks on January 11th, 2010 11:44 am
[...] The Kernel Boot Process – “The previous post explained how computers boot up right up to the point where the boot loader, after stuffing the kernel image into memory, is about to jump into the kernel entry point. This last post about booting takes a look at the guts of the kernel to see how an operating system starts life…” [...]
73.raj on January 29th, 2010 1:36 am
Thanks for such a wonderful article.Need more of this article in future.
I’ve bookmarked your homepage.
74.pravar on March 23rd, 2010 12:33 am
good job ………
75.mahayodha on November 9th, 2010 6:52 am
Thanks a ton for not only this but other great articles too.
76.Kaiwan on November 18th, 2010 6:41 am
Hi,

Awesome article Gustav, as usual!

On the ref books topic, I highly recommend this book on drivers: “Essential Linux Device Drivers” by S Venkateswaran.

http://www.amazon.com/Essential-Device-Drivers-Sreekrishnan-Venkateswaran/dp/0132396556/ref=sr_1_2?ie=UTF8&qid=1290086793&sr=8-2

It not only covers the “usual” driver-related topics (and plenty of ‘em) but also gives a fantastic quick look at kernel internals topics most relevant to driver authors.
77.pravink.22 on November 28th, 2010 3:23 am
thxs buddy..!!!!
Try this, very easy to understand
http://www.redhatlinux.info/2010/11/steps-of-boot-process.html
78.figaro on February 10th, 2011 6:32 pm
Great article, I loved it. Thanks.
79.Jakub Jankiewicz on February 14th, 2011 8:43 am
Thank a lot for this article.
80.jonny rocket on February 14th, 2011 9:39 am
nice article. thanks. Sphere: Related Content

Memory Translation and Segmentation

This post is the first in a series about memory and protection in Intel-compatible (x86) computers, going further down the path of how kernels work. As in the boot series, I’ll link to Linux kernel sources but give Windows examples as well (sorry, I’m ignorant about the BSDs and the Mac, but most of the discussion applies). Let me know what I screw up.

In the chipsets that power Intel motherboards, memory is accessed by the CPU via the front side bus, which connects it to the northbridge chip. The memory addresses exchanged in the front side bus are physical memory addresses, raw numbers from zero to the top of the available physical memory. These numbers are mapped to physical RAM sticks by the northbridge. Physical addresses are concrete and final – no translation, no paging, no privilege checks – you put them on the bus and that’s that. Within the CPU, however, programs use logical memory addresses, which must be translated into physical addresses before memory access can take place. Conceptually address translation looks like this:
Memory Address Translation in x86 with paging enabled
This is not a physical diagram, only a depiction of the address translation process, specifically for when the CPU has paging enabled. If you turn off paging, the output from the segmentation unit is already a physical address; in 16-bit real mode that is always the case. Translation starts when the CPU executes an instruction that refers to a memory address. The first step is translating that logic address into a linear address. But why go through this step instead of having software use linear (or physical) addresses directly? For roughly the same reason humans have an appendix whose primary function is getting infected. It’s a wrinkle of evolution. To really make sense of x86 segmentation we need to go back to 1978.

The original 8086 had 16-bit registers and its instructions used mostly 8-bit or 16-bit operands. This allowed code to work with 216 bytes, or 64K of memory, yet Intel engineers were keen on letting the CPU use more memory without expanding the size of registers and instructions. So they introduced segment registers as a means to tell the CPU which 64K chunk of memory a program’s instructions were going to work on. It was a reasonable solution: first you load a segment register, effectively saying “here, I want to work on the memory chunk starting at X”; afterwards, 16-bit memory addresses used by your code are interpreted as offsets into your chunk, or segment. There were four segment registers: one for the stack (ss), one for program code (cs), and two for data (ds, es). Most programs were small enough back then to fit their whole stack, code, and data each in a 64K segment, so segmentation was often transparent.

Nowadays segmentation is still present and is always enabled in x86 processors. Each instruction that touches memory implicitly uses a segment register. For example, a jump instruction uses the code segment register (cs) whereas a stack push instruction uses the stack segment register (ss). In most cases you can explicitly override the segment register used by an instruction. Segment registers store 16-bit segment selectors; they can be loaded directly with instructions like MOV. The sole exception is cs, which can only be changed by instructions that affect the flow of execution, like CALL or JMP. Though segmentation is always on, it works differently in real mode versus protected mode.

In real mode, such as during early boot, the segment selector is a 16-bit number specifying the physical memory address for the start of a segment. This number must somehow be scaled, otherwise it would also be limited to 64K, defeating the purpose of segmentation. For example, the CPU could use the segment selector as the 16 most significant bits of the physical memory address (by shifting it 16 bits to the left, which is equivalent to multiplying by 216). This simple rule would enable segments to address 4 gigs of memory in 64K chunks. Sadly Intel made a bizarre decision to multiply the segment selector by only 24 (or 16), which in a single stroke confined memory to about 1MB and unduly complicated translation. Here’s an example showing a jump instruction where cs contains 0×1000:
Real Mode Segmentation
Real mode segment starts range from 0 all the way to 0xFFFF0 (16 bytes short of 1 MB) in 16-byte increments. To these values you add a 16-bit offset (the logical address) between 0 and 0xFFFF. It follows that there are multiple segment/offset combinations pointing to the same memory location, and physical addresses fall above 1MB if your segment is high enough (see the infamous A20 line). Also, when writing C code in real mode a far pointer is a pointer that contains both the segment selector and the logical address, which allows it to address 1MB of memory. Far indeed. As programs started getting bigger and outgrowing 64K segments, segmentation and its strange ways complicated development for the x86 platform. This may all sound quaintly odd now but it has driven programmers into the wretched depths of madness.

In 32-bit protected mode, a segment selector is no longer a raw number, but instead it contains an index into a table of segment descriptors. The table is simply an array containing 8-byte records, where each record describes one segment and looks thus:
Segment Descriptor
There are three types of segments: code, data, and system. For brevity, only the common features in the descriptor are shown here. The base address is a 32-bit linear address pointing to the beginning of the segment, while the limit specifies how big the segment is. Adding the base address to a logical memory address yields a linear address. DPL is the descriptor privilege level; it is a number from 0 (most privileged, kernel mode) to 3 (least privileged, user mode) that controls access to the segment.

These segment descriptors are stored in two tables: the Global Descriptor Table (GDT) and the Local Descriptor Table (LDT). Each CPU (or core) in a computer contains a register called gdtr which stores the linear memory address of the first byte in the GDT. To choose a segment, you must load a segment register with a segment selector in the following format:
Segment Selector
The TI bit is 0 for the GDT and 1 for the LDT, while the index specifies the desired segment selector within the table. We’ll deal with RPL, Requested Privilege Level, later on. Now, come to think of it, when the CPU is in 32-bit mode registers and instructions can address the entire linear address space anyway, so there’s really no need to give them a push with a base address or other shenanigan. So why not set the base address to zero and let logical addresses coincide with linear addresses? Intel docs call this “flat model” and it’s exactly what modern x86 kernels do (they use the basic flat model, specifically). Basic flat model is equivalent to disabling segmentation when it comes to translating memory addresses. So in all its glory, here’s the jump example running in 32-bit protected mode, with real-world values for a Linux user-mode app:
Protected Mode Segmentation
The contents of a segment descriptor are cached once they are accessed, so there’s no need to actually read the GDT in subsequent accesses, which would kill performance. Each segment register has a hidden part to store the cached descriptor that corresponds to its segment selector. For more details, including more info on the LDT, see chapter 3 of the Intel System Programming Guide Volume 3a. Volumes 2a and 2b, which cover every x86 instruction, also shed light on the various types of x86 addressing operands – 16-bit, 16-bit with segment selector (which can be used by far pointers), 32-bit, etc.

In Linux, only 3 segment descriptors are used during boot. They are defined with the GDT_ENTRY macro and stored in the boot_gdt array. Two of the segments are flat, addressing the entire 32-bit space: a code segment loaded into cs and a data segment loaded into the other segment registers. The third segment is a system segment called the Task State Segment. After boot, each CPU has its own copy of the GDT. They are all nearly identical, but a few entries change depending on the running process. You can see the layout of the Linux GDT in segment.h and its instantiation is here. There are four primary GDT entries: two flat ones for code and data in kernel mode, and another two for user mode. When looking at the Linux GDT, notice the holes inserted on purpose to align data with CPU cache lines – an artifact of the von Neumann bottleneck that has become a plague. Finally, the classic “Segmentation fault” Unix error message is not due to x86-style segments, but rather invalid memory addresses normally detected by the paging unit – alas, topic for an upcoming post.

Intel deftly worked around their original segmentation kludge, offering a flexible way for us to choose whether to segment or go flat. Since coinciding logical and linear addresses are simpler to handle, they became standard, such that 64-bit mode now enforces a flat linear address space. But even in flat mode segments are still crucial for x86 protection, the mechanism that defends the kernel from user-mode processes and every process from each other. It’s a dog eat dog world out there! In the next post, we’ll take a peek at protection levels and how segments implement them.

Comments
36 Responses to “Memory Translation and Segmentation”

1.Chuck on August 12th, 2008 7:03 am
For what it’s worth, I’m really enjoying your articles. Please continue to write
2.Mahesh on August 12th, 2008 9:40 pm
I liked the explanation, very well articulated.
3.Gustavo Duarte on August 13th, 2008 12:34 am
@Chuck: it’s worth a lot I enjoy writing these posts, I write them for fun but the fact that people seem to like them is definitely encouraging too.

I’m cooking up the next one here… I’m actually in Hawaii this week, I’ve been waking up at ~6am to snorkel and dive, sleeping early, but this evening I’m writing a bit of the protection stuff

@Mahesh: thanks!
4.Arvind on August 13th, 2008 2:35 am
Nice article!! I have subscribed to your blog feed and everytime I check for new items your blog is the first I look for unread items..Elated that I found one today..Good read..
5.Ben fowler on August 13th, 2008 4:53 pm
I thoroughly enjoyed this well-written article as well as your others. It’s interesting, but challenging material for a lot of people, and I know it’d otherwise take a LOT of reading around to get the information elsewhere.

You have a knack for this Gustavo! Have you considered teaching, or at least turning this material into a book? Keep up the great work!
6.notmuch on August 15th, 2008 9:08 am
Nicely done. For long I have been pondering upon how to visualize the workings of hardware. From CPU clock cycle and instructions, interrupt mechanism and interaction with software, and memory. These articles are a good step in that direction. Thanks much.
7.JinxterX on August 17th, 2008 10:48 am
You write great articles, thanks, any chance of doing one on Linux and MTRRs?
8.Hormiga Jones on August 19th, 2008 9:22 am
Your series of articles are HIGHLY informative and well written. I have forwarded them onto several of my work colleagues. Thank you for this valuable resource and keep up the good work.
9.CPU Rings, Privilege, and Protection : Gustavo Duarte on August 20th, 2008 12:39 am
[...] let’s see exactly how the CPU keeps track of the current privilege level, which involves the segment selectors from the previous post. Here they [...]
10.Gustavo Duarte on August 20th, 2008 12:51 am
Thank you all for the feedback!

I was in Hawaii last week, hence the belated reply. I did crank out the protection stuff in the flight back to Denver though

@Ben: You know, I started the blog just as a way to write random stuff. I never expected to get hits hehe. So now I have thought about some of the stuff you mentioned, especially books.

The trouble with books is that the stuff ends up locked and inaccessible to people, plus you lose things like being able to link to the source code directly (or linking in general).

I also thought about doing a print-friendly CSS or maybe render some stuff as PDF to let people download. That might be a good way to go.

Teaching would be fun… I love teaching people who are interested in learning. hehe. I don’t plan ahead too much, so I guess I’ll see where it goes. Thanks for the encouraging words.

@JinxterX: yea, that sounds cool. I thought about writing about cache lines and how memory access happens “for real”, and talking about MTRRs would make a lot of sense. I’ll write it down here. I think after this last series though I’ll do a hiatus on CPU internals type of stuff, because I didn’t want the blog to be just about that. But I’ve added MTRRs to the !ideas.txt here
11.Peter Teoh on September 19th, 2008 3:19 am
Thank you for the article. I am playing with the memory in Linux Kernel now. When I touch/modify things like PTE, PDE etc do I need to be in preemption disabled mode? And whenever I modify these, is flushing of TLB needed? Will it lead to crash if not? I am always getting crashes doing all these, so not sure what causes the crashes?
12.Mario on September 23rd, 2008 7:27 pm
Hey guy,

These are some of the best-written architecture articles I have came across. I started teaching myself x86 assembly awhile back and realized I was getting nowhere without really understanding memory management and how the CPU operates. Your articles are filling in a lot of gaps for me and I truly do appreciate it.

-Mario
13.Nikhil on December 11th, 2008 7:43 am
Hi Gustavo,
like everyone already has said…highly informative, well articulated, best written literature about x86 on the internet. In my pursuit of trying to understand segments, descriptors, selectors this is serving the purpose wonderfully.

again i cannot help but echo other people’s thoughts when i suggest you should write a book.
Thanks a lot.
14.Gustavo Duarte on December 12th, 2008 12:56 am
@Peter: sorry for not replying in time. Unfortunately it’s hard for me to keep up with the comments sometimes, though I’m trying to do a better job. I’ll email you to see if your doubts are still current.

@Mario, @Nikhil: wow, thanks a LOT for the kind words That’s really encouraging. I had no idea I was any good at this stuff until I started blogging, but the fact that my stuff works at least for some of you is really cool. I get a huge kick out of teaching stuff, especially computing since I like it a lot. So, thanks again – hearing stuff like this makes me want to post more for sure.

Regarding the book, I have thought about it. The trouble is that I wouldn’t want to lock the content away. I want it to be freely accessible. But I also thought about maybe assembling the stuff and making an online book.

I would love to do something like that. Time is the problem, as usual hehe.
15.el_bot on December 28th, 2008 9:15 am
Hi Gustavo. I have a doubt. Are you sure that gdtr store a linear address? Is not it a physical address? If paging is disabled it should be physical; but if paging is enabled, well, I don’t know. If it is linear then you can get a fault page when accesing to the GDT; something (I think) problematic in this stage (you can have 2 page faults in only one memory access!).

Good blog and happy new year.
16.Gustavo Duarte on December 29th, 2008 12:55 am
@el_bot: thanks, and happy new year to you as well!

I’m sure about the GDTR. Here is the relevant bit from Section 2.4.1 in the Intel System Programming Guide 3A:

“The GDTR register holds the base address (32 bits in protected mode; 64 bits in IA-32e mode) and the 16-bit table limit for the GDT. The base address specifies the linear address of byte 0 of the GDT; the table limit specifies the number of bytes in the table.”

The CR3 register though (also called PDBR) which points to the page directory does hold a physical address. Also, see section 8.8, Software Initialization for Protected Mode Operation. It covers some of the initializations that must be done.

My post on the kernel boot up also might help at http://duartes.org/gustavo/blog/post/kernel-boot-process

Cheers!
17.el_bot on December 30th, 2008 8:58 am
Thanks for the specific references. When I have a bit of free time I will read theses.
Ok, your are (again) right. But in the case the CPU running in protected mode 32bit with paging disabled, you would need storing a physical address in GDTR (ok, you can say “it’s anyway a linear adress, but in this case is like if you are using (fictitious) page tables performing a identity mapping; i.e; X (linear)-> X (physical) ). In any case, before switch to mode 32bit protected with paging disabled, you must store in the GTDR the physical address of the GTD. It’s a assumption, but I will try check it in the linux kernel code (http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L115 ? I think it is done in the line ‘asm volatile(“lgdtl %0″ : : “m” (gdt)); ‘ but my understanding about assembler embeded in C is very poor… My “theory” is “gdt.ptr store, in this point, the physical adress of the array boot_gdt” ).
Yes, CR3 MUST store a physical address (btw, it happen in any architecture supporting paging; the pointer to the table page must be a physical pointer). If it not, well… that’s don’t work.

And yes, I read your great article about booting-up; actually I am basing in it for my asummption(s)!

Saludos, and thanks for your replies.
P.S : Again, my English surely is not correct (I make my best effort…). Please, if you believe necessary, correct my words. English readers and I will are grateful with you
18.Anatomy of a Program in Memory : Gustavo Duarte on January 27th, 2009 9:28 am
[...] on. Keep in mind these segments are simply a range of memory addresses and have nothing to do with Intel-style segments. Anyway, here is the standard segment layout in a Linux [...]
19.McGrew Security Blog » Blog Archive » Gustavo Duarte’s Great Internals Series on January 27th, 2009 3:23 pm
[...] Memory Translation and Segmentation [...]
20.Quick Note on Diagrams and the Blog : Gustavo Duarte on January 28th, 2009 6:21 pm
[...] colors hold from the earliest post about memory to the latest. This convention is why the post about Intel CPU caches shows a blue [...]
21.travis on February 6th, 2009 3:08 am
Great posts…

I went to where the gdt_page is instantiated (http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/cpu/common.c#L24)

It has the following code:

[GDT_ENTRY_DEFAULT_USER_CS] = { { { 0x0000ffff, 0x00cffa00 } } }

Do you know what that means?
22.Gustavo Duarte on February 6th, 2009 11:35 pm
@travis:

This line is building the 8-byte segment descriptor for the user code segment. To really follow it, there are 3 things you must bear in mind:

1. The x86 is little endian, meaning that for multi-byte data types (say, 32-bit or 64-bit integers), the significance of bytes grows with memory address. If you declare a 32-bit integer as 0xdeadbeef, then it would be laid out in memory like this (in hex, assuming memory addresses are growing to the right):

ef be ad de
lower => higher

2. In array declarations, or in this case a struct declaration, earlier elements go into lower memory addresses.

3. The convention for Intel diagrams is to draw things with HIGHER memory addresses on the LEFT and on TOP. This is a bit counter intuitive, but I followed it to be consistent with Intel docs.

When you put this all together, the declaration above will translate into the following bytes in memory, using Intel’s ‘reversed’ notation:

(higher)

00 cf fa 00
00 00 ff ff

(lower)

If you compare these values against the segment descriptor diagram above, you’ll see that: the ‘base’ fields are all zero, so the segment starts at 0×00000000, the limit is 0xfffff so the limit covers 4GB, byte 47-40 is 11111010, so the DPL is 3 for ring 3.

If you look into the Intel docs, they describe the fields I left grayed out. Hope this helps!
23.travis on February 7th, 2009 1:27 am
Awesome! Thanks for the very clear explanation.

Do you know of a Linux forum that is open to these type of detailed questions? Sometimes it’s very difficult to find answers using google, and “Understanding the Linux Kernel” doesn’t cover some of the things that confuse me.
24.Gustavo Duarte on February 7th, 2009 9:57 am
@travis: I don’t ;( There used to be a ‘kernel janitors’ project to get people to do simple patches for the kernel, and a ‘kernel newbies’ to try to teach kernel basics. But I’m not sure where they are. I don’t use forums much, so there might be something good out there. If you find anything, I’d like to hear about it.

I also thought about installing some forum software on my server so people could talk about this stuff. However, I’m afraid of spending gobs of time there. I’m pretty strict about not getting into stuff that takes too much time, which is why I don’t touch Twitter : P
25.Ya-tou & me » Blog Archive » CPU Rings, Privilege, and Protection on February 19th, 2009 1:46 am
[...] let’s see exactly how the CPU keeps track of the current privilege level, which involves the segment selectors from the previous post. Here they [...]
26.Raúl on April 12th, 2009 9:06 pm
Gustavo, I don’t know how to thank you. Your articles are beautiful and very well explained. Please continue writing. Why don’t you write a book?. You are one of the best teachers I’ve ever found. You save me hours trying to find information and hard-studying. Sincerily, thank you very much.
27.Gustavo Duarte on April 14th, 2009 1:44 pm
@Raul: wow, thank you very much for your comment. It’s great to hear from people who have learned from or have been helped by this material. It’s the best incentive.

Regarding the book, stay tuned
28.内存剖析 « Rock2012’s Blog on May 3rd, 2009 4:26 am
[...] on. Keep in mind these segments are simply a range of memory addresses and have nothing to do with Intel-style segments. Anyway, here is the standard segment layout in a Linux [...]
29.Joel on May 11th, 2009 6:47 am
Thanks for such a beautiful article, articulated way beyond expression. I hope you write a book some day.

One of the things that confused me though was: Your gdt had ‘limit’ but you hadn’t mentioned that there was a granularity flag that multiplied it by 4k when set. Later on you go on to mention that in modern kernels flat model, each descriptor describes a segment of upto 4GB in size (32 bits) but the gdt ‘limit’ being only 20 bits made me wonder how.

Thanks
30.Rahul on July 13th, 2009 1:49 pm
Hi Gustavo,

In terms of making technical concepts clear, your posts are the best I have seen.

One question, can you please explain as to what purpose is served by the LDT ? Does any real OS ever really uses the LDT.

Thanks
Rahul
31.Krish on August 4th, 2009 1:46 pm
Hi Gustavo,

Let me begin by thanking you for a wonderful article.

In your article (the diagram for “Protected mode segmentation”), the logical address is the same as the linear address because the base is 0 in a flat model.

My understanding is that the linear address is used to decode the physical address of the page directory table and the page table thereafter to finally get the physical page value containing the segment.

There is a good chance that another process might also generate the same logical address; and with the base 0, will generate the same linear address. Does this mean that it will eventually point to the same page table entry?

Who decides which page table is assigned to which process segment? How is the segment selector value assigned (populated in the segment registers) to the process?

Thanking you in anticipation.

Krishnan.
32.Darshan on February 1st, 2010 11:37 pm
Hi Gustavo,

Thank you for giving such a informative article. I learnt many concepts from this.. thank you brother!!

Darshan.
33.Jon on February 19th, 2010 5:06 pm
Thanks for these articles, I feel I have come a bit late to the party, having only just found them, and the few I have read do far have been the clearest of anything I have read!

Just one thought though, I see you mentionerd that you had thought of a forum?

Perhaps you are right, that it might take up too much time, but I feel that there might be a better format than a blog, now that there is so much content.

I dunno what form would be best however =)

I want to read all of them, but would really appreciate a way of jumping between them, to re-read/cross reference and quickly find a specific topic.

But saying that you seem to have a talent of comunicating these tough subjects clearly and with a good deal of humour (needed with the “dryness” of the subject matter!) pls pls pls keep them coming.

jon
34.saurin on July 22nd, 2010 12:43 pm
Very good explanation. Thanks
Saurin
35.Ishan on October 1st, 2010 4:44 am
Can you explain the difference between logical address and virtual address?
Thanks Sphere: Related Content

CPU Rings, Privilege, and Protection

You probably know intuitively that applications have limited powers in Intel x86 computers and that only operating system code can perform certain tasks, but do you know how this really works? This post takes a look at x86 privilege levels, the mechanism whereby the OS and CPU conspire to restrict what user-mode programs can do. There are four privilege levels, numbered 0 (most privileged) to 3 (least privileged), and three main resources being protected: memory, I/O ports, and the ability to execute certain machine instructions. At any given time, an x86 CPU is running in a specific privilege level, which determines what code can and cannot do. These privilege levels are often described as protection rings, with the innermost ring corresponding to highest privilege. Most modern x86 kernels use only two privilege levels, 0 and 3:
x86 Protection Rings
About 15 machine instructions, out of dozens, are restricted by the CPU to ring zero. Many others have limitations on their operands. These instructions can subvert the protection mechanism or otherwise foment chaos if allowed in user mode, so they are reserved to the kernel. An attempt to run them outside of ring zero causes a general-protection exception, like when a program uses invalid memory addresses. Likewise, access to memory and I/O ports is restricted based on privilege level. But before we look at protection mechanisms, let’s see exactly how the CPU keeps track of the current privilege level, which involves the segment selectors from the previous post. Here they are:
Segment Selector - Data and Code
The full contents of data segment selectors are loaded directly by code into various segment registers such as ss (stack segment register) and ds (data segment register). This includes the contents of the Requested Privilege Level (RPL) field, whose meaning we tackle in a bit. The code segment register (cs) is, however, magical. First, its contents cannot be set directly by load instructions such as mov, but rather only by instructions that alter the flow of program execution, like call. Second, and importantly for us, instead of an RPL field that can be set by code, cs has a Current Privilege Level (CPL) field maintained by the CPU itself. This 2-bit CPL field in the code segment register is always equal to the CPU’s current privilege level. The Intel docs wobble a little on this fact, and sometimes online documents confuse the issue, but that’s the hard and fast rule. At any time, no matter what’s going on in the CPU, a look at the CPL in cs will tell you the privilege level code is running with.

Keep in mind that the CPU privilege level has nothing to do with operating system users. Whether you’re root, Administrator, guest, or a regular user, it does not matter. All user code runs in ring 3 and all kernel code runs in ring 0, regardless of the OS user on whose behalf the code operates. Sometimes certain kernel tasks can be pushed to user mode, for example user-mode device drivers in Windows Vista, but these are just special processes doing a job for the kernel and can usually be killed without major consequences.

Due to restricted access to memory and I/O ports, user mode can do almost nothing to the outside world without calling on the kernel. It can’t open files, send network packets, print to the screen, or allocate memory. User processes run in a severely limited sandbox set up by the gods of ring zero. That’s why it’s impossible, by design, for a process to leak memory beyond its existence or leave open files after it exits. All of the data structures that control such things – memory, open files, etc – cannot be touched directly by user code; once a process finishes, the sandbox is torn down by the kernel. That’s why our servers can have 600 days of uptime – as long as the hardware and the kernel don’t crap out, stuff can run for ever. This is also why Windows 95 / 98 crashed so much: it’s not because “M$ sucks” but because important data structures were left accessible to user mode for compatibility reasons. It was probably a good trade-off at the time, albeit at high cost.

The CPU protects memory at two crucial points: when a segment selector is loaded and when a page of memory is accessed with a linear address. Protection thus mirrors memory address translation where both segmentation and paging are involved. When a data segment selector is being loaded, the check below takes place:
Segment Protection
Since a higher number means less privilege, MAX() above picks the least privileged of CPL and RPL, and compares it to the descriptor privilege level (DPL). If the DPL is higher or equal, then access is allowed. The idea behind RPL is to allow kernel code to load a segment using lowered privilege. For example, you could use an RPL of 3 to ensure that a given operation uses segments accessible to user-mode. The exception is for the stack segment register ss, for which the three of CPL, RPL, and DPL must match exactly.

In truth, segment protection scarcely matters because modern kernels use a flat address space where the user-mode segments can reach the entire linear address space. Useful memory protection is done in the paging unit when a linear address is converted into a physical address. Each memory page is a block of bytes described by a page table entry containing two fields related to protection: a supervisor flag and a read/write flag. The supervisor flag is the primary x86 memory protection mechanism used by kernels. When it is on, the page cannot be accessed from ring 3. While the read/write flag isn’t as important for enforcing privilege, it’s still useful. When a process is loaded, pages storing binary images (code) are marked as read only, thereby catching some pointer errors if a program attempts to write to these pages. This flag is also used to implement copy on write when a process is forked in Unix. Upon forking, the parent’s pages are marked read only and shared with the forked child. If either process attempts to write to the page, the processor triggers a fault and the kernel knows to duplicate the page and mark it read/write for the writing process.

Finally, we need a way for the CPU to switch between privilege levels. If ring 3 code could transfer control to arbitrary spots in the kernel, it would be easy to subvert the operating system by jumping into the wrong (right?) places. A controlled transfer is necessary. This is accomplished via gate descriptors and via the sysenter instruction. A gate descriptor is a segment descriptor of type system, and comes in four sub-types: call-gate descriptor, interrupt-gate descriptor, trap-gate descriptor, and task-gate descriptor. Call gates provide a kernel entry point that can be used with ordinary call and jmp instructions, but they aren’t used much so I’ll ignore them. Task gates aren’t so hot either (in Linux, they are only used in double faults, which are caused by either kernel or hardware problems).

That leaves two juicier ones: interrupt and trap gates, which are used to handle hardware interrupts (e.g., keyboard, timer, disks) and exceptions (e.g., page faults, divide by zero). I’ll refer to both as an “interrupt”. These gate descriptors are stored in the Interrupt Descriptor Table (IDT). Each interrupt is assigned a number between 0 and 255 called a vector, which the processor uses as an index into the IDT when figuring out which gate descriptor to use when handling the interrupt. Interrupt and trap gates are nearly identical. Their format is shown below along with the privilege checks enforced when an interrupt happens. I filled in some values for the Linux kernel to make things concrete.
Interrupt Descriptor with Privilege Check
Both the DPL and the segment selector in the gate regulate access, while segment selector plus offset together nail down an entry point for the interrupt handler code. Kernels normally use the segment selector for the kernel code segment in these gate descriptors. An interrupt can never transfer control from a more-privileged to a less-privileged ring. Privilege must either stay the same (when the kernel itself is interrupted) or be elevated (when user-mode code is interrupted). In either case, the resulting CPL will be equal to to the DPL of the destination code segment; if the CPL changes, a stack switch also occurs. If an interrupt is triggered by code via an instruction like int n, one more check takes place: the gate DPL must be at the same or lower privilege as the CPL. This prevents user code from triggering random interrupts. If these checks fail – you guessed it – a general-protection exception happens. All Linux interrupt handlers end up running in ring zero.

During initialization, the Linux kernel first sets up an IDT in setup_idt() that ignores all interrupts. It then uses functions in include/asm-x86/desc.h to flesh out common IDT entries in arch/x86/kernel/traps_32.c. In Linux, a gate descriptor with “system” in its name is accessible from user mode and its set function uses a DPL of 3. A “system gate” is an Intel trap gate accessible to user mode. Otherwise, the terminology matches up. Hardware interrupt gates are not set here however, but instead in the appropriate drivers.

Three gates are accessible to user mode: vectors 3 and 4 are used for debugging and checking for numeric overflows, respectively. Then a system gate is set up for the SYSCALL_VECTOR, which is 0×80 for the x86 architecture. This was the mechanism for a process to transfer control to the kernel, to make a system call, and back in the day I applied for an “int 0×80″ vanity license plate. Starting with the Pentium Pro, the sysenter instruction was introduced as a faster way to make system calls. It relies on special-purpose CPU registers that store the code segment, entry point, and other tidbits for the kernel system call handler. When sysenter is executed the CPU does no privilege checking, going immediately into CPL 0 and loading new values into the registers for code and stack (cs, eip, ss, and esp). Only ring zero can load the sysenter setup registers, which is done in enable_sep_cpu().

Finally, when it’s time to return to ring 3, the kernel issues an iret or sysexit instruction to return from interrupts and system calls, respectively, thus leaving ring 0 and resuming execution of user code with a CPL of 3. Vim tells me I’m approaching 1,900 words, so I/O port protection is for another day. This concludes our tour of x86 rings and protection. Thanks for reading!

Comments
59 Responses to “CPU Rings, Privilege, and Protection”

1.Amjith on August 20th, 2008 2:58 pm
Wow! Amazing articles Gustavo. I have a question regarding the Memory Translation article. You had mentioned that modern x86 kernels use the “flat model” without any segmentation. But won’t that restrict the size of the addressable memory to ~4GB? But I’ve seen computers with more than 4GB installed, how does that work? Or is it a restriction per program rather than the total memory?
2.Gustavo Duarte on August 21st, 2008 4:51 am
Thanks!

The segments only affect the translation of “logical” addresses into “linear” addresses. Flat model means that these addresses coincide, so we can basically ignore segmentation. But all of the linear addresses are still fed to the paging unit (assuming paging is turned on), so there’s more magic that happens there to transform a linear address into a physical address.

Check the first diagram of the previous post, it should make it more clear. Flat model eliminates the first step (logical to linear), but the last step remains and enables addressing of more than 4GB.

Now, regarding maximum memory, there are three issues: the size of the linear address space, the conversion of a linear address into a physical address, and the physical address pins in the processor so it can talk to the bus.

When the CPU is in 32-bit mode, the linear address space is _always_ 32-bits and is therefore limited to 4 GB. However, the physical pins coming out of CPU can address up to 64GB of RAM on the bus since the Pentium Pro.

So the trouble is all in the translation of linear addresses into physical addresses. When the CPU is running in “traditional” mode, the page tables that transform a linear address into a physical one only work with 32-bit physical addresses, so the computer is confined to 4 GB total RAM.

But then Intel introduced the Physical Address Extension (PAE) to enable servers to use more physical memory. This changes the MECHANISM of translation of a linear address into a physical address. It works by changing the format of the page tables, allowing more bits for the physical address. So at that point, more than 4 GB of physical memory CAN be used.

The problem is that processes are still confined to a 32-bit linear address space. So if you have a database server that wants to address 12 gigs, say, it will have to map different regions of physical memory at a time. It only has a 4 gig linear window into the 12 gigs of physical ram.

Did that make sense?
3.Alex Railean on August 21st, 2008 8:01 am
Hi, this is a nice article, thank you. I also read the other stuff I found on this site and I really like your writing style (very user friendly, you could be a teacher) and the topics you cover. I’ve subscribed to the RSS feed and am looking forward to your new articles.

I’m really impressed.

I have a question – which tool do you use to draw those awesome diagrams? (I hope not Visio)
4.Gustavo Duarte on August 21st, 2008 11:54 am
@Alex: thanks!

Unfortunately, it is Visio 2007. I haven’t tried the open office counterpart. If anyone knows of a FOSS alternative that can produce good diagrams, I’d love to hear about it. I’d like to publish some of this stuff as diagrams for people to reuse (especially network packets), so it’d be cool to have an open platform.
5.Joey on August 21st, 2008 11:59 am
I always thought each user process in Linux had its OWN segment and 4 gigs of virtual address space. I kind of see now that each process does not have its own segment, but if that is true how does the kernel let each process have its own 4 gigs worth of virtual space?
6.Gustavo Duarte on August 21st, 2008 12:27 pm
Having the same segment only affects the translation of logical addresses into linear addresses – and flat mode makes them the same. Most CPUs don’t even have a distinction between “logical” and “linear” – they only deal with linear and physical addresses. It’s an accident of history that x86 ended up with this logical/linear distinction, and x86-64 basically gets rid of it.

But each process still has a different set of _page tables_ mapping its 32-bit linear address space into physical addresses. So the way you thought is actually accurate, it’s just the terminology that was fuzzy. Processes have their own _page tables_, not segment.
7.Amjith on August 22nd, 2008 8:59 am
Hey Gustavo, thanks for taking the time to clarify my question. Now that brings up another sub-question. If we want to use the PAE, we need changes in the kernel code, right? Is that why we have server versions of Windows? A little bit of googling provides insight into how Linux handles this. I believe they enable PAE by default these days, is my understanding correct?
8.Gustavo Duarte on August 22nd, 2008 6:47 pm
@Amjith: that’s right, the kernel needs to do most of the work, since it’s the one responsible for building the page tables for the processes.

Also, if a single process wants to use more than 4GB, then the process _also_ must be aware of this stuff, because it needs to make system calls into the kernel saying “hey, I want to map physical range 5GB to 6GB in my 2GB-3GB linear range”, or “map 10GB-11GB now”, and so on. (Of course, there would be security checks. Also, these are some nice round numbers, usually it’d probably be done in chunks of X KB of memory, depends on the application).

Regarding Windows, that’s an interesting point. It’s too strong to say that PAE _is_ why we have server versions of Windows, but it’s definitely something Microsoft has used extensively for price discrimination. Not only on the Windows kernel, but also apps like SQL Server have pricier editions that support PAE. The kernel for the server editions of Windows has other tweaks as well though, in the algorithms for process scheduling and also memory allocation. But PAE has definitely been one carrot (or stick?) to get some more money.

Linux has had PAE support since the start of 2.6. To use it one must enable it at kernel compile time. I’m not sure if it’s enabled in the kernels that ship with the various distros. I’ve never looked much into the kernel PAE code to be honest, so I’m ignorant here. My understanding though is that if it’s enabled, it’s once and for all in the machine, for all CPUs and processes.
9.technichristian.net » Blog Archive » CPU Rings, Privilege, and Protection on August 25th, 2008 9:35 pm
[...] being protected: memory, I/O ports, and the ability to execute certain machine instructions. At any given time, an x86 CPU is running in a specific privilege level, which determines what code c…. Write a [...]
10.Manav on August 25th, 2008 11:02 pm
Nice article. Just wondering, which software do you use to create the images?
11.Gustavo Duarte on August 25th, 2008 11:03 pm
@Manav: thanks. I use MS Visio 2007.
12.kernel_daemon on August 26th, 2008 12:22 am
I really like your articles. Keep on writing
13.amjith on August 26th, 2008 12:55 am
Hey Gustavo,
Since you asked for an alternative for Visio you can give Dia a shot. “Dia is roughly inspired by the commercial Windows program ‘Visio’, though more geared towards informal diagrams for casual use.” That is a quote from their website.
14.Aditya Nag on August 26th, 2008 7:50 am
Great article. Though I understood very little , I can appreciate that the subject has been lucidly explained. Do keep on writing, I for one am going to be dropping by often.
15.Gokdeniz Karadag on August 26th, 2008 9:05 am
Great series of articles.

I learned lots of intricate parts of the x86 arch that I did not learn at school.

Keep up the good work!
16.D'oh on August 27th, 2008 1:56 pm
What happened to ring 1 and 2?
17.Gustavo Duarte on August 28th, 2008 1:07 am
Thank you all for the feedback

@Aditya: is there a specific part where it “trails off” and stops making sense? Anything I can do, any definitions to introduce, that might clear it up? I’m really interested in learning how to improve my writing to make it more understandable, so I appreciate feedback on this.

@D’oh: I’ll quote from an excellent comment on osnews by siride:

Because other platforms only had two modes, so OSes intended to be used cross-platform preferred to use only two of the four rings. Also, some parts of the IA-32 architecture don’t distinguish 4 rings, but only two modes: system and user, where system means rings 0, 1 and 2, and user means ring 3. Page protection is like this, for example. And finally, adding extra rings just adds extra complexity that could probably be dealt with by using more comprehensive security methodologies, which is currently the case. In much the same way that it’s better to avoid using x86 segmentation or TSS’s for task switching in favor of a software solution that is portable and can be fine-tuned to the needs of the OS in general, or even at a particular point in time (such as under heavy load, etc.).
18.Raminder on August 28th, 2008 6:05 am
Great article as always Gustavo, keep up the good work. You’ve helped clear so many cobwebs in my head. Thanks!
19.Everything you got to know about CPU rings, Privilege and Protection in Intel x86 processors | l . i . n . k . e . r on August 29th, 2008 9:55 pm
[...] here Share this: These icons link to social bookmarking sites where readers can share and discover new [...]
20.Marcelo Gomez on August 30th, 2008 10:21 am
Great article Gustavo! Like the others. Could you mention good resources and bibliography? Keep on writing:)
21.Gustavo Duarte on August 31st, 2008 11:39 pm
Thank you all for the feedback.

@Marcelo: Check out the suggestions at the end of my posts about motherboard chipsets and also the kernel boot process, they link to a few good resources. I really like the Intel documents and those kernel books.

Also, if you are particularly interested in low-level security, the book Subverting the Windows Kernel is a great read, focused on Windows.
22.Ken on September 9th, 2008 9:36 am
Nice Article,
I found this researching the extensions Intel made to support virtualization. If you have any insight into that I think it would make a great article.
23.lallous on September 15th, 2008 9:53 am
Hello Gustavo,

How can one transfer from Ring 0 to Ring 3 ? Or more generalized from a more privileged to a less privileged mode?

Thanks,
lallous
24.Anand Thakur on September 27th, 2008 3:23 am
Hi Gustavo Duarte,
All your articles are extremely great:). I am enjoying your articles.

I am asking a question which is come from your a reply to a question with title “Amjith on August 22nd, 2008 8:59 am”. In that reply you mentioned, to use PAE we need some changes in the kernel code.
Please please correct me if I am wrong.
My thinking like below:
Whenever we want to use PAE, we need to write code,in kernel:
1. which will enable PAE bit in control register
2. Security check at kernel code level
And whenever any application want to map physical range, let say, 5GB to 6GB in 2GB-3GB linear range then this requirement will be handle by hardware itself(Note: This is contradiction with your reply). Kernel dont have to handle this.
To handle above requirement hardware will do like below:
1. First check for PAE is enable or not
2. if PAE is not enable then trap will happen
3. If PAE is enable then complete the reqirement.
All above info from me is my intuitive feeling.
I am requesting u please correct me if I am wrong.
Thanks
25.Anand on September 28th, 2008 2:00 am
Hi Gustavo Duarte,
All your articles are extremely great:). I am enjoying your articles.

I am asking a question which is come from your a reply to a question with title “Amjith on August 22nd, 2008 8:59 am”. In that reply you mentioned, to use PAE we need some changes in the kernel code.
Please please correct me if I am wrong.
My thinking like below:
Whenever we want to use PAE, we need to write code,in kernel:
1. which will enable PAE bit in control register
2. Security check at kernel code level
And whenever any application want to map physical range, let say, 5GB to 6GB in 2GB-3GB linear range then this requirement will be handle by hardware itself(Note: This is contradiction with your reply). Kernel dont have to handle this.
To handle above requirement hardware will do like below:
1. First check for PAE is enable or not
2. if PAE is not enable then trap will happen
3. If PAE is enable then complete the reqirement.
All above info from me is my intuitive feeling.
I am requesting u please correct me if I am wrong.
Thanks
26.baozhao on October 12th, 2008 6:28 am
“First, its contents cannot be set directly by load instructions such as mov, “, probably it’s wrong.The following is excerpted from INTEL 80386 PROGRAMMER’S REFERENCE MANUAL ,1986.

3.10 Segment Register Instructions
This category actually includes several distinct types of instructions.
These various types are grouped together here because, if systems designers
choose an unsegmented model of memory organization, none of these
instructions is used by applications programmers. The instructions that deal
with segment registers are:
1. Segment-register transfer instructions.
MOV SegReg, …
MOV …, SegReg
PUSH SegReg
POP SegReg
2. Control transfers to another executable segment.
JMP far ; direct and indirect
CALL far
RET far
3. Data pointer instructions.
LDS
LES
LFS
LGS
LSS
27.baozhao on October 12th, 2008 9:21 am
sorry,you’re right. It’s my fault
28.Anand on October 19th, 2008 2:47 am
Hi baozhao ,
Sorry, I did not get you.You r replying to my answer?
If yes, Please please write clear reply(because i dont know much about x86 assembly programming)…
Thanks for your reply…..Please reply to my answer asp.

Thank You
Anand
29.Justin Blanton | CPU rings, privilege, and protection on November 2nd, 2008 10:13 pm
[...] CPU rings, privilege, and protection. © 1999-2008 Justin Blanton (email) e v e r y t h i n g i s r e l a t i v e In partnership with [...]
30.Robert on November 4th, 2008 4:44 am
Hi Gustavo,

“When sysenter is executed the CPU does no privilege checking, going immediately into CPL 0″

I was wondering what exactly happens when SYSENTER is called. The code control will be transferred to SegSelector:Offset pointed by vector 0×80 in the IDT is it? Usually what does that code do….?

How will the usermode application then differentiate the various calls made to the kernel mode, eg: a call to get a file; to access a port etc.

How will the return values from the kernel be passed back to the usermode process then?

Thanks and great article btw… Appreciate if you could furnish more details
31.Timur Izhbulatov on November 15th, 2008 2:46 pm
Thanks for the article Gustavo! It’s really helpful.

But there’s still a question I can’t answer. Why do people often talk about “a process in kernel/user mode”, while this is actually the CPU that gets switched from one mode to another?

Thanks,
Timur
32.Bruno on November 17th, 2008 8:59 am
@lallous: Did you read the blog entry before replying?
“Finally, when it’s time to return to ring 3, the kernel issues an iret or sysexit instruction to return from interrupts and system calls (…)”

@Anand: If you use PAE, the kernel _must_ be fully aware, because the kernel must tell the CPU that virtual address x is physical address y, this done in the page tables.
Without PAE the page table as the following format:
– 32bits page address entry
– 2 levels deep
With PAE the page table as the following format:
– 64bits page address entry
– 3 levels deep

There’s another thing to be aware, the virtual address space is split in two halfs, the lower half is the per process address space and the upper half is the kernel address space. Effectivly user mode only has 2GB (or 8EB in 64bits) address space. Now you should see the importance of 64bits, the kernel 2GB space is getting _very_ small (does who say that 32bits is enouth for desktops pc’s are ignoring the kernel).

@Robert: The sysenter (on x86-32) instructions jumps to the address specified in the IA32_SYSENTER_EIP machine specific register. You pass the syscall number on the eax register (linux and nt) and get the return code in eax, arguments are passed (by value or by reference) in registers and the stack (you also have to pass in a register a pointer to the user stack since it will be swaped to the kernel stack on sysenter), once the sysenter jumps to the kernel handler, arguments passed on the user stack are copied to the kernel stack and jump to the function in the system call (sys_xxx on linux, NtXxx on nt) table using the index provided by the user mode (after being validated of course).
33.Berny G on December 1st, 2008 2:15 pm
Gustavo, I greatly enjoy reading your articles. Thank you.

BTW: Enjoyed the pertinent pic of Ken and Dennis and their/your inference to Ring 0.

Keep the articles coming
34.Tej parkash on December 5th, 2008 2:44 am
Nice Article.
It would have been nice if you give some example e.g vi test.c and ./test
35.Gustavo Duarte on December 5th, 2008 10:17 am
Thanks for the feedback.

@Berny: You’re the first to comment on the pic haha, it was my favorite part of the post

@Tej: that’s an interesting idea. What would you show though? Some stepping through C code in a user mode-to-kernel transition? Or the assembly as you transition into the kernel?
36.Gustavo Duarte on December 5th, 2008 10:26 am
@Timur:

Sorry for the delay.

that’s a great question. I actually hope to write a post exactly about these transitions between user mode and kernel mode.

You’re absolutely correct that it is the CPU that changes between modes (ring 0 and ring 3).

But we say ‘process’ because it is the process on whose behalf the CPU is running. So for example, you are running your text editor. It is in user mode (ring 3) say, while it is doing some formatting on the text for you. Then you tell it to SAVE the file. Since the text editor needs to rely on the kernel to do that, it makes a _system call_ (“hey kernel, write THIS stuff to that file”).

As part of that system call, there is a transition to kernel mode and ring 0. The code that actually performs the trasition is part of the C library that underlies the system call (glibc in Linux, and DLLs in Windows).

But now the KERNEL starts to execute, in ring 0. But it is ON BEHALF of your text editor process, so we say that your process is now in kernel mode.

Does that make sense?
37.Timur Izhbulatov on December 5th, 2008 11:57 am
@Gustavo: Thank you so much for your reply!

Indeed this is a very interesting question. I hope to see your new article about processes!

I would like to share some of my thoughts. Please correct me if I’m wrong.

I think we need to further clarify what a process is. As I understand, a process is a running program. That is, a binary image (instructions and data) loaded by the kernel into main memory for execution. OTOH, there are internal data structures that represent running processes in the kernel and a whole subsystem to manage them.

So, from user’s point of view, a system call looks like a function call, but which is translated into a sequence of CPU instructions that set registers and issue an interrupt. Everything beyond this point is hidden from user, but essentially the control is passed over to the kernel interrupt handler. That is, the CPU starts executing some other instructions which are not part of the initially loaded image…

This leads me to the point that my initial statement was not completely correct. Seems that it only explains how a process is viewed from the user space. Which is actually… well… an illusion created by the kernel…

Apparently, there is some level of indirection. The kernel routine executing on behalf of my process does NOT belong to the later. I assume that while the routine is advancing on the CPU, the process itself is “waiting” (its state is saved be the kernel) until the system call returns. I can imagine a situation when there are several processes waiting for some I/O to finish but can we say they are all in kernel mode while another process’ instructions are being executed by the CPU?..

Looks like I gotta get a copy Understanding Linux Kernel or a similar book

Again, thanks for such a fascinating article and discussion!
38.Gustavo Duarte on December 6th, 2008 3:31 am
@Timur:

Basically everything you have said is correct. I’ll only try to clarify some of the points you expressed doubts about.

First, regarding memory. This _definitely_ needs a post, but I’ll do a quick explanation for now. Let’s assume the processor is running in 32-bit mode to keep everything easy.

Each process has 4 gigabytes of memory that it can access (because the processor is using 32 bits to address memory). The kernel sets up a virtual memory space for the process to run in.

Here’s one catch: the first 3 gigabytes actually contain the PROCESS (its executable code, stack, allocated memory, etc). The final gigabyte belongs to the KERNEL and is full of kernel code and data. So after the interrupt (or SYSENTER instruction) the kernel code mapped in this fourth (and last) gigabyte starts running. Meanwhile, the lower 3 gigabytes still contain the process that jumped into kernel mode.

Another interesting thing: 4th gigabyte is COMMON across all processes. So the kernel is always ‘resident’ so to speak, but the mapping of the 3 lower gigabytes keeps changing with changing processes.

Regarding your example of several processes waiting for some I/O. Yes, it IS correct to say that they are all in kernel mode. They are also _sleeping_ though, waiting for that I/O. When the I/O completes, the kernel will scan a data structure that stores the processes that are sleeping on it. It will then wake up each process that was sleeping in kernel mode.

When the process wakes up, it’ll resume execution in kernel mode, exactly where it went to sleep. Then the I/O completes, the kernel returns, and the process is back in user mode.

There’s quite a lot of material here, but I hope the comment helps out some. I hope to cover some of these topics in upcoming posts. Also, your understanding is pretty much all correct as far as I know, it sounds like there are just some details you don’t know about, but your idea of the whole thing seems accurate to me.
39.Gustavo Duarte on December 6th, 2008 4:09 am
the process itself is “waiting” (its state is saved be the kernel) until the system call returns.

Regarding the state being saved, this is all accomplished via the stack. Processes have two stacks: a user mode stack and a kernel mode stack. The stacks are used to preserve state when the process goes from user mode to kernel mode, and also when a process goes to sleep in kernel mode.
40.Timur Izhbulatov on December 6th, 2008 7:41 am
Gustavo, thanks for the explanations! I look forward to your new posts. This is just great that while having such deep understanding of system internals you put so much effort into sharing it with the world in accessible form.

Maybe someday, once you have enough articles, you’ll put them together and publish as a book?
41.Andrew Kirsanow on December 11th, 2008 2:33 pm
It’s nice to clear up some of this stuff, so thank you Sir. Would you happen to know if there is any test code out there to verify/demonstrate legal and illegal control transfers? I have been writing an X86 emulator as a hobby project for about 5 years now (started 5 years ago and took 4 and a half years off) and now I’ve come back to it I’m finding that some of the apps which fail to run are protected mode apps which die in suspect ways that I can only assume are due to errors in the control transfer opcode handling code. I would love a way to test the validity of this code without having to write yet another longwinded ASM test harness!
42.Gustavo Duarte on December 12th, 2008 12:58 am
@Andrew: you’re welcome, my pleasure.

Sounds like a cool project, but I don’t know of such code off the top of my head. Two projects though that may have stuff like that though are Valgrind and the project for the user-mode kernel. They may have things that could help you out.
43.Andrew Kirsanow on December 15th, 2008 6:02 am
Thanks man! I will have a look into those projects. Yes, the project is cool to me because I made the primary design goal to use nobody else’s code either core or BIOS etc. In that way, I suppose it’s 3 projects in one as I develop VGA, BIOS and emulator core. I can currently run a lot of 386 DOS extender apps and Windows 3 runs in standard mode but there are still some apps that just fail for no apparent reason. Hence the thinking that I may have missed something in the PM privilege testing code, maybe some apps rely on the GP exceptions intentionally to allow the extenders to perform some task or other, a bit like marking pages Read only to signal the time to produce physical copies of process pages in fork.
44.el_bot on December 28th, 2008 2:45 pm
Regarding PAE in Windows: XP supports PAE using the “PAE switch” in boot.ini (but anyway it’s restricted to 4GB !!! http://www.microsoft.com/whdc/system/platform/server/PAE/PAEdrv.mspx ok, with PAE you get support of DEP by hardware… why? I don’t know). It’s should be noted that Windows and Linux support PAE in different ways: boot option vs compilation option (correct me if I am wrong).
Regarding “process in kernel/user mode”: I think that terminology is wrong (at least, confusing): a process never runs in kernel mode (i.e, ring 0); in ring 0 only runs the kernel. In every instant the CPU can be in one the following states:
- running a user process (ring 3, your code)
- running a syscall (ring 0, kernel code)
- running a interrupt handler (ring 0, kernel code)
- runnign a kernel thread (ring 0, kernel code)
The phrase “the process is in kernel mode” is, anyway, common (albeit incorrect… at least for the “purists” ).

Saludos
45.Gustavo Duarte on December 29th, 2008 1:05 am
@el_bot:

Regarding the PAE, Windows actually has a different binary for the PAE-enabled kernel. It is \Windows\System32\Ntkrnlpa.exe. So it is a compile-time option as well, but you are correct that we can pass a boot time option to select the PAE kernel.

Regarding the kernel modes, I think you are correct in all that you wrote:

A process’ code does NOT run in kernel mode, ever;
The states you describe are all correct.

But it’s a terminology issue. The case of the running syscall is what people usually think of when saying “the process is running in kernel mode”. Because the process has a kernel-mode stack that is directly tied to it, I think it’s a fair way to call it. At any rate, people use the term widely, so it became a de facto term for the syscall case.
46.el_bot on December 30th, 2008 7:54 am
Yes, you are right: there is two kernels (maybe more) in Windows. And yes: “the process is running in kernel mode” == “kernel is runing a syscall in behalf of the user process”.
Recgarding the kernel-mode stack, I have this hypothesis:
“Actually, there is only one kernel-mode stack (by CPU o core). It’s is shared by syscalls (therefore, by all processes), interrupts handlers, and kernel threads running in this CPU. It’s have a page size (4KB or 8KB). The excellent quality of the kenel code make it size sufficient (i.e, is unlikely a stack overflow)”
I’m not completely sure about this, except that interrupt handlers really use the same kernel stack that would be used by kernel if the current user process would issued a syscall. Well, maybe I’m wrong in this late point.

I’m waiting your article about memory layout (layout logical and/or layout physical) of a running linux kernel (with process, handlers, syscalls, etc). And, of course, with your beautiful illustrations!
Saludos
47.Anatomy of a Program in Memory : Gustavo Duarte on January 27th, 2009 12:34 am
[...] space is flagged in the page tables as exclusive to privileged code (ring 2 or lower), hence a page fault is triggered if user-mode programs try to touch it. In Linux, [...]
48.Mark Lambert on January 31st, 2009 11:52 pm
Brilliant article. Gustavo, you are a hero for taking the time to put these articles together. Your style is absolutely wonderful.
49.Gustavo Duarte on February 1st, 2009 1:17 am
@Mark: thanks so much for the feedback! *blush*

I enjoy writing the posts though, so I can’t claim self sacrifice. : P It’s fun, I learn a ton myself, and it’s great to feel like I’m helping people out a little by making decent content.

Regards to all you folks at MS.
50.Ya-tou & me » Blog Archive » Anatomy of a Program in Memory on February 19th, 2009 1:44 am
[...] map whatever physical memory it wishes. Kernel space is flagged in the page tables as exclusive to privileged code (ring 2 or lower), hence a page fault is triggered if user-mode programs try to touch it. In Linux, [...]
51.内存剖析 « Rock2012’s Blog on May 3rd, 2009 4:22 am
[...] map whatever physical memory it wishes. Kernel space is flagged in the page tables as exclusive to privileged code (ring 2 or lower), hence a page fault is triggered if user-mode programs try to touch it. In Linux, [...]
52.Funktionsweise eines Betriebssystems | duetsch.info - GNU/Linux, Open Source, Softwareentwicklung, Methodik und Vim. on December 28th, 2009 6:20 am
[...] CPU Rings, Privilege, and Protection [...]
53.Jose on January 7th, 2010 2:31 pm
Very nice writtings Mr David ,
I have a question.

When Bios is postboot and the next the kernel beging starting
the kernel use only the memory represented by bios ???
54.nebor on May 24th, 2010 10:06 pm
Fantastic article gustavo!!!

But one thing started confusing me. It’s related to sysexit. After this instruction, CPU sets CS to hardcoded value ( code selector which points to segment descriptor with base 0, limit 4GB, and privilege level 3).
How does it work after this?
linear address is now different, because new CS is in use (logic address is added to NEW code segment base address to create linear).
55.Tuan on May 28th, 2010 9:48 pm
I just know this website by chance, and it’s really helpful.
56.Inside Program Memory « h e a d – w o r d on June 6th, 2010 12:32 am
[...] map whatever physical memory it wishes. Kernel space is flagged in the page tables as exclusive to privileged code (ring 2 or lower), hence a page fault is triggered if user-mode programs try to touch it. In Linux, [...]
57.Johnson on June 21st, 2010 7:45 am
i was just pass-by
58.Faraz on November 10th, 2010 3:08 am
Thanks, Gustavo

this article is very helpful for me.

Thanks again Sphere: Related Content