13/3/11

Memory Translation and Segmentation

This post is the first in a series about memory and protection in Intel-compatible (x86) computers, going further down the path of how kernels work. As in the boot series, I’ll link to Linux kernel sources but give Windows examples as well (sorry, I’m ignorant about the BSDs and the Mac, but most of the discussion applies). Let me know what I screw up.

In the chipsets that power Intel motherboards, memory is accessed by the CPU via the front side bus, which connects it to the northbridge chip. The memory addresses exchanged in the front side bus are physical memory addresses, raw numbers from zero to the top of the available physical memory. These numbers are mapped to physical RAM sticks by the northbridge. Physical addresses are concrete and final – no translation, no paging, no privilege checks – you put them on the bus and that’s that. Within the CPU, however, programs use logical memory addresses, which must be translated into physical addresses before memory access can take place. Conceptually address translation looks like this:
Memory Address Translation in x86 with paging enabled
This is not a physical diagram, only a depiction of the address translation process, specifically for when the CPU has paging enabled. If you turn off paging, the output from the segmentation unit is already a physical address; in 16-bit real mode that is always the case. Translation starts when the CPU executes an instruction that refers to a memory address. The first step is translating that logic address into a linear address. But why go through this step instead of having software use linear (or physical) addresses directly? For roughly the same reason humans have an appendix whose primary function is getting infected. It’s a wrinkle of evolution. To really make sense of x86 segmentation we need to go back to 1978.

The original 8086 had 16-bit registers and its instructions used mostly 8-bit or 16-bit operands. This allowed code to work with 216 bytes, or 64K of memory, yet Intel engineers were keen on letting the CPU use more memory without expanding the size of registers and instructions. So they introduced segment registers as a means to tell the CPU which 64K chunk of memory a program’s instructions were going to work on. It was a reasonable solution: first you load a segment register, effectively saying “here, I want to work on the memory chunk starting at X”; afterwards, 16-bit memory addresses used by your code are interpreted as offsets into your chunk, or segment. There were four segment registers: one for the stack (ss), one for program code (cs), and two for data (ds, es). Most programs were small enough back then to fit their whole stack, code, and data each in a 64K segment, so segmentation was often transparent.

Nowadays segmentation is still present and is always enabled in x86 processors. Each instruction that touches memory implicitly uses a segment register. For example, a jump instruction uses the code segment register (cs) whereas a stack push instruction uses the stack segment register (ss). In most cases you can explicitly override the segment register used by an instruction. Segment registers store 16-bit segment selectors; they can be loaded directly with instructions like MOV. The sole exception is cs, which can only be changed by instructions that affect the flow of execution, like CALL or JMP. Though segmentation is always on, it works differently in real mode versus protected mode.

In real mode, such as during early boot, the segment selector is a 16-bit number specifying the physical memory address for the start of a segment. This number must somehow be scaled, otherwise it would also be limited to 64K, defeating the purpose of segmentation. For example, the CPU could use the segment selector as the 16 most significant bits of the physical memory address (by shifting it 16 bits to the left, which is equivalent to multiplying by 216). This simple rule would enable segments to address 4 gigs of memory in 64K chunks. Sadly Intel made a bizarre decision to multiply the segment selector by only 24 (or 16), which in a single stroke confined memory to about 1MB and unduly complicated translation. Here’s an example showing a jump instruction where cs contains 0×1000:
Real Mode Segmentation
Real mode segment starts range from 0 all the way to 0xFFFF0 (16 bytes short of 1 MB) in 16-byte increments. To these values you add a 16-bit offset (the logical address) between 0 and 0xFFFF. It follows that there are multiple segment/offset combinations pointing to the same memory location, and physical addresses fall above 1MB if your segment is high enough (see the infamous A20 line). Also, when writing C code in real mode a far pointer is a pointer that contains both the segment selector and the logical address, which allows it to address 1MB of memory. Far indeed. As programs started getting bigger and outgrowing 64K segments, segmentation and its strange ways complicated development for the x86 platform. This may all sound quaintly odd now but it has driven programmers into the wretched depths of madness.

In 32-bit protected mode, a segment selector is no longer a raw number, but instead it contains an index into a table of segment descriptors. The table is simply an array containing 8-byte records, where each record describes one segment and looks thus:
Segment Descriptor
There are three types of segments: code, data, and system. For brevity, only the common features in the descriptor are shown here. The base address is a 32-bit linear address pointing to the beginning of the segment, while the limit specifies how big the segment is. Adding the base address to a logical memory address yields a linear address. DPL is the descriptor privilege level; it is a number from 0 (most privileged, kernel mode) to 3 (least privileged, user mode) that controls access to the segment.

These segment descriptors are stored in two tables: the Global Descriptor Table (GDT) and the Local Descriptor Table (LDT). Each CPU (or core) in a computer contains a register called gdtr which stores the linear memory address of the first byte in the GDT. To choose a segment, you must load a segment register with a segment selector in the following format:
Segment Selector
The TI bit is 0 for the GDT and 1 for the LDT, while the index specifies the desired segment selector within the table. We’ll deal with RPL, Requested Privilege Level, later on. Now, come to think of it, when the CPU is in 32-bit mode registers and instructions can address the entire linear address space anyway, so there’s really no need to give them a push with a base address or other shenanigan. So why not set the base address to zero and let logical addresses coincide with linear addresses? Intel docs call this “flat model” and it’s exactly what modern x86 kernels do (they use the basic flat model, specifically). Basic flat model is equivalent to disabling segmentation when it comes to translating memory addresses. So in all its glory, here’s the jump example running in 32-bit protected mode, with real-world values for a Linux user-mode app:
Protected Mode Segmentation
The contents of a segment descriptor are cached once they are accessed, so there’s no need to actually read the GDT in subsequent accesses, which would kill performance. Each segment register has a hidden part to store the cached descriptor that corresponds to its segment selector. For more details, including more info on the LDT, see chapter 3 of the Intel System Programming Guide Volume 3a. Volumes 2a and 2b, which cover every x86 instruction, also shed light on the various types of x86 addressing operands – 16-bit, 16-bit with segment selector (which can be used by far pointers), 32-bit, etc.

In Linux, only 3 segment descriptors are used during boot. They are defined with the GDT_ENTRY macro and stored in the boot_gdt array. Two of the segments are flat, addressing the entire 32-bit space: a code segment loaded into cs and a data segment loaded into the other segment registers. The third segment is a system segment called the Task State Segment. After boot, each CPU has its own copy of the GDT. They are all nearly identical, but a few entries change depending on the running process. You can see the layout of the Linux GDT in segment.h and its instantiation is here. There are four primary GDT entries: two flat ones for code and data in kernel mode, and another two for user mode. When looking at the Linux GDT, notice the holes inserted on purpose to align data with CPU cache lines – an artifact of the von Neumann bottleneck that has become a plague. Finally, the classic “Segmentation fault” Unix error message is not due to x86-style segments, but rather invalid memory addresses normally detected by the paging unit – alas, topic for an upcoming post.

Intel deftly worked around their original segmentation kludge, offering a flexible way for us to choose whether to segment or go flat. Since coinciding logical and linear addresses are simpler to handle, they became standard, such that 64-bit mode now enforces a flat linear address space. But even in flat mode segments are still crucial for x86 protection, the mechanism that defends the kernel from user-mode processes and every process from each other. It’s a dog eat dog world out there! In the next post, we’ll take a peek at protection levels and how segments implement them.

Comments
36 Responses to “Memory Translation and Segmentation”

1.Chuck on August 12th, 2008 7:03 am
For what it’s worth, I’m really enjoying your articles. Please continue to write
2.Mahesh on August 12th, 2008 9:40 pm
I liked the explanation, very well articulated.
3.Gustavo Duarte on August 13th, 2008 12:34 am
@Chuck: it’s worth a lot I enjoy writing these posts, I write them for fun but the fact that people seem to like them is definitely encouraging too.

I’m cooking up the next one here… I’m actually in Hawaii this week, I’ve been waking up at ~6am to snorkel and dive, sleeping early, but this evening I’m writing a bit of the protection stuff

@Mahesh: thanks!
4.Arvind on August 13th, 2008 2:35 am
Nice article!! I have subscribed to your blog feed and everytime I check for new items your blog is the first I look for unread items..Elated that I found one today..Good read..
5.Ben fowler on August 13th, 2008 4:53 pm
I thoroughly enjoyed this well-written article as well as your others. It’s interesting, but challenging material for a lot of people, and I know it’d otherwise take a LOT of reading around to get the information elsewhere.

You have a knack for this Gustavo! Have you considered teaching, or at least turning this material into a book? Keep up the great work!
6.notmuch on August 15th, 2008 9:08 am
Nicely done. For long I have been pondering upon how to visualize the workings of hardware. From CPU clock cycle and instructions, interrupt mechanism and interaction with software, and memory. These articles are a good step in that direction. Thanks much.
7.JinxterX on August 17th, 2008 10:48 am
You write great articles, thanks, any chance of doing one on Linux and MTRRs?
8.Hormiga Jones on August 19th, 2008 9:22 am
Your series of articles are HIGHLY informative and well written. I have forwarded them onto several of my work colleagues. Thank you for this valuable resource and keep up the good work.
9.CPU Rings, Privilege, and Protection : Gustavo Duarte on August 20th, 2008 12:39 am
[...] let’s see exactly how the CPU keeps track of the current privilege level, which involves the segment selectors from the previous post. Here they [...]
10.Gustavo Duarte on August 20th, 2008 12:51 am
Thank you all for the feedback!

I was in Hawaii last week, hence the belated reply. I did crank out the protection stuff in the flight back to Denver though

@Ben: You know, I started the blog just as a way to write random stuff. I never expected to get hits hehe. So now I have thought about some of the stuff you mentioned, especially books.

The trouble with books is that the stuff ends up locked and inaccessible to people, plus you lose things like being able to link to the source code directly (or linking in general).

I also thought about doing a print-friendly CSS or maybe render some stuff as PDF to let people download. That might be a good way to go.

Teaching would be fun… I love teaching people who are interested in learning. hehe. I don’t plan ahead too much, so I guess I’ll see where it goes. Thanks for the encouraging words.

@JinxterX: yea, that sounds cool. I thought about writing about cache lines and how memory access happens “for real”, and talking about MTRRs would make a lot of sense. I’ll write it down here. I think after this last series though I’ll do a hiatus on CPU internals type of stuff, because I didn’t want the blog to be just about that. But I’ve added MTRRs to the !ideas.txt here
11.Peter Teoh on September 19th, 2008 3:19 am
Thank you for the article. I am playing with the memory in Linux Kernel now. When I touch/modify things like PTE, PDE etc do I need to be in preemption disabled mode? And whenever I modify these, is flushing of TLB needed? Will it lead to crash if not? I am always getting crashes doing all these, so not sure what causes the crashes?
12.Mario on September 23rd, 2008 7:27 pm
Hey guy,

These are some of the best-written architecture articles I have came across. I started teaching myself x86 assembly awhile back and realized I was getting nowhere without really understanding memory management and how the CPU operates. Your articles are filling in a lot of gaps for me and I truly do appreciate it.

-Mario
13.Nikhil on December 11th, 2008 7:43 am
Hi Gustavo,
like everyone already has said…highly informative, well articulated, best written literature about x86 on the internet. In my pursuit of trying to understand segments, descriptors, selectors this is serving the purpose wonderfully.

again i cannot help but echo other people’s thoughts when i suggest you should write a book.
Thanks a lot.
14.Gustavo Duarte on December 12th, 2008 12:56 am
@Peter: sorry for not replying in time. Unfortunately it’s hard for me to keep up with the comments sometimes, though I’m trying to do a better job. I’ll email you to see if your doubts are still current.

@Mario, @Nikhil: wow, thanks a LOT for the kind words That’s really encouraging. I had no idea I was any good at this stuff until I started blogging, but the fact that my stuff works at least for some of you is really cool. I get a huge kick out of teaching stuff, especially computing since I like it a lot. So, thanks again – hearing stuff like this makes me want to post more for sure.

Regarding the book, I have thought about it. The trouble is that I wouldn’t want to lock the content away. I want it to be freely accessible. But I also thought about maybe assembling the stuff and making an online book.

I would love to do something like that. Time is the problem, as usual hehe.
15.el_bot on December 28th, 2008 9:15 am
Hi Gustavo. I have a doubt. Are you sure that gdtr store a linear address? Is not it a physical address? If paging is disabled it should be physical; but if paging is enabled, well, I don’t know. If it is linear then you can get a fault page when accesing to the GDT; something (I think) problematic in this stage (you can have 2 page faults in only one memory access!).

Good blog and happy new year.
16.Gustavo Duarte on December 29th, 2008 12:55 am
@el_bot: thanks, and happy new year to you as well!

I’m sure about the GDTR. Here is the relevant bit from Section 2.4.1 in the Intel System Programming Guide 3A:

“The GDTR register holds the base address (32 bits in protected mode; 64 bits in IA-32e mode) and the 16-bit table limit for the GDT. The base address specifies the linear address of byte 0 of the GDT; the table limit specifies the number of bytes in the table.”

The CR3 register though (also called PDBR) which points to the page directory does hold a physical address. Also, see section 8.8, Software Initialization for Protected Mode Operation. It covers some of the initializations that must be done.

My post on the kernel boot up also might help at http://duartes.org/gustavo/blog/post/kernel-boot-process

Cheers!
17.el_bot on December 30th, 2008 8:58 am
Thanks for the specific references. When I have a bit of free time I will read theses.
Ok, your are (again) right. But in the case the CPU running in protected mode 32bit with paging disabled, you would need storing a physical address in GDTR (ok, you can say “it’s anyway a linear adress, but in this case is like if you are using (fictitious) page tables performing a identity mapping; i.e; X (linear)-> X (physical) ). In any case, before switch to mode 32bit protected with paging disabled, you must store in the GTDR the physical address of the GTD. It’s a assumption, but I will try check it in the linux kernel code (http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L115 ? I think it is done in the line ‘asm volatile(“lgdtl %0″ : : “m” (gdt)); ‘ but my understanding about assembler embeded in C is very poor… My “theory” is “gdt.ptr store, in this point, the physical adress of the array boot_gdt” ).
Yes, CR3 MUST store a physical address (btw, it happen in any architecture supporting paging; the pointer to the table page must be a physical pointer). If it not, well… that’s don’t work.

And yes, I read your great article about booting-up; actually I am basing in it for my asummption(s)!

Saludos, and thanks for your replies.
P.S : Again, my English surely is not correct (I make my best effort…). Please, if you believe necessary, correct my words. English readers and I will are grateful with you
18.Anatomy of a Program in Memory : Gustavo Duarte on January 27th, 2009 9:28 am
[...] on. Keep in mind these segments are simply a range of memory addresses and have nothing to do with Intel-style segments. Anyway, here is the standard segment layout in a Linux [...]
19.McGrew Security Blog » Blog Archive » Gustavo Duarte’s Great Internals Series on January 27th, 2009 3:23 pm
[...] Memory Translation and Segmentation [...]
20.Quick Note on Diagrams and the Blog : Gustavo Duarte on January 28th, 2009 6:21 pm
[...] colors hold from the earliest post about memory to the latest. This convention is why the post about Intel CPU caches shows a blue [...]
21.travis on February 6th, 2009 3:08 am
Great posts…

I went to where the gdt_page is instantiated (http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/cpu/common.c#L24)

It has the following code:

[GDT_ENTRY_DEFAULT_USER_CS] = { { { 0x0000ffff, 0x00cffa00 } } }

Do you know what that means?
22.Gustavo Duarte on February 6th, 2009 11:35 pm
@travis:

This line is building the 8-byte segment descriptor for the user code segment. To really follow it, there are 3 things you must bear in mind:

1. The x86 is little endian, meaning that for multi-byte data types (say, 32-bit or 64-bit integers), the significance of bytes grows with memory address. If you declare a 32-bit integer as 0xdeadbeef, then it would be laid out in memory like this (in hex, assuming memory addresses are growing to the right):

ef be ad de
lower => higher

2. In array declarations, or in this case a struct declaration, earlier elements go into lower memory addresses.

3. The convention for Intel diagrams is to draw things with HIGHER memory addresses on the LEFT and on TOP. This is a bit counter intuitive, but I followed it to be consistent with Intel docs.

When you put this all together, the declaration above will translate into the following bytes in memory, using Intel’s ‘reversed’ notation:

(higher)

00 cf fa 00
00 00 ff ff

(lower)

If you compare these values against the segment descriptor diagram above, you’ll see that: the ‘base’ fields are all zero, so the segment starts at 0×00000000, the limit is 0xfffff so the limit covers 4GB, byte 47-40 is 11111010, so the DPL is 3 for ring 3.

If you look into the Intel docs, they describe the fields I left grayed out. Hope this helps!
23.travis on February 7th, 2009 1:27 am
Awesome! Thanks for the very clear explanation.

Do you know of a Linux forum that is open to these type of detailed questions? Sometimes it’s very difficult to find answers using google, and “Understanding the Linux Kernel” doesn’t cover some of the things that confuse me.
24.Gustavo Duarte on February 7th, 2009 9:57 am
@travis: I don’t ;( There used to be a ‘kernel janitors’ project to get people to do simple patches for the kernel, and a ‘kernel newbies’ to try to teach kernel basics. But I’m not sure where they are. I don’t use forums much, so there might be something good out there. If you find anything, I’d like to hear about it.

I also thought about installing some forum software on my server so people could talk about this stuff. However, I’m afraid of spending gobs of time there. I’m pretty strict about not getting into stuff that takes too much time, which is why I don’t touch Twitter : P
25.Ya-tou & me » Blog Archive » CPU Rings, Privilege, and Protection on February 19th, 2009 1:46 am
[...] let’s see exactly how the CPU keeps track of the current privilege level, which involves the segment selectors from the previous post. Here they [...]
26.Raúl on April 12th, 2009 9:06 pm
Gustavo, I don’t know how to thank you. Your articles are beautiful and very well explained. Please continue writing. Why don’t you write a book?. You are one of the best teachers I’ve ever found. You save me hours trying to find information and hard-studying. Sincerily, thank you very much.
27.Gustavo Duarte on April 14th, 2009 1:44 pm
@Raul: wow, thank you very much for your comment. It’s great to hear from people who have learned from or have been helped by this material. It’s the best incentive.

Regarding the book, stay tuned
28.内存剖析 « Rock2012’s Blog on May 3rd, 2009 4:26 am
[...] on. Keep in mind these segments are simply a range of memory addresses and have nothing to do with Intel-style segments. Anyway, here is the standard segment layout in a Linux [...]
29.Joel on May 11th, 2009 6:47 am
Thanks for such a beautiful article, articulated way beyond expression. I hope you write a book some day.

One of the things that confused me though was: Your gdt had ‘limit’ but you hadn’t mentioned that there was a granularity flag that multiplied it by 4k when set. Later on you go on to mention that in modern kernels flat model, each descriptor describes a segment of upto 4GB in size (32 bits) but the gdt ‘limit’ being only 20 bits made me wonder how.

Thanks
30.Rahul on July 13th, 2009 1:49 pm
Hi Gustavo,

In terms of making technical concepts clear, your posts are the best I have seen.

One question, can you please explain as to what purpose is served by the LDT ? Does any real OS ever really uses the LDT.

Thanks
Rahul
31.Krish on August 4th, 2009 1:46 pm
Hi Gustavo,

Let me begin by thanking you for a wonderful article.

In your article (the diagram for “Protected mode segmentation”), the logical address is the same as the linear address because the base is 0 in a flat model.

My understanding is that the linear address is used to decode the physical address of the page directory table and the page table thereafter to finally get the physical page value containing the segment.

There is a good chance that another process might also generate the same logical address; and with the base 0, will generate the same linear address. Does this mean that it will eventually point to the same page table entry?

Who decides which page table is assigned to which process segment? How is the segment selector value assigned (populated in the segment registers) to the process?

Thanking you in anticipation.

Krishnan.
32.Darshan on February 1st, 2010 11:37 pm
Hi Gustavo,

Thank you for giving such a informative article. I learnt many concepts from this.. thank you brother!!

Darshan.
33.Jon on February 19th, 2010 5:06 pm
Thanks for these articles, I feel I have come a bit late to the party, having only just found them, and the few I have read do far have been the clearest of anything I have read!

Just one thought though, I see you mentionerd that you had thought of a forum?

Perhaps you are right, that it might take up too much time, but I feel that there might be a better format than a blog, now that there is so much content.

I dunno what form would be best however =)

I want to read all of them, but would really appreciate a way of jumping between them, to re-read/cross reference and quickly find a specific topic.

But saying that you seem to have a talent of comunicating these tough subjects clearly and with a good deal of humour (needed with the “dryness” of the subject matter!) pls pls pls keep them coming.

jon
34.saurin on July 22nd, 2010 12:43 pm
Very good explanation. Thanks
Saurin
35.Ishan on October 1st, 2010 4:44 am
Can you explain the difference between logical address and virtual address?
Thanks Sphere: Related Content

1 comentario:

Anónimo dijo...

what can i eat with fatty liver disease what can i eat with fatty liver disease what
can i eat with fatty liver disease

Here is my web-site - vitamin e fatty liver adults