13/3/11

Anatomy of a Program in Memory

Memory management is the heart of operating systems; it is crucial for both programming and system administration. In the next few posts I’ll cover memory with an eye towards practical aspects, but without shying away from internals. While the concepts are generic, examples are mostly from Linux and Windows on 32-bit x86. This first post describes how programs are laid out in memory.

Each process in a multi-tasking OS runs in its own memory sandbox. This sandbox is the virtual address space, which in 32-bit mode is always a 4GB block of memory addresses. These virtual addresses are mapped to physical memory by page tables, which are maintained by the operating system kernel and consulted by the processor. Each process has its own set of page tables, but there is a catch. Once virtual addresses are enabled, they apply to all software running in the machine, including the kernel itself. Thus a portion of the virtual address space must be reserved to the kernel:

Kernel User Memory Split
This does not mean the kernel uses that much physical memory, only that it has that portion of address space available to map whatever physical memory it wishes. Kernel space is flagged in the page tables as exclusive to privileged code (ring 2 or lower), hence a page fault is triggered if user-mode programs try to touch it. In Linux, kernel space is constantly present and maps the same physical memory in all processes. Kernel code and data are always addressable, ready to handle interrupts or system calls at any time. By contrast, the mapping for the user-mode portion of the address space changes whenever a process switch happens:
Virtual Memory-in Process Switch
Blue regions represent virtual addresses that are mapped to physical memory, whereas white regions are unmapped. In the example above, Firefox has used far more of its virtual address space due to its legendary memory hunger. The distinct bands in the address space correspond to memory segments like the heap, stack, and so on. Keep in mind these segments are simply a range of memory addresses and have nothing to do with Intel-style segments. Anyway, here is the standard segment layout in a Linux process:
Linux Flexible Address Space Layout
When computing was happy and safe and cuddly, the starting virtual addresses for the segments shown above were exactly the same for nearly every process in a machine. This made it easy to exploit security vulnerabilities remotely. An exploit often needs to reference absolute memory locations: an address on the stack, the address for a library function, etc. Remote attackers must choose this location blindly, counting on the fact that address spaces are all the same. When they are, people get pwned. Thus address space randomization has become popular. Linux randomizes the stack, memory mapping segment, and heap by adding offsets to their starting addresses. Unfortunately the 32-bit address space is pretty tight, leaving little room for randomization and hampering its effectiveness.

The topmost segment in the process address space is the stack, which stores local variables and function parameters in most programming languages. Calling a method or function pushes a new stack frame onto the stack. The stack frame is destroyed when the function returns. This simple design, possible because the data obeys strict LIFO order, means that no complex data structure is needed to track stack contents – a simple pointer to the top of the stack will do. Pushing and popping are thus very fast and deterministic. Also, the constant reuse of stack regions tends to keep active stack memory in the cpu caches, speeding up access. Each thread in a process gets its own stack.

It is possible to exhaust the area mapping the stack by pushing more data than it can fit. This triggers a page fault that is handled in Linux by expand_stack(), which in turn calls acct_stack_growth() to check whether it’s appropriate to grow the stack. If the stack size is below RLIMIT_STACK (usually 8MB), then normally the stack grows and the program continues merrily, unaware of what just happened. This is the normal mechanism whereby stack size adjusts to demand. However, if the maximum stack size has been reached, we have a stack overflow and the program receives a Segmentation Fault. While the mapped stack area expands to meet demand, it does not shrink back when the stack gets smaller. Like the federal budget, it only expands.

Dynamic stack growth is the only situation in which access to an unmapped memory region, shown in white above, might be valid. Any other access to unmapped memory triggers a page fault that results in a Segmentation Fault. Some mapped areas are read-only, hence write attempts to these areas also lead to segfaults.

Below the stack, we have the memory mapping segment. Here the kernel maps contents of files directly to memory. Any application can ask for such a mapping via the Linux mmap() system call (implementation) or CreateFileMapping() / MapViewOfFile() in Windows. Memory mapping is a convenient and high-performance way to do file I/O, so it is used for loading dynamic libraries. It is also possible to create an anonymous memory mapping that does not correspond to any files, being used instead for program data. In Linux, if you request a large block of memory via malloc(), the C library will create such an anonymous mapping instead of using heap memory. ‘Large’ means larger than MMAP_THRESHOLD bytes, 128 kB by default and adjustable via mallopt().

Speaking of the heap, it comes next in our plunge into address space. The heap provides runtime memory allocation, like the stack, meant for data that must outlive the function doing the allocation, unlike the stack. Most languages provide heap management to programs. Satisfying memory requests is thus a joint affair between the language runtime and the kernel. In C, the interface to heap allocation is malloc() and friends, whereas in a garbage-collected language like C# the interface is the new keyword.

If there is enough space in the heap to satisfy a memory request, it can be handled by the language runtime without kernel involvement. Otherwise the heap is enlarged via the brk() system call (implementation) to make room for the requested block. Heap management is complex, requiring sophisticated algorithms that strive for speed and efficient memory usage in the face of our programs’ chaotic allocation patterns. The time needed to service a heap request can vary substantially. Real-time systems have special-purpose allocators to deal with this problem. Heaps also become fragmented, shown below:
Fragmented Heap
Finally, we get to the lowest segments of memory: BSS, data, and program text. Both BSS and data store contents for static (global) variables in C. The difference is that BSS stores the contents of uninitialized static variables, whose values are not set by the programmer in source code. The BSS memory area is anonymous: it does not map any file. If you say static int cntActiveUsers, the contents of cntActiveUsers live in the BSS.

The data segment, on the other hand, holds the contents for static variables initialized in source code. This memory area is not anonymous. It maps the part of the program’s binary image that contains the initial static values given in source code. So if you say static int cntWorkerBees = 10, the contents of cntWorkerBees live in the data segment and start out as 10. Even though the data segment maps a file, it is a private memory mapping, which means that updates to memory are not reflected in the underlying file. This must be the case, otherwise assignments to global variables would change your on-disk binary image. Inconceivable!

The data example in the diagram is trickier because it uses a pointer. In that case, the contents of pointer gonzo – a 4-byte memory address – live in the data segment. The actual string it points to does not, however. The string lives in the text segment, which is read-only and stores all of your code in addition to tidbits like string literals. The text segment also maps your binary file in memory, but writes to this area earn your program a Segmentation Fault. This helps prevent pointer bugs, though not as effectively as avoiding C in the first place. Here’s a diagram showing these segments and our example variables:
Mapping Binary Image
You can examine the memory areas in a Linux process by reading the file /proc/pid_of_process/maps. Keep in mind that a segment may contain many areas. For example, each memory mapped file normally has its own area in the mmap segment, and dynamic libraries have extra areas similar to BSS and data. The next post will clarify what ‘area’ really means. Also, sometimes people say “data segment” meaning all of data + bss + heap.

You can examine binary images using the nm and objdump commands to display symbols, their addresses, segments, and so on. Finally, the virtual address layout described above is the “flexible” layout in Linux, which has been the default for a few years. It assumes that we have a value for RLIMIT_STACK. When that’s not the case, Linux reverts back to the “classic” layout shown below:
Linux Classic Address SpaceLayout
That’s it for virtual address space layout. The next post discusses how the kernel keeps track of these memory areas. Coming up we’ll look at memory mapping, how file reading and writing ties into all this and what memory usage figures mean.

GUSTAVO DUARTE Comments:
Thank you all for the feedback!
@web dev: The kernel is not really using that much memory physical though, it simply has that virtual range available to itself to map whatever physical memory it wishes. Thanks for the question though – I clarified this in the post.
Both the Linux and Windows kernel are extremely well built. It’s hard to find areas where one really has an edge, imho. Two outstanding pieces of software.
@numerodix: You know, this ‘tightness’ of the address space is a sort of recent phenomenon. When the kernels were designed, 2 or 3GB seemed like a lot So partially it’s an evolutionary artifact, one that is fixed by 64-bit address spaces.
But also, it is good for performance to give the kernel an ample window into memory. I think the next couple posts should clarify why this is.
@Reader1: thanks for the corrections. I’ll add the heap randomization to the post. Regarding ELF sections, I thought about them, but I’m always balancing what to include in these blog posts. I try hard to keep it crisp, covering one area well, but without dumbing anything down. But there’s so much interconnected stuff, it’s not always clear where to put the line. I think I’m going to leave ELF sections out for now though.
@Jose: thanks for the heads up
@el_bot: This is one similar to ELF sections above. The tradeoff between conciseness and completeness. I’m planning a post covering the stack in detail, and talking about buffer overflows, and I think that’d go in well there.
------------------------------------------------------------------------------------
First off, thank you all for the feedback!
It is great to hear that the post helped out a little bit. Contributing to the community is one of the major reasons I write this stuff, though it doesn’t hurt that it’s fun.
@vs: the idea of a book does surface in the comment threads from time to time. I see a few issues though: 1) I want to keep the content free, no matter what; 2) the color would be gone in a normal book; 3) the links would be gone.
Lately I’ve been thinking about maybe collecting all the stuff once there’s enough, and having an online book of sorts. Then maybe make color prints for a small fee if people wanted hard copies (I wouldn’t mind making money on these).
I really had no idea where this blog would go, though now it’s becoming a bit clearer. So yea, I’m munching on it.
@John: you are correct. Basically the set of threads in a thread group share all the memory regions except for the stack and thread-local-storage. Within the kernel, threads are represented with the same data structure used for processes, task_struct, so again you are correct.
Does this help clear it up? I could dig up the relevant links to kernel code if you’d like to see the stuff in action. Let me know.
-----------------------------------------------------------------------------------
[...] Anatomy of a Program in Memory Memory management is the heart of operating systems; it is crucial for both programming and system administration. In the next few posts I’ll cover memory with an eye towards practical aspects, but without shying away from internals. While the concepts are generic, examples are mostly from Linux and Windows on 32-bit x86. This first post describes how programs are laid out in memory. (tags: windows reference programming hardware) [...]
-----------------------------------------------------------------------------------
Interesting arcticle, Very didactly and with figures!! xD … well being a litle bit serious, i think that is very clear and simply to explain the concepts, follow in this way !
pd: for more linux kernel understading, in “viewly” way it will be use kernel profiling, aka “/proc/profile & kerneltop ” very useful to see internal functions and the corresponding behavior of that.
-----------------------------------------------------------------------------------
@Prabhu: that’s a great topic. There’s a good book about this called Linkers and Loaders. It’s from 2000, not sure how much has changed since. I’m going to add this to my write queue, though I have no idea when the post would actually come out
----------------------------------------------------------------------------------
Hi Gustavo, thank you for all your excellent articles.
I have a question, two actually. As you’ve said each thread has its own stack area. How are these stack areas located with respect to each other? e.g if there are two threads in a process T1 & T2 where would their stacks start and end in the memory. The second question is similar. Does each thread has two stacks – one for user mode and one for kernel mode?

@Raminder: Sorry for the delay, I’ve been working a bit lately. Per our email, I’ll talk about Windows only.

I don’t know where in the virtual space the thread stacks go. I googled briefly but didn’t see an answer, so I think the easiest thing to do is to write a short test program to spawn a few threads calling a function that prints the address of a local variable. If you do a loop of 10 or so the pattern should become clear. I found two relevant posts:
http://blogs.msdn.com/oldnewthing/archive/2005/07/29/444912.aspx

http://software.intel.com/en-us/articles/adjusting-thread-stack-address-to-improve-performance-on-intel-xeonr-processors/
Regarding the second question, YES, threads have two stacks: a large one for user mode (1MB by default it looks like) and a tiny one for kernel mode (just a few kilobytes, 12k for x86). The kernel stack is kept in kernel-mode data structures and can’t be touched from user mode.
Hope this helps.
----------------------------------------------------------------------------------
@Alex,
Thanks! That’s right, a process can’t use more than 3GB of RAM. That’s why for example the memcached folks tell you to run multiple instances when your box has more than 3GB running in 32-bit mode with PAE. Regarding the numbers in top, that would be interesting to see. It could be a quirk with the numbers themselves, or it could be that there’s some exception going on – but in general your understanding is correct – processes can’t use more than 3GB.

Regarding the 1GB not being shown in VIRT, it’s because the kernel accounting ignores the kernel space. It’s just a design issue – why worry about it since it’s there for every process? People would be shocked to see /bin/ls running with 1GB of virt space
-----------------------------------------------------------------------------------
maverick said.
great post.. keep up the good work.. have a doubt regarding the memory mapped area for shared libraries..it starts 0×40000000. Does it grow upwards or downwards ? I remember it grows upwards. In your both figures, its drawn differently.

Narayanan said:
Hi, I ve doubt regarding malloc allocting memory. How does malloc stores information about the size of the pointer as free uses only pointer variable as argument and not the size. can u explain which part of address space it is stored..?Thanks in advance…

@maverick: in x86 Linux it grows as shown in the diagrams, but this varies by CPU architecture and kernel.
@Narayanan: Malloc does its own house keeping to know how much was allocated to each pointer. The best place to check this out is reading the libc source code for malloc and free.
---------------------------------------------------------------------------------
Brian said:
Thanks for this post Gustavo. I have a question, though.
I’m mainly a sysadmin, not a low-level developer, but I need to understand this stuff for low-level debugging at the system level. Near the top of this post, you mention “ring 2 or lower” as if we should all just know what that even means, and I’m sorry to say that I do not. Could you point me to a doc that’ll explain that, or could you expand on what this notion of “rings” relates to?
Thanks — all of your posts are top notch.

Hi Brian,
Here you go to: CPU Rings Privilege and Protection
cheers
--------------------------------------------------------------------------------
Great post! Solve tons of doubts of mine.
Though I still have a few questions hope you can clarify for me.
If I understand them right:
1)The kernel stuff of kernel space (1GB) is in the physical memory (1GB) all the time, unless a user process is trying to access the physical memory mapped to kernel space. If that is the case, swapping will happen. Right?
2)So if I only have enough physical memory for kernel mapping, all my user processes will need to use the memory mapped to kernel. So I will have a lot swapping going on. Right?
3)Why 1GB? Is it based on the size of resident processes and other necessary structures? Or is it hardware related?
Thanks in advance!
---------------------------------------------------------------------------------
Nice article. It should be noted that there is a special case: When clone(2) is used instead of fork(2) to create a process, the address mappings are replicated (stack excepted).
----------------------------------------------------------------------------------
It’s not clear which kernel/libc version above information applies to.

For instance, on CentOS 5.3 running a 2.6.18 kernel with Redhat’s patches 2.6.18-128.7.1.el5PAE and GNU libc 2.5, some shared libraries are located beneath the code segment – which contradicts the figure shown above.

> tac /proc/21779/maps
bfeba000-bff0f000 rw-p bffaa000 00:00 0 [stack]
b7f98000-b7f9f000 r–s 00000000 fd:00 36079468 /usr/lib/gconv/gconv-modules.cache
b7f93000-b7f95000 rw-p b7f93000 00:00 0
b7d93000-b7f93000 r–p 00000000 fd:00 36001563 /usr/lib/locale/locale-archive
08ae9000-08b4c000 rw-p 08ae9000 00:00 0 [heap]
0809c000-080f9000 rw-p 0809c000 00:00 0
08098000-0809c000 rw-p 00050000 fd:00 90800216 /bin/tcsh
08047000-08098000 r-xp 00000000 fd:00 90800216 /bin/tcsh
02b4d000-02b4e000 rwxp 00002000 fd:00 95783275 /lib/libtermcap.so.2.0.8
02b4a000-02b4d000 r-xp 00000000 fd:00 95783275 /lib/libtermcap.so.2.0.8
028ac000-028d3000 rwxp 028ac000 00:00 0
028ab000-028ac000 rwxp 00009000 fd:00 95783323 /lib/libcrypt-2.5.so
028aa000-028ab000 r-xp 00008000 fd:00 95783323 /lib/libcrypt-2.5.so
028a1000-028aa000 r-xp 00000000 fd:00 95783323 /lib/libcrypt-2.5.so
0090a000-0090b000 rwxp 00009000 fd:00 95780903 /lib/libnss_files-2.5.so
00909000-0090a000 r-xp 00008000 fd:00 95780903 /lib/libnss_files-2.5.so
00900000-00909000 r-xp 00000000 fd:00 95780903 /lib/libnss_files-2.5.so
008da000-008dc000 rwxp 008da000 00:00 0
008d9000-008da000 rwxp 0000f000 fd:00 95783322 /lib/libresolv-2.5.so
008d8000-008d9000 r-xp 0000e000 fd:00 95783322 /lib/libresolv-2.5.so
008c9000-008d8000 r-xp 00000000 fd:00 95783322 /lib/libresolv-2.5.so
006d4000-006d7000 rwxp 006d4000 00:00 0
006d3000-006d4000 rwxp 00140000 fd:00 95781738 /lib/libc-2.5.so
006d1000-006d3000 r-xp 0013e000 fd:00 95781738 /lib/libc-2.5.so
00593000-006d1000 r-xp 00000000 fd:00 95781738 /lib/libc-2.5.so
00480000-00481000 rwxp 0001a000 fd:00 95780954 /lib/ld-2.5.so
0047f000-00480000 r-xp 00019000 fd:00 95780954 /lib/ld-2.5.so
00465000-0047f000 r-xp 00000000 fd:00 95780954 /lib/ld-2.5.so
00451000-00452000 r-xp 00451000 00:00 0 [vdso]
00115000-00116000 rwxp 00004000 fd:00 95780901 /lib/libnss_dns-2.5.so
00114000-00115000 r-xp 00003000 fd:00 95780901 /lib/libnss_dns-2.5.so
00110000-00114000 r-xp 00000000 fd:00 95780901 /lib/libnss_dns-2.5.so
---------------------------------------------------------------------------------- Sphere: Related Content

No hay comentarios: