3/4/11

What really happens when you navigate to a URL

As a software developer, you certainly have a high-level picture of how web apps work and what kinds of technologies are involved: the browser, HTTP, HTML, web server, request handlers, and so on.
In this article, we will take a deeper look at the sequence of events that take place when you visit a URL.

1. You enter a URL into the browser

It all starts here:

2. The browser looks up the IP address for the domain name

The first step in the navigation is to figure out the IP address for the visited domain. The DNS lookup proceeds as follows:
  • Browser cache – The browser caches DNS records for some time. Interestingly, the OS does not tell the browser the time-to-live for each DNS record, and so the browser caches them for a fixed duration (varies between browsers, 2 – 30 minutes)
  • OS cache – If the browser cache does not contain the desired record, the browser makes a system call (gethostbyname in Windows). The OS has its own cache
  • Router cache – The request continues on to your router, which typically has its own DNS cache
  • ISP DNS cache – The next place checked is the cache ISP’s DNS server. With a cache, naturally
  • Recursive search – Your ISP’s DNS server begins a recursive search, from the root nameserver, through the .com top-level nameserver, to Facebook’s nameserver. Normally, the DNS server will have names of the .com nameservers in cache, and so a hit to the root nameserver will not be necessary
Here is a diagram of what a recursive DNS search looks like:
One worrying thing about DNS is that the entire domain like wikipedia.org or facebook.com seems to map to a single IP address. Fortunately, there are ways of mitigating the bottleneck:
  • Round-robin DNS is a solution where the DNS lookup returns multiple IP addresses, rather than just one. For example, facebook.com actually maps to four IP addresses.
  • Load-balancer is the piece of hardware that listens on a particular IP address and forwards the requests to other servers. Major sites will typically use expensive high-performance load balancers.
  • Geographic DNS improves scalability by mapping a domain name to different IP addresses, depending on the client’s geographic location.
    This is great for hosting static content so that different servers don’t have to update shared state.
  • Anycast is a routing technique where a single IP address maps to multiple physical servers. Unfortunately, anycast does not fit well with TCP and is rarely used in that scenario.
Most of the DNS servers themselves use anycast to achieve high availability and low latency of the DNS lookups.

3. The browser sends a HTTP request to the web server

You can be pretty sure that Facebook’s homepage will not be served from the browser cache because dynamic pages expire either very quickly or immediately (expiry date set to past).
So, the browser will send this request to the Facebook server:

GET http://facebook.com/ HTTP/1.1
Accept: application/x-ms-application, image/jpeg, application/xaml+xml, [...]
User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; [...]
Accept-Encoding: gzip, deflate
Connection: Keep-Alive
Host: facebook.com
Cookie: datr=1265876274-[...]; locale=en_US; lsd=WW[...]; c_user=2101[...]

The GET request names the URL to fetch: “http://facebook.com/”. The browser identifies itself (User-Agent header), and states what types of responses it will accept (Accept and Accept-Encoding headers). The Connection header asks the server to keep the TCP connection open for further requests.

The request also contains the cookies that the browser has for this domain. As you probably already know, cookies are key-value pairs that track the state of a web site in between different page requests. And so the cookies store the name of the logged-in user, a secret number that was assigned to the user by the server, some of user’s settings, etc. The cookies will be stored in a text file on the client, and sent to the server with every request.

There is a variety of tools that let you view the raw HTTP requests and corresponding responses. My favorite tool for viewing the raw HTTP traffic is fiddler, but there are many other tools (e.g., FireBug) These tools are a great help when optimizing a site.

In addition to GET requests, another type of requests that you may be familiar with is a POST request, typically used to submit forms. A GET request sends its parameters via the URL (e.g.: http://robozzle.com/puzzle.aspx?id=85). A POST request sends its parameters in the request body, just under the headers.

The trailing slash in the URL “http://facebook.com/” is important. In this case, the browser can safely add the slash. For URLs of the form http://example.com/folderOrFile, the browser cannot automatically add a slash, because it is not clear whether folderOrFile is a folder or a file. In such cases, the browser will visit the URL without the slash, and the server will respond with a redirect, resulting in an unnecessary roundtrip.

4. The facebook server responds with a permanent redirect

This is the response that the Facebook server sent back to the browser request:
HTTP/1.1 301 Moved Permanently
Cache-Control: private, no-store, no-cache, must-revalidate, post-check=0,
pre-check=0
Expires: Sat, 01 Jan 2000 00:00:00 GMT
Location: http://www.facebook.com/
P3P: CP="DSP LAW"
Pragma: no-cache
Set-Cookie: made_write_conn=deleted; expires=Thu, 12-Feb-2009 05:09:50 GMT;
path=/; domain=.facebook.com; httponly
Content-Type: text/html; charset=utf-8
X-Cnection: close
Date: Fri, 12 Feb 2010 05:09:51 GMT
Content-Length: 0

The server responded with a 301 Moved Permanently response to tell the browser to go to “http://www.facebook.com/” instead of “http://facebook.com/”.

There are interesting reasons why the server insists on the redirect instead of immediately responding with the web page that the user wants to see.

One reason has to do with search engine rankings. See, if there are two URLs for the same page, say http://www.igoro.com/ and http://igoro.com/, search engine may consider them to be two different sites, each with fewer incoming links and thus a lower ranking. Search engines understand permanent redirects (301), and will combine the incoming links from both sources into a single ranking.

Also, multiple URLs for the same content are not cache-friendly. When a piece of content has multiple names, it will potentially appear multiple times in caches.

5. The browser follows the redirect

The browser now knows that “http://www.facebook.com/” is the correct URL to go to, and so it sends out another GET request:
GET http://www.facebook.com/ HTTP/1.1
Accept: application/x-ms-application, image/jpeg, application/xaml+xml, [...]
Accept-Language: en-US
User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; [...]
Accept-Encoding: gzip, deflate
Connection: Keep-Alive
Cookie: lsd=XW[...]; c_user=21[...]; x-referer=[...]
Host: www.facebook.comThe meaning of the headers is the same as for the first request.

6. The server ‘handles’ the request

The server will receive the GET request, process it, and send back a response.

This may seem like a straightforward task, but in fact there is a lot of interesting stuff that happens here – even on a simple site like my blog, let alone on a massively scalable site like facebook.
  • Web server software
    The web server software (e.g., IIS or Apache) receives the HTTP request and decides which request handler should be executed to handle this request. A request handler is a program (in ASP.NET, PHP, Ruby, …) that reads the request and generates the HTML for the response.
    In the simplest case, the request handlers can be stored in a file hierarchy whose structure mirrors the URL structure, and so for example http://example.com/folder1/page1.aspx URL will map to file /httpdocs/folder1/page1.aspx. The web server software can also be configured so that URLs are manually mapped to request handlers, and so the public URL of page1.aspx could be http://example.com/folder1/page1.
  • Request handler
    The request handler reads the request, its parameters, and cookies. It will read and possibly update some data stored on the server. Then, the request handler will generate a HTML response.
One interesting difficulty that every dynamic website faces is how to store data. Smaller sites will often have a single SQL database to store their data, but sites that store a large amount of data and/or have many visitors have to find a way to split the database across multiple machines. Solutions include sharding (splitting up a table across multiple databases based on the primary key), replication, and usage of simplified databases with weakened consistency semantics.

One technique to keep data updates cheap is to defer some of the work to a batch job. For example, Facebook has to update the newsfeed in a timely fashion, but the data backing the “People you may know” feature may only need to be updated nightly (my guess, I don’t actually know how they implement this feature). Batch job updates result in staleness of some less important data, but can make data updates much faster and simpler.

7. The server sends back a HTML response

Here is the response that the server generated and sent back:

HTTP/1.1 200 OK
Cache-Control: private, no-store, no-cache, must-revalidate, post-check=0,
pre-check=0
Expires: Sat, 01 Jan 2000 00:00:00 GMT
P3P: CP="DSP LAW"
Pragma: no-cache
Content-Encoding: gzip
Content-Type: text/html; charset=utf-8
X-Cnection: close
Transfer-Encoding: chunked
Date: Fri, 12 Feb 2010 09:05:55 GMT

2b3��������T�n�@����[...]The entire response is 36 kB, the bulk of them in the byte blob at the end that I trimmed.

The Content-Encoding header tells the browser that the response body is compressed using the gzip algorithm. After decompressing the blob, you’ll see the HTML you’d expect:
--!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
--html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"
lang="en" id="facebook" class=" no_js"
--head
--meta http-equiv="Content-type" content="text/html; charset=utf-8"
--meta http-equiv="Content-language" content="en"

...In addition to compression, headers specify whether and how to cache the page, any cookies to set (none in this response), privacy information, etc.
Notice the header that sets Content-Type to text/html. The header instructs the browser to render the response content as HTML, instead of say downloading it as a file. The browser will use the header to decide how to interpret the response, but will consider other factors as well, such as the extension of the URL.

8. The browser begins rendering the HTML

Even before the browser has received the entire HTML document, it begins rendering the website:

9. The browser sends requests for objects embedded in HTML

As the browser renders the HTML, it will notice tags that require fetching of other URLs. The browser will send a GET request to retrieve each of these files.
  • Images
    http://static.ak.fbcdn.net/rsrc.php/z12E0/hash/8q2anwu7.gif
    http://static.ak.fbcdn.net/rsrc.php/zBS5C/hash/7hwy7at6.gif
  • CSS style sheets
    http://static.ak.fbcdn.net/rsrc.php/z448Z/hash/2plh8s4n.css
    http://static.ak.fbcdn.net/rsrc.php/zANE1/hash/cvtutcee.css
  • JavaScript files
    http://static.ak.fbcdn.net/rsrc.php/zEMOA/hash/c8yzb6ub.js
    http://static.ak.fbcdn.net/rsrc.php/z6R9L/hash/cq2lgbs8.js
Each of these URLs will go through process a similar to what the HTML page went through. So, the browser will look up the domain name in DNS, send a request to the URL, follow redirects, etc.

However, static files – unlike dynamic pages – allow the browser to cache them. Some of the files may be served up from cache, without contacting the server at all. The browser knows how long to cache a particular file because the response that returned the file contained an Expires header. Additionally, each response may also contain an ETag header that works like a version number – if the browser sees an ETag for a version of the file it already has, it can stop the transfer immediately.

Can you guess what “fbcdn.net” in the URLs stands for? A safe bet is that it means “Facebook content delivery network”. Facebook uses a content delivery network (CDN) to distribute static content – images, style sheets, and JavaScript files. So, the files will be copied to many machines across the globe.

Static content often represents the bulk of the bandwidth of a site, and can be easily replicated across a CDN. Often, sites will use a third-party CDN provider, instead of operating a CND themselves. For example, Facebook’s static files are hosted by Akamai, the largest CDN provider.

As a demonstration, when you try to ping static.ak.fbcdn.net, you will get a response from an akamai.net server. Also, interestingly, if you ping the URL a couple of times, may get responses from different servers, which demonstrates the load-balancing that happens behind the scenes.

10. The browser sends further asynchronous (AJAX) requests

In the spirit of Web 2.0, the client continues to communicate with the server even after the page is rendered.

For example, Facebook chat will continue to update the list of your logged in friends as they come and go. To update the list of your logged-in friends, the JavaScript executing in your browser has to send an asynchronous request to the server. The asynchronous request is a programmatically constructed GET or POST request that goes to a special URL. In the Facebook example, the client sends a POST request to http://www.facebook.com/ajax/chat/buddy_list.php to fetch the list of your friends who are online.

This pattern is sometimes referred to as “AJAX”, which stands for “Asynchronous JavaScript And XML”, even though there is no particular reason why the server has to format the response as XML. For example, Facebook returns snippets of JavaScript code in response to asynchronous requests.

Among other things, the fiddler tool lets you view the asynchronous requests sent by your browser. In fact, not only you can observe the requests passively, but you can also modify and resend them. The fact that it is this easy to “spoof” AJAX requests causes a lot of grief to developers of online games with scoreboards. (Obviously, please don’t cheat that way.)

Facebook chat provides an example of an interesting problem with AJAX: pushing data from server to client. Since HTTP is a request-response protocol, the chat server cannot push new messages to the client. Instead, the client has to poll the server every few seconds to see if any new messages arrived.

Long polling is an interesting technique to decrease the load on the server in these types of scenarios. If the server does not have any new messages when polled, it simply does not send a response back. And, if a message for this client is received within the timeout period, the server will find the outstanding request and return the message with the response.

Conclusion

Hopefully this gives you a better idea of how the different web pieces work together.

Read more of Igor Ostrovsky's articles:
Gallery of processor cache effects
Human heart is a Turing machine, research on XBox 360 shows. Wait, what?
Self-printing Game of Life in C#!
Skip lists are fascinating
And if you like my blog, subscribe! Sphere: Related Content

Trinity - A M$ Research Area

Trinity is a graph database and computation platform over distributed memory cloud. As a database, it provides features such as highly concurrent query processing, transaction, consistency control. As a computation platform, it provides synchronous and asynchronous batch-mode computations on large scale graphs. Trinity can be deployed on one machine or hundreds of machines.

Graph is an abstract data structure that has high expressive power. Many real-life applications can be modeled by graphs, including biological networks, semantic web and social networks. Thus, a graph engine is important to many applications. Currently, there are several players in this field, including Neo4j, HyperGraphDB, InfiniteGraph, etc. Neo4j is a disk-based transactional graph database. HyperGraphDB is based on key/value pair store Berkeley DB. InfiniteGraph is a distributed system for large graph data analysis.

In 2009, Google announced Pregel as its large scale graph processing platform. Pregel is a batch system, and it does not support online query processing or graph serving. In comparison, Trinity supports both online query and offline batch processing. Furthermore, batch processing in Pregel is strictly synchronized, while Trinity supports asynchronized computation for better performance.

Features of Trinity

  • Data model: hypergraph.
  • Distributed: Trinity can be deployed on one machine or hundreds of machines.
  • A graph database: Trinity is a memory-based graph store with rich database features, including highly concurrent online query processing, ACI transaction support, etc. Currently, Trinity provides C# APIs to the user for graph processing.
  • A parallel graph processing system: Trinity supports large scale, offline batch processing. Both Synchronous and Asynchronous batch computation is supported.

Graph Model

Trinity adopts the hypergraph model. The difference between a simple graph and a hypergraph is that an edge in a hypergraph (called hyperedge) connects an arbitrary number of nodes, while an edge in a simple graph connects two nodes only.

Hypergraphs are more general than simple graphs:
  • A hypergraph model is more intuitive to many applications, because many relationships are not one-one relationships.
  • Some multilateral relationships cannot easily be modeled by simple graphs. Naïve modeling by simple graphs often leads to information loss.

Trinity is a Distributed Graph Database

A graph database should support some essential database features, such as indexing for query, transactions, concurrency control and consistency maintenance.

Trinity supports content-rich graphs. Each node (or edge) is associated with a set of data, or a set of key/value pairs. In other words, nodes and edges in Trinity are of heterogeneous types.

Trinity is optimized for concurrent online query processing. When deployed on a single machine, Trinity can access 1,000,000 nodes in one second (e.g., when performing BFS). When deployed over a network, the speed is affected by network latency. Trinity provides a graph partitioning mechanism to minimize latency. We are deploying Trinity on infiniband networks, and we will report results soon.

To support highly efficient online query processing, Trinity deploys various types of indices. Currently, we provide trie and hash for accessing node/edge names and key/value pairs associated with nodes/edges. We are implementing structural index for subgraph matching.

Trinity also provides support for concurrent updates on graphs. It implements transaction, concurrency control, and consistency.

Currently, Trinity does not have a graph query language yet. Graph accesses are performed through C# APIs. We are designing a high level query language for Trinity.

Trinity is a Distributed Parallel Platform for Graph Data

Many operations on graphs are carried out in batch mode, for example, PageRank, shortest path discovery, frequent subgraph mining, random walk, graph partitioning, etc.

Like Google's Pregel, Trinity supports node-based parallel processing on graphs. Through a web portal, the user provides a script (currently C# code or a DLL) to specify the computation to be carried out on a single node, including what messages it passes to its neighbors. The system will carry out the computation in parallel.

Unlike Google's Pregel, operations on nodes do not have to be conducted in strictly synchronous manner. Certain operations (e.g., shortest path discovery) can be performed in an asynchronous mode for better performance.

As an example, here is the code for synchronous shortest path search (pseudocode, C# code), and here is the code for asynchronous shortest path search (pseudocode, C# code).

We are also designing a high level language so that users can write their scrips with ease.

Trinity Architecture

Trinity is based on memory cloud. It uses memory as the main storage and disk is only used as the backup storage.

Applications

As more and more applications handle graph data, we expect Trinity will have many applications. Currently, Trinity is supporting the following two applications: Probase (a research prototype) and AEther (a production system). If your applications require graph engine support, please let us know.
Trinity is the infrastructure of Probase, a large-scale knowledgebase automatically acquired from the web. Probase has millions of nodes (representing concepts) and edges (represent relationships). Hypergraphs are more appropriate than simple graphs for modeling knowledge. Trinity is used for: 1) taxonomy building; 2) data integration (e.g. adding Freebase data into Probase); 3) querying Probase.
Microsoft Bing’s AEther project now uses Trinity for managing AEther’s experimental data, which consists of large number of workflows, and the evolutions among the workflows. Trinity is the backend graph storage engine of AEther's workflow management system. We are adding more functionalities, in particular, subgraph matching and frequent subgraph mining, to support the project.

Project Contact

Bin Shao(binshao@microsoft.com)
Haixun Wang ((haixunw@microsoft.com) Sphere: Related Content

3 Free Tools to Plan and Visualise Your Start-Up Business

If you’ve decided to take the plunge, abandoning the 9-to-5 rat race to launch out on your own, the first step to getting your start-up off the ground is to create a business model. This can be a very daunting task, and rather than start with a completely blank canvas, there are several free online tools which can help guide you through the initial steps.

Whether you’re a seasoned entrepreneur or new to the world of business, these tools will come in handy. All you need to bring to the table is your concept to create a business plan, the first step in taking it from an idea to reality. These tools can be used independently of one another, or you can choose to combine and tailor them to suit your personal needs.

Business Model Canvas
One of the best known tools for creating a visual business model comes courtesy of Alexander Osterwalder. Accounting for all of the essential elements included in any business plan, he has provided an easy-to-use business plan template and a guide to the information to be included.

The canvas can be downloaded as a PDF from his website and an iPad application is currently in the works. He also provides a blog post on how to use the canvas in a working session.

The business plan template is divided into 9 sections, each accompanied by a short series of questions making it easier to fill out the information. The sections include key partners, activities, cost structure and revenue streams, amongst others.

PlanCruncher
PlanCruncher is a free, no-registration-required service which is perfect for the budding entrepreneur who needs a step-by-step guide on how to put together a visual presentation.
  • The first step in PlanCruncher is to introduce your start-up. Choose a name, and describe your pitch.
  • Determine what kind of business idea you’re bringing to the table, and whether you want to use a non-disclosure agreement.
  • The next step is to introduce your team and their capabilities.
  • Next, describe the current state of your product, and determine the product’s intellectual proprietary status.
  • Next, describe your revenue model.
  • Then determine the kind of funding you need.
  • Select the kind of partnership you are seeking and the share you are willing to offer.
  • Finally, enter your contact information and any additional comments you feel are necessary to include in your plan. You can also choose to send a copy of your business plan to PlanCruncher where it will be shared with investors who could eventually contact you. They do include a disclaimer that you should not submit any information you consider confidential or proprietary, and they do not accept responsibility for protecting against misuse or disclosure of any confidential or proprietary information, which is a little unsettling when putting your business concept in their hands.
Once you generate the business plan, right click the link that reads PDF business plan summary and click ‘Save link as…’ to save the document to your computer.

The final product will look a little something like this.

It’s worth mentioning that it includes a footer stating that the document was generated using PlanCruncher. If you would rather not include the footer or submit your idea to a third party site, you can download the icons and put together the presentation yourself.

Startup Toolkit
The Startup Toolkit is a free service that allows you to create a canvas visually describing your business model.
After signing up for an account, rather than provide step by step instructions, you are presented with a canvas to be filled in as you see fit.
In addition to creating a canvas describing your business model, you also have access to a ‘Risk Dashboard’, a to-do list for your business risks and leaps of faith.
There are three canvases to choose from.
  • The Startup Canvas, which focuses on finding and resolving early startup risks.
  • The Lean Canvas, which focuses on the product and the customer equally.
  • And lastly, the Business Model Canvas seen earlier, developed by Osterwalder.
Each canvas provides you with a guideline and questions to answer for each section.
After you have entered all the information on your startup, you can save a snapshot to return to later, but the site does not provide any easy way to export it as a document, so it is better suited for internal or collaborative use only.
If you want to share the canvas with other members of your team, you can invite them via email either to view or edit the information.
The Risk Dashboard is where you can enter your leap of faith (what are the major beliefs and assumptions your business is built on?) and your hypothesis. After saving the information, you can then fill in the actual results of of your experiment to test the hypothesis, and your insight and course correction.

Do you have any tips on how to get your business concept down on paper? Have you used any of these techniques? Let us know how they worked out for you in the comments. Sphere: Related Content