18/12/10

Build A Basic Web Crawler To Pull Information From A Website

Web Crawlers, sometimes called scrapers, automatically scan the Internet attempting to glean context and meaning of the content they find. The web wouldn’t function without them. Crawlers are the backbone of search engines which, combined with clever algorithms, work out the relevance of your page to a given keyword set.

The Google web crawler will enter your domain and scan every page of your website, extracting page titles, descriptions, keywords, and links – then report back to Google HQ and add the information to their huge database.

Today, I’d like to teach you how to make your own basic crawler – not one that scans the whole Internet, though, but one that is able to extract all the links from a given webpage.

Generally, you should make sure you have permission before scraping random websites, as most people consider it to be a very grey legal area. Still, as I say, the web wouldn’t function without these kind of crawlers, so it’s important you understand how they work and how easy they are to make.

To make a simple crawler, we’ll be using the most common programming language of the internet – PHP. Don’t worry if you’ve never programmed in PHP – I’ll be taking you through each step and explaining what each part does. I am going to assume an absolute basic knowledge of HTML though, enough that you understand how a link or image is added to an HTML document.

Before we start, you will need a server to run PHP. You have a number of options here:
• If you host your own blog using WordPress, you already have one, so upload the files you write via FTP and run them from there. Matt showed us some free FTP clients for Windows you could use.
• If you don’t have a web server but do have an old PC sitting around, then you could follow Dave’s tutorial here to turn an old PC into a web server.
• Just one computer? Don’t worry – Jeffry showed us how we can run a local server inside of Windows or Mac.

Getting Started
We’ll be using a helper class called Simple HTML DOM. Download this zip file, unzip it, and upload the simple_html_dom.php file contained within to your website first (in the same directory you’ll be running your programs from). It contains functions we will be using to traverse the elements of a webpage more easily. That zip file also contains today’s example code.

First, let’s write a simple program that will check if PHP is working or not. We’ll also import the helper file we’ll be using later. Make a new file in your web directory, and call it example1.php – the actual name isn’t important, but the .php ending is. Copy and paste this code into it:
include_once('simple_html_dom.php');
phpinfo();
?>

Access the file through your internet browser. If you don’t have a server set up, you can still run the program from my server if you want. If everything has gone right, you should see a big page of random debug and server information printed out like below – all from the little line of code! It’s not really what we’re after, but at least we know everything is working.
The first and last lines simply tell the server we are going to be using PHP code. This is important because we can actually include standard HTML on the page too, and it will render just fine. The second line pulls in the Simple HTML DOM helper we will be using. The phpinfo(); line is the one that printed out all that debug info, but you can go ahead and delete that now. Notice that in PHP, any commands we have must be finished with a colon (;). The most common mistake of any PHP beginner is to forget that little bit of punctuation.

One typical task that Google performs is to pull all the links from a page and see which sites they are endorsing. Try the following code next, in a new file if you like.
include_once('simple_html_dom.php');
$target_url = “http://www.tokyobit.com/”;
$html = new simple_html_dom();
$html->load_file($target_url);
foreach($html->find(‘a’) as $link){
echo $link->href.”
”;
}
?>

Again, you can run that from my server too if you don’t have your own set up. You should get a page full of URLs! Wonderful. Most of them will be internal links, of course. In a real world situation, Google would ignore internal links and simply look at what other websites you’re linking to, but that’s outside the scope of this tutorial.

If you’re running on your own server, go ahead and change the target_URL variable to your own webpage or any other website you’d like to examine.

That code was quite a jump from the last example, so let’s go through in pseudo-code to make sure you understand what’s going on.

Include once the simple HTML DOM helper file.
Set the target URL as http://www.tokyobit.com.
Create a new simple HTML DOM object to store the target page
Load our target URL into that object
For each link <..a..> that we find on the target page
- Print out the HREF attribute


That’s it for today, but if you’d like a bit of challenge – try to modify to the second example so that instead of searching for links (a elements), it grabs images instead (img). Remember, the src attribute of an image specifies the URL for that image, not HREF.

Would you like learn more? Let me know in the comments if you’re interested in reading a part 2 (complete with homework solution!), or even if you’d like a back-basics PHP tutorial – and I’ll rustle one up next time for you. I warn you though – once you get started with programming in PHP, you’ll start making plans to create the next Facebook, and all those latent desires for world domination will soon consume you.
Programming is fun.

This is part 2 in a series I started last time about how to build a web crawler in PHP. Previously I introduced the Simple HTML DOM helper file, as well as showing you how incredibly simple it was to grab all the links from a webpage, a common task for search engines like Google.

If you read part 1 and followed along, you’ll know I set some homework to adjust the script to grab images instead of links.

I dropped some pretty big hints, but if you didn’t get it or if you couldn’t get your code to run right, then here is the solution. I added an additional line to output the actual images themselves as well, rather than just the source address of the image.
include_once('simple_html_dom.php');
$target_url = "http://www.tokyobit.com";
$html = new simple_html_dom();
$html->load_file($target_url);
foreach($html->find('img') as $img)
{
echo $img->src."
";
echo $img."
";
}
?>

This should output something like this:
Of course, the results are far from elegant, but it does work. Notice that the script is only capable of grabbing images that are on the content of the page in the form of <..img..> tags – a lot of the page design elements are hard-coded into the CSS, so our script can’t grab those. Again, you can run this through my server and if you wish at this URL, but to enter your own target site you’ll have to edit the code and run on your own server as I explained in part 1. At this point, you should bear in mind that downloading images from a website is significantly more stress on the server than simply grabbing text links, so do only try the script on your own blog or mine and try not to refresh lots of times.

Let’s move on and be a little more adventurous. We’re going to build upon our original file, and instead of just grabbing all the links randomly, we’re going to make it do something more useful by getting the post content instead. We can do this quite easily because standard WordPress wraps the post content within a <..div class=”post”..> tag, so all we need to do is grab any “div” with that class type, and output them – effectively stripping everything except the main content out of the original site. Here is our initial code:
include_once('simple_html_dom.php');
$target_url = "http://www.tokyobit.com";

$html = new simple_html_dom();

$html->load_file($target_url);
foreach($html->find(‘div[class=post]‘) as $post)
{
echo $post.”
”;
}

?>

You can see the output by running the script from here (forgive the slowness, my site is hosted at GoDaddy and they don’t scale very well at all), but it doesn’t contain any of the original design – it is literally just the content.

Let me show you another cool feature now – the ability to delete elements of the page that we don’t like. For instance, I find the meta data quite annoying – like the date and author name – so I’ve added some more code that finds those bits (identified by various classes of div such as post-date, post-info, and meta). I’ve also added a simple CSS style-sheet to format the output a little. Daniel covered a number of great places to learn CSS online if you’re not familiar with it.

As I mentioned in part 1, even though the file contains PHP code, we can still add standard HTML or CSS to the page and the browser will understand it just fine – the PHP code is run on the server, then everything is sent to the browser, to you, as standard HTML. Anyway, here’s the whole final code:
<..head..>
<..style type=”text/css”..>
div.post{background-color: gray;border-radius: 10px;-moz-border-radius: 10px;padding:20px;}
img{float:left;border:0px;padding-right: 10px;padding-bottom: 10px;}
body{width:60%;font-family: verdana,tahamo,sans-serif;margin-left:20%;}
a{text-decoration:none;color:lime;}
<../style..>
<../head..>

include_once(‘simple_html_dom.php’);

$target_url = “http://www.tokyobit.com”;

$html = new simple_html_dom();

$html->load_file($target_url);
foreach($html->find(‘div[class=post]‘) as $post)
{
$post->find(‘div[class=post-date]‘,0)->outertext = ”;
$post->find(‘div[class=post-info]‘,0)->outertext = ”;
$post->find(‘div[class=meta]‘,0)->outertext = ”;
echo $post.”
”;
}

?>

You can check out the results here. Pretty impressive, huh? We’ve taken the content of the original page, got rid of a few bits we didn’t want, and completely reformatted it in the style we like! And more than that, the process is now automated, so if new content were to be published, it would automatically display on our script.

That’s only a fraction of the power available to you though, you can read the full manual online here if you’d like to explore it a little more of the PHP Simple DOM helper and how it greatly aids and simplifies the web crawling process. It’s a great way to take your knowledge of basic HTML and take it up to the next dynamic level.

What could you use this for though? Well, let’s say you own lots of websites and wanted to gather all the contents onto a single site. You could copy and paste the contents every time you update each site, or you could just do it all automatically with this script. Personally, even though I may never use it, I found the script to be a useful exercise in understanding the underlying structure of modern internet documents. It also exposes how simple it is to re-use content when everything is published on a similar system using the same semantics.

What do you think? Again, do let me know in the comments if you’d like to learn some more basic web programming, as I feel like I’ve started you off on level 5 and skipped the first 4! Did you follow along and try yourself, or did you find it a little too confusing? Would you like to learn more about some of the other technologies behind the modern internet browsing experience?

If you’d prefer learning to program on the desktop side of things, Bakari covered some great beginner resources for learning Cocoa Mac OSX desktop programming at the start of the year, and our featured directory app CodeFetch is useful for any programming language. Remember, skills you develop programming in any language can be used across the board.


(By) James is a web developer and SEO consultant who currently lives in the quaint little English town of Surbiton with his Chinese wife. He speaks fluent Japanese and PHP, and when he isn't burning the midnight oil on MakeUseOf articles, he's burning it on iPad and iPhone board game reviews or random tech tutorials instead. Sphere: Related Content