November 22, 2008
A short introduction to Web Page Scraping
While in production applications we all favour use of an API, there are a lot of situations, such as in ‘Mashups’ (I love how that term has been reappropriated from Jungle music) where you need to do some page scraping.
It’s occurred to me how these very easy techniques seem inaccessible to many people, so I thought I’d post a few bits and bobs about some basic scraping methods.
Here’s a bit of code I wrote to use PHP’s DOMDocument class to treat a HTML page as XML and fetch, in this case, the incredibly useful current world population… fantastic!
<?php // where to find population data... $location['url']='http://www.census.gov/ipc/www/popclockworld.html'; $location['id']='worldnumber'; // initialise a new document and prepare the data $d=new DOMDocument(); $file = file_get_contents($location['url']); // get and print current world population $d->loadHTML($file); $e=$d->getElementById($location['id']); print $e->nodeValue; ?>
Sample output: 6,738,610,278