Acumen Development

  • Acumen Development
  • Acumen Systems
  • Acumen Third Sector
  • About Us
  • Websites
  • Web Applications
  • Branding
  • Support
  • Contact Us

Basic Scraping

recent news

  • Wordpress and Subversion
    Leo Brown, 1st June

  • Working with Corrupt Subversion Repositories
    Leo Brown, 20th January

  • Integrating with Web1.0 Service Providers
    Leo Brown, 4th June

  • WordPress Text Replacement Plugin
    Leo Brown, 14th February

  • Direct Email Reception
    Leo Brown, 2nd January

Project Request Form

Download a project request form here

Basic Scraping

November 22, 2008
categories: Development Processes, Open Source
tags: mashups, php, scraping
A short introduction to Web Page Scraping

While in production applications we all favour use of an API, there are a lot of situations, such as in ‘Mashups’ (I love how that term has been reappropriated from Jungle music) where you need to do some page scraping.

It’s occurred to me how these very easy techniques seem inaccessible to many people, so I thought I’d post a few bits and bobs about some basic scraping methods.

Here’s a bit of code I wrote to use PHP’s DOMDocument class to treat a HTML page as XML and fetch, in this case, the incredibly useful current world population… fantastic!

<?php
 // where to find population data...
 $location['url']='http://www.census.gov/ipc/www/popclockworld.html';
 $location['id']='worldnumber';

 // initialise a new document and prepare the data
 $d=new DOMDocument();
 $file = file_get_contents($location['url']);

 // get and print current world population
 $d->loadHTML($file);
 $e=$d->getElementById($location['id']);
 print $e->nodeValue;
?>

Sample output: 6,738,610,278