Last year, I faced two projects which required automated Web scrapping – to aggregate content from web pages. I evaluated different methods for Web scraping with varied level of success. Thanks to the changing structure of Web pages, non well-formed pages and URL redirects.
Amongst using regular expressions and DOM (Document Object Model) parsing, I used XPath too. XPath works great for well-formed Web pages. A HTML Web page is called well-formed when all the opening tags have corresponding closing tags and the tags are nested properly (refer to this link for more). A well-formed HTML page is also called an XHTML page.
XPath is a query language to access content on a well formed page – XHTML or XML. All the content in a Web page lie within HTML elements or tag pairs. The following is needed to extract out the content, of interest, using XPath from a Web page:
An XPath expression looks as cryptic as .//*[@id='home_featured']/div. Here is where Firefox web browser with a plugin called FireXPath comes to help (as explained below). The second and third requirements are met by PHP (in my case) which is used for automated Web scrapping using the XPath expression.
Click and build XPath expressions
Building the cryptic XPath expression is easy and intuitive with Firefox and a couple of its plugins – namely Firebug and FireXPath. Install Firebug from here, followed by FireXPath from here and restart Firefox. As of this writing, I’m using the following versions of Firefox and the plugins:
After, the Firefox restart, browse to the Web page of your interest. Next:
For example, the screenshot below, shows the XPath expression (.//*[@id='home_featured']/div ) for the “Featured Post” chunk, on the home page, of this blog:
A few examples of XPath expressions:
Automated scarping using PHP
With the XPath expression in hand, a PHP script as follows, can extract out the required content from the Web page:
<?php
class WebScrap
{
private $url;
private $xpath;
public function WebScrap($url,$xpath)
{
$this->url = $url;
$this->xpath = $xpath;
}
public function GetScrap()
{
// use Tidy to try to make the page well formed
$page = $this->TidyIt($this->url);
// create a document out of the well formed content
$domDocument=new DOMDocument();
$domDocument->loadHTML($page);
// create an XPath object out of the document and query it for the supplied xpath
$domXPath = new DOMXPath($domDocument);
$domNodeList = $domXPath->query($this->xpath);
// Get the content (HTML) out of the NodeList returned by the DOMXPath::query
$content = $this->GetHTMLFromNodeList($domNodeList);
return $content;
}
private function TidyIt($url)
{
$tidy = new tidy();
$tidy->parseFile($url);
$tidy->cleanRepair();
return $tidy;
}
private function GetHTMLFromNodeList($domNodeList)
{
$domDocument = new DOMDocument();
$node = $domNodeList->item(0);
foreach($node->childNodes as $childNode)
$domDocument->appendChild($domDocument->importNode($childNode, true));
return $domDocument->saveHTML();
}
}
?>
Note that, the function/method called GetScrap( ) of the WebScrap PHP class first calls the TidyIt( ) method. This function uses the Tidy library to fix (if required) the HTML for well-formedness. The PHP class for Tidy is used to fetch the web page via the given URL and repair it. Henceforth the DomXPath object is used to query the well-formed Web page content for the XPath expression.
To use the above code, you will need to install PHP modules for XML and Tidy. On a RedHat/CentOS/Fedora Linux machine, these modules can be installed using the following command:
yum install php-xml php-tidy
Save the above code in a file named class.WebScrap.php (say). Subsequently, the WebScrap class can be used as:
<?php
include("class.WebScrap.php");
$scrap = new WebScrap("http://news.google.com",".//*[@id='top-stories']/div[1]/h2/a");
print($scrap->GetScrap());
?>
The code should be self explanatory for a seasoned PHP programmer. If not, shoot your questions via comments to this post. For repeated and automated Web scarping, a scheduler like Cron can be used to execute the above PHP script at regular intervals and fetch the latest content.
The Leftovers
The code above is readable, crisp and focusses on the subject. For this reason, it has deliberate exclusions. In a real world application, you should:
Note that XPath may not always give you the structured content that you desire. For example, using the expression .//*[@id='latest_post']/span[1] (for posted and modified dates of the latest post on this blog) will result into something as follows:
<strong>Posted on:</strong> January 8, 2010 <span class="dot">⋅</span> <strong>Last modified:</strong> January 8, 2010 @ 4:49 pm
So there are tags and text (like “Posted on:” and “Last modified:”) to be stripped out, to get the posted date (January 8, 2010) and the modified date (January 8, 2010 @ 4:49 pm). For this, you may still have to use regular expressions and/or string manipulation functions like split( ).
Email This Post
⋅
Print This Post
⋅
Post A Comment
@shekharg
Good exhaustive tutorial. Learned and implemented quite a things from it.
Keep sharing
loading...
Is it possible to implement this technique using JavaScript? Can you guide me to some tutorials or something..
Wonderful explanation!!
Thanks a ton!!
loading...
JavaScript or Java? JavaScript will be typically confined to parse the page in which it is written. What’s your use case? That is, what’s the application that you have in mind for which you need to use JavaScript?
Most welcome
Thanks for the comment and appreciation.
loading...
hi, i am having a problem. i ran the same code as above but didnt get any result.the page is blank.
i am using fedora9. i installed php-xml and php-tidy.do i need to install anything else? please help.
loading...
Hi Subh,
Turn on PHP error display in php.ini. Hint: display_errors = On
This should show you error(s), if any, in the code.
loading...
Excellent tutorial! Even though I’m using Python and not PHP, it inspired me to check out using XPath. I had previously been using regular expressions along with BeautifulSoup, but it was painstaking. Now, using the lxml bindings, along with XPather, I can kick out solid scraping functions in a fraction of the time and that are robust without excessive testing.
If you’re into Python, I’d suggest checking this out:
http://codespeak.net/lxml/parsing.html
loading...
Hi Nick,
To be very honest, thanks for your comment, because I learnt something new today.
Also glad that this blog post helped you in some way
loading...
Need your guidance / support on automated web scraping for real time personal use.
Kindly contact..
loading...
Hi Sean,
Let me know how can I help you with this.
loading...
Hi!Its very nice tutorial.can you give me brief idea for what purpose webscrapping is used and how?and in which places it will be useful?what technology can supportto this?
Thanks.
loading...
Hi Shekhar, great post thanks – I’ve been searching the internet for a while now trying to find an automated scraping technique. I’m taking your advice and posting here as I’m not a seasoned programmer and just need a quick piece of advice. I need to collect the road speeds at junctions along the M1 over a month time period. Would your above code suffice if I wanted to extract the values from this link http://www.frixo.com/m1-north.asp (bottom left table)? Thanks in advance
loading...
Hi James,
I believe the above code should work for your requirement. The XPath expression to extract the road speeds would be .//*[@id='roaddiagram'] This will give you the required data. Subsequently, you may need to strip out the HTML tags. Enjoy
loading...
Hi Shekhar,
Thanks for your help. Is there any way to contact you directly as I am still having a few problems and feel it could be something very basic, plus can show you some of the error messages I am getting. At this stage I’m still trying to get your code to work. Appreciate all help you can provide at this stage.
Thanks
loading...
James,
while I prefer that you post your queries here so that it can be of help to others too, you can e-mail me at – shekhar at it4enterprise dot com
loading...
Wow!!
Thanks for the tutorial, I am going to bookmark this.
I’ve only programmed in LISP and VBA up until now (no need for anything else, really), so I definitely have a lot to learn before I can implement your suggestions.
I wish there was good freeware for data mining– it seems like there would be as it could be a tremendous tool for students and academic researchers, as well as individual persons.
I started looking into web-scraping this evening as an alternative to trying to investigate all 600 of the colleges and universities on forbes’ list for viable distance education programs with literature degrees. Any ideas for a lazy engineering mom who secretly pines after stimulating literature analysis?
loading...
Hi Sydney,
First, let me apologize to you for the delay in responding to your comment. The reason being, I had spotted some easy to use tools and it was only today that I search and recalled them
See if one of the following gets your started:
1. Scraper (Google Chrome)
2. Check out this blog post for screen scrapping with Firefox.
Let me know if the above are useful else we will find out more such tools!
loading...