Last year, I faced two projects which required automated Web scrapping – to aggregate content from web pages. I evaluated different methods for Web scraping with varied level of success. Thanks to the changing structure of Web pages, non well-formed pages and URL redirects.
Amongst using regular expressions and DOM (Document Object Model) parsing, I used XPath too. XPath works great for well-formed Web pages. A HTML Web page is called well-formed when all the opening tags have corresponding closing tags and the tags are nested properly (refer to this link for more). A well-formed HTML page is also called an XHTML page.
XPath is a query language to access content on a well formed page – XHTML or XML. All the content in a Web page lie within HTML elements or tag pairs. The following is needed to extract out the content, of interest, using XPath from a Web page:
An XPath expression looks as cryptic as .//*[@id='home_featured']/div. Here is where Firefox web browser with a plugin called FireXPath comes to help (as explained below). The second and third requirements are met by PHP (in my case) which is used for automated Web scrapping using the XPath expression.
Click and build XPath expressions
Building the cryptic XPath expression is easy and intuitive with Firefox and a couple of its plugins – namely Firebug and FireXPath. Install Firebug from here, followed by FireXPath from here and restart Firefox. As of this writing, I’m using the following versions of Firefox and the plugins:
After, the Firefox restart, browse to the Web page of your interest. Next:
For example, the screenshot below, shows the XPath expression (.//*[@id='home_featured']/div ) for the “Featured Post” chunk, on the home page, of this blog:
A few examples of XPath expressions:
Automated scarping using PHP
With the XPath expression in hand, a PHP script as follows, can extract out the required content from the Web page:
<?php
class WebScrap
{
private $url;
private $xpath;
public function WebScrap($url,$xpath)
{
$this->url = $url;
$this->xpath = $xpath;
}
public function GetScrap()
{
// use Tidy to try to make the page well formed
$page = $this->TidyIt($this->url);
// create a document out of the well formed content
$domDocument=new DOMDocument();
$domDocument->loadHTML($page);
// create an XPath object out of the document and query it for the supplied xpath
$domXPath = new DOMXPath($domDocument);
$domNodeList = $domXPath->query($this->xpath);
// Get the content (HTML) out of the NodeList returned by the DOMXPath::query
$content = $this->GetHTMLFromNodeList($domNodeList);
return $content;
}
private function TidyIt($url)
{
$tidy = new tidy();
$tidy->parseFile($url);
$tidy->cleanRepair();
return $tidy;
}
private function GetHTMLFromNodeList($domNodeList)
{
$domDocument = new DOMDocument();
$node = $domNodeList->item(0);
foreach($node->childNodes as $childNode)
$domDocument->appendChild($domDocument->importNode($childNode, true));
return $domDocument->saveHTML();
}
}
?>
Note that, the function/method called GetScrap( ) of the WebScrap PHP class first calls the TidyIt( ) method. This function uses the Tidy library to fix (if required) the HTML for well-formedness. The PHP class for Tidy is used to fetch the web page via the given URL and repair it. Henceforth the DomXPath object is used to query the well-formed Web page content for the XPath expression.
To use the above code, you will need to install PHP modules for XML and Tidy. On a RedHat/CentOS/Fedora Linux machine, these modules can be installed using the following command:
yum install php-xml php-tidy
Save the above code in a file named class.WebScrap.php (say). Subsequently, the WebScrap class can be used as:
<?php
include("class.WebScrap.php");
$scrap = new WebScrap("http://news.google.com",".//*[@id='top-stories']/div[1]/h2/a");
print($scrap->GetScrap());
?>
The code should be self explanatory for a seasoned PHP programmer. If not, shoot your questions via comments to this post. For repeated and automated Web scarping, a scheduler like Cron can be used to execute the above PHP script at regular intervals and fetch the latest content.
The Leftovers
The code above is readable, crisp and focusses on the subject. For this reason, it has deliberate exclusions. In a real world application, you should:
Note that XPath may not always give you the structured content that you desire. For example, using the expression .//*[@id='latest_post']/span[1] (for posted and modified dates of the latest post on this blog) will result into something as follows:
<strong>Posted on:</strong> January 8, 2010 <span class="dot">⋅</span> <strong>Last modified:</strong> January 8, 2010 @ 4:49 pm
So there are tags and text (like “Posted on:” and “Last modified:”) to be stripped out, to get the posted date (January 8, 2010) and the modified date (January 8, 2010 @ 4:49 pm). For this, you may still have to use regular expressions and/or string manipulation functions like split( ).
Email This Post
⋅
Print This Post
⋅
Post A Comment
@shekharg
Good exhaustive tutorial. Learned and implemented quite a things from it.
Keep sharing
loading...
Is it possible to implement this technique using JavaScript? Can you guide me to some tutorials or something..
Wonderful explanation!!
Thanks a ton!!
loading...
JavaScript or Java? JavaScript will be typically confined to parse the page in which it is written. What’s your use case? That is, what’s the application that you have in mind for which you need to use JavaScript?
Most welcome
Thanks for the comment and appreciation.
loading...
hi, i am having a problem. i ran the same code as above but didnt get any result.the page is blank.
i am using fedora9. i installed php-xml and php-tidy.do i need to install anything else? please help.
loading...
Hi Subh,
Turn on PHP error display in php.ini. Hint: display_errors = On
This should show you error(s), if any, in the code.
loading...