// you’re reading...

Scripts

Web Scraping with Firefox and PHP, using XPath

Last year, I faced two projects which required automated Web scrapping – to aggregate content from web pages. I evaluated different methods for Web scraping with varied level of success. Thanks to the changing structure of Web pages, non well-formed pages and URL redirects.

Amongst using regular expressions and DOM (Document Object Model) parsing, I used XPath too. XPath works great for well-formed Web pages. A HTML Web page is called well-formed when all the opening tags have corresponding closing tags and the tags are nested properly (refer to this link for more). A well-formed HTML page is also called an XHTML page.

XPath is a query language to access content on a well formed page – XHTML or XML. All the content in a Web page lie within HTML elements or tag pairs. The following is needed to extract out the content, of interest, using XPath from a Web page:

  1. A well-formed Web page
  2. An XPath expression for the HTML element of interest
  3. XPath query using the expression

An XPath expression looks as cryptic as  .//*[@id='home_featured']/div. Here is where Firefox web browser with a plugin called FireXPath comes to help (as explained below). The second and third requirements are met by PHP (in my case) which is used for automated Web scrapping using the XPath expression.

Click and build XPath expressions
Building the cryptic XPath expression is easy and intuitive with Firefox and a couple of its plugins – namely Firebug and FireXPath. Install Firebug from here, followed by FireXPath from here and restart Firefox. As of this writing, I’m using the following versions of Firefox and the plugins:

  • Firefox 3.5.6
  • Firbug 1.4.5
  • FireXPath 0.9.1

After, the Firefox restart, browse to the Web page of your interest. Next:

  • Launch Firebug by clicking on the bug like icon on the right corner of the status bar.
  • Click on the tab named XPath (in FireBug)
  • Click on the arrow (blue arrow) and move your mouse over the content, of interest, on the web page
  • Once the required chunk is highlighted (with a blue border), note down the XPath expression in the textfield.

For example, the screenshot below, shows the XPath expression (.//*[@id='home_featured']/div  ) for the “Featured Post” chunk, on the home page, of this blog:

With FireXPath, point your mouse on the chunk/element of interest, to build an XPath expression

With FireXPath, point your mouse on the chunk/element of interest, to build an XPath expression

A few examples of XPath expressions:

  • .//*[@id='latest_post']/span[1] : The posted and modified dates of the latest post on this blog – http://www.shekhargovindarajan.com
  • .//*[@id='top-stories']/div[1]/h2/a : The headline of the top story on Google News – http://news.google.com
  • .//*[@id='mp-tfa']/p[1] : Contents of featured article on Wikipedia – http://en.wikipedia.org/wiki/Main_Page

Automated scarping using PHP
With the XPath expression in hand, a PHP script as follows, can extract out the required content from the Web page:

<?php

class WebScrap
	{
	private $url;
	private $xpath;

	public function WebScrap($url,$xpath)
		{
		$this->url = $url;
		$this->xpath = $xpath;
		}

	public function GetScrap()
		{
		// use Tidy to try to make the page well formed
		$page = $this->TidyIt($this->url);

		// create a document out of the well formed content
		$domDocument=new DOMDocument();
		$domDocument->loadHTML($page);

		// create an XPath object out of the document and query it for the supplied xpath
		$domXPath = new DOMXPath($domDocument);
		$domNodeList = $domXPath->query($this->xpath);

		// Get the content (HTML) out of the NodeList returned by the DOMXPath::query
		$content = $this->GetHTMLFromNodeList($domNodeList);

		return $content;
		}

	private function TidyIt($url)
		{
		$tidy = new tidy();
		$tidy->parseFile($url);
		$tidy->cleanRepair();
		return $tidy;
		}

	private function GetHTMLFromNodeList($domNodeList)
		{
		$domDocument = new DOMDocument();

		$node = $domNodeList->item(0);   

		foreach($node->childNodes as $childNode)
			$domDocument->appendChild($domDocument->importNode($childNode, true));

		return $domDocument->saveHTML();
		}

	}

?>

Note that, the function/method called GetScrap( ) of the WebScrap PHP class first calls the TidyIt( ) method. This function uses the Tidy library to fix (if required) the HTML for well-formedness. The PHP class for Tidy is used to fetch the web page via the given URL and repair it. Henceforth the DomXPath object is used to query the well-formed Web page content for the XPath expression.

To use the above code, you will need to install PHP modules for XML and Tidy. On a RedHat/CentOS/Fedora Linux machine, these modules can be installed using the following command:

yum install php-xml php-tidy

Save the above code in a file named class.WebScrap.php (say). Subsequently, the WebScrap class can be used as:

<?php

include("class.WebScrap.php");
$scrap = new WebScrap("http://news.google.com",".//*[@id='top-stories']/div[1]/h2/a");
print($scrap->GetScrap());

?>

The code should be self explanatory for a seasoned PHP programmer. If not, shoot your questions via comments to this post. For repeated and automated Web scarping, a scheduler like Cron can be used to execute the above PHP script at regular intervals and fetch the latest content.

The Leftovers
The code above is readable, crisp and focusses on the subject. For this reason, it has deliberate exclusions.  In a real world application, you should:

  • Use Curl library in PHP  or external tools like Wget to fetch the URL. Then pass on the fetched content to Tidy. Hint: use the  parseString( ) of Tidy instead of parseFile( )
  • Handle errors in case of errors in Tidy and XPath.
  • Fallback to other means (say regular expressions) in case of errors in Tidy or XPath querying

Note that XPath may not always give you the structured content that you desire. For example, using the expression .//*[@id='latest_post']/span[1] (for posted and modified dates of the latest post on this blog) will result into something as follows:

<strong>Posted on:</strong>
January 8, 2010
<span class="dot">⋅</span>
<strong>Last modified:</strong>
January 8, 2010 @ 4:49 pm

So there are tags and text (like “Posted on:” and “Last modified:”) to be stripped out, to get the posted date (January 8, 2010) and the modified date (January 8, 2010 @ 4:49 pm). For this, you may still have to use regular expressions and/or string manipulation functions like split( ).

GD Star Rating
loading...
GD Star Rating
loading...
  • Share/Bookmark

Discussion

One comment for “Web Scraping with Firefox and PHP, using XPath”

  1. Good exhaustive tutorial. Learned and implemented quite a things from it.
    Keep sharing :)

    GD Star Rating
    loading...

    Posted by Inderjeet | January 26, 2010, 12:38 am

Post a comment