// you’re reading...

Scripts

Web Scraping with Firefox and PHP, using XPath

Last year, I faced two projects which required automated Web scrapping – to aggregate content from web pages. I evaluated different methods for Web scraping with varied level of success. Thanks to the changing structure of Web pages, non well-formed pages and URL redirects.

Amongst using regular expressions and DOM (Document Object Model) parsing, I used XPath too. XPath works great for well-formed Web pages. A HTML Web page is called well-formed when all the opening tags have corresponding closing tags and the tags are nested properly (refer to this link for more). A well-formed HTML page is also called an XHTML page.

XPath is a query language to access content on a well formed page – XHTML or XML. All the content in a Web page lie within HTML elements or tag pairs. The following is needed to extract out the content, of interest, using XPath from a Web page:

  1. A well-formed Web page
  2. An XPath expression for the HTML element of interest
  3. XPath query using the expression

An XPath expression looks as cryptic as  .//*[@id='home_featured']/div. Here is where Firefox web browser with a plugin called FireXPath comes to help (as explained below). The second and third requirements are met by PHP (in my case) which is used for automated Web scrapping using the XPath expression.

Click and build XPath expressions
Building the cryptic XPath expression is easy and intuitive with Firefox and a couple of its plugins – namely Firebug and FireXPath. Install Firebug from here, followed by FireXPath from here and restart Firefox. As of this writing, I’m using the following versions of Firefox and the plugins:

  • Firefox 3.5.6
  • Firbug 1.4.5
  • FireXPath 0.9.1

After, the Firefox restart, browse to the Web page of your interest. Next:

  • Launch Firebug by clicking on the bug like icon on the right corner of the status bar.
  • Click on the tab named XPath (in FireBug)
  • Click on the arrow (blue arrow) and move your mouse over the content, of interest, on the web page
  • Once the required chunk is highlighted (with a blue border), note down the XPath expression in the textfield.

For example, the screenshot below, shows the XPath expression (.//*[@id='home_featured']/div  ) for the “Featured Post” chunk, on the home page, of this blog:

With FireXPath, point your mouse on the chunk/element of interest, to build an XPath expression

With FireXPath, point your mouse on the chunk/element of interest, to build an XPath expression

A few examples of XPath expressions:

  • .//*[@id='latest_post']/span[1] : The posted and modified dates of the latest post on this blog – http://www.shekhargovindarajan.com
  • .//*[@id='top-stories']/div[1]/h2/a : The headline of the top story on Google News – http://news.google.com
  • .//*[@id='mp-tfa']/p[1] : Contents of featured article on Wikipedia – http://en.wikipedia.org/wiki/Main_Page

Automated scarping using PHP
With the XPath expression in hand, a PHP script as follows, can extract out the required content from the Web page:

<?php

class WebScrap
	{
	private $url;
	private $xpath;

	public function WebScrap($url,$xpath)
		{
		$this->url = $url;
		$this->xpath = $xpath;
		}

	public function GetScrap()
		{
		// use Tidy to try to make the page well formed
		$page = $this->TidyIt($this->url);

		// create a document out of the well formed content
		$domDocument=new DOMDocument();
		$domDocument->loadHTML($page);

		// create an XPath object out of the document and query it for the supplied xpath
		$domXPath = new DOMXPath($domDocument);
		$domNodeList = $domXPath->query($this->xpath);

		// Get the content (HTML) out of the NodeList returned by the DOMXPath::query
		$content = $this->GetHTMLFromNodeList($domNodeList);

		return $content;
		}

	private function TidyIt($url)
		{
		$tidy = new tidy();
		$tidy->parseFile($url);
		$tidy->cleanRepair();
		return $tidy;
		}

	private function GetHTMLFromNodeList($domNodeList)
		{
		$domDocument = new DOMDocument();

		$node = $domNodeList->item(0);   

		foreach($node->childNodes as $childNode)
			$domDocument->appendChild($domDocument->importNode($childNode, true));

		return $domDocument->saveHTML();
		}

	}

?>

Note that, the function/method called GetScrap( ) of the WebScrap PHP class first calls the TidyIt( ) method. This function uses the Tidy library to fix (if required) the HTML for well-formedness. The PHP class for Tidy is used to fetch the web page via the given URL and repair it. Henceforth the DomXPath object is used to query the well-formed Web page content for the XPath expression.

To use the above code, you will need to install PHP modules for XML and Tidy. On a RedHat/CentOS/Fedora Linux machine, these modules can be installed using the following command:

yum install php-xml php-tidy

Save the above code in a file named class.WebScrap.php (say). Subsequently, the WebScrap class can be used as:

<?php

include("class.WebScrap.php");
$scrap = new WebScrap("http://news.google.com",".//*[@id='top-stories']/div[1]/h2/a");
print($scrap->GetScrap());

?>

The code should be self explanatory for a seasoned PHP programmer. If not, shoot your questions via comments to this post. For repeated and automated Web scarping, a scheduler like Cron can be used to execute the above PHP script at regular intervals and fetch the latest content.

The Leftovers
The code above is readable, crisp and focusses on the subject. For this reason, it has deliberate exclusions.  In a real world application, you should:

  • Use Curl library in PHP  or external tools like Wget to fetch the URL. Then pass on the fetched content to Tidy. Hint: use the  parseString( ) of Tidy instead of parseFile( )
  • Handle errors in case of errors in Tidy and XPath.
  • Fallback to other means (say regular expressions) in case of errors in Tidy or XPath querying

Note that XPath may not always give you the structured content that you desire. For example, using the expression .//*[@id='latest_post']/span[1] (for posted and modified dates of the latest post on this blog) will result into something as follows:

<strong>Posted on:</strong>
January 8, 2010
<span class="dot">⋅</span>
<strong>Last modified:</strong>
January 8, 2010 @ 4:49 pm

So there are tags and text (like “Posted on:” and “Last modified:”) to be stripped out, to get the posted date (January 8, 2010) and the modified date (January 8, 2010 @ 4:49 pm). For this, you may still have to use regular expressions and/or string manipulation functions like split( ).

GD Star Rating
loading...
GD Star Rating
loading...
Web Scraping with Firefox and PHP, using XPath, 9.8 out of 10 based on 6 ratings
Share

Email This Post Email This Post Print This Post Print This Post Print This Post Post A Comment Tweet your comments/question to me @shekharg

Discussion

16 comments for “Web Scraping with Firefox and PHP, using XPath”

  1. Good exhaustive tutorial. Learned and implemented quite a things from it.
    Keep sharing :)

    GD Star Rating
    loading...

    Posted by Inderjeet | January 26, 2010, 12:38 am
  2. Is it possible to implement this technique using JavaScript? Can you guide me to some tutorials or something..

    Wonderful explanation!!

    Thanks a ton!! :)

    GD Star Rating
    loading...

    Posted by vikky | March 22, 2010, 12:59 pm
  3. Is it possible to implement this technique using JavaScript? Can you guide me to some tutorials or something.

    JavaScript or Java? JavaScript will be typically confined to parse the page in which it is written. What’s your use case? That is, what’s the application that you have in mind for which you need to use JavaScript?

    Wonderful explanation!!

    Thanks a ton!!

    Most welcome :-) Thanks for the comment and appreciation.

    GD Star Rating
    loading...

    Posted by Shekhar | March 22, 2010, 7:09 pm
  4. hi, i am having a problem. i ran the same code as above but didnt get any result.the page is blank.

    i am using fedora9. i installed php-xml and php-tidy.do i need to install anything else? please help.

    GD Star Rating
    loading...

    Posted by subh | May 27, 2010, 1:15 pm
  5. Hi Subh,
    Turn on PHP error display in php.ini. Hint: display_errors = On

    This should show you error(s), if any, in the code.

    GD Star Rating
    loading...

    Posted by Shekhar | May 27, 2010, 2:59 pm
  6. Excellent tutorial! Even though I’m using Python and not PHP, it inspired me to check out using XPath. I had previously been using regular expressions along with BeautifulSoup, but it was painstaking. Now, using the lxml bindings, along with XPather, I can kick out solid scraping functions in a fraction of the time and that are robust without excessive testing.

    If you’re into Python, I’d suggest checking this out:
    http://codespeak.net/lxml/parsing.html

    GD Star Rating
    loading...

    Posted by Nick Bennett | October 12, 2010, 12:08 am
  7. Hi Nick,
    To be very honest, thanks for your comment, because I learnt something new today.

    Also glad that this blog post helped you in some way :-)

    GD Star Rating
    loading...

    Posted by Shekhar | October 12, 2010, 12:22 am
  8. Need your guidance / support on automated web scraping for real time personal use.

    Kindly contact..

    GD Star Rating
    loading...

    Posted by Sean | October 15, 2010, 3:08 am
  9. Hi Sean,

    Need your guidance / support on automated web scraping for real time personal use.

    Let me know how can I help you with this.

    GD Star Rating
    loading...

    Posted by Shekhar | October 16, 2010, 4:46 pm
  10. Hi!Its very nice tutorial.can you give me brief idea for what purpose webscrapping is used and how?and in which places it will be useful?what technology can supportto this?

    Thanks.

    GD Star Rating
    loading...

    Posted by Arohi | November 29, 2010, 3:13 pm
  11. Hi Shekhar, great post thanks – I’ve been searching the internet for a while now trying to find an automated scraping technique. I’m taking your advice and posting here as I’m not a seasoned programmer and just need a quick piece of advice. I need to collect the road speeds at junctions along the M1 over a month time period. Would your above code suffice if I wanted to extract the values from this link http://www.frixo.com/m1-north.asp (bottom left table)? Thanks in advance

    GD Star Rating
    loading...

    Posted by James | December 16, 2010, 8:47 pm
  12. Hi James,

    Would your above code suffice if I wanted to extract the values from this link http://www.frixo.com/m1-north.asp (bottom left table)? Thanks in advance

    I believe the above code should work for your requirement. The XPath expression to extract the road speeds would be .//*[@id='roaddiagram'] This will give you the required data. Subsequently, you may need to strip out the HTML tags. Enjoy :-)

    GD Star Rating
    loading...

    Posted by Shekhar | December 17, 2010, 3:35 pm
  13. Hi Shekhar,

    Thanks for your help. Is there any way to contact you directly as I am still having a few problems and feel it could be something very basic, plus can show you some of the error messages I am getting. At this stage I’m still trying to get your code to work. Appreciate all help you can provide at this stage.

    Thanks

    GD Star Rating
    loading...

    Posted by James | December 22, 2010, 7:34 pm
  14. James,

    while I prefer that you post your queries here so that it can be of help to others too, you can e-mail me at – shekhar at it4enterprise dot com

    GD Star Rating
    loading...

    Posted by Shekhar | December 23, 2010, 5:27 pm
  15. Wow!!

    Thanks for the tutorial, I am going to bookmark this.

    I’ve only programmed in LISP and VBA up until now (no need for anything else, really), so I definitely have a lot to learn before I can implement your suggestions.

    I wish there was good freeware for data mining– it seems like there would be as it could be a tremendous tool for students and academic researchers, as well as individual persons.

    I started looking into web-scraping this evening as an alternative to trying to investigate all 600 of the colleges and universities on forbes’ list for viable distance education programs with literature degrees. Any ideas for a lazy engineering mom who secretly pines after stimulating literature analysis?

    GD Star Rating
    loading...

    Posted by Sydney | May 24, 2011, 7:21 am
  16. Hi Sydney,

    I started looking into web-scraping this evening as an alternative to trying to investigate all 600 of the colleges and universities on forbes’ list for viable distance education programs with literature degrees. Any ideas for a lazy engineering mom who secretly pines after stimulating literature analysis?

    First, let me apologize to you for the delay in responding to your comment. The reason being, I had spotted some easy to use tools and it was only today that I search and recalled them :-)

    See if one of the following gets your started:

    1. Scraper (Google Chrome)
    2. Check out this blog post for screen scrapping with Firefox.

    Let me know if the above are useful else we will find out more such tools!

    GD Star Rating
    loading...

    Posted by Shekhar | May 30, 2011, 7:03 pm

Post a comment