Last year, I faced two projects which required automated Web scrapping – to aggregate content from web pages. I evaluated different methods for Web scraping with varied level of success. Thanks to the changing structure of Web pages, non well-formed pages and URL redirects.
Amongst using regular expressions and DOM (Document Object Model) parsing, I used XPath too. XPath works great for well-formed Web pages. Read the rest of this entry »