A Comparative Study on Web Scraping
Abstract
The World Wide Web contains all kinds of information of different origins; some of those are social, financial, security and academic. Most people access information through internet for educational purposes. Information on the web is available in different formats and through different access interfaces. Therefore, indexing or semantic processing of the data through websites could be cumbersome. Web Scraping is the technique which aims to address this issue. Web scraping is used to transform unstructured data on the web into structured data that can be stored and analysed in a central local database or spreadsheet. There are various web scraping techniques including Traditional copy-andpaste, Text grapping and regular expression matching, HTTP programming, HTML parsing, DOM parsing, Webscraping software, Vertical aggregation platforms, Semantic annotation recognizing and Computer vision web-page analysers. Traditional copy and paste is the basic and tiresome web scraping technique where people need to scrap lots of datasets. Web scraping software is the easiest scraping technique since all the other techniques except traditional copy and paste require some form of technical expertise. There are hundreds of web scraping software available today, most of them designed by using Java, Python and Ruby. There are also some open source web scraping software and as well as commercial software. Web scraping software such as YahooPipes, Google Web Scrapers and Outwit Firefox extensions are the best tools for beginners in web scraping. This study focused on giving comparative clarification about web scraping techniques and famous web scraping software. To accomplish this, we compare and contrast several web scraping techniques and some famous web scraping software. The outcome of this study offers a review on web scraping techniques and software which can be used to extract data from educational web sites.
Collections
- Computing [32]