Towards Automatic Web Data Scraper and Aligner (WDSA).

Authors

  • Shridevi A. Swami Pune Institute of Computer Technology, Pune University, Maharashtra.
  • Pujashree S. Vidap Pune Institute of Computer Technology, Pune University, Maharashtra.

DOI:

https://doi.org/10.24297/ijct.v13i3.2762

Keywords:

Data extraction, Wrapper, Data scraping, Data values alignment, Data integration.

Abstract

Web is very immense and fast emerging source of information. Web browsers along with search engines have come forward as famous tools for retrieving and accessing the information present on web. Enormous growth of web made the data extraction from web harder than ever. This paper presents the Automatic Web Data Scraper and Aligner (WDSA). Automatic WDSA extracts the interested web data present in dynamically generated web page received from search engine when user gives a query. Automatic web data scraping is necessary because human being can identify the interested query relevant contents from query result web page, however it is tricky for computer applications. Extracted web data can be further transferred into a format suitable for use in applications like comparison shopping, data integrations, value added services etc. WDSA does this by aligning the extracted web data pairwise as well as holistically in table. The novel thing about Automatic WDSA is that Data Scraper and Aligner uses new approach which combines similarity of both tag and value, for extraction and alignment process. Also Data Scraper handles the data which is present in non contiguous fashion due to presence of auxiliary information like advertisement banners, navigational links, pop ups etc. Experimental results show that Automatic WDSA achieves high precision and recall. Further Automatic WDSA is compared with existing most widely used famous tools like Helium scraper, Outwit Hub, Screen Scraper etc. During comparison we observed that Manual labeling or extraction patterns of desired data is to be specified for working of existing tools while Automatic WDSA does not require any user involvement which made it fully automatic.

Downloads

Download data is not yet available.

Downloads

Published

2014-04-15

How to Cite

Swami, S. A., & Vidap, P. S. (2014). Towards Automatic Web Data Scraper and Aligner (WDSA). INTERNATIONAL JOURNAL OF COMPUTERS &Amp; TECHNOLOGY, 13(3), 4308–4318. https://doi.org/10.24297/ijct.v13i3.2762

Issue

Section

Research Articles