Building Application Powered by Web Scraping
Phan, Huy (2019)
Phan, Huy
2019
All rights reserved. This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-201904175517
https://urn.fi/URN:NBN:fi:amk-201904175517
Tiivistelmä
Being able to collect and process online contents to help can help businesses to make informed decisions. With the explosion of data available online this process cannot be practically accomplished with manual browsing but can be done with Web Scraping, an automated system that can collect just the necessary data. This paper examines the use of Web Scraper in building Web Applications, in order to identify the major advantages and challenges of web scraping. Two applications based on web scrapers are built to study how scraper can help developers retrieve and analyze data. One has a web scraper backend to fetch data from web stores as demanded. The other scraps and accumulates data over time.
A good web scraper requires very robust, multi-component architecture that is fault tolerant. The retrieval logic can be complicated since the data can be in different format. A typical application based on web scraper requires regular maintenance in order to function smoothly. Site owners may not want such a robot scraper to visit and extract data from their sites so it is important to check the site’s policy before trying to scrap its contents.
It will be beneficial to look into ways to optimize the scraper traffic. The next step after data retrieval is to have a well-defined pipeline to process the raw data to get just the meaningful data that the developer intended to get.
A good web scraper requires very robust, multi-component architecture that is fault tolerant. The retrieval logic can be complicated since the data can be in different format. A typical application based on web scraper requires regular maintenance in order to function smoothly. Site owners may not want such a robot scraper to visit and extract data from their sites so it is important to check the site’s policy before trying to scrap its contents.
It will be beneficial to look into ways to optimize the scraper traffic. The next step after data retrieval is to have a well-defined pipeline to process the raw data to get just the meaningful data that the developer intended to get.