Automated Parsing and Integration of Job Requirements from Public Sources : A Case Study in Excel and SQL
Neittamo, Joona (2024)
Neittamo, Joona
2024
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-202404085998
https://urn.fi/URN:NBN:fi:amk-202404085998
Tiivistelmä
The process of extracting structured information from Excel files is becoming increasingly essential in an era of data-driven decision making. This thesis explores the main challenges associated with extracting and going through data in multiple different patterns changing Excel files and presents versatile strategies to go through these complexities. With the main datasets that hold job requirement information extracted from a web-parser robot, this study addresses the prominent obstacles of location, structure and pattern repetition encountered during the extraction process.
The introductory chapter establishes the context of Excel parsing, emphasizing the importance of efficient data extraction for knowledgeable decision-making. The subsequent literature review provides an in-depth exploration of known tools, methodologies, and impediments pertaining to data extraction from Excel files, sustaining a comprehensive grasp of the subject matter. The chapter centers on locationbased challenges investigating the identification of relevant cells and subtle handling of merged cells and spans. Meanwhile, the chapter dedicated to structural challenges talks about the task of normalizing inconsistent data formats and extracting hierarchical data. Lastly, the section focusing on the problems of pattern repeat scrutinizes the discernment of repetitive structures and strategies to effectively manage irregular patterns.
The conclusion chapter joins the findings and implications found throughout the study. The identified challenges and their corresponding solutions collectively contribute to the advancement of data extraction practices, augmenting the efficiency and precision of these processes.
The introductory chapter establishes the context of Excel parsing, emphasizing the importance of efficient data extraction for knowledgeable decision-making. The subsequent literature review provides an in-depth exploration of known tools, methodologies, and impediments pertaining to data extraction from Excel files, sustaining a comprehensive grasp of the subject matter. The chapter centers on locationbased challenges investigating the identification of relevant cells and subtle handling of merged cells and spans. Meanwhile, the chapter dedicated to structural challenges talks about the task of normalizing inconsistent data formats and extracting hierarchical data. Lastly, the section focusing on the problems of pattern repeat scrutinizes the discernment of repetitive structures and strategies to effectively manage irregular patterns.
The conclusion chapter joins the findings and implications found throughout the study. The identified challenges and their corresponding solutions collectively contribute to the advancement of data extraction practices, augmenting the efficiency and precision of these processes.