Development of Modern Data Platform using Medallion Architecture
Wiselka, Michal (2024)
Wiselka, Michal
2024
All rights reserved. This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-2024111828633
https://urn.fi/URN:NBN:fi:amk-2024111828633
Tiivistelmä
Worldwide data amounts continue to grow, forcing organizations to develop modern scalable data solutions. This thesis is focused on development of a modern data platform, using Medallion Architecture for company specialised in manufacturing of sports devices. The objective of this thesis is to further develop data platform, by integrating new data sources, enhancing analytical capabilities of internal data team. The author also wants to familiarize reader with data engineering concepts and showcase practical implementation of Medallion Architecture.
The first chapters introduce reader to background theoretical concepts, as well as the Medallion Architecture itself. Reader is also walked through the overview of the solution implementation, where certain technologies are described, and their usage within solution is explained.
The main part of the thesis describes implementation of the solution, the detailed journey of data from source systems to business use-cases is described. Every step, beginning from data ingestion, through data storage and data processing, ending at data governance is shown and described in detail. This part also includes deep dive into core technologies, that are a backbone of this solution: Apache Spark and Delta Lake.
The thesis concludes, by evaluating the set objectives and reflections of the author about solution and technologies used. The author shares positive and negative aspects of the implementation, as well as possible improvements that should be made.
The first chapters introduce reader to background theoretical concepts, as well as the Medallion Architecture itself. Reader is also walked through the overview of the solution implementation, where certain technologies are described, and their usage within solution is explained.
The main part of the thesis describes implementation of the solution, the detailed journey of data from source systems to business use-cases is described. Every step, beginning from data ingestion, through data storage and data processing, ending at data governance is shown and described in detail. This part also includes deep dive into core technologies, that are a backbone of this solution: Apache Spark and Delta Lake.
The thesis concludes, by evaluating the set objectives and reflections of the author about solution and technologies used. The author shares positive and negative aspects of the implementation, as well as possible improvements that should be made.