Exploring end-to-end data engineering a GCP case study
Nawaz, Muhammad Kashif (2024)
Nawaz, Muhammad Kashif
2024
All rights reserved. This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-2024052715933
https://urn.fi/URN:NBN:fi:amk-2024052715933
Tiivistelmä
In an era dominated by data, the role of Data Engineering (DE) has become quite important in crafting robust solutions for data management. Data Engineering has become an integral part due to the rise of machine learning and artificial intelligence. In machine learning we need big amounts of data so handling that data is an integral part before starting the ML adoption in an organization. Data engineering ensures that we have efficient data processing which includes storage and retrieval along with maintaining data quality and supporting batch processing or real time analytics which are essential for machine learning and analytics workflows.
This thesis provides a hands-on exploration of new cloud technologies, offering insights into real-world challenges and solutions. By looking into a complete flow of data from collection to reporting this thesis aims to link the theoretical knowledge and practical application, preparing individuals for the demands of the evolving DE field. This thesis proposes a comprehensive case study, aiming to construct an end-to-end solution using Google Cloud Platform (GCP) and other tools. Study covers the scenarios of how data engineering simple workflow looks like in real world. For this purpose, Raw data from multiple sources is collected into cloud storage via the process called Extract, Transform, Load (ETL) and then apply the transformations into the data warehouse and in last build a business dashboard containing different key performance indicator (KPI). For orchestration Mage AI and for transformations and dbt library is used. Business users are more interested in the data in the context of their business domain, so dashboards are quite useful for them when they want to analyze the historic trends or analyze the performance.
This thesis introduces libraries databases and different modern tools which are being used in data engineering along with use of cloud providers like Google cloud provider in this case how to integrate opensource tools like mage AI in that workflow. This work produces different Python scripts, dbt scripts, terraform scripts and database views into the production environment.
This thesis provides a hands-on exploration of new cloud technologies, offering insights into real-world challenges and solutions. By looking into a complete flow of data from collection to reporting this thesis aims to link the theoretical knowledge and practical application, preparing individuals for the demands of the evolving DE field. This thesis proposes a comprehensive case study, aiming to construct an end-to-end solution using Google Cloud Platform (GCP) and other tools. Study covers the scenarios of how data engineering simple workflow looks like in real world. For this purpose, Raw data from multiple sources is collected into cloud storage via the process called Extract, Transform, Load (ETL) and then apply the transformations into the data warehouse and in last build a business dashboard containing different key performance indicator (KPI). For orchestration Mage AI and for transformations and dbt library is used. Business users are more interested in the data in the context of their business domain, so dashboards are quite useful for them when they want to analyze the historic trends or analyze the performance.
This thesis introduces libraries databases and different modern tools which are being used in data engineering along with use of cloud providers like Google cloud provider in this case how to integrate opensource tools like mage AI in that workflow. This work produces different Python scripts, dbt scripts, terraform scripts and database views into the production environment.