Real-time Data Analytic on Google Cloud : A Complete Data Pipeline from Self-Host Databases to GCP Services
Bui, Tan (2025)
Bui, Tan
2025
All rights reserved. This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-202502256789
https://urn.fi/URN:NBN:fi:amk-202502256789
Tiivistelmä
Data analysis plays an important role in business operations these days. There is a growing demand for integrating a data analytic pipeline into an existing software system. Although cloud services such as AWS and Google Cloud provide many on-the-cloud solutions, many businesses may have self-host infrastructures that are independent from those cloud services, making it challenging to design an architecture for a data analysis system.
Especially in real-time processing, data changes must be streamed continuously from the existing software system to the data analysis pipeline. However, many businesses are still using traditional data storage systems that do not support event streaming.
The thesis introduces a system architecture of a complete real-time ELT pipeline for data analysis. In this pipeline, data changes from multiple sources, such as MySQL and MongoDB databases in an existing infrastructure, are extracted and loaded to Google Cloud services in real time. The data is then transformed to become available for visualization or other services.
To verify the architecture, an implementation of the whole system was conducted in the thesis. The existing software system was simulated with a data generator that populates thousands of sample data records per second to the databases. There are new components to observe and convert the changes to events streamed to Google Cloud services.
The results showed that the architecture can be applied to different data sources, and millions of events can be processed every second, depending on network bandwidth and cloud services quotas.
Especially in real-time processing, data changes must be streamed continuously from the existing software system to the data analysis pipeline. However, many businesses are still using traditional data storage systems that do not support event streaming.
The thesis introduces a system architecture of a complete real-time ELT pipeline for data analysis. In this pipeline, data changes from multiple sources, such as MySQL and MongoDB databases in an existing infrastructure, are extracted and loaded to Google Cloud services in real time. The data is then transformed to become available for visualization or other services.
To verify the architecture, an implementation of the whole system was conducted in the thesis. The existing software system was simulated with a data generator that populates thousands of sample data records per second to the databases. There are new components to observe and convert the changes to events streamed to Google Cloud services.
The results showed that the architecture can be applied to different data sources, and millions of events can be processed every second, depending on network bandwidth and cloud services quotas.