Near Real-Time IoT Data Pipeline: Implementing Scalable Architecture Using Confluent Kafka, TimescaleDB and Cloud Deployment with Integrated Monitoring Solutions
Manandhar, Rabindra (2026)
Manandhar, Rabindra
2026
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-202603315344
https://urn.fi/URN:NBN:fi:amk-202603315344
Tiivistelmä
The rapid growth of Internet of Things (IoT) systems requires data architectures capable of processing high-velocity time-series data with low latency while ensuring scalability, and fault tolerance. The existing infrastructure at Metropolia AIoT Garage and its IoT partners introduced an approximate 24-hour delay between sensor data generation and its availability for analytics, limiting real-time decision-making and anomaly detection.
This thesis presents the design, implementation, and evaluation of a cloud-native IoT data pipeline that eliminates this delay. Environmental data (temperature, humidity, and pressure) is collected from RuuviTags via ESP32-WROOM-32 gateway and transmitted using MQTT to a Mosquitto broker on Google Kubernetes Engine (GKE). Data is processed through a custom adapter service and streamed via a three-node Confluent Kafka cluster operating in KRaft mode with Avro serialization enforced by Confluent Schema Registry. A sink service stores the data in TimescaleDB optimized for time-series storage using hypertables, compression, and continuous aggregates.
The system is containerized using Docker and deployed via Terraform and Kubernetes manifests, supporting horizontal autoscaling, persistent storage, and comprehensive monitoring with Prometheus and Grafana. Performance evaluation shows a median end-to-end latency of 1.4 seconds (minimum 26 milliseconds), with stable resources utilization, and query execution in under 15 milliseconds.
The results demonstrate a scalable, near real-time IoT pipeline suitable for analytics, with future work focusing on high availability, enhancing security, and integrating machine learning for real-time analytics and predictive capabilities.
This thesis presents the design, implementation, and evaluation of a cloud-native IoT data pipeline that eliminates this delay. Environmental data (temperature, humidity, and pressure) is collected from RuuviTags via ESP32-WROOM-32 gateway and transmitted using MQTT to a Mosquitto broker on Google Kubernetes Engine (GKE). Data is processed through a custom adapter service and streamed via a three-node Confluent Kafka cluster operating in KRaft mode with Avro serialization enforced by Confluent Schema Registry. A sink service stores the data in TimescaleDB optimized for time-series storage using hypertables, compression, and continuous aggregates.
The system is containerized using Docker and deployed via Terraform and Kubernetes manifests, supporting horizontal autoscaling, persistent storage, and comprehensive monitoring with Prometheus and Grafana. Performance evaluation shows a median end-to-end latency of 1.4 seconds (minimum 26 milliseconds), with stable resources utilization, and query execution in under 15 milliseconds.
The results demonstrate a scalable, near real-time IoT pipeline suitable for analytics, with future work focusing on high availability, enhancing security, and integrating machine learning for real-time analytics and predictive capabilities.
