Theseus käyttökatko ma 22.4. klo 12 alkaen. Katko jatkuu 22.4. klo 15 asti ja on koko Theseuksen laajuinen. Lisäksi töiden käsittely ja syöttö on estetty ti 23.4. ainakin klo 12 asti.
Theseus service break from Mon 22.4. at 12:00. The break will last until 15:00 on Mon 22.4. and is Theseus-wide. In addition, processing and uploading of work will be blocked until at least 12:00 on Tue 23.4.
Near real-time clickstream analysis : a journey in big data systems and architectures
Corrao, Salvatore (2019)
Corrao, Salvatore
2019
All rights reserved. This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-2019051510083
https://urn.fi/URN:NBN:fi:amk-2019051510083
Tiivistelmä
The clickstream analysis focuses on the records generated while a user clicks on a web page. This field is nowadays part of the Big Data phenomenon and uses near real-time software implementations.
The aim of this thesis was the implementation of a near real-time Big Data infrastructure that can uphold a clickstream analysis. This work limited the clickstream analysis implementation to mainly the user sessionization function. The infrastructure architecture design used open-source software to enable five core data capabilities which are ingestion (consuming the click records), transformation (data cleaning, user sessionization, user agent enrichment), storage, analytics (insights) and visualization (for presenting accessible insights).
The implementation was run interactively, moving step by step through different technical options. The iterations followed a simple scheme. What is easy to install, to configure, and to test? What is general enough to solve more complex requirements? What can be removed? In particular, the sessionization algorithm implementation easiness was a benchmark to compare the various infrastructure iterations.
There were four design iterations and four infrastructure implementations. A first realization was a zero-coding infrastructure. A second phase delivered a more capable parallel data processing component based on Apache Spark, a central framework in this work. The next implementation simplified the data storage and started the exploration of the Apache Spark streaming features. The last experiment showed the possibility to process streams of clickstream data, coming continuously from a weblog, with low latency by using Apache Spark Structured Streaming.
Spark Structured Streaming has a few SQL limitations that require adapting the algorithms and the processing sequence. However, Spark Structured Streaming is in its infancy, and there are good reasons to believe that it is going towards fewer limits. Besides the central place of Spark, many technologies can set new directions or bring improvements to this thesis; for example, specialized databases as the clickstream records are a time-series and also a graph.
The aim of this thesis was the implementation of a near real-time Big Data infrastructure that can uphold a clickstream analysis. This work limited the clickstream analysis implementation to mainly the user sessionization function. The infrastructure architecture design used open-source software to enable five core data capabilities which are ingestion (consuming the click records), transformation (data cleaning, user sessionization, user agent enrichment), storage, analytics (insights) and visualization (for presenting accessible insights).
The implementation was run interactively, moving step by step through different technical options. The iterations followed a simple scheme. What is easy to install, to configure, and to test? What is general enough to solve more complex requirements? What can be removed? In particular, the sessionization algorithm implementation easiness was a benchmark to compare the various infrastructure iterations.
There were four design iterations and four infrastructure implementations. A first realization was a zero-coding infrastructure. A second phase delivered a more capable parallel data processing component based on Apache Spark, a central framework in this work. The next implementation simplified the data storage and started the exploration of the Apache Spark streaming features. The last experiment showed the possibility to process streams of clickstream data, coming continuously from a weblog, with low latency by using Apache Spark Structured Streaming.
Spark Structured Streaming has a few SQL limitations that require adapting the algorithms and the processing sequence. However, Spark Structured Streaming is in its infancy, and there are good reasons to believe that it is going towards fewer limits. Besides the central place of Spark, many technologies can set new directions or bring improvements to this thesis; for example, specialized databases as the clickstream records are a time-series and also a graph.