Hyppää sisältöön
    • Suomeksi
    • På svenska
    • In English
  • Suomi
  • Svenska
  • English
  • Kirjaudu
Hakuohjeet
JavaScript is disabled for your browser. Some features of this site may not work without it.
Näytä viite 
  •   Ammattikorkeakoulut
  • Tampereen ammattikorkeakoulu
  • Opinnäytetyöt (Avoin kokoelma)
  • Näytä viite
  •   Ammattikorkeakoulut
  • Tampereen ammattikorkeakoulu
  • Opinnäytetyöt (Avoin kokoelma)
  • Näytä viite

Near real-time clickstream analysis : a journey in big data systems and architectures

Corrao, Salvatore (2019)

 
Avaa tiedosto
Corrao_Salvatore.pdf (2.665Mt)
Lataukset: 


Corrao, Salvatore
2019
All rights reserved. This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Näytä kaikki kuvailutiedot
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-2019051510083
Tiivistelmä
The clickstream analysis focuses on the records generated while a user clicks on a web page. This field is nowadays part of the Big Data phenomenon and uses near real-time software implementations.

The aim of this thesis was the implementation of a near real-time Big Data infrastructure that can uphold a clickstream analysis. This work limited the clickstream analysis implementation to mainly the user sessionization function. The infrastructure architecture design used open-source software to enable five core data capabilities which are ingestion (consuming the click records), transformation (data cleaning, user sessionization, user agent enrichment), storage, analytics (insights) and visualization (for presenting accessible insights).

The implementation was run interactively, moving step by step through different technical options. The iterations followed a simple scheme. What is easy to install, to configure, and to test? What is general enough to solve more complex requirements? What can be removed? In particular, the sessionization algorithm implementation easiness was a benchmark to compare the various infrastructure iterations.

There were four design iterations and four infrastructure implementations. A first realization was a zero-coding infrastructure. A second phase delivered a more capable parallel data processing component based on Apache Spark, a central framework in this work. The next implementation simplified the data storage and started the exploration of the Apache Spark streaming features. The last experiment showed the possibility to process streams of clickstream data, coming continuously from a weblog, with low latency by using Apache Spark Structured Streaming.

Spark Structured Streaming has a few SQL limitations that require adapting the algorithms and the processing sequence. However, Spark Structured Streaming is in its infancy, and there are good reasons to believe that it is going towards fewer limits. Besides the central place of Spark, many technologies can set new directions or bring improvements to this thesis; for example, specialized databases as the clickstream records are a time-series and also a graph.
Kokoelmat
  • Opinnäytetyöt (Avoin kokoelma)
Ammattikorkeakoulujen opinnäytetyöt ja julkaisut
Yhteydenotto | Tietoa käyttöoikeuksista | Tietosuojailmoitus | Saavutettavuusseloste
 

Selaa kokoelmaa

NimekkeetTekijätJulkaisuajatKoulutusalatAsiasanatUusimmatKokoelmat

Henkilökunnalle

Ammattikorkeakoulujen opinnäytetyöt ja julkaisut
Yhteydenotto | Tietoa käyttöoikeuksista | Tietosuojailmoitus | Saavutettavuusseloste