Analysis of Customer Journey Video Data Using Eye-Tracking and Multimodal AI
Heikkinen, Sami; Väisänen, Jaani; Kaikonen, Hannu; Makkula, Sami; Markkanen, Petteri (2026)
Avaa tiedosto
avautuu julkiseksi: 05.01.2027
Heikkinen, Sami
Väisänen, Jaani
Kaikonen, Hannu
Makkula, Sami
Markkanen, Petteri
Editoija
Chira, Camelia
Matei, Oliviu
Pop, Florin
Pop-Sitar, Petrică
Springer Nature
2026
All rights reserved. This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi-fe2026022516193
https://urn.fi/URN:NBN:fi-fe2026022516193
Tiivistelmä
This study introduces a novel methodological approach for analyzing customer journey data using eye-tracking technology and multimodal AI. Customer journey research faces significant limitations due to reliance on resource-intensive qualitative methods that are difficult to scale. We address this gap by employing vision language models (VLMs) to automate the interpretation of eye-tracking data, eliminating the manual coding bottleneck while enabling analysis of larger and more diverse datasets. Our research evaluates five locally-deployable VLMs to determine the optimal balance between semantic accuracy and computational efficiency. Using data collected with Tobii eye-tracking glasses, we developed an analytical pipeline that synchronizes gaze location data with verbal think-aloud protocols to create a comprehensive multimodal dataset. Results demonstrate that Gemma3 (4B parameters) achieved 100% semantic accuracy on our test set while maintaining reasonable processing efficiency (43.26 s per image). When validated against human coding across the complete dataset, the model achieved a 74.2% recall rate. The integration of eye-tracking and verbal data revealed distinctive attention patterns including “navigational uncertainty,” “confirmatory scanning,” and “socially-mediated attention” throughout the customer journey. Our approach provides objective behavioral evidence of visual attention that complements traditional self-reported measures, enabling more comprehensive touchpoint analysis while aligning with event-driven perspectives from process mining research. This methodology offers promising applications for service design by identifying discrepancies between reported and actual customer attention patterns and providing a foundation for developing automated behavioral indicators to detect moments of customer confusion, decision-making, or confirmation throughout service journeys.