Designing a smart search pipeline with web scraping and LLM-based retrieval on the HAMK website
Zou, Qiaoqiao (2025)
Zou, Qiaoqiao
2025
All rights reserved. This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-2025060921741
https://urn.fi/URN:NBN:fi:amk-2025060921741
Tiivistelmä
This thesis explores the practical application of a Retrieval-Augmented Generation (RAG) system for improving information access on the HAMK website. Although the site is flat and static in structure, users often struggle to locate specific content, as they must manually browse and read through multiple pages. To address this issue, an automated solution was developed to retrieve website information efficiently and accurately. To guide the development, this thesis focuses on answering the following research questions: What technologies should be used to develop the application? What process should the application follow? Can the approach be extended to other websites?
The system is built upon three core technologies: Web Scraping, semantic embeddings, and Large Language Models (LLMs). SentenceTransformer was used to convert raw text into dense vector representations, FAISS was used to index these vectors and perform similarity search, and LLMs generated responses based on the retrieved content. Gradio served as the front-end interface to facilitate user interaction.
This is a practical, implementation-oriented thesis. The proposed system was developed using the Scrapy framework for web scraping, as it effectively extracts structured content from static websites. The final application integrates all components—Scrapy, SentenceTransformer, FAISS, and LLMs—into a complete pipeline, with Gradio providing a lightweight and interactive user interface (UI).
The remainder of this thesis presents the development background, theoretical methods, tool selection rationale, and implementation process. The research demonstrates how to build a smart, RAG-based search application that enables faster and more accurate information retrieval on static websites.
The system is built upon three core technologies: Web Scraping, semantic embeddings, and Large Language Models (LLMs). SentenceTransformer was used to convert raw text into dense vector representations, FAISS was used to index these vectors and perform similarity search, and LLMs generated responses based on the retrieved content. Gradio served as the front-end interface to facilitate user interaction.
This is a practical, implementation-oriented thesis. The proposed system was developed using the Scrapy framework for web scraping, as it effectively extracts structured content from static websites. The final application integrates all components—Scrapy, SentenceTransformer, FAISS, and LLMs—into a complete pipeline, with Gradio providing a lightweight and interactive user interface (UI).
The remainder of this thesis presents the development background, theoretical methods, tool selection rationale, and implementation process. The research demonstrates how to build a smart, RAG-based search application that enables faster and more accurate information retrieval on static websites.