Data Mining Thesis Topics in Finland
Bajo Rouvinen, Ari (2017)
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-201704084467
https://urn.fi/URN:NBN:fi:amk-201704084467
Tiivistelmä
The Theseus open repository contains metadata about more than 100,000 thesis publications from the different universities of applied sciences in Finland. Different data mining techniques were applied to the Theseus dataset to build a web application to explore thesis topics and degree programmes using different libraries in Python and JavaScript. Thesis topics were extracted from manually annotated keywords by the authors and curated subjects by the librarians. During the project, the quality of the thesis keywords and subjects to represent the thesis topics was evaluated and several data quality issues were raised.
The deliverables are this written thesis that presents different data mining techniques applied to the Theseus dataset, the open sourced code used to data mine the theses metadata and a web application accessible at www.ammattiko.com. The web application allows to discover popular topics for a university or a selection of degrees and popular degrees for a selection of topics, as well as to explore related topics and related degrees.
Special focus was put on comparing the results of different dimensionality reduction and clustering techniques to visualize similar degrees based on topics. t-SNE proved to be a powerful method to visualize degrees on a 2-dimensional interactive map and hierarchical clustering was found to be the most flexible technique to get multiple clusterings at different levels.
The deliverables are this written thesis that presents different data mining techniques applied to the Theseus dataset, the open sourced code used to data mine the theses metadata and a web application accessible at www.ammattiko.com. The web application allows to discover popular topics for a university or a selection of degrees and popular degrees for a selection of topics, as well as to explore related topics and related degrees.
Special focus was put on comparing the results of different dimensionality reduction and clustering techniques to visualize similar degrees based on topics. t-SNE proved to be a powerful method to visualize degrees on a 2-dimensional interactive map and hierarchical clustering was found to be the most flexible technique to get multiple clusterings at different levels.