Accuracy of different Machine Learning techniques at predicting students' study path selection
Ruegger, Adrien (2021)
Ruegger, Adrien
2021
All rights reserved. This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-202105209629
https://urn.fi/URN:NBN:fi:amk-202105209629
Tiivistelmä
Taking the world by storm, machine learning has revolutionized how data is used and analyzed. Built on pre-existing algorithms, the sheer amount of data collected nowadays kickstarted this technology as a staple, as pattern recognition and other statistical studies require large sample size to yield relevant results.
The overarching goal of the thesis was to introduce Machine Learning in an intuitive and readily comprehensible way, as well as encourage schools to consider machine learning-driven personalized guidance as a future tool to improve educational offering. It teaches concepts such as artificial intelligence, statistical analysis and machine learning algorithms.
The study focuses on illustrating this knowledge through a comparison of different machine learning libraries, namely scikit-learn and TensorFlow, at predicting students’ study path selection. The data used for this research comes from BITe students at Haaga-Helia in Finland and HES-SO ARC in Switzerland, as 97 of them responded to a questionnaire.
The questionnaire focused on study level, study path selection, demographic questions, as well as statements about BITe studies judged using a Likert scale.
Built exclusively using Python, four algorithms were tested using the dataset: logistic regression, support-vector machine, k-nearest neighbor and decision tree. On top of that, a deep learning neural network was also used to compete with the other algorithms.
As comparison was the main element of this research, each method was tested 500 times and the best, worst and mean of each technology were extracted. Each try had a randomized separation between the training set (75% of the data) and the testing set (25% of the data), but the proportion of study path was maintained.
The average accuracy was among 36% to 46%, while the best reached 67% to 83% and the worst 6% to 22%, highlighting the importance of cross-validation and preprocessing, especially with low sample sizes. It also called attention to the similarities and differences between the schools’ students and mindset, backed by the survey results.
In line with the thesis’ objectives, this project could be used to showcase how to improve machine learning results with deeper implementations and iterations. The idea itself could be developed further and implemented by schools to better guide students towards their ideal studies.
The overarching goal of the thesis was to introduce Machine Learning in an intuitive and readily comprehensible way, as well as encourage schools to consider machine learning-driven personalized guidance as a future tool to improve educational offering. It teaches concepts such as artificial intelligence, statistical analysis and machine learning algorithms.
The study focuses on illustrating this knowledge through a comparison of different machine learning libraries, namely scikit-learn and TensorFlow, at predicting students’ study path selection. The data used for this research comes from BITe students at Haaga-Helia in Finland and HES-SO ARC in Switzerland, as 97 of them responded to a questionnaire.
The questionnaire focused on study level, study path selection, demographic questions, as well as statements about BITe studies judged using a Likert scale.
Built exclusively using Python, four algorithms were tested using the dataset: logistic regression, support-vector machine, k-nearest neighbor and decision tree. On top of that, a deep learning neural network was also used to compete with the other algorithms.
As comparison was the main element of this research, each method was tested 500 times and the best, worst and mean of each technology were extracted. Each try had a randomized separation between the training set (75% of the data) and the testing set (25% of the data), but the proportion of study path was maintained.
The average accuracy was among 36% to 46%, while the best reached 67% to 83% and the worst 6% to 22%, highlighting the importance of cross-validation and preprocessing, especially with low sample sizes. It also called attention to the similarities and differences between the schools’ students and mindset, backed by the survey results.
In line with the thesis’ objectives, this project could be used to showcase how to improve machine learning results with deeper implementations and iterations. The idea itself could be developed further and implemented by schools to better guide students towards their ideal studies.