Fake news detection using natural language processing and machine learning: a comparative study of supervised algorithms and text representation techniques

Nguyen, Loi

Fake news detection using natural language processing and machine learning: a comparative study of supervised algorithms and text representation techniques

Nguyen, Loi (2025)

Avaa tiedosto

Nguyen_Loi.pdf (1.094Mt)

Lataukset:

Nguyen, Loi

2025

Näytä kaikki kuvailutiedot

Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-2025101125915

Tiivistelmä

The widespread phenomenon of fake news on digital platforms causes serious social issues, highlighting the
need for precise and scalable detection methods. There was a comparison among studies being carried out.
It is responsible for assessing the functional strategies of supervised machine learning algorithms and text
representation. The main purpose of this study was seeking the most effective feature extraction tech
niques and algorithmic integration to improve how well classification performed. With the supportive
source of comprehensive evaluation methodology and pertinent peer-reviewed publications, databases
were utilised to searching for phrases like “supervised learning”, “fake news detection”, “Bag-of-Words”,
and “word-embeddings”. Some highlighted examples are IEEE Xplore, ScienceDirect, and SpringerLink.
Bag-of-Words (BoW) and word-embeddings (Word2Vec, GloVe) were approached as the two main subjects
of the investigation. By contrast, combining them with supervised algorithms like Support Vector Machine
(SVM), Logistic Regression (LR), Random Forest (RF), and Naïve Bayes (NB) had a conflict. The arrangement
of comparative tables was deployed for datasets, feature extraction techniques, classifiers, and perfor
mance indicators to generate codes. Through the result, it was recorded that the embedding-based tech
niques had a greater performance than BoW. The more sophisticated combination of classification systems
like SVM and collaborative models were, the more efficient majority of scenarios was. Previous trained em
beddings were required with above standard lexical models regarding to accuracy and F1-score through
data collections like LIAR and FEVER.
It was undeniable to realise how majorly the degree of preprocessing, algorithmic sophistication, and qual
ity of data collection impacted on identifying performance. Under any encouragement of results, some
drawbacks still were existent, including inconsistent assessment criteria, and unbalanced dataset, and dif
ferent preprocessing techniques used in different research. The best condition for developers to choose the
suitable models depended on how available data is and computing capacity is one example of the practical
ramifications, indeed. The suggestion of future studies brought the new investigation of deep learning and
unsupervised learning techniques and create uniform data points for more reliable assessment.

Kokoelmat

Opinnäytetyöt (Avoin kokoelma)