Distinguishing Truth from Deception: A Machine Learning Approach to Fake News Detection
Mary Ann Cabilao Paulin1, Efren I. Balaba2
International Journal of Information Technology, Research & Application ISSN: 2583-5343, Vol. 4 No. 4: Dec 2025
Mary Ann Cabilao Paulin, Efren I. Balaba (2025). Distinguishing Truth from Deception: A Machine Learning Approach to Fake News Detection, Issue 4(4), 33-40.
1,2Dept. of Information Technology, Southern Leyte University, Philippines

Article history:
Received May 14, 2025
Revised Dec 16, 2025
Accepted Dec 31, 2025
Keywords:
Fake news detection
Machine learning
LSTM
SVM
Natural language processing Misinformation
Feature analysis
IPO model
Statistical validation

ABSTRACT

The fast-paced dissemination of false information on social media is dangerous not only for public trust but also for political stability and societal cohesion. Tackling this issue, the current article is creating a machine learning framework for fake news identification, which is based on the Input-Process-Output (IPO) model to do the research work systematically. Together with the use of Natural Language Processing (NLP) tools, a couple of statistical feature validation, and supervised learning models, specifically, Support Vector Machine (SVM) and Long Short-Term Memory (LSTM) networks, this research attempts to create stable interpretable, and reliable classification system. Trained and untrained news sources’ textual data were gathered from which sentiment analysis, TF-IDF vectorization, and syntactic feature extraction were conducted as initial processing tasks. Statistical techniques such as Chi-square tests, T-tests, Pearson correlation coefficients, etc., were applied to pinpoint Feature 100 as the key attribute among the lot. The findings of the study reveal that the LSTM model significantly beat SVM in the case of class accuracy with a high precision and recall rate, which finally led to 94% of the students mastering the tests and obviously having broad skills. The research’s main point is the fact that the union of statistical methods and deep learning models is necessary to make fake news detection much more effective. This study adds new knowledge about the making of a dependable automatic misinformation filtering system and further updates a safer digital information environment.
This is an open access article under the CC BY-SA license.
Corresponding Author:
 Mary Ann Cabilao Paulin
Department of Information Technology
Southern Leyte University 
Philippines
Email: anncabilaopaulin22@gmail.com

Introduction

Digital platforms are expanding at a rapid rate and revolutionizing the spread of information, making it accessible to a large number of people and giving them the possibility of keeping up with the latest news. Nevertheless, this growth has also made the spread of fake news, misinformation or intentionally misleading information that is meant to manipulate the public, more widespread. Fake news can lead to paving wrong ways where people make the wrong decision, may distort the public discourse, and can be a significant threat to democratic institutions. To solve this issue, we need to find new and better technology options, and machine learning may be one of those that would be capable of weeding out misinformation.
The fundamental objective of the study is to propose a machine learning-oriented methodology for the detection of fake news, which embraces Natural Language Processing (NLP) techniques, statistical feature validation, and supervised learning models, specifically, Support Vector Machine (SVM) and Long Short-Term Memory (LSTM) networks. Additionally, the research project endeavors to explore the accuracy, fairness, and transparency of fake news detection systems by examining linguistic features and statistically analyzing their significance.
The papers approached various ways in which machine learning can be used to tackle fake news. An article compared several machine learning algorithms and found that the supervised learning model, the Support Vector Machines (SVM), and the Long Short-Term Memory (LSTM) networks were giving very good fake news classification accuracy. Similarly, another author proposed a new deep learning-based framework to improve the authenticity of information by correctly identifying good and bad news sources.
In different research, the selection and implementation of features in the process of fake news recognition by the means of natural language processing (NLP) and told that the use of these two techniques is very good in making the models less biased and more precise were the main points stressed by the researchers. Other researchers reported that the machine learning approach combined with linguistic analysis was successful in determining intentional misinformation in the news. Furthermore, most authors contributed a comprehensive review of these enhancements and, in this process, pointed out the need for continually refining the methods associated with machine learning. The study results fall within the larger framework of the evaluation of IT-based services, including online transactions and activities, through the prevention of misinformation creation disrupting the trust and reliability equilibrium. The present research introduces machine learning as a powerful approach to fake news identification; therefore, it will be possible to create a more thoughtful and sustainable digital society by driving the use of it for more accurate and reliable news.

Conceptual Framework

This study discusses the effectiveness of modern machine learning techniques in effectively detecting fake news. This research was inspired by the increasing problem of misinformation on the internet that is misleading people and has an impact on political, social, and economic issues. Owing to the great potential of machine learning, in particular, those in the field can use Natural Language Processing (NLP) and deep learning to enrich the accuracy of fake news detection systems. The automation of fake news identification systems via machine learning is a strategic necessity because the traditional manual fact-checking methods are labor-intensive and hardly fast.
Existent literature has highlighted a number of difficulties in the matter of fake news detection, with examples of subtle grammatical markers of whether the news is true or false and the fact that the sources have totally different writing styles, the biggest among them. Earlier research has also pointed out that it is much easier to address these challenges with the help of supervised machine learning models like Support Vector Machines (SVMs) and Long Short-Term Memory (LSTM) networks.
There are still gaps in the knowledge, even though progress has been made regarding the comprehensive evaluation of machine learning algorithms based on feature importance, dataset characteristics, and interpretability.
Furthermore, the number of studies that have discussed a thorough comparison of classical and deep learning models on datasets that are different from each other in size, quality, and linguistic complexity is very limited.
The major goal of this investigation is, therefore, to cover these aspects by carrying out a complete study of machine learning algorithms for fake news detection. Emphasis will be placed on why linguistic feature engineering and statistical validation are instrumental in model performance improvement.
Furthermore, the research analyzes the impact of the diverse nature of the dataset, class imbalance, and textual variability on the performance and generalization capabilities of the models involved. Statistical tools, like T-tests, Chi-Square tests, and Pearson correlation coefficients, were utilized for feature significance validation before model training. Primarily, this research is anticipated to outline the way to develop fake news detection systems that are not only robust but also easily understood and effective. Also, the study, by understanding the strengths and limitations of different approaches, aims to support future efforts in fighting misinformation and, in turn, fostering a more reliable digital information ecosystem.
image: e_a9e14a0c15cb_Fig-1.png
Figure 1: Conceptual Framework

Methodology

The study takes a more quantitative approach based on the Input-Process-Output (IPO) model, which was used as the foundation for constructing the fake news detection system. The arrangement of the different stages of the investigation, such as data collection, preprocessing, feature extraction, model training, and testing, is given to guarantee that all the methods used in the study were scientific and repeatable.

A. Research Design

The procedure used in the study follows the IPO model. In the Input phase, a more comprehensive set of data is gathered from both reliable (e.g., Reuters, BBC) and unreliable sources (e.g., unknown falsehood platforms). The Process stage entails the deployment of the Natural Language Processing (NLP) techniques along with machine learning algorithms that include Support Vector Machines (SVM) and Long Short-Term Memory (LSTM) networks. The Output phase is all about the analysis of the model’s performance by means of classification metrics, as well as testing the importance of the characteristics using statistical methods. Such a design allows for a methodical and replicable approach, which is consistent with the characteristics of fake news detection in relation to accuracy, fairness, and generalization.

B. Data Collection

Text data were grabbed from the LIAR dataset and FakeNewsNet, which were publicly accessible databases containing tagged real and fake news articles.
The articles were specifically chosen using the criteria that included linguistic diversity, topic variety, and the credibility of the source. In the process, each article was carefully coded as either "fake" or "real," depending on the verdict given by the community or a superior fact-checker in the previous stage.

C. Preprocessing and Feature Extraction

Multiple stage of refining and handling of the raw text were applied by us:
The deceptive feature extraction strategy was based on two groups of indicators: one that reflects surface level (lexical) and the other that reflects deep level (syntactic and semantic elements of deception). We also used features of a linguistic nature, such as emotional exaggeration, modal verbs, and passive voice usage.

E. Machine Learning Models

Two supervised learning models were used:
Both models were trained on 80% of the dataset and tested on the remaining 20%. Hyperparameter tuning was conducted using grid search, and model validation was carried out through 5-fold cross-validation.

F. Evaluation Metrics

The models were evaluate using the following metrics:
A c c u r a c y = T P + T N T P + T N + F P + F N
P r e c i s i o n = T P T P + F P
F 1 = 2 X P r e c i s i o n X R e c a l l P r e c i s i o n + R e c a l l
These metrics were computed using a confusion matrix generated during classification. Here,
T P = T r u e P o s i t i v e s
F P = F a l s e P o s i t i v e s
F N = F a l s e N e g a t i v e s

G. Statistical Analysis

Tools were used to run statistical tests to determine the reliability and significance of the features utilized in the classification models. These statistical tests provided much insight into the relations between certain features and the probability that an article is identified as fake or real.
Chi-Square Test:
This test was employed to examine if there is a dependence relation between categorical variables like source credibility and the binary classification label. It checked how much the actual data distribution differed from the anticipated one. A high value of the chi-square statistic signified that the feature and the class label were unlikely to be correlated by chance.
x 2 = ( 0 - E ) 2 E
T-Test:
Used to compare the means of feature distributions between fake and real newsgroups. A significant p-value (< 0.05) indicated meaningful differences were employed to assess the difference in mean scores of continuous features, such as sentiment polarity between the fake and real newsgroups. This test was particularly useful in identifying statistically significant differences in language tone and intensity.
t = x ¯ 1 - x ¯ 2 S 1 2 n 1 + S 2 2 n 2
Pearson Correlation Coefficient:
Calculated to measure the linear relationship between numerical features and the classification outcome. This analysis helped to identify the strength and direction of associations, guiding the selection of high-impact predictors.
r = ( x i - x ¯ ) ( y i - y ¯ ) ( x i - x ¯ ) 2 ( y i - y ¯ ) 2
These tests ensured that the features used were statistically significant and contributed meaningfully to the classification process. Collectively, these tools ensured that the model’s predictive decisions were supported by statistically significant relationships rather than random variation. The application of these methods enhanced the validity and interpretability of the classification process, as demonstrated in similar studies.
H. Implementation Tools
The system was implemented using Python 3.10 with libraries such as Keras and TensorFlow for LSTM and NLTK. Evaluation and visualization were conducted using Matplotlib and Seaborn.
This rigorous and well-defined methodology ensures the scientific validity and reproducibility of the research findings in detecting fake news through machine learning.

Results and Discussion

4.1 Feature Selection Analysis

To determine each feature’s relevance to the target classification, I conducted three statistical assessments: a T-test, a Chi-Square test, and a Pearson correlation. These were visualized in Figures 2 to 4. As shown in Figure 2, the majority of features exhibit relatively low t-statistic values, implying minimal individual discriminative ability. Notably, Feature 100 recorded a significantly high t-value, suggesting it holds strong statistical relevance for class separation. This outlier indicates that Feature 100 may play a crucial role in improving model performance.
image: e_478e69912514_Fig-2.png
Figure 2: T-Test Scores for Each Feature
The Chi-square test, presented in Figure 3, reinforces the prior observation. Feature 100 achieves a high score, whereas most other features have negligible values. This result indicates a strong dependency between Feature 100 and the target label, highlighting its potential importance in classification.
image: e_1af55acf79a0_Fig-3.png
Figure 3: Chi-Square Feature Dependency Scores
In Figure 4, the correlation coefficients for the features show that most variables have very weak linear relationships with the target class (values close to zero). However, Feature 100 again stands out.

4.2 LSTM Model Evaluation

After identifying the most informative features, I trained a Long-Short-Term Memory (LSTM) network on the dataset, which displayed a modest but meaningful positive correlation (approximately 0.08). Though not high in absolute terms, its relative strength supports its consistency as a predictive feature.
image: e_564fb0818c7f_Fig-4.png
Figure 4: Correlation Coefficients with Targets
The performance of the model was assessed with the help of a confusion matrix and classification report, just as it is described in Figures 4 and 5. Figure 5’s confusion matrix points out that the model correctly attested 6680 cases of each class while misclassifying.
Table 1: LSTM Classification Metrics
precision recall f1-score support
0 0.92 0.94 0.93 7089
1 0.94 0.93 0.94 7338
accuracy 0.94 14427
macro avg 0.94 0.94 0.94 14427
weighted avg 0.94 0.94 0.94 14427

CONCLUSION

The goal of the study has been reached, as the researchers managed to produce a machine learning-based model for recognizing fake news by combining Natural Language Processing (NLP) technologies, the statistical validation of characteristics, and supervised learning such as SVM and LSTM. Rigorous preprocessing, feature engineering, and statistical analysis were systematically applied to show the predictive power of Feature 100, which was indicated by the research to be relevant across several validation tests (T-test, Chi-Square, Pearson correlation) 409 instances per class. As a result, the binary representation versions have equally diverse values of the standard, have balanced performance, and consistently exhibit their low level of blunder.
Among the models tested, the Long Short-Term Memory (LSTM) network significantly outperformed the Support Vector Machine (SVM), achieving a remarkable 94% overall accuracy with balanced precision, recall, and F1 Scores. This performance demonstrates that the LSTM model is everywhere. A long memory model is used and applicable to the specific task of fake news detection with some strong signs of generalization.
The study ultimately points out that when statistical feature validation is merged with deep learning architectures, misinformation detection systems have more interpretability, accuracy, and fairness, which not only have a meaningful impact but are also instrumental in producing a safer digital information environment.
Recommendations
  1. Further Feature Analysis Using Interpretability Tools: Researchers are suggested to use SHAP (SHapley Additive Explanations), LIME (Local Interpretable Model-agnostic Explanations), or other interpretability methods to identify features (such as Feature 100) that are really decisive for the learning process implemented by the LSTM model.
  2. Explore Dimensionality Reduction: The application of techniques such as Principal Component Analysis (PCA) or autoencoders might even improve the model’s functionalities. In parallel, the model may become the performance winner because of its clarity and the less time and energy it uses.
  3. Model Comparison and Hybrid Approaches: One possible way to gain better insights would be to first do the redundant work, i.e., not repeat what was done in the previous phase, and then use the results to conclude whether CNNs, GRUs or even a combination of both was the best model to beat LSTM for identifying fake news.
  4. Dataset Expansion and Diversity: It makes sense to add more various datasets, possibly from different domains and other languages, that can also contribute to the fight against misinformation that can occur in a health-related area, for example, in order to be able to solve the riddle of the widely applicable and, at the same time, biasing and thus contextual changeable nature that the model itself does.
  5. Real-world Deployment and Stress Testing: Launching the model in a real-world environment, such as a real-time social media monitoring system, is one way of collecting practical results on its efficiency, latency, scalability, and performance under certain live, difficult-to-contain (noisy) conditions.
ACKNOWLEDGEMENT
We extend our sincere gratitude to the Department of Information Technology at Southern Leyte University for providing the academic environment and resources necessary to conduct this research. Our deepest appreciation goes to the developers and maintainers of the FakeNewsNet, whose publicly available data were instrumental in validating our machine learning models.
We would also like to acknowledge the contributions of the open-source community, particularly the developers of Python libraries such as TensorFlow, Keras, NLTK, and Scikit-learn, which enabled the efficient implementation of our framework. Special thanks to the researchers cited in this work, whose pioneering studies on NLP and deep learning laid the foundation for our methodological approach.
Finally, we are grateful to our colleagues, peers, and reviewers for their constructive feedback, which significantly improved the quality of this research paper. Any remaining errors or omissions are solely our own.

References