Deep Learning Approaches for Classifying Informal and Formal English Texts Using Linguistic Features
View/ Open
Date
2025-01Author
Karunarathna, KMGS
Rupasingha, RAHM
Kumara, BTGS
Metadata
Show full item recordAbstract
Effective techniques for automatically classifying texts are becoming increasingly necessary due to the
exponential expansion of digital material. Differentiating between formal and informal documents can help students
identify appropriate resources for their assignments and improve the effectiveness of information retrieval systems.
Although machine learning is extensively utilized in classification of text, there is a lack of research focused to the effective
differentiation of formal and informal writings through linguistic features. This gap highlights the necessity for advanced
methodologies that improve classification accuracy and enhance the value of digital content in academic and retrieval
systems. Our research addresses the problem by utilizing deep learning methodologies and a wide range of 13 linguistic
attributes to get enhanced efficacy in text classification. Artificial Neural Networks (ANN), Convolutional Neural
Networks (CNN), and Long Short-Term Memory Networks (LSTM) were considered. A dataset , including both formal
(news articles, formal documents) and informal (personal letters, personal blogs) texts, were gathered from several web
sources. We considered linguistic markers such as colloquialisms, contractions, modal verbs, slang, acronyms, pronouns,
phrasal verbs, grammar complexity, vocabulary complexity, voice, and language type to generate the feature vector. The
feature vectors were utilized to train and assess the classification models using several cross-validation techniques,
particularly 3, 5, 7, and 10 folds. The efficacy of the models was evaluated using performance indicators, f-measure,
accuracy, precision, and recall. With the highest accuracy of 99.8% and resilience in differentiating between formal and
informal texts, the LSTM model outperformed than the others. Future research will examine big datasets, more linguistic
characteristics, sophisticated deep learning models, and real-time and multilingual classification systems.