Deep Learning Approaches for Classifying Informal and Formal English Texts Using Linguistic Features

Karunarathna, KMGS; Rupasingha, RAHM; Kumara, BTGS

View/Open

IJRC V 4 I (pages 9-22).pdf (495.4Kb)

Date

2025-01

Author

Karunarathna, KMGS

Rupasingha, RAHM

Kumara, BTGS

Metadata

Show full item record

Abstract

Effective techniques for automatically classifying texts are becoming increasingly necessary due to the exponential expansion of digital material. Differentiating between formal and informal documents can help students identify appropriate resources for their assignments and improve the effectiveness of information retrieval systems. Although machine learning is extensively utilized in classification of text, there is a lack of research focused to the effective differentiation of formal and informal writings through linguistic features. This gap highlights the necessity for advanced methodologies that improve classification accuracy and enhance the value of digital content in academic and retrieval systems. Our research addresses the problem by utilizing deep learning methodologies and a wide range of 13 linguistic attributes to get enhanced efficacy in text classification. Artificial Neural Networks (ANN), Convolutional Neural Networks (CNN), and Long Short-Term Memory Networks (LSTM) were considered. A dataset , including both formal (news articles, formal documents) and informal (personal letters, personal blogs) texts, were gathered from several web sources. We considered linguistic markers such as colloquialisms, contractions, modal verbs, slang, acronyms, pronouns, phrasal verbs, grammar complexity, vocabulary complexity, voice, and language type to generate the feature vector. The feature vectors were utilized to train and assess the classification models using several cross-validation techniques, particularly 3, 5, 7, and 10 folds. The efficacy of the models was evaluated using performance indicators, f-measure, accuracy, precision, and recall. With the highest accuracy of 99.8% and resilience in differentiating between formal and informal texts, the LSTM model outperformed than the others. Future research will examine big datasets, more linguistic characteristics, sophisticated deep learning models, and real-time and multilingual classification systems.

URI

https://ir.kdu.ac.lk/handle/345/8917
http://doi.org/10.64701/ijrc/345/8917

Collections

Volume 04 , Issue 01 , 2025 [6]