Automatically Correcting Data with Noisy Labels for Improving Training Set of Sentiment Classification Domain

Please use this identifier to cite or link to this item: http://202.28.34.124/dspace/handle123456789/3631

Title:	Automatically Correcting Data with Noisy Labels for Improving Training Set of Sentiment Classification Domain การแก้ไขข้อมูลที่ลาเบลไม่ถูกต้องแบบอัตโนมัติเพื่อปรับปรุงคุณภาพข้อมูลชุดสอนสำหรับโดเมนการจำแนกความรู้สึก
Authors:	Thananchai Khamket ธนันชัย คำเกตุ Jantima Polpinij จันทิมา พลพินิจ Mahasarakham University Jantima Polpinij จันทิมา พลพินิจ Jantima.p@msu.ac.th Jantima.p@msu.ac.th
Keywords:	Sentiment classification Noisy label correction Polarity Label Analyzer Machine learning Deep learning
Issue Date:	19
Publisher:	Mahasarakham University
Abstract:	Sentiment classification is crucial in natural language processing, but noisy or mislabeled data can significantly degrade model performance. This study proposes an automated label correction method to improve training data quality before applying sentiment classification models. The research introduces the Polarity Label Analyzer, a predictive model developed using sentence-level sentiment analysis, which detects and corrects mislabeled sentiment data to enhance classification accuracy. Three datasets of TripAdvisor hotel reviews were used in this study. The first dataset, manually validated by linguistic experts, was used to train the Polarity Label Analyzer. The second dataset, containing a mix of correctly and incorrectly labeled reviews, was used to analyze the impact of label noise on model performance. The third dataset, also validated by experts, served as a test set to assess the impact of label correction on various sentiment classification models. The study applies seven classification models KNN, Logistic Regression, Multinomial Naïve Bayes, Random Forest, SVM with a Linear Kernel, CNN, and BERT Base to evaluate the effect of label correction. The results show significant improvements in accuracy and F1-score across all models when trained on corrected data. SVM performed best among traditional models, while BERT Base achieved the highest accuracy (0.95) and F1-score (0.94), highlighting the importance of label quality for deep learning models. Findings suggest that correcting noisy labels before training significantly enhances sentiment classification models, especially for deep learning architectures like CNN and BERT. The Polarity Label Analyzer proves to be a valuable tool for improving training set quality, reinforcing the importance of data reliability in sentiment analysis tasks. -
URI:	http://202.28.34.124/dspace/handle123456789/3631
Appears in Collections:	The Faculty of Informatics

Files in This Item:

File	Description	Size	Format
65011293501.pdf		3.5 MB	Adobe PDF	View/Open

Show full item record