Machine Learning and Lexical Rule-Based Cost-Efficient Emotion Annotation of Hinglish Utterances

Pratibha Verma; Amandeep Kaur; Meenu Khurana; Deepali Gupta

doi:10.3791/68437

Research Article

Machine Learning and Lexical Rule-Based Cost-Efficient Emotion Annotation of Hinglish Utterances

DOI:

10.3791/68437

⸱

August 19th, 2025

Pratibha Verma¹ , Amandeep Kaur¹ , Meenu Khurana² , Deepali Gupta¹

¹Chitkara University Institute of Engineering & Technology, Chitkara University, ²Chitkara University School of Engineering & Technology, Chitkara University

Summary

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This study combines the rule-based strategy with machine learning and expert assistance to annotate the Hinglish and English text. The data is tested on 19,000 tweets with 81% accuracy, and it is much cheaper than doing it manually. It could be useful for tracking emotions during a crisis.

Abstract

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Emotion annotation in code-mixed languages like Hinglish (Hindi-English) presents unique challenges due to linguistic complexity and resource constraints. This study introduces a hybrid active learning framework that combines lexical rules, machine learning, and iterative expert feedback to achieve cost-efficient, high-accuracy emotion annotation. Grounded in psychological theories of emotion, including Discrete Emotions Theory and Cognitive Appraisal Theory, the framework employs bilingual emotion dictionaries (e.g., mapping gussa and rage to anger), subword tokenization for compound terms (e.g., splitting Devanagari script text: "भयानक" meaning "terrible" in Hindi language. into Equation of Hindi words भव्य + अंक, highlighting language structure analysis. ), and active learning to prioritize ambiguous samples. Evaluated on a 19,000 war and conflict-related Hinglish tweets dataset, the framework achieved 81% accuracy (F-score: 0.76) while reducing operational costs by 40% compared to manual annotation. Lexical rules resolved 89% of code-switching ambiguities, and iterative refinements enabled incremental accuracy gains from 72% to 81%. The system's efficiency stems from limiting human effort to 73% of the dataset, with automated preprocessing of emojis, hashtags, and slang. This study is based on the hypothesis that integrating lexical rule-based methods with active learning and machine learning can enhance the accuracy of emotion annotation in Hinglish text, while simultaneously reducing the manual labeling and overall annotation effort.

Introduction

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

When two or more languages are mixed together in a single line or speech, this is called a code-mixed language. It is common in casual dialog like Hinglish. There are multiple ways human emotions can be understood, and to computationally model a series of emotional statements is to annotate them by the people who uttered those sentences. It can be understood in terms of biological, physiological, psychological levels, and so on. According to scientists such as Roger Penrose, many phenomena in our world are non-computational, and scientists such as Wolfram consider that everything (every phenomenon) can be modeled computationally¹. Penrose belie....

Access restricted. Please log in or start a trial to view this content.

Protocol

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This section also explains how the multimodal framework for 8 emotion annotation has been constructed. The section begins with a discussion of the properties of the dataset, followed by the subsequent procedures. For a better understanding of the research procedure, refer to Figure 1.

Machine learning flowchart for tweet emotion analysis using preprocessing, dynamic rules, active learning.
Figure 1: Systematic framewo....

Access restricted. Please log in or start a trial to view this content.

Results

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The finding of this research suggests that integrating the lexical rules with machine learning and active learning techniques offers a viable pathway for enhancing the efficiency and accuracy of emotion annotation in code-mixed hinglish text. Through iterative refinement and expert suggestion, the proposed framework was able to achieve notable reductions in manual effort while sustaining high performance across evolution matrices. The outcomes indicate potential for broader applicability .......

Access restricted. Please log in or start a trial to view this content.

Discussion

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The dataset for this study was curated using a combination of manual annotation and active learning. Initially, 10,040 Hinglish tweets related to war and conflict were manually labeled with eight predefined emotions. The dataset was then expanded to 19,000 tweets using a semi-automated approach. Active learning enabled selective expert intervention, reducing manual effort by 40% while maintaining a high annotation accuracy of 81% with an F-score of 0.76. Lexical rules and emotion-specific dictionaries played a crucial ro.......

Access restricted. Please log in or start a trial to view this content.

Disclosures

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The authors declare no conflict of interest.

Acknowledgements

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This research received no external funding.

....

Access restricted. Please log in or start a trial to view this content.

Materials

List of materials used in this article
Name	Company	Catalog Number	Comments
fastText	Facebook AI	N/A	Word representation and classification
Google Colab	Google	N/A	Cloud-based Jupyter Notebook environment
Google Colab GPU/TPU	Google	N/A	Cloud-based hardware acceleration
Intel Core i5/i7 or AMD Ryzen 5/7	Intel / AMD	N/A	Processor for local execution (if required)
Matplotlib	Open-source	N/A	Data visualization library
NLTK	Open-source	N/A	Natural Language Toolkit for text processing
NumPy	Open-source	N/A	Numerical computing library
NVIDIA GTX 1650 or Higher (Optional)	NVIDIA	N/A	GPU for deep learning tasks
Pandas	Open-source	N/A	Data manipulation library
Python	Python Software Foundation	N/A	Programming language for ML and NLP
PyTorch	Meta AI	N/A	Deep Learning framework
RAM (8GB Minimum, 16GB Recommended)	Various	N/A	Memory requirement for ML tasks
Scikit-learn	Open-source	N/A	Machine Learning library
Seaborn	Open-source	N/A	Statistical data visualization
SpaCy	Explosion AI	N/A	Industrial-strength NLP library
SSD Storage (256GB Minimum, 512GB Recommended)	Various	N/A	Storage for dataset processing
TensorFlow	Google	N/A	Deep Learning framework

References

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Herce, R. Non-locality of the phenomenon of consciousness according to Roger Penrose. Dialogo. 3 (2), 127-134 (2016).
Wolfram, S. The future of computation. Math J. 10 (2), 329-362 (2006).
Kusal, S., et al.

Access restricted. Please log in or start a trial to view this content.

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article

Request Permission

Machine Learning and Lexical Rule-Based Cost-Efficient Emotion Annotation of Hinglish Utterances

In This Article

Summary

Abstract

Introduction

Protocol

Results

Discussion

Disclosures

Acknowledgements

Materials

References

Reprints and Permissions

Tags

Related Articles