Research Article

Machine Learning and Lexical Rule-Based Cost-Efficient Emotion Annotation of Hinglish Utterances

DOI:

10.3791/68437

August 19th, 2025

In This Article

Summary

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This study combines the rule-based strategy with machine learning and expert assistance to annotate the Hinglish and English text. The data is tested on 19,000 tweets with 81% accuracy, and it is much cheaper than doing it manually. It could be useful for tracking emotions during a crisis.

Abstract

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Emotion annotation in code-mixed languages like Hinglish (Hindi-English) presents unique challenges due to linguistic complexity and resource constraints. This study introduces a hybrid active learning framework that combines lexical rules, machine learning, and iterative expert feedback to achieve cost-efficient, high-accuracy emotion annotation. Grounded in psychological theories of emotion, including Discrete Emotions Theory and Cognitive Appraisal Theory, the framework employs bilingual emotion dictionaries (e.g., mapping gussa and rage to anger), subword tokenization for compound terms (e.g., splitting Devanagari script text: "भयानक" meaning "terrible" in Hindi language. into Equation of Hindi words भव्य + अंक, highlighting language structure analysis.), and active learning to prioritize ambiguous samples. Evaluated on a 19,000 war and conflict-related Hinglish tweets dataset, the framework achieved 81% accuracy (F-score: 0.76) while reducing operational costs by 40% compared to manual annotation. Lexical rules resolved 89% of code-switching ambiguities, and iterative refinements enabled incremental accuracy gains from 72% to 81%. The system's efficiency stems from limiting human effort to 73% of the dataset, with automated preprocessing of emojis, hashtags, and slang. This study is based on the hypothesis that integrating lexical rule-based methods with active learning and machine learning can enhance the accuracy of emotion annotation in Hinglish text, while simultaneously reducing the manual labeling and overall annotation effort.

Introduction

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

When two or more languages are mixed together in a single line or speech, this is called a code-mixed language. It is common in casual dialog like Hinglish. There are multiple ways human emotions can be understood, and to computationally model a series of emotional statements is to annotate them by the people who uttered those sentences. It can be understood in terms of biological, physiological, psychological levels, and so on. According to scientists such as Roger Penrose, many phenomena in our world are non-computational, and scientists such as Wolfram consider that everything (every phenomenon) can be modeled computationally1. Penrose belie....

Access restricted. Please log in or start a trial to view this content.

Protocol

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This section also explains how the multimodal framework for 8 emotion annotation has been constructed. The section begins with a discussion of the properties of the dataset, followed by the subsequent procedures. For a better understanding of the research procedure, refer to Figure 1.

Machine learning flowchart for tweet emotion analysis using preprocessing, dynamic rules, active learning.
Figure 1: Systematic framewo....

Access restricted. Please log in or start a trial to view this content.

Results

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The finding of this research suggests that integrating the lexical rules with machine learning and active learning techniques offers a viable pathway for enhancing the efficiency and accuracy of emotion annotation in code-mixed hinglish text. Through iterative refinement and expert suggestion, the proposed framework was able to achieve notable reductions in manual effort while sustaining high performance across evolution matrices. The outcomes indicate potential for broader applicability .......

Access restricted. Please log in or start a trial to view this content.

Discussion

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The dataset for this study was curated using a combination of manual annotation and active learning. Initially, 10,040 Hinglish tweets related to war and conflict were manually labeled with eight predefined emotions. The dataset was then expanded to 19,000 tweets using a semi-automated approach. Active learning enabled selective expert intervention, reducing manual effort by 40% while maintaining a high annotation accuracy of 81% with an F-score of 0.76. Lexical rules and emotion-specific dictionaries played a crucial ro.......

Access restricted. Please log in or start a trial to view this content.

Disclosures

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The authors declare no conflict of interest.

Acknowledgements

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This research received no external funding.

....

Access restricted. Please log in or start a trial to view this content.

Materials

List of materials used in this article
NameCompanyCatalog NumberComments
fastTextFacebook AIN/AWord representation and classification
Google ColabGoogleN/ACloud-based Jupyter Notebook environment
Google Colab GPU/TPUGoogleN/ACloud-based hardware acceleration
Intel Core i5/i7 or AMD Ryzen 5/7Intel / AMDN/AProcessor for local execution (if required)
MatplotlibOpen-sourceN/AData visualization library
NLTKOpen-sourceN/ANatural Language Toolkit for text processing
NumPyOpen-sourceN/ANumerical computing library
NVIDIA GTX 1650 or Higher (Optional)NVIDIAN/AGPU for deep learning tasks
PandasOpen-sourceN/AData manipulation library
Python Python Software FoundationN/AProgramming language for ML and NLP
PyTorchMeta AIN/ADeep Learning framework
RAM (8GB Minimum, 16GB Recommended)VariousN/AMemory requirement for ML tasks
Scikit-learnOpen-sourceN/AMachine Learning library
SeabornOpen-sourceN/AStatistical data visualization
SpaCyExplosion AIN/AIndustrial-strength NLP library
SSD Storage (256GB Minimum, 512GB Recommended)VariousN/AStorage for dataset processing
TensorFlowGoogleN/ADeep Learning framework

References

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,
  1. Herce, R. Non-locality of the phenomenon of consciousness according to Roger Penrose. Dialogo. 3 (2), 127-134 (2016).
  2. Wolfram, S. The future of computation. Math J. 10 (2), 329-362 (2006).
  3. Kusal, S., et al.

Access restricted. Please log in or start a trial to view this content.

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article

Request Permission

Tags

Emotion AnnotationHinglish UtterancesCode Mixed LanguageLexical Rule BasedMachine LearningActive LearningBilingual Emotion DictionarySubword TokenizationCognitive Appraisal TheoryDiscrete Emotions Theory
Video Coming Soon

Related Articles