Two-Stage Recruitment Text Labeling via Lexicon-Guided Routing and Retrieval-Augmented Large Language Models

Hushuang Shen; Yuan Wang

doi:10.3791/70153

Method Article

Two-Stage Recruitment Text Labeling via Lexicon-Guided Routing and Retrieval-Augmented Large Language Models

DOI:

10.3791/70153

⸱

May 8th, 2026

Hushuang Shen¹ , Yuan Wang²

¹School of Petroleum, China University of Petroleum-Beijing at Karamay Campus, ²School of Business Administration, South China University of Technology

Summary

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

A two-stage workflow is described for classifying recruitment texts into Artificial Intelligence, Environmental Protection, or Other. Lexicon-guided triage assigns routine cases, while retrieval-augmented, prompt-optimized large language model inference resolves ambiguous cases, enabling accurate large-scale labeling and downstream descriptive labor-market summarization.

Abstract

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Recruitment texts are heterogeneous, and domain terminology evolves over time, making large-scale labeling difficult with limited manual annotation capacity. This protocol provides a reproducible two-stage workflow for classifying Chinese recruitment texts into Artificial Intelligence, Environmental Protection, or Other. Stage one performs lexicon-guided triage using two curated domain keyword lists. Postings that match only one domain lexicon are directly labeled. Postings that match neither lexicon are labeled as Other, while dual-match postings are routed to stage two. Stage two applies retrieval-augmented large language model inference. A knowledge base compiled from 12 authoritative domain documents is converted to text, split into about 1,000-character chunks with 20-character overlap, embedded using all-MiniLM-L6-v2, and indexed in LanceDB. For each uncertain posting, cosine-similarity k-nearest-neighbor retrieval returns candidate chunks filtered by a minimum similarity threshold (τ), and up to four chunks are injected into a fixed prompt template that defines label boundaries and constrains the output to a single label. A benchmark set of 3,000 postings is manually verified, with 1,000 postings per class, and achieves an inter-annotator agreement of 95.0 % and a Cohen's kappa (κ) of about 0.93. Evaluation compares multiple open-source large language models across a prompt-only setting and a retrieval-augmented setting using Qwen2.5-7B-Instruct. The retrieval-augmented configuration achieves 98.47 %accuracy and supports corpus-scale labeling of about 6.93 million postings for downstream descriptive aggregation.

Introduction

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

In recent years, with the rapid development of AI and environmental technologies, a large number of emerging careers have emerged. On the one hand, the global job market is flooded with many brand-new positions based on AI and sustainability. Vacancy-based evidence documents a rapid expansion in AI-related skill demand, which increases the heterogeneity of recruitment texts and motivates a reproducible, scalable domain-tagging protocol¹. On the other hand, China's latest revision of the Occupational Classification Dictionary (OCD) has seen a similarly robust growth in new occupations in the fields of employment informatization and green and....

Access restricted. Please log in or start a trial to view this content.

Protocol

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This study did not involve human participants or animal subjects. Therefore, ethical approval and informed consent were not required. The workflow was executed in a Python 3.10 environment on a 64-bit Windows operating system using the hardware, tools, and software listed in the Table of Materials.

1. Modeling

Constructing the data and knowledge base
1. Collect and curate an in-house structured corpus of job advertisements for publicly listed companies from major online recruitment platforms in China (2014–2023), ensuring compliance with platform policies and applicable regulation....

Access restricted. Please log in or start a trial to view this content.

Results

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Four open-source, instruction-tuned large language model backbones (Backbone A–D) listed in the Table of Materials are benchmarked on the three-class job-posting classification task. Evaluation is conducted using the same test protocol, preprocessing procedures, and metrics across all backbones, including overall accuracy, precision, recall, and F1. Prompt refinement and retrieval-augmented generation (RAG) are subsequently applied to Backbone B and re-evaluated under identical conditions. The performanc.......

Access restricted. Please log in or start a trial to view this content.

Discussion

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Prior work has applied classical machine learning and neural models to recruitment text categorization, including resume and job description classification, but such approaches often require substantial labeled data and may degrade when domain terminology shifts. Ali et al. proposed a resume classification system using NLP and machine learning techniques¹⁸. Jalili et al. explored BiLSTM-based resume classification¹⁹. Pal et al. evaluated resume categorization using classica.......

Access restricted. Please log in or start a trial to view this content.

Disclosures

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The authors have nothing to disclose.

Acknowledgements

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The authors thank colleagues for technical support and constructive feedback during data curation and manuscript preparation. This project received no external funding.

....

Access restricted. Please log in or start a trial to view this content.

Materials

List of materials used in this article
Name	Company	Catalog Number	Comments
all-MiniLM-L6-v2 (sentence encoder)	Hugging Face (sentence-transformers)	https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2	Sentence embedding model used for vectorization.
bert-base-chinese (baseline encoder)	Hugging Face (google-bert)	https://huggingface.co/google-bert/bert-base-chinese	Baseline encoder (Chinese BERT base).
DDR5 memory, 16 GB	Workstation/laptop configuration	N/A	System memory (16 GB DDR5); vendor/part number not specified.
DeepSeek-R1-Distill-Qwen-7B (Backbone A)	Hugging Face (deepseek-ai)	https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B	LLM backbone A.
GLM-4-9B-Chat (Backbone C)	Hugging Face (zai-org)	https://huggingface.co/zai-org/glm-4-9b-chat-hf	LLM backbone C.
Intel Core i5-13500HX CPU	Intel	https://www.intel.com/content/www/us/en/products/sku/232156/intel-core-i513500hx-processor-24m-cache-up-to-4-70-ghz/specifications.html	Workstation CPU used for experiments.
InternLM2.5-7B-Chat (Backbone D)	Hugging Face (internlm)	https://huggingface.co/internlm/internlm2_5-7b-chat	LLM backbone D.
jieba (Chinese tokenizer)	PyPI (jieba)	https://pypi.org/project/jieba/	Chinese tokenization/segmentation library; version not specified.
Keyword libraries (AI & Environmental) (Supplement File 3)	This study	Supplement File 3	Domain keyword lexicons used for lexicon-guided pre-filtering; archive exact version used per run.
LanceDB (vector database)	LanceDB	N/A	Vector database for storing/searching embeddings; version not specified.
NVIDIA GeForce RTX 4060 Laptop GPU (8 GB VRAM)	NVIDIA	https://www.nvidia.com/en-us/geforce/laptops/40-series/	GPU used for inference/experiments.
Online recruitment platform job postings (2014–2023)	Major Chinese recruitment platforms (public web data)	N/A	Publicly posted job advertisements collected and curated by the authors; ~6.93 million postings after cleaning.
pip (Python package installer)	PyPA	N/A	Package installer used to install dependencies and export an environment snapshot using pip freeze.
Prompt template evolution (Supplement File 2)	This study	Supplement File 2	Successive prompt variants and the final prompt used for evaluation.
Python 3.10	Python Software Foundation	N/A	Runtime interpreter (Python 3.10); patch version not specified.
Qwen2.5-7B-Instruct (Backbone B)	Hugging Face (Qwen)	https://huggingface.co/Qwen/Qwen2.5-7B-Instruct	LLM backbone B.
RAG knowledge base field descriptions/index (Supplement File 1)	This study	Supplement File 1	Schema/field notes and document index for the knowledge base used in retrieval-augmented inference.
Thesaurus (AI & Environmental) (Supplement File 4)	This study	Supplement File 4	Term dictionaries used in reclassification and analysis; archive exact version used per run.
Windows 11 (64-bit)	Microsoft	N/A	Operating system (Windows 11, 64-bit); build/version not specified.

References

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Alekseeva, L., et al. The demand for AI skills in the labor market. Labour Economics. 71, 102002(2021).
Ministry updates official dictionary of careers. , Ministry of Human Resources and Social Security of the People’s Republic of China. Available at: https://english.www.gov.cn/statecouncil/ministries/202209/30/content_WS633647bbc6d0a75....

Access restricted. Please log in or start a trial to view this content.

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article

Request Permission

Two-Stage Recruitment Text Labeling via Lexicon-Guided Routing and Retrieval-Augmented Large Language Models

In This Article

Summary

Abstract

Introduction

Protocol

Results

Discussion

Disclosures

Acknowledgements

Materials

References

Reprints and Permissions

Tags

Related Articles