RESEARCH
Peer reviewed scientific video journal
Video encyclopedia of advanced research methods
Visualizing science through experiment videos
EDUCATION
Video textbooks for undergraduate courses
Visual demonstrations of key scientific experiments
BUSINESS
Video textbooks for business education
OTHERS
Interactive video based quizzes for formative assessments
Products
RESEARCH
JoVE Journal
Peer reviewed scientific video journal
JoVE Encyclopedia of Experiments
Video encyclopedia of advanced research methods
EDUCATION
JoVE Core
Video textbooks for undergraduates
JoVE Science Education
Visual demonstrations of key scientific experiments
JoVE Lab Manual
Videos of experiments for undergraduate lab courses
BUSINESS
JoVE Business
Video textbooks for business education
Solutions
Language
English
Menu
Menu
Menu
Menu
A subscription to JoVE is required to view this content. Sign in or start your free trial.
Research Article
Erratum Notice
Important: There has been an erratum issued for this article. View Erratum Notice
Retraction Notice
The article Assisted Selection of Biomarkers by Linear Discriminant Analysis Effect Size (LEfSe) in Microbiome Data (10.3791/61715) has been retracted by the journal upon the authors' request due to a conflict regarding the data and methodology. View Retraction Notice
This protocol integrates Case-Based Reasoning (CBR) with multi-stage transformer models to summarize legal texts. It preprocesses legal cases, retrieves similar precedents, adapts reasoning structures, and generates accurate, coherent summaries. Applications include legal research, judgment analysis, and decision-support systems, ensuring factual consistency and domain-specific reasoning fidelity.
Legal documents are known to be long and complicated, which makes it essentially impossible for legal practitioners and researchers to quickly identify and extract relevant information. Here, a hybrid approach is presented that outperforms prior extractive and abstractive baselines on both lexical overlap and domain-specific reasoning metrics, which uses Case-Based Reasoning (CBR) for legal texts and concrete deep learning techniques for summary representation, accurately and efficiently producing summaries. Using a larger dataset of 4,968 legal cases from Kaggle, a multi-stage transformer architecture was constructed on top of the general CBR retrieval model created before to produce brief summaries along with CBR for context comprehension. The system was evaluated on legal outcome prediction and coherence of summary, with results showing performance superior to existing extractive and abstractive methods and trained the proposed model until 98% accuracy of legal entities, along with 46% more coherent legal corpus (baseline-enhanced) than state-of-the-art methods, compared using ROUGE scores above previous types by 23%. This study presents a hybrid legal text summarization framework that integrates CBR with transformer-based models. Extensive experiments show superior performance over recent baselines, achieving higher factual accuracy, reasoning fidelity, and legal entity preservation.
The extensive legal documentation has necessitated advanced methods to retrieve pertinent data in a timely manner. Legal text summarization is important for improving accessibility and decision-making. Legal opinions, judgments, and precedents are so verbose that judges, practitioners, and researchers can have a difficult time working their way through them, thus leading to the development of automated methods to summarize these documents in an accurate and efficient manner1.
Despite their state-of-the-art performance in more general cases, existing summarization methods struggle to capture the unique complexity and legal jargon that characterizes legal documents. Simply picking the top sentences without rearranging merely based on the log-likelihood obtains good ROUGE scores but loses to the context and the complete semantics. Abstractive approaches use natural language generation to compose the coverage, so they can capture the contextual nuance, but they struggle to preserve the logical thought process and legal reasoning of courtroom cases2. The proposed method is particularly effective for long-form legal case texts containing structured sections (e.g., facts, arguments, precedents, and judgments). It excels when applied to corpora with explicit citations and standardized formatting, such as appellate or supreme court decisions. However, performance may be limited to noisy or unstructured datasets (e.g., scanned PDFs with OCR errors, or documents lacking clear rhetorical segmentation). Thus, the method is best suited for well-structured digital case repositories where both linguistic and citation features are accessible3.
The proposed superior hybrid framework that combines case-based reasoning (CBR) with the advanced Natural Language Processing (NLP) is used to generate a summary. It uses a multi-stage transformer architecture with legal domain-specific attention layers and proposes a cross-document reasoning module. CBR allows the system to load similar past cases that can be used for examining all contextual variables for legal reason4. The proposed model achieved superior accuracy compared with state-of-the-art baselines, as demonstrated by quantitative gains across ROUGE, BLEU, and Legal-SemSim metrics, and validated by human expert evaluation. For instance, the proposed model outperformed Legal-BART and PALM-Law by margins of 15%-20% on reasoning chain accuracy. Embracing general representations and sophisticated approaches adds case-specific knowledge alongside neural summarization and divining 98% precision for legal entity recognition and 97% for precedent retrieval, vastly exceeding previous works5.
Related work
Legal text summarization, a subdomain of natural language processing (NLP), has gained increasing attention due to the growing demand for efficient legal document processing, including judicial opinions, statutes, and case law. The primary challenge lies in preserving semantic integrity, legal reasoning, and domain-specific context in the generated summaries. Prior research spans across extractive, abstractive, and hybrid approaches, with recent emphasis on domain-adapted neural architectures6.
Literature review
Legal text summarization has evolved as a critical area within natural language processing, driven by the pressing need to make voluminous, complex legal documents accessible and actionable. The domain's literature reflects a trajectory from surface-level extractive models to sophisticated hybrid architectures that attempt to balance fluency, factual integrity, and legal logic. Early efforts focused on extractive summarization techniques, which prioritized sentence importance based on lexical similarity and statistical patterns. Though effective in selecting legally salient fragments, these approaches, such as TextRank and LexRank, often ignore the deeper semantic structure and fail to capture the rhetorical and argumentative layers essential in legal reasoning6. As legal texts differ markedly from general corpora due to their rigid semantics and domain-specific expressions, these models produced summaries that lacked coherence and contextual adequacy. With the advent of pre-trained transformer architectures, the research focus shifted to abstractive methods. Models like BART, T5, and PEGASUS began to exhibit the capacity to synthesize summaries using learned language generation patterns. Their legal adaptations, such as LegalBART and LegalPEGASUS, further refined the performance by incorporating law-specific pretraining corpora7. However, abstractive methods continued to suffer from limitations in reasoning consistency and explainability-challenges particularly problematic in legal settings, where even minor factual deviations can lead to misinterpretation.
Hybrid models emerged in response to these deficits, incorporating symbolic reasoning or retrieval-based strategies to improve contextual grounding. A notable direction was the integration of CBR, where knowledge from precedent cases was utilized to contextualize the summary generation process. These models attempt to retain the factual rigor of extractive methods while generating linguistically coherent and legally valid outputs8.
Recent trends emphasize multi-stage architectures that combine legal entity recognition, cross-document reasoning, and rhetorical role parsing, suggesting a shift towards holistic document understanding rather than mere summarization. Research now increasingly considers legal ontology alignment, domain-specific evaluation metrics (e.g., Legal-SemSim), and human-centric evaluation to assess the quality and usability of summaries for legal professionals9.
Recent surveys, such as Exploring LLMs Applications in Law in 2023 and Exploring the Use of LLMs in the Italian Legal Domain in 2024, highlight the growing role of large language models (LLMs) in automating legal reasoning, summarization, and retrieval tasks10. These works emphasize not only the capacity of LLMs like GPT-4, PaLM, and LLaMA to capture contextual nuances of legal discourse but also the challenges of domain adaptation, explainability, and factual consistency11. Hybrid systems that integrate retrieval-augmented generation (RAG) or CBR with LLMs are increasingly being explored to combine the strengths of precedent-aware reasoning with the generative fluency of transformers6. Positioning the work within this trend, proposed multi-stage hybrid model extends prior retrieval-augmented frameworks by explicitly encoding legal reasoning chains, while maintaining coherence through transformer-based abstractive summarization.
Despite significant progress, there remains a gap in models that can synthesize factual, logically structured, and domain-consistent legal summaries while maintaining interpretability12. The present work addresses this void by proposing a multi-stage hybrid approach grounded in legal CBR and deep neural architecture, offering a model that is both accurate and practically usable.
Extractive summarization approaches
The term extractive summarization is used for methods that extract and stitch together the most informative sentences from the document. Graph-based algorithms, e.g., TextRank13 and LexRank14, are traditional approaches that rank sentences based on their importance. Though computationally feasible, such approaches typically struggle with complex legal reasoning and argumentation8. With the advent of pre-trained language models, BERT-based methods like BERTSUM8 demonstrated significant improvements. BERTSUM fine-tunes BERT embeddings for sentence-level extractive summarization, capturing contextual nuance better than statistical models9. Legal domain adaptations such as Legal-BERTSUM further refined performance by incorporating legal-specific corpora, enabling improved legal term recognition and sentence selection. Nonetheless, extractive methods are limited in reconstructing the argumentative flow or legal reasoning chain, particularly when sentences are interdependent or references are implicit15.
Abstractive summarization approaches
Abstractive summarization generates novel phrases and sentences that rephrase the source content, often using encoder-decoder architectures. The BART model4 and PEGASUS7 are leading examples of pre-trained sequence-to-sequence models applied to general-purpose summarization. These have been adapted for the legal domain through fine-tuning on legal corpora-resulting in models like Legal-BART and Legal PEGASUS16,17. These models produce more fluent and human-like summaries and are better at capturing contextual meaning than extractive approaches18.
However, their application in legal summarization poses challenges. Abstractive models often struggle with factual consistency and preservation of legal semantics. Misinterpretation of legal clauses or omission of key legal entities can render a summary misleading19. Furthermore, neural text generation models are generally opaque, raising issues around explainability and verifiability, both of which are critical in legal domains.
Hybrid summarization models
To leverage the strengths of both paradigms, hybrid summarization models integrate extractive and abstractive components. One early strategy was to use extractive modules to select candidate sentences, which are then refined by an abstractive decoder. More recently, transformer architecture has been combined with knowledge-infused modules to enhance legal comprehension5.
CBR has emerged as a promising hybrid strategy. Waterworth pioneered the use of prior cases to inform current decisions, and its integration with neural models enables semantic enrichment through analogical reasoning6. In legal summarization, combining CBR with transformers allows for both contextual reuse and abstract synthesis. Models like BART+CBR integrate retrieval-based legal context with generative mechanisms, offering improved coherence and legal relevance20.
The proposed Multistage + CBR architecture builds on this line of work by embedding domain-specific reasoning chains into the summarization pipeline. It features hierarchical transformer blocks, cross-document attention, and CBR-enhanced retrieval layers, tailored specifically for legal applications. This multi-layered design facilitates both fine-grained content understanding and macro-level structure preservation21.
Legal-specific challenges and considerations
Legal documents exhibit unique structural and linguistic features, including hierarchical formatting, cross-referencing of precedents, statutory citations, and domain-specific terminology22. Summarization models must therefore not only handle syntactic and semantic complexities but also respect logical coherence and legal argument progression23. Effective summarization in this context hinges on accurate legal entity recognition, reasoning structure preservation, and interpretation of rhetorical roles such as facts, arguments, and judgments.
Legal ontologies and citation networks have been employed to improve domain understanding. For instance, integrating structured legal knowledge bases during model training has been shown to enhance legal entity linking and citation consistency24. These efforts contribute to summarizing models that better align with legal practitioners' expectations25.
Another challenge is interpretability. In legal settings, output must be auditable and interpretable. Hence, explainable AI (XAI) frameworks such as LIME and SHAP are increasingly explored to justify summarization decisions, though integration with large-scale models remains an ongoing research challenge26.
Evaluation metrics in legal summarization
Standard summarization metrics such as ROUGE8, BLEU27, and METEOR are commonly used for lexical overlap evaluation. However, these measures are insufficient for capturing semantic fidelity, legal consistency, and argumentative coherence.
Recent efforts introduce domain-specific metrics like Legal-SemSim, which incorporates legal ontologies to assess semantic similarity, and reasoning chain accuracy, which evaluates preservation of logical flow and conclusion validity28. Human expert evaluation remains indispensable, particularly for measuring legal correctness, coherence, and actionability. Inter-annotator agreement using metrics such as Fleiss' Kappa ensures reliability of qualitative assessments29.
Recent advancements and future directions in legal summarization
Legal summarization research is moving toward more robust, context-aware, and explainable systems. Hybrid symbolic-neural architectures that integrate rule-based reasoning with transformer models are gaining traction. These systems maintain the interpretability of logic-based approaches while benefiting from the generalization capacity of deep learning28.
Multilingual legal summarization is another emerging frontier, addressing the global nature of legal texts and proceedings30. Temporal reasoning, necessary for understanding time-dependent legal sequences, and argumentative discourse modeling are also under exploration9.
Moreover, advancements in contrastive learning and curriculum learning are being used to enhance model generalization across diverse legal domains17. By gradually increasing task complexity and differentiating fine-grained semantic classes, these techniques improve the model's robustness in real-world scenarios31.
This protocol uses publicly available legal datasets. No sensitive personal or confidential data were used. The study complies with institutional ethical guidelines for the use of legal corpora in research (Koneru Lakshmaiah Education Foundation (Deemed to be University), Hyderabad, Telangana, India, Approval No.: KLEF/CS/2024/IRB-017). The study uses a dataset of 4,968 legal cases, each with inclusive annotations. Table 1 shows the dataset description.
| precedent_citations | Structured citations to precedent cases |
| reasoning_chains | Annotated logical reasoning sequences |
Table 1: Dataset attribute description.
Dataset for the study
After pre-processing, the dataset is normalized to standardize outcomes. The length of the case texts ranges between 1,000 and 5,000 words, comprehensive for training and evaluation. We gathered case results into primary labels according to their frequency in the dataset, allowing us to explore summarization performance across different legal contexts. Annotation was conducted by a team of five legal professionals, including three practicing attorneys and two doctoral researchers specializing in legal informatics. Each case was independently annotated by two experts, with conflicts resolved by a senior annotator. The annotation schema included legal entities, precedent citations, and reasoning chains. Inter-annotator agreement was measured using Fleiss' Kappa (κ = 0.82), indicating strong consistency across annotators. This process ensured both the legal validity and reproducibility of the dataset.
Dataset distribution
The dataset includes cases that contain various legal domains, as shown in Table 2.
| Legal Domain | Number of Cases | Percentage |
| Contract Law | 1,278 | 25.70% |
| Intellectual Property | 1,053 | 21.20% |
| Constitutional Law | 512 | 10.30% |
| Administrative Law | 432 | 8.70% |
| Criminal Law | 308 | 6.20% |
| Tort Law | 219 | 4.40% |
| Other/Miscellaneous | 1,166 | 23.50% |
Table 2: Cases containing various legal domains.
The distribution of case outcomes across the dataset is shown in Table 3.
| Case Outcome | Count | Percentage |
| Cited | 2,460 | 49.20% |
| Referred to | 855 | 17.10% |
| Followed | 471 | 9.40% |
| Applied | 447 | 8.90% |
| Considered | 353 | 7.10% |
| Other outcomes | 382 | 8.30% |
Table 3: Distribution of case outcomes across the dataset.
Therefore, it confirmed a sensible depiction across domains in training, validation, and test splits using stratified sampling to prevent domain-specific biases. The dataset was divided into training (80%), validation (10%), and test (10%) sets.
Text length distribution
The analysis of the text length distribution is presented in Table 4.
| Length Range (words) | Count | Percentage |
| Under 500 | 465 | 9.40% |
| 500-1,000 | 1,359 | 27.40% |
| 1,000-5,000 | 2,698 | 54.30% |
| 5,000-10,000 | 323 | 6.50% |
| Over 10,000 | 154 | 3.10% |
Table 4: Analysis of the text length distribution.
This distribution highlights the variety in document lengths, with the majority falling in the medium-length range (1,000-5,000 words), which presents an appropriate challenge for summarization techniques.
Enhanced data pre-processing
The preprocessing pipeline was informed by recent studies on the impact of preprocessing in transformer-based legal NLP tasks. Chalkidis et al. demonstrate that domain-aware tokenization and citation normalization significantly improve downstream summarization accuracy30. Similarly, Bommarito and Katz show that preprocessing choices, particularly entity normalization, directly affect transformer performance in legal summarization32. Based on these insights, the pipeline employed section-aware tokenization, legal citation normalization, and rhetorical role parsing to maximize contextual coherence. A novel pre-processing pipeline was developed, integrating general pre-processing methods and domain-specific techniques to obtain high-quality legal text and filter out a few cases of empty or very short case text fields at first, in order to ensure the integrity of the data. Next, input normalization was done, where legal tokenization rules were employed to ensure input usability33. Key entities were extracted using a custom high-performing legal entity recognition Legal Entity Recognition (LER) model (F1 score of 98%), and discourse parsing techniques were used to grasp structural elements. To enhance precedent analysis, a citation graph construction step was introduced. Other pre-processing steps are domain-specific to the regular expression parsers, validating that their input precisely corresponds to the very strict formats for standardized legal citations. Section-aware tokenizers for hierarchical document structures were applied, and a fine-tuned RoBERTa classifier labeled with rhetorical roles like facts, arguments, and decisions was added for each legal text. This wider pipeline lets us trivially augment the downstream tasks like legal text retrieval, reasoning, and summarization.
Advanced case-based reasoning module
The study presents for the first time a stepping-up application of a state-of-the-art Case-Based Reasoning (CBR) system that can use semantic similarity as well as knowledge about the legal domain to deliver relevant and comparable cases with a 98% retrieval accuracy on the benchmark dataset. Enhanced Case Representation encodes each case with a hybrid vector that combines contextualized embeddings - produced by a domain-focused BERT model on legal text - with structural embeddings that reflect the position of elements in the document, entity-aware embeddings that emphasize the legal relations present in the cases, and citation network embeddings that represent how precedents are bound together. Multi-stage case retrieval acts as a multi-step approach by first performing Approximate Nearest Neighbor (ANN) search to get seed candidates, applying cross-attention to get optimal matches from candidates, and using a domain-specific scoring function and ensemble-based relevance scoring to select the best-case retrievals. During the Advanced Case Adaptation Stage (ACAS), the system extracts reasoning patterns through a graph-based representation template that captures logical dependencies among legal arguments. It then identifies precedent relevance via citation analysis, weighing each case component according to its contextual importance to the query. Finally, the argumentation structure is maintained using discourse parsing-based methods to ensure the logical flow of legal reasoning is preserved. The final module, the Knowledge Integration module, aids retrieval through legal ontologies, records temporal precedent relationships, and retains validity in complex chains of legal reasoning. The proposed CBR system achieves a 98% correct retrieval rate, which is higher than previously reported benchmarks (e.g., LexGLUE, 92%-95%) for the retrieval and adaptation of legal cases.
Multi-stage transformer architecture
Based on the references mentioned above, the study proposes an artificial neural network-based method on a transformer-based multi-stage architecture model that takes advantage of the best elements of all the mentioned algorithms to improve the automated processing, reasoning, and summarization of legal documents. The document understanding stage uses a BERT encoder adapted to the legal domain, hierarchical attention mechanisms, and modules for legal entity recognition, relationship extraction, and citation graph processing to gain deep structural and contextual understanding of legal documents. The cross-document reasoning stage assists in case-based reasoning by linking queries with retrieved cases through a cross-attention mechanism in the form of extracting and validating legal reasoning chains, a model of precedent application, and preservation of argumentation structure. SBERT and Bi-LSTM were used in the sentence embedding formation level, while a custom decoder with legal domain constraints was used in the abstract summary generation level to accurately maintain legal terms, factual consistency, and hierarchy in the summary as per legal writing requirements. By splitting the process into different stages, it can ensure document processing preserves factual consistency, contextual coherence, and fidelity citation. The proposed model training process is shown in Table 5.
| Hyperparameter | Value |
| Batch size | 32 |
| Learning rate | 3e-5 with warmup |
| Epochs | 15 |
| Max sequence length | 2048 tokens |
| Dropout rate | 0.15 |
| Gradient accumulation | 8 steps |
| Warmup steps | 1000 |
| Weight decay | 0.01 |
| Label smoothing | 0.1 |
Table 5: Model training process.
To enhance model performance and generalization, we have implemented multiple advanced training techniques, which are covered in detail. It employed curriculum learning to gradually increase task complexity, enabling the model to acquire basic skills before tackling more challenging instances. By conducting experiments on the training dataset, the researcher confirmed that fine-tuning successfully refined semantic representations by helping the model differentiate between highly similar and dissimilar data, thus increasing its understanding of the dataset. With respect to multi-task learning, summarization was coupled with an auxiliary task(s), e.g., classification or entailment recognition, that fostered a more comprehensive sense of the legal domain. Transfer of important knowledge through knowledge distillation that retains high-level knowledge without making the model huge has already been transferred for larger domain-focused models7.
Multi-stage hybrid legal summarization methodology
To operationalize the proposed framework, stepwise algorithm 1 is presented, representing the multi-stage hybrid legal summarization methodology.
Algorithm 1:
Algorithm: Multi-Stage Hybrid Legal Summarization
Input: Legal Case Dataset D with fields {case_text, legal_entities, citations, outcome}
Output: Summary S for each case with legal reasoning preserved
Begin
// Stage 1: Data Preprocessing
For each case in D:
Normalize the text to remove noise and unify formatting.
Apply Legal Entity Recognition (LER) to extract legal entities.
Build a citation graph from referenced precedents.
Tokenize the text using domain-aware legal tokenization rules.
NOTE: If the input corpus contains noisy or incomplete texts, perform additional normalization such as stop-word filtering or section segmentation.
// Stage 2: Case-Based Retrieval
For each case:
Generate contextual embeddings using a domain-adapted BERT encoder.
Generate structural embeddings based on section position.
Generate citation network embeddings.
Retrieve similar past cases using Approximate Nearest Neighbor (ANN) search.
Refine retrieved cases using cross-attention scoring and ensemble-based relevance ranking.
// Stage 3: Case Adaptation
For each retrieved case:
Extract reasoning patterns using discourse parsing.
Identify relevant legal arguments and assign weights based on their importance.
Preserve the argumentation flow during adaptation.
NOTE: If precedent cases contain irrelevant or contradictory arguments, exclude them during adaptation.
// Stage 4: Multi-Stage Transformer-Based Summarization
For each input case and its adapted precedents:
Encode the input case using a legal-domain BERT encoder with hierarchical and entity-aware attention.
Apply cross-document reasoning between the query and retrieved cases.
Encode sentences using SBERT and Bi-LSTM for enhanced contextual embeddings.
Decode the summary using a domain-constrained decoder to preserve factual and legal accuracy.
// Stage 5: Output and Validation
For each generated SHORT ABSTRACT:
Validate the generated summary for coherence, factual correctness, and legal reasoning fidelity.
Return the finalized summary S.
NOTE: If expert validation is required, include an additional human-in-the-loop review step before deployment.
End
To setup the computing environment, Python 3.10 was installed. Installation requires libraries: PyTorch (v2.0), Hugging Face Transformers (v4.32.0), SpaCy (v3.6), NLTK (v3.8.1), NetworkX (v3.1), DGL (v1.1). The GPU support was configures(two GPUs with 40 GB memory each). See Table of Materials for exact versions and sources.
The legal case dataset (≈ 5,000 cases) was downloaded from the specified dataset source and each case was verified to include fields: facts, arguments, citations, and judgment outcome. Store documents in UTF-8 plain text format. Removal of empty or short texts using a Python script (preprocess.py) was done. Removal of noise and unification of formatting of text was carried out. Legal-domain tokenization was applied with SpaCy (spacy.load("en_core_legal_sm")). Legal Entity Recognition was performed using a fine-tuned BERT model (transformers.pipeline("ner", model="legal-bert-ler")). A citation graph was built with NetworkX (nx.DiGraph()), linking nodes by referenced precedents. If OCR-based documents are included, an additional text cleaning script (ocr_clean.py) was run.
Contextual embedding was generated with a domain-specific BERT model (bert-legal) and citation embeddings were generated using DGL graph encoders. Approximate Nearest Neighbor search was performed using FAISS (faiss.IndexFlatIP). Cross-attention scoring was applied with a PyTorch module (CrossAttentionScorer) and the results obtained were ranked by ensemble weighting of contextual and citation similarity. The discourse structure was parsed using a rhetorical role classifier (roberta-legal-rhetoric) and relevant reasoning patterns were extracted by aligning rhetorical roles. Weights were assigned to argument segments based on frequency and citation strength. Exclude contradictory or irrelevant precedents.
Documents were encoded with a domain-adapted BERT encoder and hierarchical attention modules was applied (implemented in hierarchical_encoder.py). SBERT and Bi-LSTM (sentence-transformers package) were used for sentence embeddings. Abstractive summaries were decoded using a constrained decoder (legal_decoder.py). Limit maximum sequence length to 2048 tokens. To train the model, set batch size = 32, learning rate = 3e-5, Train for 15 epochs with gradient accumulation (8 steps), Apply dropout = 0.15 and weight decay = 0.01, Enable label smoothing = 0.1. Modify hyperparameters in train_config.json.
To evaluate performance, compute ROUGE (1,2, L) using rouge_score library, BLEU using nltk.translate.bleu_score, and BERTScore using bert_score package. Evaluate Legal-SemSim using ontology-based semantic similarity scripts and entity F1 using SpaCy and gold annotations. Runtime efficiency (seconds per fold) was recorded, and the summaries were compared against gold references. Two legal experts were asked to review factual correctness and the final validated summaries of results were saved. Following this protocol will produce domain-specific legal summaries that preserve legal reasoning, factual accuracy, and citation fidelity.
Evaluation metrics
The system was evaluated using a comprehensive set of metrics, which is shown in Table 6.
| Metric Category | Specific Metrics | Description |
| Lexical Overlap | ROUGE-1, ROUGE-2, ROUGE-L | Measures n-gram overlap between generated and reference summaries |
| BLEU | Evaluates precision of n-grams | |
| METEOR | Accounts for stemming and synonymy in evaluation | |
| Semantic Similarity | BERTScore | Leverages contextual embeddings to measure semantic similarity |
| Legal-SemSim | Domain-specific semantic similarity metric based on legal ontologies | |
| Legal Accuracy | Legal Entity Recognition F1 | Accuracy in identifying key legal entities and concepts |
| Citation Accuracy | Correctness of included legal citations | |
| Argument Preservation | Preservation of key legal arguments | |
| Reasoning Chain Accuracy | Accuracy in preserving logical reasoning steps | |
| Human Evaluation | Summary Coherence | Expert evaluation of logical flow (1-5 scale) |
| Legal Reasoning | Assessment of legal reasoning quality (1-5 scale) | |
| Factual Correctness | Verification of factual claims (1-5 scale) | |
| Actionability | Usefulness for legal professionals (1-5 scale) | |
| Comprehensiveness | Coverage of key legal points (1-5 scale) | |
| Efficiency | Execution Time | Average time to generate summaries (in seconds) |
| Memory Usage | Peak memory consumption during inference (in GB) |
Table 6: Comprehensive set of metrics.
For human evaluation, it recruited 12 legal professionals (6 practicing attorneys, 3 law professors, and 3 legal researchers) who assessed 150 randomly sampled summaries across all models. Each summary was evaluated by at least 4 experts, and the inter-annotator agreement was measured using Fleiss' Kappa.
Baseline and comparative models
The hybrid model is compared against several state-of-the-art approaches, including the baseline approaches shown in Table 7.
| Model Category | Specific Models | Description |
| Extractive | TextRank | Graph-based extractive summarization |
| LexRank | Graph-based approach with lexical centrality | |
| BERTSUM | BERT-based extractive summarizer | |
| Legal-BERTSum | Legal domain-adapted BERT extractive summarizer | |
| Abstractive | BART | Standard BART without domain adaptation |
| Legal-BART | BART fine-tuned on legal corpus | |
| T5-Legal | T5 model fine-tuned on legal documents | |
| PEGASUS | State-of-the-art abstractive summarizer | |
| LegalPEGASUS | PEGASUS fine-tuned on legal documents | |
| PALM-Law | Domain-specific adaptation of PALM | |
| Baseline | CBR only | Case-Based Reasoning retrieval and adaptation without transformer summarization |
| Transformer only | Multi-stage transformer summarizer without CBR integration | |
| LLM-Based | GPT-4 (API) | General-purpose LLM applied to summarization with zero-shot prompting |
| LLM-Based | LegalLLaMA | LLaMA model fine-tuned on domain-specific legal corpora |
| Hybrid | BART+KB | BART with legal knowledge base integration |
| BART+CBR (previous) | Previous hybrid approach with CBR and BART | |
| Proposed (Multi-stage+CBR) | Advanced hybrid approach |
Table 7: State-of-the-art approaches.
Additionally, it conducted extensive ablation studies to evaluate the contribution of each component.
Experimental setup
To rigorously evaluate the effectiveness and efficiency of proposed multi-stage hybrid model CBR was integrated with a deep learning-based transformer architecture and designed a comprehensive experimental pipeline. The setup ensures reproducibility, scalability, and fairness in benchmarking against state-of-the-art baseline models. All experiments were conducted on a high-performance computing cluster with dual GPUs (40 GB memory each), a 64-core server-grade CPU, 512 GB of RAM, and high-speed SSD storage. We utilized widely adopted open-source deep learning and natural language processing libraries (version details provided in the Table of Materials)
Explanation of metrics
To assess the effectiveness of the proposed legal text summarization model, employ standard reference-based evaluation metrics from the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) family. These metrics measure the quality of machine-generated summaries by comparing them with human-written gold-standard summaries, primarily based on n-gram overlap. The following variants were utilized:
ROUGE-1 is a tool that assesses the overlap of unigrams between a system-generated summary and a reference summary, evaluating the summarization model's capacity to capture crucial content words. The metric formulation is as given below:



ROUGE-2 is a tool that assesses the fluency and coherence of short phrase structures by evaluating the overlap of bigrams between candidate and reference summaries. The metric formulation is as given below:



ROUGE-L measures the preservation of in-order sequence matching without exact word adjacency, capturing syntactic and sentence-level coherence by comparing generated and reference summaries. The metric formulation is as given below:



These ROUGE metrics are computed using the official ROUGE-1.5.5 script with stemming and stop-word removal enabled, ensuring consistency with benchmarking practices in legal NLP tasks 3. While ROUGE does not capture semantic fidelity, it remains a foundational metric to assess lexical overlap and is thus complemented in the work by BERTScore, Legal-SemSim, and reasoning accuracy for a more holistic evaluation. Table 8 shows the comparative performance on the main metrics.
| Model | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU | BERTScore | Legal-SemSim | Legal Entity F1 | Execution Time (s) |
| TextRank | 42.3 | 19.8 | 40.7 | 18.2 | 0.67 | 0.61 | 0.58 | 1.2 |
| LexRank | 43.5 | 20.4 | 41.2 | 18.6 | 0.68 | 0.63 | 0.6 | 1.4 |
| BERTSUM | 51.2 | 27.5 | 49.3 | 23.5 | 0.74 | 0.69 | 0.74 | 4.8 |
| Legal-BERTSum | 53.6 | 28.9 | 51.7 | 25.2 | 0.76 | 0.73 | 0.78 | 5.1 |
| BART (vanilla) | 56.1 | 32.8 | 53.9 | 27.8 | 0.79 | 0.71 | 0.72 | 5.8 |
| Legal-BART | 60.2 | 36.3 | 57.5 | 30.5 | 0.82 | 0.78 | 0.81 | 6.7 |
| T5-Legal | 59.7 | 35.9 | 56.8 | 30.1 | 0.81 | 0.77 | 0.8 | 7.2 |
| PEGASUS | 61.3 | 37.2 | 58.4 | 31.2 | 0.83 | 0.76 | 0.79 | 8.1 |
| LegalPEGASUS | 63.7 | 39.8 | 61.6 | 33.5 | 0.85 | 0.82 | 0.84 | 8.3 |
| PALM-Law | 66.1 | 43.2 | 64.3 | 35.7 | 0.87 | 0.85 | 0.87 | 9.2 |
| BART+KB | 63.5 | 39.4 | 61.2 | 33.1 | 0.85 | 0.81 | 0.83 | 8.4 |
| BART+CBR (previous) | 65.1 | 41.2 | 63.4 | 34.5 | 0.87 | 0.84 | 0.84 | 8.9 |
| CBR only | 62.7 | 31.4 | 60.1 | 28.9 | 0.81 | 0.84 | 0.9 | 7.1 |
| Transformer only | 71.4 | 39.8 | 69.2 | 35.1 | 0.86 | 0.79 | 0.82 | 8.7 |
| Multi-stage+CBR (Proposed) | 78.4 | 54.8 | 75.3 | 46.2 | 0.94 | 0.96 | 0.98 | 10.3 |
Table 8: Comparative performance on main metrics.
Figure 1 depicts the results of different legal text summarization models over 8 metrics - ROUGE-1, ROUGE-2, ROUGE-L, BLEU, BERTScore, Legal-SemSim, Legal Entity F1, and Execution Time. Traditional models, such as TextRank and LexRank, scored lower across most metrics, while advanced models based on efficient machine learning architectures, e.g., PALM-Law and LegalPEGASUS, can provide a significant boost in accuracy-based metrics. The proposed multi-stage +CBR outperforms the others compared architectures with top scores on ROUGE, BLEU, semantic similarity metrics, and the best performance on legal-peculiar tasks. Since this balance's execution time versus model complexity, more complicated models tend to take longer to compute. Table 9 represents the performance by case domain (ROUGE-L).

Figure 1: Performance comparison of legal text summarization models across key metrics. The figure shows the comparison of the performance of various legal text summarization models across eight metrics: ROUGE-1, ROUGE-2, ROUGE-L, BLEU, BERTScore, Legal-SemSim, Legal Entity F1, and Execution Time. Please click here to view a larger version of this figure.
While recent LLM-based models such as GPT-4 and LegalLLaMA demonstrate strong fluency and contextual coverage, they still underperform compared to the proposed multi-stage hybrid framework. Specifically, GPT-4 achieves a legal reasoning accuracy of 0.83 and legal entity preservation F1 of 0.88, whereas LegalLLaMA records similar trends (reasoning accuracy 0.85, entity F1 0.87). In contrast, the proposed Multi-stage+CBR system consistently surpasses these baselines with a reasoning accuracy of 0.97 and an entity F1 of 0.98. This reinforces the importance of combining retrieval-based reasoning (through Case-Based Reasoning) with transformer-driven generation for legal summarization tasks, offering both factual robustness and linguistic fluency.
| Model | Contract Law | IP | Constitutional | Administrative | Criminal | Tort | Other | Average |
| BERTSUM | 50.8 | 51.2 | 48.7 | 47.5 | 49.2 | 48.4 | 46.8 | 48.9 |
| Legal-BART | 59.2 | 60.4 | 57.3 | 56.2 | 58.1 | 57.4 | 55.3 | 57.7 |
| PALM-Law | 65.8 | 66.5 | 64.2 | 63.1 | 64.9 | 64.3 | 62.4 | 64.5 |
| BART+CBR (previous) | 64.2 | 65.1 | 62.8 | 61.7 | 63.5 | 62.9 | 60.5 | 63 |
| Multi-stage+CBR (Proposed) | 76.7 | 77.5 | 75.1 | 74.2 | 75.8 | 75.1 | 73.4 | 75.4 |
Table 9: Performance by case domain (ROUGE-L).
Figure 2 shows the performance of five models across different legal domains. The proposed model consistently outperforms others in every category, achieving the highest average score. While PALM-Law and BART+CBR perform well, they fall short compared to the multi-stage approach. BERTSUM has the lowest scores across all domains, while Legal-BART provides moderate performance improvements. This visualization highlights the superior domain-specific accuracy of the multi-stage + CBR model. Table 10 explains the legal entity recognition F1 Score.
| Model | Case Citations | Legal Principles | Statutes | Precedents | Overall F1 |
| Legal-BART | 0.76 | 0.72 | 0.78 | 0.73 | 0.75 |
| PALM-Law | 0.88 | 0.83 | 0.89 | 0.85 | 0.86 |
| BART+CBR (previous) | 0.87 | 0.82 | 0.85 | 0.84 | 0.84 |
| Multi-Stage+CBR (Proposed) | 0.98 | 0.97 | 0.99 | 0.98 | 0.98 |
Table 10: Legal entity recognition F1 score.

Figure 2: Model performance across different legal domains. The figure shows the performance of five models across different legal domains. Please click here to view a larger version of this figure.
Figure 3 shows the four models over a range of legal information extraction tasks, using F1 scores. The scores of other models are significantly lower than that of the proposed Multi-stage+CBR, which achieves near-perfect scores in all categories. There remains a higher quality in Precision, Recall, and F1 score in a multi-stage approach involving PALM-Law and BART+CBR, despite a competitive performance in both experiments. Legal-BART trails much further, particularly in extracting legal principles and case citations. This plot shows a better extraction piece of multi-stage +CBR. Table 11 represents the reasoning chain accuracy.
| Model | Argument Identification | Reasoning Flow | Precedent Application | Conclusion Validity | Overall |
| Legal-BART | 0.64 | 0.61 | 0.71 | 0.68 | 0.66 |
| PALM-Law | 0.78 | 0.76 | 0.85 | 0.82 | 0.8 |
| BART+CBR (previous) | 0.75 | 0.73 | 0.82 | 0.79 | 0.77 |
| Multi-stage+CBR (Proposed) | 0.97 | 0.96 | 0.98 | 0.97 | 0.97 |
Table 11: Reasoning chain accuracy.

Figure 3: Comparison of models on legal information extraction tasks. The figure shows the performance of four models across different legal information extraction tasks, measured by F1 scores. Please click here to view a larger version of this figure.
A plot summarizing the relative performance of four models across tasks, from argument identification to reasoning flow to precedent application to conclusion validity, to accuracy. As can be seen from the results shown in Figure 4, the proposed model, Multi-stage +CBR, outperforms the other compared models with very high scores in every category. PALM-Law and BART+CBR achieved strong performance but fall short compared to the multi-stage approach. Legal-BART performs the worst on all tasks, especially on reasoning flow and argument identification.

Figure 4: Comparison of models on legal reasoning and accuracy tasks. The figure compares the performance of four models across legal reasoning tasks, including argument identification, reasoning flow, precedent application, conclusion validity, and overall accuracy. Please click here to view a larger version of this figure.
Human evaluation
Table 12 represents the expert assessment (Scale 1-5, higher is better).
| Model | Coherence | Legal Reasoning | Factual Correctness | Actionability | Comprehensiveness | Overall |
| BERTSUM | 3.1 | 2.7 | 3.7 | 2.9 | 2.8 | 3 |
| Legal-BART | 3.9 | 3.5 | 4.2 | 3.7 | 3.6 | 3.8 |
| PALM-Law | 4.2 | 4 | 4.5 | 4.2 | 4.1 | 4.2 |
| BART+CBR (previous) | 4.4 | 4.3 | 4.6 | 4.5 | 4.3 | 4.4 |
| Multi-stage+CBR (Proposed) | 4.8 | 4.7 | 4.9 | 4.8 | 4.8 | 4.8 |
Table 12: Expert assessment (Scale 1-5, higher is better).
Figure 5 presents five models using legal document quality metrics - coherence, legal reasoning, factual correctness, actionability, comprehensiveness, and overall score. The proposed model significantly outperforms all others, obtaining near-perfect scores on all metrics. Other existing models also do well but fall short of quality overall. Legal-BART performs moderately, whereas BERTSUM performs the worst in all categories. Inter-annotator agreement (Fleiss' Kappa): 0.82 (strong agreement).

Figure 5: Evaluation of models using legal document quality metrics. The figure shows the performance of five models across legal document quality metrics, including coherence, legal reasoning, factual correctness, actionability, comprehensiveness, and overall score. Please click here to view a larger version of this figure.
Qualitative observations and expert insights
Qualitative observations of the expert evaluators were combined instead of offering single case studies. As legal experts repeatedly pointed out, the offered multi-stage + CBR model exhibited outstanding legal reasoning, high level of factual accuracy and enhanced ability of summarisation coherence. These expert evaluations also support the quantitative gains that are found in ROUGE, Legal-SemSim and reasoning accuracy measures.
Data availability:
The dataset utilized in this study consists of 4,968 publicly available legal case documents sourced from the Kaggle Legal Case Corpus, which have been further processed, annotated, and integrated with derived metadata (including expert annotations and reasoning-chain mappings). The resulting dataset contains information included identifiers, legal reasoning traces, and annotations tied to case-specific contexts. Due to ethical and legal constraints associated with the redistribution of annotated legal documents-particularly those containing sensitive judicial text, precedent identifiers, or potentially identifiable legal entities-the dataset cannot be made publicly available. However, the supporting data set files are without annotation are uploaded as Supplementary File 1, which is beneficial for the users.
Supplementary File 1: Supporting data files. Please click here to download this file.
The effectiveness of the CBR augmentation to the multi-stage transformer design has been achieved by establishing a new performance benchmark in legal domain text summarization. The results indicate a significant improvement in ROUGE scores of 78.4 ROUGE-1, 54.8 ROUGE-2, and 75.3 ROUGE-L, and domain-specific metrics of 98% legal entity recognition F1. To further validate the contribution of each component, introduced CBR-only and Transformer-only baselines. The CBR only baseline exhibited strong factual accuracy and reasoning preservation but lacked fluency and comprehensive coverage. Conversely, the Transformer-only baseline produced more fluent summaries and higher lexical overlap but showed weaker performance in legal entity recognition and reasoning accuracy. The combination of the two (multi-stage + CBR) outperformed both baselines, confirming that CBR enriches contextual grounding while transformers enhance linguistic fluency and semantic abstraction. This ablation underscores the complementary strengths of symbolic retrieval and neural summarization, thereby justifying the integration of both in the proposed hybrid model.
Critical steps
The success of the proposed protocol depends heavily on three critical steps: (i) accurate legal entity recognition (LER) to preserve key factual elements, (ii) robust case retrieval using hybrid embeddings (contextual, citation, structural), and (iii) the multi-stage transformer decoder to maintain both factual fidelity and coherence. Errors in any of these stages directly affect the reliability of the final summary. The different phases of the protocol were also tested by automated protocol checks and manual verification to reduce the chances of error during implementation. The preprocessing scripts have an inbuilt data integrity check that leaves gaps or noisy records. The cross-validation that was used during case retrieval and adaptation allowed scoring relevance consistently and avoid retrieval mismatches. Monitoring of model training was done based on early stopping, gradient clipping, and the reproducibility based on fixed random seeds. Lastly, a human expert review was conducted on a sample of outputs in order to detect and correct possible inconsistencies in legal reasoning or factual accuracy.
Modifications and troubleshooting
In cases where OCR-generated documents introduce noise, apply additional preprocessing (e.g., spelling correction, section segmentation). For datasets with limited citation structures, bypass the citation graph stage and rely on contextual embeddings. If GPU memory is insufficient for the maximum sequence length (2048 tokens), reduce batch size or increase gradient accumulation. These adjustments ensure the protocol can run effectively in constrained environments22.
Limitations
The current protocol is evaluated primarily on English legal corpora with structured citations. Its applicability to non-English or less-structured datasets remains limited. While retrieval accuracy is high (98%), efficiency decreases with extremely large case repositories, requiring index optimization or distributed retrieval30. Another limitation is dependency on expert-annotated datasets for training entity recognition and rhetorical role classification.
Comparative significance
Compared with Legal-BART (ROUGE-1 = 60.2) and PALM-Law (ROUGE-1 = 66.1), our system achieves ROUGE-1 = 78.4 and Legal-SemSim = 0.96, highlighting substantial improvements. Unlike purely generative LLMs such as GPT-4 and LegalLLaMA, the protocol integrates retrieval-based reasoning, resulting in higher reasoning accuracy (0.97 versus 0.83-0.85) and entity preservation (0.98 versus 0.87-0.88). These comparisons reinforce the protocol's strength in domains where factual accuracy is paramount.
Future Applications
Future work may extend the protocol to multilingual legal summarization, integration with knowledge graphs for explainable reasoning, and real-time deployment in court decision-support systems. Another promising direction is combining the protocol with federated learning frameworks to enable privacy-preserving training across distributed legal databases34.
Efficiency, automation, and usability
The protocol is computationally efficient, completing each fold within ~10 s on dual 40 GB GPUs. Its modular design allows automation: preprocessing, retrieval, and summarization are implemented as separate scripts (preprocess.py, retrieval.py, summarize.py) that can be executed in sequence. This modularity enhances reproducibility and facilitates deployment in legal-tech environments.
Conclusion
This paper develops a new hybrid model for legal text summarization that combines Case-Based Reasoning (CBR) with multi-stage deep learning techniques to best capture the peculiarities of legal domain summarization. When compared to many state-of-the-art baselines35, the provided method outperforms them on several evaluation measures such as ROUGE, BLEU, BERTScore, Legal-SemSim, and Legal Entity F1. The combination of precedent-based retrieval with contextual abstraction has shown promise in producing factually accurate, coherent, and legally correct summaries. Unlike typical extractive or generic abstractive summarization approaches36, the proposed system captures domain-specific details by collecting relevant case information and incorporating it into a controlled, attention-steered creative process. The results reveal not just increased semantic memory, but also a considerable decrease in factual hallucinations-an important criterion in legal use. Despite these encouraging results, several avenues for further development remain available. One approach is to apply the model to multi-document summarization of legislation, which requires inter-case reasoning and hierarchical summarization. Adding cross-case temporal and jurisdictional connections can assist in strengthening the interpretability and usefulness of summaries even more. While current assessment uses well-established criteria, including human-in-the-loop assessment frameworks comprised of legal professionals, it can provide a greater knowledge of usability, compliance, and ethical accuracy. Furthermore, including Legal QA-style question-and-answer modules into the summarizer might transform the system into an even more interactive legal assistant. Future studies may investigate the usage of explainable AI (XAI) characteristics to improve transparency in summary production. For example, visualizing which prior decisions or legal concepts influenced specific sections of the summary might boost trust and adoption among legal practitioners. Finally, scaling the approach for multilingual legal corpora, particularly for states with multiple legal languages, can make this methodology more widely applicable. Another intriguing approach could be to adapt domain-specific large language models (LLMs) using reinforcement learning from human feedback (RLHF) in legal settings. Finally, the proposed hybrid method bridges the gap between retrieval and generative paradigms in legal AI while also laying the framework for next-generation legal document processing systems that are accurate, interpretable, and contextually consistent with judicial reasoning.
The authors declare that no potential conflicts of interest related to the content of this study.
The authors express their sincere gratitude to the Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation (Deemed to be University), Hyderabad, Telangana, India, for providing the computational facilities and research infrastructure essential to this work. Special thanks are extended to the legal experts and domain specialists who contributed to the annotation and evaluation of the legal corpus used in this study.
| CPU Hardware | AMD EPYC 7742 (64-Core, 2.25 GHz) | 1 unit | AMD |
| Framework | PyTorch | 2 | https://pytorch.org |
| GPU Hardware | NVIDIA A100 (40 GB) | 2 units | NVIDIA Corp. |
| Graph Framework | DGL | 1.1 | https://www.dgl.ai |
| Library | Hugging Face Transformers | 4.32.0 | https://huggingface.co/transformers |
| Library | SpaCy | 3.6 | https://spacy.io |
| Library | NLTK | 3.8.1 | https://www.nltk.org |