RESEARCH
Peer reviewed scientific video journal
Video encyclopedia of advanced research methods
Visualizing science through experiment videos
EDUCATION
Video textbooks for undergraduate courses
Visual demonstrations of key scientific experiments
BUSINESS
Video textbooks for business education
OTHERS
Interactive video based quizzes for formative assessments
Products
RESEARCH
JoVE Journal
Peer reviewed scientific video journal
JoVE Encyclopedia of Experiments
Video encyclopedia of advanced research methods
EDUCATION
JoVE Core
Video textbooks for undergraduates
JoVE Science Education
Visual demonstrations of key scientific experiments
JoVE Lab Manual
Videos of experiments for undergraduate lab courses
BUSINESS
JoVE Business
Video textbooks for business education
Solutions
Language
English
Menu
Menu
Menu
Menu
A subscription to JoVE is required to view this content. Sign in or start your free trial.
Research Article
Ying Lin1,2, Linxuan Li1,2, Jingyu Zeng2, Rongkang Zhao1,2, Xin Jin2,3,4,5,6, Huanhuan Zhu2,3
1College of Life Sciences,University of Chinese Academy of Sciences, 2Shenzhen Key Laboratory of Transomics Biotechnologies,BGI Research, 3State Key Laboratory of Genome and Multi-omics Technologies,BGI Research, 4The Innovation Centre of Ministry of Education for Development and Diseases, School of Medicine,South China University of Technology, 5Shanxi Medical University-BGI Collaborative Center for Future Medicine,Shanxi Medical University, 6State Key Laboratory of Vascular Homeostasis and Remodeling,Peking University Health Science Center
Erratum Notice
Important: There has been an erratum issued for this article. View Erratum Notice
Retraction Notice
The article Assisted Selection of Biomarkers by Linear Discriminant Analysis Effect Size (LEfSe) in Microbiome Data (10.3791/61715) has been retracted by the journal upon the authors' request due to a conflict regarding the data and methodology. View Retraction Notice
Three imputation tools-STITCH, QUILT2, and GLIMPSE2-were benchmarked across varying sequencing depths and sample sizes, using CKB and EAS reference panels. The results provide a practical framework for selecting appropriate imputation strategies in ultra-low-depth sequencing data, facilitating large-scale population genomic and complex trait studies.
Ultra-low-depth sequencing (ULDS) is a cost-effective strategy for large-scale genomic studies, but its utility hinges on accurate genotype imputation. This study evaluates three imputation tools -- STITCH, QUILT2, and GLIMPSE2 -- across varying sequencing depths and sample sizes, using the China Kadoorie Biobank (CKB) and The 1000 Genomes Project (1KGP) East Asian (EAS) reference panels. Critical performance divergences are demonstrated: Sample size sensitivity: STITCH's accuracy improved markedly with larger samples, whereas QUILT2 and GLIMPSE2 showed minimal dependence on sample size. Reference panel optimization: Population-specific CKB significantly enhanced accuracy for QUILT2 and GLIMPSE2 but had a negligible impact on STITCH, which relies on internal haplotype inference. Depth thresholds: All tools achieved robust accuracy at moderate sequencing depths (≥ 0.5x), but STITCH underperformed drastically at ultra-low depths (≤ 0.1x). GLIMPSE2 with CKB delivered the highest overall accuracy, while QUILT2 balanced precision and computational efficiency. For non-invasive prenatal testing (NIPT) data, GLIMPSE2+CKB maintained sufficient accuracy for downstream analyses. A decision framework is proposed, prioritizing population-matched panels and depth-adapted tools, offering actionable guidelines for optimizing ULDS-WGS in diverse research settings. These insights bridge methodological advancements with practical implementation, enabling cost-effective scaling of genomic studies without compromising data quality.
Ultra-low-depth sequencing (ULDS), defined as sequencing coverage below 1x, has gained traction due to its low cost, broad genome coverage, and compatibility with diverse sample types. It has already shown clinical value in applications such as non-invasive prenatal testing (NIPT)1, cancer monitoring2, and chromosomal copy number variation (CNV) detection3,4. Beyond clinical diagnostics, the decreasing cost of sequencing and rapid advances in bioinformatics have enabled ULDS to play a growing role in population genomics and complex trait research. By combining ULDS data with population-scale haplotype reference panels, genotype imputation enables the recovery of genome-wide variant information at the individual level. As a result, ULDS has emerged as a cost-effective alternative to traditional single-nucleotide polymorphism (SNP) arrays and high-depth whole-genome sequencing (WGS)5, particularly in large-scale studies such as genome-wide association studies (GWAS) and population structure analyses.
Previous research has demonstrated the feasibility of conducting various genetic studies using NIPT sequencing data, including variant calling, population history reconstruction, viral infection pattern inference, and GWAS6.
Despite these advantages, the extremely sparse nature of ULDS data presents unique challenges. At the variant level, many sites are entirely unobserved or represented by only a single allele per individual, leading to insufficient data quality for downstream analyses. Genotype imputation is therefore essential, leveraging haplotype structure from large reference panels (e.g., 1000 Genomes7 or population-specific resources) to statistically infer missing or uncertain genotypes. Previous work has shown that imputation from NIPT data can achieve high accuracy and retain robust statistical power in GWAS for identifying trait-associated variants8. Using the STITCH9 algorithm, NIPT data (mean depth ~0.15x) in a cohort of 20,900 Chinese pregnant women were successfully imputed, leading to the identification of pregnancy-associated loci. The imputed genotypes showed strong concordance with high-depth WGS data in GWAS results (Pearson R² > 0.8)10.
The success of ULDS-based analyses critically depends on imputation accuracy, which is influenced by sequencing depth, reference panel quality and population match, imputation algorithm performance, sample size, and allele frequency spectrum11. Among these, the choice of reference panel is a major determinant of imputation accuracy. Commonly used panels include globally representative resources such as the 1000 Genomes Project (1KGP)7, TOPMed12, and the Haplotype Reference Consortium (HRC)13, as well as increasingly available population- or region-specific panels such as Singapore 10,000 Genomes (SG10K)14, the China Kadoorie Biobank (CKB)15. Another key factor in imputation performance is the choice of algorithm. Several tools have been developed to accommodate the unique challenges of low-depth sequencing, significantly advancing the practical use of imputation in large-scale genetic research. While imputation methods such as Beagle (v5+)16, Minimac417, and IMPUTE511 are widely used for SNP array and medium-to-high-depth WGS data, they often perform suboptimally in ULDS settings. More recently, specialized tools have been developed to address these challenges. STITCH9 infers haplotypes directly from low-depth sequencing reads, making it particularly suitable for large homogeneous cohorts. QUILT218 employs a compressed haplotype library and localized likelihood model, enabling efficient imputation with massive reference panels and offering unique applications in prenatal genomics. GLIMPSE219, an extension of the original GLIMPSE framework, provides further improvements in both accuracy and computational efficiency.
Although these tools represent major advances, their relative performance under different experimental designs (e.g., sequencing depth, cohort size, and reference panel choice) has not been systematically evaluated, leaving researchers without clear guidance on selecting the most appropriate strategy. To bridge this gap, three widely used ULDS imputation tools -- STITCH, QUILT2, and GLIMPSE2 -- were systematically benchmarked under multiple sequencing depths and sample sizes. Their performance was evaluated using two East Asian reference panels highly relevant to Chinese populations. The findings indicate that ULDS imputation is generally reliable at sequencing depths ≥0.5x, while depths <0.1x require substantially larger cohorts to achieve acceptable accuracy. Reference panel selection should be tailored to the study context, with population-matched panels such as CKB improving imputation accuracy. Moreover, these approaches are directly applicable to ultra-low-depth data generated in large-scale population studies and NIPT. This study thus establishes a practical framework for tool selection in ULDS-based research, providing methodological guidance for future applications in population genetics and complex trait analyses.
All participants provided written informed consent prior to participation. The study involving high-depth WGS data was reviewed and approved by the BGI Institutional Review Board (BGI-IRB 23058-T2), and approval for the collection of human genetic resources was obtained from the Human Genetic Resources Administration of China ([2023] CJ0262). The study involving ULDS data from NIPT was approved by the Institutional Review Board of Wuhan Children's Hospital (2021R062) and the BGI Institutional Review Board (BGI-IRB 21088), with additional approval from the Human Genetic Resources Administration of China ([2021] CJ2002).
NOTE: This study included two types of WGS data. The first type consisted of high-depth WGS data (30xx) obtained from blood samples of 500 individuals recruited from a natural population cohort in Shenzhen. These data were used to construct a high-quality ground truth dataset and for subsequent down-sampling and accuracy evaluations. The second type comprised ULDS data derived from the NIPT of 10,000 pregnant women from the Wuhan area.
1. High-depth whole genome sequencing data
2. Ultra-low depth NIPT data (~0.1x WGS)
3. Data preprocessing pipeline
4. Genotype imputation
5. Imputation accuracy evaluation
Impact of sample size on imputation accuracy
Increasing the sample size from N = 200 to N = 500 improved the imputation accuracy of STITCH, particularly under low-coverage conditions. For example, with the CKB reference panel at 1x coverage, STITCH achieved an R2> of 0.916 (N=500) compared to 0.882 (N=200), representing a 3.4% increase (Figure 3; Supplementary File 2). Similarly, at 0.5x coverage, its accuracy rose from 0.800 to 0.868 (ΔR2> = 8.5%). In contrast, QUILT2 and GLIMPSE2 showed minimal sensitivity to sample size variations, with R2> fluctuations of less than 0.5% across all tested conditions. For instance, QUILT2 maintained stable performance under the CKB panel at 1x coverage (R2> = 0.970 for N = 200 vs. 0.971 for N = 500). This indicates that STITCH benefits more from larger sample sizes, consistent with its hidden Markov model (HMM)-based haplotype inference framework.
Reference panel compatibility
The choice of reference panel affected the imputation accuracy of QUILT2 and GLIMPSE2, while having little impact on STITCH (Figure 3; Supplementary File 2). When using the Chinese population-specific CKB panel, both QUILT2 and GLIMPSE2 achieved higher accuracy than with the general-purpose EAS panel. For example, at 1x coverage, the R2> of GLIMPSE2 increased from 0.956 with EAS to 0.974 with CKB (ΔR2> = 1.8%), while QUILT2 rose from 0.950 to 0.971 (ΔR2> = 2.1%). In contrast, STITCH showed almost no difference between the two panels (R2> = 0.916 versus 0.915), consistent with its independence from external reference panels. These results highlight the practical advantage of population-specific reference panels in precision medicine research.
Sequencing depth and accuracy trade-offs
The imputation accuracy of all tools increased with higher sequencing depth, but differences between methods became more evident under ultra-low coverage conditions (≤ 0.1x) (Figure 3; Supplementary File 2). Using the CKB panel as an example, GLIMPSE2 achieved an R2> of 0.676 at 0.05x coverage, which increased to 0.974 at 1x. Similarly, QUILT2 rose from 0.666 to 0.971 over the same range. In contrast, STITCH performed poorly at 0.05x coverage, with an R2> of 0.255, indicating reduced robustness under extremely low coverage. At 0.1x coverage, QUILT2 and GLIMPSE2 reached R2> values above 0.817 with the CKB panel, whereas STITCH reached only 0.532 (N = 500).
SNP count considerations and impact on tool selection
STITCH produced fewer total SNPs than QUILT2 and GLIMPSE2, but after quality control, the differences in effective SNP numbers among the tools were reduced. The total number of imputed SNPs varied across tools. For example, at a sample size of N = 500 and coverage of 1x, STITCH generated 21,829 SNPs with the CKB panel and 24,546 with the EAS panel, compared with 398,026 (CKB) and 205,711 (EAS) for QUILT2, and 467,754 (CKB) and 239,495 (EAS) for GLIMPSE2 under the same conditions. After applying standard quality control procedures, MAF and HWE filtering, the number of SNPs retained for downstream analysis became comparable across all tools (Figure 4; Supplementary File 2). This suggests that although the raw SNP counts differ across methods, the number of usable variants after QC is similar, and therefore total imputed SNP count should not be considered a decisive factor when selecting an imputation tool.
Computational resource consumption and tool selection recommendations
In this study, core-hours -- defined as the product of the number of CPU cores and runtime -- were used as a key metric to quantify the computational burden of genotype imputation tools. Under the same sample size and sequencing depth, imputation using the CKB reference panel consistently required more core-hours than using the EAS panel, suggesting that larger or more complex reference panels increase computational demand. Additionally, core-hour usage increased with both sample size and sequencing depth, highlighting the need for careful resource planning in large-scale studies (Figure 5; Supplementary File 3). Among the three tools evaluated, QUILT2 consumed substantially more core-hours than both STITCH and GLIMPSE2, especially at the largest sample size (N = 10,050), where QUILT2 required 361-496 core-hours per imputation run, compared to approximately 20-22 for STITCH and 87-145 for GLIMPSE2. Based on these observations, GLIMPSE2 is recommended for small- to medium-sized datasets, particularly when imputation accuracy is a primary concern, but computational resources are constrained. Conversely, STITCH should be considered the tool of choice in studies that lack high-quality reference panels or aim to handle very large cohorts, as it offers a favorable trade-off between accuracy and efficiency. Researchers are therefore advised to balance accuracy, reference panel availability, and computational costs when selecting imputation tools for ultra-low-depth WGS studies.

Figure 1: Main steps of the comprehensive evaluation of genotype imputation tools. This flowchart illustrates the three main steps of genotype imputation accuracy evaluation: Data Preprocessing Pipeline, Genotype Imputation Workflow, and Imputation Accuracy Evaluation. Please click here to view a larger version of this figure.

Figure 2: Sequencing depth distribution of 10,000 ultra-low-depth NIPT samples. Histogram showing the coverage depth distribution of 10,000 non-invasive prenatal testing (NIPT) samples sequenced at ultra-low depth. The red dashed line indicates the mean coverage depth (0.102x), and the green dashed line represents the median coverage depth (0.104x). The x-axis ranges from 0.05x to 0.35x, with the majority of samples concentrated around 0.10x. Please click here to view a larger version of this figure.

Figure 3: Evaluation of imputation accuracy for three genotype imputation tools under varying conditions. (A) Accuracy across varying coverage depths (0.05x, 0.1x, 0.5x, 1x) using two reference panels (CKB versus EAS), with sample size fixed at 500. (B) Accuracy across varying coverage depths under two sample sizes (200 versus 500), using the EAS reference panel. (C) Accuracy on 0.1x coverage depth data under different sample sizes (200, 500, and 10,050). Please click here to view a larger version of this figure.

Figure 4: Comparison of SNP numbers from different genotype imputation tools. A sample size of n = 500 and coverage depth of 1x: (A) The total number of SNPs predicted and the number of SNPs remaining after quality control (MAF > 0.05 and HWE filtering) by three genotype imputation tools (STITCH, QUILT2, and GLIMPSE2) in the EAS and CKB populations. (B) The number of SNPs after quality control and the number of effective SNPs used for calculating imputation accuracy. Please click here to view a larger version of this figure.

Figure 5: Comparison of core-hour consumption across tools under varying coverage depths, sample sizes, and reference panels. (A) Core-hour consumption changes with increasing coverage depth under a fixed sample size (N = 500), comparing three tools (STITCH, GLIMPSE2, and QUILT2) using the EAS and CKB reference panels. (B) Compares core-hour usage across different sample sizes (200 vs. 500) as coverage depth increases, under the EAS reference panel. (C) The impact of sample size (200, 500, 10,050) on core-hour consumption at a fixed coverage depth of 0.1x, under both EAS and CKB reference panels. Please click here to view a larger version of this figure.
| Reference Panel | Sample size | Sequencing depth | Ancestries |
| CKB | 9,964 | ~15X | Chinese |
| 1KG-EAS | 585 | ~30X | Five East Asian populations |
Table 1: Information on reference panels. The table provides key details about the reference panels used for genotype imputation, including the name of the reference panel, sample size, sequencing depth, and ancestries.
| Tool | Strengths | Limitations | Recommended Scenarios |
| STITCH | Reference-free; low computational burden; scalable to very large cohorts | Lower accuracy, especially at ultra-low depth | Large-scale cohorts (N ≥ 10,000); populations without appropriate reference panels |
| QUILT2 | High accuracy; robust across depths; supports maternal–fetal separation | Higher computational cost; under active development | Ultra-low-depth NIPT data; studies requiring maternal–fetal genome separation |
| GLIMPSE2 | Highest accuracy (R2> up to 0.974 with CKB at 1×); efficient resource usage | Requires reference panel; slightly less robust at extremely low depths | Small- to medium-sized datasets; studies prioritizing accuracy with available reference |
Table 2: Recommended application scenarios for STITCH, QUILT2, and GLIMPSE2. Recommended application scenarios for STITCH, QUILT2, and GLIMPSE2. Recommendations are derived from comparative evaluations under varying sequencing depths, sample sizes, and reference panel conditions, aiming to guide tool selection in ultra-low-depth WGS studies.
Supplementary File 1: Original code for analyses. Please click here to download this File.
Supplementary File 2: Imputation accuracy and SNPs number. Please click here to download this File.
Supplementary File 3: Core-hour consumption of imputation. Please click here to download this File.
This study systematically assessed the performance of three widely used genotype imputation tools for ULDS, with high-depth WGS serving as the gold standard. A key methodological strength lies in the adoption of a unified preprocessing pipeline -- encompassing alignment, quality control, and base quality score recalibration -- that minimizes batch effects and ensures comparability across tools and conditions. By downsampling deeply sequenced samples, ultra-low-depth data were simulated under controlled settings, thereby providing an objective framework for benchmarking. Restricting analyses to a defined 10 Mb genomic interval on chromosome 1 ensured computational feasibility while retaining sufficient variant density. Comparative assessment of STITCH, QUILT2, and GLIMPSE2 reveals distinct strengths and weaknesses across imputation strategies.
Sequencing depth and sample size exerted predictable effects on imputation accuracy. Accuracy declined sharply below ~0.1x depth, reflecting insufficient information to reliably recover haplotypes from sparse read data. Larger sample sizes mitigated this limitation by enabling improved inference of population haplotype structure, consistent with theoretical expectations that more samples improve imputation performance29. The results indicate that depths of at least 0.5x maintain acceptable accuracy for common variants, even in modest sample sizes, whereas depths below 0.1x require substantially larger cohorts to achieve comparable performance. These findings have practical implications for study design: sample size can partly compensate for sequencing depth, but only within certain limits.
Reference panel selection emerged as a decisive factor influencing imputation performance. 1KGP-EAS panel, widely used in global studies, provided robust coverage of common variants. By contrast, the CKB panel, derived from large-scale local sequencing, improved accuracy for population-specific variants. This underscores that population-specific reference panels, which capture a larger repertoire of haplotypes, substantially enhance imputation quality. This observation aligns with previous findings that leveraging local haplotype diversity increases the chance that any given allele is tagged by a representative haplotype7,15. Future efforts to expand and diversify Chinese reference panels will be critical for further improving imputation accuracy. Comprehensive evaluation demonstrates distinct performance profiles for STITCH, QUILT2, and GLIMPSE2 under varying sequencing depths, sample sizes, and reference panel configurations. GLIMPSE2 achieved the highest imputation accuracy (R² = 0.974) when using the CKB reference panel at 1x coverage, closely followed by QUILT2 (R² = 0.971). Both tools exhibited greater robustness to changes in sample size and sequencing depth compared with STITCH. Although STITCH yielded lower overall accuracy, its minimal computational requirements and ability to function without an external reference panel make it particularly advantageous for large-scale cohorts (N ≥ 10,000) or populations where no well-matched reference panel is available.
Overall, these findings suggest that tool selection should be tailored to study design, dataset characteristics, reference panel availability, and computational constraints. A comparative summary of recommended applications is provided in Table 2. Notably, QUILT2 offers unique functionality for maternal-fetal genome separation, making it particularly applicable to ultra-low-depth NIPT datasets. GLIMPSE2 balances accuracy and computational efficiency, providing high-quality results with manageable resource usage.
Several technical issues were identified that can influence imputation outcomes. For GLIMPSE2, the BAM list file must be correctly formatted with two columns (BAM path and sample name); omission of the second column may lead to inconsistent sample identifiers in the output VCF11. For STITCH, the parameter K (number of ancestral haplotypes) requires careful adjustment: larger values improve accuracy at higher depth and sample size, but can reduce performance under ultra-low-depth conditions9. In addition, STITCH occasionally fails during multi-threaded runs due to known instability in its parallelization library; switching between single- and multi-thread execution generally resolves this issue30 (GitHub issue #86: https://github.com/rwdavies/STITCH/issues/86). Addressing these points was essential to ensure reproducibility and stable imputation results.
In this study, a representative genomic interval on chromosome 1 was selected to evaluate overall imputation performance, with a focus on common variants. The aim was to benchmark tool behavior under typical genomic contexts rather than highly complex regions such as the HLA locus, which involves structural variation and may not be equally well addressed by current imputation algorithms. This design provides a reliable overview of population-level imputation accuracy; however, it does not capture the challenges associated with rare variants or structurally complex loci, which remain important limitations. Future studies may focus on challenging genomic regions such as the HLA locus to further test tool performance. Moreover, while the CKB panel substantially improved accuracy for Chinese populations, continued expansion of population-specific reference panels will be essential to improve the resolution of low-frequency variants. Finally, emerging approaches such as deep learning-based imputation methods31 hold promise for further enhancing accuracy and robustness under ultra-low-depth conditions.
The authors declare no competing interests.
This study was supported by Shenzhen Medical Research Fund (B2404004), National Key Research and Development Program of China (2023YFC2605400, 2022YFC2502402), Shenzhen Science and Technology Program (SYSPG20241211173852024), Open Research Project in State Key Laboratory of Vascular Homeostasis and Remodeling (Peking University) (2025-SKLVHR-013), and Key-Area Research and Development Program of Guangdong Province (2023B0303040001).
| Data | |||
| 10,000 NIPT low-depth samples | This paper | Ultra-low-depth whole genome sequencing data used for imputation benchmark. | |
| 500 high-depth WGS samples | This paper | 30× high-depth WGS used as gold standard/truth set. | |
| Reference Panel | |||
| 1KGP-EAS reference panel | 1000 Genomes Project (East Asia) | Subset of 1KGP for East Asian ancestry-specific imputation. | |
| CKB reference panel | China Kadoorie Biobank | Custom population-specific panel for genotype imputation. | |
| Software and algorithms | |||
| BCFtools v1.11 | GitHub (samtools/bcftools) | Used for merging and sorting chromosome-level results, and filtering variants. | |
| BQSR of the GATK 4.0.4.0 toolset | Broad Institute | Used for base quality score recalibration (BQSR). | |
| BWA-MEM .7.16a-r1181 | Heng Li / GitHub | For aligning raw reads to GRCh38. | |
| DPGT (Distributed Population Genetics Tool) | BGI | A distributed population genetics analysis tool which enabled joint calling on millions of WGS samples. Available at [GitHub - BGI-flexlab/DPGT](https://github.com/BGI-flexlab/DPGT) | |
| fastp.0.23.4 | Open-source (Chen et al., 2018) | For quality control and adapter trimming. | |
| GLIMPSE2 | University of Oxford | Fast genotype phasing and imputation for low-coverage WGS | |
| Original Code for analyses | This paper | Supplementary File 1 Original Code for analyses | |
| Picard toolkit | Broad Institute | Used for marking duplicates and file format conversion. | |
| Plink 2.0 | C. Chang, S. Purcell / Broad Institute | For genotype format conversion and association analysis. | |
| Python 3.8 | Python Software Foundation | Used for scripting, automation, and data analysis. | |
| QUILT2 | Oxford Big Data Institute | HMM-based imputation using external reference panels | |
| R 4.1.3 | The R Foundation | Used for running STITCH, QUILT2 and plotting/statistics. | |
| SAMtools v1.3 | GitHub (samtools/samtools) | For manipulating SAM/BAM files. | |
| Seqtk-1.5 | GitHub (lh3/seqtk) | Toolkit for processing sequences in FASTA/Q formats. Available at [GitHub - lh3/seqtk](https://github.com/lh3/seqtk) | |
| SOAPnuke | BGI | For NGS data quality control and filtering. | |
| STITCH v1.6.6 | University of Oxford | Imputation tool optimized for ultra-low coverage sequencing | |
| tabix | GitHub (samtools/tabix) | Used for indexing and querying bgzipped VCF files. | |
| Other Materials | |||
| GATK bundle files | GATK | Available at [https://github.com/gatk-workflows/gatk4-data-processing/blob/master/processing-for-variant-discovery-gatk4.hg38.wgs.inputs.json] | |
| Genetic Map for 1000G (GRCh38) | Oxford / 1000 Genomes Project | Required for phasing/imputation tools | |
| GRCh38 | Genome Reference Consortium | Used for read alignment and variant calling |