Research Article

Integrated Bioinformatics Analysis of Human Transcriptomic Data Identifies Three Key Diagnostic and Prognostic Biomarkers in Lung Adenocarcinoma

DOI:

10.3791/71214

June 30th, 2026

In This Article

Summary

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This study identified diagnostic and prognostic biomarkers for lung adenocarcinoma using TCGA-LUAD and GEO GSE115002 transcriptomic data. B3GNT3, FERMT1, and SPP1 were upregulated, distinguishing tumors from normal tissue. These genes are linked to epithelial-mesenchymal transition and immune suppression. A nomogram combining gene expression with TNM stage showed reliable predictive value.

Abstract

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Lung adenocarcinoma (LUAD) is the leading cause of cancer-related death worldwide. Despite advances in surgery, targeted therapy, and immunotherapy, the 5-year survival rate of advanced LUAD remains below 20%, indicating an urgent need for reliable molecular biomarkers for early detection and prognosis. In this study, the authors hypothesized that three consistently upregulated genes could act as effective diagnostic and prognostic biomarkers for LUAD. The authors analyzed transcriptomic data from two independent cohorts, TCGA-LUAD (535 tumors, 59 normal samples) and GSE115002 (52 tumors, 52 matched normal samples), to screen differentially expressed genes. Three core genes—B3GNT3, FERMT1, and SPP1—were consistently overexpressed in LUAD tumors in both datasets. These genes showed excellent diagnostic performance, with AUC values above 0.95 in TCGA-LUAD and high accuracy in GSE115002. Survival analysis showed that high expression of each gene was significantly associated with shorter overall and disease-free survival, and multivariate Cox regression verified their independent prognostic value. Functional enrichment analysis indicated that these three genes participate in epithelial-mesenchymal transition, extracellular matrix remodeling, and immune suppression, all of which are closely related to LUAD invasion and metastasis. The authors further constructed a prognostic nomogram combining the three genes and TNM stage, achieving a concordance index of 0.743 and demonstrating good predictive performance. These findings confirm that B3GNT3, FERMT1, and SPP1 are promising diagnostic and prognostic biomarkers for LUAD, supporting the clinical application in risk stratification and management.

Introduction

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Lung cancer is the primary cause of global cancer mortality, accounting for approximately 1.8 million deaths in 20201. Lung adenocarcinoma (LUAD) represents nearly 40% of all lung cancer cases2. Despite advances in surgery, targeted therapy, and immunotherapy, the 5-year survival rate for advanced LUAD remains below 20%3,4. Reliable molecular biomarkers for early detection and precise prognostication are urgently needed. High-throughput sequencing and public databases such as The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) enable systematic transcriptomic profiling of cancers5,6. Integrative cross‑cohort bioinformatics improves the reliability of candidate biomarker discovery5.

Many genes and pathways have been implicated in LUAD, including cell proliferation, EGFR signaling, and immune escape7. However, few have been translated into clinical use. Risk models combining gene signatures and clinicopathologic features—especially nomograms—improve prognostic accuracy in LUAD8. While B3GNT3, FERMT1, and SPP1 have been individually linked to cancer progression, their combined diagnostic, prognostic, and immune‑microenvironment regulatory value in LUAD has not been systematically validated across independent cohorts. This study provides the first integrated cross‑platform analysis of these three genes as a unified biomarker panel for LUAD, with a clinically applicable prognostic nomogram.

B3GNT3 encodes a glycosyltransferase that stabilizes PD‑L1 and promotes immune evasion9,10. FERMT1 (kindlin‑1) regulates integrin activation and drives metastasis in non‑small cell lung cancer (NSCLC)11,12. SPP1 (osteopontin) mediates extracellular matrix remodeling, epithelial‑mesenchymal transition (EMT), and chemoresistance13,14,15. Circadian clock-related genes have also been shown to predict LUAD prognosis and diagnosis16, while sex differences in LUAD have been uncovered via multi-omics integrative protein signaling networks17. B3GNT3 and SPP1 are secreted or membrane‑localized, supporting potential use as minimally invasive biomarkers. Effective LUAD classification and biomarker identification can also be achieved through overlapping feature selection methods18, and multi-omic interactions play important functional roles in lung cancer progression19. Mitochondrial gene signatures, identified via comprehensive multi-omics integration, also hold value for LUAD prognosis and personalized therapy20. B3GNT3 and SPP1 are secreted or membrane‑localized, supporting potential use as minimally invasive biomarkers. This study aimed to identify robust LUAD biomarkers using integrative bioinformatics, evaluate their diagnostic and prognostic performance, explore their biological functions and immune associations, and build a clinically useful prognostic nomogram.

Protocol

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

1. Data sources and preprocessing

  1. Process raw data in R (version 4.1.3; Windows 10 Pro).
  2. For GSE115002, apply quantile normalization using limma (version 3.52.3).
  3. Filter low-expression genes for TCGA: retain genes with CPM > 0.5 in ≥50% of samples.
  4. Filter low-expression genes for GSE115002: retain genes with average signal >50.
  5. log2Transform expression values with a pseudocount of +1.
    NOTE: LUAD gene expression and clinical data were obtained from TCGA-LUAD (version 33.0, GDC Portal, downloaded August 7, 2025) and GSE115002 (Agilent microarray, GEO, downloaded August 7, 2025). TCGA-LUAD included 535 tumors and 59 normal samples. GSE115002 included 52 tumors and 52 matched normal samples.

2. Identification of differentially expressed genes

  1. Use DESeq2 (version 1.36.0) for TCGA RNA-seq and limma (version 3.52.3) for GSE115002 for differential expression analysis. Calculate adjusted P‑values (FDR) using the Benjamini–Hochberg method.
  2. To ensure cross‑dataset comparability, a unified |log₂FC| ≥ 1.0 was applied for both cohorts. DEGs were defined as FDR < 0.05 and |log₂FC| ≥ 1.0. Overlapping DEGs were identified using VennDiagram (version 1.7.3). B3GNT3, FERMT1, and SPP1 were selected as consistently upregulated candidates with known cancer relevance.

3. Evaluation of diagnostic value

  1. Construct ROC curves for each candidate gene.
  2. Determine optimal cutoff values using the Youden index.
  3. Calculate AUC, sensitivity, and specificity for each gene.
  4. Build a combined diagnostic panel using multivariate logistic regression.
    NOTE: The pROC package v1.18.0 was used for ROC analysis. The glm function with the binomial family was used to construct the diagnostic model.

4. Survival analysis

  1. Stratify patients into high-and low-expression groups using median expression.
  2. Generate Kaplan–Meier survival curves for each gene.
  3. Perform log‑rank tests to compare survival differences.
  4. Conduct univariate Cox regression analysis.
  5. Conduct multivariate Cox regression analysis.
  6. Include clinical covariates in regression models.
  7. Verify the proportional hazards assumption using Schoenfeld residuals.
  8. Calculate a three‑gene risk score.
    NOTE: Survival v3.3.1 and survminer v0.4.9 were used. Covariates included age, sex, T stage, N stage, and M stage. The risk score was calculated as:
    Risk score = (0.328 × B3GNT3) + (0.331 × FERMT1) + (0.321 × SPP1). (1)

5. Gene set enrichment and functional annotation

  1. Perform GO enrichment analysis using DEGs.
  2. Perform KEGG pathway enrichment analysis using DEGs.
  3. Conduct gene set enrichment analysis (GSEA).
  4. Rank genes by Pearson correlation with candidate gene expression.
  5. Identify significant terms using adjusted P < 0.05.
    NOTE: clusterProfiler v4.6.2 was used for GO and KEGG analyses. FGSEA v1.22.0 and MSigDB Hallmark v7.5 were used for GSEA.

6. Correlation and network analysis

NOTE: Pearson correlation was used for normally distributed gene expression; Spearman correlation for immune cell fractions. PPI networks were generated using STRING (version 11.5, confidence > 0.7) and visualized in Cytoscape (version 3.9.1). Immune infiltration was estimated using CIBERSORT (absolute mode, 100 permutations). Single-cell RNA sequencing has been shown to reveal niche transitions in the NSCLC microenvironment, which is relevant to immune infiltration analysis21,22, and integrative single-cell analysis can further dissect the roles of immune cells, such as CD8+ memory cells, in LUAD23,24,25.

7. Nomogram construction and validation

NOTE: Variables for the nomogram were selected based on multivariate Cox significance (P < 0.05): T stage, N stage, B3GNT3, FERMT1, and SPP1. The nomogram was built using rms (version 6.5.0). Internal validation used 1000 bootstrap resampling with replacement. Calibration curves and decision curve analysis (DCA) were performed using rmda (version 1.7). The computational environment included R 4.1.3, Windows 10 Pro, and Bioconductor 3.15. Analysis scripts are available at https://github.com/[redacted]/LUAD‑biomarker-2025 upon reasonable request.

8. Statistical analysis

NOTE: All statistical tests were two-sided; P < 0.05 was considered significant.

Results

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Global gene expression alterations in LUAD

Transcriptomic comparisons between lung adenocarcinoma tissues and normal lung tissues identified widespread gene expression changes. Figure 1A shows volcano plots of differentially expressed genes in the TCGA-LUAD dataset, and Figure 1B shows those in the GSE115002 dataset. In the TCGA-LUAD cohort (Figure 1A), 1865 genes were significantly upregulated, and 1247 genes were downregulated. In the GSE115002 cohort (Figure 1B), 645 genes were upregulated and 609 downregulated. A total of 421 genes were consistently upregulated in both datasets. Among these overlapping genes, B3GNT3, FERMT1, and SPP1 are labeled in Figure 1A and Figure 1B as markedly overexpressed in tumor samples. In TCGA-LUAD, B3GNT3 expression was increased approximately 5-fold, FERMT1 8-fold, and SPP1 10-fold compared with normal tissues. Similar upregulation was confirmed in GSE115002, with all three genes showing more than twofold elevation.

Diagnostic performance of B3GNT3, FERMT1, and SPP1

Receiver operating characteristic curve analysis was used to evaluate the diagnostic performance of B3GNT3, FERMT1, and SPP1. Figure 2A presents ROC curves in the TCGA-LUAD cohort, and Figure 2B presents those in the GSE115002 cohort. All three genes achieved high diagnostic accuracy in both cohorts. In the TCGA-LUAD cohort (Figure 2A), the area under the curve values exceeded 0.95 for all markers. In the independent GSE115002 cohort (Figure 2B), similarly high area under the curve values were observed. Sensitivity and specificity ranged from 85% to 95% at optimal cutoff values. These results confirm that each gene provides excellent discrimination between tumor and normal tissues.

Prognostic significance of B3GNT3, FERMT1, and SPP1 expression

Kaplan–Meier survival curves in Figure 3 reveal that high expression of each gene was significantly associated with shorter overall survival in both cohorts. Figures 3A–3C show overall survival curves for B3GNT3, FERMT1, and SPP1 in the TCGA-LUAD cohort. Figures 3D–3F show the corresponding curves in the GSE115002 cohort. Patients with high B3GNT3, FERMT1, or SPP1 expression showed reduced median survival and lower 5-year survival rates. Multivariate regression analysis confirmed that high SPP1 expression remained an independent poor prognostic factor. Elevated expression of all three genes was also associated with shorter disease-free survival. Consistent trends across both cohorts indicate that overexpression of B3GNT3, FERMT1, and SPP1 predicts unfavorable clinical outcomes in lung adenocarcinoma.

Functional enrichment analysis

Functional enrichment analysis results are summarized in Figure 4. Figure 4A shows GO and KEGG enrichment in the TCGA-LUAD cohort, and Figure 4B shows enrichment in the GSE115002 cohort. Upregulated genes were strongly enriched in cell cycle progression, extracellular matrix organization, focal adhesion, and oncogenic signaling. Downregulated genes were associated with normal epithelial differentiation and p53 signaling. These observations indicate that the three candidate genes participate in pathways that promote proliferation, invasion, and immune dysregulation in lung adenocarcinoma.

Co-expression networks

Co-expression networks associated with B3GNT3, FERMT1, and SPP1 are displayed in Figure 5. Figure 5A shows the network in the TCGA-LUAD cohort, and Figure 5B shows the network in the GSE115002 cohort. Nodes represent genes and edges represent correlation coefficients. The three key genes cluster with ECM remodeling, immune regulation, and cytoskeletal organization genes. These findings suggest that overexpression of the three genes is associated with an immunosuppressive tumor microenvironment.

Performance of the prognostic nomogram

The prognostic nomogram is presented in Figure 6. The model was constructed by integrating pathologic T stage, pathologic N stage, and expression levels of B3GNT3, FERMT1, and SPP1 for predicting 1‑, 2‑, and 3‑year overall survival in LUAD. Points are assigned for each variable, and the total points correspond to the predicted survival probability. The model achieved a concordance index of 0.743, indicating good predictive performance. Calibration curves showed close agreement between predicted and actual survival probabilities. Decision curve analysis confirmed clinical net benefit. This nomogram improves individualized survival prediction beyond conventional TNM staging.

In summary, this study highlights B3GNT3, FERMT1, and SPP1 as core molecular players in LUAD pathogenesis. Their overexpression correlates with invasive tumor phenotypes, stromal remodeling, and immune evasion. Through multi-omics integration, we demonstrate the combined value of these genes for diagnosis, prognosis, and patient stratification. Future research should explore their predictive relevance for immunotherapy response and assess their potential as therapeutic targets in LUAD.

figure-results-1
Figure 1: Volcano plots of differentially expressed genes in LUAD tumor versus normal tissues. (A) TCGA-LUAD dataset. (B) GSE115002 dataset. Red indicates significantly upregulated genes; blue indicates downregulated genes. B3GNT3, FERMT1, and SPP1 are labeled as consistently upregulated. Please click here to view a larger version of this figure.

figure-results-2
Figure 2: ROC curves for B3GNT3, FERMT1, and SPP1 in distinguishing LUAD from normal tissues. (A) TCGA-LUAD cohort. (B) GSE115002 cohort. AUC values demonstrate high diagnostic accuracy. Please click here to view a larger version of this figure.

figure-results-3
Figure 3: Kaplan-Meier overall survival curves stratified by B3GNT3, FERMT1, and SPP1 expression levels. (A–C) TCGA-LUAD cohort. (A) B3GNT3, (B) FERMT1, (C) SPP1. (D–F) GSE115002 cohort. (D) B3GNT3, (E) FERMT1, (F) SPP1. High expression of each gene is significantly associated with shorter overall survival in both cohorts. HR and P values from log-rank tests are provided. Please click here to view a larger version of this figure.

figure-results-4
Figure 4: GO and KEGG enrichment analysis of the DEGs correlated with the three key genes. (A) TCGA-LUAD cohort. (B) GSE115002 cohort. The enrichment terms include biological process (BP), cellular component (CC), molecular function (MF), and KEGG pathways. Upregulated genes are enriched in proliferation, ECM remodeling, and oncogenic signaling. Please click here to view a larger version of this figure.

figure-results-5
Figure 5: Gene co-expression networks associated with B3GNT3, FERMT1, and SPP1. (A) TCGA-LUAD cohort. (B) GSE115002 cohort. Nodes represent genes, and edges represent correlation coefficients. The three key genes cluster with ECM remodeling, immune regulation, and cytoskeletal organization genes. Please click here to view a larger version of this figure.

figure-results-6
Figure 6: Prognostic nomogram integrating pathologic T stage, pathologic N stage, B3GNT3, FERMT1, and SPP1 expression for predicting 1-, 2-, and 3-year overall survival in LUAD. Points are assigned for each variable, and the total points correspond to the predicted survival probability. Please click here to view a larger version of this figure.

Discussion

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The nomogram was constructed using multivariate Cox regression analysis based on the TCGA-LUAD cohort. Predictors include pathologic T stage, pathologic N stage, and gene expression status of B3GNT3, FERMT1, and SPP1 (categorized as High vs. Low based on median expression). For each patient, the individual scores for each variable are summed to generate a “Total Points” value, which corresponds to estimated 1-, 2-, and 3-year overall survival probabilities. Higher total scores indicate increased mortality risk. This tool provides individualized survival prediction and aids in LUAD risk stratification. Machine learning has been used to reveal diverse cell death patterns in LUAD prognosis and therapy21,22,23, which could further optimize our nomogram model.

This study identified B3GNT3, FERMT1, and SPP1 as robust diagnostic and prognostic biomarkers in LUAD using integrated bioinformatics analysis. All three genes are consistently overexpressed in tumors, distinguish tumors from normal tissue with high accuracy, predict poor survival, and regulate pathways related to EMT, matrix remodeling, and immune evasion. A combined nomogram improves risk stratification beyond TNM staging. B3GNT3 promotes immune evasion by stabilizing PD‑L1 via glycosylation9,10. FERMT1 enhances integrin signaling and cell motility, driving invasion and metastasis11,12. SPP1 functions as a secreted driver of EMT, angiogenesis, and M2 macrophage polarization13,14,15. Together, they define an aggressive LUAD subtype characterized by invasiveness and immunosuppression.

Prior studies reported individual roles of B3GNT3, FERMT1, or SPP1 in LUAD. This study is the first to validate all three as a unified panel across independent transcriptomic cohorts, with diagnostic and prognostic performance confirmed in both TCGA and GEO data. Recent work on LUAD biomarkers supports the value of immune‑associated gene signatures for prognosis and immunotherapy guidance. Recent studies using integrated bioinformatics and machine learning have identified multiple gene signatures for LUAD diagnosis and prognosis21,22. These approaches, similar to our three‑gene panel, highlight the value of transcriptome‑based biomarkers in clinical stratification.

Extracellular matrix remodeling is a core feature of aggressive LUAD, and ECM‑related signatures have been validated as independent prognostic factors. Our findings that FERMT1 and SPP1 are closely associated with focal adhesion and ECM–receptor interaction further support the critical role of matrix remodeling in LUAD progression. Similar to CHAF1B and ubiquitin‑related gene signatures reported in interdisciplinary medicine (IMed)21,23, our three genes are closely associated with immune infiltration and may serve as both prognostic and predictive markers. Alternative strategies to identify LUAD biomarkers include single‑cell RNA sequencing, spatial transcriptomics, machine learning‑based feature selection, and plasma proteomic profiling24,25,26. Machine learning algorithms such as random forest or LASSO could further refine biomarker selection. Wet‑lab validation using qPCR, IHC, and ELISA is essential to confirm clinical translation27,28,29,30.

This study is limited by its retrospective design, reliance on public transcriptomic data, and lack of external clinical validation. Nomogram validation was limited to internal bootstrap resampling. Bulk transcriptomic data cannot resolve cellular‑level expression. Mechanistic causality requires functional experiments. Correlations with immune infiltration are based on computational deconvolution and should be interpreted cautiously. Future studies should validate these biomarkers in prospective cohorts using IHC, qPCR, and serum ELISA. Single‑cell and spatial transcriptomics will clarify cellular sources and spatial distribution. The predictive value for immunotherapy and targeted therapy response should be evaluated. Therapeutic targeting of B3GNT3, FERMT1, and SPP1 may offer new strategies for LUAD treatment.

Disclosures

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The authors declare no competing interests.

Acknowledgements

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This work was supported by the 2024 Fujian University of Traditional Chinese Medicine University-level Project (Grant Number: XB2024012), led by Yuhui Lin from the Affiliated People’s Hospital of Fujian University of Traditional Chinese Medicine. and Joint funds for the innovation of science and technology, Fujian Province( Grant No:2025Y9530), led by Xiaoting Chen from Jinjiang Municipal Hospital (Shanghai Sixth People's Hospital, Fujian).

Materials

List of materials used in this article
NameCompanyCatalog NumberComments
Publicly Available DatasetsTCGA-LUAD DatasetThe Cancer Genome Atlas (TCGA) Portal (https://portal.gdc.cancer.gov/); 535 LUAD tumor samples, 59 adjacent normal lung tissue samples (RNA-sequencing count/FPKM values + clinical data: survival, TNM staging)Transcriptomic and clinical data for differential expression, survival, and nomogram analysis; primary study cohort
GSE115002 DatasetGene Expression Omnibus (GEO) (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE115002); Agilent microarray, 52 LUAD tumor tissues, 52 matched adjacent normal lung tissues (treatment-naïve primary tumors)Independent validation cohort for differential expression, diagnostic performance, and immune infiltration analysis
Bioinformatics Software & Programming EnvironmentR Programming LanguageVersion 4.1Core platform for all transcriptomic, statistical, and graphical analyses
R Packages (Differential Expression)DESeq2, limmaDESeq2: TCGA RNA-seq raw count differential expression analysis; limma: GSE115002 microarray normalization and differential expression analysis (Benjamini–Hochberg FDR correction)
R Packages (Diagnostic Analysis)pROCConstruction of ROC curves, calculation of AUC (95% CI), optimal cutoff determination (Youden’s index) for diagnostic performance assessment
R Packages (Survival Analysis)survival, survminerKaplan–Meier survival curve generation, log-rank test, univariate/multivariate Cox proportional hazards regression (HR + 95% CI); patient stratification by median gene expression
R Packages (Functional Enrichment)clusterProfiler, fgseaclusterProfiler: GO (BP/CC/MF) and KEGG pathway enrichment analysis (adjusted P < 0.05); fgsea: GSEA for MSigDB Hallmark/KEGG gene sets (FDR < 0.25)
R Packages (Nomogram Construction & Validation)rmsDevelopment of prognostic nomogram (integration of gene expression + TNM stage); Harrell’s C-index calculation, bootstrap resampling (1000 repetitions) for bias correction, calibration plot generation
R Packages (Statistical & Visualization)ggplot2, ComplexHeatmap, corrplotGeneration of volcano plots, bubble plots (enrichment), heatmaps (immune infiltration correlation), scatter plots (gene co-expression); Pearson/Spearman correlation analysis
Bioinformatics Databases & Tools (Network/Immune Analysis)STRING DatabaseConfidence score > 0.7Construction of protein–protein interaction (PPI) networks for B3GNT3/FERMT1/SPP1 and first-degree interactors
Cytoscape-Visualization of PPI and gene co-expression networks (edge weighting by correlation strength, hub gene identification)
Immune Deconvolution AlgorithmCIBERSORTEstimation of immune cell infiltration abundance (M2 macrophages, CD8+ T cells, neutrophils, NK cells, etc.) in LUAD samples; correlation with candidate gene expression
Other ToolsMicrosoft Office/LaTeX-Manuscript preparation, figure assembly, and table formatting; statistical result compilation

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article

Request Permission

Tags

Cancer ResearchLung adenocarcinomaB3GNT3FERMT1SPP1biomarkerprognosisgene expressionnomogramBioinformatics

Related Articles