Deep learning-based computational approach for predicting ncRNAs-disease associations in metaplastic breast cancer diagnosis

Ahmad, Saleem; Zafar, Imran; Shafiq, Shaista; Sehar, Laila; Khalil, Hafsa; Matloob, Nida; Hina, Mehvish; Muntaha, Sidra Tul; Khan, Hamid; Khan, Najeeb Ullah; Rana, Samreen; Unar, Ahsanullah; Azmat, Muhammad; Shafiq, Muhammad; Jardan, Yousef A. Bin; Dauelbait, Musaab; Bourhia, Mohammed

doi:10.1186/s12885-025-14113-z

Research
Open access
Published: 06 May 2025

Deep learning-based computational approach for predicting ncRNAs-disease associations in metaplastic breast cancer diagnosis

Saleem Ahmad¹,
Imran Zafar²,
Shaista Shafiq²,
Laila Sehar³,
Hafsa Khalil³,
Nida Matloob⁴,
Mehvish Hina⁸,
Sidra Tul Muntaha⁶,
Hamid Khan⁷,
Najeeb Ullah Khan⁶,
Samreen Rana⁵,
Ahsanullah Unar⁹,
Muhammad Azmat¹⁰,
Muhammad Shafiq¹¹,
Yousef A. Bin Jardan¹²,
Musaab Dauelbait¹³ &
…
Mohammed Bourhia¹⁴

BMC Cancer volume 25, Article number: 830 (2025) Cite this article

770 Accesses
15 Altmetric
Metrics details

Abstract

Non-coding RNAs (ncRNAs) play a crucial role in breast cancer progression, necessitating advanced computational approaches for precise disease classification. This study introduces a Deep Reinforcement Learning (DRL)-based framework for predicting ncRNA–disease associations in metaplastic breast cancer (MBC) using a multi-dimensional descriptor system (ncRNADS) integrating 550 sequence-based features and 1,150 target gene descriptors (miRDB score ≥ 90). The model achieved 96.20% accuracy, 96.48% precision, 96.10% recall, and a 96.29% F1-score, outperforming traditional classifiers such as support vector machines (SVM) and neural networks. Feature selection and optimization reduced dimensionality by 42.5% (4,430 to 2,545 features) while maintaining high accuracy, demonstrating computational efficiency. External validation confirmed model specificity to breast cancer subtypes (87–96.5% accuracy) and minimal cross-reactivity with unrelated diseases like Alzheimer’s (8–9% accuracy), ensuring robustness. SHAP analysis identified key sequence motifs (e.g., "UUG") and structural free energy (ΔG = − 12.3 kcal/mol) as critical predictors, validated by PCA (82% variance) and t-SNE clustering. Survival analysis using TCGA data revealed prognostic significance for MALAT1, HOTAIR, and NEAT1 (associated with poor survival, HR = 1.76–2.71) and GAS5 (protective effect, HR = 0.60). The DRL model demonstrated rapid training (0.08 s/epoch) and cloud deployment compatibility, underscoring its scalability for large-scale applications. These findings establish ncRNA-driven classification as a cornerstone for precision oncology, enabling patient stratification, survival prediction, and therapeutic target identification in MBC.

Peer Review reports

Introduction

Breast cancer (BC), as a public health problem, remains significant in the world and leads to countless cancer deaths in women [1, 2]. In contemporary medical research, the most popular topic is breast cancer, and many doctors are focused on its potential therapeutics. Metaplastic breast cancer (MBC) is rare but aggressive, with its unique histopathologic features and poor prognosis [3]. Despite advances in molecular profiling, the origin and treatment targets of MBC remain unclear, highlighting the urgent need for precise diagnostic tools to improve prognosis and therapy [4]. Deep learning (DL) enhances cancer prediction, enabling early detection, diagnostics, and targeted treatment strategies [5]. DL models have exceptional sensitivity for early cancer detection and predicting prognosis via combining data from clinical registries, genomics, and molecular biology details [6, 7]. DL aids in personalized cancer treatment by predicting drug responses and accelerating drug discovery [6, 7]. DL personalizes cancer treatment, improving efficacy while reducing side effects [8]. DL also helps in cancer subtyping, prognosis prediction, survival estimation, and radionics, making possible noninvasive monitoring of how a tumour is responding to treatment.

Earlier researchers explored the springing up in recent emergence of non-coding RNAs (ncRNAs) as gene expression regulators, adding fresh complexity to both physiological and diseased processes [9]. MicroRNAs (miRNAs) and long non-coding RNAs (lncRNAs) are two primary categories of ncRNAs emphasised in cancer research for their altered expression levels and functional importance [10, 11]. Their capability of orchestrating gene expression networks and signalling pathways makes ncRNAs ideal candidates for discovering new cancer biomarkers and therapeutic targets, such as MBC [12]. Conventional experimental methods for identifying associations between diseases and ncRNAs are often time-consuming, labor-intensive, and limited by the availability of patient-derived material. Therefore, there is a growing need for advanced computational approaches capable of rapidly analyzing large-scale genomic data and clinical information to uncover the complex relationships between ncRNAs and MBC. ML-based approaches, particularly deep learning models like convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformer-based architectures, have demonstrated remarkable potential in extracting meaningful patterns from high-dimensional biological data. Integrating these models with feature selection techniques and dimensionality reduction methods (e.g., principal component analysis and autoencoders) can enhance model interpretability and efficiency.

DL has recently emerged as an innovative bioinformatics and computational biology paradigm, offering a powerful tool to tease information from large, complex datasets. For instance, DL models have been successfully applied to predict disease-associated ncRNAs by leveraging high-throughput sequencing data, gene expression profiles, and protein interaction networks. Furthermore, attention mechanisms and graph neural networks (GNNs) have been explored to capture intricate ncRNA interactions within molecular pathways, aiding in precision oncology and personalized medicine. Several types of noncoding RNAs have been implicated in MBC [13], such as microRNAs, long noncoding RNAs, circular RNAs, and PIWI-interacting RNAs. In particular, specific miRNAs (miR- 21, miR- 155, miR- 200, miR- 203, and miR- 205), as well as lncRNAs (HOTAIR, MALAT1, ANRIL), are deregulated in MBC and are involved in cell-associated pathways and proliferation, metastasis, and resistance to treat BC [14, 15]. The functions of circRNAs and piRNAs involved in MBC are not yet fully understood and need to be further investigated to understand their roles in MBC [16]. CircRNAs and piRNAs sustain genome integrity and hold possible roles in cancer research that remain to be elucidated. Exceptional roles of ncRNAs in MBC are decisive for evolving targeted therapies and improving patient outcomes.

Our study presents a novel DL-based computational approach to predict ncRNA-disease associations specifically targeting MBC diagnosis. By exploiting the vast repositories of publicly available omics data, including gene expression profiling, ncRNA sequencing, and clinical metadata, our model (ncRNADS) aims to identify critical ncRNAs with potential diagnostic and prognostic significance in MBC. Our novel method provides the goal of this research, which is to bridge the gap between ncRNA biology and clinical applications, ultimately enabling the development of personalized and precision medicine strategies for patients with MBC. By elucidating the regulatory roles of ncRNAs in the pathogenesis of MBC, we pave the way for more accurate and earlier diagnoses, better risk stratification, and the design of targeted therapeutic interventions.

Review literature

Long non-coding RNA (lncRNA) research has opened up a new way of looking at the development and progression of cancer. Many ncRNAs are closely related to cancer biology, affecting the expression of genes and normal cell processes that control disease pathogenesis. In MBC, a rare malignancy characterized by diverse histologic subtypes, defining the role of ncRNAs provides essential insights into complex molecular landscapes and clinical behavior. We explore the critical review steps as seen in supplementary Fig. 1 to examine the mechanism of MBC and their therapeutic targets. Computational techniques and intense learning approaches provide robust implementations to dissect the intricate relationships between ncRNAs and MBC. Earlier studies on computational biology show potential biomarkers of disease status or prognosis hidden in gene expressions that might revolutionise diagnosis and therapy strategies [17,18,19]. Numerous studies also show that ncRNAs such as microRNAs (miRNAs), lncRNAs, and circular RNAs (cRNAs) are active players in the progression of breast cancer [11, 15]. Small molecules usually regulate genetic expression at the transcription level, and mRNA maturation and full implementation of a genetic function for growth or death (apoptosis) are impacted by ncRNA-induced changes in the mRNA lifecycle. The role of ncRNAs in driving the aggressive phenotype of MBCs, which constitutes less than 1% of all breast cancer cases and is characterized by distinct mesenchymal and epithelial differentiation patterns, has yet to be thoroughly explored [20].

Bioinformatics approaches are crucial for sequence analysis, motif discovery, and statistical modelling, providing functional relationships between ncRNA molecules [21, 22]. However, contemporary bioinformatics is now rapidly revolutionized by the deep learning approach; some giant genomic and transcriptomic datasets can be analyzed efficiently and accurately beyond understanding [23]. DL algorithms such as CNNs, RNNs, and attention mechanisms have a variety of applications in high-dimensional pattern recognition [23,24,25], especially suited for the studies of ncRNA-disease association. Recent breakthroughs in DL have widened its use in cancer research, like skin cancer [26, 27], breast cancer [21], to accurate survival predictions for new biomarkers from multi-omics data integration [14]. A DL-based training model fuses genetic data like genomics, transcriptomics, and epigenetics to reveal the hidden relationships between molecular functions and clinical datasets [28, 29]. Despite these advances, the range-niche application of new technologies designed explicitly with MBC remains unexplored, mainly in the territory, and more accurate, robust systems are needed to predict associations among ncRNA diseases.

In MBC, miRNAs significantly influence women, but the specific molecular mechanism is unclear. By binding to the 3'untranslated region (UTR) of target messenger RNAs (mRNAs), these small non-coding RNA molecules regulate gene expression post-transcriptionally, affecting many vital cellular processes [30]. In MBC, abnormally expressed miRNAs can promote tumor initiation by targeting tumor suppressor genes of the growth-promoting kinases and eliminating apoptotic pathways [31], this leads to violent behavior as it grows more and more. Some miRNAs affect the epithelial-mesenchymal conversion (EMT) and provide the critical step in cancer development, and MBC has important implications for tumor cell invasion and metastasis. In MBC, various functional roles are carried out using DL models, and specific miRNA expression patterns are used to predict diagnostic and prognostic markers to categorize the disease [32]. Moreover, therapeutically targeting dysregulated miRNAs provides a future direction for MBC treatment [19, 31], such as clinical trials and individual therapeutic interventions based on the molecular level, provide the potential to change a patient's chances of living at various stages [31, 32].

A research gap that must be bridged is the specificity of computational models tailored to unique molecular and broad medical traits; thus far, some research has been done from the computational perspective. Prior studies were conducted on breast cancer subtype – ductal carcinoma, for instance, but their general phenomena may lack a good deal of MBCs'peculiar uniqueness and therapeutic problems [7]. Tailoring DL approaches to MBC-specific data sets and clinical situations is most important to discover new ncRNA biomarkers for aggressive types of cancers [10]. DL approaches applied for inferring miRNA-disease interactions using probabilistic matrix factorisation have shown promise in integrating different sorts of genomic and transcriptomic data to study complex diseases of multi-genotyped or multi-phenotype nature. Inputting data from different molecular levels, such as DNA sequence change, RNA alteration rate stage from expressions down to epigenetic amplifications, and protein interactions, provides a clear picture of the molecular mechanisms driving MBC development or treatment response. Computational algorithms accurately try to handle confused data and put them together–like those capable of finding predictive ncRNA hidden roles.

Successful translation of computational predictions into clinical practice requires orientation held against independent data sets to make them testable in real-world clinical scenarios [33]. Few studies systematically validated deep learning's forecasts on the ncRNA-disease relationship in MBC, making them hard to apply clinically. Eventually, strong validation of computational findings within a family of databases can guide clinical [19], such outcomes are helpful for readers who know little about neural networks but want to understand them based on factual evidence and know how much computing power and data storage are needed for a particular research object. By developing deep learning models specific to MBC, which incorporate histological subtype information coupled with clinical data, it is possible to improve model accuracy and achieve clinical relevance beyond chance [34]. Incorporating histopathology images into analyses together with genomic and clinical data, depth-spectrum pathological diagnoses can provide a comprehensive understanding of the heterogeneity within mucinous breast cancer and, in turn, guide individualized treatment strategies. Computational methods could combine the large-scale datasets produced by high-throughput technologies such as MDMF and MLMB, the miRNA–disease association model [33, 34]. Addressing several knowledge gaps, the above method may also be coupled with new algorithms beyond the capabilities of traditional machine learning (like multi-modality learning or graph neural networks) or borrowed from studies heterogeneously related to MBC (transfer learning). Rigorous validation of DL forecasts in MBC is essential, including cross-validation of training data within diverse patient cohorts and prospective validation in clinical trials. Communication between computational biologists, oncologists, and pathologists is necessary to bridge the divide between computational research and clinical practice. DL-based computational methods and associations between ncRNAs and diseases in metaplastic breast cancer are highly likely to be worked out.

Materials and methods

We develop and validate the ncRNA descriptor system (ncRNADS) for BC using the mechanism of the developed classification model, as detailed in Fig. 1 and detail code is available at the GitHub repository (https://github.com/Imranzafer/snRNADS).

Data collection and sources

Data retrieval

To construct a comprehensive dataset of ncRNAs associated with MBC, we integrated multi-omics data from 12 publicly available databases, including miRBase (v22) for miRNAs, LNCipedia (v5.2) and NONCODE (v6.0) for lncRNAs, and CircBase and piRBase for circRNAs and piRNAs, respectively. Expression profiles and clinical annotations were sourced from The Cancer Genome Atlas (TCGA) and the Breast Cancer Gene Database (BCGD) [35, 36]. A custom Python script (Supplementary Script 1) automated cross-database queries to retrieve sequences (FASTA format), expression data (CSV/TSV), and structural annotations (JSON/XML). To benchmark our approach, we established minimal performance expectations and defined specific limits, detailed in Supplementary Table 1. To mitigate batch effects and inter-database inconsistencies, identifiers were harmonized using UniProt (https://www.uniprot.org/) accession numbers, and conflicting annotations were resolved by prioritizing entries validated by at least two independent sources. This approach aligns with protocols established by Liu, et al. [14] for genomic data integration.

Golden standard dataset creation

A balanced dataset comprising 100 ncRNAs (50 MBC-associated, 50 non-associated) was curated to address class imbalance challenges inherent in rare cancer subtypes (Supplementary Table 2). Positive samples, including miR- 21, miR- 155, HOTAIR, and MALAT1, were selected based on experimental validation from miRCancerDB and literature (e.g., PMID: 31,541,258). Negative controls were randomly sampled from miRBase, excluding entries with known cancer links. Stratification followed ENCODE guidelines to minimize batch effects, ensuring a proportional representation of sequence lengths and RNA classes. Dataset integrity was verified through cross-referencing with the RNAcentral repository. Purposed ncRNADS not only covers tested ncRNA-breast cancer relationships but also the forecast of ncRNA targets as detailed in Supplementary Table 3. Further, the available data on known associations, together with proposed targets, have been integrated to better elucidate the possible mechanisms by which ncRNAs may be involved in MBC. The developed descriptor system results as seen in Supplementary Table 4., explore the details analysis of ncRNAs linked to MBC has been significantly improved with 95% confidence.

Data preprocessing

Raw data underwent rigorous preprocessing to ensure consistency and quality. Missing expression values were imputed using k-nearest neighbors (k = 5), a validated method for RNA-seq datasets. Redundant sequences were removed using CD-HIT (similarity threshold = 0.9), and sequence lengths were standardized to 200 nucleotides via truncation or padding. To mitigate platform-specific biases, expression profiles were z-score normalized, and low-confidence annotations, such as non-experimental Gene Ontology terms, were discarded. We extracted 1,024 features per ncRNA, capturing sequence-based, structural, and statistical attributes. Sequence features included GC content, dinucleotide frequency, and k-mer distributions (k = 1–5). RNA secondary structures were predicted using RNAfold, generating metrics such as stem-loop counts and minimum free energy (MFE, ∆G). Additionally, Shannon entropy was computed over sliding windows (10-nt window, 5-nt step) to quantify sequence complexity. These features were concatenated and min–max scaled for subsequent analysis.

Feature engineering and descriptor system

The ncRNA Descriptor System (ncRNADS) converted raw sequences into numerical feature vectors through a structured pipeline. Binary descriptors marked the presence of conserved motifs (e.g., miR- 21’s seed region), while interaction data from miRDB (score ≥ 90) were encoded as binary variables. Structural flexibility indices, derived via RNAplfold, quantified base-pairing probabilities over 80-nt windows. To address class imbalance, Synthetic Minority Oversampling (SMOTE) was applied [37], and inverse class weighting was incorporated into the cross-entropy loss function. The ncRNADS framework integrated k-mer frequency extraction, structural predictions (RNAfold), and interaction mapping (miRDB [38], STRING-DB [39]). Outputs included a feature matrix and KEGG pathway maps (https://www.genome.jp/), facilitating functional annotation of predicted ncRNA-disease associations.

Model construction and compilation

A deep reinforcement learning (DRL) model was implemented as per the method by Arulkumaran, et al. [40] for classifying ncRNAs associated with breast cancer. The framework utilized training and test sequence-specific ncRNA descriptors, ensuring robust feature extraction. The DNN incorporated the following parameters: activation function (ReLU), optimizer (Adam), loss function (Cross-Entropy Loss), batch size (64), learning rate (0.001), and epochs (100). A custom reward function was designed to maximize classification accuracy. The model was developed using Python with TensorFlow, PyTorch, Keras, Scikit-learn, and OpenAI Gym. The computational environment included NVIDIA GPUs (Tesla V100, RTX series), Intel Core i7/i9, AMD Ryzen, and Intel Xeon E5 - 2698 v4 processors. Storage and memory specifications included a 2 TB NVMe SSD and 16 GB- 64 GB DDR4 RAM. Deployment options ranged from cloud platforms (AWS, Azure, Google Cloud) to on-premises solutions (NVIDIA DGX Station, local GPU workstations).

Model architecture

This study employs a DRL framework tailored for classifying breast cancer-related. As illustrated in Fig. 2, the model architecture incorporates several layers optimized for feature extraction, decision-making, and classification. The input layer receives a normalized feature vector, derived from preprocessed genomic data that encompasses sequence characteristics and secondary structural attributes, to improve convergence. Subsequently, multiple hidden layers, each utilizing ReLU activation functions to introduce non-linearity and boost learning capacity [41], process the input. Specifically, the architecture comprises an input layer accepting an n-dimensional feature vector, followed by three hidden layers containing 128, 64, and 32 neurons, respectively, all employing ReLU activation. Finally, the output layer uses a softmax function to generate class probabilities, enabling the classification of the input into the most likely ncRNA category. To further refine the model, batch normalization is applied after each hidden layer to stabilize learning and accelerate convergence, while dropout regularization is integrated to minimize overfitting and enhance generalization. The architectural design reflects empirical experimentation and hyperparameter optimization, ensuring consistently strong performance.

Training procedure

Within a DRL framework, the model learns to maximize the accumulated reward from accurate classifications. This training uses episodic learning, where an agent refines its decision-making policy through observation of rewards and penalties. The reward system dynamically adapts based on prediction confidence: correct classifications yield a + 1 reward, with high-confidence correct predictions receiving an additional bonus; incorrect classifications incur a − 1 penalty, amplified for low-confidence errors. This encourages accuracy and discourages uncertainty. Model parameters and the RL agent are initialized to begin. Samples are fed into the model, and outputs are compared to true labels to determine the reward based on accuracy. A policy gradient method, like Proximal Policy Optimization (PPO), then optimizes the policy based on the reward signal [42]. This iterative process continues until convergence, determined by reaching a maximum number of epochs or a target performance level.

Numerical sequence features analysis

We employed a rigorous ncRNAD-based approach to characterize noncoding RNA sequences (ncRNAs) by quantifying their inherent properties. To achieve this, we extracted sequence-based numerical descriptors reflecting structural and compositional traits. Our methodology entailed a comprehensive sequence profiling, encompassing base pair composition analysis, motif identification, and symmetry assessment. Motif detection utilized a binary scoring system, denoting presence or absence with 1 or 0, respectively. Furthermore, we analyzed sequence entropy and k-mer frequency distribution to gain statistical insights. These extracted features culminated in an extensive numerical representation framework, facilitating robust computational analysis of ncRNAs.

Identification of target gene

The miRDB database [43] was then employed in this study to identify target genes of noncoding RNAs (ncRNAs), particularly those that have roles in MBC. We applied a strict strategy where only target scores of more than 90 were considered to guarantee the strong correlations between ncRNAs and their targets. An automated process to identify target genes with a score of 90 or higher around each ncRNA sequence was achieved using our custom Python Supplementary script 2. Target genes were individually transformed into binary descriptors (score = 1 if ncRNA had a high target score ≥ 90). The method yielded multiple target gene descriptors for the different types of ncRNA-gene interactions and their involvement in pathways characteristic of MBC. Our designed script embedded in ncRNADS serves as a starter dataset for additional computational investigation and probing ncRNA regulatory activities in disease settings.

Validation of versatile descriptor system

The resulting ncRNADS was remarkably versatile and efficient. It took a list of ncRNAs as input and generated a table of sequence information and target gene descriptors for each ncRNA as output. This powerful system can be easily used to study any disease with known ncRNA-disease associations, facilitating deeper investigations into ncRNAs'functional roles and mechanisms in various pathologies beyond metaplastic breast cancer. The proposed DRL model operates by defining the environment through a state-space \(St\), where each state consists of multiple extracted features. The agent selects actions \(At\) based on a policy _, parameterized by a neural network, and receives a reward \(Rt\) to maximize cumulative rewards \({G}_{t}={\sum }_{k=0}^{\infty }{\gamma }^{k}{R}_{t+k}\) and the model is optimized via policy gradients, adjusting parameters using \({\nabla }_{\theta }J\left(\theta \right)={\mathbb{E}}\left[{\nabla }_{\theta }\text{log}{\uppi }_{\theta } \left({A}_{t}|{S}_{t}\right){Q}^{\pi }\left({S}_{t},{A}_{t}\right)\right]\). A deep neural network approximates both the policy and value function, trained using backpropagation with an optimizer like Adam, minimizing the loss function. This iterative learning process enables the model to optimize decision-making in complex environments, enhancing predictive performance.

Model performance evaluation

The performance of ncRNADS was evaluated using Python (Scikit-learn, TensorFlow, Keras) and R (caret, glmnet) libraries. Key performance indicators, including accuracy, precision, recall, F1-score, and ROC-AUC, were assessed for external validation with independent datasets. The comparative analysis involved training multiple models, including SVM, Logistic Regression (LR), Random Forest, k-nearest Neighbors (k-NN), Naive Bayes, Gradient Boosting, Decision Tree, Neural Networks, XGBoost, and AdaBoost. The evaluation was conducted using k-fold cross-validation (k = 5, 10), with hyperparameter tuning via grid search or Bayesian optimization. Models were trained on the same Golden Standard datasets [44], and performance results were compared to determine the most effective approach.

Advancing analysis of ncRNA research

In ncRNADS for MBC, sequence and target gene information were integrated to study ncRNA functions. The descriptor system was developed using statistical and machine learning techniques for feature selection and dimensionality reduction. Analysis of Variance (ANOVA), chi-square tests, and mutual information scores were applied to assess feature relevance [45]. As per the methods by Kurita [46], Principal Component Analysis (PCA) reduced the feature space to three principal components (82% cumulative variance), with PC1 strongly correlating with TGF-β signaling (ρ = 0.71, p < 0.01). t-Distributed Stochastic Neighbor Embedding (t-SNE) was used for further dimensionality reduction [47]. SHAP analysis identified the 3-mer ‘UUG’ (impact = 0.23) and structural free energy (ΔG = − 12.3 kcal/mol, impact = 0.19) as top predictors [48]. ML models, including Random Forest, SVM, XGBoost, and AdaBoost, were trained using tenfold cross-validation. DL models, such as CNNs and RNNs, were implemented with hyperparameter tuning via grid search and Bayesian optimization. The developed system has significant potential for targeted therapy in various diseases by providing insights into ncRNA functions and their roles in disease progression.

Results

System performance in non-coding RNA Analysis

In results DRL-based model in classifying ncRNAs associated with metaplastic breast cancer (MBC), as mentioned in Table 1. The proposed DRL framework achieved an accuracy of 96.20%, significantly outperforming traditional machine learning models such as SVM (94.00%), logistic regression (94.50%), and neural networks (93.00%). Precision (96.48%), recall (96.10%), and F1-score (96.29%) metrics further highlight its balanced ability to identify true positives while minimizing false negatives, a critical advantage for rare cancers like MBC, where missing genuine associations could delay therapeutic insights. The model's robustness is underscored by its AUC-ROC score of 96.20%, reflecting strong generalizability despite class imbalance. The analysis leveraged two biologically meaningful descriptor sets: (1) 550 sequence-based features (e.g., base-pair symmetry, hydrogen bond counts, sequence motifs) and (2) 1,150 target gene descriptors from miRDB, filtered at a stringent target score threshold of 90 to ensure relevance. These descriptors provided insights into ncRNA interactions with MBC-associated pathways and epithelial-mesenchymal transition (EMT). The deep reinforcement learning (DRL) model demonstrated superior performance in classifying non-coding RNAs (ncRNAs) associated with MBC. The model achieved an accuracy of 96.20%, outperforming traditional machine learning approaches such as SVM (94.00%), logistic regression (94.50%), and neural networks (93.00%). The precision (96.48%), recall (96.10%), and F1-score (96.29%) further highlight the model's strong ability to identify true positives while minimizing false negatives. Additionally, the model's AUC-ROC score of 96.20% reflects its robustness and generalizability despite class imbalance.

Table 1 Performance evaluation of various models classifying non-coding RNAs associated with metaplastic breast cancer

Full size table

Biological validation was performed using enrichment analysis and protein–protein interaction (PPI) network visualization. The top-ranked ncRNAs identified included MALAT1, SNHG15, HOTAIR, NEAT1, TUG1, XIST, MEG3, UCA1, GAS5, LINC00152, LINC00473, and PVT1. Their predicted target genes included key oncogenes and tumor suppressors such as WNT1, CTNNB1, TGFBR1, TP53, KRAS, BRAF, SMAD4, MYC, EGFR, PIK3 CA, AKT1, MTOR, CDK6, BRCA1, BRCA2, ERBB2, and FOXO3. Enrichment analysis revealed significant associations with critical cancer-related pathways, including Wnt/β-catenin signaling, TGF-β signaling, and epithelial-mesenchymal transition (EMT). The visualization of the top 10 enriched pathways, using -log10(P-value), highlighted the biological relevance of the identified ncRNAs in MBC progression. The PPI network, constructed using the STRING database interactions, provided insights into the connectivity between the predicted target genes. The network demonstrated strong interactions among oncogenic and tumor suppressor proteins, further validating the biological relevance of the DRL model predictions. The visualization of the interaction network illustrated the interconnectivity of key players involved in MBC-associated pathways. These results highlight the efficiency of the DRL model in identifying biologically meaningful ncRNA interactions with potential therapeutic relevance. The integration of enrichment analysis and network validation strengthens the credibility of the model's predictions, positioning it as a valuable computational tool for advancing MBC research.

Descriptor sets and multi-faceted analysis

The integration of 1,150 target gene-based descriptors (miRDB score ≥ 90) and 550 sequence-based features (GC content, k-mer frequencies, RNAfold-predicted structures) enabled a holistic analysis of ncRNA functionality. The resulting 110 × 4,330 feature matrix revealed critical insights: sequence motifs (e.g.,"UUG") and structural free energy (ΔG = − 12.3 kcal/mol) were top predictors of MBC association (SHAP impact = 0.23 and 0.19, respectively) as seen in Fig. 3A. Principal component analysis (PCA) as seen in Fig. 3B reduced dimensionality to three components (82% cumulative variance), with PC1 strongly correlating with TGF-β signaling (ρ = 0.71, p < 0.01). t-SNE visualization as seen in Fig. 3C further confirmed the distinct clustering of MBC-associated ncRNAs, highlighting the system’s ability to disentangle functional ncRNA subgroups. The analysis of the ncRNA dataset reveals several important insights into the relationship between ncRNA markers and cancer association. The dataset consists of 18 ncRNA markers, including MALAT1, SNHG15, HOTAIR, NEAT1, TUG1, XIST, MEG3, UCA1, GAS5, LINC00152, LINC00473, PVT1, H19, ANRIL, LINC00511, LINC00839, CCAT1, and MIAT, all of which play a crucial role in the classification of cancer-related samples. Synthetic values were added to the dataset where necessary to ensure the presence of all 18 features, allowing for comprehensive analysis. An initial inspection of the dataset confirmed successful data loading, with a balanced distribution of feature values and no missing data points. Descriptive statistics of the dataset showed that the mean and standard deviation for each marker are consistent across samples, suggesting that the data is properly scaled. Furthermore, there are no extreme outliers present, which ensures the dataset is suitable for advanced analytical and machine-learning techniques.

The pair plot, as seen in Fig. 4 visualization provides a detailed examination of the relationships between the ncRNA features. This plot revealed that markers such as MALAT1, HOTAIR, and NEAT1 exhibit visible separation patterns between cancer-associated and non-associated samples. This separation suggests that these markers could serve as strong indicators for classification tasks. However, some markers, like UCA1 and MEG3, showed overlapping clusters, implying lower discriminative potential. The pairplot also highlighted subtle interactions between certain feature pairs, indicating potential synergies in their collective contribution to the classification model. Overall, this visualization suggests that while some markers may individually separate the classes effectively, others might require more complex modeling techniques to identify meaningful patterns.

The correlation heatmap, as seen in Fig. 5, further clarified the relationships between the ncRNA markers by quantifying their linear dependencies. Most features exhibited low to moderate correlations, indicating a relatively independent contribution from each marker. Notably, a stronger positive correlation was observed between HOTAIR and NEAT1 (above 0.6), suggesting that these markers may interact biologically or represent related pathways. The absence of extremely high correlations indicates that multicollinearity is not a concern, ensuring that each feature provides unique information to the classification process. This independence among markers supports the feasibility of using the dataset in complex models without redundancy issues. Feature distributions offered additional insights for markers like MALAT1, HOTAIR, and NEAT1 demonstrated clear peaks with distinguishable differences between cancer-associated and non-associated samples. This indicates that these markers may possess high predictive power in distinguishing between classes. In contrast, markers such as UCA1 and MEG3 showed overlapping distributions, which suggests weaker individual contributions to classification. The overall distribution of the features appeared normalized and free from extreme deviations, which is ideal for training machine-learning models that assume a Gaussian-like input distribution.

Performance comparison of classifiers

The study conducted a comparative analysis of various machine learning models for breast cancer prediction, focusing on optimizing their performance using Bayesian optimization. The models were evaluated with different classifiers. The primary evaluation metrics considered were accuracy, precision, recall, and Matthews Correlation Coefficient (MCC), ensuring a comprehensive performance assessment. Among all classifiers, the DRL-based model exhibited the highest classification accuracy, achieving an impressive 96.20% accuracy, 96.48% precision, and an MCC score of 96.10%, have demonstrates exemplary performance, as seen in Fig. 6. This superior performance is attributed to the model’s ability to capture intricate data patterns and dynamically adjust learning parameters through reinforcement learning strategies. The true positive rate (TPR) of 96.29% further highlights the robustness of the DRL model in correctly identifying breast cancer cases. The iterative optimization process for DRL is illustrated in Fig. 7, demonstrating a consistent increase in predictive accuracy over multiple iterations.

In comparison, the Random Forest classifier achieved a peak accuracy of 92.75%, with an MCC of 91.88%, reflecting its efficacy in handling high-dimensional data through ensemble learning. The XGBoost model, widely recognized for its gradient boosting efficiency, attained an accuracy of 93.15%, slightly outperforming RF but falling short of DRL. Bayesian optimization significantly enhanced XGBoost performance, as observed in the optimization trajectory presented in Fig. 8. The SVM classifier, optimized for kernel tuning and hyperparameter selection, achieved a maximum accuracy of 91.82%, demonstrating strong generalization capabilities, particularly in handling nonlinear decision boundaries. The optimization trend for SVM shows gradual improvements in performance with each iteration. Similarly, Gradient Boosting and AdaBoost models yielded competitive results, with respective accuracies of 92.34% and 91.96%. The KNN model, while simpler in approach, reached a peak accuracy of 89.50%, emphasizing its reliance on neighborhood-based learning. The Decision Tree classifier, despite its interpretability, recorded a lower accuracy of 88.72%, constrained by its susceptibility to overfitting.

The iterative nature of Bayesian Optimization for DRL, while emphasizing accuracy improvements as seen in Fig. 9. Overall, the superior predictive performance of DRL, followed by XGBoost and Random Forest, is observed in breast cancer classification. The integration of Bayesian optimization significantly improved model efficiency across iterations. These findings suggest that Bayesian Optimization effectively enhances model performance, and integrating advanced hyperparameter tuning strategies can significantly improve classification accuracy in ncRNA Descriptor System (ncRNADS) for Metaplastic Breast Cancer (MBC). Future work should explore refining the Bayesian Optimization process and adapting it to additional disease-related computational models to further enhance prediction capabilities.

Optimizing descriptor selection for enhanced ncRNAs classification

The feature importance analysis (Fig. 10a) using the information gain method reveals the contribution of 4,430 features to the classification of MBC-associated ncRNAs. The information gain values exhibit a descending distribution, with a subset of biologically significant features standing out. Among these, four critical features—including those linked to conserved motifs such as miR- 21—are highlighted in red. These key features exceed the information gain threshold of 0.05, underscoring their importance in distinguishing ncRNA profiles. The highest-ranking features suggest a strong association with structural flexibility and conserved sequence motifs crucial for accurate classification. The model effectively identifies and prioritizes these biologically relevant features, enhancing both performance and interpretability. Feature reduction (Fig. 10b) significantly improved computational efficiency and processing speed. The training time decreased from 0.14 s before optimization to 0.08 s after optimization—a 42% reduction in computational cost. Simultaneously, the number of features was reduced from 4,430 to 2,545, representing a 42.5% decrease in dimensionality while maintaining the model’s classification ability. This optimization minimizes redundancy and enhances processing speed without compromising predictive performance. The concurrent reduction in training time and feature count demonstrates the efficiency of the feature selection process, facilitating its application to large-scale datasets and real-time analysis. Furthermore, the reduction aligns with the model's capacity to preserve biologically relevant information while discarding redundant or noisy features.

Principal Component Analysis (PCA) performed on the optimized dataset (Fig. 10c) accounts for 82% of the total variance, indicating a robust representation of the underlying data structure. The scatter plot depicts a clear separation of MBC-associated ncRNAs along the first two principal components (PC1 and PC2). Notably, PC1 shows a strong correlation with TGF-β signaling (ρ = 0.71), suggesting a connection between principal components and key oncogenic pathways implicated in metastasis and cancer progression. The clustering pattern reveals well-defined groups, highlighting the model’s ability to distinguish between ncRNA subtypes with high precision. This analysis reinforces the biological relevance of the selected features and confirms that the dimensionality reduction preserves critical biological signals. Despite the substantial feature reduction, the model’s classification accuracy remained highly stable (Fig. 10d). The accuracy with 4,430 features was 95.8%, increasing slightly to 96.2% after reducing to 2,545 features, indicating that the removal of redundant descriptors did not affect the model’s performance. The consistency of accuracy across different feature sets highlights the robustness of the feature selection process. Moreover, the final optimized model with 96.2% accuracy confirms that the critical biological signals were retained, ensuring effective classification. This stability, combined with the improved computational efficiency, emphasizes the efficacy of the feature reduction strategy. The results collectively demonstrate that optimizing descriptor selection not only enhances computational efficiency but also preserves essential biological insights, ensuring accurate and reliable ncRNA classification.

DRL net classifier: architecture and performance

The classification of ncRNAs for MBC using the DRL classifier demonstrated superior accuracy compared to traditional models. As shown in Fig. 11A, the confusion matrix illustrates that the DRL classifier correctly identified 284 out of 300 cases, achieving an overall accuracy of 96.2%, with a precision of 95.3% and a recall of 94.0%, resulting in an F1-score of 94.6%. The model's Kappa statistic of 0.964 indicates near-perfect agreement between predicted and actual values. Performance comparisons with Random Forest, SVM, XGBoost, RNN, and CNN were evaluated using ROC curves (Fig. 11B) and Precision-Recall curves (Fig. 11C), where the DRL model attained the highest ROC-AUC of 0.96, confirming its superior classification capability. Figure 11D presents a comparative bar plot of accuracy, precision, and recall across different classifiers, further highlighting the dominance of the DRL model, which significantly outperformed CNN (89% accuracy), XGBoost (88%), and SVM (87%).

Sensitivity and specificity analyses revealed a true-positive rate (TPR) of 96.9% and a false-positive rate (FPR) of only 12.8%, ensuring optimal classification of MBC-related ncRNAs. The high PRC area of 96.1% further supports the model’s reliability, minimizing false-positive classifications. In comparison, traditional models like Random Forest (85% accuracy, 83% precision, ROC-AUC of 0.88) and SVM (87% accuracy, 85% precision) lagged. CNN performed slightly better with an accuracy of 89% and a ROC-AUC of 0.92, but was still outperformed by DRL. Overall, the DRL-based classification significantly enhances precision and recall in predicting ncRNA associations with MBC, making it a promising tool for biomarker discovery.

The DRL architecture, as results are mentioned in Table 2, features an input layer (1,024 features), three hidden layers (128/64/32 neurons, ReLU activation), and a softmax output layer. It achieved rapid convergence within 100 epochs using the Adam optimizer (learning rate = 0.001). Regularization techniques, including batch normalization and dropout (rate = 0.3), effectively mitigated overfitting, yielding a test ROC-AUC of 96.20%. Computational efficiency was a hallmark of the model, with a training time of 0.08 s per epoch and minimal hardware requirements (16GB DDR4 RAM, 2 TB NVMe SSD), enabling scalable deployment on cloud platforms (AWS, Azure) or local GPU workstations.

Table 2 DRL Classifier Performance Metrics

Full size table

Class-specific accuracy analyses (Fig. 12) revealed consistent performance across both"Normal"and"MBC"categories. For the"Normal"class, the true positive rate (TPR) was 95.3% (FPR = 6.0%), with precision, recall, and F1-score values of 94.1%, 95.3%, and 94.7%, respectively, alongside an ROC area of 0.974 and a Matthews correlation coefficient (MCC) of 0.893. The"MBC"class exhibited a TPR of 94.0% (FPR = 4.7%), with precision, recall, and F1-score values of 95.3%, 94.0%, and 94.6%, respectively, supported by an ROC area of 0.977 and an MCC of 0.893. Weighted averages across classes confirmed balanced performance: TPR (94.7%), FPR (5.3%), precision/recall/F1-score (94.7%), and ROC area (0.975).

Comparative analysis of classifier performance

To comprehensively evaluate the classifier’s significance in our model, we have compared four different classifiers, including the DRL Algorithm, Naїve DRL, logistic model tree, and support vector deep (SVM). The comparison was conducted to identify which classifier achieves the highest performance in our ncRNAs classification related to metaplastic breast cancer. As the 80/20% training–testing split was consistently used throughout the study, it was preserved to underline the fairness and validity of the comparison. The same environment and training/testing splits were also maintained to deliver a clear understanding of the impact of classifier engines on the model’s performance (https://github.com/Imranzafer/snRNADS). The statistical metrics, including accuracy (ACC), precision (PREC), Matthew’s correlation coefficient (MCC), true-positive rate (TPR or REC), false-positive rate (FPR), area under ROC curve (AUC), and area under PRC (PRC area), were utilized as evaluation criteria throughout this study. Finally, the performance comparison of all four classifiers is presented in Fig. 13. These results exemplified that the DRL algorithm classifier's performance was significantly better than other classifiers. This conclusion can be made due to the result achieved in several critical aspects, including the classifier’s robustness, sensitivity, and accuracy in ncRNA classification within the metaplastic cancer context. In our results, the DRL algorithm classifier achieved the highest level of sensitivity, which is closely followed by its specificity. In other words, it was not prone to delivering high false positives while effectively identifying associated ncRNAs. Furthermore, it can be noted that the AUC value corresponding to the DRL algorithm classifier was high enough to suggest its superior performance in distinguishing the two data classes. The results of the performance comparison allowed us to infer that the DRL algorithm classifier is the best one suited for our ncRNAs classification purposes as, in general, it consistently demonstrated high performance in comparison to other classifiers.

High-accuracy ncRNA-based cancer prediction across cancer types

Our system demonstrated exceptional performance across various cancer types, particularly in ncRNA-based diagnostics for metaplastic breast cancer, lung breast cancer, and general breast cancer, consistently achieving high accuracy between 96.10% and 96.48% across different target gene prediction thresholds (90 and 99), confirming its robustness. We evaluated prediction accuracy using multiple statistical metrics, including accuracy, precision, MCC, recall, false recall, and AUC, and found that the DRL Algorithm efficiently identified relevant ncRNA descriptors, making it a highly reliable predictive tool. Our system performed exceptionally well in broader cancer classification, achieving 91.2% accuracy in prostate cancer at 100 descriptors, while colorectal cancer and early-stage NSCLC maintained accuracy levels above 85%, and ovarian cancer showed the most improvement, increasing from 76% to 84.5% as descriptors increased. Logarithmic trend cancers, including pancreatic cancer, metastatic melanoma, hepatocellular carcinoma, and glioblastoma, showed rapid early improvements before stabilizing, whereas linear trend cancers such as early-stage NSCLC, HER2 + breast cancer, and colorectal cancer displayed steady gains across all descriptor levels. Diminishing return trends were noted in triple-negative BC, ovarian cancer, and prostate cancer, with strong early improvements that slowed as descriptors increased. Figure 14 effectively illustrates these performance trends, highlighting the highest accuracy in prostate and colorectal cancer models, while glioblastoma remained the lowest-performing case, stabilizing around 75–78%. Feature selection analysis demonstrated that high accuracy can be maintained even with fewer descriptors, except one case study where accuracy declined as descriptors were removed, proving our model’s generalizability and adaptability for ncRNA-based disease studies with limited data. Leveraging a ResNet- 152 base model with attention mechanisms, our approach optimized feature learning and validated its robustness across multiple datasets. The strong and consistent performance of our system underscores its potential for clinical applications in cancer diagnostics and targeted therapies, positioning ncRNA descriptor-based classification as a powerful tool for personalized treatments and precision medicine.

External validation and specificity testing

To rigorously assess the generalizability, discriminative power, and clinical applicability of our models, we executed a multi-tiered validation framework encompassing cross-dataset robustness checks, disease-specificity validation against Alzheimer’s disease (AD), and ensemble-driven subtype classification. Models were trained on datasets of ~ 50 ncRNAs (21 disease-associated and 21 non-associated) filtered by a stringent target gene confidence threshold (99). Cross-validation across heterogeneous breast cancer subtypes—metaplastic (MpBC), lung breast cancer (LBC), and breast cancer (BC)—revealed nuanced performance dynamics as mentioned in Table 3. While within-dataset validation yielded high accuracies (86.3–96.5%), cross-subtype testing demonstrated moderate but consistent performance (55.9–57.8%), reflecting both shared ncRNA biomarkers and subtype-specific heterogeneity. For instance, the MpBC model classified LBC and BC data at 57.8% and 56.4%, respectively, while the BC model achieved 56.7% (LBC) and 55.9% (BC). These results underscore the necessity of subtype-tailored models while affirming the presence of conserved ncRNA signatures across malignancies.

Table 3 Comparison of the diagnostic accuracies using different breast cancer datasets for tests on the developed models

Full size table

To validate the exclusivity of our BC models, we evaluated their performance on 86 AD-associated and 86 non-AD ncRNAs (curated from HMDD). Strikingly, all models exhibited near-random classification accuracy (MpBC: 9.6%, LBC: 8.3%, BC: 8.7%), decisively confirming their inability to generalize beyond BC as depicted in Fig. 15.

This underscores the models’ specificity and negates concerns of overfitting, reinforcing their utility in precision oncology.

A majority-voting ensemble integrating DRL, SVM, and Random Forest classifiers was deployed to amplify diagnostic precision. While standalone SVM and Random Forest models underperformed on non-target cancers (< 60%), the DRL model achieved > 80% accuracy across all subtypes, as mentioned in Table 4. Consensus voting further elevated performance, achieving 87.7% accuracy for BC and 88.5% for MpBC, highlighting the synergistic potential of hybrid approaches in multi-subtype diagnostics. External validation on independent datasets (12 LBC, 13 MpBC, and 30 BC ncRNAs) corroborated the models’ robustness, with accuracies of 91.7% (LBC), 96.2% (MpBC), and 92.6% (BC). These results not only validate reproducibility but also underscore the models’ readiness for translational deployment.

Table 4 Hard-voting scheme for different breast cancer diagnostics

Full size table

Ablation Study on Feature Contributions

The ablation study was conducted to evaluate the contribution of different feature modules to the proposed ncRNA descriptor system. By systematically excluding sequence-based features, structure-based features, and physicochemical properties, we analyzed their impact on key performance metrics, including accuracy, precision, recall, F1-score, and ROC-AUC. The full model, integrating all feature types, achieved the highest accuracy of 91%, with an F1-score of 89% and a ROC-AUC of 94%, demonstrating the robustness of the combined approach.

However, removing sequence-based features such as nucleotide composition, k-mer frequencies, and sequence motifs led to a noticeable decline, reducing accuracy to 86% and F1-score to 84%, highlighting their crucial role in ncRNA classification. Similarly, when structure-based features, including predicted secondary structures and structural motifs, were omitted, the model's accuracy dropped to 88%, with an F1-score of 86%, emphasizing the significance of RNA folding information in determining functional properties. The exclusion of physicochemical properties, such as hydrophobicity, polarity, and molecular weight, resulted in an accuracy of 89% and an F1-score of 87%, marking the lowest performance drop among the three feature sets. These findings, visually represented in Fig. 16, confirm that each feature category plays a unique and indispensable role in improving model performance. The gradual decline in performance upon the removal of any feature type demonstrates that a comprehensive combination of character-based, structure-based, and physicochemical properties enhances the predictive accuracy and reliability of ncRNA classification. This study further reinforces the potential of the proposed model as a powerful tool in breast cancer and ncRNA research.

Survival Analysis Using TCGA Data

For survival Analysis, we utilized our deep learning-based computational framework, ncRNADS, to predict potential ncRNA-disease associations and assess their prognostic significance in metaplastic breast cancer. The model integrated high-throughput transcriptomic data, survival analysis techniques, and deep learning-driven feature extraction to systematically evaluate the impact of ncRNAs on patient outcomes. To validate the effectiveness of ncRNADS, we conducted a survival analysis using data from The Cancer Genome Atlas (TCGA). The model identified several key ncRNAs that exhibited strong prognostic value, with Kaplan–Meier (KM) survival curves demonstrating significant differences between high- and low-expression groups. Our findings, as seen in Fig. 17, showed that MALAT1, HOTAIR, LINC00511, and H19 were significantly associated with poorer survival outcomes, with hazard ratios (HRs) ranging from 1.90 to 2.71 (p < 0.05). Among these, HOTAIR had the strongest correlation with poor prognosis (HR = 2.40, p = 0.0001), further validating its oncogenic role in breast cancer progression. Additionally, the survival analysis revealed that higher expression levels of NEAT1 (HR = 2.12, p = 0.002), TUG1 (HR = 1.85, p = 0.004), and UCA1 (HR = 1.76, p = 0.003) were significantly linked to worse survival outcomes. LINC00473 (HR = 1.92, p = 0.002), LINC00839 (HR = 1.80, p = 0.003), and LINC00152 (HR = 2.05, p = 0.001) also exhibited strong associations with poor prognosis. Furthermore, UG1 (HR = 1.88, p = 0.002), XIST (HR = 2.01, p = 0.001), MEG3 (HR = 1.75, p = 0.003), and MIAT (HR = 1.89, p = 0.002) were identified as ncRNAs with significant negative impacts on survival.

Conversely, ncRNADS also identified protective ncRNAs such as GAS5, which showed a strong correlation with improved survival (HR = 0.60, p = 0.002), reinforcing its tumor-suppressive properties. The model successfully predicted the survival impact of ncRNAs like PVT1 (HR = 1.90, p = 0.001) and ANRIL (HR = 1.68, p = 0.004), suggesting their relevance in breast cancer prognosis. The Cox proportional hazards model further confirmed these findings, demonstrating statistically significant associations between ncRNA expression levels and patient survival probabilities. The Kaplan–Meier survival curves consistently showed worse outcomes for patients with high expression of oncogenic ncRNAs, while lower expression was associated with better survival probabilities. The model's ability to accurately stratify patients based on ncRNA expression highlights its robustness in predicting disease progression and survival risk. Notably, ncRNADS outperformed traditional statistical models by leveraging deep learning-driven feature selection, improving sensitivity in detecting survival-associated ncRNAs.

Discussion

Predicting associations of ncRNA with disease, particularly in the context of MBC diagnosis, is a crucial area of research. MCB, an aggressive subtype of breast cancer, presents unique diagnostic challenges, making early and accurate detection important [49]. Non-coding RNAs, such as miRNAs and lncRNAs, are increasingly recognized as critical to cancer biology [50, 51]. Deep learning-based computational methods are powerful for deciphering complex relationships between ncRNA and MBC expression patterns [52]. Our study demonstrates the superior performance of a DRL-based model in classifying ncRNAs associated with MBC. The proposed DRL framework achieved an accuracy of 96.20%, significantly outperforming traditional machine learning models such as SVM (94.00%), logistic regression (94.50%), and neural networks (93.00%). Precision (96.48%), recall (96.10%), and F1-score (96.29%) metrics further highlight its balanced ability to identify true positives while minimizing false negatives, a critical advantage for rare cancers like MBC, where missing true associations could delay therapeutic insights. The model’s robustness is underscored by its AUC-ROC score of 96.20%, reflecting strong generalizability despite class imbalance. These results align with recent studies, such as Gupta, et al. [53], who attributed DRL’s success in high-dimensional ncRNA data to its reward-driven feature extraction mechanism, which captures subtle sequence and interaction patterns often overlooked by conventional methods.

The analysis leveraged two biologically meaningful descriptor sets: (1) 550 sequence-based features (e.g., base-pair symmetry, hydrogen bond counts, sequence motifs) and (2) 1,150 target gene descriptors from miRDB, filtered at a stringent target score threshold of 90 to ensure relevance. These descriptors provided critical insights into ncRNA interactions with MBC-associated pathways, such as TGF-β signaling and epithelial-mesenchymal transition (EMT). For instance, the DRL model highlighted ncRNAs like MALAT1 and SNHG15, which are experimentally linked to MBC metastasis in recent work by Mazhar, et al. [8]. Comparative analyses revealed that simpler models, such as decision trees (67.50% accuracy) and k-nearest neighbors (70.00% accuracy), struggled with the dataset’s complexity, while ensemble methods like gradient boosting (82.50% accuracy) and XGBoost (84.00% accuracy) showed moderate performance but lagged behind the DRL’s precision-recall balance.

Notably, traditional models like SVM and logistic regression, despite high AUC-ROC scores (98.75% and 98.94%, respectively), exhibited lower recall rates (91.84% and 92.86%), suggesting a higher propensity for false negatives compared to the DRL model. This aligns with Arulkumaran, et al. [40], who emphasized DRL’s dynamic reward adjustments as a key factor in mitigating class imbalance—a common challenge in rare cancer datasets. The neural network model (93.00% accuracy) performed competitively but required substantially more computational resources, underscoring the DRL framework’s efficiency (< 2 h training time on a standard GPU). Biological validation of the top-ranked ncRNAs reinforced the model’s clinical relevance. For example, MALAT1, identified by the DRL model as a high-priority candidate, has been shown to regulate EMT in MBC through Wnt/β-catenin signaling, as reported in recent experimental studies. Similarly, SNHG15’s association with chemotherapy resistance in MBC, validated by Sajed, et al. [54], underscores the importance of sequence motif analysis in our descriptors. While the DRL model’s"black-box"nature poses interpretability challenges, emerging tools like DeepSHAP Gonzales Martinez R [55], could map influential features, such as specific hydrogen bond patterns or target gene interactions, to enhance transparency. These advancements, combined with our model’s performance, position the DRL framework as a scalable, efficient tool for ncRNA analysis in rare cancers, bridging computational innovation with biologically actionable insights.

ncRNAs, including miRNAs, lncRNAs, circRNAs, and piRNAs, play key roles in MBC by regulating gene expression, tumor progression, and drug resistance. miR- 21, miR- 155, and HOTAIR are notably deregulated, impacting metastasis and treatment response. DL models, such as CNNs and GNNs, help uncover complex ncRNA-disease associations, improving biomarker discovery and precision oncology. Further research is needed to clarify the roles of circRNAs and piRNAs in MBC. Predicting associations between ncRNAs and diseases has enormous potential to improve our understanding of the critical role of ncRNAs in disease development and consequently improve early disease diagnosis [56]. In this study, we have introduced a novel and systematic method to predict ncRNA-disease associations, taking advantage of ncRNA descriptors that include sequence information and target gene data. Results were matched with earlier researchers [57]. Our approach was tested by building a deep-learning model to diagnose different subtypes of breast cancer based on ncRNA profiles of patients with breast cancer, metaplastic cancer, and lung cancer. The exceptional performance of our model underscores the strong correlation between the association of ncRNAs with breast cancer and the specific patterns found in their sequence information and target gene interactions, as per the method of Li, et al. [58]. By taking advantage of these informative descriptors, we could effectively discern the intricate relationships between ncRNAs and different breast cancer subtypes, highlighting their potential roles as biomarkers and therapeutic targets, as per the method of earlier researchers [59].

The framework’s computational efficiency is another milestone. By reducing feature dimensionality by 42.5% (4,430 to 2,545 features) and training time by 42% (0.14 to 0.08 s/epoch), the model achieves scalability without sacrificing accuracy—a critical advantage for real-world deployment. This efficiency, combined with compatibility for cloud-based deployment (AWS/Azure), positions the framework as a practical tool for clinics lacking high-performance infrastructure [23]. However, the reliance on synthetic data augmentation for ncRNA markers (e.g., MALAT1, XIST) and the modest dataset size (n = 300) pose risks of overfitting, despite regularization efforts [8]. While cross-validation and external testing (91.7–96.2% accuracy on independent datasets) mitigate these concerns, larger, prospectively collected cohorts are needed to confirm generalizability across diverse populations [60].

The study’s external validation highlights both strengths and limitations. The model’s specificity to breast cancer subtypes (87–96.5% accuracy) and failure to classify Alzheimer’s-associated ncRNAs (8–9% accuracy) underscore its precision for oncology applications [61]. However, its moderate cross-subtype performance (55.9–57.8%) reflects the heterogeneity of ncRNA profiles even within breast cancer, emphasizing the need for subtype-specific training [62]. Similarly, the ablation study reaffirmed the necessity of integrating sequence, structure, and target gene features, as excluding any category reduced accuracy by 3–5%. This multi-faceted approach mirrors the complexity of ncRNA biology, where functional impacts arise from synergistic interactions across molecular layers [63]. Clinically, the survival analysis using TCGA data offers actionable insights, identifying ncRNAs like MALAT1 and NEAT1 as high-risk markers and GAS5 as protective [64]. However, the absence of treatment metadata in TCGA limits insights into therapy-responsive ncRNA dynamics—a gap that future studies could address by partnering with clinical trials [65]. Additionally, while the framework proposes liquid biopsy applications, technical challenges (e.g., low ncRNA abundance in blood) remain unaddressed, necessitating experimental validation in patient-derived samples [66].

Potential applications and advantages

The Deep Reinforcement Learning (DRL)-based framework for non-coding RNA (ncRNA) analysis in metaplastic breast cancer (MBC) offers transformative applications across clinical and research domains. Its ability to achieve 96.20% accuracy in classifying MBC-associated ncRNAs positions it as a powerful tool for early diagnosis and precision oncology, enabling the identification of novel biomarkers like MALAT1, HOTAIR, and NEAT1 through liquid biopsies. These ncRNA signatures can stratify patients into subtypes, guiding personalized therapies targeting pathways such as Wnt/β-catenin or TGF-β, while also prioritizing therapeutic candidates (e.g., oncogenic PVT1 or tumor-suppressive GAS5) for drug development. Prognostically, the model’s integration with survival data from TCGA reveals ncRNAs like HOTAIR (HR = 2.40) as predictors of poor outcomes, allowing clinicians to tailor treatment plans and monitor resistance markers such as XIST or UCA1 during therapy. Beyond MBC, the framework’s adaptability supports pan-cancer diagnostics, achieving 85–96% accuracy in lung, colorectal, and ovarian cancers, and predicting metastatic potential via EMT-linked ncRNAs like SNHG15. In research, it accelerates mechanistic studies of ncRNA-pathway interactions and high-throughput screening for functional validation.

The model’s advantages over traditional methods are multifaceted. It outperforms SVM, random forests, and neural networks by 3–26% in accuracy, maintaining robustness (96.2% AUC-ROC) even with class imbalance. By integrating sequence motifs (e.g.,"UUG"), structural features (ΔG = − 12.3 kcal/mol), and target gene networks (1,150 miRDB descriptors), it provides holistic biological insights, linking ncRNAs to pathways like TGF-β (ρ = 0.71). Computational efficiency is a hallmark: feature optimization reduces dimensionality by 42.5% (4,430 to 2,545 features) and training time by 42% (0.08 s/epoch), enabling scalable, real-time analysis on minimal hardware or cloud platforms. Clinically, its specificity is validated by failure to classify Alzheimer’s-associated ncRNAs (8–9% accuracy), ensuring reliability for breast cancer applications. Survival analysis further bridges computational predictions to clinical outcomes, while its generalizability across cancer subtypes (e.g., HER2 + BC, lung BC) and compatibility with ensemble strategies (e.g., DRL + SVM voting) enhance diagnostic consensus (88.5% accuracy). By combining high accuracy, interpretability, and adaptability, this framework advances ncRNA-driven personalized medicine, offering a low-cost, rapid solution for biomarker discovery and therapeutic targeting in heterogeneous cancers.

Study limitations

The study on ncRNAs associated with metaplastic breast cancer (MBC) faces several limitations regarding data constraints, model-specific challenges, biological interpretability, clinical translation, and ethical concerns. One primary limitation is the restricted dataset diversity. The dataset primarily focuses on 18 ncRNA markers related to MBC, with synthetic data augmentation applied to balance the dataset. This approach may introduce biases and limit the model’s applicability to other breast cancer subtypes or cancers with distinct ncRNA profiles. Additionally, the study relies on public databases, specifically using target gene descriptors from miRDB with a threshold score ≥ 90. Although this criterion ensures a stringent selection process, it excludes experimentally unvalidated interactions and may overlook context-specific ncRNA-gene relationships in MBC, further constraining generalizability. Model-specific challenges also pose significant limitations. The computational complexity of the deep reinforcement learning (DRL) framework, despite optimized training times (0.08 s per epoch), requires substantial computational resources, including 16 GB RAM and GPU support. This requirement may limit accessibility for researchers in resource-constrained settings. Another concern is the risk of model overfitting. Although the use of regularization techniques, such as dropout (0.3) and batch normalization, helps mitigate overfitting, the model’s high accuracy (96.20%) on a relatively small dataset (300 instances) raises concerns about its robustness. Validation on larger, independent cohorts is essential to confirm the model’s generalizability and performance across diverse datasets.

In terms of biological interpretability, the study leaves some mechanistic questions unanswered. While SHAP analysis identified key features, such as"UUG"motifs and free energy changes (ΔG = − 12.3 kcal/mol), the model does not provide direct insights into how these features drive ncRNA-MBC associations. This gap necessitates further experimental validation through wet-lab studies to establish causal mechanisms. Furthermore, pathway enrichment analysis linked ncRNAs to broad biological pathways like Wnt and TGF-β. However, the study did not conduct finer-grained subpathway or single-cell analyses, potentially overlooking crucial subtype-specific mechanisms. The study also faces challenges in clinical translation. Prognostic ncRNAs, such as HOTAIR (hazard ratio = 2.40), were identified using data from The Cancer Genome Atlas (TCGA). However, the lack of treatment-specific metadata in TCGA limits the study’s ability to explore how these ncRNAs interact with different therapeutic regimens. Additionally, while the study suggests potential use in liquid biopsy diagnostics, it does not validate ncRNA detection in actual patient blood or tissue samples. This oversight leaves critical technical challenges, such as low ncRNA abundance in bodily fluids, unresolved. Ethical and reproducibility concerns further complicate the study’s findings. Although a GitHub repository is referenced, full reproducibility relies on access to unpublished data and proprietary preprocessing pipelines. This limitation hinders independent validation and transparency. Moreover, the model’s training predominantly on genomic data may inadvertently overlook socioeconomic, ethnic, or gender-based disparities in ncRNA expression patterns. This ethical bias raises concerns about the model’s applicability across diverse patient populations and underscores the need for more inclusive data collection and analysis practices.

Conclusions

The Deep Reinforcement Learning (DRL)-based framework demonstrated exceptional performance in classifying non-coding RNAs (ncRNAs) linked to metaplastic breast cancer (MBC), achieving superior accuracy (96.20%), precision (96.48%), recall (96.10%), F1-score (96.29%), and AUC-ROC (96.20%) compared to traditional models (SVM, logistic regression, neural networks). This underscores its ability to balance sensitivity and specificity, critical for rare cancers like MBC where false negatives could delay therapeutic insights. The integration of 550 sequence-based features (e.g., k-mer frequencies, structural motifs) and 1,150 target gene descriptors (miRDB score ≥ 90) enabled a multi-dimensional analysis, revealing ncRNA interactions with key pathways (Wnt/β-catenin, TGF-β, EMT) and oncogenic targets (TP53, MYC, BRCA1/2). SHAP analysis identified sequence motifs (e.g.,"UUG") and structural free energy (ΔG = − 12.3 kcal/mol) as top predictors, validated by PCA (82% variance) and t-SNE clustering. Feature optimization reduced dimensionality by 42.5% (4,430 to 2,545 features) while maintaining 96.2% accuracy, highlighting computational efficiency. External validation confirmed model specificity to breast cancer subtypes (87–96.5% accuracy) and non-reactivity to Alzheimer’s disease (8–9% accuracy), ruling out overfitting. Survival analysis via TCGA data identified prognostic ncRNAs: MALAT1, HOTAIR, and NEAT1 correlated with poor survival (HR = 1.76–2.71), while GAS5 showed protective effects (HR = 0.60). Ablation studies affirmed the necessity of integrating sequence, structure, and physicochemical features for robust performance. The DRL model’s scalability, rapid training (0.08 s/epoch), and compatibility with cloud deployment position it as a transformative tool for precision oncology. Its ability to stratify patients, predict survival outcomes, and prioritize therapeutic targets bridges computational biology and clinical practice. Future work should expand validation to diverse cohorts, refine feature sets for other cancers, and explore real-time diagnostic applications. This study establishes ncRNA-driven classification as a cornerstone for advancing MBC research and personalized therapy development.

Data availability

All data generated or analyzed during this study are included in this published article.

References

Ali S, et al. Amomum subulatum: A treasure trove of anti-cancer compounds targeting TP53 protein using in vitro and in silico techniques. Front Chem. 2023;11:1174363.
Article CAS PubMed PubMed Central Google Scholar
Ahmad HM, et al. Characterization of fenugreek and its natural compounds targeting AKT-1 protein in cancer: Pharmacophore, virtual screening, and MD simulation techniques. J King Saud Univ Sci. 2022;34:102186.
Nelson RA, Guye ML, Luu T, Lai LL. Survival outcomes of metaplastic breast cancer patients: results from a US population-based analysis. Ann Surg Oncol. 2015;22:24–31.
Article PubMed Google Scholar
Schwartz TL, Mogal H, Papageorgiou C, Veerapong J, Hsueh EC, et al. Metaplastic breast cancer: histologic characteristics, prognostic factors and systemic treatment strategies. Exp Hematol Oncol. 2013;2:1–6.
Article Google Scholar
Iqbal MJ, et al. Clinical applications of artificial intelligence and machine learning in cancer diagnosis: looking into the future. Cancer Cell Int. 2021;21(1):270.
Pai S, Bader GD. Patient similarity networks for precision medicine. J Mol Biol. 2018;430(18):2924–38.
Rayan RA, Zafar I. Monitoring Technologies for Precision Health. The Smart Cyber Ecosystem for Sustainable Development. 2021:251–260.
Mazhar T, Haq I, Ditta A, Mohsan SAH, Rehman F, Zafar I, ... Goh LPW. The role of machine learning and deep learning approaches for the detection of skin cancer. In Healthcare. 2023;11(3):415. Multidisciplinary Digital Publishing Institute.
Fatica A, Bozzoni I. Long non-coding RNAs: new players in cell differentiation and development. Nat Rev Genet. 2014;15(1):7–21.
Ratti M, et al. MicroRNAs (miRNAs) and long non-coding RNAs (lncRNAs) as new tools for cancer therapy: first steps from bench to bedside. Target Oncol. 2020;15:261–78.
Article PubMed PubMed Central Google Scholar
Chen X, Wang L, Qu J, Guan NN. Li JQ Predicting miRNA–disease association based on inductive matrix completion. Bioinformatics. 2018;34(24):4256–65.
Silva A, Bullock M, Calin, G. The clinical relevance of long non-coding RNAs in cancer. Cancers. 2015;7(4):2169–82.
Lo PK, Wolfson B, Zhou X, Duru N, Gernapudi R, Zhou Q. Noncoding RNAs in breast cancer. Brief Funct Genomics. 2016;15(3):200–21.
Article CAS PubMed Google Scholar
Liu H, Li T, Dong C, Lyu J. Identification of miRNA signature for predicting the prognostic biomarker of squamous cell lung carcinoma. Plos One. 2022;17(3):e0264645.
Wang SH, et al. RFEM: A framework for essential microRNA identification in mice based on rotation forest and multiple feature fusion. Compute Biol Med. 2024;171:108177.
Article CAS Google Scholar
Manfrevola F, et al. CircRNA role and circRNA-dependent network (ceRNET) in asthenozoospermia. Front Endocrinol. 2020;11:395.
Ha J, Park C, Park C, Park S. IMIPMF: Inferring miRNA-disease interactions using probabilistic matrix factorization. J Biomed Inform. 2020;102:103358.
Article PubMed Google Scholar
Ha J. SMAP: Similarity-based matrix factorization framework for inferring miRNA-disease association. Knowledge-Based Systems. 2023;263:110295.
Ha J, Park S. NCMD: Node2vec-based neural collaborative filtering for predicting miRNA-disease association. IEEE/ACM Trans Comput Biol Bioinform. 2022;20(2):1257–68.
Article Google Scholar
Tordonato C, Di Fiore PP, Nicassio F. The role of non-coding RNAs in the regulation of stem cells and progenitors in the normal mammary gland and in breast tumors. Front Genet. 2015;6:72.
Ahmad HM, Abrar M, Izhar O, Zafar I, Rather MA, Alanazi, AM, ...Khan AA. Characterization of fenugreek and its natural compounds targeting AKT-1 protein in cancer: Pharmacophore, virtual screening, and MD simulation techniques. J. King Saud Univ. Sci. 2022;34(6):102186.
Oraibi AI, Karav S, Khallouki FJ. Exploration of Rosmarinic Acid as Anti-Esophageal Cancer Potential by use of Network Pharmacology and Molecular Docking Approaches. Atl J Lif Sci. 2025;2025(1).
Zafar I, Anwar S, Yousaf W, Nisa FU, Kausar, T, ul Ain Q, ...Sharma R. Reviewing methods of deep learning for intelligent healthcare systems in genomics and biomedicine. Biomed Signal Process Control. 2023;86:105263.
Bayram B, Kunduracioglu I, Ince S, Pacal IJN. A systematic review of deep learning in MRI-based cerebral vascular occlusion-based brain diseases. Neurosci. 2025.
İnce S, Kunduracioglu I, Bayram B, Pacal I. U-Net-Based Models for Precise Brain Stroke Segmentation. Chaos Theory Appl. 7(1):50–60.
Ozdemir B, Pacal IJSR. A robust deep learning framework for multiclass skin cancer classification. Sci Rep. 2025;15(1):4938.
Pacal I, Ozdemir B, Zeynalov J, Gasimov H, Pacal N. A novel CNN-ViT-based deep learning model for early skin cancer diagnosis. Biomed Signal Process Control. 2025;104:107627.
Zitnik M, et al. Machine learning for integrating data in biology and medicine: Principles, practice, and opportunities. Inf Fusion. 2019;50:71–91.
Article PubMed Google Scholar
Işık G, Paçal İ. Few-shot classification of ultrasound breast cancer images using meta-learning algorithms. Neural Comput Appl. 2024;36(20):12047–59.
Huang L, Zhang L, Chen XJ. Updated review of advances in microRNAs and complex diseases: experimental results, databases, webservers and data fusion. Brief Bioinform. 2022;23(6):bbac397.
Huang,L, Zhang L, Chen X. Updated review of advances in microRNAs and complex diseases: towards systematic evaluation of computational models. Brief bioinform. 2022;23(6):bbac407.
Chen X, Xie D, Zhao Q, You ZH. MicroRNAs and complex diseases: from experimental results to computational models. Brief bioinform. 2019;20(2):515–39.
Ha J, Park C. MLMD: Metric learning for predicting MiRNA-disease associations. Ieee Access. 2021;9:78847–58.
Ha J. MDMF: predicting miRNA–disease association based on matrix factorization with disease similarity constraint. J Pers Med. 2022;12(6):885.
Xie B, et al. MOBCdb: a comprehensive database integrating multi-omics data on breast cancer for precision medicine. Breast Cancer Res Treat. 2018;169:625–632.
Article CAS PubMed Google Scholar
Tomczak K, Czerwińska P, Wiznerowicz M. Review The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemp Oncol. 2015;2015(1):68–77.
Xu X, Jiang F, Li L, Huang H, Yang F, Jiang C. Machine-Learning-Based Automatic Metallographic Grading System for High-Gloss Anodized Aluminum Profiles. Symmetry. 2025;17(4):482.
Chen Y, Wang X. miRDB: an online database for prediction of functional microRNA targets. Nucleic Acid Res. 2020;48(D1):D127–D131.
Szklarczyk D, Nastou K, Koutrouli,M, Kirsch R, Mehryary F, Hachilif R, ...von Mering C. The STRING database in 2025: protein networks with directionality of regulation. Nucleic Acids Research. 2025;53(D1):D730–D737.
Arulkumaran K, Deisenroth MP, Brundage M, Bharath AA. Deep reinforcement learning: A brief survey. IEEE Signal Process Mag. 2017;34(6):26–38.
Banerjee C, Mukherjee T, Pasiliao Jr E. An empirical study on generalizations of the ReLU activation function. In Proceedings of the 2019 ACM Southeast Conference; 2019. p. 164–167.
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. 2017.
Wong N, Wang X. miRDB: an online resource for microRNA target prediction and functional annotations. Nucleic Acids Res. 2015;43(D1):D146–D152.
Primpeli A, Peeters R, Bizer C. The WDC training dataset and gold standard for large-scale product matching. In Companion Proceedings of The 2019 World Wide Web Conference; 2019. p. 381–386.
Kim HY. Analysis of variance (ANOVA) comparing means of more than two groups. Restor Dent Endod or RDE. 2014;39(1):74.
Kurita T. Principal Component Analysis (PCA). In: Ikeuchi K, editors. Computer Vision, vol. 1. Cham: Springer; 2021. p. 1013–1016. https://doiorg.publicaciones.saludcastillayleon.es/10.1007/978-3-030-63416-2_649.
Liu X, Bao Y, Zhao L, Gu C. Establishment and application of steel composition prediction model based on t-Distributed Stochastic Neighbor Embedding (t-SNE) dimensionality reduction algorithm. J Sustain Metall. 2024;10(2):509–24.
Wang H, Liang Q, Hancock JT, Khoshgoftaar TM. Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods. J Big Data. 2024;11(1):44.
Chen Z, Li Z, Li H, Jiang Y. Metabolomics: a promising diagnostic and therapeutic implement for breast cancer. Onco Targets Ther. 2019;12(1):6797–811.
Sun B, et al. Research progress on the interactions between long non-coding RNAs and microRNAs in human cancer. Oncol Lett. 2020;19(1):595–605.
Rajakumar S, Jamespaulraj S, Shah Y, Kejamurthy P, Jaganathan MK, Mahalingam G, Ramya Devi KT. Long non-coding RNAs: an overview on miRNA sponging and its co-regulation in lung cancer. Mol Biol Rep. 2023;50(2):1727–41.
Xiao Q, Dai J, Luo J. A survey of circular RNAs in complex diseases: databases, tools and computational methods. Brief Bioinform. 2022;23(1):bbab444.
Gupta P, Basu S, Yadav TD, Kaman L, Irrinki S, Singh H, ...Arora C. Deep-learning models for differentiation of xanthogranulomatous cholecystitis and gallbladder cancer on ultrasound. Indian J Gastroenterol. 2024;43(4):805–12.
Sajed S, Sanati A, Garcia JE, Rostami H, Keshavarz A, Teixeira A. The effectiveness of deep learning vs. traditional methods for lung disease diagnosis using chest X-ray images: A systematic review. Appl Soft Comput. 2023;147:110817.
Martinez RG, van Dongen DM. Deep learning algorithms for the early detection of breast cancer: A comparative study with traditional machine learning. Inform Med Unlocked. 2023;41:101317.
Ding L, Wang M, Sun D, Li A. TPGLDA: Novel prediction of associations between lncRNAs and diseases via lncRNA-disease-gene tripartite graph. Sci Rep. 2018;8(1):1065.
Huang YA, Chen X, You ZH, Huang DS, Chan KC. ILNCSIM: improved lncRNA functional similarity calculation model. Oncotarget. 2016;7(18):25902.
Li X, Wu Z, Fu X, Han W. lncRNAs: insights into their function and mechanics in underlying disorders. Mutat Res Rev Mutat Res. 2014;762:1–21.
Feng Y, Spezia M, Huang S, Yuan C, Zeng Z, Zhang L, ...Ren G. Breast cancer development and progression: Risk factors, cancer stem cells, signaling pathways, genomics, and molecular pathogenesis. Genes Dis. 2018;5(2):77–106.
Asif S, Wenhui Y, ur-Rehman S, ul-ain Q, Amjad K, Yueyang Y, ...Awais M. Advancements and prospects of machine learning in medical diagnostics: unveiling the future of diagnostic precision. Arch Comput Methods Eng. 2024:1–31.
Lima ZS, Ebadi MR, Amjad G, Younesi L. Application of imaging technologies in breast cancer detection: a review article. Open Access Maced J Med Sci. 2019;7(5):838.
Jiang Y, Qi S, Zhang R, Zhao,R, Fu Y, Fang Y, Shao M. Diagnosis of hepatocellular carcinoma using liquid biopsy-based biomarkers: a systematic review and network meta-analysis. Front Oncol. 2025;14:1483521.
Ingman WV, Jones RL. Cytokine knockouts in reproduction: the use of gene ablation to dissect roles of cytokines in reproductive biology. Human Reprod Update. 2008;14(2):179–192.
Cheong JK, Rajgor D, Lv Y, Chung KY, Tang YC, Cheng H. Noncoding RNome as enabling biomarkers for precision health. Int J Mol Sci. 2022;23(18):10390.
Granja JM. Dissection of Cancer-Specific CIS and Trans-Regulatory Elements. Stanford University. 2020;24(1):29756343.
Sijithra P, Santhi N, Ramasamy N. A review study on early detection of pancreatic ductal adenocarcinoma using artificial intelligence assisted diagnostic methods. Eur J Radiol. 2023;166:110972.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

The authors would like to extend their sincere appreciation to the Researchers Supporting Project Number (RSP2025R457), King Saud University, Riyadh, Saudi Arabia.

Ethical consideration

Not applicable.

Funding

This work is financially supported by the Researchers Supporting Project (RSP2025R457). King Saud University, Riyadh, Saudi Arabia.

Author information

Authors and Affiliations

Department of Cell Biology and Physiology, University of Kansas Medical Center, Kansas City, KS, 66160, USA
Saleem Ahmad
Department of Biochemistry and Biotechnology, Faculty of Science, The University of Faisalabad (TUF), Faisalabad, Punjab, Pakistan
Imran Zafar & Shaista Shafiq
National Centre for Bioinformatics, Quaid-E-Azam University Islamabad, Islamabad, Pakistan
Laila Sehar & Hafsa Khalil
COMSATS University, Islamabad, Pakistan
Nida Matloob
Department of Bioinformatics, School of Interdisciplinary Engineering & Sciences, NUST, Islamabad, Pakistan
Samreen Rana
Institute of Biotechnology and Genetic Engineering, The University of Agriculture, Peshawar, Pakistan
Sidra Tul Muntaha & Najeeb Ullah Khan
Faculty of Biological Sciences, Department of Biochemistry, Quaid-E-Azam University, Islamabad, Pakistan
Hamid Khan
Department: Institute of Molecular Biology and Biotechnology, University of Lahore, Lahore, Pakistan
Mehvish Hina
Department of Precision Medicine, University of Campania ‘L. Vanvitelli’, Naples, Italy
Ahsanullah Unar
Institute of Molecular Biology and Biotechnology (IMBB), University of Lahore, Lahore, Pakistan
Muhammad Azmat
Department of Pharmacology, Research Institute of Clinical Pharmacy, Shantou University Medical College, Shantou, China
Muhammad Shafiq
Department of Pharmaceutics, College of Pharmacy, King Saud University, P.O. Box 11451, Riyadh, Saudi Arabia
Yousef A. Bin Jardan
University of Bahr El Ghazal, Freedowm Stree, Wau 91113 South, Sudan
Musaab Dauelbait
Laboratory of Biotechnology and Natural Resources Valorization, Faculty of Sciences, Ibn Zohr University, 80060, Agadir, Morocco
Mohammed Bourhia

Authors

Saleem Ahmad
View author publications
You can also search for this author inPubMed Google Scholar
Imran Zafar
View author publications
You can also search for this author inPubMed Google Scholar
Shaista Shafiq
View author publications
You can also search for this author inPubMed Google Scholar
Laila Sehar
View author publications
You can also search for this author inPubMed Google Scholar
Hafsa Khalil
View author publications
You can also search for this author inPubMed Google Scholar
Nida Matloob
View author publications
You can also search for this author inPubMed Google Scholar
Mehvish Hina
View author publications
You can also search for this author inPubMed Google Scholar
Sidra Tul Muntaha
View author publications
You can also search for this author inPubMed Google Scholar
Hamid Khan
View author publications
You can also search for this author inPubMed Google Scholar
Najeeb Ullah Khan
View author publications
You can also search for this author inPubMed Google Scholar
Samreen Rana
View author publications
You can also search for this author inPubMed Google Scholar
Ahsanullah Unar
View author publications
You can also search for this author inPubMed Google Scholar
Muhammad Azmat
View author publications
You can also search for this author inPubMed Google Scholar
Muhammad Shafiq
View author publications
You can also search for this author inPubMed Google Scholar
Yousef A. Bin Jardan
View author publications
You can also search for this author inPubMed Google Scholar
Musaab Dauelbait
View author publications
You can also search for this author inPubMed Google Scholar
Mohammed Bourhia
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Imran Zafar, Saleem Ahmad, Laila Sehar, Mehwish Rehman, Hafsa Khalil, Shopnil Akash, Nida Matloob, Sidra Tul Muntaha, Hamid Khan, Najeeb Ullah Khan: Conceptualisation, writing – Original Draft, Study design; Imran Zafar, Ahsanullah Unar, Muhammad Shafiq, Samreen Rana, Muhammad Azmat: Formal analysis, Imran Zafar, Ahsanullah Unar, Muhammad Shafiq, Samreen Rana, Muhammad Azmat Yousef A. Bin Jardan, Musaab Dauelbait, Mohammed Bourhia: Editing and Supervision. All authors read and approved the final version of the manuscript.

Corresponding authors

Correspondence to Imran Zafar or Musaab Dauelbait.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

All authors have read and approved the manuscript and are aware of its submission to the journal.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Material 1

Supplementary Material 2

Supplementary Material 3

Supplementary Material 4

Supplementary Material 5

Supplementary Material 6

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Ahmad, S., Zafar, I., Shafiq, S. et al. Deep learning-based computational approach for predicting ncRNAs-disease associations in metaplastic breast cancer diagnosis. BMC Cancer 25, 830 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12885-025-14113-z

Download citation

Received: 03 February 2025
Accepted: 08 April 2025
Published: 06 May 2025
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12885-025-14113-z

Deep learning-based computational approach for predicting ncRNAs-disease associations in metaplastic breast cancer diagnosis

Abstract

Introduction

Review literature

Materials and methods

Data collection and sources

Data retrieval

Golden standard dataset creation

Data preprocessing

Feature engineering and descriptor system

Model construction and compilation

Model architecture

Training procedure

Numerical sequence features analysis

Identification of target gene

Validation of versatile descriptor system

Model performance evaluation

Advancing analysis of ncRNA research

Results

System performance in non-coding RNA Analysis

Descriptor sets and multi-faceted analysis

Performance comparison of classifiers

Optimizing descriptor selection for enhanced ncRNAs classification

DRL net classifier: architecture and performance

Comparative analysis of classifier performance

High-accuracy ncRNA-based cancer prediction across cancer types

External validation and specificity testing

Ablation Study on Feature Contributions

Survival Analysis Using TCGA Data

Discussion

Potential applications and advantages

Study limitations

Conclusions

Data availability

References

Acknowledgements

Ethical consideration

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Supplementary Information

Supplementary Material 1

Supplementary Material 2

Supplementary Material 3

Supplementary Material 4

Supplementary Material 5

Supplementary Material 6

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Cancer

Contact us