- Research
- Open access
- Published:
Evaluating key predictors of breast cancer through survival: a comparison of AFT frailty models with LASSO, ridge, and elastic net regularization
BMC Cancer volume 25, Article number: 665 (2025)
Abstract
Background
Frailty models are extensively utilized in survival analysis to address unobserved heterogeneity among individuals. However, selecting the most robust model for survival prediction, especially in the context of high-dimensional data, continues to pose a challenge. This study evaluates the performance of various Accelerated Failure Time (AFT) frailty models and examines the influence of regularization techniques, including LASSO, Ridge, and Elastic Net, on model selection and prediction accuracy.
Methods
We utilized both simulated datasets and a real breast cancer dataset to compare the performance of seven Accelerated Failure Time (AFT) frailty models: Weibull, Log-logistic, Gamma, Gompertz, Log-normal, Generalized Gamma, and the Extreme Value Frailty AFT model. Model performance was evaluated using Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Mean Absolute Error (MAE), and Mean Squared Error (MSE) metrics across three sample sizes (25%, 50%, and 75%). To enhance parameter estimation and reduce overfitting in high-dimensional survival data, we applied regularization methods, including LASSO, Ridge, and Elastic Net. The Extreme Value Frailty AFT model consistently outperformed all other models across various sample sizes, demonstrating the lowest values for AIC, BIC, MAE, and MSE. These results indicate its superior fit and predictive accuracy. The forest plot analysis further validates the strong impact of significant covariates. The model's AIC ranged from 100.41 at a 25% sample size to 384.58 at a 75% sample size, consistently surpassing the performance of the second-best Log-logistic model. Furthermore, the application of LASSO regularization improved the model's parsimony by eliminating non-informative covariates, such as Age, PR, and Hospitalization, while retaining essential predictors like Competing Risks, Metastasis, Stage, and Lymph Node involvement.
Conclusion
The Extreme Value Frailty Accelerated Failure Time (AFT) model demonstrated strong predictive performance in survival analysis, particularly when combined with LASSO regularization to enhance interpretability and generalizability. Key predictors—including Comorbidity, Metastasis, Stage, and Lymph Node involvement—remained significant after regularization, with reduced coefficients. Notably, patients without metastasis had 2.63 times longer expected survival than those with metastatic disease, while lower-stage diagnoses and minimal lymph node involvement contributed to 26% and 16% longer survival times, respectively. Other significant factors included recurrence status (19% increase in survival), HER2 negativity (20% longer survival), absence of the Triple Negative subtype (15% longer survival), and lower tumor grades (11% longer survival).By effectively shrinking less relevant variables, LASSO mitigated overfitting while preserving critical predictors, reinforcing the importance of tumor characteristics and molecular markers in survival outcomes. The study highlights the crucial role of risk stratification, as patients categorized into Low, Medium, and High-risk groups exhibit distinct survival patterns, aligning with the Extreme Value AFT Frailty Model. The forest plot analysis further validates the strong impact of significant covariates, with Competing Risks, Lymph Node Involvement, and Metastasis emerging as the most critical prognostic factors. Kaplan–Meier survival analysis reveals sharp survival declines associated with metastasis, lymph node involvement, tumor grade, HER2 status, and molecular subtypes, reinforcing the urgent need for early detection and targeted interventions. Notably, patients with Triple Negative and HER2-overexpressing subtypes exhibit the poorest survival outcomes, highlighting the necessity for subtype-specific therapies. Additionally, competing risks, particularly hospitalization-related factors, substantially impact survival, emphasizing the need for integrated treatment approaches.These findings emphasize the role of advanced statistical techniques in improving survival predictions, providing valuable insights that can enhance clinical decision-making in breast cancer prognosis and broader medical research.
Introduction
Survival analysis is essential for understanding time-to-event data, particularly in medical research, where predicting patient outcomes is critical. Traditional survival models, such as the Cox proportional hazards model, have been widely utilized; however, they depend on the proportional hazards assumption, which often fails to hold in real-world situations, especially when unobserved heterogeneity (frailty) is present. Frailty, which represents unaccounted random variability, can significantly impact model outcomes, leading to biased or inaccurate predictions if not adequately addressed [21]. To mitigate this issue, parametric survival models, such as the Accelerated Failure Time (AFT) model, have gained popularity. AFT models do not rely on the proportional hazards assumption and estimate survival times directly, making them more adaptable in scenarios where covariates influence survival in a direct manner (Collett [10]. These models are particularly valuable in oncology, where understanding the time until events such as recurrence or death is crucial. Frailty models introduce a latent random effect to account for unobserved heterogeneity, often modeled using a Gamma distribution [48]. This approach is particularly beneficial in datasets where individuals share unmeasured risk factors, such as patients within the same healthcare facility. Recent advancements have incorporated Extreme Value distributions into frailty models to better capture the variability in survival times, especially in datasets characterized by high levels of frailty or extreme events [13]. These models have demonstrated improved predictive accuracy, particularly in high-frailty datasets. Furthermore, regularization techniques such as LASSO, Ridge, and Elastic Net have emerged to mitigate multicollinearity and overfitting in high-dimensional datasets, which are common in cancer research [46, 54]. These methods enhance model performance by shrinking coefficients, thereby reducing the influence of non-informative predictors and improving interpretability. Several studies have advanced the understanding of Accelerated Failure Time (AFT) and frailty models. Senyefia et al. [41] compared various AFT models in breast cancer survival analysis, identifying the Gompertz model as the best fit. Meanwhile, Mahmoodi et al. [34] demonstrated that AFT models serve as robust alternatives to Cox models, particularly when the proportional hazards assumption is violated. Crowther et al. [11] introduced a flexible parametric AFT model utilizing restricted cubic splines, which enhances flexibility in modeling baseline hazards and time-dependent effects. Despite these contributions, significant gaps persist, especially in addressing unobserved frailty and managing high-dimensional data. Our research aims to bridge these gaps by incorporating the Extreme Value Frailty AFT model alongside regularization techniques such as LASSO and Elastic Net to mitigate multicollinearity and improve model generalizability in high-dimensional survival data. The work of Keiding et al. [26] further underscores the limitations of frailty models when heterogeneity is inadequately addressed, suggesting that Accelerated Failure Time (AFT) models may offer a more stable and interpretable alternative. Chen et al. [9] proposed a generalized gamma frailty distribution,however, its computational complexity remains a significant barrier. Gallardo and Bourguignon [15] introduced the shared weighted Lindley frailty model, which incorporates a frailty term to model unobserved heterogeneity in clustered survival data. This model is particularly useful in accounting for intra-cluster correlations, thereby providing more precise estimates in the presence of unobserved factors. Chen and Qiu [8] developed an Accelerated Failure Time (AFT) model tailored for length-biased and partly interval-censored survival data with mismeasured covariates, addressing challenges prevalent in cohort studies. Their approach effectively corrects biases arising from non-representative sampling and interval censoring, enhancing the accuracy of survival estimates. However, this model does not account for unobserved heterogeneity through frailty parameters. Further extending the application of the weighted Lindley distribution, Mota et al. [37] proposed a cure rate frailty regression model. This model is designed to handle scenarios where a proportion of subjects are immune to the event of interest, offering a more comprehensive understanding of survival data in medical research. These advancements collectively contribute to the refinement of survival analysis methodologies, each addressing specific complexities such as biased sampling, interval censoring, and unobserved heterogeneity. The work by Chen & Qiu [8] focuses on length-biased and partly interval-censored survival data with mismeasured covariates, a common issue in cohort studies. The study develops an AFT model to address challenges posed by biased sampling and interval censoring but does not incorporate frailty parameters. Pereira et al. [39] contributed to the advancement of survival modeling by introducing the weighted Lindley frailty model, specifically for industrial reliability data. Their study explored accelerated failure time (AFT) models with and without frailty, comparing the conventional AFT model, the AFT model with Gamma frailty, and their proposed AFT model incorporating weighted Lindley frailty. A key advantage of their model is its closed-form Laplace transform, which enhances analytical tractability. By integrating the intensity function of a power law process, the proposed framework maintains the interpretability of traditional AFT models, where covariates influence the acceleration or deceleration of failure times. Additionally, the study employed parametric approaches for model fitting, enabling efficient estimation of regression parameters and baseline intensity function parameters. Recent research has focused on minimizing bias in Accelerated Failure Time (AFT) models by addressing issues such as measurement errors, biased sampling, and nonlinear covariate effects. A study by Chen & Huang [6] introduces AFFECT, an R package designed for AFT models that handle error-contaminated survival times, particularly in gene expression studies. The model accounts for mismeasurement errors in survival data, improving the estimation of failure times. Chen [7] also extends AFT models to handle error-prone response variables and nonlinear covariates, providing a flexible survival modeling framework.
Accelerated Failure Time (AFT) models are widely used in survival analysis to assess the impact of covariates on survival times. However, a significant limitation of traditional AFT models is their inability to account for unobserved heterogeneity among subjects, which can lead to biased estimates and incorrect inferences [26]. Frailty parameters are instrumental in addressing this issue by modeling random effects and accounting for variability arising from unobserved covariates and latent risk factors [29]. A frailty-enhanced AFT model would provide a more comprehensive framework by accounting for individual or group-level random effects, improving the model’s ability to handle latent variability. Without a frailty component, the model may have limitations in adjusting for latent risk factors, multicenter effects, and unexplained variation between clusters or study sites. In multi-institutional clinical studies, ignoring frailty may lead to underestimation or overestimation of survival probabilities, reducing the generalizability of the findings. However, the study does not explicitly incorporate frailty parameters, which are instrumental in accounting for unobserved heterogeneity, random effects, and multicenter effects in survival analysis. By not including these components, the model may face limitations in addressing variability arising from unobserved covariates and latent risk factors, which could impact the robustness of survival estimates in complex biological datasets. While the study improves estimation by addressing measurement errors and nonlinear effects, it does not explicitly incorporate frailty terms. In the absence of frailty modeling, the approach may struggle to capture unobserved heterogeneity among study subjects, potentially leading to biased parameter estimates, particularly in longitudinal or clustered survival data [34, 36, 47, 50,51,52]. Regularization techniques, such as the Least Absolute Shrinkage and Selection Operator (LASSO) and Ridge regression, introduce a penalty term to the loss function, effectively constraining the magnitude of the estimated coefficients. This constraint discourages the model from becoming overly complex, thereby mitigating the risk of overfitting and enhancing the model's generalizability to new data. For instance, LASSO regularization not only addresses overfitting but also performs variable selection by shrinking some coefficients to zero, simplifying the model and improving interpretability. In the context of AFT models, incorporating regularization has been shown to improve predictive performance and provide more stable estimates, especially in high-dimensional scenarios. Conversely, AFT models that do not incorporate regularization may struggle with multicollinearity among covariates, leading to unstable parameter estimates and difficulties in determining the individual effect of each covariate. By adding a layer of regularization, these models can handle multicollinearity more effectively, resulting in more reliable and interpretable parameter estimates. Therefore, integrating regularization techniques into AFT models is crucial for enhancing their robustness, especially when dealing with complex or high-dimensional data [14, 23, 33, 44, 53]. In survival analysis, the Accelerated Failure Time (AFT) model serves as a powerful alternative to the traditional Cox Proportional Hazards (PH) model, especially in cases where the proportional hazards assumption does not hold or when direct estimation of survival times is necessary [7]. Unlike the Cox model, which focuses on estimating hazard ratios without explicitly modeling survival time, AFT models directly assess the impact of covariates on survival duration, making them more interpretable for time-to-event analysis [8]. To identify the most suitable AFT distribution for survival modeling, simulations are conducted to determine the most robust option [6]. The Extreme Value AFT model proves particularly effective in handling skewed survival distributions, as it better captures extreme events and accounts for unobserved heterogeneity, ensuring more reliable parameter estimates.
Recent research has made significant strides in incorporating regularization techniques for feature selection and predictive modeling in high-dimensional datasets. Sirimongkolkasem & Drikvandi [43] further explored regularization methods in high-dimensional data analysis, highlighting the advantages of de-biased LASSO while acknowledging its limitations in handling multicollinearity and multiple hypothesis testing. Li & Liu [30] introduced a connected network-regularized logistic regression (CNet-RLR) model, which embeds network connectivity constraints into the penalty function, improving the selection of structured features in cancer genomics. Similarly, Li & Liu [32] extended this approach through a connected network-constrained support vector machine (CNet-SVM), effectively integrating prior gene–gene interaction networks to enhance the identification of biomarker genes. While methods like CNet-RLR and CNet-SVM aim to enhance feature selection by incorporating network connectivity constraints, they present certain limitations. The CNet-RLR model, for instance, imposes inequality constraints to maintain connectivity between nodes during feature selection, which can lead to increased computational complexity and may not scale efficiently with large datasets [30]. Again, while these studies emphasize the importance of structured feature selection and regularization in high-dimensional datasets, they do not explicitly address survival modeling, particularly in the presence of unobserved heterogeneity and frailty effects. The omission of frailty components in these frameworks limits their applicability in survival analysis, where latent variability significantly impacts model accuracy. Our research builds on these advancements by systematically simulating AFT frailty models and applying regularization techniques such as LASSO, Ridge, and Elastic Net to estimate model coefficients. Using simulation studies across varying sample sizes (25%, 50%, and 75%) and robustness assessment through AIC, BIC, MAE, and MSE, we aim to identify the most stable frailty-based AFT model. This systematic approach ensures that our findings remain generalizable across different sample sizes and data structures, addressing key limitations in the reviewed studies and improving survival model reliability in high-dimensional medical datasets.
Li and Liu [31] employed Regularized Cox Proportional Hazards (RCPH) models to identify prognostic biomarkers for breast cancer using gene expression data. Their study constructed a gene regulatory network (GRN) with 1142 genes and identified 72 robust prognostic biomarkers using LASSO-RCPH, Elastic Net-RCPH, and other regularized methods. They validated these biomarkers through literature checks, BRCA-specific GRN, and functional enrichment analysis. Additionally, they developed a prognostic risk score (PRS) using Cox regression analysis, which effectively stratified high- and low-risk groups in both internal and external validation datasets. However, their approach primarily focused on gene-level prognostic biomarkers without incorporating clinical and pathological predictors, which are crucial in survival analysis. The current research fills this gap by evaluating key clinical predictors of breast cancer survival using AFT frailty models combined with LASSO, Ridge, and Elastic Net regularization. By integrating statistical survival modeling with variable selection techniques, this study offers a more comprehensive assessment of breast cancer prognosis, improving predictive accuracy and clinical applicability beyond genomic data alone.
Despite significant advancements in AFT modeling, several gaps remain unaddressed in the reviewed studies. Many existing AFT models, including those by Chen & Huang [6], Chen [7], and Chen & Qiu [8], have focused on handling measurement errors, biased sampling, and nonlinear covariates but have not incorporated frailty parameters. This omission limits their ability to account for unobserved heterogeneity, random effects, and multicenter variability, which are critical for improving model robustness and generalizability in survival analysis. Additionally, while some studies have proposed alternative parametric survival models, they often lack regularization techniques to mitigate issues such as multicollinearity and overfitting, particularly in high-dimensional datasets. Without these enhancements, model estimates may be unstable, leading to unreliable inferences. To bridge these gaps, this study introduces a frailty-based AFT framework that explicitly accounts for unobserved heterogeneity and random effects. Moreover, it incorporates regularization techniques such as LASSO and Ridge regression to improve parameter stability and predictive accuracy. By employing multiple robustness verification methods, including simulations with varying sample sizes and evaluation metrics such as AIC, BIC, MAE, and MSE, this research aims to provide a more comprehensive and reliable approach to survival modeling, particularly in medical and high-dimensional data contexts.
Study question
The central question of this study is: Which Accelerated Failure Time (AFT) frailty model most effectively accounts for unobserved heterogeneity in breast cancer survival data, and how do regularization techniques enhance model performance in high-dimensional contexts? Specifically, the study compares the performance of several AFT frailty models, including the Extreme Value Frailty AFT model, across various sample sizes (25%, 50%, and 75%), using metrics such as Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Mean Absolute Error (MAE), and Mean Squared Error (MSE). Additionally, the study examines how regularization techniques (LASSO, Ridge, and Elastic Net) refine the selection of key predictors of breast cancer survival while improving model parsimony, accuracy, and generalizability.
What is already known
Previous research has established the utility of frailty models, particularly for addressing unobserved heterogeneity in survival data. Models like Weibull, Log-logistic, and Gamma Accelerated Failure Time (AFT) frailty models are commonly applied in clinical and survival analyses to account for variability among individuals. However, their performance can be limited in datasets with high dimensionality or noise, where overfitting and multicollinearity pose challenges. Regularization methods, including LASSO and Ridge regression, have been shown to improve generalizability and parsimony in other statistical contexts. Despite this, their integration into AFT frailty models has been inadequately explored, especially in identifying and validating key predictors for survival outcomes in breast cancer patients.
What this study contributes
This study provides a comprehensive evaluation of frailty models, emphasizing their ability to identify and prioritize key predictors of breast cancer survival while addressing unobserved heterogeneity. By comparing the Extreme Value Frailty Accelerated Failure Time (AFT) model to established frailty models such as Weibull, Log-logistic, Gamma, and Gompertz, the study demonstrates the superior performance of the Extreme Value model across all sample sizes. It consistently achieves better model fit, as indicated by the lowest AIC and BIC values, and improved predictive accuracy, reflected in the lowest MAE and MSE scores. These findings confirm the robustness of the Extreme Value model in survival analysis. A significant contribution of this research lies in the identification of key predictors of breast cancer survival. The application of regularization techniques, including LASSO, Ridge, and Elastic Net, ensured that important predictors remained significant while mitigating risks associated with high dimensionality and overfitting. Notably, factors such as Competing Risks (comorbidities), Metastasis, Stage, and Lymph Node involvement consistently emerged as crucial determinants of survival outcomes. For example, patients without metastasis have an expected survival time that is 2.63 times longer than those with metastatic disease, while those without competing risks survive 2.34 times longer than those with such risks. Furthermore, survival time increases by approximately 26% for patients diagnosed at lower stages (0, I, and II) compared to those diagnosed at higher stages (III and IV). Similarly, patients with lower lymph node involvement (0 and 1) experience about 16% longer survival than those with greater involvement. Additional findings highlight the role of tumor characteristics, molecular markers, and disease progression in determining survival. Patients without recurrent disease exhibit a 19% longer expected survival time compared to those who experience recurrence. Those with HER2-negative status survive 20% longer than HER2-positive patients, while individuals without the Triple Negative (TRN) subtype have a 15% longer survival time compared to other molecular subtypes, including Luminal A, Luminal B, and HER2-positive overexpression. Additionally, patients with lower tumor grades (Grade I and II) have an 11% longer survival time than those with higher-grade tumors. These predictors were refined and validated through the LASSO regularization process, which excluded non-informative covariates such as Age, PR, and Hospitalization while preserving essential variables. This process enhanced the interpretability and clinical relevance of the model. Overall, this study underscores the importance of integrating advanced frailty models with regularization techniques to improve the understanding of key survival predictors in breast cancer research. The findings offer valuable insights that enhance prognosis accuracy, guide targeted interventions, and support clinical decision-making.
Method
This study utilized both real and simulated survival data to evaluate the performance of Accelerated Failure Time (AFT) models that incorporate frailty components, along with regularization techniques to enhance model interpretability. The real dataset comprised breast cancer patients from Korle Bu Teaching Hospital in Ghana, recorded between [49] and [10]. This dataset included 558 patients and contained demographic, tumor, and treatment-related information, such as age, hormone receptor status (ER, PR, HER2), tumor stage, metastasis, and genetic factors. We generated a simulated dataset with survival times that included frailty values drawn from a Gamma distribution to capture unobserved heterogeneity. Both datasets were divided into training (70%) and validation (30%) sets. To ensure the robustness of the model evaluation, three different sample sizes (25%, 50%, and 75%) were simulated. Seven AFT models—Weibull, Log-logistic, Gamma, Gompertz, Log-normal, Extreme Value, and Generalized Gamma—were fitted, both with and without frailty components. The frailty term, modeled as a Gamma distribution, was integrated into the AFT models to account for unobserved heterogeneity, with particular emphasis on the Extreme Value Frailty AFT model, which is well-suited for datasets exhibiting high variability. Regularization techniques were employed to address multicollinearity and enhance model generalization. LASSO (Least Absolute Shrinkage and Selection Operator), which applies an L1 penalty, was utilized for variable selection by shrinking the coefficients of non-significant predictors to zero. Ridge regression, applying an L2 penalty, reduced all coefficients while retaining all variables. Elastic Net, which combines both L1 and L2 penalties, effectively managed correlated variables. Models were evaluated using the Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Mean Absolute Error (MAE), and Mean Squared Error (MSE). In this analysis, concordance, often regarded as equivalent to the Receiver Operating Characteristic (ROC) curve index, is represented by the concordance index (C-index). It quantifies a survival model’s capacity to distinguish between individuals based on their survival durations. A C-index of 0.5 suggests the model has no greater predictive power than random chance, while a value of 1.0 indicates perfect differentiation between patients with varying survival outcomes. All analyses were conducted using R packages such as frailtypack, survival, and glmnet.
Parametric models for AFT modeling: a look at the best fit
Fitting a variety of parametric models is essential for identifying the most suitable form of the parametric family that accurately describes the covariates in Accelerated Failure Time (AFT) modeling. To achieve this objective, the parametric models listed below were analysed. Table 1.0 presents selected distributions along with their properties. As an alternative to Non-proportional Hazards, these distributions were utilized to fit selected AFT models in order to determine the model that most effectively relates to the survival data. Table 1 presents key properties (hazard function, survival function, and probability density function) of several parametric distributions, including the Weibull, Extreme-Value, Log-normal, Log-logistic, Gamma, Gompertz, and Generalized Gamma distributions. These distributions are widely used in AFT models to characterize survival times and capture different underlying failure time mechanisms. In frailty modeling, parametric assumptions about the survival distribution help in understanding how the inclusion of random effects influences survival probabilities and hazard rates. The inclusion of these distributions in Table 1 is relevant for AFT frailty modeling because each parametric distribution captures different failure time mechanisms. For instance, the Weibull distribution allows for increasing or decreasing hazard rates, while the Log-normal and Log-logistic distributions accommodate non-monotonic hazard functions. Additionally, AFT frailty models often require comparing different parametric distributions to determine the best fit for the data [22], making Table 1 a valuable reference for model selection. Furthermore, these distributions influence how frailty (unobserved heterogeneity) affects survival outcomes, with the Gamma frailty model, for example, assuming a multiplicative effect on the hazard function, thereby modifying the baseline hazard derived from the chosen distribution [12].
Model overview and formulation
The Accelerated Failure Time (AFT) model posits that covariates can either accelerate or decelerate the life course of an event. In contrast to the proportional hazards model, which shifts the hazard function based on covariates, the AFT model assumes a multiplicative effect on survival time. This characteristic makes the AFT model particularly suitable for analysing how covariates influence the time until an event occurs, rather than focusing on the hazard rate. An Extreme Value Accelerated Failure Time (AFT) model assumes that the logarithm of survival time follows an Extreme Value distribution. In the frailty version of the AFT model, unobserved heterogeneity, referred to as frailty, is modelled using a Gamma distribution. Frailty represents unobserved factors that influence an individual's hazard rate.
The baseline Accelerated Failure Time (AFT) model (without frailty) for a continuous survival time \((T)\)Â is given by Eq. (1).
where \(\in_1\sim Extreme\;Value\;(\mu,\sigma)\)
Ti is the survival time for individual i
Xi is the vector of covariates for individual i
\(\beta\)Â is a vector of regression coefficients
\(\in_i\)Â is the error term assumed to follow an Extreme Value distribution
In the frailty model, we introduce a frailty term that accounts for unobserved heterogeneity. Here, we assume the frailty follows a Gamma distribution with a shape parameter and a scale parameter of 1 (mean = 1). The frailty affects the survival time multiplicatively. The survival time for the ith individual is expressed as Eq. (2).
Where:
\(Z_i\sim\;Gamma\;(\theta,1/\theta)\)Â is the frailty term for individual
Ti is the baseline survival time (in the absence of covariates and frailty)
\(\exp \left( {{X_i}\beta } \right)\)Â is the effect of covariates on the survival time
Taking the logarithm of Eq. (2) yields Eq. (3) as follows.
Where:
\(\log Z_i\)Â is the frailty component (unobserved heterogeneity
\(X_i\beta\)Â represents the covariate effects
\(LogT_o\)Â is the baseline survival time in log form
The error term \({\varepsilon}_{i}\) follows an Extreme Value Distribution with the probability density function (PDF) expressed in Eq. (4).
Where: \(\mu\)Â is the location parameter
\(\sigma\)Â is the scale parameter
The frailty term \({Z}_{i}\) follows a Gamma distribution expressed as Eq. (5).
\(\theta\)Â is the shape parameter of the Gamma distribution (also called the frailty variance).
\(\Gamma (\theta)\)Â is the Gamma function.
The survival function S(T) for the frailty model can be written as the conditional survival functions given frailty \({Z}_{i}\) as expressed in Eq. (6).
To obtain the marginal survival function (over the frailty distribution) we integrate out the frailty term as expressed in Eq. (7).
Substituting the Gamma frailty distribution \((Z_{i})\)Â and the conditional survival function yields Eq. 8.
The hazard function h(T) is the instantaneous rate of failure at time T.For the Extreme Value AFT model with frailty,the conditional hazard function(given frailty Z i) expressed in Eq. (9).
To estimate the parameters \(\beta\)Â (regression coefficients), \(\theta\)Â (frailty variance), and \(\sigma\)Â (scale parameter of the Extreme Value distribution), maximum likelihood estimation (MLE) is typically employed. The likelihood function is formulated based on the survival and hazard functions, explicitly incorporating the frailty term to account for unobserved heterogeneity. The likelihood for the observed data \((T_{i},\delta_{i})\), where \(\delta_{i}\)Â is the event indicator (1 if the event occurs,0 if censored) is expressed in Eq. 10.
Where
\(\delta_{i}=1\)Â is for observed events and
\(\delta_{i}=0\)Â for censored observations.
The log likelihood is maximised with respect to.\(\beta ,\theta\) and \(\sigma\) to obtain the parameter estimates.
Ridge Regularisation (L2 Regularisation)
Ridge regularization, also known as L2 regularization, penalizes the sum of the squared coefficients. The regularization term helps prevent overfitting by shrinking the coefficients of less important features towards zero, though they never become exactly zero.
The objective function for Ridge regresssion is represented by Eq. 11.
-
\({y}_{i}\): The actual response for the i-th observation
-
\({x}_{ij}\):The j-th feature of the i-th observation
-
\({\beta }_{0}\):The coefficient for the j-th predictor
-
\(\lambda\):The tuning parameter (regularisation parameter) that controls the strenght of the penalty.Higher values of \(\lambda\) shrink the coefficients more.
LASSO (Least Absolute Shrinkage and Selection Operator), also known as L1 regularization, penalizes the absolute values of the coefficients. This results in some coefficients being reduced to zero, effectively performing feature selection.
The objective function for LASSO regression is as represented by Eq. 12.
-
\(\lambda\):The tuning parameter that controls the strenght of the penalty.Higher values of \(\lambda\) shrink the coefficients more.
Elastic Net is a hybrid regularization technique that combines both Ridge (L2) and LASSO (L1) methods. It imposes penalties on both the sum of the squared coefficients (as in Ridge) and the sum of the absolute values of the coefficients (as in LASSO), while preserving some grouped feature information characteristic of Ridge regression.
The objective function for the Elastic Net represented by Eq. 13.
-
\({\uplambda }_{1}\): Controls the LASSO (L1) penalty
-
\({\lambda }_{2}\):Controls the Ridge(L2) penalty
-
Typically,a combination of \({\lambda }_{1} abd {\lambda }_{2}\) is used to tune the model’s performance
Frailty models compensate for unspecified error distributions
Frailty models address the issue of an unspecified noise distribution by incorporating random effects, heterogeneity, unobserved heterogeneity, and covariates. These mechanisms provide a robust framework for handling variability that cannot be explained by observed factors alone.
Frailty as a random effect to absorb noise variability
Frailty models introduce random effects to account for the influence of unmeasured factors affecting survival times. The frailty term is typically assumed to follow a known distribution (e.g., Gamma, Log-normal, or Inverse Gaussian), which absorbs the impact of unspecified noise distributions [12]. By doing so, the model mitigates the effects of unknown error structures, ensuring that unobserved factors influencing survival times are adequately captured (Wienke [50]. For example, in clustered survival data (such as patients from the same hospital or individuals from the same family), frailty terms allow dependence between individuals without requiring a strict assumption on the error term’s distribution [22]. This flexibility is crucial in settings where survival times exhibit unexplained correlation due to shared but unmeasured risk factors.
Heterogeneity in survival times and model flexibility
Frailty models account for heterogeneity by allowing variations in survival probabilities that are not explained by measured covariates [48]. If the noise term were explicitly modeled with a strict parametric distribution, it might fail to capture the full range of survival variability. However, by including frailty components, the model compensates for this by introducing subject-specific or group-specific random effects (Therneau & Grambsch [45]. For instance, in medical studies, frailty models explain why some individuals have significantly longer survival times despite similar covariate profiles—this is due to heterogeneous risks that remain unobserved [17,25].
Accounting for unobserved heterogeneity
Unobserved heterogeneity, if ignored, can lead to biased survival estimates and incorrect hazard function interpretations [1]. Frailty models explicitly address this issue by incorporating a latent variable (frailty term) that captures individual or group-level variations. The latent frailty term acts as a substitute for an unspecified error distribution by shifting unexplained variability into a structured random effect (Klein & Moeschberger [27]. For example, in population-based mortality studies, frailty terms explain why certain groups exhibit lower or higher mortality rates than predicted by observed covariates alone. By incorporating frailty, researchers obtain more realistic survival estimates that reflect unobserved biological or environmental differences [48].
Covariate adjustment without relying on a fully specified error structure
Frailty models also adjust for observed covariates while accommodating an unspecified noise distribution. Unlike strict parametric survival models that require a fully defined error term, frailty models allow for flexible covariate effects by introducing a latent frailty term. This allows for correct parameter estimation even when the true noise distribution remains unknown (Therneau & Grambsch [45]. For instance, Bayesian frailty models use hierarchical structures where the frailty term is estimated alongside covariate effects, thereby reducing dependence on explicit error distribution assumptions [24]. Such models provide robust predictions while handling correlated survival times in clustered data [17, 25].
To this end, Frailty models provide a natural solution to dealing with an unspecified error distribution by incorporating random effects, capturing heterogeneity, accounting for unobserved heterogeneity, and adjusting for covariates flexibly. These mechanisms ensure that survival analysis remains robust even when the true noise distribution is unknown, as supported by numerous studies in survival modeling and biostatistics.
The importance of frailty in disease dynamics modeling
A frailty model is a multiplicative hazard model that has three parts: a frailty (random effect), a baseline hazard function, and a part that models the influence of observable variables (fixed effects) (Adham [2]. Frailty is a latent unobserved random variable that accounts for unobserved heterogeneity and unobserved variables in survival modeling [3, 51, 40]. The inclusion of a frailty component in disease modeling accounts for the heterogeneity factors that occur among patients. The population is made up of people who are at various levels of risk. As a result, it's critical to think of the population as diverse (heterogeneous), or as a mix of people who face different risks. Heterogeneity arises due to variations in patients as well as provider care and other latent factors [18]. In real-world medical practice, patient variables are highly heterogeneous and time-variant, with hazard rates that may approach, diverge from, or even intersect (time-varying covariates). In this case, the Cox PH model in its original form will not be appropriate [38]. Past research have revealed that patients naturally differ substantially (heterogeneity) in regards to the effects of a medicine, a treatment, or the influence of multiple covariates, such as the multi-Centre effect and the absence of influential covariates [16, 19, 35, 51]. Higher mortality at younger ages has been linked to greater covariate heterogeneity (Wienke [50]. Including frailties in the model allows to correctly measure the covariate effects and avoid underestimation or overestimation of the parameters,by fully accounting for the true differences in risk and survival (Ulviya [47]. In survival studies, when unobserved heterogeneity (frailty) is overlooked, bias is introduced into the estimations, leading to misleading results [4]. If heterogeneity is not taken into account in the model, the influence of known factors on relative risk is reduced also [21]. Patients'covariates vary substantially in medical practice which may require existing models to be modified to include univariate frailty in model formulation [16]. The Gama frailty distribution outperforms Gaussian, lognormal and other frailty distributions. The gamma distribution is the most commonly utilized frailty distribution because frailties in conditional likelihood can be integrated out,yielding simple unconditional likelihood expressions whose maximum may be employed for estimation [3].
Results
Simulated results of tentative frailty AFT models against metrics
The study evaluated seven Accelerated Failure Time (AFT) models—Weibull, Log-logistic, Gamma, Gompertz, Log-normal, Extreme Value, and Generalized Gamma—using simulated data across various partition sizes. These models were assessed for their robustness against key evaluation metrics, including the Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Mean Absolute Error (MAE), and Mean Squared Error (MSE). The Gamma distribution was utilized to account for unobserved heterogeneity and random effects in the frailty components. The resulting plots and estimates are presented in this section, providing a comprehensive comparison of model performance across the different metrics, ensuring both fit and predictive accuracy. These analyses demonstrate each model's ability to manage variability within the data and account for random effects through frailty terms.
Figures 1 and 2, along with Table 2, demonstrate that the Extreme Value AFT model consistently exhibits superior performance across all metrics, establishing it as the most robust model in terms of both goodness of fit and prediction accuracy. It records the lowest AIC and BIC values across all sample sizes, indicating an optimal balance between model complexity and fit. For example, at the 25% sample size, the AIC of the extreme model is 100.41, significantly outperforming other models. Similarly, its BIC is the lowest at 107.46, further confirming its parsimony. Additionally, regarding predictive performance, the extreme model achieves the smallest MAE and MSE values across the board, demonstrating its ability to produce accurate and reliable predictions consistently. With MAE values as low as 0.42 and MSE values around 0.37 at the 25% sample size, it stands out as the most dependable model for minimizing errors. The log-logistic model also emerges as a strong contender, particularly for larger sample sizes. Although its AIC and BIC values are slightly higher than those of the extreme model, they remain relatively low, indicating that the log-logistic model is well-fitted with moderate complexity. For instance, at the 25% sample size, the log-logistic model has an AIC of 350.80 and a BIC of 361.37. In terms of prediction accuracy, its MAE and MSE values are competitive, though not as low as those of the extreme model. The log-logistic model consistently produces low error rates, making it a solid second choice, especially when slightly larger sample sizes are utilized. However, the extreme model remains the best overall performer across both small and large sample sizes.
Table 3 presents the results of assessing different data partitioning strategies to determine the most robust split for training, testing, and validation before estimating model parameters. The choice of partitioning significantly impacts model performance, influencing both goodness-of-fit and predictive accuracy. Assessing various partitions—50–30 - 20, 60–20 - 20, and 70–20 - 10—plays a crucial role in selecting the most robust training, testing, and validation split before estimating model parameters. The goal is to identify the partition that provides the best balance between model training, generalization, and validation accuracy. A well-chosen partition ensures that the model learns effectively from the training set while maintaining strong predictive performance on unseen data. Tables 2 and 3 illustrates the performance of various AFT frailty models across different sample sizes, using key evaluation metrics such as AIC, BIC, MAE, and MSE. The results highlight the superior performance of the Extreme Value AFT model across all sample sizes, demonstrating its robustness in both goodness-of-fit and predictive accuracy. The partitioning analysis further refines model selection by showing that the 70–20 - 10 split yields the lowest AIC (1578.6) and BIC (1595.4), suggesting it provides the most efficient balance between model complexity and fit. This implies that allocating 70% of the data for training enhances model learning, while the 20% testing and 10% validation sets prevent overfitting and ensure generalizability. The inference drawn from this analysis underscores the importance of systematic partitioning in survival modeling, ultimately leading to a more reliable and interpretable predictive framework.
Application of AFT frailty model to Ghana`s breast cancer data
This section builds upon the previous analysis, in which simulated datasets of varying sample sizes were utilized to identify the most robust Frailty AFT model. Based on the evaluation metrics—Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Mean Absolute Error (MAE), and Mean Squared Error (MSE)—the Extreme Value Frailty AFT model was selected as the top performer. In this section, we present the parameter estimates of the chosen model and provide a detailed interpretation of the results. To enhance estimation accuracy and mitigate the risk of overfitting, we applied regularization techniques, including LASSO, Ridge, and Elastic Net, to optimize and validate the parameter estimates. Model selection was based on AIC and BIC criteria to ensure an optimal balance between model fit and complexity. We also present diagnostics for the chosen model to evaluate its performance and confirm its validity. Regularization methods can help reduce overfitting and improve generalization by shrinking the coefficients of less important features. This often leads to enhanced model performance, particularly when dealing with a large number of features. Elastic Net, which combines LASSO and Ridge regularization, effectively balances the shrinking of coefficients (as seen in LASSO) while retaining some variables (as in Ridge).
The results presented in Table 4 provide key insights into factors influencing survival time among breast cancer patients. The Extreme Value AFT Frailty Model identifies Competing Risks, Metastasis, Cancer Stage, and Lymph Node Involvement as the most significant predictors of reduced survival time. The negative estimate for Competing Risks (− 0.97, p < 0.001) suggests that patients facing additional health risks experience significantly shorter survival durations. The strong statistical significance (p < 0.001) underscores the importance of accounting for competing risks in survival analysis. Similarly, Metastasis (− 1.21, p < 0.001) substantially accelerates disease progression, leading to reduced survival times. While the confidence interval for Metastasis (− 1.67 to − 0.75) indicates some variability in this effect, its strong significance remains evident. Cancer Stage (− 0.29, p = 0.003) is also associated with a shorter survival time, reinforcing the importance of early detection and intervention in improving prognosis. Additionally, Lymph Node Involvement (− 0.19, p = 0.02) significantly reduces survival duration, though its effect size is smaller compared to Metastasis and Competing Risks. The AFT model estimates acceleration factors, meaning negative coefficients indicate a reduction in survival time rather than an increased risk of event occurrence, as seen in Cox models. Therefore, the interpretations focus on time compression rather than hazard ratios. The findings highlight the dominant role of Metastasis, Cancer Stage, and Lymph Node Involvement in determining survival duration among breast cancer patients. These results emphasize the critical need for early detection, aggressive treatment of metastasis, and targeted management strategies to improve patient outcomes. Future research should explore interaction effects, incorporate additional clinical markers, and refine missing data handling to further enhance predictive accuracy. The frailty parameter in survival analysis represents unobserved heterogeneity among individuals, accounting for differences in patient risk that are not explained by the covariates included in the model. In this analysis, the frailty variance is estimated at 0.42, indicating a moderate level of variability due to unmeasured factors. This means that even though the model incorporates multiple explanatory variables, there are still underlying influences on survival that remain unidentified. The presence of this random effect helps to correct for these hidden variations, ensuring that the survival estimates are not biased by unknown confounders. The random effect parameter plays a crucial role in addressing patient-level heterogeneity, multicenter effects, and the influence of unobserved covariates. In a clinical setting, patients often differ in ways that are difficult to measure, such as genetic predisposition, underlying health conditions, or differences in healthcare access. Additionally, when data is collected from multiple centers, there may be variations in treatment protocols, diagnostic accuracy, and patient management strategies that are not explicitly modeled. The frailty term helps to capture these site-specific or individual differences, preventing the model from making overly simplistic assumptions about patient risk. With a concordance index (C-index) of 0.882 and a standard error of 0.011, the model demonstrates a strong ability to discriminate between high- and low-risk individuals. This high predictive accuracy suggests that the included covariates are effective in explaining a large portion of the survival variability. However, the presence of frailty variance (Theta = 0.42) also highlights the fact that unmeasured factors still contribute to the differences in survival outcomes among patients. If Theta were closer to zero, it would imply that all variability is well captured by the model. Conversely, a much higher Theta would indicate substantial unobserved heterogeneity, signaling the need for additional explanatory variables. The implications of the frailty variance extend to both research and clinical practice. From a modeling perspective, acknowledging the presence of frailty ensures that hazard ratios are estimated more accurately, as ignoring unobserved heterogeneity can lead to biased results. Clinically, the finding that unmeasured factors influence survival suggests the need for a more personalized approach to patient care. Patients with similar observed characteristics may still experience different outcomes due to hidden influences, meaning that individualized treatment strategies should consider factors beyond standard clinical assessments. The presence of frailty in this model suggests that further research could focus on identifying additional covariates that contribute to survival differences. Factors such as lifestyle choices, socioeconomic status, and genetic markers could be incorporated to refine risk prediction. In cases where multicenter data is used, adjustments for hospital-level effects and treatment differences could further improve model precision. By addressing these sources of heterogeneity, future models can enhance their predictive ability and provide more reliable insights for clinical decision-making.
The QQ plot of residuals, as shown in Figs. 3 and 4, allows us to evaluate whether the residuals follow a normal distribution, which is a fundamental assumption underlying many statistical models. In this plot, most of the residuals align closely with the diagonal line, indicating that they approximate a normal distribution reasonably well. The survival and hazard function estimates presented in Figs. 5 and 6 provide valuable insights into the survival dynamics of the dataset. The survival function illustrates a gradual decline in survival probability over time, which is expected in survival analysis, where events (such as death or relapse) accumulate over time. The shape of the curve suggests that the risk of the event increases steadily as time progresses. Conversely, the hazard function begins at a high level and gradually decreases, indicating that the instantaneous risk of experiencing the event diminishes over time. This behavior is consistent with the Extreme Value Frailty AFT model, where individuals with higher frailty experience the event earlier, resulting in a lower-risk population over time. The smooth nature of these plots implies that the model effectively captures the overall trend in the data.
The application of regularization techniques, such as LASSO, Ridge, and Elastic Net, has introduced innovative solutions for addressing issues like multicollinearity, overfitting, and enhancing generalization in statistical models. These methods function by penalizing large coefficient estimates, effectively shrinking them toward zero to reduce model complexity without significantly compromising performance. LASSO (Least Absolute Shrinkage and Selection Operator) not only shrinks coefficients but also performs variable selection by setting some coefficients to zero, thereby eliminating less important predictors. For instance, variables such as AgeCAT, ER, and Hospitalization were assigned a value of zero in the LASSO model, indicating their negligible influence on outcomes, as illustrated in Table 5 (Hastie, Tibshirani, & Friedman [20]. In contrast, Ridge regression retains all variables while shrinking the magnitude of their coefficients, thereby striking a balance between overfitting and model complexity. Model selection based on the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC), as presented in Table 6, highlights the effectiveness of regularization methods. In our study, LASSO achieved the lowest AIC (2710.431) and BIC (2745.026), indicating the optimal balance between model fit and complexity. Ridge and Elastic Net exhibited slightly higher values, with Elastic Net integrating LASSO’s variable selection with Ridge’s shrinkage, thereby retaining variables such as Tumor Size and Genetics that LASSO had discarded. These metrics, which assess the trade-off between model fit and complexity, underscore the significance of regularization in high-dimensional data [5].
The benefits of regularization become evident when comparing model estimates before and after applying these techniques. Initially, covariates such as AgeCAT, Menopause, and Ethnicity exhibited large confidence intervals and weak statistical significance. After applying regularization, LASSO and Elastic Net refined the model by eliminating or shrinking coefficients, resulting in a more parsimonious and interpretable framework. The regularized models, particularly LASSO, enhanced generalization by emphasizing key predictors such as CompetingR, Metastasis, and Stage, which maintained significant coefficients across all models. This approach is especially valuable in high-dimensional datasets where overfitting is a concern, as the reduction in variance leads to more stable predictions on new data [54]. In survival models, particularly the Accelerated Failure Time (AFT) frailty model, regularization is essential for enhancing both performance and interpretability. For instance, in the breast cancer dataset, regularization ensured that only the most critical variables were retained, thereby improving the model's ability to generalize across diverse patient populations. By employing LASSO and Elastic Net, which effectively manage multicollinearity and correlated predictors, the models identified the most predictive covariates while preserving interpretability [42]. This capability is crucial in medical research, where developing models that generalize well across various datasets is fundamental. To that end, regularization methods such as LASSO, Ridge, and Elastic Net have significantly advanced statistical modeling by providing effective solutions to issues of overfitting and multicollinearity. These techniques yield parsimonious and predictive models, with LASSO often being the preferred choice when simplicity is essential, while Elastic Net strikes a balance between feature selection and variable retention. Prior to the application of LASSO, the breast cancer dataset included numerous variables characterized by wide confidence intervals and weak significance. However, regularization streamlined the model by concentrating on the most relevant predictors, thereby enhancing generalization and mitigating overfitting. The reduction in dimensionality, coupled with improved model fit, underscores the transformative impact of regularization in predictive modeling. After applying LASSO, key predictors such as Competing risk, Metastasis, Stage, and Lymph Node remained consistent with the original model, albeit with reduced coefficients. The LASSO regularization method effectively refined the model by shrinking less important variables, which helps mitigate potential overfitting while preserving the influence of the most significant predictors. This approach enhances both model interpretability and generalization. This regularization technique selectively retains only the most relevant predictors, thereby enhancing the reliability of inference while reducing the potential for overfitting.
To derive meaningful interpretations from the LASSO estimates, exponentiating the coefficients is essential. Given that the Extreme Value Accelerated Failure Time (AFT) Frailty Model operates on a logarithmic scale, the estimated coefficients represent log-time ratios. By exponentiating these values, they are converted into time ratios, making it more intuitive to understand how each covariate influences survival time. For instance, the LASSO coefficient for Competing Risk (CompetingR) is 0.8493. Exponentiating this value (exp(0.8493) ≈ 2.34) suggests that individuals without competing risk have an expected survival time that is 2.34 times longer than those with competing risk, holding other factors constant. Cancer staging is a critical determinant of survival and is classified into five levels. Stage 0 consists of abnormal cells that do not invade deeper tissues. In Stage 1, the tumor is localized to its origin and remains small. Stage 2 represents a larger tumor that may have spread to adjacent lymph nodes. Stage 3 indicates further tumor growth, possibly extending to nearby lymph nodes, tissues, or organs. Finally, Stage 4 marks the spread of cancer to distant organs, making it the most advanced stage. The LASSO coefficient for staging is 0.2291, and exponentiating this value (exp(0.2291) ≈ 1.26) suggests a 26% increase in survival time for patients diagnosed at lower stages (0, I, and II) compared to those diagnosed at higher stages (III and IV). Patients who experience recurrence of the disease tend to have a shorter survival time. The exponentiated coefficient for recurrence is exp(0.1724) ≈ 1.19, indicating that individuals without recurrent disease have a 19% longer expected survival time compared to those who experience recurrence. The human epidermal receptor 2 (HER2) status is categorized as HER2-negative (0) and HER2-positive (1). The exponentiated coefficient for HER2 is exp(0.1861) ≈ 1.20, suggesting that patients with HER2-negative status have a 20% longer survival time compared to those with HER2-positive status. Similarly, molecular subtypes of breast cancer influence survival. Patients without the Triple Negative (TRN) subtype have a 15% longer survival time (exp(0.1361) ≈ 1.15) compared to those with other molecular subtypes, including Luminal A, Luminal B, and HER2-positive overexpression. The grade of the tumor also plays a crucial role in survival outcomes. Grade 1 tumors are well-differentiated, closely resembling normal breast tissue and growing slowly. Grade 2 tumors are moderately differentiated, growing at a moderate pace. Grade 3 tumors, however, are poorly differentiated, looking highly abnormal and spreading rapidly. The exponentiated coefficient for tumor grade is exp(0.1081) ≈ 1.11, suggesting that patients with lower tumor grades (Grade I and II) have an 11% longer survival time compared to those with higher-grade tumors. The presence or absence of metastasis significantly affects survival outcomes. Metastasis is classified into three categories: 0 (no available information), 1 (metastasis present, meaning the cancer has spread beyond its original site), and 2 (no metastasis, meaning the cancer remains localized). The exponentiated coefficient for metastasis is exp(0.9666) ≈ 2.63, indicating that patients without metastasis have an expected survival time that is 2.63 times longer than those with metastatic disease, holding other factors constant. The extent of lymph node involvement also plays a crucial role in survival. Lymph nodes are examined microscopically for cancer presence and categorized into four levels: 0 (no positive lymph nodes), 1 (one positive lymph node), 2 (two positive lymph nodes), and 3 (three positive lymph nodes). The exponentiated coefficient for lymph node involvement is exp(0.1487) ≈ 1.16, suggesting a 16% longer survival time for patients with lower lymph node involvement (0 and 1) compared to those with higher involvement. The findings highlight the critical role of tumor characteristics, molecular markers, and disease progression in determining survival outcomes. Patients diagnosed at earlier stages (0, I, and II) have significantly better survival prospects than those diagnosed at advanced stages (III and IV). Higher tumor grade, advanced staging, metastasis, high lymph node involvement, HER2 positivity, and the presence of competing risk factors all contribute to shorter survival times. Among these factors, metastasis and competing risk status are the most significant determinants of reduced survival time, emphasizing the need for aggressive monitoring and targeted interventions for high-risk patients. By employing LASSO regularization, this study ensures that only the most relevant predictors are retained, thereby enhancing the robustness and interpretability of survival estimates. These findings underscore the importance of early detection, personalized treatment strategies, and tailored interventions to improve patient outcomes.
The risk score distribution plot in Fig. 6 categorizes patients into Low, Medium, and High-risk groups, with distinct density distributions. This classification aligns with the parameter estimates from the Extreme Value AFT Frailty Model (Table 4), where significant covariates such as CompetingR, Metastasis, LymphNode, and TypeofBC contribute to risk differentiation. The influence of these variables on survival outcomes justifies the stratification seen in the plot. Moreover, the regularization techniques (Table 5) further validate the key predictors influencing risk scores, as LASSO and Elastic Net retain similar variables with nonzero coefficients. The AIC and BIC model selection (Table 6) suggest that LASSO provides the best model fit, reinforcing the robustness of risk classification. This interplay between statistical modeling and empirical risk score distribution suggests that patient stratification effectively reflects underlying survival risks, facilitating targeted interventions.
The forest plot in Fig. 7 provides a clear visualization of the estimated effects of various prognostic factors in the AFT frailty model, highlighting both significant and non-significant covariates with respect to their hazard ratios and confidence intervals. The justification for using a forest plot lies in its ability to succinctly compare multiple predictors simultaneously, particularly in the context of survival analysis, where understanding the impact of each factor on prognosis is critical. Significant factors such as TypeofBC, LymphNode, Age, Metastasis, and CompetingR are marked in red, underscoring their substantial role in influencing patient survival. Notably, CompetingR exhibits a strong negative estimate, suggesting that patients experiencing competing risks have a significantly poorer prognosis, with the worst outcomes observed in those who also underwent hospitalization, reflecting an accelerated decline in survival. Lymph node involvement, categorized from 0 (no positive nodes) to 3 (three positive nodes), emerges as a key determinant of prognosis, reinforcing the clinical understanding that increased lymphatic spread correlates with worse survival outcomes. Similarly, Metastasis, with the covariate distinguishing between 0 (no adequate information), 1 (confirmed metastasis), and 2 (no metastasis), demonstrates a significant association with prognosis, further validating the aggressive nature of cancer once it spreads beyond the primary site. TypeofBC, representing the molecular subtypes (Laminal A, Basal type/Triple Negative, Laminal B, and HER2 Plus overexpression), significantly impacts survival, with more aggressive subtypes (Triple Negative) associated with poorer outcomes. Age also emerges as a significant factor, indicating that survival prognosis varies with patient age, possibly reflecting differences in biological response and treatment efficacy across age groups. The implications of this analysis extend to clinical decision-making, as it emphasizes the necessity of early detection, stratification of patients based on lymph node involvement and molecular subtypes, and the management of competing risks to improve survival outcomes.
The Kaplan–Meier survival curve stratified by metastasis status in Fig. 8 reveals a significant association with breast cancer prognosis (p < 0.0001). Patients without metastasis (Metastasis = 2) exhibit the highest survival probability, indicating a favorable prognosis, while those with confirmed metastasis (Metastasis = 1) show a considerable decline in survival over time, highlighting the severe impact of cancer spread. The latent phase (0–10 months) shows minimal survival decline across all groups, indicating an initial period of disease stability. The critical phase (10–30 months) marks a steeper decline in survival for patients with metastasis, suggesting increased disease burden and progression. The accelerated phase (beyond 30 months) shows a pronounced drop in survival for metastatic patients, emphasizing the aggressive nature of advanced-stage breast cancer. Patients with no metastasis maintain relatively stable survival trends throughout, reinforcing the importance of early detection and intervention. These findings underscore the critical role of metastasis in determining patient outcomes, necessitating aggressive therapeutic strategies, such as systemic chemotherapy, targeted therapy, and continuous monitoring for metastatic cases. The results highlight the need for timely intervention to delay disease progression and improve survival rates, particularly for patients at high risk of metastatic spread.
The Kaplan–Meier survival curve stratified by lymph node involvement in Fig. 9 shows a significant impact on breast cancer prognosis, as indicated by the p-value < 0.0001. Patients with no positive lymph nodes (LymphNode = 0) exhibit the highest survival probability throughout the study period, indicating a more favorable prognosis, while those with three positive lymph nodes (LymphNode = 3) have the poorest survival outcomes, reflecting the aggressive nature of lymphatic spread. The latent phase (0–10 months) shows minimal survival decline across all groups, suggesting a period of disease stability. The critical phase (10–30 months) marks a sharper decline, especially for patients with two or more positive lymph nodes, indicating progressive disease impact. Beyond 30 months, patients with three positive lymph nodes enter an accelerated phase of severity, where survival declines significantly compared to those with fewer or no lymph node involvement. These findings highlight the importance of lymph node status in predicting disease progression and guiding treatment strategies. Patients with higher lymph node involvement may require more aggressive interventions, such as chemotherapy and targeted therapy, along with closer monitoring to improve survival outcomes. The results emphasize the prognostic value of lymph node analysis in breast cancer management and the need for early detection to mitigate the risks associated with advanced nodal involvement.
The Kaplan–Meier survival curve stratified by tumor grade demonstrates significant differences in breast cancer prognosis, with a p-value < 0.0001 confirming the strong statistical impact of tumor differentiation on survival. Grade 1 (well-differentiated) exhibits the most favorable prognosis, maintaining the highest survival probability throughout the study period, while Grade 3 (poorly differentiated) has the poorest survival, indicating rapid tumor progression and aggressive disease behavior. Grade 2 (moderately differentiated) follows an intermediate trajectory. The latent phase (0–10 months) shows minimal survival decline across all grades, suggesting a period of stability before disease progression accelerates. The critical phase (10–30 months) is marked by a steeper decline in survival for Grade 3, reflecting the aggressive nature and faster spread of poorly differentiated tumors. Beyond 30 months, the survival rates for Grade 3 patients continue to decline rapidly, indicating an accelerated phase of severity, while Grade 1 patients experience a more gradual decline, reinforcing their slower tumor growth. These findings highlight the importance of early and aggressive treatment strategies for patients with high-grade tumors, as their prognosis is significantly worse compared to those with well-differentiated tumors. The study underscores the role of tumor grade in guiding treatment intensity and monitoring, with Grade 3 patients requiring more intensive surveillance and potentially more aggressive therapeutic interventions to improve survival outcomes (Fig. 10).
The Kaplan–Meier survival analysis stratified by HER2 status in Fig. 11 reveals a significant difference in survival outcomes between HER2-positive and HER2-negative breast cancer patients, with HER2-positive patients experiencing worse prognoses. The survival curves separate early, and the log-rank test (p < 0.0001) confirms the statistical significance of this difference. Initially, during the critical phase (0–15 months), both groups maintain a high survival probability with minimal divergence. However, in the delayed phase (15–30 months), HER2-positive patients begin to show a decline in survival, indicating an increasing mortality risk. This trend continues into the latent phase (30–45 months), where the gap between the two groups widens, signaling progressive disease severity in HER2-positive cases. Beyond 45 months, in the accelerated phase, HER2-positive patients experience a steeper decline in survival probability, reflecting aggressive disease progression. These findings underscore the necessity for early HER2 testing and targeted therapy, to improve survival outcomes. Clinically, early-stage management (0–15 months) should focus on early intervention, while the intermediate phase (15–45 months) requires close monitoring and aggressive therapeutic strategies to slow disease progression. After 45 months, palliative care becomes essential for high-risk patients. Overall, HER2-positive patients require intensive treatment and continuous follow-up to mitigate their higher mortality risk and improve long-term survival.
The Kaplan–Meier survival curve stratified by Competing Risk (CompetingR) in Fig. 12 provides crucial insights into the prognosis of breast cancer patients based on their competing risk status. Patients with no evidence of competing risk (CompetingR = 0, red curve) have the best prognosis, with consistently high survival probability throughout the follow-up period. Patients with competing risk but without hospitalization (CompetingR = 1, green curve) show a gradual decline in survival, indicating a moderate impact on prognosis. Patients with competing risk and hospitalization (CompetingR = 2, blue curve) experience the worst survival outcomes, with a significantly sharper decline, suggesting that hospitalization due to competing risks accelerates mortality. The early period (0–20 months) shows minimal differences among the groups, indicating a delayed effect of competing risks. However, from 20 to 40 months, the survival probability of CompetingR = 2 drops significantly, marking a critical window of accelerated severity, where hospitalization strongly correlates with increased mortality risk. Beyond 40 months, survival stabilizes but remains markedly lower for the hospitalized group, suggesting long-term detrimental effects. The statistically significant difference (p < 0.0001) emphasizes the importance of integrating competing risk factors into prognosis assessments. Clinically, this highlights the urgent need for early interventions and targeted management strategies for patients with identified competing risks, especially those requiring hospitalization, to improve survival outcomes.
The Kaplan–Meier survival curve stratified by recurrent breast cancer status in Fig. 13 provides key insights into the prognosis of patients with and without a history of recurrence. Patients with no recurrence history (Recurrent = 0, red curve) exhibit a significantly higher survival probability throughout the follow-up period compared to those with recurrence. In contrast, patients with recurrent breast cancer (Recurrent = 1, blue curve) show a markedly steeper decline in survival, suggesting a worse prognosis and higher mortality risk. The difference between the two groups is statistically significant (p < 0.0001), confirming that recurrence significantly impacts survival outcomes. In terms of time-dependent severity, the early phase (0–20 months) shows a delayed effect, where survival remains relatively high for both groups, with only a slight decline in the recurrent group. However, between 20–40 months, a critical period of accelerated severity is observed, where the survival probability of recurrent patients drops sharply, reflecting increased mortality risk during this window. Beyond 40 months, survival in the recurrent group continues to decline at a higher rate, reinforcing the long-term negative impact of recurrence. These findings underscore the need for aggressive surveillance, early intervention, and personalized treatment strategies for patients with recurrent breast cancer. The substantial survival gap between recurrent and non-recurrent patients highlights the importance of enhanced follow-up care, timely therapeutic adjustments, and possible novel treatment approaches to mitigate the accelerated decline observed in recurrent cases.
The Kaplan–Meier survival curve stratified by molecular subtypes (MSubtype) indicates significant differences in prognosis among breast cancer patients, with a p-value < 0.0001 confirming strong statistical evidence of variation. Laminal A (MSubtype = 0) exhibits the most favorable prognosis, maintaining the highest survival probability throughout the study period, whereas Basal type or Triple Negative (MSubtype = 1) and HER2 Plus overexpression (MSubtype = 3) show the poorest survival outcomes, suggesting their aggressive nature and resistance to treatment. Laminal B (MSubtype = 2) follows an intermediate pattern. The early phase (0–10 months) shows minimal survival decline across all groups, indicating a latent phase of disease severity. The critical phase (10–30 months) sees a pronounced drop in survival for Basal type and HER2 overexpression, reflecting rapid disease progression and treatment resistance. Beyond 30 months, survival rates for these subtypes continue declining, with a particularly steep reduction in HER2 overexpression, indicating an accelerated phase of severity. The findings emphasize the need for aggressive early intervention for Basal type and HER2 overexpression patients, while Laminal A patients may benefit from standard treatment protocols. These results reinforce the importance of molecular subtype-specific therapies, particularly for high-risk groups, to improve long-term survival outcomes in breast cancer patients (Fig. 14).
Discussion
This study underscores the critical role of regularization techniques in survival analysis, particularly in refining variable selection and enhancing model generalization. LASSO, in particular, proved to be an effective tool for identifying key prognostic factors while mitigating overfitting and multicollinearity. The retained predictors—Competing Risk, Metastasis, Stage, Lymph Node involvement, HER2 status, and Tumor Grade—highlight essential clinical determinants of breast cancer survival. The findings of this study provide crucial insights into the key predictors of breast cancer survival by leveraging regularization techniques within the framework of the Extreme Value Accelerated Failure Time (AFT) Frailty Model. Regularization methods, particularly LASSO, Ridge, and Elastic Net, have demonstrated their effectiveness in improving model fit and interpretability, addressing issues of multicollinearity and overfitting common in high-dimensional datasets. The superior performance of LASSO underscores its capacity to refine the model by selecting the most relevant predictors, thereby enhancing the robustness of survival estimates. Regularization played a pivotal role in refining variable selection and improving the reliability of the model estimates. Before applying these techniques, covariates such as Age Category (AgeCAT), Menopause, and Ethnicity exhibited wide confidence intervals and weak statistical significance. LASSO and Elastic Net effectively mitigated these issues by eliminating or shrinking coefficients, leading to a more parsimonious and interpretable model. The retained predictors—Competing Risk, Metastasis, Stage, and Lymph Node involvement—were consistently significant across models, reinforcing their critical role in survival analysis. Exponentiating the LASSO-derived coefficients enabled a more intuitive interpretation of survival outcomes. Competing Risk (CompetingR) emerged as a critical determinant, with an exponentiated coefficient (exp(0.8493) ≈ 2.34), indicating that patients without competing risks had an expected survival time 2.34 times longer than those with competing risks. Similarly, Metastasis was a strong predictor, with patients without metastasis exhibiting a 2.63-fold increase in expected survival time (exp(0.9666) ≈ 2.63), emphasizing the detrimental impact of cancer spread on prognosis. Cancer staging further confirmed the significance of early diagnosis, as patients diagnosed at lower stages (0, I, II) had a 26% longer survival time compared to those diagnosed at advanced stages (III, IV) (exp(0.2291) ≈ 1.26). Additionally, recurrence status (exp(0.1724) ≈ 1.19) and HER2-negative status (exp(0.1861) ≈ 1.20) were associated with improved survival, corroborating clinical findings that highlight these factors as key prognostic indicators. Tumor grade was another significant predictor, with patients having lower-grade tumors (Grade I and II) exhibiting an 11% longer survival time than those with high-grade tumors (Grade III) (exp(0.1081) ≈ 1.11). The risk score distribution plot categorized patients into Low, Medium, and High-risk groups, reflecting distinct density distributions. The stratification aligned with parameter estimates from the Extreme Value AFT Frailty Model, wherein significant covariates such as Competing risk status, Metastasis, Lymph Node involvement, and Type of Breast Cancer (TypeofBC) contributed to risk differentiation. LASSO and Elastic Net retained similar predictors with nonzero coefficients, reinforcing their role in risk classification. Further validation was provided by the forest plot, which visualized the effect sizes of key prognostic factors. Notably, Competing risk (CompetingR) exhibited a strong negative estimate, suggesting a significantly poorer prognosis for patients experiencing competing risks. Lymph node involvement and metastasis were also significant predictors, reinforcing the established clinical understanding that increased nodal spread and distant metastasis correlate with poorer survival outcomes. The Kaplan–Meier survival curves provided additional validation for the predictive power of key covariates. Stratification by metastasis status revealed a significant decline in survival among patients with confirmed metastasis, highlighting the aggressive nature of advanced-stage breast cancer. Similarly, lymph node involvement was a critical prognostic factor, with patients having three positive lymph nodes experiencing the worst survival outcomes. These findings underscore the importance of early detection and aggressive treatment for patients with high nodal involvement. Tumor grade also significantly impacted survival, with poorly differentiated tumors (Grade III) exhibiting rapid disease progression. The findings suggest that high-grade tumors require more intensive therapeutic interventions to improve survival outcomes. Additionally, HER2 status influenced prognosis, as HER2-positive patients exhibited worse survival outcomes compared to their HER2-negative counterparts. This observation reinforces the clinical significance of targeted therapies such as HER2 inhibitors in improving outcomes for HER2-positive breast cancer patients. These findings align with previous research emphasizing the importance of regularization in high-dimensional datasets, where feature selection and reduction in variance contribute to more stable predictions [54].
Major findings
The study evaluated the performance of seven Accelerated Failure Time (AFT) models with a Gamma frailty component to account for unobserved heterogeneity. Based on key evaluation metrics—Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Mean Absolute Error (MAE), and Mean Squared Error (MSE)—the Extreme Value AFT model emerged as the most robust across all sample sizes. This model recorded the lowest AIC and BIC values, demonstrating an optimal balance between model fit and complexity. Additionally, it exhibited the lowest MAE and MSE values, signifying superior predictive accuracy and reliability. The Log-logistic model also performed well, particularly for larger sample sizes, with competitive AIC, BIC, MAE, and MSE values. However, its performance was slightly inferior to the Extreme Value model, making it a strong but secondary option. The study further examined the impact of data partitioning strategies on model performance, revealing that a 70–20 - 10 training, testing, and validation split provided the most efficient trade-off between learning and generalization, yielding the lowest AIC (1578.6) and BIC (1595.4).Parameter estimates from the Extreme Value AFT Frailty Model identified key predictors significantly influencing survival time among breast cancer patients. Competing Risks (− 0.97, p < 0.001), Metastasis (− 1.21, p < 0.001), Cancer Stage (− 0.29, p = 0.003), and Lymph Node Involvement (− 0.19, p = 0.02) were all associated with reduced survival duration. The frailty variance (0.42) indicated moderate heterogeneity among individuals, suggesting the presence of unmeasured factors affecting survival outcomes. The model's concordance index (0.882) demonstrated a high ability to distinguish between high- and low-risk individuals. Regularization techniques, including LASSO, Ridge, and Elastic Net, were applied to mitigate overfitting and improve model generalization. LASSO effectively performed variable selection, eliminating less relevant predictors such as Age Category (AgeCAT), Estrogen Receptor (ER) status, and Hospitalization. The LASSO-regularized model achieved the lowest AIC (2710.431) and BIC (2745.026), confirming its optimal trade-off between model complexity and prediction accuracy. The exponentiation of LASSO estimates provided intuitive interpretations, highlighting the substantial impact of key clinical variables on survival duration. For instance, patients without competing risks had an expected survival time 2.34 times longer than those facing additional health challenges.
Conclusion
This study demonstrated that the Extreme Value AFT Frailty Model is the most effective in capturing survival patterns among breast cancer patients, outperforming six alternative models across multiple evaluation criteria. The inclusion of a frailty term accounted for unobserved heterogeneity, ensuring more reliable survival estimates. Findings underscore the significance of key predictors such as Metastasis, Cancer Stage, Lymph Node Involvement, and Competing Risks in determining survival duration. The application of regularization techniques, particularly LASSO, significantly improved model interpretability and predictive performance by refining variable selection and mitigating overfitting. The results reinforce the importance of systematic data partitioning, with a 70–20 - 10 split emerging as the most optimal for robust model training and validation. Clinically, these findings highlight the need for early detection and targeted treatment strategies to mitigate the adverse effects of metastasis and late-stage diagnoses. The study highlights the crucial role of risk stratification, as patients categorized into Low, Medium, and High-risk groups exhibit distinct survival patterns, aligning with the Extreme Value AFT Frailty Model. The forest plot analysis further validates the strong impact of significant covariates, with Competing Risks, Lymph Node Involvement, and Metastasis emerging as the most critical prognostic factors. Kaplan–Meier survival analysis reveals sharp survival declines associated with metastasis, lymph node involvement, tumor grade, HER2 status, and molecular subtypes, reinforcing the urgent need for early detection and targeted interventions. Notably, patients with Triple Negative and HER2-overexpressing subtypes exhibit the poorest survival outcomes, highlighting the necessity for subtype-specific therapies. Additionally, competing risks, particularly hospitalization-related factors, substantially impact survival, emphasizing the need for integrated treatment approaches. Future research should explore the integration of additional biomarkers, genetic information, and socioeconomic factors to enhance predictive accuracy further. Additionally, refining multicenter data adjustments and exploring alternative frailty distributions may provide deeper insights into survival modeling and patient-specific risk assessments. The study contributes to the advancement of survival analysis methodologies by emphasizing model selection, frailty incorporation, and regularization techniques in improving predictive frameworks for clinical decision-making.
Data availability
The datasets used and analysed during the current study are available from the corresponding author (Senyefia Bosson-Amedenu, at senyefia.bosson-amedenu@ttu.edu.gh) upon reasonable request.
References
Aalen OO. Heterogeneity in survival analysis. Stat Med. 1988;7(11):1121–37.
Adham SA, Amani A. AlAhmadi. Gamma and inverse Gaussian frailty models: a comparative study. Int J Math Stat Invent (IJMSI) E-ISSN: 2321 – 4767 P-ISSN: 2321 - 4759 www.ijmsi.org. 2016;4(4):-4101–05.
Balan TA, Putter H. A tutorial on frailty models. Stat Methods Med Res. 2020;29(11):3424–54.
Bearse P, Canals-Cerda J, Rilstone P. Efficient semiparametric estimation of duration models with unobserved heterogeneity. Econ Theory. 2007;23(2):281–308 (http://www.jstor.org/stable/4126558).
Burnham KP, Anderson DR. Multimodel inference: understanding AIC and BIC in model selection. Sociol Methods Res. 2004;33(2):261–304.
Chen LP, Huang HT. AFFECT: an R package for accelerated functional failure time model with error-contaminated survival times and applications to gene expression data. BMC Bioinformatics. 2024;25(1):265.
Chen LP. Accelerated failure time models with error-prone response and nonlinear covariates. Stat Comput. 2024;34(6):183.
Chen LP, Qiu B. Analysis of length-biased and partly interval-censored survival data with mismeasured covariates. Biometrics. 2023;79(4):3929–40.
Chen P, Zhang J, Zhang R. Estimation of the accelerated failure time frailty model under generalized gamma frailty. Comput Stat Data Anal. 2013;62:171–80. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.csda.2013.01.016.
Collett D. Modeling survival data in medical research (3rd ed.). Chapman and Hall/CRC. 2015
Crowther MJ, Royston P, Clements M. A flexible parametric accelerated failure time model and the extension to time-dependent acceleration factors. Biostatistics. 2023;24(3):811–31. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/biostatistics/kxac009.
Duchateau L, Janssen P. The Frailty Model. Springer; 2007.
Duchateau L, Janssen P. The frailty model (1st ed.). Springer. 2008
Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):1–22. https://doiorg.publicaciones.saludcastillayleon.es/10.18637/jss.v033.i01.
Gallardo DI & Bourguignon M. A shared weighted Lindley frailty model for clustered survival data.arXiv preprint.2022. https://arxiv.org/abs/2206.12973.
Grover S, Kukreti R. A systematic review and meta-analysis of the role of ABCC2 variants on drug response in patients with epilepsy. Epilepsia. 2013;54(5):936–45. https://doiorg.publicaciones.saludcastillayleon.es/10.1111/epi.12132. Epub 2013 Mar 18 PMID: 23506516.
Gutierrez RG. Parametric frailty and shared frailty survival models. Stata Journal. 2002;2(1):22–44.
Haerting J. Frailty Models in Survival Analysis. From the Institute for Medical Epidemiology: Biometry and Informatics, Sachsen-Anhalt (ULB) University; 2007.
Hanagal DD, Pandey A. Correlated gamma frailty models for bivariate survival data based on reversed hazard rate. International Journal of Data Science. 2017;2(4):301. https://doiorg.publicaciones.saludcastillayleon.es/10.1504/ijds.2017.088102.
Hastie T, Tibshirani R, Friedman JH & Friedman JH. The elements of statistical learning: data mining, inference, and prediction (Vol. 2, pp. 1–758).2009 New York: springer.
Hougaard P. Frailty models for survival data. Lifetime Data Anal. 1995;1(3):255–73. https://doiorg.publicaciones.saludcastillayleon.es/10.1007/BF00985760.
Hougaard P. Analysis of multivariate survival data. Springer; 2000.
Huang J, Ma S, Xie H. Regularized estimation in the accelerated failure time model with high-dimensional covariates. Biometrics. 2006;62(3):813–20. https://doiorg.publicaciones.saludcastillayleon.es/10.1111/j.1541-0420.2006.00562.x.
Ibrahim JG, Chen MH, Sinha D. Bayesian Survival Analysis. Springer; 2001.
Kats L & Gorfine M. An accelerated failure time regression model for illness-death data: a frailty approach. Biometrics.2022
Keiding N, Andersen PK, Klein JP. The role of frailty models and accelerated failure time models in describing heterogeneity due to omitted covariates. Stat Med. 1997;16(1–3):215–24. https://doiorg.publicaciones.saludcastillayleon.es/10.1002/(sici)1097-0258(19970130)16:2%3c215::aid-sim481%3e3.0.co;2-j.
Klein JP &Moeschberger ML. Survival Analysis: Techniques for Censored and Truncated Data. Springer. 2003
Kleinbaum DG, Klein M. Survival Analysis: A Self-Learning Text. 2nd ed. United States of America: Springer Science Publishers; 2005. p. 1020.
Lambert P, Collett D, Kimber A, Johnson R. Parametric accelerated failure time models with random effects and an application to kidney transplant survival. Stat Med. 2004;23(20):3177–92.
Li L, Liu ZP. A connected network-regularized logistic regression model for feature selection. Appl Intell. 2022;52:11672–702. https://doiorg.publicaciones.saludcastillayleon.es/10.1007/s10489-021-02877-3.
Li L & Liu ZP. Detecting prognostic biomarkers of breast cancer by regularized Cox proportional hazards models. J Transl Med. 2021;19(514). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12967-021-03180-y.
Li L, Liu ZP. Biomarker discovery from high-throughput data by connected network-constrained support vector machine. Expert Syst Appl. 2023;226:120179. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.eswa.2023.120179.
Liang Y, Chai H, Liu XY, et al. Cancer survival analysis using semi-supervised learning method based on Cox and AFT models with L1/2 regularization. BMC Med Genomics. 2016;9:11. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12920-016-0169-6.
Mahmoodi M, Hosseini M, Zare A, Mohammad K, Zeraati H, HolakouieNaieni K. A comparison between accelerated failure-time and Cox proportional hazard models in analyzing the survival of gastric cancer patients. Iranian J Public Health. 2015;44(8):1095–102 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4645729).
Matsuo K, Purushotham S, Jiang B, Mandelbaum RS, Takiuchi T, Liu Y, Roman LD. Survival outcome prediction in cervical cancer: Cox models vs deep-learning model. Am J Obstet Gynecol. 2019;220(4):381.e1-381.e14. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.ajog.2018.12.030. Epub 2018 Dec 21. PMID: 30582927; PMCID: PMC7526040.
Moore DF. Applied Survival Analysis Using R. Switzerland: Springer International Publishing; 2016. p. 1233.
Mota TA, Bourguignon M, Cordeiro GM, Pescim RR. A cure rate frailty regression model based on the weighted Lindley distribution. Stat Methods Appl. 2023;32:29–52. https://doiorg.publicaciones.saludcastillayleon.es/10.1007/s10260-022-00673-y.
Omae K, Eguchi S. Quasi-linear Cox proportional hazards model with cross- L1 penalty. BMC Med Res Methodol. 2020;20:182. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12874-020-01063-2.
Pereira EQ, Gonzatto OAJ, Tomazella VLD, Morita LHM, Mota AL, Louzada FN. Accelerated failure time frailty model for modeling multiple systems subject to minimal repair. Appl Stoch Model Bus Ind. 2024. https://doiorg.publicaciones.saludcastillayleon.es/10.1002/asmb.2864.
Sarkar K, Chowdhury R, Dasgupta A. Analysis of survival data: challenges and algorithm-based model selection. J Clin of Diagn Res. 2017;11(6):LC14–20. https://doiorg.publicaciones.saludcastillayleon.es/10.7860/JCDR/2017/21903/10019.
Senyefia BA, Acquah J, Nyarko CC, Ouerfelli N. A comparison between accelerated failure time models in analyzing the survival of breast cancer patients. J Cancer Tumor Int. 2022;12(1):16–28. https://doiorg.publicaciones.saludcastillayleon.es/10.9734/jcti/2022/v12i130166.
Simon N, Friedman J, Hastie T, Tibshirani R. Regularization paths for Cox’s proportional hazards model via coordinate descent. J Stat Softw. 2011;39(5):1–13. https://doiorg.publicaciones.saludcastillayleon.es/10.18637/jss.v039.i05.
Sirimongkolkasem T, Drikvandi R. On regularisation methods for analysis of high-dimensional data. Annals of Data Science. 2019;6:737–63. https://doiorg.publicaciones.saludcastillayleon.es/10.1007/s40745-019-00209-4.
Soret P, Avalos M, Wittkop L, et al. Lasso regularization for left-censored Gaussian outcome and high-dimensional predictors. BMC Med Res Methodol. 2018;18:159. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12874-018-0609-4.
Therneau TM & Grambsch PM. Modeling Survival Data: Extending the Cox Model. Springer. 2000
Tibshirani R. Regression shrinkage and selection via the Lasso. J Roy Stat Soc: Ser B (Methodol). 1996;58(1):267–88. https://doiorg.publicaciones.saludcastillayleon.es/10.1111/j.2517-6161.1996.tb02080.x.
Ulviya A, Frailty Models for modeling heterogeneity. masters dissertation at McMaster University. 2013
Vaupel JW, Manton KG, Stallard E. The impact of heterogeneity in individual frailty on the dynamics of mortality. Demography. 1979;16(3):439–54. https://doiorg.publicaciones.saludcastillayleon.es/10.2307/2061224.
Wienke A. Frailty models in survival analysis (1st ed.). Chapman and Hall/CRC. 2010
Wienke A, Frailty Models in Survival data. Chapman and Hall, New York, 1 edition. 2011
Yazdani A, Yaseri M. Investigation of prognostic factors of survival in breast cancer using a frailty model: a multicentre study. Greast Cancer Basic Clin Res. 2019;13:1–10.
Zang W, Chen H, Yan J, Li D, Xiao N, Zheng X, Zhang Z. Research trends and hotspots of exercise for people with sarcopenic: A bibliometric analysis. Medicine. 2023;102(50):e35148.
Zhang D, Analysis of Survival Data (ST745), Spring 2005.
Zou H, Hastie T. Regularization and variable selection via the elastic net. J Royal Stat Soc : Series B (Statistical Methodology). 2005;67(2):301–20. https://doiorg.publicaciones.saludcastillayleon.es/10.1111/j.1467-9868.2005.00503.x.
Acknowledgements
Not applicable.
Ethics reference number
IRB KTH15/1083703.
Ethical Clearance Statement
This study utilizes a secondary dataset that contains anonymized patient information, which is untraceable and will not lead to the identification of individual patients.
Funding
The authors received no financial support for the research, authorship, or publication of this article.
Author information
Authors and Affiliations
Contributions
S. B-A conceptualized the study, validated the findings, designed the methodology, performed formal analysis, wrote the original draft, and developed the software. E. A contributed to visualization, reviewed and edited the manuscript, conducted investigations, curated the data, and collected the data. F. A-M supported visualization, validated the findings, reviewed and edited the manuscript, and contributed to investigations. L. A validated the findings, reviewed and edited the manuscript, and assisted with visualization. All authors reviewed and approved the final manuscript and took personal responsibility for their contributions and the integrity of the work presented.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
This study utilizes a secondary dataset that contains anonymized patient information, which is untraceable and will not lead to the identification of individual patients. As such, the research adheres to ethical standards for the use of secondary data, ensuring the privacy and confidentiality of patient information. Approval for the use of this dataset was obtained from the relevant ethical review board, ensuring that all necessary safeguards are in place to protect patient confidentiality.
Consent for publication
N/a.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Bosson-Amedenu, S., Ayitey, E., Ayiah-Mensah, F. et al. Evaluating key predictors of breast cancer through survival: a comparison of AFT frailty models with LASSO, ridge, and elastic net regularization. BMC Cancer 25, 665 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12885-025-14040-z
Received:
Accepted:
Published:
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12885-025-14040-z