A Predictive Model for Distant Metastasis in Patients With Breast Cancer Based on Machine Learning

Article information

J Breast Dis. 2023;11(2):39-47

Publication date (electronic) : 2023 December 31

doi : https://doi.org/10.14449/jbd.2023.11.2.39

Hye Jin Kwon, M.D., Min Hyung Lee, M.D., Soo Yeon Joo, M.D., Kwanbum Lee, M.D., Seung Ah Lee, M.D., Ph.D., Seung Ki Kim, M.D., Ph.D., Isaac Kim, M.D., Ph.D.

Division of Breast Surgery, Department of Surgery, Bundang CHA Medical Center, CHA University School of Medicine, Seongnam, Korea

Correspondence: Isaac Kim, M.D., Ph.D. Division of Breast Surgery, Department of Surgery, Bundang CHA Medical Center, CHA University School of Medicine, 59 Yatap-ro, Bundang-gu, Seongnam 13496, Korea Tel: +82-31-780-5257, Fax: +82-31-780-5266, E-mail: isaac24@chamc.co.kr

Received 2023 March 31; Revised 2023 June 27; Accepted 2023 August 7.

Abstract

Purpose

Breast cancer starts as a local disease, but can metastasize to distant organs. In this study, we described an easy-to-use tool for predicting distant metastases based on clinical characteristics and gene expression profiles.

Methods

We performed a retrospective chart review of 326 patients with breast cancer who underwent surgery and CancerSCAN^TM between January 2001 and December 2014 at the Samsung Medical Center. Additional retrospective data for 83 patients during 2015 were used for internal validation. CancerSCAN^TM, a next-generation sequencing-based targeted deep sequencing analysis, was used for gene analysis, and Azure Machine Learning (ML) was used for the ML processes.

Results

The no-distant metastasis group comprised 267 patients, while the distant metastasis group comprised 59. Using the Azure ML platform, a predictive model was developed with 326 cases. The area under the curve of the receiver operating characteristic curve for predictive value was 0.917. Based on the internal validation performed using 83 patients, the true-negative was 81 and the true-positive was two when a threshold value of 0.5 was applied.

Conclusion

Patients with breast cancer are at risk of metastasis and experience fear throughout their lives. Our predictive model is a valuable and easy-to-access tool for identifying patients with distant metastasis and it presents a way for each institution to achieve optimal results using its variables. Further evaluation with a larger patient population will improve the reliability of this model.

Keywords: Breast neoplasms; Metastasis; Prediction

INTRODUCTION

Breast cancer is the most common cancer among women worldwide [1]. Breast cancer starts as a local disease but can metastasize to the lymph nodes and distant organs [2]. Despite advances in breast cancer therapy, 20%−30% of patients with early breast cancer experience relapse with distant metastatic disease [3]. Tumor metastasis is a major clinical challenge for most cancer-related deaths [4].

In previous studies, the prediction of distant metastasis was based on intrinsic biological subtypes [5,6] and clinical status, including tumor size and nodal status [7]. Gene expression profiles have been used as predictive markers for distant metastasis [8,9]; however, clinical application has been difficult.

In this study, we present an easy-to-use tool for predicting distant metastases based on clinical characteristics and gene expression profiles. Gene profiles were obtained from CancerSCAN^TM, a targeted sequencing platform designed at the Samsung Medical Center [10].

METHODS

Study population

We performed a retrospective chart review of 336 patients with breast cancer who underwent surgery and CancerSCAN^TM between January 2001 and December 2014 at the Samsung Medical Center in Seoul, Korea. DNA sequencing results and electronic medical records, including pathology reports, were reviewed. Ten cases were excluded from the analysis owing to incomplete medical data. As a result, 326 cases were included in the analysis. For internal validation, additional retrospective data from 83 patients who underwent surgery and CancerSCAN^TM in 2015 were used. This study adhered to the tenets of the Declaration of Helsinki and was approved by the Institutional Review Board (IRB) of Samsung Medical Center (IRB no. 2018-05-005).

The available data for the cohorts included age at diagnosis, subtype (e.g., hormone receptor [HR]+/human epidermal growth factor receptor 2 [HER2]-, HR+/HER2+, HR-/HER2+, HR-/HER2-), histopathology (e.g., invasive ductal carcinoma [IDC], other), operation type (e.g., breast-conserving surgery, total mastectomy, sentinel lymph node biopsy, axillary lymph node dissection), chemotherapy (e.g., neo-adjuvant, adjuvant, none), regimen (e.g., adriamycin, cyclophosphamide [AC]), AC+docetaxel/taxol, fluorouracil, adriamycin, Cytoxan, AC, methotrexate, fluorouracil, docetaxel, carboplatin, trastuzumab, pertuzumab, and others, radiotherapy, hormonal therapy, target therapy, nuclear grade, pathological T-stage, pathological N-stage, distant metastasis, and metastatic site. Distant metastasis was defined as distant detectable metastasis confirmed using clinical and radiographic means and histologically proven for lung, liver, and peritoneal metastases. For bone and brain metastases, definite radiological findings were considered distant metastasis without biopsy. Positron emission tomography, chest computed tomography (CT), abdominal CT, bone scans, and brain magnetic resonance imaging were used to obtain radiological findings.

Targeted deep sequencing using a customized cancer panel (CancerSCAN™)

Genomic DNA (250 ng) from each tissue sample was sheared in a Covaris S220 ultrasonicator (Covaris, Woburn, USA) and used with CancerSCAN^TM probes and the SureSelect XT reagent kit, HSQ (Agilent Technologies, Santa Clara, USA), to construct a library, according to the manufacturer’s protocol.

The panel was designed to enrich the exons of 81 genes, covering 366.2 kb of the human genome. After multiplexing, the enriched exome libraries were sequenced on a HiSeq 2500 sequencing platform (Illumina). A paired-end DNA sequencing library was prepared via gDNA shearing, end repair, A-tailing, paired-end adaptor ligation, and amplification. After hybridization of the library with bait sequences for 27 hours, the captured library was purified and amplified using an index barcode tag, and the library quality and quantity were assessed.

The exome library was sequenced using the 100-bp paired-end mode of the TruSeq Rapid PE Cluster Kit and TruSeq Rapid SBS Kit (Illumina).

Sequence reads were mapped to the human genome (hg19) using the Burrows-Wheeler Aligner [11]. Duplicate reads were removed using the Picard and SAM tools [12]. Local alignments were optimized using the Genome Analysis Toolkit [13]. Variant calling was only performed in regions targeted by CancerSCAN^TM. To detect single nucleotide variants, we integrated the results of the three types of variant callers, which increased the sensitivity [14]. A Pindel was used to detect the indels [15]. Copy number variations were calculated for the targeted regions by dividing the read depth per exon by the estimated normal reads per exon using an in-house reference.

Gene profiles

CancerSCAN^TM is a next-generation targeted deep sequencing analysis method covering 81 genes. Gradient boosting was performed to identify the genes important for survival. In particular, the input features were alterations in 81 genes, including loss-of-function, mutation, and copy number variations. The target value was alive or dead at the five-year mark. Important features were identified by serially reducing their number of features [16]. The hyperparameters were optimized as follows: number of estimators, 999; learning rate, 0.15; and maximum depth, 6. Bootstrap resampling (n = 100) was performed in which the training sets (85%) and their corresponding test sets (15%) were resampled 100 times to evaluate the internal stability of the model. Wilcoxon test was used to determine the optimal number of genes (Supplementary Figure 1) [17].

Statistical analysis

Variables were compared between the no-distant and distant metastasis groups using the chi-square or Fisher’s exact test. Mean age was compared between the two groups using the Mann-Whitney U test with SAS version 9.4 (SAS Institute, Cary, USA). Receiver operating characteristic (ROC) curves and areas under the ROC curves (AUCs) were calculated. All tests were two-sided, and a p < 0.05 was considered to indicate statistical significance.

Machine learning (ML)

Azure ML (Microsoft, Redmond, USA) is a cloud service that enables the execution of ML processes. The Azure ML Studio (Microsoft, Redmond, USA) is available as a workspace to help users build and test predictive models [18]. We built a supervised ML classification model using the Azure ML platform by performing the following: (1) data editing, (2) data splitting, (3) model training, (4) model scoring, and (5) model evaluation (Figure 1). We split the modeling data (326 cases) into training and testing sets using a randomized 50–50 split. Thereafter, we trained our training set using the Two-class Decision Forest method [19] to predict distant metastasis and the Multi-class Neural Network [20] method to predict distant metastatic sites.

Figure 1.

Modeling workflow using Azure Machine Learning. The workflow included establishing a dataset (e.g., 27 gene_azureml), editing the metadata, employing an algorithm (e.g., two-Class Decision Forest), splitting the data, training the model, scoring the model, and evaluating the model.

RESULTS

Patient characteristics

Table 1 shows the baseline characteristics of patients. The no-distant metastasis group comprised 267 patients, and the distant metastasis group comprised 59. The median follow-up period was 104 months (range, 1–203) and the average distant metastasis-free interval was 85 months (range, 1–190). HR+/HER2- was higher in the no-distant metastasis group (p = 0.011), while HR-/HER2+ was higher in the distant metastasis group (p = 0.003). The distant metastasis group had more patients 50 years and older (p = 0.000). Further, the mean age was slightly higher in the distant metastasis group than in the no-distant metastasis group; however, the difference was not significant. IDC accounted for the majority in both groups, with no difference found according to the surgical type. More patients received neoadjuvant chemotherapy and radiotherapy in the distant metastasis group than in the no-distant metastasis group; opposite results were obtained for adjuvant chemotherapy and hormonal therapy. Nuclear grades 1 and 2 and pathological T stage did not differ between the two groups; however, nuclear grade 3, nodal stage, and pathological stage were found to differ between the two groups. The no-distant metastasis group had higher N0 (p = 0.000), while the distant metastasis group had higher nuclear grade 3, and N2 and 3 (p = 0.010, p = 0.000, p = 0.001, respectively). Regarding the pathological stage after surgery, the no-distant metastasis group had a higher number of stage II cases (p = 0.003), while the distant metastasis group had a higher number of stage III cases (p = 0.000).

Table 1.

The baseline characteristics of enrolled patients (n=326)

Distant metastasis

Table 2 shows the proportion of metastatic sites in the distant metastasis group. Among the 59 patients, 21 (35.6%) had multiple metastatic sites and 19 (32.2%) had lung metastases. Only one other site of metastasis, the contralateral supraclavicular lymph node, was found. of the 21 cases of multiple-site metastasis, six were triple-site cases (28.6%) and 15 were double-site cases (71.4%).

Table 2.

Metastatic sites in distant metastasis group (n=59)

Among the multiple-site metastasis cases, the cumulative counts for lung, bone, liver, brain metastasis, and peritoneal seeding were 17, 12, 10, 7, and 2, respectively.

Gene signature

We used the results of 27 genes and 34 occasions from the CancerSCAN^TM data for the analysis. Table 3 shows the cumulative counts of each gene signature between the two groups. PIK3CA mutations were the most frequent gene variations among patients. 34.5% of the no-distant metastasis group and 27.1% of the distant metastasis group had PIK3CA mutation. BRCA1 loss-of-function and BRCA2 were more frequent in the distant metastasis group than in the no-distant metastasis group; however, the total counts were very small. Of the 59 patients with distant metastasis, 1 (1.7%) had seven gene variations and 6 (3.4%) had no gene variation. Most patients (16, 27.1%) in the distant metastasis group had two genetic variations. Among the 267 patients without distant metastasis, 1 (0.4%) had 10 gene variations, and 28 (10.5%) had no gene variation. Most patients (81, 30.3%) in the no-distant metastasis group had two genetic variations.

Table 3.

The differences of gene signature between no distant metastasis group and distant metastasis group

Predictive model

We developed a predictive model with 326 cases using the Azure MLplatform (Figure 1) with various classification algorithms, such as two-class Decision Forest, two-class Decision Jungle, two-class Bayes Point Machine, two-class Support Vector Machine, and two-class Neural Network. Of the algorithms, the two-class Decision Forest method was identified as the most suitable for predicting distant metastasis. Based on the calculations, the AUC was 0.917 and the accuracy was 0.903 (Figure 2). Internal validation was conducted using 83 patients who underwent breast cancer surgery and CancerSCAN^TM in 2015. The median follow-up period was 26 months (range, 1–46), and the average distant metastasis-free interval was 21 months (range, 1–46). When a threshold value of 0.5 was applied, the true-negative was 81, and the true-positive was two among the 83 patients. No false negative or false positive results were observed. The validation accuracy was 1.000.

Figure 2.

Receiver operating characteristic (ROC) curve. The area under the ROC curve (AUC) was 0.917.

Clinical application

The Azure ML platform provides a function for setting up web services (http://docs.microsoft.com/en-us/azure/machine-learning/studio/consume-web-services). After the Azure ML predictive model was used as a web service, we utilized a representational state-transfer application programming interface to send data and obtain real-time predictions. For example, when data (0 or 1) were inputted according to each variable, excluding the final value (distant metastasis), an external application communicated with a machine-learning workflow scoring model in real-time, enabling the predicted value to be calculated in a few seconds (Figure 3).

Figure 3.

Illustration of web service usage for our predictive tool. BCS= breastconserving surgery; TM= total mastectomy; SLNB=sentinel lymph node biopsy; ALND=axillary lymph node dissection; NAC=neoadjuvant chemotherapy; ADJCTX=adjuvant chemotherapy; MUT=mutation; CNV=copy number variation; LOF=loss-of-function.

DISCUSSION

Distant metastasis from primary breast cancer is potentially lethal and has a complex mechanism. A commonly accepted theory is that as cancer grows, cells within the tumor acquire the capability to spread, survive, and flourish within the regional lymph nodes and other distant sites [7]. In addition, models of metastatic spread describe the complex interaction between seed and soil factors involving tumor intravasation, circulation, extravasation, proliferation, angiogenesis [21], and the microenvironment of the target tissue [22,23]. Owing to the heterogeneous nature of breast cancer metastasis, it is difficult to define a cure for this disease and assess the risk factors for metastasis [24].

A previous study identified the presence of lymph node metastasis, large primary tumor, and loss of histopathological differentiation (grade) as breast cancer prognostic markers [2]. A study of nomograms to predict metastasis-free survival used clinical findings, such as estrogen receptor (ER) status, histological grade, age, and chemotherapy cycles; however, the concordance index was 0.72 [25].

As gene expression signatures of human primary breast tumors enable more accurate predictions than prognostic factors, patients are destined to relapse and ultimately die due to metastatic breast cancer [2]. Several studies have sought to predict distant metastasis using gene expression. Cheng et al. [9] developed an 18-gene classifier to estimate distant metastasis risk. The 18-gene scoring system classified patients into low- and high-risk groups. Based on external validation, the 5-year probability of freedom from distant metastasis was 89.5% for low-risk patients and 73.6% for high-risk patients (p = 0.003) [9]. Wang et al. [8] used gene expression profiles to predict distant metastases. These researchers identified a 76-gene signature using an RNA microarray from 286 patients, which showed 93% sensitivity and 48% specificity in 171 independent testing sets [8]. Zemmour et al. [26] conducted DNA microarray studies that identified gene expression signatures for predicting metastatic relapse in early breast cancer. Using only six genes, the Cox Boost classifier predicted the 4-year status of metastatic disease with 93% sensitivity [26].

We developed a new predictive tool for distant metastasis using the clinical characteristics and gene profiles of 27 genes and 34 occasional results (mutation, loss-of-function, or copy number variation). Our study is valuable as it consisted of clinical findings and gene profiles and was conducted using ML. The Azure ML platform used in this study offers several advantages: real-time analysis can be performed in the clinical setting and the platform is free. An optional paid tool is available on the Azure ML platform; however, this study was adequately performed with the free-option tool. In addition, the Azure ML platform could be used to develop a suitable model for each hospital. The findings of a predictive study with large data collected at one center may not always be suitable for use by other institutions. This discrepancy may be due to differences in race or variable values. The accuracy of the predictive tool was dependent on accurate variable information (e.g., histological grading or ER, progesterone receptor, and HER2 immunohistochemical results). These factors were measured according to the official international standards. However, minimal differences may exist between centers and individual patients. Our predictive model can incorporate data from other centers or hospitals and provide proper results for each center; thus, any disparity among centers or hospitals could be reduced.

We developed an additional predictive tool for distant metastatic sites based on the data from 59 patients in the distant metastasis group. A Multi-class Neural Network was used for the analysis, and the overall accuracy was 0.86 (Supplementary Figure 2). When a prospective internal validation was conducted using 83 patients who underwent breast cancer surgery and CancerSCAN^TM in 2015, this tool did not predict accurate sites. Among the 83 patients, one had lung metastasis and one had multiple (bone and liver) metastases, where the predictive sites were the liver and bone, respectively. This result was due to the small number of patients with distant metastasis. In addition, the follow-up period was shorter than that of the modeling and test groups.

Our study had several limitations. First, our gene data were collected from tissues during surgery, regardless of whether patients received or did not receive neoadjuvant therapy. Therefore, it was indistinguishable from gene transformation induced by chemotherapy. Second, the number of patients enrolled in this study was small. Third, only an internal validation was performed. Fourth, we did not include time as a factor in our analysis. Therefore, our model predicts the presence or absence of metastasis but not when metastasis occurs.

Our predictive model is a useful and easy-to-access tool for identifying patients with distant metastases. Our model presents a way for each institution to achieve optimal results using its variables and also helps clinical decision for metastasis work up during follow up period. Further evaluations with a larger patient population will improve the reliability of this model.

Notes

The authors declare that they have no competing interests.

Acknowledgements

The authors thank Dr. Sung Wook Seo for providing advice on machine learning analysis.

SUPPLEMENTARY MATERIAL

Supplementary Figure 1.

(A) List of genes in order of importance score estimated by gradient boosting: A total of 243 features (alterations, including loss-offunction, mutation, and copy number variation in 81 genes) were used as input features. The target value was alive or dead at the 5-year mark. The genes were sorted in descending order of importance. (B) Selection for the optimal number of genes: Bootstrap resampling (n=100) was performed by sequentially reducing the less important genetic features, where the training sets (85%) and their corresponding test sets (15%) were resampled 100 times. The performance of each model was evaluated using the AUC values. The AUC was sequentially reduced by reducing the number of features. The AUC values of the 27 important genes (0.81; 95%CI: 0.76–0.85) were not significantly different from those with a higher number of genes. However, the AUC values of 26 important genes (0.724; 95% CI: 0.80–0.66) were significantly lower than those of the 27 important genes (p=0.000).

jbd-11-2-39-Supplementary-Fig-1.pdf

Supplementary Figure 2.

Prediction results for distant metastatic sites based on the data of 59 patients in the distant metastasis group.

Class 1: Bone; Class 2: Liver; Class 3: Lung; Class 4: Brain; Class 5: Others; Class 6: Multiple sites.

jbd-11-2-39-Supplementary-Fig-2.pdf

References

1. DeSantis CE, Ma J, Goding Sauer, Newman LA, Jemal A. Breast cancer statistics, 2017, racial disparity in mortality by state. CA Cancer J Clin 2017;67:439–48.

2. Weigelt B, Peterse JL, Van’t Veer LJ. Breast cancer metastasis: markers and models. Nat Rev Cancer 2005;5:591–60.

3. Early Breast Cancer Trialist’s Collaborative Group. Effects of chemotherapy and hormonal therapy for early breast cancer on recurrence and 15-year survival: an overview of the randomised trials. Lancet 2005;365:1687–717.

4. Kimbung S, Loman N, Hedenfalk I. Clinical and molecular complexity of breast cancer metastases. Semin Cancer Biol 2015;35:85–95.

5. Smid M, Wang Y, Zhang Y, Sieuwerts AM, Yu J, Klijn JG, et al. Subtypes of breast cancer show preferential site of relapse. Cancer Res 2008;68:3108–14.

6. Parker JS, Mullins M, Cheang MC, Leung S, Voduc D, Vickery T, et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol 2009;27:1160.

7. Sopik V, Narod SA. The relationship between tumour size, nodal status and distant metastases: on the origins of breast cancer. Breast Cancer Res Treat 2018;170:647–56.

8. Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, et al. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 2005;365:671–9.

9. Cheng SHC, Huang TT, Cheng YH, Tan TBK, Horng CF, Wang YA, et al. Validation of the 18-gene classifier as a prognostic biomarker of distant metastasis in breast cancer. PloS One 2017;12:e0184372.

10. Shin HT, Choi YL, Yun JW, Kim NK, Kim SY, Jeon HJ, et al. Prevalence and detection of low-allele-fraction variants in clinical cancer samples. Nat Commun 2017;8:1377.

11. Li H, Durbin R. Fast and accurate long-read alignment with Burrows-wheeler transform. Bioinformatics 2010;26:589–95.

12. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and SAMtools. Bioinformatics 2009;25:2078–9.

13. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The genome analysis toolkit: a mapreduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010;20:1297–303.

14. Cibulskis K, Lawrence MS, Carter SL, Sivachenko A, Jaffe D, Sougnez C, et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol 2013;31:213–9.

15. Ye K, Schulz MH, Long Q, Apweiler R, Ning Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 2009;25:2865–71.

16. Natekin A, Knoll A. Gradient boosting machines, a tutorial. Front Neurorobot 2013;7:21.

17. Taheri S, Hesamian G. A generalization of the wilcoxon signed-rank test and its applications. Stat Pap 2013;54:457–70.

18. Miller S, Curran K, Lunney T. Cloud-based machine learning for the detection of anonymous web proxies. 27th Irish Signals and Systems Conference (ISSC). Londonderry: IEEE; 2016. p.1-6.

19. Hong H, Tong W, Perkins R, Fang H, Xie Q, Shi L. Multiclass decision forest-a novel pattern recognition method for multiclass classification in microarray data analysis. DNA Cell Biol 2004;23:685–94.

20. Ou G, Murphey YL. Multi-class pattern classification using neural networks. Pattern Recognit 2007;40:4–18.

21. Norton L, Massagué J. Is cancer a disease of self-seeding? Nat Med 2006;12:875–8.

22. Fidler IJ. The organ microenvironment and cancer metastasis. Differentiation 2002;70:498–505.

23. Fidler IJ. The pathogenesis of cancer metastasis: the’seed and soil’hypothesis revisited. Nat Rev Cancer 2003;3:453–8.

24. Elston CW, Ellis IO. Pathological prognostic factors in breast cancer. I. The value of histological grade in breast cancer: experience from a large study with long-term follow-up. Histopathology 1991;19:403–10.

25. Rouzier R, Pusztai L, Delaloge S, Gonzalez-Angulo AM, Andre F, Hess KR, et al. Nomograms to predict pathologic complete response and metastasis-free survival after preoperative chemotherapy for breast cancer. J Clin Oncol 2005;23:8331–9.

26. Zemmour C, Bertucci F, Finetti P, Chetrit B, Birnbaum D, Filleron T, et al. Prediction of early breast cancer metastasis from DNA microarray data using high-dimensional cox regression models. Cancer Inform 2015;14:129–38.

Article information Continued

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table 1.

The baseline characteristics of enrolled patients (n=326)

Characteristic	Total	No distant metastasis group (n = 267) No. (%)	Distant metastasis group (n = 59) No. (%)	p-value
Subtype
HR+/HER2-	180	156 (58.4)	24 (40.7)	0.011
HR+/HER2+	47	37 (13.9)	10 (16.9)	0.721
HR-/HER2+	46	30 (11.2)	16 (27.1)	0.003
HR-/HER2-	53	44 (16.5)	9 (15.3)	0.626
Age (yr)
20–29	17	15 (5.6)	2 (3.4)	0.747
30–39	100	84 (31.4)	16 (27.1)	0.513
40–49	136	115 (43.1)	21 (35.6)	0.292
50–59	46	29 (10.9)	17 (28.8)	0.000
60 ≤	27	24 (9.0)	3 (5.1)	0.438
Mean age (yr, range)		43.8 (23–79)	44.7 (25–69)	0.588
Pathology
IDC	299	242 (90.6)	57 (96.6)	0.191
Others	27	25 (9.4)	2 (3.4)
Operation
BCS	179	149 (55.8)	30 (50.8)	0.381
TM	147	118 (44.2)	29 (49.2)	0.648
SLNB	112	98 (36.7)	14 (23.7)	0.065
ALND	214	169 (63.3)	45 (76.3)	0.162
Chemotherapy
None	32	28 (10.5)	4 (6.8)	0.386
Neoadjuvant	43	24 (9.0)	19 (32.2)	0.000
Adjuvant	251	215 (80.5)	36 (61.0)	0.001
Regimen
AC	48	39 (16.3)	9 (16.4)	0.899
AC+T/D	131	95 (39.7)	36 (65.5)	0.000
FAC	63	58 (24.3)	5 (9.1)	0.020
CMF	34	32 (13.4)	2 (3.5)	0.051
TCHP	7	7 (2.9)	0	0.358
Others	11	8 (3.3)	3 (5.5)	0.426
Radiotherapy
Done	233	182 (68.2)	51 (86.4)	0.005
Undone	93	85 (31.8)	8 (13.6)
Hormonal therpay
Done	221	189 (70.8)	32 (54.2)	0.014
Undone	105	78 (29.2)	27 (45.8)
Target therapy
Done	90	65 (24.3)	25 (42.3)	0.007
Undone	236	202 (75.7)	34 (57.7)
Nuclear grade
1	24	22 (8.2)	2 (3.4)	0.274
2	158	136 (50.9)	22 (37.3)	0.058
3	144	109 (40.8)	35 (59.3)	0.010
T stage
0	6	5 (1.9)	1 (1.7)	1.000
1	142	120 (44.9)	22 (37.3)	0.283
2	160	130 (48.7)	30 (50.8)	0.764
3	17	12 (4.5)	5 (8.5)	0.206
4	1	0	1 (1.7)	0.181
N stage
0	166	151 (56.6)	15 (25.4)	0.000
Isolated tumor cell	12	9 (3.4)	3 (5.1)	0.461
1	84	70 (26.2)	14 (23.7)	0.692
2	41	25 (9.4)	16 (27.1)	0.000
3	23	12 (4.5)	11 (18.6)	0.001
Pathologic stage
0	4	3 (1.1)	1 (1.7)	0.552
I	94	83 (31.1)	11 (18.6)	0.056
II	157	139 (52.1)	18 (30.5)	0.003
III	71	42 (15.7)	29 (49.2)	0.000

HR=hormone receptor; HER2=human epidermal growth factor receptor 2; IDC=invasive ductal carcinoma; BCS=breast conserving surgery; TM=total mastectomy; SLNB=sentinel lymph node biopsy; ALND=axillary lymph node dissection; AC=adriamycin, cyclophosphamide; AC+D/T=adriamycin, cyclophosphamide+docetaxel/taxol; FAC=fluorouracil, adriamycin, Cytoxan; CMF=cyclophosphamide, methotrexate, fluorouracil; TCHP=docetaxel, carboplatin, trastuzumab, pertuzumab.

Gene	Cumulative counts	No distant metastasis group (n=267) No. (%)	Distant metastasis group (n=59) No. (%)	p-value
AKT1_MUT	18	16 (6.0)	2 (3.4)	0.545
AKT3_LOF	1	0	1 (1.7)	0.181
ARID1B_MUT	24	20 (7.5)	4 (6.8)	0.850
ARID2_MUT	15	12 (4.5)	3 (5.1)	0.845
AURKA_MUT	3	2 (0.7)	1 (1.7)	0.491
BRCA1_LOF	1	0	1 (1.7)	0.033
BRCA1_MUT	21	16 (6.0)	5 (8.5)	0.482
BRCA2_LOF	15	9 (3.4)	6 (10.2)	0.024
BRCA2_MUT	35	29 (10.9)	6 (10.2)	0.877
CDH1_MUT	22	20 (7.5)	2 (3.4)	0.256
ERBB2_LOF	2	1 (0.4)	1 (1.7)	0.330
ERBB2_MUT	36	28 (10.5)	8 (13.6)	0.496
ERBB3_MUT	17	12 (4.5)	5 (8.5)	0.206
FGFR1_CNV	18	15 (5.6)	3 (5.1)	0.871
FGFR3_MUT	12	10 (3.7)	2 (3.4)	0.896
JAK2_MUT	14	11 (4.1)	3 (5.1)	0.724
KDR_MUT	6	5 (1.9)	1 (1.7)	1.000
NF1_LOF	8	7 (2.6)	1 (1.7)	1.000
NF1_MUT	16	12 (4.5)	4 (6.8)	0.503
NOTCH1_MUT	46	39 (14.6)	7 (11.9)	0.584
NPM1_MUT	1	0	1 (1.7)	0.181
PDGFRA_LOF	1	0	1 (1.7)	0.181
PIK3CA_MUT	108	92 (34.5)	16 (27.1)	0.278
PIK3R1_MUT	5	5 (1.9)	0	0.589
PTCH1_MUT	21	16 (6.0)	5 (8.5)	0.556
PTEN_CNV	4	2 (0.7)	2 (3.4)	0.151
PTEN_LOF	5	3 (1.1)	2 (3.4)	0.223
PTEN_MUT	7	7 (2.6)	0	0.358
RB1_MUT	8	5 (1.9)	3 (5.1)	0.160
RET_MUT	18	16 (6.0)	2 (3.4)	0.545
ROS1_MUT	26	22 (8.2)	4 (6.8)	1.000
SMO_MUT	6	4 (1.5)	2 (3.4)	0.298
TP53_LOF	25	18 (6.7)	7 (11.9)	0.182
TP53_MUT	107	86 (32.2)	21 (35.6)	0.616

Metastasis sites	Distant metastasis group No. (%)
Bone	7 (11.9)
Lung	19 (32.2)
Brain	5 (8.5)
Liver	6 (10.2)
Others	1 (1.7)
Multiple	21 (35.6)
Bone+Lung	3
Bone+Liver	3
Bone+Peritoneal seeding	1
Lung+Liver	3
Lung+Brain	4
Lung+Peritoneal seeding	1
Bone+Lung+Liver	3
Bone+Lung+Brain	2
Lung+Brain+Liver	1
Cumulative counts in multiple metastasis
Lung	17
Bone	12
Liver	10
Brain	7
Peritoneal seeding	2