Result: Explainable machine learning for early diagnosis of esophageal cancer: A feature-enriched Light Gradient Boosting Machine framework with Shapley Additive Explanations and Local Interpretable Model-Agnostic Explanations interpretations.
Original Publication: Northampton, Eng., Cambridge Medical Publications ltd.
Drug Discov Today. 2015 Mar;20(3):318-31. (PMID: 25448759)
World J Gastroenterol. 2015 Jul 14;21(26):7933-43. (PMID: 26185366)
Chin J Cancer Res. 2021 Oct 31;33(5):535-547. (PMID: 34815628)
Ann Cardiothorac Surg. 2017 Mar;6(2):131-136. (PMID: 28447001)
Cancers (Basel). 2022 Sep 01;14(17):. (PMID: 36077827)
Am J Gastroenterol. 2017 Aug;112(8):1247-1255. (PMID: 28585555)
Heliyon. 2024 Jan 20;10(2):e24797. (PMID: 38312629)
Am J Gastroenterol. 2021 Apr;116(4):683-691. (PMID: 33982937)
Asian J Surg. 2018 May;41(3):210-215. (PMID: 27986415)
Biomedicine (Taipei). 2018 Jun;8(2):9. (PMID: 29806587)
Int J Cancer. 2021 Jan 15;148(2):329-339. (PMID: 32663318)
Front Public Health. 2021 Dec 01;9:680967. (PMID: 34926362)
Arch Iran Med. 2012 Jan;15(1):18-21. (PMID: 22208438)
J Thorac Dis. 2021 Nov;13(11):6240-6251. (PMID: 34992804)
Biomed Opt Express. 2023 Jul 31;14(8):4383-4405. (PMID: 37799695)
Circ Heart Fail. 2021 Feb;14(2):e007761. (PMID: 33535771)
Medicina (Kaunas). 2020 Sep 08;56(9):. (PMID: 32911665)
NPJ Digit Med. 2021 Jan 4;4(1):3. (PMID: 33398013)
J Natl Compr Canc Netw. 2019 Aug 1;17(8):1009-1014. (PMID: 31390584)
J Pers Med. 2022 Jul 25;12(8):. (PMID: 35893299)
World J Gastroenterol. 2017 Feb 7;23(5):751-762. (PMID: 28223720)
Cancers (Basel). 2021 Sep 13;13(18):. (PMID: 34572819)
Am Fam Physician. 2017 Jan 1;95(1):22-28. (PMID: 28075104)
Med J Islam Repub Iran. 2018 Jul 02;32:55. (PMID: 30175081)
World Psychiatry. 2021 Jun;20(2):200-221. (PMID: 34002494)
Psychol Med. 2016 Sep;46(12):2455-65. (PMID: 27406289)
Am Fam Physician. 2006 Jun 15;73(12):2187-94. (PMID: 16836035)
J Surg Oncol. 2017 Apr;115(5):564-579. (PMID: 28320055)
J Healthc Eng. 2020 Mar 9;2020:4984967. (PMID: 32211144)
South Med J. 2021 Mar;114(3):161-168. (PMID: 33655310)
Nat Methods. 2015 Apr;12(4):277-8. (PMID: 26005726)
Geriatr Gerontol Int. 2020 Jun;20(6):637-642. (PMID: 32358851)
Osong Public Health Res Perspect. 2015 Apr;6(2):100-5. (PMID: 25938019)
J Biomed Inform. 2021 May;117:103763. (PMID: 33781921)
Surg Today. 2020 Jan;50(1):12-20. (PMID: 31535225)
Further Information
ObjectiveEsophageal cancer is among the most rapidly spreading malignancies worldwide. Early detection of esophageal cancer is critical for disease prevention and for improving overall population health. Most studies have used statistical methodologies to assess the esophageal cancer risk, and only a few studies have used prediction models.MethodsThe esophageal cancer dataset, comprising 3985 patient records with 85 demographic, pathological, and follow-up features, was obtained from Kaggle. A comprehensive data-engineering pipeline was implemented, including the removal of null and low-variance features, elimination of identifier variables to prevent data leakage, mode-based imputation, label encoding, and data standardization. Feature relevance was assessed using Mutual Information, and the top 31 clinically meaningful features were retained for model development. Six machine learning classifiers-Support Vector Machine, Gaussian Naïve Bayes, k-nearest neighbors, AdaBoost, Multilayer Perceptron, and LightGBM (Gradient Boosting Machine)-were trained and evaluated. A stratified 10-fold cross-validation was applied to maintain class balance, and GridSearchCV was used for hyperparameter optimization. Model interpretability was assessed using Shapley Additive Explanations (SHAP) for global and local feature attribution and Local Interpretable Model-Agnostic Explanations (LIME) for instance-level explanations. Furthermore, the top features identified by SHAP and LIME were used to retrain the LightGBM model to evaluate performance under reduced dimensionality.ResultsAmong all evaluated classifiers, LightGBM exhibited the highest and most stable performance, achieving an accuracy of 99.87% prior to hyperparameter tuning and 99.74% following stratified cross-validated tuning, with near-perfect precision, recall, F1-score, and area under the curve values. Explainability analyses indicated that clinically relevant variables, including tumor staging, smoking-related factors, and follow-up indicators, played a significant role in model predictions. The SHAP-selected top-20 feature model maintained high predictive performance (99.76%), demonstrating that the classifier remained robust despite dimensionality reduction.ConclusionsThe proposed LightGBM-based model demonstrates exceptional predictive accuracy and strong interpretability, suggesting its potential utility for the early detection of esophageal cancer using machine learning approaches.