Training set size takes precedence over similarity: A comparative study of machine learning models for landslide prediction in the Jishishan earthquake

doi:10.13745/j.esf.sf.2025.9.3

Abstract

Abstract:

Machine learning methods are key tools for predicting earthquake-induced landslide risks, significantly improving post-seismic risk assessment. This study offers practical guidelines for such predictions by evaluating three prominent machine learning models—Random Forest (RF), Artificial Neural Network (ANN), and XGBoost—using data from eight earthquakes in western China. We predict the landslide risk following the Ms 6.2 Jishishan earthquake in Gansu and create four training sets: a heterogeneous set and a traditional set based on all eight earthquakes, and another pair of sets using four earthquakes most similar to Jishishan in magnitude and other factors. Results show that the RF model using the heterogeneous training set from all eight earthquakes achieved the highest AUC value and most accurate predictions. The XGBoost and ANN models using the traditional training set of all eight earthquakes also performed well, with higher AUC values than other sets. Conversely, models trained on the four-earthquake set constructed based on seismic similarity all demonstrated relatively low accuracy, indicating that sample size impacts model performance more than sample similarity. Additionally, all models using the traditional four-earthquake set exhibited overfitting, further demonstrating the importance of the scale of the training set. This study provides insights into training data and model selection for improving the accuracy of earthquake-induced landslide predictions and supports emergency response and disaster risk management efforts.

Key words: Jishishan earthquake, landslide, machine learning model, accuracy assessment

CLC Number:

P694

LIU Meiyu, WU Wei, WANG Hui, LUO Weier, WU Juanjuan, GUO Xudong. Training set size takes precedence over similarity: A comparative study of machine learning models for landslide prediction in the Jishishan earthquake[J]. Earth Science Frontiers, 2025, 32(5): 466-483.

Figures/Tables 12

References 76

[1]	BUDIMIR M E A, ATKINSON P M, LEWIS H G. A systematic review of landslide probability mapping using logistic regression[J]. Landslides, 2015, 12(3): 419-436.
[2]	LEE S, PRADHAN B. Landslide hazard mapping at Selangor, Malaysia using frequency ratio and logistic regression models[J]. Landslides, 2007, 4(1): 33-41.
[3]	HE S W, PAN P, DAI L, et al. Application of kernel-based Fisher discriminant analysis to map landslide susceptibility in the Qinggan River delta, Three Gorges, China[J]. Geomorphology, 2012, 171/172: 30-41.
[4]	SAHA A, TRIPATHI L, VILLURI V G K, et al. Exploring machine learning and statistical approach techniques for landslide susceptibility mapping in Siwalik Himalayan Region using geospatial technology[J]. Environmental Science and Pollution Research, 2024, 31(7): 10443-10459.
[5]	KHALIL U, IMTIAZ I, ASLAM B, et al. Comparative analysis of machine learning and multi-criteria decision making techniques for landslide susceptibility mapping of Muzaffarabad district[J]. Frontiers in Environmental Science, 2022, 10: 1028373.
[6]	HE Q, JIANG Z Y, WANG M, et al. Landslide and wildfire susceptibility assessment in Southeast Asia using ensemble machine learning methods[J]. Remote Sensing, 2021, 13(8): 1572.
[7]	XI C J, HAN M, HU X W, et al. Effectiveness of Newmark-based sampling strategy for coseismic landslide susceptibility mapping using deep learning, support vector machine, and logistic regression[J]. Bulletin of Engineering Geology and the Environment, 2022, 81(5): 174.
[8]	TIAN Y Y, XU C, MA S Y, et al. Inventory and spatial distribution of landslides triggered by the 8th August 2017 Mw 6.5 Jiuzhaigou earthquake, China[J]. Journal of Earth Science, 2019, 30(1): 206-217.
[9]	CHEN W, XIE X S, PENG J B, et al. GIS-based landslide susceptibility evaluation using a novel hybrid integration approach of bivariate statistical based random forest method[J]. Catena, 2018, 164: 135-149.
[10]	PRADHAN B, LEE S. Landslide risk analysis using artificial neural network model focusing on different training sites[J]. International Journal of Physical Sciences, 2009, 3(11): 1-15.
[11]	BISWAJEET P, SARO L. Utilization of optical remote sensing data and GIS tools for regional landslide hazard analysis using an artificial neural network model[J]. Earth Science Frontiers, 2007, 14(6): 143-151.
[12]	ZHANG J Y, MA X L, ZHANG J L, et al. Insights into geospatial heterogeneity of landslide susceptibility based on the SHAP-XGBoost model[J]. Journal of Environmental Management, 2023, 332: 117357.
[13]	KAVZOGLU T, TEKE A. Advanced hyperparameter optimization for improved spatial prediction of shallow landslides using extreme gradient boosting (XGBoost)[J]. Bulletin of Engineering Geology and the Environment, 2022, 81(5): 201.
[14]	HONG H Y, PRADHAN B, JEBUR M N, et al. Spatial prediction of landslide hazard at the Luxi area (China) using support vector machines[J]. Environmental Earth Sciences, 2015, 75(1): 40.
[15]	HUANG Y, ZHAO L. Review on landslide susceptibility mapping using support vector machines[J]. Catena, 2018, 165: 520-529.
[16]	MA J W, LEI D Z, REN Z Y, et al. Automated machine learning-based landslide susceptibility mapping for the Three Gorges Reservoir area, China[J]. Mathematical Geosciences, 2024, 56(5): 975-1010.
[17]	HONG H Y. Landslide susceptibility assessment using locally weighted learning integrated with machine learning algorithms[J]. Expert Systems with Applications, 2024, 237: 121678.
[18]	SHARMA N, SAHARIA M, RAMANA G V. High resolution landslide susceptibility mapping using ensemble machine learning and geospatial big data[J]. Catena, 2024, 235: 107653.
[19]	YOUSEFI Z, ALESHEIKH A A, JAFARI A, et al. Stacking ensemble technique using optimized machine learning models with boruta-XGBoost feature selection for landslide susceptibility mapping: a case of Kermanshah province, Iran[J]. Information, 2024, 15(11): 689.
[20]	TEHRANI F S, CALVELLO M, LIU Z, et al. Machine learning and landslide studies: recent advances and applications[J]. Natural Hazards, 2022, 114(2): 1197-1245.
[21]	POURGHASEMI H R, KARIMINEJAD N, AMIRI M, et al. Assessing and mapping multi-hazard risk susceptibility using a machine learning technique[J]. Scientific Reports, 2020, 10: 3203.
[22]	KORUP O, STOLLE A. Landslide prediction from machine learning[J]. Geology Today, 2014, 30(1): 26-33.
[23]	LIU L L, YANG C, WANG X M. Landslide susceptibility assessment using feature selection-based machine learning models[J]. Geomechanics and Engineering, 2021, 25(1): 1-16.
[24]	WANG Y H, WANG L Q, LIU S L, et al. A comparative study of regional landslide susceptibility mapping with multiple machine learning models[J]. Geological Journal, 2024, 59(9): 2383-2400.
[25]	YANG S L, TAN J Y, LUO D Y, et al. Sample size effects on landslide susceptibility models: a comparative study of heuristic, statistical, machine learning, deep learning and ensemble learning models with SHAP analysis[J]. Computers & Geosciences, 2024, 193: 105723.
[26]	MICHELETTI N, FORESTI L, ROBERT S, et al. Machine learning feature selection methods for landslide susceptibility mapping[J]. Mathematical Geosciences, 2014, 46: 33-57.
[27]	GU T, DUAN P, WANG M, et al. Effects of non-landslide sampling strategies on machine learning models in landslide susceptibility mapping[J]. Scientific Reports, 2024, 14(1): 7201.
[28]	ZHAO F C, MIAO F S, WU Y P, et al. Refined landslide susceptibility mapping in township area using ensemble machine learning method under dataset replenishment strategy[J]. Gondwana Research, 2024, 131: 20-37.
[29]	HUANG F M, XIONG H W, ZHOU X T, et al. Modelling uncertainties and sensitivity analysis of landslide susceptibility prediction under different environmental factor connection methods and machine learning models[J]. KSCE Journal of Civil Engineering, 2024, 28(1): 45-62.
[30]	XU S L, SONG Y X, HAO X L. A comparative study of shallow machine learning models and deep learning models for landslide susceptibility assessment based on imbalanced data[J]. Forests, 2022, 13(11): 1908.
[31]	KADAVI P R, LEE C W, LEE S. Application of ensemble-based machine learning models to landslide susceptibility mapping[J]. Remote Sensing, 2018, 10(8): 1252.
[32]	YANG C, LIU L L, HUANG F M, et al. Machine learning-based landslide susceptibility assessment with optimized ratio of landslide to non-landslide samples[J]. Gondwana Research, 2023, 123: 198-216.
[33]	FU X, LIU Y F, ZHU Q, et al. Reliable assessment approach of landslide susceptibility in broad areas based on optimal slope units and negative samples involving priori knowledge[J]. International Journal of Digital Earth, 2022, 15(1): 2495-2510.
[34]	HUSSIN H Y, ZUMPANO V, REICHENBACH P, et al. Different landslide sampling strategies in a grid-based bi-variate statistical susceptibility model[J]. Geomorphology, 2016, 253: 508-523.
[35]	LIU L L, ZHANG Y L, XIAO T, et al. A frequency ratio-based sampling strategy for landslide susceptibility assessment[J]. Bulletin of Engineering Geology and the Environment, 2022, 81(9): 360.
[36]	LE X H, CHOI C, EU S, et al. Quantitative evaluation of uncertainty and interpretability in machine learning-based landslide susceptibility mapping through feature selection and explainable AI[J]. Frontiers in Environmental Science, 2024, 12: 1424988.
[37]	ZHU J, BAISE L G, THOMPSON E M. An updated geospatial liquefaction model for global application[J]. Bulletin of the Seismological Society of America, 2017, 107(3): 1365-1385.
[38]	NOWICKI JESSEE M A, HAMBURGER M W, ALLSTADT K, et al. A global empirical model for near-real-time assessment of seismically induced landslides[J]. Journal of Geophysical Research: Earth Surface, 2018, 123(8): 1835-1859.
[39]	SHAO X Y, MA S Y, XU C, et al. Effects of sampling intensity and non-slide/slide sample ratio on the occurrence probability of coseismic landslides[J]. Geomorphology, 2020, 363: 107222.
[40]	YANG H, SHI P J, QUINCEY D, et al. A heterogeneous sampling strategy to model earthquake-triggered landslides[J]. International Journal of Disaster Risk Science, 2023, 14(4): 636-648.
[41]	CHANG Z L, HUANG J S, HUANG F M, et al. Uncertainty analysis of non-landslide sample selection in landslide susceptibility prediction using slope unit-based machine learning models[J]. Gondwana Research, 2023, 117: 307-320.
[42]	SONG Y Z, YANG D G, WU W C, et al. Evaluating landslide susceptibility using sampling methodology and multiple machine learning models[J]. ISPRS International Journal of Geo-Information, 2023, 12(5): 197.
[43]	CHEN S, MIAO Z L, WU L X, et al. A one-class-classifier-based negative data generation method for rapid earthquake-induced landslide susceptibility mapping[J]. Frontiers in Earth Science, 2021, 9: 609896.
[44]	ZHANG Y Z, YAN Q S. Landslide susceptibility prediction based on high-trust non-landslide point selection[J]. ISPRS International Journal of Geo-Information, 2022, 11(7): 398.
[45]	ZHU Y C, SUN D L, WEN H J, et al. Considering the effect of non-landslide sample selection on landslide susceptibility assessment[J]. Geomatics, Natural Hazards and Risk, 2024, 15: 2392778.
[46]	DOU H Q, HE J B, HUANG S Y, et al. Influences of non-landslide sample selection strategies on landslide susceptibility mapping by machine learning[J]. Geomatics, Natural Hazards and Risk, 2023, 14: 2285719.
[47]	WU B, SHI Z M, ZHENG H C, et al. Impact of sampling for landslide susceptibility assessment using interpretable machine learning models[J]. Bulletin of Engineering Geology and the Environment, 2024, 83(11): 461.
[48]	XU C, XU X W, YAO X, et al. Three (nearly) complete inventories of landslides triggered by the May 12, 2008 Wenchuan Mw 7.9 earthquake of China and their spatial distribution statistical analysis[J]. Landslides, 2014, 11(3): 441-461.
[49]	XU C, XU X W, YU G H. Landslides triggered by slipping-fault-generated earthquake on a plateau: an example of the 14 April 2010, Ms 7.1, Yushu, China earthquake[J]. Landslides, 2013, 10(4): 421-431.
[50]	XU C, XU X W, SHYU J B H. Database and spatial distribution of landslides triggered by the Lushan, China Mw 6.6 earthquake of 20 April 2013[J]. Geomorphology, 2015, 248: 77-92.
[51]	XU C, XU X W, SHYU J B H, et al. Landslides triggered by the 22 July 2013 Minxian-Zhangxian, China, Mw 5.9 earthquake: inventory compiling and spatial distribution analysis[J]. Journal of Asian Earth Sciences, 2014, 92: 125-142.
[52]	WU W Y, XU C, WANG X Q, et al. Landslides triggered by the 3 August 2014 Ludian (China) Mw 6.2 earthquake: an updated inventory and analysis of their spatial distribution[J]. Journal of Earth Science, 2020, 31(4): 853-866.
[53]	XU C, WANG S Y, XU X W, et al. A panorama of landslides triggered by the 8 August 2017 Jiuzhaigou, Sichuan Ms 7.0 earthquake[J]. Seismology and Geology, 2018, 40(1): 232-260.
[54]	HU K H, ZHANG X P, YOU Y, et al. Landslides and dammed lakes triggered by the 2017 Ms 6.9 Milin earthquake in the Tsangpo gorge[J]. Landslides, 2019, 16(5): 993-1001.
[55]	SHAO X Y, XU C, MA S Y. Preliminary analysis of coseismic landslides induced by the 1 June 2022 Ms 6.1 Lushan earthquake, China[J]. Sustainability, 2022, 14(24): 16554.
[56]	FAN X M, YUNUS A P, SCARINGI G, et al. Rapidly evolving controls of landslides after a strong earthquake and implications for hazard assessments[J]. Geophysical Research Letters, 2021, 48: e2020GL090509.
[57]	CHEN W, PANAHI M, POURGHASEMI H R. Performance evaluation of GIS-based new ensemble data mining techniques of adaptive neuro-fuzzy inference system (ANFIS) with genetic algorithm (GA), differential evolution (DE), and particle swarm optimization (PSO) for landslide spatial modelling[J]. Catena, 2017, 157: 310-324.
[58]	PHAM B T, TIEN BUI D, POURGHASEMI H R, et al. Landslide susceptibility assessment in the Uttarakhand area (India) using GIS: a comparison study of prediction capability of naïve Bayes, multilayer perceptron neural networks, and functional trees methods[J]. Theoretical and Applied Climatology, 2017, 128(1): 255-273.
[59]	TIEN BUI D, TUAN T A, KLEMPE H, et al. Spatial prediction models for shallow landslide hazards: a comparative assessment of the efficacy of support vector machines, artificial neural networks, kernel logistic regression, and logistic model tree[J]. Landslides, 2016, 13(2): 361-378.
[60]	YOUSSEF A M, POURGHASEMI H R, EL-HADDAD B A, et al. Landslide susceptibility maps using different probabilistic and bivariate statistical models and comparison of their performance at Wadi Itwad Basin, Asir Region, Saudi Arabia[J]. Bulletin of Engineering Geology and the Environment, 2016, 75(1): 63-87.
[61]	YANG J, HUANG X. 30 m annual land cover and its dynamics in China from 1990 to 2019[J]. Earth System Science Data, 2021, 13(8): 3907-3925.
[62]	HALL M A. Correlation-based feature selection for machine learning[D]. Hamilton: The University of Waikato, 1999.
[63]	肖克炎, 李程, 唐瑞, 等. 大数据智能预测评价[J]. 地学前缘, 2025, 32(4): 20-37.
[64]	CHEN C, FAN L. Selection of contributing factors for predicting landslide susceptibility using machine learning and deep learning models[J]. Stochastic Environmental Research and Risk Assessment, 2023: 1-26. https://doi.org/10.1007/s00477-023-02556-4.
[65]	LIU Z Q, GILBERT G, CEPEDA J M, et al. Modelling of shallow landslides with machine learning algorithms[J]. Geoscience Frontiers, 2021, 12(1): 385-393.
[66]	LI W, XU Q, LI Q, et al. Preliminary analysis of the coseismic geohazards induced by the 2023 Jishishan Ms 6.2 earthquake[J]. Journal of Chengdu University of Technology(Science & Technology Edition), 2024, 51(1): 33-45.
[67]	VABALAS A, GOWEN E, POLIAKOFF E, et al. Machine learning algorithm validation with a limited sample size[J]. PLoS One, 2019, 14(11): e0224365.
[68]	LECUN Y, BENGIO Y, HINTON G. Deep learning[J]. Nature, 2015, 521(7553): 436-444.
[69]	MEENA S R, SOARES L P, GROHMANN C H, et al. Landslide detection in the Himalayas using machine learning algorithms and U-Net[J]. Landslides, 2022, 19(5): 1209-1229.
[70]	YOUSSEF A M, EL-HADDAD B A, SKILODIMOU H D, et al. Landslide susceptibility, ensemble machine learning, and accuracy methods in the southern Sinai Peninsula, Egypt: assessment and mapping[J]. Natural Hazards, 2024, 120(15): 14227-14258.
[71]	YANG D D, QIU H J, YE B F, et al. Distribution and recurrence of warming-induced retrogressive thaw slumps on the central Qinghai-Tibet Plateau[J]. Journal of Geophysical Research: Earth Surface, 2023, 128(8): e2022JF007047.
[72]	AI X, SUN B T, CHEN X Z. Construction of small sample seismic landslide susceptibility evaluation model based on Transfer Learning: a case study of Jiuzhaigou earthquake[J]. Bulletin of Engineering Geology and the Environment, 2022, 81(3): 116.
[73]	VALIANTE M, GUIDA D, DELLA SETA M, et al. A spatiotemporal object-oriented data model for landslides (LOOM)[J]. Landslides, 2021, 18(4): 1231-1244.
[74]	HOANG D A, LE H V, PHAM D V, et al. Hybrid BBO-DE optimized SPAARCTree ensemble for landslide susceptibility mapping[J]. Remote Sensing, 2023, 15(8): 2187.
[75]	NEZHAD S K, BAROONI M, VELIOGLU SOGUT D, et al. Ensemble neural networks for the development of storm surge flood modeling: a comprehensive review[J]. Journal of Marine Science and Engineering, 2023, 11(11): 2154.
[76]	DE SANTI M, KHAN U T, ARNOLD M, et al. Forecasting point-of-consumption chlorine residual in refugee settlements using ensembles of artificial neural networks[J]. NPJ Clean Water, 2021, 4: 35.

因子	异构4次地震	异构8次地震	传统4次地震	传统8次地震
LCD	0.023 778	0.008 591	0.007 332	0.016 242
Aspect	0.133 877	0.118 729	0.007 290	0.094 656
Slope	0.139 360	0.134 182	0.032 378	0.124 060
DEM	0.170 792	0.144 311	0.027 497	0.132 693
Fault	0.153 277	0.143 819	0.015 115	0.139 639
PGA	0.080 306	0.183 184	0.022 557	0.265 930
River	0.147 946	0.133 952	0.011 720	0.106 703
NDVI	0.150 664	0.133 232	0.876 112	0.120 077

因子	异构4次地震	异构8次地震	传统4次地震	传统8次地震
LCD	0.023 778	0.008 591	0.007 332	0.016 242
Aspect	0.133 877	0.118 729	0.007 290	0.094 656
Slope	0.139 360	0.134 182	0.032 378	0.124 060
DEM	0.170 792	0.144 311	0.027 497	0.132 693
Fault	0.153 277	0.143 819	0.015 115	0.139 639
PGA	0.080 306	0.183 184	0.022 557	0.265 930
River	0.147 946	0.133 952	0.011 720	0.106 703
NDVI	0.150 664	0.133 232	0.876 112	0.120 077

因子	异构4次地震	异构8次地震	传统4次地震	传统8次地震
LCD	0.138 074	0.039 520	0.009 724	0.068 789
Aspect	0.071 046	0.021 518	0.001 652	0.037 196
Slope	0.136 052	0.155 870	0.006 470	0.127 573
DEM	0.147 697	0.107 061	0.005 043	0.089 514
Fault	0.091 351	0.069 240	0.002 893	0.074 135
PGA	0.186 176	0.508 977	0.000 959	0.470 988
River	0.088 134	0.048 028	0.002 270	0.050 365
NDVI	0.141 469	0.049 787	0.970 990	0.081 440

因子	异构4次地震	异构8次地震	传统4次地震	传统8次地震
LCD	0.138 074	0.039 520	0.009 724	0.068 789
Aspect	0.071 046	0.021 518	0.001 652	0.037 196
Slope	0.136 052	0.155 870	0.006 470	0.127 573
DEM	0.147 697	0.107 061	0.005 043	0.089 514
Fault	0.091 351	0.069 240	0.002 893	0.074 135
PGA	0.186 176	0.508 977	0.000 959	0.470 988
River	0.088 134	0.048 028	0.002 270	0.050 365
NDVI	0.141 469	0.049 787	0.970 990	0.081 440

模型	定量评估准确性/%
模型	异构4次地震	异构8次地震	传统8次地震
随机森林	55.00	60.35	53.60
XGBoost	33.60	27.09	45.35
人工神经网络	83.84	22.79	29.65