地学前缘 ›› 2024, Vol. 31 ›› Issue (3): 371-380.DOI: 10.13745/j.esf.sf.2023.2.40

• 地下水与地热资源 • 上一篇    下一篇

基于集成学习优化的河套盆地地下水砷风险评估

付宇1(), 曹文庚2,*(), 张春菊3, 翟文华1, 任宇2, 南天2, 李泽岩2   

  1. 1.华北水利水电大学, 河南 郑州 450046
    2.中国地质科学院 水文地质环境地质研究所, 河北 石家庄 050061
    3.合肥工业大学, 安徽 合肥 230009
  • 收稿日期:2022-10-28 修回日期:2022-12-27 出版日期:2024-05-25 发布日期:2024-05-25
  • 通信作者: *曹文庚(1985—),男,博士,副研究员,从事水文地质和水文地球化学方面的研究工作。E-mail: 281084632@qq.com
  • 作者简介:付宇(1986—),女,博士,讲师,从事地质信息化工作。E-mail: 378048306@qq.com
  • 基金资助:
    国家自然科学基金项目(41972262);河北自然科学基金优秀青年科学基金项目(D2020504032)

Risk assessment of groundwater arsenic in Hetao Basin base on ensemble learning optimization

FU Yu1(), CAO Wengeng2,*(), ZHANG Chunju3, ZHAI Wenhua1, REN Yu2, NAN Tian2, LI Zeyan2   

  1. 1. North China University of Water Resources and Electric Power, Zhengzhou 450046, China
    2. The Institute of Hydrogeology and Environmental Geology, Chinese Academy of Geological Sciences, Shijiazhuang 050061, China
    3. Hefei University of Technology, Hefei 230009, China
  • Received:2022-10-28 Revised:2022-12-27 Online:2024-05-25 Published:2024-05-25

摘要:

河套盆地浅层地下水砷污染严重超标,其潜在的高砷风险对当地居民健康造成严重威胁。当前宏观尺度的高砷地下水风险分布认识仍显不足。本研究以605个浅层地下水样数据以及沉积环境、气候、人类活动、土壤理化特征、水文地质条件等环境因子为数据源,构建了以随机森林(RF)、极端梯度提升(XGBoost)、支持向量机(SVM)为基学习器,线性判别分析(LDA)为元学习器的高砷地下水Stacking集成学习模型,预测了研究区地下水砷风险分布,并对影响该地区地下水砷风险分布的关键环境因子进行识别。研究表明:研究区地下水砷浓度超标(>10 μg/L)率为49.59%,多集中在改道形成的古河道影响带和黄河决口扇;构建的Stacking集成模型比单一模型中性能最优的RF模型具有更高的可靠性,ROC曲线下的面积(AUC)和准确率分别提高了1.1%和3.2%;高风险区面积达到5 257 km2,占研究区总面积的38.44%;沉积环境是影响高砷地下水风险分布的关键环境因素,对模型准确性贡献度高达25.06%。研究结果能够为地下水砷风险分布制图提供方法及参考,对地区饮水安全和人类健康具有重要意义。

关键词: Stacking集成学习, 地下水, 高砷, 风险分布, 河套盆地

Abstract:

The shallow groundwater arsenic pollution in Hetao Basin seriously exceeds the standard, and its potential pollution risk poses a serious health threat to local residents. At present, the perception of the risk distribution of high arsenic groundwater is still insufficient on the macroscopic scale. Based on 605 shallow groundwater samples and environmental factors such as sedimentary environment, climate, human activities, soil physical and chemical characteristics, and hydrogeological conditions as data sources, Random Forest (RF), Extreme Gradient Boosting (XGBoost), and Support Vector Machine (SVM) were selected as the base learners, and Linear Discriminant Analysis (LDA) was selected as the meta-learner to construct a Stacking ensemble learning model for high arsenic groundwater. The ensemble learning model was used to predict the risk distribution of high arsenic groundwater and identify the key environmental factors affecting the risk distribution of high arsenic groundwater in the region. The research showed that the groundwater arsenic concentration exceeded the standard rate (>10 μg/L) was 49.59%, mainly concentrated in the paleochannel zone and flood fans of the Yellow River. The Stacking ensemble model had higher reliability than the RF model with the best performance in the single model, and the Area Under the ROC Curve (AUC) and accuracy were increased by 1.1% and 3.2%, respectively. The high-risk area reached 5257 km2, accounting for 38.44% of the total area of the study area. The sedimentary environment is the key environmental factor affecting the risk distribution of high arsenic groundwater, contributing up to 25.06% to the accuracy of the model. The results of this study can provide a method and reference for mapping the spatial distribution of high arsenic groundwater pollution and have important implications for the safety of drinking water and human health in the region.

Key words: Stacking ensemble learning, groundwater, high arsenic, risk distribution, Hetao Basin

中图分类号: