结核与肺部疾病杂志 ›› 2023, Vol. 4 ›› Issue (5): 364-369.doi: 10.19983/j.issn.2096-8493.20230087

• 论著 • 上一篇    下一篇

基于随机森林算法研究老年流动人口肺结核发病风险因素

马建军1, 张铁娟2, 赵庆龙2, 于世晖1, 梅扬3()   

  1. 1吉林省结核病防治科学研究院诊疗质量评价所,长春 130062
    2吉林省疾病预防控制中心(吉林省公共卫生研究院),长春 130062
    3中国疾病预防控制中心,北京 102206
  • 收稿日期:2023-08-12 出版日期:2023-10-20 发布日期:2023-10-16
  • 通信作者: 梅扬,Email:meiyang@chinacdc.cn
  • 基金资助:
    吉林省卫生与健康管理模式革新项目(2020G007)

Random forest algorithm-based study of risk factors for tuberculosis incidence in an elderly mobile population

Ma Jianjun1, Zhang Tiejuan2, Zhao Qinglong2, Yu Shihui1, Mei Yang3()   

  1. 1Clinical Quality Evaluation Institute, Jilin Provincial Tuberculosis Prevention and Treatment Institute, Changchun 130062, China
    2Jilin Provincial Center of Disease Control and Prevention, Changchun 130062, China
    3Chinese Center for Disease Control and Prevention, Beijing 102206, China
  • Received:2023-08-12 Online:2023-10-20 Published:2023-10-16
  • Contact: Mei Yang, Email: meiyang@chinacdc.cn
  • Supported by:
    Jilin Province Health and Wellness Management Model Innovation Project(2020G007)

摘要:

目的: 应用机器学习算法随机森林建立吉林省老年流动人口肺结核发病风险模型并分析发病风险因素,为制定结核病重点人群防治策略提供参考。方法: 采用1∶1匹配设计的病例对照研究,选择2021年吉林省登记的年龄≥60岁的流动人口肺结核患者(281例)为病例组,281例性别匹配的非本地户籍健康人群为对照组,随机抽取70%(393例/名)和30%(169例/名)的数据作为训练集和测试集,使用 R Software Version 4.2.1软件建立随机森林算法的发病风险模型。结果: 发病风险因素前5位分别为有结核病患者接触史、工作经常变动、个人防护差、吸烟、较少摄入肉蛋奶,其基尼平均减少值分别为44.344、29.007、21.859、19.703、15.242;随机森林模型最优树数量为281,袋外数据误差率为6.44%;ROC曲线下面积为0.967;使用Caret包 10折交叉验证随机森林算法,正确率为93.5%,Kappa值为0.870。结论: 有结核病患者接触史的老年流动人口被感染的风险最大,常态化的结核病防控要重视隔离具有传染性的肺结核患者,加强个人防护和营养摄入。

关键词: 结核, 老年人, 流动人口, 机器学习, 危险因素

Abstract:

Objective: To use the machine learning algorithm—random forest to establish a risk model of tuberculosis incidence among elderly mobile population in Jilin Province, so as to provide a reference for the development of prevention and treatment strategies for key populations of tuberculosis. Methods: Using a case-control study with a 1∶1 matching design, 281 tuberculosis patients ≥60 years from the migrant population registered in Jilin Province in 2021 were selected as the case group, and 281 gender-matched healthy non-local household members were selected as the control group, 70% (393 cases) and 30% (169 cases) of the data were randomly selected as the training and test sets, and random forest algorithm was used to model the incidence risk of tuberculosis using R Software Version 4.2.1. Results: The top 5 risk factors for morbidity were history of exposure to tuberculosis patients, change of job, poor personal protection, smoking, and low intake of meat, eggs and milk, the average decline of Gini were 44.344, 29.007, 21.859, 19.703 and 15.242, respectively; the optimal number of trees in the model was 281, and the error rate of out-of-bag data was 6.44%; area under the ROC curve was 0.967; the random forest algorithm was cross-validated using the Caret package 10-fold with a 93.5% correct rate and a Kappa value of 0.870. Conclusion: Elderly mobile population with a history of contact with tuberculosis patients were at highest risk of infection, thus normalized tuberculosis prevention and control should emphasize on isolation of infectious tuberculosis patients and strengthening personal protection and nutritional intake.

Key words: Tuberculosis, Aged, Floating population, Machine learning, Risk factors

中图分类号: