Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
'),o.close()}("https://assets.zendesk.com/embeddable_framework/main.js","jmir.zendesk.com");/*]]>*/

Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Currently submitted to: JMIR AI

Date Submitted: Mar 15, 2024
Open Peer Review Period: Mar 19, 2024 - May 14, 2024
(currently open for review)

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Chronic Obstructive Pulmonary Disease in the United States: A Comparison of Multiple Linear Regression and Machine Learning Models

  • Arnold Kamis; 
  • Nidhi Gadia; 
  • Zilin Luo; 
  • Cyndi Ng; 
  • Mansi Thumbar

ABSTRACT

Background:

Lung disease is a severe problem in the United States. Despite the decreasing rates of cigarette smoking, COPD continues to be health burden in the United States. In this paper, we focus on Chronic Obstructive Pulmonary Disease in the United States from 2016 to 2019.

Objective:

We gather a diverse set of data sources to better understand and predict COPD rates at the level of Core-Based Statistical Area in the United States. The objective is to compare linear models with machine learning models to obtain the most accurate and interpretable model of COPD.

Methods:

We integrate data from multiple Centers for Disease Control sources and use them to analyze Chronic Obstructive Pulmonary Disease by using different types of methods. We include cigarette smoking, a well-known contributing factor, and race / ethnicity variables because health disparities among different races and ethnicities in the United States are also well-known. The models also include air quality index, education, employment, and economic variables. We fit models with both multiple linear regression and machine learning methods.

Results:

The most accurate multiple linear regression model has variance explained = 81.1% and Root Mean Squared Error = 0.73. The most accurate machine learning model has variance explained = 87.1% and Root Mean Squared Error = 0.53. Overall, cigarette smoking and household income are the strongest predictor variables. Hispanic percentage of CBSA, Education, and American Indian / Alaska Native percentage of CBSA are moderately strong predictors.

Conclusions:

This research highlights the importance of using diverse data sources as well as multiple methods to understand and predict COPD. The most accurate model is a Support Vector Machine, which captured non-linearities in a model whose accuracy is superior to the best multiple linear regression. Our interpretable models suggest ways that individual predictor variables can be used in interventions aimed at decreasing COPD rates. Gaps in understanding the health impacts of air pollution, particularly in relation to climate change, suggest a need for further research to design interventions and improve public health.


 Citation

Please cite as:

Kamis A, Gadia N, Luo Z, Ng C, Thumbar M

Chronic Obstructive Pulmonary Disease in the United States: A Comparison of Multiple Linear Regression and Machine Learning Models

JMIR Preprints. 15/03/2024:58455

DOI: 10.2196/preprints.58455

URL: https://preprints.jmir.org/preprint/58455

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

Advertisement