Cited 17 times in
A standardized analytics pipeline for reliable and rapid development and validation of prediction models using observational health data
DC Field | Value | Language |
---|---|---|
dc.contributor.author | 유승찬 | - |
dc.date.accessioned | 2022-11-24T00:59:40Z | - |
dc.date.available | 2022-11-24T00:59:40Z | - |
dc.date.issued | 2021-11 | - |
dc.identifier.issn | 0169-2607 | - |
dc.identifier.uri | https://ir.ymlib.yonsei.ac.kr/handle/22282913/191117 | - |
dc.description.abstract | Background and objective: As a response to the ongoing COVID-19 pandemic, several prediction models in the existing literature were rapidly developed, with the aim of providing evidence-based guidance. However, none of these COVID-19 prediction models have been found to be reliable. Models are commonly assessed to have a risk of bias, often due to insufficient reporting, use of non-representative data, and lack of large-scale external validation. In this paper, we present the Observational Health Data Sciences and Informatics (OHDSI) analytics pipeline for patient-level prediction modeling as a standardized approach for rapid yet reliable development and validation of prediction models. We demonstrate how our analytics pipeline and open-source software tools can be used to answer important prediction questions while limiting potential causes of bias (e.g., by validating phenotypes, specifying the target population, performing large-scale external validation, and publicly providing all analytical source code). Methods: We show step-by-step how to implement the analytics pipeline for the question: 'In patients hospitalized with COVID-19, what is the risk of death 0 to 30 days after hospitalization?'. We develop models using six different machine learning methods in a USA claims database containing over 20,000 COVID-19 hospitalizations and externally validate the models using data containing over 45,000 COVID-19 hospitalizations from South Korea, Spain, and the USA. Results: Our open-source software tools enabled us to efficiently go end-to-end from problem design to reliable Model Development and evaluation. When predicting death in patients hospitalized with COVID-19, AdaBoost, random forest, gradient boosting machine, and decision tree yielded similar or lower internal and external validation discrimination performance compared to L1-regularized logistic regression, whereas the MLP neural network consistently resulted in lower discrimination. L1-regularized logistic regression models were well calibrated. Conclusion: Our results show that following the OHDSI analytics pipeline for patient-level prediction modelling can enable the rapid development towards reliable prediction models. The OHDSI software tools and pipeline are open source and available to researchers from all around the world. | - |
dc.description.statementOfResponsibility | open | - |
dc.language | English | - |
dc.publisher | Elsevier Scientific Publishers | - |
dc.relation.isPartOf | COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE | - |
dc.rights | CC BY-NC-ND 2.0 KR | - |
dc.subject.MESH | COVID-19* | - |
dc.subject.MESH | Humans | - |
dc.subject.MESH | Logistic Models | - |
dc.subject.MESH | Machine Learning | - |
dc.subject.MESH | Pandemics* | - |
dc.subject.MESH | SARS-CoV-2 | - |
dc.title | A standardized analytics pipeline for reliable and rapid development and validation of prediction models using observational health data | - |
dc.type | Article | - |
dc.contributor.college | College of Medicine (의과대학) | - |
dc.contributor.department | Dept. of Biomedical Systems Informatics (의생명시스템정보학교실) | - |
dc.contributor.googleauthor | Sara Khalid | - |
dc.contributor.googleauthor | Cynthia Yang | - |
dc.contributor.googleauthor | Clair Blacketer | - |
dc.contributor.googleauthor | Talita Duarte-Salles | - |
dc.contributor.googleauthor | Sergio Fernández-Bertolín | - |
dc.contributor.googleauthor | Chungsoo Kim | - |
dc.contributor.googleauthor | Rae Woong Park | - |
dc.contributor.googleauthor | Jimyung Park | - |
dc.contributor.googleauthor | Martijn J Schuemie | - |
dc.contributor.googleauthor | Anthony G Sena | - |
dc.contributor.googleauthor | Marc A Suchard | - |
dc.contributor.googleauthor | Seng Chan You | - |
dc.contributor.googleauthor | Peter R Rijnbeek | - |
dc.contributor.googleauthor | Jenna M Reps | - |
dc.identifier.doi | 10.1016/j.cmpb.2021.106394 | - |
dc.contributor.localId | A02478 | - |
dc.relation.journalcode | J00637 | - |
dc.identifier.eissn | 1872-7565 | - |
dc.identifier.pmid | 34560604 | - |
dc.subject.keyword | COVID-19 | - |
dc.subject.keyword | Data harmonization | - |
dc.subject.keyword | Data quality control | - |
dc.subject.keyword | Distributed data network | - |
dc.subject.keyword | Machine learning | - |
dc.subject.keyword | Risk prediction | - |
dc.contributor.alternativeName | You, Seng Chan | - |
dc.contributor.affiliatedAuthor | 유승찬 | - |
dc.citation.volume | 211 | - |
dc.citation.startPage | 106394 | - |
dc.identifier.bibliographicCitation | COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, Vol.211 : 106394, 2021-11 | - |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.