Machine learning-enabled systematic review on coded healthcare data in heart failure research
Champsi, Asgher ; Slater, Karin T ; Gill, Simrat ; Dyszynski, Tomasz ; Schröder, Megan ; Suzart-Woischnik, Kiliana ; Tyl, Benoit ; Allée, Guillaume ; Sartorius, Alfonso ; Lumbers, R Thomas ... show 4 more
Champsi, Asgher
Slater, Karin T
Gill, Simrat
Dyszynski, Tomasz
Schröder, Megan
Suzart-Woischnik, Kiliana
Tyl, Benoit
Allée, Guillaume
Sartorius, Alfonso
Lumbers, R Thomas
Affiliation
University of Birmingham; University Hospitals Birmingham NHS Foundation Trust; Bayer AG; Boehringer Ingelheim; Bayer Healthcare SAS; Servier Laboratories; University College London; Amsterdam University Medical Center; University Medical Center Utrecht
Other Contributors
Publication date
2025-10-23
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Aims: Coded healthcare data are now commonly used in clinical research. This study aimed to assess the transparency of reporting within heart failure studies and employ machine learning to facilitate larger-scale evaluation.
Methods & results: A systematic search of EMBASE and MEDLINE (2015-2020) identified 4279 heart failure studies with accessible Extensible Markup Language published in the top 25 journals by impact factor. Manual extraction in a random sample of 170 studies by independent human reviewers characterized 40 studies (23.5%) that used coded healthcare data, with 34 of these (85%) reporting doing so and only 19 (47.5%) providing clear descriptions of dataset construction and linkage. Another 420 studies underwent manual annotation to further train a Natural Language Processing (NLP) model designed for this study to automate and upscale review. The NLP model processed 3689 studies with a high level of internal accuracy (area under the receiver operating characteristic curve 0.97 and F1 score 0.96). Overall, the NLP approach identified 782 studies (21.2%) that reported coded healthcare data usage (95% CI 19.8-20.9%). No correlation was found between the reporting of coded healthcare data use and the publication year (r = -0.05; P = 0.21) or citation count (r = -0.13; P = 0.12).
Conclusion: One-fifth of contemporary heart failure research articles are already reporting the use of coded healthcare data, with at-scale evaluation facilitated by a machine-learning model. The limited transparency on how coded healthcare data were used in studies highlights the need for quality standards such as the CODE-EHR framework for the use of healthcare data in research.
Citation
Champsi A, Slater KT, Gill S, Dyszynski T, Schröder M, Suzart-Woischnik K, Tyl B, Allée G, Sartorius A, Lumbers RT, Asselbergs FW, Grobbee DE, Gkoutos G, Kotecha D. Machine learning-enabled systematic review on coded healthcare data in heart failure research. Eur Heart J Digit Health. 2025 Oct 23;7(1):ztaf123. doi: 10.1093/ehjdh/ztaf123.
Type
Article
