Algorithm Pipeline Aids in Identification of Patients With RA

Using machine learning methods researchers were able to extract records of patients with rheumatoid arthritis (RA) from electronic health record data with high precision, enabling research on very large populations with this condition for limited costs.

Using machine learning methods researchers were able to extract records of patients with rheumatoid arthritis (RA) from electronic health record (EHR) data with high precision, enabling research on very large populations with this condition for limited costs. Study findings were published in JMIR Medical Informatics.

Although financial codes are often used to extract diagnoses from EHRs, this methodology can lead to false positives. To circumvent this issue, query-like algorithms can be constructed but require knowledge on the diagnosis of interest and are highly language- and center-specific, the researchers explained.

“Advancements in natural language processing and machine learning have created great potential for processing format-free text data such as those in EHRs,” the authors wrote. “A major advantage of machine learning is that it can learn extraction patterns from a set of training examples, relieving the need for extensive domain knowledge.”

To develop an easily implementable workflow that builds a machine learning algorithm in order to accurately identify patients with RA from format-free test fields in EHRs, the researchers employed 2 EHR health data sets: Leiden from the Netherlands and Erlangen from Germany.

The Leiden data set consisted of information from 23,300 patients who visited the rheumatology outpatient clinic of a university medical center since 2011. A total of 11,786 patients were eligible for inclusion. Researchers randomly selected 3000 patients and extracted the entries for up to 1 year of follow-up. Data were then divided into 2 independent sets: Leiden-A (n = 2000) for model selection, training, and validation and Leiden-B (n = 1000) for independent testing.

After these analyses were performed, investigators evaluated the universal application of the pipeline by applying it to the Erlangen data set divided into model (Erlangen-A; n = 4293) and testing (Erlangen-B; n = 478) sets. A health care professional manually reviewed all entries in the Leiden and Erlangen sets.

The pipeline consisted of word segmentation, lowercase conversion, stop word removal, word normalization and vectorization while default scikit-learn implementations were used to create the machine learning models. Seven different machine learning methods were tested and for each sample set, different models were trained and evaluated in equally sized training and validation sets.

Researchers also “employed a naïve word-matching algorithm that assigns RA status to a sample when the text contained RA (in German or Dutch) or its abbreviation appeared in the chart. Each classifier gives a score between 0 and 1 that [researchers] interpreted as a probability for each sample to be a case.”

For the Leiden data set:

  • The word-matching algorithm demonstrated mixed performance (area under the receiver operating characteristic curve [AUROC] 0.90; area under the precision recall curve [AUPRC] 0.33; F1 score 0.55)
  • 4 methods significantly outperformed word-matching, with support vector machines performing best (AUROC 0.98; AUPRC 0.88; F1 score 0.83)
  • Applying this support vector machine classifier to the test data resulted in a similarly high performance (F1 score 0.81; positive predictive value [PPV] 0.94)
  • Using the support vector machines, researchers could identify 2873 patients with RA in less than 7 seconds out of the complete collection of 23,300 patients

In the Erlangen data set:

  • gradient boosting performed best (AUROC 0.94; AUPRC 0.85; F1 score 0.82) in the training set
  • gradient boosting resulted in good results in the test data (F1 score 0.67; PPV 0.97)

When cut-offs for case selection from the first data sets (in both Leiden and Erlangen) were defined aiming for a high sensitivity or high PPV, researchers observed robust performances in the second data sets (Leiden-B: PPV 0.94 and sensitivity 0.93; Erlangen-B: PPV 0.97 and sensitivity 0.84). In other words, depending on the chosen cut-off, case selection can be tailored to being highly precise or sensitive.

Because this approach is more precise, doesn’t require standardization, and ensures high performance within the center, the investigators argue their approach of making a center-specific algorithm is more viable and efficient than an application of an algorithm developed outside the center.

Deploying the pipeline does require user familiarity with implementation software, marking a limitation. To determine ease of use, a future usability study ought to be conducted, the researchers wrote. Additional limitations include the evaluation of the workflow in only 2 centers and the potential for fine-tuning hyperparameters to optimize the model’s performance.

“The workflow facilitates the production of highly reliable center-specific machine learning methods for the identification of patients with rheumatoid arthritis from format-free text fields,” the authors concluded. “Our results suggest that our workflow can easily be applied to other EHRs or other diseases and is not restrained by specific language, EHR software, or treatments. This methodology of machine learning for EHR data extraction facilitates cohort studies (with regard to cost and size).”


Maarseveen TD, Meinderink T, Reinders MJT, et al. Machine learning electronic health record identification of patients with rheumatoid arthritis: algorithm pipeline development and validation study. JMIR Med Inform. 2020;8(11):e23930. doi:10.2196/23930