Machine Learning Predicts Incidence of Gestational Diabetes

December 30, 2020
Gianna Melillo

Gianna is an associate editor of The American Journal of Managed Care® (AJMC®). She has been working on AJMC® since 2019 and has a BA in philosophy and journalism & professional writing from The College of New Jersey.

Using advanced machine learning models, researchers in China were able to accurately predict incidence of gestational diabetes among pregnant women during their first trimester.

Using advanced machine learning (ML) models, researchers in China were able to accurately predict incidence of gestational diabetes mellitus (GDM) among pregnant women during their first trimester. Study findings were published in The Journal of Clinical Endocrinology and Metabolism.

GDM affects up to 15% of pregnant women around the world, and current evidence suggests that exposure of embryos or fetuses to a hyperglycemic environment in the uterus may lead to chronic health problems later in life. Because of this, both the American Diabetes Association and the International Association of Diabetes and Pregnancy Study Groups (IADPSG) recommend diagnosing GMD between weeks 24 and 28 of pregnancy.

However, “theoretically, GDM patients could have hyperglycemia for a long or short period of time before the GDM diagnosis, so the fetus will be more or less exposed to an intrauterine hyperglycemic environment in the second trimester (from 13 weeks of pregnancy to the day of the oral glucose tolerance test),” researchers wrote.

They argued that diagnosis at 24 to 28 weeks of gestation may be too late for effective intervention and that current GDM predictors are not sufficient or feasible.

To identify women at high risk of developing GDM in the first trimester, investigators used electronic health record data from a Chinese hospital and developed a clinically cost-effective 7-variable logistic regression (LR) model.

The training data set consisted of medical records from women presenting at a hospital at the Shanghai Jiao Tong University School of Medicine before 12 weeks of gestation in 2017. Any individuals with pre-GDM were excluded. The 2018 obstetrical electronic medical record data were collected and served as the testing group.

GDM incidence was defined as fasting plasma glucose (FPG) of at least 5.1 mmol, 1 h of at least 10 mmol, and/or 2 h of at least 8.5 mmol, according to IADPSG guidelines. A total of 16,819 cases were included in the training group and 15,371 cases were included in the testing data set.

“To ensure better model discrimination and create an efficient approach for clinical practice with fewer redundant variables, variable selection was conducted to select a panel of biomarkers with the most discriminative power for our outcome,” the authors wrote. Specifically, indicators related to glucose and lipid metabolism had the strongest correlation to GMD.

Using the variable panel, investigators tested 4 ML methods including LR, k-nearest neighbor (KNN), support vector machine (SVM), and deep neural network (DNN). A total of 73 alternative variables including sociodemographic characteristics, laboratory indexes in the first trimester, and clinical variables were extracted from data sets. In addition, “6 variables, namely, age, body mass index (BMI), FPG, hemoglobin A1c (A1c), high density lipoprotein (HDL), and triglycerides, were set as categorical variables apart from continuous variables.”

Analyses revealed:

  • Incidence of GDM between the training data set and the testing data set had no statistical difference (16.0% vs 14.4%; P = .0681).
  • Using 73 variables, the DNN model achieved high discriminative power, with area under the curve (AUC) values of 0.80.
  • The 7-variable LR model also achieved effective discriminate power (AUC = 0.77).
  • Low BMI (≤17) was related to an increased risk of GDM, compared with a BMI in the range of 17 to 18 (minimum risk interval) (11.8% vs 8.7%; P = .0935).
  • Total triiodothyronine and total tetraiodothyronine were superior to free triiodothyronine and free tetraiodothyronine in predicting GDM.
  • Lipoprotein (a) demonstrated a promising predictive value (AUC = 0.66).

The 7-variable LR model took into account age, family history of diabetes in a first degree relative, multiple pregnancy, previous GDM history, FPG, A1c, and triglycerides.

“The advantage of DNN is its ability to capture subtle non-linear relationships between variables and outcomes,” but DNNs have a risk of overfitting, researchers explained. In contrast, “LR highlights a clear contribution of each variable, making it useful for real-time clinical implementation.” The LR model also revealed a slightly superior calibration than DNN.

A lack of external verification and the fact that all data were collected from a single center mark limitations to the study.

“These findings can help clinicians identify women at high risk of diabetes in early pregnancy and start interventions such as diet changes sooner,” said study author He-Feng Huang, PhD. “The artificial intelligence technology will continue to improve over time and help us better understand the risk factors for gestational diabetes."


Wu Y, Zhang C, Mol BW, et al. Early prediction of gestational diabetes mellitus in the Chinese population via advanced machine learning. J Clin Endocrinol Metab. Published online December 22, 2020. doi:10.1210/clinem/dgaa899