Abstract
Parkinson’s Disease (PD) neurodegenerative disorder which lacks reliable early diagnostic tests. In this study, we present a method for PD prediction that uses multimodal data from the Parkinson Progression Marker
Initiative (PPMI), specifically clinico-demographic, biospecimen, and genetic data, to improve predictive accuracy and facilitate timely interventions via a multimodal machine learning approach. We evaluated data obtained from 598 participants (171 healthy
controls and 427 PD patients) in three different modalities: 29 clinical features, five cerebrospinal fluid biomarkers, and 154 SNPs (single nucleotide polymorphisms), which were selected through a biology-driven feature selection method. We utilized
three multimodal integration strategies—early, intermediate, and late—and trained various machine learning models, including LightGBM and Multilayer Perceptron (MLP), each of which was optimized by hyperparameter tuning and cross-validation. In
early integration, we combined feature sets from all modalities into a single set, leveraging complementary information to increase predictive power. The intermediate integration method
made use of autoencoders to encode features into a single 12 dimensional vector as the input into a Neural Network classifier. Late integration combined outputs from the top-performing models for each modality using ensemble techniques such as Voting Classifier and Stacking. We found that early integration
achieved the highest performance, with Support Vector Machines
with 90% accuracy, 0.98 AUC-ROC, 0.99 precision, and 0.93 F1
score. Intermediate integration with the Neural Network classifier
followed with an AUC-ROC of 0.84 and an F1-score of 0.85. Our
feature importance analysis identified two clinical scores, namely
UPSIT (related to sensory decline) and SCOPA (measuring
autonomic dysfunction), along with two SNPs (chr4 90755939 A G
and chr4 90646886 G A, both previously linked to PD
susceptibility) as crucial predictors, emphasizing their well
established relevance in PD diagnosis. Early diagnosis and
treatment of PD patients are facilitated by the integration of
multiple data modalities, which also greatly increases the
predicted accuracy for the disease while also providing us with a
thorough understanding of its complexity.