Search Results
As presented in Fig. 1, we identified a total of 852 citations from searching the bibliographic databases. The software EndNote identified and removed 344 duplicates of the retrieved citations. Screening titles and abstracts of the remaining 508 citations led to excluding 446 citations. By reading the full text of the remaining 62 publications, we excluded 48 publications. An additional systematic review was identified through checking the list of the included reviews. In total, 15 systematic reviews were included in the current review17,18,19,20,21,22,23,24,25,26,27,28,29,30,31.
Characteristics of included reviews
Interestingly, the included reviews were published between 2017 and 2020, and more than half of them (n = 8) were published in 2020 (Table 1). The included reviews were conducted in 7 different countries, but more than half of them were conducted in Italy (n = 5) and the United Kingdom (n = 4). All included reviews were articles in peer-reviewed journals. Only four reviews had a registered protocol. All studies except one stated that they followed Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines.
With regards to the eligibility criteria, the included studies focused on diagnosing 10 mental disorders, namely: Alzheimer’s disease (AD) (n = 7), mild cognitive impairment (MCI) (n = 6), and Schizophrenia (SCZ) (n = 3) (Table 2). While seven reviews focused on any AI approach, another seven reviews focused merely on supervised machine learning (SML), and one review focused on deep learning (DL). SML uses labeled datasets to train algorithms in order to predict or label new, unforeseen examples, SML is used for classification and regression purposes. UML analyzes unlabeled data to discover hidden features, patterns, and relationships in data. Clustering, association, and dimensionality reduction are three major applications of unsupervised learning models. It is worth mentioning that most deep learning applications are based on supervised learning. More than half of the reviews (n = 8) focused on neuroimaging data for diagnosing mental disorders. While seven reviews restricted the search to studies in the English language, there was no language restriction imposed in six studies. Eight studies applied time restrictions to the search while the remaining studies did not.
Varied numbers of electronic databases were searched in the included reviews. The most common databases used in the included reviews are MEDLINE (n = 13), Web of Science (n = 7), EMBASE (n = 6), PsycINFO (n = 5), and Scopus (n = 4) (Table 3). Eight studies used either backward reference list checking (n = 7) or forward reference list checking (n = 1) to identify further studies. Two independent reviewers carried out the study selection process in twelve reviews, performed data extraction in four reviews, and assessed study quality in two reviews. The quality of studies was assessed in nine reviews using six different tools such as a revised tool for Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) and Jadad rating system. Four reviews synthesized the data using meta-analysis.
The number of retrieved studies in the included reviews ranged from 52 to 7,991 (Table 4). The number of included studies in the included reviews varied between twelve to 114. The size of data sets used to train and validate models in the included studies ranged between 10 and 7,026 data points. The included studies in the included reviews used different types of data to train and validate models, namely: neuroimaging data (n = 13), neuropsychological data (n = 6), genetic data (n = 4), and Electroencephalography (EEG) measures (n = 4). As shown in Table 5, many methods were used in the included studies, and the most common ones were Support Vector Machine (SVM) (n = 13), Random Forest (RF) (n = 10), Naïve Bayes (NB) (n = 7), k-Nearest Neighbors (k-NN) (n = 5), and Linear Discriminant Analysis (LDA) (n = 5). The models in the included reviews were validated using only internal validation methods (n = 6) or both internal and external validation methods (n = 3).
Results of study quality appraisal
Two thirds of the included reviews clearly stated the review question or aim by identifying the AI approach of interest and its aim, the target disease, and type of data for the model development (Fig. 2). The eligibility criteria were detailed, clear, and matched the review question in 13 reviews. Six studies showed a clear and adequate search strategy that contained all search terms related to the topic, Subject Headings, and limits. Less than half (n = 7) of the included reviews used adequate search sources such as searching multiple major databases and backward and forward reference list checking. Only five reviews assessed the quality of the included studies using a tool suitable for the review question. The quality assessment was carried out by two or more reviewers independently in only a single review. In three reviews, bias and errors in data extraction were minimal, given that at least two reviewers independently extracted the data using a piloted tool. Publication bias and its potential impact on the findings were assessed in only one review. All included reviews used an adequate approach for data synthesis and provided relevant research and practical implications based on the findings. Supplementary Table 1 shows reviewers’ judgments about each appraisal item for each included review.
Results of studies
The included reviews assessed the performance of AI models in diagnosing 8 mental disorders: Alzheimer’s disease, mild cognitive impairment, schizophrenia, autism spectrum disorder, bipolar disease, obsessive-compulsive disorder, post-traumatic stress disorder, and psychotic disorders. The performance of the AI models in diagnosing these mental disorders is presented in the next subsections.
Alzheimer’s disease (AD) is a neurodegenerative disorder characterized by an ongoing decline in brain functions such as memory, executive functions, and language processing32. Four reviews assessed the performance of AI classifiers in differentiating AD from healthy control (HC) using neuroimaging data17,18,19,20 (Table 6). The number of mutual studies was five between Pellegrini et al.17 and Ebrahimighahnavieh et al.20 and four between Pellegrini et al.17 and Sarica et al.19. Accuracy, sensitivity, and specificity of the classifiers in these four reviews ranged from 56% to 100%, 37.3% to 100%, and 55% to 100%, respectively (Table 6). None of these reviews pooled the results using meta-analysis due to the high heterogeneity in the used classifiers, data types, data features, and types of validation.
Two other reviews examined the performance of AI classifiers in differentiating AD from HC using neuropsychological data21,22. There are four mutual studies between the two reviews. Accuracy of the classifiers in these reviews ranged from 68% to 100% (Table 6). One of these reviews meta-analyzed sensitivities and specificities reported in eleven studies and showed a pooled sensitivity of 92% and a pooled specificity of 86%22.
Three reviews examined the performance of AI classifiers in differentiating AD from mild cognitive impairment (MCI) using neuroimaging data17,18,20 (Table 7). There are five mutual studies between Pellegrini et al.17 and Ebrahimighahnavieh et al.20. Accuracy, sensitivity, and specificity of the classifiers in these three reviews ranged from 56% to 100%, 40.3% to 100%, and 67% to 100%, respectively (Table 7). None of these reviews pooled the results using meta-analysis due to the high heterogeneity. One other review examined the performance of AI classifiers in differentiating AD from MCI using neuropsychological data21. Accuracy of the classifiers in that review varied between 68% to 86% (Table 7).
One review assessed the performance of AI classifiers in differentiating AD from Lewy body dementia (LBD) using EEG measures23. Accuracy, sensitivity, specificity, and AUC of the classifiers in this review ranged from 66% to 100%, 76% to 100%, 77% to 100%, and 78% to 93%, respectively.
Mild cognitive impairment (MCI) refers to deterioration in cognitive functions (e.g., memory, thinking, and language) that is detectable but it is less severe than the deterioration in patients with AD33. MCI represents a transitional stage between the expected cognitive decline associated with normal aging and the more severe decline of dementia33. Four reviews assessed the performance of AI classifiers in differentiating MCI from HC using neuroimaging data17,18,19,20 (Table 8). The number of mutual studies was five between Pellegrini et al.17 and Ebrahimighahnavieh et al.20 and four between Pellegrini et al.17 and Sarica et al.19. Accuracy, sensitivity, and specificity of the classifiers in these four reviews ranged from 47% to 99.2%, 24.3% to 98.3%, and 47.1% to 97%, respectively (Table 8). None of these reviews pooled the results using meta-analysis due to the high heterogeneity.
Two other reviews examined the performance of AI classifiers in differentiating MCI from HC using neuropsychological data21,22. Four studies were mutual studies between the two reviews. Accuracy of the classifiers in these reviews ranged from 60% to 98% (Table 8). Only one of these reviews meta-analyzed sensitivities and specificities reported in nine studies and showed pooled sensitivity and specificity of 83% each22.
Three reviews examined the performance of AI classifiers in differentiating MCI converting to AD (MCIc) from MCI non-converting to AD (MCInc) using neuroimaging data17,19,20 (Table 9). The number of mutual studies was five between Pellegrini et al.17 and Ebrahimighahnavieh et al20 and four between Pellegrini et al.17 and Sarica et al.19. Accuracy, sensitivity, and specificity of the classifiers in these three reviews ranged from 47% to 96.2%, 42.1% to 99%, and 51.2% to 95.2%, respectively (Table 10). None of these reviews pooled the results using meta-analysis due to the high heterogeneity.
Another review examined the performance of AI classifiers in differentiating MCIc from MCInc using neuropsychological data22. Accuracy, sensitivity, specificity, and AUC of the classifiers in this review ranged from 61% to 85%, 50% to 91%, 48% to 91%, and 67% to 93%, respectively. This review meta-analyzed sensitivities and specificities reported in ten studies and showed a pooled sensitivity of 73% and a pooled specificity of 69%.
Schizophrenia (SCZ) is a long-term serious mental disorder, in which patients are not able to differentiate between their thoughts from reality due to disturbances in cognition, emotional responsiveness, and behavior34. Two reviews investigated the performance of AI classifiers in differentiating SCZ from HC using neuroimaging data24,25. There are 15 mutual studies between the two reviews. Accuracy, sensitivity, and specificity of the classifiers in the two reviews ranged from 61% to 99.3%, 57.9% to 100%, and 40.9% to 98.6%, respectively (Table 10). None of these reviews pooled the results using meta-analysis. One review examined the performance of AI classifiers in differentiating SCZ from HC using genetic data26. Accuracy and AUC of the classifiers in this review ranged from 40% to 86% and 54% to 95%, respectively.
Bipolar disorder is a mood disorder that is characterized by mood fluctuations between symptoms of mania or hypomania and depression35. One review assessed the performance of AI classifiers in differentiating bipolar BD from HC using neuroimaging data27. Accuracy, sensitivity, and specificity of the classifiers ranged from 55% to 100%, 40% to 100%, and 49% to 100%, respectively (Table 11). This review examined the performance of AI classifiers in differentiating BD from HC using neuropsychological data27. Accuracy of classifiers varied between 71% and 96.4% (Table 11). This review also investigated the performance of AI classifiers in differentiating BD from major depressive disorder using neuroimaging data. Accuracy, sensitivity, and specificity of the classifiers ranged from 54.76% to 92.1% (n = 7), 57.9 to 83% (n = 3), and 52.1 to 90.9% (n = 3), respectively. Another review used genetic data and AI classifiers to differentiate BD from HC26. Accuracy and AUC of the classifiers ranged from 54% to 77% and 48% to 65%, respectively (Table 11).
Autism spectrum disorder (ASD) is a group of disorders (e.g., autism, childhood disintegrative disorder, and Asperger’s disorder) that starts usually in the preschool period and is characterized by difficulties or impairment in communication and social interaction36. One review investigated the performance of AI classifiers in differentiating ASD from HC using neuroimaging data28. Accuracy, sensitivity, and specificity of the classifiers in the review ranged from 45% to 97%, 24% to 100%, and 21% to 100%, respectively (Table 12). The review meta-analyzed sensitivities and specificities of AI classifiers based on structured MRI (sMRI) in 11 studies. The review found a pooled sensitivity of 83%, a pooled specificity of 84%, a pooled AUC of 90%28. The review also meta-analyzed sensitivities and specificities of deep neural network-based classifiers in one study (five samples) that used functional MRI (fMRI) as a predictor. The review found a pooled sensitivity of 69%, a pooled specificity of 66%, and a pooled AUC of 71%28.
The review assessed the performance of AI classifiers in differentiating ASD from HC using a neuropsychological test (behavior traits)28. Accuracy, sensitivity, and specificity of the classifiers in the review ranged from 78.1% to 100%, 64% to 100%, and 48% to 97%, respectively (Table 12). Further, the review tested the performance of AI classifiers in differentiating ASD from HC using biochemical features28. Accuracy, sensitivity, and specificity of the classifiers in the review ranged from 75% to 94%, 77% to 94%, and 67% to 93%, respectively (Table 12). The review also examined the performance of AI classifiers in differentiating ASD from HC using EEG measures28. Accuracy, sensitivity, and specificity of the classifiers in the review ranged from 85% to 100%, 94% to 97%, and 81% to 94%, respectively (Table 12). The review did not conduct a meta-analysis for the above-mentioned results due to heterogeneity between samples28.
Posttraumatic stress disorder (PTSD) refers to feelings of fear, anxiety, irritability, terror, or guilty that result from remembering very stressful, life-threatening, frightening, distressing events that a patient lived through or witnessed in the past37. One review examined the performance of AI classifiers in differentiating PTSD from HC29. Accuracy of the classifiers using neuroimaging data varied between 89.2% and 92.3% (n = 3). The review also assessed the performance of AI classifiers in differentiating PTSD from trauma-exposed controls29. Accuracy of the classifiers using neuroimaging data varied between 67% and 83.6% (n = 4). Meta-analysis was not carried out in the review.
Obsessive-compulsive disorder (OCD) is a mental health condition in which an individual has frequent intrusive thoughts that lead him or her to perform repetitive behaviors, which may affect daily activities and cause severe distress38. One review assessed the performance of supervised machine learning classifiers in distinguishing OCD from HC using neuroimaging data30. Accuracy, sensitivity, and specificity of the classifiers in the review ranged from 66% to 100% (n = 11), 74.1% to 96.2% (n = 6), and 72.7% to 95% (n = 6), respectively. The review did not pool the results using meta-analysis.
Psychotic disorders are a group of mental disorders in which a patient has incorrect perceptions, thoughts, and inferences about external reality although there is contrary evidence39. One review examined the performance of AI classifiers in differentiating patients with a high risk of developing psychotic disorders from HC using neuroimaging data or neuropsychological tests31. Sensitivity and specificity of the classifiers in the review ranged from 60% to 96% (n = 12) and 47% to 94 (n = 12), respectively. The review meta-analyzed sensitivities and specificities of AI classifiers in 12 studies and found a pooled sensitivity of 78% and a pooled specificity of 77%31.