Autism
Autism
i
CONTENTS
ii
4 PROJECT DESCRIPTION
4.1 EXISTING SYSTEM 24
4.1.1 DRAWBACKS OF EXISTING SYSTEM 24
4.2 PROPOSED SYSTEM 25
4.3 MODULE DESCRIPTION 26
4.3.1 DATA COLLECTION 26
4.3.2 PRE-PROCESSING 27
4.3.3 IDENTIFICATION OF FXS 28
4.3.4 IDENTIFICATION OF ASD IN FXS 29
4.4 ADVANTAGES OF PROPOSED SYSTEM 31
5 IMPLEMENTATION AND RESULTS 32
6 CONCLUSION AND FUTURE ENHANCEMENT 35
APPENDIX
SOURCE CODE 36
REFERENCE 43
iii
LIST OF FIGURES
iv
LIST OF TABLES
v
LIST OF ABBREVIATIONS
AI - Artificial Intelligence
ML - Machine Learning
SVM - Support Vector Machine
RF - Random Forest
ASD - Autism Spectrum Disorder
FXS - Fragile X Syndrome
CNN - Convolutional Neural Network
EEG - Electroencephalography
FMRI - Functional Magnetic Resonance Imaging
AQ - Autism Spectrum Quotient
CBCL - Child Behavior Checklist
CARS - Childhood Autism Rating Scale
RBS-R - Repetitive Behavior Scale – Revised
DSM - Diagnostic And Statistical Manual Of Mental Disorders
RBF - Radial Basis Function
ABIDE - Autism Brain Imaging Data Exchange
BASC - Bootstrap Analysis Of Stable Clusters
ROI - Region Of Interest
SDAE - Stacked Denoising Autoencoder
SMOTE - Synthetic Minority Over-SamplingTechnique
GABA - Gamma-Aminobutyric Acid
IDDs - Intellectual And Developmental Disabilities
PCR - Polymerase Chain Reaction
ADHD - Attention Deficit Hyperactivity Disorder
CBT - Cognitive Behavioral Therapy
vi
ROC - Receiver Operating Characteristic
TP - True Positive
FP - False Positive
TN - True Negative
FN - False Negative
vii
CHAPTER 1
INTRODUCTION
1
1.1.1 MACHINE LEARNING (ML)
Support Vector Machine (SVM) is a supervised learning algorithm used for classification
and regression tasks, known for its ability to identify complex patterns and relationships in data.
It works by finding the optimal hyperplane that best separates different classes while
maximizing the margin between them. SVM is particularly useful in high-dimensional spaces
where data points are not easily separable using simple linear boundaries. To handle non-
linearly separable data, SVM utilizes kernel functions such as polynomial, radial basis function
(RBF), and sigmoid, which map input data into higher-dimensional spaces where classification
2
becomes more efficient. This capability makes SVM widely applicable in fields such as medical
diagnosis, image classification, speech recognition, and text categorization. In healthcare, SVM
has been extensively used to analyze medical datasets and predict disease outcomes by
distinguishing between normal and abnormal patterns in clinical data. It has also played a
crucial role in detecting early indicators of neurological and psychological conditions by
processing large-scale datasets and identifying hidden correlations. The precision of SVM in
classification tasks, coupled with its adaptability in different domains, has made it a preferred
choice for researchers and practitioners working with structured and unstructured data.
Furthermore, SVM has been effectively implemented in bioinformatics for gene expression
analysis and protein classification, enhancing advancements in medical research and
personalized treatment planning. With its ability to generalize well across different datasets,
SVM continues to be a key tool in predictive modeling and decision-making applications,
contributing significantly to technological advancements across various industries.
Random Forest (RF) is an ensemble learning algorithm that constructs multiple decision
trees to improve prediction accuracy and reliability. By aggregating the outputs of numerous
trees, RF enhances classification and regression performance, making it a powerful tool for
handling complex datasets. Each tree in the forest is built using a randomly selected subset of
data and features, ensuring diversity in decision-making and improving the overall predictive
capability of the model. This approach allows RF to capture intricate patterns in data, making it
highly effective in domains such as medical research, finance, and cybersecurity. In healthcare,
RF has been widely used for disease prediction, medical image classification, and patient risk
assessment by analyzing various clinical and genetic factors. The model’s ability to handle large
datasets with multiple variables enables researchers to uncover meaningful insights and
correlations, aiding in the early detection of medical conditions and personalized treatment
recommendations. In financial analytics, RF plays a crucial role in fraud detection and risk
evaluation, providing accurate predictions based on historical transaction data. Additionally, RF
has been applied in environmental science for predicting climate trends, monitoring air quality,
and assessing ecosystem health. The method’s capability to deliver reliable and interpretable
results has led to its widespread adoption in research and industry. As machine learning
continues to evolve, RF remains one of the most widely utilized algorithms due to its strong
performance, adaptability, and ability to provide meaningful insights across diverse applications.
3
1.1.4 AUTISM SPECTRUM DISORDER (ASD)
Autism Spectrum Disorder (ASD) is a complex neurodevelopmental disorder
characterized by impairments in social communication, difficulties in reciprocal interactions,
and certain types of repetitive behavior. The latest Census Bureau report gave the U.S. incidence
of one in every 44 children, thereby supporting the need for further research into diagnostics.
While the causes are mostly unknown, the disorder resists all known categorizations and is
considered to express an interplay of gene-environment interaction, epigenetic, and
environmental factors. There has yet to be a proven link or associated evidence between
vaccines and the development of ASD. For the most part, ASD diagnosis is made utilizing
different observational tools for behavior, like ADOS and M-CHAT, with delayed responses for
proximal intervention. These diagnostic tools rely on behavioral assessments, which, although
effective, have inherent limitations such as subjective interpretation and variability in
administration across different practitioners. This necessitates ongoing efforts to refine and
improve diagnostic techniques to ensure greater accuracy and early identification.The other
recent avenues in AI and machine learning do hold some considerable hope. AI-Qazzaz et al.[1]
shows that a hybrid CNN-SVM achieving 87.8% accuracy in the classification of ASD severity
using EEG features with deep learning as a model. Machine learning algorithms provide a
promising pathway for refining ASD diagnostics by leveraging vast datasets to detect patterns
beyond human perception. Thapa et al.[2] similarly shows the advancement of machine learning
classification applied to ASD in adjusting to the continuum of the changing demands posed by
DSM-IV and DSM-5. With the inclusion of AI, large-scale datasets from neuroimaging, genetic
markers, and behavioral patterns can be analyzed to provide more precise predictive models for
ASD identification. Depending on how problematical their communication is, individuals with
ASD might be less than totally mute, or they may use speech that's mildly impaired in its
pragmatic use—the latter making very little difference to the other party—who are seriously
slow in reading or interpreting social cues—impaired social interaction, thus contributing to
their social isolation. Social withdrawal and difficulties in forming relationships often result in
increased challenges in academic and professional settings.The other characteristic features and
stereotypic behaviors include obsessive interests that could topple along some topics or interests
as well as sensory sensitivities, ranging from hyper- to hyposensitivity to sound. Routine
screening at 18 months and 24 months is paramount to early diagnosis and the possibility of
intervention. Early identification has been shown to significantly improve developmental
outcomes, particularly when paired with targeted interventions. Applications of machine
learning with EEG diagnosis and AI-mechanism classifying autism have further enhanced the
prospect in both accuracies, therefore proving useful in diagnosis. These methods help in
4
distinguishing ASD from other neurodevelopmental conditions with overlapping symptoms,
improving differential diagnosis. Specific therapeutic modalities include ABA, speech and
occupational therapy, as well as structured educational programs directed toward enhancing
social and adaptive skills. These interventions aim to foster independence and improve quality
of life by addressing core deficits in communication, behavior, and daily functioning. Other
medications are available for some of the co-occurring conditions, helping to manage symptoms
such as anxiety, hyperactivity, or sleep disturbances, thereby providing a more comprehensive
approach to ASD management.
5
FXS, allowing for potentially syndrome-effective therapies. Such interventions, including
neurofeedback and brain stimulation techniques, are being explored to improve cognitive and
social outcomes. Diagnosis is backed up by molecular genetic testing such as PCR and a couple
of Southern blot analysis/CGG repeat expansions and methylation status, ensuring an accurate
and early diagnosis. While there is no cure yet, early intervention through behavioral therapies,
speech and occupational therapies, and pharmacological treatments is associated with significant
improvement in quality of life. Current medical treatment options include SSRIs for anxiety, and
stimulant medications for attention deficit disorder, offering symptomatic relief. Additionally,
research into targeted molecular treatments such as mGluR5 antagonists continues to be actively
pursued as promising therapeutic avenues. Advances in gene therapy and CRISPR-based
approaches are also being investigated, raising hope for potential future treatments. A good
understanding of the differences between the different neurobiological mechanisms of FXS will
be key in developing targeted interventions aimed at both ASD-related and unique cognitive and
behavioral challenges
7
1.2 PROBLEM STATEMENT
Although ASD and FXS have been studied separately, individuals with FXS have a
significantly higher risk of developing ASD.
Early prediction and intervention for ASD in FXS patients are crucial, as these
individuals not only suffer from the symptoms of ASD but also face additional internal
health issues and physical illnesses associated with FXS.
Accurate identification of ASD in FXS patients can help reduce the impact of these co-
occurring conditions, improving patient outcomes and minimizing the long-term effects
on both mental and physical health.
The primary aim of this project is to identify and diagnose ASD in individuals with
Fragile X Syndrome FXS. FXS is the most common genetic cause of autism, and
individuals with FXS are at a higher risk for ASD.
To identify ASD in individuals with FXS at an early stage for effective management, as
ASD affects 1 in 68 individuals and FXS affects approximately 1 in 11,000 individuals.
To create individualized prediction tools that account for the variability in how ASD
presents within the FXS population.
Develop a model using Random Forest and SVM with AQ scores for non-invasive ASD
screening in FXS patients.
Improve diagnostic accuracy, enable timely intervention, and support clinical decision-
making.
Extend to other neurodevelopmental disorders and enhance accuracy with advanced
datasets and techniques.
8
1.5 ORGANIZATION OF THE REPORT
The basic organization of the report is given below: The project's introductory view and
terms are covered in chapter 1. Chapter 2 is devoted to literature review in order to have a better
grasp of current relevant initiatives. The software and hardware requirements for this project are
detailed in chapter 3. The existing system, its pros and cons, and the project's proposed system
are outlined in chapter 4. With thorough explanations, chapter 5 depicts the project's
implementation and results. The project's conclusion and potential improvements are discussed
in chapter 6.
9
CHAPTER 2
LITERATURE SURVEY
2.1 INTRODUCTION
N. K. Al-Qazzaz et al. utilizes EEG datasets collected from children with ASD
(classified as mild, moderate, or severe) and normal controls. Various machine learning and
deep learning models were employed, including pre-trained Convolutional Neural Networks
(CNNs) such as AlexNet, ResNet18, GoogLeNet, MobileNetV2, SqueezeNet, ShuffleNet, and
EfficientNetb0. Transfer learning was applied using these CNN models, achieving a maximum
classification accuracy of 85.5% with SqueezeNet. Additionally, hybrid models integrating
Decision Tree (DT), K-Nearest Neighbors (KNN), and Support Vector Machine (SVM)
classifiers with deep CNNs were evaluated. The best performance was obtained using
SqueezeNet combined with SVM, yielding an accuracy of 87.8%. Despite these advancements,
challenges remain due to dataset constraints and overlapping ASD symptoms, necessitating
further improvements in feature extraction and model generalization.
10
2.2.2 The Effect of Anxiety and Autism Symptom Severity on Restricted and
Repetitive Behaviors Over Time in Children with Fragile X Syndrome
Moskowitz, L.J et al. investigates the effect of anxiety and autism symptom severity on
restricted and repetitive behaviors (RRBs) in children with Fragile X Syndrome (FXS) over
time. The dataset consists of 60 children with FXS, where anxiety was assessed using the Child
Behavior Checklist (CBCL), autism severity was measured using the Childhood Autism Rating
Scale (CARS), and RRBs were evaluated using the Repetitive Behavior Scale – Revised (RBS-
R). The methodology involved a longitudinal analysis with moderated regression models to
examine the interaction between anxiety and ASD symptoms in predicting RRB outcomes over
two years. The results showed that ASD and anxiety exert independent and combined effects on
sensory-motor RRBs, with increased severity leading to elevated RRB levels. Specifically, ASD
symptoms predicted sensory-motor RRBs only when anxiety levels were low, and vice versa.
The study highlights the complexity of ASD and anxiety interactions in FXS, emphasizing the
need for early intervention strategies tailored to individual symptom profiles.
F. Hajjej et al. introduces a two-phase system for Autism Spectrum Disorder (ASD)
identification and personalized education. The dataset comprises two ASD screening datasets of
toddlers, merged to improve diversity. The Synthetic Minority Over-sampling Technique
(SMOTE) is applied to balance the dataset, followed by feature selection methods. Various
machine learning models, including Random Forest, XGBoost, Decision Trees, Support Vector
Machine (SVM), and Gradient Boosting, were employed. An ensemble model combining
Random Forest and XGBoost using a voting mechanism achieved a 94% accuracy in ASD
classification. The second phase focuses on identifying optimal teaching methods for ASD
children by evaluating their physical, verbal, and behavioral performance. The proposed model
outperforms previous approaches, achieving a maximum accuracy of 99.99% with ML-based
feature selection techniques, highlighting its potential for early ASD detection and
individualized learning strategies.
11
2.2.4 Machine Learning Differentiation of Autism Spectrum Sub-
Classifications
Thapa R et al. utilizes retrospective data from 38,560 individuals sourced from the
SPARK and ABIDE datasets, containing demographic, clinical, and assessment data. The
methodology employs machine learning (ML), specifically a gradient-boosted tree algorithm, to
classify individuals as having one of three autism spectrum disorders (ASD) under DSM-IV
(autistic disorder, Asperger’s disorder, PDD-NOS) or as non-spectrum. The model was trained
on 80% of the data and tested on the remaining 20%. The algorithm achieved AUROC scores
ranging from 0.863 to 0.980, with an overall classification accuracy of 80.5%. Among the
16.7% misclassified cases, the majority were incorrectly categorized into another ASD subtype
rather than non-spectrum. The study demonstrates that ML can effectively classify ASD using
minimal inputs, aiding in early diagnosis and intervention. However, the model’s performance
for PDD-NOS was lower due to its relatively lower prevalence in the dataset.
Qiuhong Wei et al.examined 24 studies involving 1,396 individuals, assessing the use of
eye-tracking data for identifying Autism Spectrum Disorder (ASD) using machine learning
(ML) models. The most commonly used ML algorithms were Support Vector Machine (SVM)
and Random Forest (RF), with studies reporting accuracy rates ranging from 55.5% to 93.8%.
The pooled accuracy across studies was 81%, with a sensitivity of 84% and specificity of 79%.
The highest classification accuracy (88%) was found in preschool-aged children. Eye-tracking
stimuli varied widely, including social, non-social, static, and dynamic stimuli, with most
studies failing to report standardized eye-tracking hardware details. Despite promising results,
the study highlights high inter-study heterogeneity, small sample sizes, and lack of test set-based
evaluations, suggesting the need for better standardization and validation in future research.
12
2.2.6 Brief Intensive Social Gaze Training Reorganizes Functional Brain
Connectivity in Boys with Fragile X Syndrome
Manish Saggar et al. examined the effects of brief intensive social gaze training on
functional brain connectivity in boys with fragile X syndrome using resting-state functional
MRI. The dataset included 37 boys, 16 with fragile X syndrome and 21 with idiopathic autism
spectrum disorder, aged 7–18 years, who underwent pre- and post-training scans. The study
used connectome-based predictive modeling to analyze brain network reorganization. Results
showed significant hyper- and hypo-connectivity differences in fragile X syndrome compared to
autism spectrum disorder before training. After training, the fragile X syndrome group exhibited
a stabilization of functional connectivity, reducing hyperconnectivity and increasing
connectivity where it was previously lower. Behavioral improvements in social gaze avoidance
were noted, particularly in the fragile X syndrome group, suggesting potential for targeted
interventions. The study emphasized the need for larger sample sizes and extended training
durations for further validation.
Yang et al. explored ASD classification using resting-state fMRI data from the ABIDE
repository. The dataset included 871 samples (403 ASD, 468 TD), with functional connectivity
features extracted using various brain parcellation techniques. They evaluated three functional
connectivity metrics: correlation, partial correlation, and tangent-space. Multiple machine
learning models were applied, including logistic regression, SVM (linear and kernel), and deep
neural networks. The Bootstrap Analysis of Stable Clusters (BASC444) atlas combined with the
Radial Basis Function (RBF) kernel SVM produced the best results, achieving an accuracy of
69.43%, sensitivity of 64.57%, and specificity of 73.61%. The study highlighted the
significance of functional connectivity and machine learning in ASD diagnosis, emphasizing
SVM as the optimal classifier.
13
2.2.8 A Multimodal Approach for Identifying Autism Spectrum Disorders in
Children
Han et al. explored ASD classification in children using a multimodal approach that
integrates EEG and eye-tracking data. The dataset included 90 children (40 ASD, 50 TD), and a
stacked denoising autoencoder (SDAE) was employed for feature learning and fusion. EEG and
ET data were processed separately using SDAE models before being combined in a multimodal
fusion model. The study achieved an accuracy of 95.56%, sensitivity of 92.5%, and specificity
of 98%, outperforming unimodal methods. The results demonstrated the effectiveness o
multimodal fusion in enhancing ASD classification by capturing complementary
neurophysiological and behavioral characteristics.
14
2.2.10 A Deep Learning Approach to Predict Autism Spectrum Disorder
Using Multisite Resting-State fMRI
Subah et al. explored ASD classification using resting-state fMRI data from the ABIDE
dataset. The dataset included 866 subjects (402 ASD, 464 TD), with functional connectivity
features extracted using four predefined brain atlases: CC200, AAL, BASC, and Power. The
study applied a deep neural network classifier to differentiate ASD and control groups, utilizing
tangent space embedding for feature representation. The proposed model achieved a mean
accuracy of 88%, with sensitivity, F1-score, and AUC values of 90%, 87%, and 96%,
respectively. The results indicated that the BASC atlas with 122 ROIs had the highest predictive
power, outperforming other atlases and state-of-the-art methods.
Clodagh O’Keeffe et al. explored the disruption in postural control networks of Fragile
X premutation carriers using force plate posturography and high-density EEG. The dataset
included eight premutation carriers and six controls, who performed balance tasks under
different sensory and cognitive conditions. EEG data were collected using a 128-channel
Biosemi system, and postural sway was analyzed using force plate measurements. The study
applied statistical analysis to compare postural control and neural activity between groups.
Results showed that premutation carriers exhibited increased sway area under challenging
conditions (p=0.01) and significantly reduced frontal theta power compared to controls
(p=0.007). The correlation between theta power and postural stability was evident in controls
but absent in carriers, suggesting a disruption in neural mechanisms underlying balance control.
The findings highlight potential biomarkers for early detection of motor impairments associated
with FXTAS.
15
2.2.12 Hidden Markov Models to Estimate the Probability of Having Autistic
Children
Emerson A. Carvalho et al. explored the estimation of the probability of autistic parents
having autistic children using Hidden Markov Models (HMMs). The study utilized statistical
data on autism heritability and recurrence among siblings to model the hidden and observable
states of the HMMs. The methodology involved defining transition probabilities based on ASD
recurrence rates and genetic factors, while the model was trained using statistical datasets rather
than individual observations. The results suggested that autistic parents have an estimated
probability of 33% of having an autistic daughter and 80% of having an autistic son. The
findings emphasize the significant role of genetic factors in autism inheritance and could
contribute to early diagnosis and risk estimation for prospective parents.
Michelle Tang et al. explored the diagnosis of autism spectrum disorder using a deep
multimodal learning approach with functional magnetic resonance imaging (fMRI) data. The
dataset used was ABIDE-I, consisting of 1035 subjects (505 ASD, 530 controls). Two types of
activation maps were extracted: correlation matrices between fMRI voxel intensities and ROI
time series, and functional connectivity maps between ROI pairs. A multimodal deep learning
model was developed, combining a 3D ResNet-18 for fMRI data and a multilayer perceptron
(MLP) for ROI connectivity features. The final model achieved a classification accuracy of
74%, a recall of 95%, and an F1-score of 0.805, outperforming single-modality models. The
findings highlight the advantage of incorporating multiple neuroimaging features to improve
ASD classification.
16
2.2.14 An Intelligent Multimodal Framework for Identifying Children with
Autism Spectrum Disorder
Jingying Chen et al. explored an intelligent multimodal framework for identifying
children with autism spectrum disorder using behavioral and cognitive data. The dataset
included 50 ASD and 50 typically developing (TD) children aged 3–6 years. Eye fixation, facial
expression, and cognitive level data were collected using a Tobii Eye Tracker, a video camera,
and an interactive question-answer platform. The methodology involved an optimized random
forest (RF) classifier with weighted decision trees and a hybrid fusion method to integrate
different data modalities. The proposed framework achieved a classification accuracy of 91%,
outperforming support vector machine (SVM) and discriminant analysis models. The results
suggest that multimodal data fusion enhances ASD identification and may improve early
intervention strategies.
Dryburgh et al. explored the prediction of intelligence scores in ASD using functional
connectomic data from the ABIDE dataset. The study utilized resting-state fMRI data from 202
ASD and 226 neurotypical individuals, applying a connectome-based predictive model (CPM)
to identify brain connections associated with full-scale and verbal intelligence. The
methodology involved feature selection, summarization, and training of linear regression
models with leave-one-out cross-validation. The results showed that the model achieved
significant predictions, with verbal intelligence scores exhibiting stronger correlations than full-
scale intelligence scores. The study highlighted differences in functional connectivity patterns
between ASD and neurotypical populations, suggesting distinct neural mechanisms underlying
intelligence in ASD
17
TABLE 2.1 INFERENCES FROM LITERATURE REVIEWS
18
5 Machine learning Qiuhong Wei, Elsevier,2023 Dataset–Eye
based on eye-tracking Huiling Cao, Yuan Tracking Data
data to identify Autism Shi, Ximing Xu Algorithm–SVM
Spectrum Accuracy–88%
Disorder: A systematic
review and meta
analysis
6 Brief intensive social Manish Saggar, Elsevier, 2023 Dataset–fMRI
gaze training Jennifer L. Bruno , Data
reorganizes functional Scott S. Hal Algorithm-CPA
brain connectivity Accuracy–N/A
in boys with
fragile X syndrome
7 A study of brain Xin Yanga, Elsevier,2022 Dataset–fMRI
networks for autism Ning Zhangb, Data
spectrum disorder Paul Schraderc Algorithm-SVM
classification using Accuracy–
resting-state 69.43%
functional connectivity
8 A Multimodal Junxia Han, IEEE Transactions Dataset–EEG, ET
Approach for Guoqian Jiang, on Neural Systems Algorithm–
Identifying Gaoxiang Ouyang, and Rehabilitation SDAE
Autism Spectrum and Xiaoli Li Engineering, 2022 Accuracy–
Disorders in Children 95.56%
9 Bringing machine Chirag Gupta, IEEE Access,2022 Dataset–Clinical
learning to research Pramod Data
On intellectual and Chandrashekar, Algorithm–N/A
developmental Ting Jin, Chenfeng Accuracy–N/A
disabilities
10 A Deep Learning Subah, F.Z. Deb, K. IEEE Access, 2021 Dataset–fMRI
Approach to Predict Dhar, Data
Autism Spectrum P.K. Koshiba Algorithm–DNN
Disorder Using Accuracy –88%
Multisite
Resting-State Fmri
19
11 Decreased Theta Clodagh O’Keeffe, IEEE Access, 2020 Dataset–EEG
Power Reflects Manuel Carro Algorithm–UTest
Disruption in Postural Domínguez, Eugene Accuracy–N/A
Control Networks of O’Rourke, Tim
Fragile X Lynch,
Premutation Carriers and Richard B.
Reilly
20
CHAPTER 3
SYSTEM REQUIREMENT SPECIFICATION
This section gives the details and specification of the hardware on which the system is
expected to work.
Processor: Intel(R) Core (TM) i5-1235U 1.30 GHz
RAM: 8.00 GB RAM
Hard disk: 500 GB
Keyboard: Standard102 keys
This section gives the details of the software that are used for the development.
Operating System: Windows 11
Environment: PyCharm
Language: Python
3.2.1 PYTHON
21
3.2.2 FEATURES OF PYHTON
An efficient data handling and storage facility, allowing easy manipulation of various
data structures such as lists, dictionaries, and sets.
A vast collection of libraries and frameworks that support tasks like data analysis,
machine learning, web development, and automation, making it highly versatile.
22
3.2.3 PYCHARM
A key advantage of PyCharm over many other Python editors is its intelligent code
assistance, built-in debugging tools, and seamless integration with popular frameworks.
PyCharm can be installed on Windows, macOS, and Linux, making it a cross-platform
development tool.
PyCharm is available under JetBrains’ license for the Professional Edition, while the
Community Edition is open-source and distributed under the Apache License 2.0. According to
the principles of open-source software, PyCharm Community Edition grants the following
freedoms:
The freedom to use the software for any purpose, whether for learning, personal projects,
or commercial development.
The freedom to study and modify the source code to customize it according to individual
or organizational needs. The source code is publicly accessible.
The freedom to share and distribute copies of the software to help others in the
programming community.
The freedom to enhance the software and contribute improvements back to the
community, ensuring collective growth and better functionality for all users.
With its powerful features, ease of use, and support for modern Python development,
PyCharm remains one of the most popular IDEs for Python programmers worldwide.
23
CHAPTER 4
PROJECT DESCRIPTION
24
Simplified linear assumptions fail to capture the complex, multi-factorial
nature of ASD and FXS.
Findings often lack generalizability due to variations in demographics, data
collection, and study conditions.
The project focuses on the early identification of Autism Spectrum Disorder (ASD) in
individuals with Fragile X Syndrome (FXS), addressing a critical challenge in
neurodevelopmental research. Since ASD and FXS share overlapping traits, distinguishing
between the two conditions can be complex. Many individuals with FXS exhibit autistic-like
behaviors, but not all meet the diagnostic criteria for ASD. This overlap makes traditional
diagnostic methods less effective, leading to delayed interventions and missed opportunities for
early support.
By improving the accuracy of ASD detection within the FXS population, the project
aims to enhance clinical decision-making and enable timely therapeutic strategies. achieve this,
the project employs a multi-modeling approach that integrates different machine learning
techniques for robust classification. The first step involves data preprocessing, ensuring that the
dataset is cleaned, structured, and optimized for model training. The Random Forest (RF)
algorithm is used as an initial classification model, leveraging its ability to handle high-
dimensional data and detect complex patterns. The RF results are then refined using Support
Vector Machines (SVM), which enhances classification accuracy by focusing on defining
optimal decision boundaries. This sequential approach ensures that the model achieves a high
level of precision and minimizes misclassification errors.
The Random Forest algorithm is chosen due to its capability to process large datasets
with intricate relationships between features. It is a powerful ensemble learning method that
combines multiple decision trees, making it robust against overfitting and effective in
identifying key predictors of ASD within FXS individuals. Additionally, Random Forest
provides feature importance rankings, helping researchers understand which variables contribute
most to the classification process.
Support Vector Machine (SVM) is employed in the final classification stage because of
its strength in separating classes with clear margins. SVM is particularly effective when dealing
with non-linear relationships between features, making it well-suited for refining predictions. By
optimizing decision boundaries, SVM ensures that the final classification is highly accurate and
25
reliable, improving the overall performance of the predictive model. The combination of
Random Forest and SVM allows for a comprehensive, data-driven approach to ASD detection in
FXS individuals, enhancing both diagnostic precision and early intervention strategies.
4.3.2 Pre-processing
The architecture diagram shown in Figure 4.1 includes several phases, such as dataset
collection, dataset preprocessing, identification of FXS and identification of ASD in FXS from
the output predicted from FXS identification.
The data, obtained from Kaggle, is a tabular survey dataset designed to measure Autism
Spectrum Disorder (ASD) characteristics based on the Autism Spectrum Quotient (AQ) score.
26
It comprises approximately 1,400 entries, incorporating behavioral, demographic, and medical
features that play a crucial role in predicting ASD in individuals with Fragile X Syndrome
(FXS). The core of the dataset consists of ten screening question responses (A1_Score to
A10_Score) from the Q-Chat-10 questionnaire, where certain answer patterns indicate
potential ASD traits. A cumulative AQ score exceeding 3 suggests an increased risk of ASD,
warranting further assessment by specialists. In addition to behavioral indicators, demographic
characteristics such as age, gender, and ethnicity provide insights into ASD and FXS
prevalence across different populations. Medical factors, including a history of neonatal
jaundice and family history of autism, help analyze the genetic and environmental influences
contributing to ASD risk. The dataset also captures country of residence, prior use of autism
screening applications, and the relationship of the testee to the individual tested, offering
valuable insights into global diagnostic trends and awareness. By leveraging this
comprehensive dataset, machine learning models can be developed to enhance early ASD
detection in individuals with FXS, improving early intervention strategies and clinical
decision-making.
4.3.2 PRE-PROCESSING
The data was mostly normalized with the help of Standard Scaler, which scales
numerical features to a standard normal distribution (mean = 0, standard deviation = 1). This
scaling avoids letting one attribute overshadow the learning process and improves model
performance, particularly for distance-based and gradient-based models. Categorical attributes,
such as gender, ethnicity, history of jaundice, and family history of autism, were encoded
through One-Hot Encoding to prevent misinterpretation of categorical data as numerical
hierarchies.
Since AQ scores and other features have different ranges, we standardize them:
𝑋−μ
X′ = --(1)
σ
μ is the mean of the feature (e.g., mean AQ score across all participants).
This ensures that AQ scores and other features are normalized before feeding into the
models.
27
The missing values were taken care of using imputation by replacing numerical
attributes with the median and categorical with the mode. Since class imbalance can prevent
the accuracy of predictions, oversampling, under sampling, and SMOTE were implemented to
keep balanced class distribution intact.Following preprocessing, the final dataset consisted of
normalized numerical features, encoded categorical features, no missing values, and a balanced
class distribution, rendering it ready for model training.
By leveraging the power of multiple decision trees, the model can accurately capture
subtle variations in behavior and demographic trends that might indicate FXS risk. The input
features include AQ-based scores (A1–A10), sex, autism family history, and other individual
factors that help to identify patterns characteristic of FXS-positive cases. These factors
28
collectively play a crucial role in distinguishing between individuals at higher risk of FXS and
those who do not exhibit relevant behavioral markers.
Through learning from annotated training data, the model detects the primary behavioral
markers characteristic of FXS and distinguishes affected from non-affected individuals. This
ability to learn from labeled examples and generalize effectively makes Random Forest a
robust and reliable tool for predictive modeling, particularly in the context of early ASD
detection within FXS populations.
The Random Forest (RF) equation for predicting FXS (Fragile X Syndrome) can be written as:
̂
Y = arg max ∑Tt=1 1 ( ht (X) = c ) --(2)
where:
The final class is determined by majority voting across all trees. Since it can effectively
process categorical and numerical data without affecting model stability, Random Forest is a
reliable algorithm in accurately classifying FXS and therefore an appropriate algorithm for
predictive modeling in this research.
The Support Vector Machine is a well-known classifier that solves supervised learning
problems with high efficiency in both classification and regression on high-dimensional
data. SVM attempts to find an optimal hyperplane that will best separate different classes out
of the given data set, with maximal margin, between two closest points belonging to the same
class, also called support vectors. This margin maximization approach ensures that the model
achieves strong generalization capabilities, making it highly effective in real-world
applications.
29
SVM performs well in dealing with challenging, non-linearly separable data with the
help of kernel functions like linear, polynomial, and radial basis function (RBF) kernels to map
the data to a high-dimensional space where it is possible to achieve separation. The ability to
transform complex data distributions into a separable space gives SVM a distinct advantage in
handling datasets with intricate patterns and varying relationships between features. It is
known for its generalized ability to extrapolate to previously unseen data based on its hardness
against overfitting, especially when dealing with few samples.
Unlike other models that may struggle with small datasets, SVM remains robust by
focusing on critical data points (support vectors) rather than relying on the entire dataset. In our
work, we use SVM to identify ASD in FXS-positive patients using the output of the Random
Forest model as input. If FXS = 1, we pass the features plus the FXS prediction to an SVM
model to predict ASD in FXS. This two-stage modeling approach enhances predictive accuracy
by leveraging the strengths of both Random Forest and SVM, ensuring that the classification of
ASD in FXS cases is both precise and reliable.
By incorporating SVM after the Random Forest model, we refine the decision-making
process, allowing for more accurate differentiation between ASD-positive and ASD-negative
cases within the FXS population. The combination of these models helps improve the
sensitivity and specificity of the prediction, making it a valuable approach for early ASD
identification in FXS individuals.
where:
The Random Forest classifier initially handles behavioral screening answers and
demographic characteristics to decide if a person is FXS-positive or not. The subpopulation of
30
individuals that are FXS-positive is then input into the SVM model, which again examines
their responses and behavioral characteristics to forecast the presence or absence of ASD.
Machine learning models like RF and SVM capture complex, nonlinear patterns,
enhancing predictive accuracy.
Better Generalization: These models perform well on larger, diverse datasets, providing
robust results.
Data-Driven Insights: Machine learning uncovers hidden relationships in the data,
offering deeper insights.
Scalability: The methodology is adaptable and can be easily updated with new text data
for continuous improvement.
31
CHAPTER 5
The implementation process for predicting ASD in individuals with FXS follows a structured
two-stage classification approach. The first stage involves predicting whether an individual has
Fragile X Syndrome (FXS) based on behavioral screening responses and demographic
attributes. Once an FXS classification is determined, the second stage refines the prediction by
identifying the presence of Autism Spectrum Disorder (ASD) within the FXS-positive group.
This sequential modeling strategy ensures that ASD classification is performed only within the
relevant subpopulation, improving predictive accuracy and reducing unnecessary computations.
The first stage of classification involved predicting Fragile X Syndrome (FXS) using the
Random Forest (RF) algorithm. RF, a powerful ensemble learning technique, builds multiple
decision trees and aggregates their outputs to improve classification accuracy and reduce
overfitting. Each tree in the forest is trained on a randomly chosen subset of data using bootstrap
sampling, and features for each split are selected randomly, ensuring the model does not overly
rely on specific attributes. This randomness enhances generalization, making the model highly
effective for mixed-type datasets with both numerical and categorical features. Additionally, RF
provides feature importance insights, allowing identification of key behavioral and demographic
indicators contributing to FXS classification. By leveraging a diverse set of decision trees, the
RF model efficiently learns patterns from the input features, which include AQ-based scores,
sex, family history of autism, and other behavioral traits, to determine whether an individual is
FXS-positive or FXS-negative.
Once individuals were classified as FXS-positive by the Random Forest model, the next step
was to determine the presence of Autism Spectrum Disorder (ASD) within this subpopulation. A
Support Vector Machine (SVM) was employed for this classification, as it is well-suited for
handling high-dimensional data and ensuring strong generalization. SVM finds an optimal
hyperplane that maximizes the margin between different classes, enhancing classification
robustness. For non-linearly separable data, kernel functions like radial basis function (RBF)
were utilized to map data into higher-dimensional spaces where separation is possible. The input
features for the SVM model included AQ scores, behavioral traits, and the FXS prediction from
32
the RF model. By using this two-stage approach—first applying RF to detect FXS and then
SVM to classify ASD—the model improved classification accuracy, effectively distinguishing
ASD-positive and ASD-negative cases within the FXS population. This integration of ensemble
learning and margin-based classification enhances predictive reliability, supporting early ASD
identification among individuals with FXS.
Accuracy: This measure shows that the model accurately classified 91% of the total examples.
A high accuracy rate implies that the ensemble model is working well in general.
Precision: Precision assesses the number of positive cases correctly predicted against all the
predicted positive cases. An 80% precision is equivalent to a situation where 80% of the positive
cases labeled were truly correct, stressing the reliability of the model to avoid false positives.
Recall: Recall (or Sensitivity) determines how well the model captures actual positive instances.
With a recall of 82%, the model successfully identified 82% of the total actual positive cases,
demonstrating good sensitivity but leaving room for improvement in detecting all positives.
33
F1 Score: The F1 Score strikes a balance between precision and recall, giving a single score to
measure the overall performance of the model. At a value of 81%, the F1 Score indicates a good
balance between precision and recall, that is, the model performs well in both recognizing actual
positives and reducing false positives.
Roc Curve : The ROC Curve measures the model's classification capability by graphing the
True Positive Rate versus the False Positive Rate. With an AUC of 0.96, the model has very
good discrimination capability. Figure 5.2 demonstrates strong classification, reducing false
positives while increasing true positives.
METRICS
MODEL
Accuracy Precision Recall F1
RF& SVM
91% 79% 79% 79%
DT
90% 75% 82% 78%
KNN
77% 48% 20% 28%
34
CHAPTER 6
This two-stage machine learning system represents a significant breakthrough in the early
detection of Autism Spectrum Disorder (ASD) among individuals with Fragile X Syndrome
(FXS). By employing a combination of Random Forest (RF) and Support Vector Machine
(SVM) classifiers, the model achieves a high classification accuracy of 91%, demonstrating its
effectiveness in clinical applications.The RF model, serving as the first stage of the system,
plays a crucial role in feature selection and pattern extraction. It efficiently identifies key
attributes that contribute to ASD classification, reducing noise and improving predictive
performance. The second stage, powered by the SVM classifier, further refines the classification
by distinguishing between ASD-positive and ASD-negative cases with high precision. This two-
tiered approach enhances the model’s reliability, ensuring robust performance even with
complex, high-dimensional data. One of the key advantages of this system is its ability to
process large datasets while maintaining accuracy and efficiency. Unlike traditional diagnostic
methods that rely on behavioral assessments, this machine learning-driven approach provides an
objective, data-driven solution that minimizes subjectivity.
35
APPENDIX
SOURCE CODE
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve,
auc, log_loss
# Load dataset
df = pd.read_csv("/content/drive/My Drive/train.csv")
# Encode categorical variables
label_encoders = {}
for col in df.columns:
if df[col].dtype == "object":
le = LabelEncoder()
df[col] = le.fit_transform(df[col])
label_encoders[col] = le
X = df.drop(columns=["Class/ASD"])
y = df["Class/ASD"]
36
intervals = [200, 400, 600, 800, 1000, 1200, 1400]
# Train RF+SVM
rf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
rf.fit(X_train, y_train)
y_rf_pred = rf.predict_proba(X_test)[:, 1]
37
# Train KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
knn_pred = knn.predict(X_test)
knn_prob = knn.predict_proba(X_test)[:, 1]
# Evaluate models
models = {
"RF+SVM": (ensemble_pred_final, ensemble_pred),
"Decision Tree": (dt_pred, dt_prob),
"KNN": (knn_pred, knn_prob),
}
model_metrics[model]["Accuracy"].append(acc)
model_metrics[model]["Precision"].append(prec)
model_metrics[model]["Recall"].append(rec)
model_metrics[model]["F1 Score"].append(f1)
# ROC Curve
fpr, tpr, _ = roc_curve(y_test, prob)
roc_curves[model].append((fpr, tpr))
38
roc_auc_values[model].append(auc(fpr, tpr))
# Plot separate graphs for each performance metric with all models
for metric in metrics:
plt.figure(figsize=(10, 6))
for model in model_metrics:
plt.plot(intervals, model_metrics[model][metric], marker='o', linestyle='-',
label=f"{model}")
plt.xlabel("Dataset Size")
plt.ylabel(metric)
plt.title(f"{metric} Comparison Across Models")
plt.legend()
plt.grid()
plt.show()
39
plt.title("ROC Curves at Final Dataset Size")
plt.legend()
plt.grid()
plt.show()
# ======================
# Combined FXS & ASD Analysis
# ======================
import pandas as pd
import pickle
from sklearn.preprocessing import StandardScaler
# Load data
data_path = "/content/drive/My Drive/test.csv"
df = pd.read_csv(data_path)
# 1. Fragile X Detection
# ======================
fxs_features = [
'A1_Score', 'A2_Score', 'A3_Score', 'A4_Score', 'A5_Score',
'A6_Score', 'A7_Score', 'A8_Score', 'A9_Score', 'A10_Score',
'austim', 'jaundice'
]
40
# 2. ASD Prediction with FXS Consideration
# ======================
# Preprocessing for ASD
cat_columns = [
'ethnicity', 'gender', 'contry_of_res',
'used_app_before', 'age_desc', 'relation'
]
# Scale features
scaler = StandardScaler()
X = scaler.fit_transform(asd_features)
base_predictions = classifier.predict(X)
41
for idx, (fx_prob, base_pred) in enumerate(zip(df['fxs_probability'], base_predictions)):
if fx_prob >= 70: # Threshold for FXS-induced ASD
final_pred = "ad (FXS-induced)"
else:
final_pred = "ad" if base_pred == 1 else "non-ad"
print(f"Row {idx}:")
print(f"FXS Probability: {fx_prob:.1f}%")
print(f"ASD Prediction: {final_pred}")
print("-" * 50)
final_predictions.append(final_pred)
# Add to dataframe
df['ASD_Final'] = final_predictions
42
REFERENCE
[2]. Moskowitz, L.J., Will, E.A., Black, C.J., & Roberts, J.E. (2024). The effect of anxiety and
autism symptom severity on restricted and repetitive behaviors over time in children with fragile
X syndrome. Journal of Neurodevelopmental Disorders, 16:61. https://doi.org/10.1186/s11689-
024-09569-2
[3]. F. Hajjej, S. Ayouni, M. A. Alohali and M. Maddeh, "Novel Framework for Autism
Spectrum Disorder Identification and Tailored Education With Effective Data Mining and
Ensemble Learning Techniques," in IEEE Access, vol. 12, pp. 35448-35461, 2024, doi:
10.1109/ACCESS.2024.3349988
[4] . Thapa, R., Garikipati, A., Ciobanu, M. et al. Machine Learning Differentiation of Autism
Spectrum Sub Classifications. J Autism Dev Disord 54, 4216–4231 (2024).
https://doi.org/10.1007/s10803-023-06121-4
[5]. Qiuhong Wei , Huiling Cao, Yuan Shi , Ximing Xu, "Machine learning based on eye-
tracking data to identify Autism Spectrum Disorder: A systematic review and meta-analysis"
,Journal of Biomedical Informatics, 2023 ,https://doi.org/10.1016/j.jbi.2022.104254
[6]. Manish Saggar, Jennifer L Bruno, Scott S Hall, Brief intensive social gaze training
reorganizes functional brain connectivity in boys with fragile X syndrome, Cerebral Cortex,
Volume 33, Issue 9, 1 May 2023, Pages 5218–5227.
[7]. Xin Yanga, Ning Zhangb , Paul Schraderc,” A study of brain networks for autism spectrum
disorder classification using resting-state functional connectivity”, Elsevier,2022,
https://doi.org/10.1016/j.mlwa.2022.100290
[8]. J. Han, G. Jiang, G. Ouyang and X. Li, "A Multimodal Approach for Identifying Autism
Spectrum Disorders in Children," in IEEE Transactions on Neural Systems and Rehabilitation
Engineering, vol. 30, pp. 2003-2011, 2022, doi: 10.1109/TNSRE.2022.3192431.
[9]. Chirag Gupta, Pramod Chandrashekar, Ting Jin, Chenfeng He ,"Bringing machine learning
to research on intellectual and developmental disabilities",Journal of Neurodevelopmental
Disorders,2022 https://doi.org/10.1186/s11689-022-09438-w
43
[10]. Subah, F.Z.; Deb, K.; Dhar, P.K.; Koshiba, T. A Deep Learning Approach to Predict Autism
Spectrum Disorder Using Multisite Resting-State fMRI. Appl. Sci. 2021, 11, 3636.
https://doi.org/10.3390/ app11083636
[13]. Michelle Tang , Pulkit Kumar , Hao Chen,"Deep Multimodal Learning for the Diagnosis
of Autism Spectrum Disorder",Journal of Imaging,2020,6, 47; doi:10.3390/jimaging6060047
[14].Jingying Chen, Mengyi Liao, Guangshuai Wang, Chang Chen,"An Intelligent Multimodal
Framework For Identifying Children With Autism Spectrum Disorder", Int. J. Appl. Math.
Comput. Sci., 2020, Vol. 30, No. 3, 435–448DOI: 10.34768/amcs-2020-0032
[15]. Elizabeth Dryburgh, Stephen McKenna, Islem Rekik,"Predicting full-scale and verbal
intelligence scores from functional Connectomic data in individuals with autism Spectrum
disorder",Brain Imaging and Behavior (2020) 14:1769–1778https ://doi.org/10.1007/s11682-
019-00111-w
[17]. Tassone, F.; Protic, D.; Allen, E.G.; Archibald, A.D.; Baud, A.; Brown, T.W.; Budimirovic,
D.B.; Cohen, J.; Dufour, B.; Eiges, R.; et al. Insight and Recommendations for Fragile X-
Premutation-Associated Conditions onFMR1Premutation. Cells 2020, 12,
2330.https://doi.org/10.3390/ cells12182330
[19]. Inga Sophia Knoth,Tarek Lajne, Simon Rigoulot, Karine Lacourse, Phetsamone
Vannasing, Jacques L. Michaud, Sébastien Jacquemont, Philippe Major, Karim Jerbi and Sarah
44
Lippé "Auditory repetition suppression alterations in relation to cognitive functioning in fragile
X syndrome: a combined EEG and machine learning approach", Journal of Neurodevelopmental
Disorders (2018) 10:4 DOI 10.1186/s11689 018-9223-3.
[20]. Ming Xua , Vince Calhoun c , Rongtao Jiang,"Brain imaging-based machine learning in
autism spectrum disorder: methods and applications",Journal of Neuroscience Methods Volume
361, 1 September 2021, 109271,https://doi.org/10.1016/j.jneumeth.2021.109271
45