Moodangels
Moodangels
Abstract
1 Introduction
Mental diseases [1], with their high prevalence and profound societal impact, pose a major public
health challenge by severely impairing quality of life. Accurate diagnosis [2] is essential for timely
intervention and effective treatment [3], yet the complexity and variability of symptoms make it
particularly difficult [4], highlighting the need for advanced diagnostic tools to aid clinicians. Among
mental diseases, mood disorder, including conditions like depression and bipolar disorder, is critical
due to its high prevalence and the significant overlap of symptoms with other psychiatric conditions [5].
Correctly diagnosing mood disorders is crucial, as it influences the diagnostic process for other
disorders [6]; for example, symptoms like difficulty concentrating may signal neurodevelopmental
disorders only if they occur outside of depressive episodes. Given their prevalence and severe
consequences, including suicide risk and chronic disability, mood disorders represent a significant
burden on individuals and healthcare systems.
While large language models (LLMs) and LLM-based agents demonstrate strong capabilities in
medical domains through robust textual analysis and decision-making, their application to psychiatry
faces unique challenges. General medical diagnostic agents typically rely on concrete medical
1
Code and synthetic data sample are available in https://github.com/elsa66666/MoodAngels.
Figure 1: The MoodAngels framework. Diagnostic agents include Angel.R, Angel.D, Angel.C, and
multi-Angels.
records or biological test results [7, 8], resources largely unavailable in psychiatric practice. These
limitations interact with the field’s inherent uncertainties, including significant symptom overlap
across disorders [9–11] and the absence of definitive biomarkers [12, 13] that are standard in other
medical specialties, creating fundamental barriers for AI implementation. These diagnostic difficulties
are intensified by psychiatry’s reliance on nuanced interpretation of subjective clinical data, which
contrasts sharply with the structured evidence typically used to train LLM-based agents [7, 8].
Moreover, the situation faces additional complications from data accessibility constraints. Clinical
diagnostic information, while rich in potential insights, contains sensitive patient data that cannot
be publicly shared, creating a critical bottleneck for AI-driven psychiatric research. Taken together,
this dual challenge of diagnostic complexity and data scarcity underscores the urgent need for
specialized systems capable of emulating clinicians’ probabilistic reasoning under conditions of
imperfect information [14].
To address these challenges, we propose MoodAngels (Figure 1), the first retrieval-augmented
multi-agent framework for mood disorder diagnosis, which enhances the diagnostic process through
granular-scale analysis and multi-step verification. The system tackles the inherent challenges of
psychiatric diagnosis, reliance on potentially unreliable self-reported and clinician-estimated scales
(with optional medical records), by decomposing traditional scale scoring into item-level analysis.
Using Pearson correlation, we identify the top 5% most statistically significant questions for mood
disorders, categorizing them into five diagnostic groups (depression, suicidal ideation, energy/interest
loss, anxiety, and insomnia). By analyzing consistency within these groups, MoodAngels can
resolve diagnostic discrepancies (e.g., when self-reported depressive symptoms conflict with clinical
observations) through additional behavioral marker validation.
To address symptom overlap, we structure DSM-5 diagnostic criteria [15] into a retrievable knowledge
base, augmented with anonymized clinical data to incorporate expert judgment and handle ambiguous
cases. We develop three diagnostic variants to balance historical reliance with individual variability:
Angel.R (no reference), Angel.D (case display), and Angel.C (comparative analysis). Our final multi-
Angels model synthesizes their independent diagnoses through debate, combining computational
efficiency with clinical nuance.
Beyond diagnostic frameworks, clinical data scarcity presents another research challenge. Despite
longstanding interest in depression and bipolar detection, existing methods predominantly analyze
social media posts [16–18], which may be distorted by social norms or exaggeration. To enable
accurate AI-driven psychiatric diagnosis and early detection, we construct the open-source synthetic
dataset MoodSyn with 1,173 synthetic psychiatric cases, containing: selected five groups of top-
related scale data, total scores from 13 common mental disorder scales, and mood disorder labels.
Through a comprehensive evaluation of quality, ML efficiency, and privacy protection, we demonstrate
that this synthetic dataset maintains high fidelity while safeguarding visitor confidentiality.
We evaluate the effectiveness of our framework on 561 real-world clinical cases and 140 synthetic
cases. Experimental results on both real cases and synthetic data demonstrate the superiority of our
diagnostic process, with all versions of our agents outperforming bare LLMs that rely solely on
client performances as context. Notably, on real data, even our raw agent, Angel.R, achieves a 12.3%
higher diagnostic accuracy than our backbone LLM, GPT-4o. The multi-agent framework further
2
enhances performance, surpassing the accuracy of all single-agent variants and showing considerable
improvement on hard cases. Ablation studies confirm the individual contributions of our medical
record analysis and scale selection processes, highlighting their critical roles in improving diagnostic
precision.
Our contributions are summarized as follows:
The primary challenge in constructing the diagnostic framework involves accurately identifying infor-
mation that reliably reflects the visitor’s actual condition. This difficulty arises because psychiatric
diagnoses depend on self-reported data and clinician-estimated scales (supplemented by optional
medical records), which carry inherent risks of misjudgment or misrepresentation.
To address this challenge, we introduce an innovative granular-scale analysis approach. Rather
than evaluating scales solely through total scores, we decompose them into individual item-level
responses. We employed a comprehensive set of clinical scales to assess various aspects of mental
health, including eight self-reported scales2 and five clinician-evaluated scales3 . These tools provide
valuable insights into both subjective experiences and symptoms, as well as clinical observations and
2
The self-reported scales are available at: CTQ, DAS, GAD-7, HCL-32, MDQ, NSSI, PHQ-9 and SHAPS.
3
The clinician-administered scales are available at: BPRS, HAMA, HAMD-24, MCCB and YMRS. In this
study, we employ the Chinese version of the HAMD-24, which features a different question order compared to
the English version.
3
structured evaluations of clients’ mental states and behaviors. All scales are routinely used in hospital
settings for clinical diagnosis, ensuring the reliability and validity of the collected data.
To identify the most relevant questions for mood disorder diagnosis, we computed the Pearson
correlation between each question’s score (and total score) and the presence of a mood disorder,
selecting the top 5% with the highest correlations. These questions naturally clustered into key
symptom groups: depressive mood, loss of interest, anxiety, insomnia, and suicidal tendencies.
These groups enhance diagnostic robustness through cross-validation and comprehensive symptom
coverage. To further refine our framework, we included clinically significant PHQ-9 questions, such
as phq9_Q2 (depressed mood) and phq9_Q1 (loss of interest), even if their correlation scores were
slightly below the threshold, ensuring a nuanced and reliable diagnostic process (see Appendix D).
By evaluating response consistency within each symptom group, we derive more accurate inferences
about the visitor’s probable condition. For instance, when a visitor reports frequent depressive
symptoms on a self-assessment scale but clinicians observe no corresponding depressive signs,
this discrepancy directs MoodAngels to investigate additional behavioral markers for diagnostic
validation.
Since overlapping symptoms may correspond to multiple disorders, we extract and structure diagnostic
and differential criteria from the Diagnostic and Statistical Manual of Mental Disorders: DSM-5 [15],
a widely recognized authority in psychiatry, to build a retrievable knowledge base. The knowledge
base construction process is detailed in Appendix C.
To prevent MoodAngels from making arbitrary decisions based solely on symptom presentation, we
also incorporate clinicians’ diagnostic expertise by including anonymized clinical data for retrieval.
These experiences are also beneficial when a visitor’s symptoms are ambiguous, for historical
diagnostic precedents may offer additional interpretive insights. The clinical data used in our study
consists of anonymized real-world hospital cases, totaling 2804 entries. We partitioned the dataset
such that 80% of the cases are used as historical cases for retrieval, while the remaining 20% serve
as the test set. All clients in the dataset have completed scale assessments, although clients without
diagnosed conditions do not have medical records available, and our agents are not pre-informed about
this distinction. The dataset statistics are summarized in Table 1. Appendix E details the processing
steps applied to both medical records and scale data, which aim to optimize the performance of
LLM-based agents for more accurate analysis.
To mitigate overreliance on past cases (which could overlook individual variability in psychiatric
diagnoses), we develop three diagnostic variants with differing levels of historical dependence:
Angel.R (no reference to previous cases), Angel.D (displays retrieved cases as context), and Angel.C
(compares each retrieved case with the current query and returns an analysis as context).
By aggregating independent diagnoses from these three agents and facilitating debate among their
conclusions, our final diagnosis model, multi-Angels, bridges the gap between computational decision-
making and the nuanced understanding essential for accurate psychiatric evaluations.
The following parts introduce the main components of our agents:
Symptom Matching To align client symptoms in medical records with DSM-5 diagnostic criteria,
we process records and compute relevance between records and criteria using dense vector encoding.
The BGE-M3 embedder [19] is employed for its strong semantic embedding capabilities. We retrieve
the top-5 most similar criteria, returning their text, classification, and similarity scores (detailed in
Appendix G.1). The tool does not diagnose but provides results for agent analysis, ensuring decisions
integrate quantitative data and clinical expertise, mitigating over-reliance on single metrics.
4
For cases with overlapping symptoms, an additional instruction prompts the agent to consider
differential diagnosis, guiding systematic evaluation of potential conditions. This enhances the
agent’s ability to distinguish between mood disorders and other diseases.
Scale Performance Analysis We diagnose the presence of mood disorder using 16 key questions
selected in Section 2.1 as the most mood-relevant items. Client performances are converted from
numeric scores to textual descriptions based on question content and options. For agent interpretability,
performances are reorganized into coherent descriptive paragraphs, enhancing analysis effectiveness.
Similar Cases Retrieval To leverage clinical experience from similar cases, we develop two optional
tools for retrieving medical records and scales with similar performance. After performing similarity
retrieval, our tools generate different outputs tailored to the type of diagnosis agent in use. For
Angel.R, this tool is intentionally excluded to minimize potential interference from the diagnostic
outcomes of other cases. For Angel.D, the tool directly returns the retrieved cases for reference,
enabling the agent to review and draw insights from them. For Angel.C, the tool conducts a detailed
comparison of similarities and differences among the retrieved cases and returns an analysis text
summarizing the findings. Consistent with the symptom matching tool, we employ BGE-M3 as the
retriever and select the top 5 most relevant cases as the retrieved records. This approach ensures that
the tool adapts to the specific needs of each diagnosis agent, enhancing the diagnostic process while
maintaining flexibility and precision.
Multi-agent Diagnosis To integrate insights from all Angels and improve diagnostics, Angel.R,
Angel.D, and Angel.C first provide independent decisions and reasoning. A Judge Agent consolidates
their inputs. If consensus is reached, the Judge outputs the diagnosis and reasoning.
For disagreements, two Debate Agents are introduced: a Positive Agent, supporting a mood disorder
diagnosis, and a Negative Agent, opposing it. Both Debate Agents and the Judge access symptom
matching results, scale performances, relevant cases, and the Angels’ diagnosis and reasoning. In
each debate round, the Positive Agent speaks first, followed by the Negative Agent. After each round,
the Judge evaluates the arguments and decides whether to conclude the debate. If concluded, the
Judge delivers the final diagnosis and supporting reasons.
3 MoodSyn Dataset
The MoodSyn dataset addresses the critical need for clinically valid yet privacy-preserving data in
computational psychiatry by introducing an open-source collection of 1,173 synthetic cases, as the
statistics shown in Table 2. Each case captures the essential characteristics of psychiatric assessments
through 25 carefully selected features, including 16 diagnostic questions, 8 standard scale scores, and
expert-verified mood disorder labels. A data example is shown in Appendix F.1.
Table 2: Descriptive statistics of the synthetic MoodSyn dataset. Positive counts represent cases with
mood disorder diagnoses, while negative counts indicate an absence of mood disorders.
positive amount negative amount Total
cases for retieval 687 419 1106
cases for test 73 67 140
5
learning applications. Specifically, our evaluation reveals high-density preservation of univariate
and multivariate distributions matching the original data’s feature patterns, robust performance in
downstream diagnostic prediction tasks comparable to real data, and near-indistinguishability in
rigorous detection tests. Furthermore, MoodSyn provides stronger privacy guarantees than traditional
anonymization approaches through its synthetic generation process. This unique combination of
statistical fidelity, clinical validity, and privacy protection establishes MoodSyn as both a reliable
analytical surrogate and a valuable resource for advancing AI applications in mental health research.
4 Experiment
Datasets. We assess the effectiveness of our agent framework on the test set described in Table
1, which comprises 561 cases. These include 315 normal cases, 56 cases of mood disorders, and
190 cases of other mental disorders. Cases without mental disorders contain only scale data and no
medical records, whereas the remaining cases include both medical records and scale data. All cases
are derived from real clinical data collected from our corporate hospital, ensuring the evaluation
reflects practical diagnostic scenarios.
Baselines. To evaluate the effectiveness of our agent, we compare it against four baseline
LLMs: LLaMA3-8B-Instruct [21], Mistral-7B-Instruct-v0.3 [22], GPT-4o (gpt-4o-2024-08-06) [23],
DeepSeek-V3 [24]. Each LLM is provided with a standardized input consisting of: (1) A combined
medical record, presented as a unified narrative. (2) A summary of scale test results, listing the
client’s scores across multiple psychological scales along with their clinical implications. The LLMs
are prompted to determine whether the client has a mood disorder (including depression or bipolar
disorder) and to provide a structured explanation in JSON format. An example of the prompt used for
baseline models is shown in Figure 8.
Hyperparameters. For each model, we employ default parameter settings, utilizing official models
for open-source LLMs obtained from Hugging Face or the API from the official website. These
testing procedures take place on a computational infrastructure consisting of two NVIDIA A800
Tensor Core GPUs, equipped with 80GB of memory.
Evaluation Metrics. Following previous methods [16], we utilize Recall, Accuracy (ACC), Matthews
Correlation Coefficient (MCC) [25], and Macro F1 to evaluate the performance of various versions
of MoodAngels and baseline LLMs on the mood disorder diagnosis.
Our diagnostic framework achieves superior performance through innovative components that trans-
form psychiatric assessment. Unlike conventional approaches limited by total score interpretation
of standardized scales, MoodAngels implements granular item-level analysis that identifies and
groups mood-relevant questions, overcoming the inherent information loss of aggregate scoring. This
technical advancement enables more precise detection of symptom patterns that would otherwise be
obscured in traditional scale processing. The framework’s structured diagnostic process represents
another significant innovation, replacing the direct diagnosis generation used by baseline models with
a rigorous multi-step verification system. While Angel.R establishes core functionality through DSM-
5 criteria referencing, our more advanced variants incorporate clinical case retrieval and inter-agent
debate, a substantial departure from the direct inference approach employed by bare LLMs.
The experimental results presented in Tables 3 and 4 demonstrate the clear advantages of this
approach. All MoodAngels variants show substantial improvements over baseline LLMs, with even
our foundational agent achieving 12.3% greater accuracy than GPT-4o, underscoring the effectiveness
of our redesigned diagnostic process. A deeper examination of the agent comparisons reveals
important insights about psychiatric diagnosis. The performance advantage of Angel.D over Angel.R
confirms the clinical value of historical cases as reference points, while the slight dip observed with
Angel.C serves as a caution against over-reliance on past cases. Particularly telling are those instances
where only one agent succeeds in making the correct diagnosis while others fail, highlighting both
the complexity of psychiatric assessment and the complementary nature of different diagnostic
approaches.
6
These findings collectively validate our multi-agent framework’s design, which synthesizes granular
symptom analysis, structured diagnostic protocols, and balanced clinical experience integration
through its collaborative debate mechanism. The framework’s ultimate performance superiority
emerges from this sophisticated combination of innovations, each addressing specific limitations in
conventional psychiatric assessment methods while working in concert to achieve more accurate and
reliable diagnoses.
In this section, we adapt Angel.R on real data to test the efficiency of processed medical records and
selected scales.
7
facts, potentially overlooking nuanced symptoms. This makes it more challenging to identify patients
with atypical presentations of mood disorders. For example, a patient with severe depression may
not exhibit obvious signs of low mood or mania but instead feel a sense of hopeless calmness and
express suicidal ideation. This "hopeless calmness" is clearly conveyed in a narrative-style medical
record but may be less apparent in a structured format.
In summary, while structured medical records offer consistency and objectivity, they may lack the
nuanced details present in narrative-style records, which are crucial for identifying atypical or subtle
symptoms of mood disorders. This highlights the importance of balancing structured data with rich,
descriptive narratives to ensure accurate and comprehensive psychiatric diagnosis.
Table 5: Diagnosis accuracy of different medical record formats in symptom matching and agent
processing steps. The numbers indicate different combinations of experimental settings for ease of
reference. In the experimental analysis, these settings are directly referred to as Setting 1, Setting 2
and Setting 3.
Symptom Matching Return to Agent ACC MCC
1 unstructured unstructured 0.920 \ 0.829 \
2 structured unstructured 0.918 0.002↓ 0.822 0.007↓
3 structured structured 0.914 0.006↓ 0.814 0.015↓
To demonstrate the robustness and diagnostic capabilities of our agents, we present two critical
scenarios: inter-scale conflicts and overlapping symptoms. In inter-scale conflict cases, clients
may misjudge their condition in self-reported scales or, due to certain personality traits or stress
responses, may not fully disclose their true feelings to clinicians. In such situations, our approach
of grouping and analyzing similar questions across different scales proves invaluable, enabling the
agent to identify inconsistencies and arrive at a more accurate diagnosis. Additionally, we examine
cases with overlapping symptoms, where a single symptom may be indicative of multiple diseases,
significantly complicating the diagnostic process. In these instances, simple symptom matching alone
may fail to pinpoint the accurate disease. However, by leveraging historical case experiences and
employing a multi-agent debate framework to explore various possibilities, MoodAngels achieves a
more nuanced and precise diagnosis, bringing it closer to the underlying truth. These case studies
highlight the strengths of our method in handling complex real-world diagnostic challenges. The
intuitive presentations of these cases are shown in Appendix I.
Inter-Scale Conflicts. We analyze a case where self-reported scales conflict with clinician-evaluated
performances. Although the client scored 11 on the PHQ-9, suggesting moderate depression, self-
reports are subjective. Clinician evaluations across multiple professional scales (e.g., HAMD, HAMA,
BPRS) revealed no depressive symptoms or energy decline, providing more objective diagnostic
8
assurance. Historical cases with similar self-reported depression but no clinician-confirmed symptoms
further support this conclusion. While the client reported mild to moderate anxiety, which correlates
with mood disorders, it was insufficient for a diagnosis, especially given the clinician’s negative
findings. The absence of clinician-confirmed symptoms led our agent to conclude no mood disorder,
demonstrating its ability to resolve inter-scale conflicts through comprehensive analysis of grouped
questions and historical data.
Overlapping Symptoms. We analyze a particularly challenging diagnostic scenario with overlapping
symptoms. While the client’s self-reports indicated negative emotions, suicidal tendencies, and loss
of interest, the clinician found none of these. However, the medical record confirmed the self-reports,
also noting delusions and self-talk. Symptom matching with DSM-5 criteria further complicated
the diagnosis, as the top five matched disorders spanned Mood Disorder, Personality Disorder, and
Neurocognitive Disorder, none of which included the actual diagnosis of Schizophrenia. Only
through historical case retrieval and multi-agent debate did the Judge Agent identify that the client
had concealed symptoms during clinical interviews, ultimately leading to the correct Schizophrenia
diagnosis. This case highlights the difficulty of diagnosing complex cases with overlapping symptoms
and underscores our agent’s strength in integrating diverse data sources to uncover hidden diagnostic
patterns.
We examined cases where the three single angels disagreed and instances where the multi-agent
debate still resulted in incorrect diagnoses. These cases typically involved borderline symptoms or
conflicts between medical records and scale performances. Conflicts between medical records and
scale performances are particularly challenging, as they often require additional information to make
a more confident diagnosis.
Borderline Cases. Some clients’ self-reported and clinician-evaluated scale performances were
highly consistent, both indicating mild degrees of depression, anxiety, insomnia, and loss of interest.
After discussion with our coauthor expert, we concluded that while the client does not currently meet
the diagnostic threshold for major depressive disorder, they fall on the borderline between normal
and mild depression. Such cases warrant close attention and monitoring.
Conflicts between Medical Record and Scale Performances. In some cases, the medical record
indicates severe symptoms, while both self-reported and clinician-evaluated scales show minimal or
no symptoms related to mood disorders, with most relevant questions scoring zero. This discrepancy
may arise when the client is in remission from a severe mood disorder, when medication temporarily
suppresses negative emotions, or when the client has recovered. Recognizing this challenge, we
enhanced our agents to flag such cases for clinicians, prompting further investigation. While the
agents may misjudge the presence of a mood disorder in these scenarios, this additional signal ensures
that critical cases are not overlooked, thereby improving diagnostic reliability and clinical utility.
5 Conclusion
In conclusion, we propose MoodAngels, the first specialized multi-agent framework for mood
disorder diagnosis that addresses key challenges in psychiatry through granular-scale analysis and
structured multi-step verification, achieving superior accuracy over existing methods. Our framework
is complemented by MoodSyn, an open-source synthetic dataset of 1,173 clinically validated cases
that enables research while preserving privacy. Experimental results demonstrate the effectiveness
of our approach, with the baseline agent outperforming GPT-4o by 12.3% and the full multi-agent
system achieving even greater accuracy. These contributions advance AI applications in mental health
by providing both an effective diagnostic framework and a valuable research resource that addresses
critical gaps in computational psychiatry.
Acknowledgments
This work is partially supported by the National Key Research and Development Program of China
(2024YFC3308400), National Natural Science Foundation of China (U23A20316) and founded by
9
Joint&Laboratory on Credit Technology. We gratefully acknowledge all clinicians who participated
in the controlled reader study for their valuable contributions to model evaluation.
References
[1] Dominic Murphy and Robert L Woolfolk. The harmful dysfunction analysis of mental disorder.
Philosophy, Psychiatry, & Psychology, 7(4):241–252, 2000.
[2] Steven E Hyman. The diagnosis of mental disorders: the problem of reification. Annual review
of clinical psychology, 6(1):155–179, 2010.
[3] Robert J Haggerty and Patricia J Mrazek. Reducing risks for mental disorders: Frontiers for
preventive intervention research. National Academies Press, 1994.
[4] Carolyn E Schwartz, Robert M Kaplan, John P Anderson, Troy Holbrook, and M Wilson
Genderson. Covariation of physical and mental symptoms across illnesses: Results of a factor
analytic study. Annals of Behavioral Medicine, 21(2):122–127, 1999.
[5] Jeffrey Rakofsky and Mark Rapaport. Mood disorders. CONTINUUM: Lifelong Learning in
Neurology, 24(3):804–827, 2018.
[6] Eiko I Fried, Claudia D van Borkulo, Angélique OJ Cramer, Lynn Boschloo, Robert A Schoevers,
and Denny Borsboom. Mental disorders as networks of problems: a review of recent insights.
Social psychiatry and psychiatric epidemiology, 52:1–10, 2017.
[7] Junkai Li, Yunghwei Lai, Weitao Li, Jingyi Ren, Meng Zhang, Xinhui Kang, Siyu Wang, Peng
Li, Ya-Qin Zhang, Weizhi Ma, et al. Agent hospital: A simulacrum of hospital with evolvable
medical agents. arXiv preprint arXiv:2405.02957, 2024.
[8] Haochun Wang, Sendong Zhao, Zewen Qiang, Nuwa Xi, Bing Qin, and Ting Liu. Beyond
direct diagnosis: Llm-based multi-specialist agent consultation for automatic diagnosis. arXiv
preprint arXiv:2401.16107, 2024.
[9] Omid Kohandel Gargari, Farhad Fatehi, Ida Mohammadi, Shahryar Rajai Firouzabadi, Arman
Shafiee, and Gholamreza Habibi. Diagnostic accuracy of large language models in psychiatry.
Asian Journal of Psychiatry, 100:104168, 2024.
[10] Daniel F Gros, Matthew Price, Kathryn M Magruder, and B Christopher Frueh. Symptom
overlap in posttraumatic stress disorder and major depression. Psychiatry research, 196(2-3):
267–270, 2012.
[11] Miriam K Forbes. Implications of the symptom-level overlap among dsm diagnoses for
dimensions of psychopathology. Journal of Emotion and Psychopathology, 1(1):104–112, 2023.
[12] John A Bilello. Seeking an objective diagnosis of depression. Biomarkers in medicine, 10(8):
861–875, 2016.
[13] Shitij Kapur, Anthony G Phillips, and Thomas R Insel. Why has it taken so long for biological
psychiatry to develop clinical tests and what to do about it? Molecular psychiatry, 17(12):
1174–1179, 2012.
[14] Yu He Ke, Rui Yang, Sui An Lie, Taylor Xin Yi Lim, Hairil Rizal Abdullah, Daniel Shu Wei
Ting, and Nan Liu. Enhancing diagnostic accuracy through multi-agent conversations: Using
large language models to mitigate cognitive bias. arXiv preprint arXiv:2401.14589, 2024.
[15] DSMTF American Psychiatric Association, DS American Psychiatric Association, et al. Di-
agnostic and statistical manual of mental disorders: DSM-5, volume 5. American psychiatric
association Washington, DC, 2013.
[16] Nawshad Farruque, Randy Goebel, Sudhakar Sivapalan, and Osmar R Zaïane. Depression
symptoms modelling from social media text: an llm driven semi-supervised learning approach.
Language Resources and Evaluation, pages 1–29, 2024.
10
[17] Yuxi Wang, Diana Inkpen, and Prasadith Kirinde Gamaarachchige. Explainable depression
detection using large language models on social media data. In Proceedings of the 9th Workshop
on Computational Linguistics and Clinical Psychology (CLPsych 2024), pages 108–126, 2024.
[18] Daeun Lee, Hyolim Jeon, Sejung Son, Chaewon Park, Ji hyun An, Seungbae Kim, and Jinyoung
Han. Detecting bipolar disorder from misdiagnosed major depressive disorder with mood-aware
multi-task learning. In Proceedings of the 2024 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies (Volume 1:
Long Papers), pages 4954–4970, 2024.
[19] Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-
embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through
self-knowledge distillation, 2024. URL https://arxiv.org/abs/2402.03216.
[20] Hengrui Zhang, Jiani Zhang, Balasubramaniam Srinivasan, Zhengyuan Shen, Xiao Qin, Christos
Faloutsos, Huzefa Rangwala, and George Karypis. Mixed-type tabular data synthesis with
score-based diffusion in latent space. In The twelfth International Conference on Learning
Representations, 2024.
[21] Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The llama 3 herd of models, 2024.
URL https://arxiv.org/abs/2407.21783.
[22] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh
Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile
Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut
Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URL
https://arxiv.org/abs/2310.06825.
[23] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni
Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4
technical report. arXiv preprint arXiv:2303.08774, 2023.
[24] Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao,
Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint
arXiv:2412.19437, 2024.
[25] Brian W Matthews. Comparison of the predicted and observed secondary structure of t4 phage
lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 405(2):442–451, 1975.
[26] Zhuang Chen, Jiawen Deng, Jinfeng Zhou, Jincenzi Wu, Tieyun Qian, and Minlie Huang.
Depression detection in clinical interviews with llm-empowered structural element graph. In
Proceedings of the 2024 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages
8174–8187, 2024.
[27] Ana-Maria Bucur. Leveraging llm-generated data for detecting depression symptoms on social
media. In International Conference of the Cross-Language Evaluation Forum for European
Languages, pages 193–204. Springer, 2024.
11
Ethical Statement
All medical record de-identification was conducted in an offline environment, ensuring no data was
processed on external or networked systems. To further protect privacy, the medical records presented
in this paper have undergone event and symptom obfuscation, along with partial modification.
Additionally, all cases used in this study were included with the explicit consent of the clients, strictly
for academic research purposes.
Evaluators’ Background. Our evaluation team consists of coauthors who are experts in the field, led
by a professional attending physician and professor with over 20 years of clinical experience. This lead
evaluator is affiliated with one of the most authoritative hospitals in our country, bringing unparalleled
expertise and credibility to the evaluation process. Their deep understanding of psychiatric disorders
and extensive clinical background ensured a rigorous and reliable assessment of the knowledge base.
Limitations
Our study is limited by the available data, which only includes client medical records and scale scores,
restricting our ability to fully replicate the comprehensive diagnostic process employed by clinicians.
Clinician diagnoses typically involve around one hour of patient interviews, during which additional
clinician-evaluated scales are completed and more granular judgments are made based on real-time
interactions. While our approach cannot fully capture this in-depth process, it still provides valuable
insights by leveraging existing data to support mood disorder diagnosis, offering a promising tool for
assisting clinicians in cases where complete diagnostic information may not be available.
A Related Work
A.1 LLM-based Depression and Bipolar-disorder Detection
Recent advancements in depression detection have primarily focused on leveraging large language
models (LLMs) to address the challenges of psychiatric diagnosis. Chen et al. [26] proposed
structuring clinical interviews into a directed acyclic graph to enable automatic diagnosis, though
this approach may struggle with cases where clients misrepresent their conditions. Social media
platforms have also become a valuable resource for depression detection due to the abundance of user-
generated content. Farruque et al. [16] fine-tuned a pre-trained language model on specific datasets to
detect depressive symptoms from self-disclosed tweets, while Wang et al. [17] employed LLMs to
identify depression-related text and predict depression levels from Reddit posts. Additionally, Lee
et al. [18] utilized historical mood swings from users’ past social media activities to detect bipolar
disorder. Efforts to enhance data quality and model performance have further explored synthetic
data generation; for instance, Bucur [27] used LLMs to generate synthetic data for BDI-II symptoms,
enriching datasets with semantic diversity and emotional experiences unique to Reddit posts. These
approaches collectively highlight the potential of LLMs and social media data in advancing depression
detection, though challenges remain in ensuring accuracy and addressing client misrepresentations.
Recent studies have begun to explore the potential of LLM agents in medical diagnosis, though
these approaches often face limitations when applied to psychiatric contexts. Li et al. [7] proposed a
general disease diagnosis framework that leverages medical examination reports, such as blood tests
and cell staining, to retrieve similar cases and rules. A doctor agent then evaluates the information
to make a diagnosis, and correctly diagnosed cases are added to a historical database for future
retrieval. However, this method is unsuitable for psychiatric diagnosis due to its lack of patient data
anonymization, which could lead to privacy concerns and patient resistance. Additionally, psychiatric
diagnosis lacks objective diagnostic tools and often involves overlapping symptoms, making rule-
based approaches ineffective. Similarly, Wang et al. [8] enhanced agent expertise using external
knowledge from the National Institutes of Health for conditions like diarrhea and bronchitis. Their
agent-specialist outputs a probability distribution over possible diagnoses based on patient-reported
symptoms. However, this approach struggles with psychiatric cases where clients may misrepresent
or inaccurately estimate their mental state, highlighting the need for more nuanced frameworks
tailored to mental health diagnoses.
12
B Mental Disorders
B.1 Problem Definition
Mood disorder diagnosis is a binary classification task aimed at determining whether a client has a
mood disorder, such as depression or bipolar disorder, based on a given case of clinical data. The
input for each case includes structured clinical information, which consists of medical records (which
may be absent for first-time visitors) and scale performance data (which is always available). The
output of the task is a diagnostic result, represented as a binary decision (yes or no), along with
supporting reasons that justify the diagnosis. Formally, for a given case C = (M, S), where M
represents the medical records and S represents the scale performance data, the goal is to determine
f (C) = (ŷ, r). Here, ŷ ∈ {0, 1} is the predicted diagnosis (1 indicating the presence of a mood
disorder and 0 indicating its absence), and r is the reasoning that supports the prediction. This task is
particularly challenging due to the potential absence of medical records for some cases and the need
to provide interpretable reasoning for the diagnostic decision.
In this section, we provide a brief overview of 18 common mental disorders, along with examples of
their typical symptoms as outlined in the DSM-5. It is important to note that mental disorders are
complex, and their manifestations often extend beyond the descriptions provided here. The following
content is intended to offer a general understanding of these conditions and should not be considered
exhaustive or definitive.
Neurodevelopmental Disorders. The neurodevelopmental disorders are a group of conditions with
onset in the developmental period. The range of developmental deficits varies from very specific
limitations of learning or control of executive functions to global impairments of social skills or
intelligence. For example, individuals with neurodevelopmental disorders may have difficulties in
speech and language development, problems with social communication and understanding social
cues, repetitive behaviors, and restricted interests.
Schizophrenia Spectrum and Other Psychotic Disorders. Schizophrenia spectrum disorders are
characterized by a range of symptoms that affect thinking, perception, emotional regulation, and
behavior. For example, individuals with auditory hallucinations would hear voices that others do not.
Bipolar and Related Disorders. Bipolar and related disorders are characterized by significant
mood swings that include episodes of mania or hypomania (elevated mood) and depression. For
example, individuals with mania may experience an abnormally elevated, expansive, or irritable
mood, increased energy, reduced need for sleep, grandiosity, impulsivity, and excessive engagement
in risky behaviors.
Depressive Disorders. Depressive disorders are characterized by persistent feelings of sadness,
hopelessness, and a loss of interest or pleasure in most activities. For example, individuals with
depression may experience changes in appetite or weight, sleep disturbances (either insomnia or
excessive sleeping), fatigue, feelings of worthlessness or excessive guilt, difficulty concentrating, and
thoughts of death or suicide.
Anxiety Disorders. Anxiety disorders are characterized by excessive fear, worry, or nervousness that
is disproportionate to the actual threat or situation. For example, individuals with generalized anxiety
disorder (GAD) experience chronic, uncontrollable worry about various aspects of life, such as work,
health, or social interactions.
Obsessive-Compulsive and Related Disorders Obsessive-Compulsive and Related Disorders are
characterized by the presence of obsessions (intrusive, unwanted thoughts) and/or compulsions
(repetitive behaviors or mental acts performed to reduce anxiety). For example, individuals with
Obsessive-Compulsive Disorder (OCD) experience persistent, distressing obsessions and feel com-
pelled to perform rituals or routines to alleviate the anxiety caused by these thoughts.
Trauma- and Stressor-Related Disorders. Trauma- and Stressor-Related Disorders are charac-
terized by the emotional and psychological response to traumatic or stressful events. For example,
individuals with Post-Traumatic Stress Disorder (PTSD) intrusive memories of the trauma (flash-
backs, nightmares), emotional numbness, avoidance of reminders of the event, hypervigilance, and
heightened arousal, such as irritability, difficulty sleeping, and exaggerated startle responses.
13
Dissociative Disorders. Dissociative disorders are characterized by disruptions or breakdowns in
memory, consciousness, identity, or perception. For example, individuals with Dissociative Identity
Disorder (DID), previously known as Multiple Personality Disorder, exhibit two or more distinct
identities or personality states, each with its own pattern of thinking, feeling, and behaving. These
identities may take control of the person’s behavior and are often accompanied by gaps in memory or
awareness of time.
Somatic Symptom and Related Disorders Somatic Symptom and Related Disorders are character-
ized by the presence of physical symptoms that cause significant distress or impairment, which are
not fully explained by a medical condition. For example, individuals with Illness Anxiety Disorder
(formerly known as hypochondriasis) may experience excessive worry about having or developing a
serious illness, despite having little or no physical symptoms.
Feeding and Eating Disorders Feeding and Eating Disorders are characterized by persistent distur-
bances in eating behaviors and related thoughts or emotions that negatively impact physical health,
emotional well-being, and daily functioning. For example, individuals with Anorexia Nervosa have
an intense fear of gaining weight and engage in restrictive eating, leading to significantly low body
weight. They may also have a distorted body image, perceiving themselves as overweight even when
underweight.
Elimination Disorders Elimination Disorders are characterized by the inappropriate elimination of
urine or feces, which are not due to a medical condition. For example, individuals with enuresis refers
to the repeated involuntary or intentional voiding of urine, typically at night (bedwetting), beyond the
expected age for bladder control.
Sleep-Wake Disorders Sleep-Wake Disorders are characterized by disturbances in the quality, timing,
and amount of sleep, leading to significant impairment in daytime functioning and distress. These
disorders are not attributable to the physiological effects of a substance or another medical condition.
For example, individuals with insomnia disorder experience persistent difficulty falling asleep, staying
asleep, or achieving restorative sleep, despite adequate opportunity for sleep, resulting in fatigue,
mood disturbances, and impaired cognitive or social functioning.
Sexual Dysfunctions Sexual Dysfunctions are characterized by a clinically significant inability
to participate in or experience satisfaction from sexual activity, often causing marked distress or
interpersonal difficulties. These dysfunctions are not better explained by a nonsexual mental disorder,
relationship distress, or the effects of a substance or medical condition. For example, individuals
with erectile disorder experience a persistent or recurrent inability to attain or maintain an adequate
erection during sexual activity, leading to significant distress or interpersonal strain.
Gender Dysphoria Gender Dysphoria involves a strong and persistent discomfort with one’s assigned
gender at birth and a desire to be treated as the opposite gender. This condition is marked by significant
distress or impairment in functioning due to the incongruence between experienced or expressed
gender and assigned gender. For example, a person assigned female at birth may experience intense
discomfort with their body, wishing to transition to a male gender identity, often leading to emotional
and social distress.
Disruptive, Impulse-Control, and Conduct Disorders These disorders are characterized by persis-
tent patterns of behavior where the rights of others or societal norms are violated. Individuals with
these disorders often exhibit aggressive, antisocial, or impulsive behaviors that disrupt their social,
academic, or occupational functioning. For example, a child with conduct disorder may frequently
engage in theft, aggression toward others, or deliberate property destruction, showing little empathy
or remorse for their actions.
Substance-Related and Addictive Disorders Substance-Related and Addictive Disorders encompass
a range of problems caused by the use of substances (e.g., alcohol, drugs) or behaviors (e.g., gambling)
that lead to addiction. These disorders are defined by a pattern of substance use or behavior that
leads to significant impairment, including physical, social, or psychological problems. For example,
an individual with alcohol use disorder may find themselves drinking excessively despite negative
consequences, such as relationship issues or health problems, and may experience withdrawal
symptoms when not drinking.
Neurocognitive Disorders Neurocognitive Disorders involve a decline in cognitive function that
represents a significant change from a previous level of functioning. These disorders can affect
14
memory, learning, attention, executive function, and perception. For example, Alzheimer’s disease
is a common neurocognitive disorder where individuals experience progressive memory loss and
confusion, often leading to difficulty performing everyday tasks.
Personality Disorders Personality Disorders are characterized by enduring patterns of behavior,
cognition, and inner experience that deviate markedly from the expectations of an individual’s culture.
These patterns are inflexible and pervasive, leading to distress or impairment in social, occupational,
or other important areas of functioning. For example, individuals with borderline personality disorder
may have intense and unstable relationships, fear of abandonment, and difficulty regulating their
emotions, often leading to impulsive actions or self-harming behaviors.
To gather professional diagnostic criteria, we rely on the Diagnostic and Statistical Manual of Mental
Disorders: DSM-5 [15], a comprehensive and authoritative guide widely used by clinicians and
researchers. The DSM-5 outlines the standardized criteria for the classification and diagnosis of
mental disorders, providing detailed descriptions of symptoms, diagnostic features, and associated
conditions. As a crucial tool in psychiatry, it ensures consistency and accuracy in mental health
diagnosis, making it an essential reference for both clinical practice and research.
Our knowledge base is built by extracting diagnostic criteria and symptoms from the DSM-5, focusing
on mood disorders and other conditions. We reference the diagnostic criteria section to identify
required symptoms and their severity, and the differential diagnosis section to distinguish mood
disorders from similar conditions. This ensures accurate diagnosis while addressing symptom
overlaps.
Diagnostic Criteria Extraction. Symptoms in the DSM-5 are listed individually. To facilitate
comparison with client records, we reframe these using GPT-4o for brevity and clarity, ensuring
independent, complete criteria. We extracted symptoms for 18 common mental disorders, including
all major symptoms of "bipolar and related disorders" and "depressive disorders."
For example, one symptom of bipolar disorder is as follows:
"A distinct period of abnormally and persistently elevated, expansive, or irritable mood and abnor-
mally and persistently increased goal-directed activity or energy, lasting at least 1 week and present
most of the day, nearly every day (or any duration if hospitalization is necessary)."
The extracted symptom description would be:
"Manic episode: Elevated, expansive, or irritable mood, accompanied by increased energy and
activity, lasting at least 1 week and present most of the day, nearly every day."
Differential Criteria Extraction. We decompose complex differential diagnosis into distinct
expressions using GPT-4o. This process ensures precise differentiation between mood disorder and
other diseases.
For example, a differential symptom for bipolar disorder is described as:
"Attention-deficit/hyperactivity disorder. This disorder may be misdiagnosed as bipolar disorder,
especially in adolescents and children. Many symptoms overlap with the symptoms of mania, such
as rapid speech, racing thoughts, distractibility, and less need for sleep. The ’double counting’ of
symptoms toward both ADHD and bipolar disorder can be avoided if the clinician clarifies whether
the symptom(s) represent a distinct episode."
This description was refined into two distinct expressions:
"Attention-deficit/hyperactivity disorder: rapid speech, racing thoughts, distractibility, and less need
for sleep in adolescents and children."
"Bipolar disorder: rapid speech, racing thoughts, distractibility, and less need for sleep in adults."
15
C.2 Knowledge Base Evaluation
We conducted a comprehensive evaluation of the knowledge base to ensure the accuracy and com-
pleteness of the symptom descriptions for all entries, with a particular focus on the sections covering
"bipolar and related disorders" and "depressive disorders." The evaluation involved assessing the
correctness of each symptom description and verifying the completeness of the content within these
two critical sections. Any inaccuracies or gaps identified during this process were manually revised
to ensure the highest quality of information.
16
D Scales
D.1 Scales Definition
We calculated the correlation scores between each question and the total score across the 13 scales
mentioned above, as well as their correlation with the presence of mood disorders, using a training
set of 2,243 cases. The results are presented in Table 7.
Table 7: Pearson Correlation (PC) coefficients and p-values between scale scores and the presence of
mood disorder.
17
ctq_Q5 0.0914 1.6e-05
ctq_Q6 0.1797 1.3e-17
ctq_Q7 0.2371 8.0e-30
ctq_Q8 0.4358 7.5e-104
ctq_Q9 0.1150 5.3e-08
ctq_Q10 0.3248 7.6e-56
ctq_Q11 0.2805 1.6e-41
ctq_Q12 0.2726 3.1e-39
ctq_Q13 0.2751 6.2e-40
ctq_Q14 0.3973 4.6e-85
ctq_Q15 0.2841 1.4e-42
ctq_Q16 -0.2860 3.7e-43
ctq_Q17 0.1306 6.3e-10
ctq_Q18 0.3640 1.0e-70
ctq_Q19 0.2959 3.2e-46
ctq_Q20 0.1276 1.5e-09
ctq_Q21 0.0810 1.3e-04
ctq_Q22 -0.2533 6.3e-34
ctq_Q23 0.1246 3.7e-09
ctq_Q24 0.1690 9.9e-16
ctq_Q25 0.3855 8.7e-80
ctq_Q26 0.1815 6.3e-18
ctq_Q27 0.0478 2.4e-02
ctq_Q28 0.3101 8.0e-51
2. DAS
das_Q1 0.2455 6.9e-32
das_Q2 0.0296 1.6e-01
das_Q3 0.2930 2.7e-45
das_Q4 0.3332 8.1e-59
das_Q5 0.2801 2.2e-41
das_Q6 0.1511 7.7e-13
das_Q7 0.3111 4.1e-51
das_Q8 0.3161 8.1e-53
das_Q9 0.3706 2.3e-73
das_Q10 0.3823 2.4e-78
das_Q11 0.3618 8.6e-70
das_Q12 0.1682 1.4e-15
das_Q13 0.2901 2.1e-44
das_Q14 0.3695 6.1e-73
das_Q15 0.3308 6.0e-58
das_Q16 0.3399 2.8e-61
das_Q17 -0.2876 1.2e-43
das_Q18 -0.0604 4.4e-03
das_Q19 0.3015 5.4e-48
das_Q20 0.2773 1.4e-40
das_Q21 0.1974 5.6e-21
das_Q22 0.2228 2.0e-26
das_Q23 0.3144 3.0e-52
das_Q24 0.0429 4.3e-02
das_Q25 -0.0428 4.3e-02
das_Q26 0.3093 1.6e-50
das_Q27 0.3224 5.7e-55
das_Q28 0.2095 1.7e-23
das_Q29 0.1205 1.2e-08
das_Q30 0.0003 9.9e-01
das_Q31 0.3770 4.3e-76
das_Q32 0.3108 4.9e-51
das_Q33 0.2986 4.8e-47
das_Q34 0.2978 8.5e-47
das_Q35 0.1615 1.8e-14
das_Q36 0.2307 2.9e-28
18
das_Q37 0.2021 6.1e-22
das_Q38 0.2937 1.6e-45
das_Q39 0.2404 1.2e-30
das_Q40 0.2141 1.7e-24
das_total_score 0.4389 2.0e-105
3. GAD-7
gad7_Q1 0.4377 1.1e-104
gad7_Q2 0.4492 8.3e-111
gad7_Q3 0.4393 1.6e-105
gad7_Q4 0.4522 2.0e-112
gad7_Q5 0.3891 3.3e-81
gad7_Q6 0.4661 2.9e-120
gad7_Q7 0.3493 8.9e-65
gad7_total_score 0.5116 1.7e-148
4. HCL-32
hcl32_Q1 0.0179 4.0e-01
hcl32_Q2 -0.1972 5.9e-21
hcl32_Q3 -0.2632 1.3e-36
hcl32_Q4 -0.2292 6.5e-28
hcl32_Q5 -0.1799 1.2e-17
hcl32_Q6 -0.1501 1.1e-12
hcl32_Q7 0.0251 2.4e-01
hcl32_Q8 0.2175 3.1e-25
hcl32_Q9 -0.0151 4.8e-01
hcl32_Q10 -0.2148 1.2e-24
hcl32_Q11 -0.1509 8.2e-13
hcl32_Q12 -0.2202 7.4e-26
hcl32_Q13 -0.2187 1.7e-25
hcl32_Q14 -0.0594 5.0e-03
hcl32_Q15 -0.2313 2.0e-28
hcl32_Q16 0.0177 4.0e-01
hcl32_Q17 -0.0184 3.9e-01
hcl32_Q18 -0.1769 4.2e-17
hcl32_Q19 -0.2283 1.0e-27
hcl32_Q20 -0.1844 1.8e-18
hcl32_Q21 0.1239 4.5e-09
hcl32_Q22 -0.2486 1.0e-32
hcl32_Q23 0.0968 4.7e-06
hcl32_Q24 -0.1985 3.2e-21
hcl32_Q25 0.2063 8.0e-23
hcl32_Q26 0.1971 6.2e-21
hcl32_Q27 0.1479 2.3e-12
hcl32_Q28 -0.2751 6.2e-40
hcl32_Q29 0.0465 2.8e-02
hcl32_Q30 -0.0205 3.3e-01
hcl32_Q31 0.0117 5.8e-01
hcl32_Q32 0.0905 1.9e-05
hcl32_total_score -0.0788 2.0e-04
5. MDQ
mdq_Q1 0.1832 3.0e-18
mdq_Q2 0.2916 7.3e-45
mdq_Q3 -0.1594 3.9e-14
mdq_Q4 0.2980 7.4e-47
mdq_Q5 0.0515 1.5e-02
mdq_Q6 -0.0556 8.8e-03
mdq_Q7 0.2141 1.8e-24
mdq_Q8 -0.1568 1.0e-13
mdq_Q9 -0.1437 9.7e-12
mdq_Q10 0.0857 5.2e-05
mdq_Q11 0.0907 1.8e-05
mdq_Q12 0.2022 5.8e-22
19
mdq_Q13 0.2920 5.6e-45
mdq_total_score 0.1742 1.3e-16
6. NSSI
nssi_Q1 0.4597 4.0e-116
nssi_Q2 0.3568 1.9e-67
nssi_Q3 0.0221 3.0e-01
nssi_Q4 0.1636 9.5e-15
nssi_Q5 0.4422 1.4e-106
nssi_Q6 0.2471 3.8e-32
nssi_Q7 0.3085 5.4e-50
nssi_Q8 0.3112 6.5e-51
nssi_Q9 0.4045 6.8e-88
nssi_Q10 0.3417 1.3e-61
nssi_Q11 0.2181 3.1e-25
nssi_Q12 0.2392 3.7e-30
nssi_Q13 0.2664 2.8e-37
nssi_Q14 0.2641 1.3e-36
nssi_Q15 0.0729 6.0e-04
nssi_Q16 -0.0825 1.0e-04
nssi_Q17 0.1995 2.7e-21
nssi_Q18 0.0267 2.1e-01
7. PHQ-9
phq9_Q1 0.5008 9.0e-143
phq9_Q2 0.5006 1.2e-142
phq9_Q3 0.4842 3.2e-132
phq9_Q4 0.5097 1.2e-148
phq9_Q5 0.4484 1.9e-111
phq9_Q6 0.4877 1.9e-134
phq9_Q7 0.4578 1.1e-116
phq9_Q8 0.4493 6.6e-112
phq9_Q9 0.5057 5.4e-146
phq9_total_score 0.5920 2.7e-212
8. SHAPS
shaps_Q1 0.1860 8.4e-19
shaps_Q2 0.2637 8.5e-37
shaps_Q3 0.2561 1.0e-34
shaps_Q4 0.2186 1.6e-25
shaps_Q5 0.1925 4.7e-20
shaps_Q6 0.1774 3.1e-17
shaps_Q7 0.2355 1.8e-29
shaps_Q8 0.1430 1.2e-11
shaps_Q9 0.2753 4.7e-40
shaps_Q10 0.1845 1.6e-18
shaps_Q11 0.2522 1.1e-33
shaps_Q12 0.2314 1.7e-28
shaps_Q13 0.1941 2.3e-20
shaps_Q14 0.1587 4.7e-14
shaps_total_score 0.2675 7.4e-38
9. BPRS
bprs_Q1 0.1675 1.4e-15
bprs_Q2 0.4255 2.2e-99
bprs_Q3 0.1035 8.9e-07
bprs_Q4 0.0232 2.7e-01
bprs_Q5 0.4761 2.7e-127
bprs_Q6 0.1560 1.1e-13
bprs_Q7 0.0003 9.9e-01
bprs_Q8 0.1869 4.3e-19
bprs_Q9 0.6200 1.8e-238
bprs_Q10 0.1341 1.8e-10
bprs_Q11 0.3040 3.4e-49
bprs_Q12 0.1088 2.4e-07
20
bprs_Q13 0.3225 1.8e-55
bprs_Q14 0.0363 8.5e-02
bprs_Q15 0.0224 2.9e-01
bprs_Q16 0.0859 4.6e-05
bprs_Q17 0.2906 6.5e-45
bprs_Q18 -0.0506 1.7e-02
bprs_total_score 0.4724 4.2e-125
10. HAMA
hama_Q1 0.4227 6.2e-98
hama_Q2 0.4177 1.9e-95
hama_Q3 0.2960 1.3e-46
hama_Q4 0.5099 8.7e-149
hama_Q5 0.4987 2.2e-141
hama_Q6 0.6304 7.4e-249
hama_Q7 0.2840 6.9e-43
hama_Q8 0.2837 8.2e-43
hama_Q9 0.3875 2.7e-81
hama_Q10 0.3994 1.0e-86
hama_Q11 0.3641 2.7e-71
hama_Q12 0.1479 1.9e-12
hama_Q13 0.2420 2.8e-31
hama_Q14 0.1529 3.3e-13
hama_total_score 0.5513 1.2e-178
11. HAMD
hamd_Q1 0.6021 8.1e-219
hamd_Q2 0.4958 6.5e-138
hamd_Q3 0.6024 4.1e-219
hamd_Q4 0.5247 4.3e-157
hamd_Q5 0.4385 7.2e-105
hamd_Q6 0.3932 6.9e-83
hamd_Q7 0.6070 2.8e-223
hamd_Q8 0.3943 2.4e-83
hamd_Q9 0.3444 9.0e-63
hamd_Q10 0.4053 2.1e-88
hamd_Q11 0.4846 6.5e-131
hamd_Q12 0.4143 1.1e-92
hamd_Q13 0.4448 3.3e-108
hamd_Q14 0.1405 3.0e-11
hamd_Q15 0.1536 3.6e-13
hamd_Q16 0.1968 8.5e-21
hamd_Q17 0.2599 1.5e-35
hamd_Q18 0.2671 1.5e-37
hamd_Q19 0.3108 7.6e-51
hamd_Q20 0.2943 1.6e-45
hamd_Q21 0.2604 1.1e-35
hamd_Q22 0.5236 2.3e-156
hamd_Q23 0.4595 3.2e-116
hamd_Q24 0.4293 3.9e-100
hamd_total_score 0.6348 2.2e-250
12. MCCB
TMT-A 0.2300 4.6e-28
TMT-B 0.1530 4.2e-13
BACS SC -0.2589 2.3e-35
HVLT-R.1 -0.0672 1.5e-03
HVLT-R.2 0.0157 4.6e-01
HVLT-R.3 0.0536 1.1e-02
WMS-III SS -0.1140 7.1e-08
NAB Mazes -0.1402 3.1e-11
BVMT-R.1 -0.0217 3.1e-01
BVMT-R.2 -0.0255 2.3e-01
BVMT-R.3 -0.0510 1.6e-02
21
Fluency -0.0489 2.1e-02
MSCEIT ME -0.2084 3.2e-23
CPT-IP.1 -0.2915 8.8e-45
CPT-IP.2 -0.3012 8.2e-48
CPT-IP.3 -0.2747 9.2e-40
13. YMRS
ymrs_Q1 0.2262 1.1e-20
ymrs_Q2 0.2100 5.3e-18
ymrs_Q3 0.0520 3.4e-02
ymrs_Q4 0.1925 2.6e-15
ymrs_Q5 0.3067 1.9e-37
ymrs_Q6 0.2548 5.3e-26
ymrs_Q7 0.1239 4.1e-07
ymrs_Q8 0.1106 6.4e-06
ymrs_Q9 0.1501 8.0e-10
ymrs_Q10 -0.0730 2.9e-03
ymrs_Q11 0.0900 2.4e-04
ymrs_total_score 0.3196 1.0e-40
To ensure the selection of highly relevant questions, we identified the top 5% of questions with the
highest correlation coefficients. The distribution of correlation scores and the selection threshold
are illustrated in Figure 2. Question IDs with correlation scores above the threshold are highlighted
in purple, and their corresponding correlation scores are bolded for emphasis. Upon reviewing the
original questions selected, we observed that they naturally clustered into five distinct categories:
five questions related to depressive mood, two related to suicidal tendencies, three related to loss of
interest, two related to anxiety, and two related to insomnia.
- - - threshold: 0.5032
40
35
0
3
#
unoEesuoqsanba1
5
2
0
2
eus
1
5
10
5
Notably, for questions related to depressive mood and loss of interest, we observed variations in
how similar questions were phrased across different scales. To ensure robustness, we consulted
our domain expert co-author and decided to include two additional questions from the self-reported
PHQ-9 depression scale: phq9_Q2 (depressed mood over the past two weeks) and phq9_Q1 (loss of
interest over the past two weeks). We highlight these two question IDs in teal in Table 7. Although
their correlation scores (0.5006 and 0.5008, respectively) were slightly below the threshold of 0.5032,
their inclusion provides valuable insights and enhances the diagnostic framework. This decision
underscores the importance of multi-scale assessment in minimizing misjudgments of clients’ mental
states, reducing the impact of short-term emotional fluctuations, and mitigating potential inaccuracies
due to inconsistent responses. By repeating similar questions across self-reported and clinician-rated
scales, we can achieve a more comprehensive and reliable evaluation. The detailed content and
options of selected questions are listed in Table 8.
22
Table 8: Question content and options of selected scale questions.
Question ID Correlation Question Content Options
Depression-related Performances
hamd_total_score 0.6348 The total score of Hamilton Depression 0-6 = no depression, 7-16 = may have depression, 17-23 =
Rating Scale (HAMD). must have depression, 24-76 = severe depression
hama_Q6 0.6304 Depressed mood. Loss of interest, lack 0 = Not present, 1 = Mild, 2 = Moderate, 3 = Severe, 4 =
of pleasure in hobbies, depression, early Very severe.
waking, diurnal swing.
bprs_Q9 0.6200 DEPRESSIVE MOOD. Despondency in 0 = not assessed, 1 = not present, 2 = very mild, 3 = mild, 4
mood, sadness. Rate only degree of de- = moderate, 5 = moderately severe, 6 = severe, 7 = extremely
spondency; do not rate on the basis of severe
inferences concerning depression based
upon general retardation and somatic com-
plaints.
hamd_Q1 0.6021 DEPRESSED MOOD (sadness, hopeless, 0 = Absent. 1 = These feeling states indicated only on
helpless, worthless) questioning. 2 = These feeling states spontaneously reported
verbally. 3 = Communicates feeling states non-verbally, i.e.
through facial expression, posture, voice and tendency to
weep. 4 = Patient reports virtually only these feeling states in
his/her spontaneous verbal and non-verbal communication.
phq9_total_score 0.592 The total score of Patient Health 0-4 = minimal depression, 5-9 = mild depression, 10-14 =
Questionnaire-9 (PHQ-9). moderate depression, 15-19 = moderately severe depression,
20-27 = severe depression
phq9_Q2 0.5006 Over the last 2 weeks, how often have 0 = Not at all, 1= Several days, 2 = More than half the days,
you been bothered by any of the following 3 = Nearly every day
problems? Feeling down, depressed, or
hopeless.
Suicide-related Performances
hamd_Q3 0.6024 SUICIDE 0 = Absent. 1 = Feels life is not worth living. 2 = Wishes
he/she were dead or any thoughts of possible death to self. 3
= Ideas or gestures of suicide. 4 = Attempts at suicide (any
serious attempt rate 4).
phq9_Q9 0.5007 Over the last 2 weeks, how often have 0 = Not at all, 1= Several days, 2 = More than half the days,
you been bothered by any of the following 3 = Nearly every day
problems? Thoughts that you would be
better off dead, or of hurting yourself.
Energy&Interest-related Performances
hamd_Q7 0.607 Work and Activities 0 = No difficulty, 1= Thoughts and feelings of incapacity,
fatigue, or weakness related to activities, work, or hobbies,
only reported when asked, 2 = Spontaneously reports loss
of interest in activities, work, or hobbies, either directly or
indirectly, such as feeling listless, indecisive, or needing to
push themselves to work or engage in activities, 3 = Decrease
in actual time spent in activities or decrease in productivity;
in a hospital setting, rate 3 if the patient does not spend at
least three hours a day in activities, exclusive of ward chores,
4 = Stopped working due to the current illness; in a hospital
setting, rate 4 if the patient engages in no activities except
ward chores or if the patient fails to perform ward chores
unassisted.
hamd_Q22 0.5236 Feelings of Inadequacy or Reduced Abil- 0 = Absent, 1 = Subjective feelings of inadequacy only
ity elicited on questioning, 2 = Patient spontaneously reports
feelings of inadequacy), 3 = Needs encouragement, guid-
ance, and reassurance to complete daily tasks or personal
hygiene, 4 = Requires assistance from others for dressing,
grooming, eating, making the bed, or personal hygiene.
phq9_Q4 0.5097 Over the last 2 weeks, how often have 0 = Not at all, 1= Several days, 2 = More than half the days,
you been bothered by any of the following 3 = Nearly every day
problems? Feeling tired or having little
energy.
phq9_Q1 0.5008 Over the last 2 weeks, how often have 0 = Not at all, 1 = Several days, 2 = More than half the days,
you been bothered by any of the following 3 = Nearly every day
problems? Little interest or pleasure in do-
ing things.
Anxiety-related Performances
hama_total_score 0.5513 The total score of Hamilton Anxiety Rat- 0-6 = no anxiety, 7-13 = may have anxiety, 14-20 = must
ing Scale (HAM-A). have anxiety, 21-28 = must have obvious anxiety, 29-56 =
severe anxiety
gad7_total 0.5185 The total score of Generalized Anxiety 0-4 = no anxiety, 5-9 = mild anxiety, 10-14 = moderate
Disorder-7(GAD-7). anxiety, 15-21 = severe anxiety
Insomnia-related Performances
hamd_Q4 0.5247 INSOMNIA: EARLY IN THE NIGHT 0 = No difficulty falling asleep. 1 = Complains of occasional
difficulty falling asleep, i.e. more than half an hour. 2 =
Complaints of nightly difficulty falling asleep.
hama_Q4 0.5099 Insomnia. Difficulty in falling asleep, bro- 0 = Not present, 1 = Mild, 2 = Moderate, 3 = Severe, 4 =
ken sleep, unsatisfying sleep and fatigue Very severe.
on waking, dreams, nightmares, night ter-
rors.
23
E Clinical Data Processing
E.1 Medical Record Processing
To ensure patient privacy, the medical records presented in this section have been anonymized, with
all event details adapted while preserving the patient’s actual symptoms. The original medical record
(adapted) and the corresponding processing steps are illustrated in Figure 3.
Effective psychiatric diagnosis relies on comprehensive and well-structured patient information.
However, raw medical records often contain scattered, redundant, or sensitive details that can hinder
accurate analysis. To address these challenges, we designed a systematic medical record processing
pipeline that ensures data security, enhances temporal reasoning, and improves symptom extraction for
LLM-based agents. This pipeline consists of four key steps: extracting essential diagnostic elements,
refining present illness descriptions to prevent data leakage, integrating structured information for
coherence, and an optimal step of reorganizing records into a structured format for precise symptom
matching.
Step 1. Raw Data Extraction. To construct a comprehensive and secure data store, we first obtained
anonymized client information from the hospital database. To capture the full context of clients’
backgrounds, which are critical for accurate diagnosis, we extracted the following key elements from
the medical records: (1) Gender: Essential for identifying gender-specific symptoms (e.g., menstrual-
related mood changes). (2) Age: Some disorders manifest at specific life stages, such as adolescence.
(3) Occupation: Helps differentiate disorder-related symptoms from external factors (e.g., shift work
causing sleep issues). (4) Visit Date: Provides a reference for interpreting time-sensitive information.
(5) Chief Complaint: A concise summary of primary symptoms. (6) Present illness: A detailed
symptom history, including medication usage.
Step 2. Present Illness Processing. To prevent data leakage in agent diagnosis, we removed explicit
disorder labels from the present illness section. Additionally, absolute dates were converted into
relative timeframes based on the visit date (e.g., “9 months ago” instead of a specific date), enhancing
data privacy while facilitating temporal reasoning for symptom progression analysis.
Step 3. Key Elements Combination. To generate a coherent input for LLM-based analysis, we
integrate the structured elements from Step 1 with the processed present illness data from Step 2,
ensuring comprehensive contextual understanding.
Step 4. Medical Records Structurizing. Symptoms relevant to diagnosis are often dispersed across
different parts of medical records. For example, "suspecting others and perceiving malicious intent"
and "engaging in defensive behaviors" may appear separately, making direct symptom matching
difficult. To address this, we reorganize medical records into a structured format, presenting key
symptoms in JSON. We will compare whether coherent text or structured representation is more
effective for symptom matching and analysis in the ablation study (Section 4.3.1).
Data processing steps are conducted using GPT-4o prompts followed by manual revision.
Diagnosic criteria extraction. To facilitate comparison between diagnostic criteria and medical
records, please summarize the criteria for {disease name} from the DSM-5 diagnostic manual in
a point-by-point format. Each point should fully describe a symptom, explicitly include reference
information where necessary, and retain age or other relevant restrictions.
Below is the description of the diagnostic criteria for {disease name} from the DSM-5: {the original
criteria}.
Differential Criteria Extraction. Please split the following differential criteria into two separate di-
agnostic criteria. Each point should fully describe a symptom, explicitly include reference information
where necessary, and retain age or other relevant restrictions.
Below are the differential criteria to be split: the original criteria.
Medical Record Processing.
Step 1 →
− Step 2: Replace specific dates in the present illness history with relative time expressions
based on the patient’s admission date. Avoid using exact dates after the replacement.
24
Step 1. Raw Data Extraction. Step 2. Present Illness Processing.
Figure 3: An example of the original medical record (adapted) and its corresponding processing
steps.
Step 2 →
− Step 3: Organize the patient’s information into a single, coherent paragraph without altering
the content of the present illness section.
Step 3 →
− Step 4: Structure the medical record by extracting and summarizing the patient’s symptoms
and background. List each item separately, using "-" as a delimiter. Respond only in JSON format.
25
E.3 Scale Performance Description
An example of how we transform scale performances into a textual description is shown in Figure 4,
and a full example is shown in Figure 5.
"phq9_Q2_score": 2
Over the past two weeks, how often have you been
Question Content
bothered by any of the following problems?
Feeling down, depressed, or hopeless.
0 = Not at all
1= Several days
2 = More than half the days
3 = Nearly every day
Figure 4: An example of generating scale performance description from the score value of a relevant
question.
26
Scale Performance Descriptions
The following are the performance of highly relevant questions in the [self-reported] and [clinician-
evaluated] scales, as well as the Pearson correlation coefficient between this question and the presence of
mood disorders in a statistical sense:
Depression-related Performances:
[clinician-evaluated] In the Hamilton Depression Rating Scale (HAMD) filled out by the clinician during
"hamd_total_score": 15 the consultation, the client scored 15 points (out of 76 points), indicating that the client may have depressive
symptoms. The correlation score between this and the presence of mood disorder is 0.6348.
[clinician-evaluated] In the sixth question of the HAMA questionnaire, the clinician assessed the client's
"hama_Q6_score": 1 depressive mood (Loss of interest, lack of pleasure in hobbies, depression, early waking, diurnal swing.) as
mild. The correlation score between this question and the presence of mood disorder is 0.6304.
[clinician-evaluated] In the ninth question of the BPRS scale, the clinician assessed the client's
"bprs_Q9_score": 1 DEPRESSIVE MOOD (Despondency in mood, sadness. Rate only degree of despondency) as not present.
The correlation score between this question and the presence of mood disorder is 0.62.
[clinician-evaluated] In the first question of the HAMD questionnaire, the clinician assessed the client's
"hamd_Q1_score": 2 depressive mood as: These feeling states spontaneously reported verbally. The correlation score between
this question and the presence of mood disorder is 0.6021.
[self-reported] In the self-rated Patient Health Questionnaire-9 (PHQ-9), the client scored 8 points (out of a
"phq9_total_score": 8 total of 27 points), indicating that the client may have mild depression. The correlation score between this
and the presence of mood disorder is 0.592.
[self-reported] In the second question of the PHQ-9 questionnaire, the client self-reported that over the past
"phq9_Q2_score": 2 two weeks, they feel sad, depressed, or hopeless more than half the days. The correlation score between this
question and the presence of mood disorder is 0.5006.
Suicide-related Performances:
[clinician-evaluated] In the third question of the HAMD questionnaire, the clinician assessed the client's
"hamd_Q3_score": 0 suicidal tendencies as: none. The correlation score between this question and the presence of mood disorder
is 0.6024.
[self-reported] In the ninth question of the PHQ-9 questionnaire, the client self-reported that over the past
"phq9_Q9_score": 0 two weeks, they did not at all think that death or harming themselves in some way was a solution. The
correlation score between this question and the presence of mood disorder is 0.5057.
Energy&Interest-related Performances:
[clinician-evaluated] In the 7th question of the HAMD questionnaire, the clinician rated the client's work
"hamd_Q7_score": 1 and interests as: Thoughts and feelings of incapacity, fatigue, or weakness related to activities, work, or
hobbies, only reported when asked. The correlation score between this question and the presence of mood
disorder is 0.607.
[clinician-evaluated] In the 22nd question of the HAMD questionnaire, the clinician rated the client's
"hamd_Q22_score": 2 Feelings of Inadequacy or Reduced Ability as: patient spontaneously reports feelings of inadequacy. The
correlation score between this question and the presence of mood disorder is 0.5236.
[self-reported] In the 4th question of the PHQ-9 questionnaire, the client self-reported that over the past two
"phq9_Q4_score": 2 weeks, they feel tired or having little energy more than half the days. The correlation score between this
question and the presence of mood disorder is 0.5097.
[self-reported] In the 4th question of the PHQ-9 questionnaire, the client self-reported that over the past two
"phq9_Q1_score": 2 weeks, they have little interest or pleasure in doing things more than half the days. The correlation score
between this question and the presence of mood disorder is 0.5008.
Anxiety-related Performances:
[clinician-evaluated] In the Hamilton Anxiety Rating Scale (HAMA) filled out by the doctor during the
"hama_total_score": 11 consultation, the client scored 11 points (out of a total of 56 points), indicating that the client may have
anxiety. The correlation score between this question and the presence of mood disorder is 0.5513.
[self-reported] In the self-assessed Generalized Anxiety Disorder-7 (GAD-7), the client scored 8 points (out
"gad7_total_score": 8 of 21 points), indicating that the client has mild anxiety. The correlation score between this question and the
presence of mood disorder is 0.5185.
Insomia-related Performances:
[clinician-evaluated] In the fourth question of the HAMD questionnaire, the clinician assessed the client's
"hamd_Q4_score": 0 INSOMNIA: EARLY IN THE NIGHT as: No difficulty falling asleep. The correlation score between this
question and the presence of mood disorder is 0.5247.
[clinician-evaluated] In the fourth question of the HAMA questionnaire, the clinician assessed the client's
"hama_Q4_score": 1 Insomia (Difficulty in falling asleep, broken sleep, unsatisfying sleep and fatigue on waking, dreams,
nightmares, night terrors) as mild. The correlation score between this question and the presence of mood
disorder is 0.5099.
Figure 5: A full example of scale performance descriptions derived from the score values of mood
disorder-related questions.
27
F Synthetic Data
We synthesize our dataset using a structured pipeline that begins with data preparation, followed by
model training, and concludes with rigorous post-processing. First, we construct the input data by
combining 16 mood disorder-related questions from Table 8, 8 baseline assessment scale total scores,
and binary mood disorder labels (1 for presence, 0 for absence), resulting in 25 features per case. This
tabular format serves as the foundation for our synthesis process using the TabSyn framework [20].
The training phase consists of two key components. We first pretrain a VAE to encode the tabular
data into a continuous latent space, employing column-wise tokenizers and Transformer-based
encoders/decoders to handle mixed data types. The model optimizes an adaptively weighted ELBO
loss, dynamically balancing reconstruction accuracy against KL divergence regularization to preserve
inter-column dependencies. Subsequently, we train a diffusion model in this latent space, where the
forward process gradually adds Gaussian noise following a linear schedule, while the reverse process
learns to denoise samples through a score-based SDE.
Our synthetic data generation initially follows a 1:1 ratio with the original dataset’s predefined splits
(for both retrieval and test sets). The generation process involves iterative denoising of Gaussian
priors to produce latent vectors, which are decoded into tabular form using the VAE’s detokenizer,
applying linear inverse transformations for numerical features and Softmax sampling for categorical
variables. Through rigorous post-processing, including value rounding, logical consistency checks
(e.g., ensuring question-level sums never exceed scale totals), and illogical case removal.
An example of our synthetic data is shown below.
{
" HAMA Q4 Score " : 3 ,
" HAMA Q6 Score " : 3 ,
" HAMA Total Score " : 28 ,
" GAD7 Total Score " : 18 ,
" PHQ9 Q1 Score " : 3 ,
" PHQ9 Q2 Score " : 2 ,
" PHQ9 Q4 Score " : 2 ,
" PHQ9 Q9 Score " : 0 ,
" PHQ9 Total Score " : 14 ,
" HAMD Q1 Score " : 2 ,
" HAMD Q3 Score " : 1 ,
" HAMD Q4 Score " : 1 ,
" HAMD Q7 Score " : 2 ,
" HAMD Q22 Score " : 1 ,
" HAMD Total Score " : 28 ,
" BPRS Q9 Score " : 3 ,
" PSQI Total Score " : 11 ,
" SHAPS Total Score " : 29 ,
" HCL32 Total Score " : 18 ,
" DAS Total Score " : 122 ,
" SSRS Total Score " : 40 ,
" MDQ Total Score " : 11 ,
" BPRS Total Score " : 34 ,
" YMRS Total Score " : 5 ,
" Mood Disorder " : 1
}
Building on the evaluation framework of Zhang et al. [20], we conduct a comprehensive evaluation
of our synthetic data across five critical dimensions: statistical density, data quality, machine learning
efficacy, privacy preservation, and logistic detectability. The results demonstrate that our synthetic
dataset achieves exceptional fidelity, accurately reproducing both the core statistical patterns and
complex relationships present in the original data while maintaining strong utility for machine
learning applications. Notably, our evaluation shows the synthetic data achieves: (1) high-density
preservation of univariate and multivariate distributions, (2) robust performance in downstream ML
tasks, and (3) near-indistinguishability from real data according to rigorous detection tests. These
collective findings position our synthetic data as a reliable surrogate for the original dataset across
analytical and modeling use cases.
28
F.2.1 Density Evaluation
The quality of synthetic data hinges on its ability to accurately replicate both individual feature
distributions and the complex relationships between variables in real-world data. The metrics, derived
from Shape, Trend, Coverage, and overall Density score, collectively measure the fidelity and utility
of synthetic data for downstream applications.
Our synthetic dataset demonstrates remarkable fidelity in this regard, achieving an excellent overall
density score of 0.86, well above the 0.8 threshold considered indicative of high-quality synthetic
data. The component scores reveal particularly strong performance in capturing feature relationships,
with Column Pair Trends reaching 0.93, while maintaining solid 0.79 fidelity in individual Column
Shapes. These results demonstrate that our synthesis process successfully preserves the nuanced
statistical patterns of the original data, with especially robust preservation of the critical multivariate
relationships that are often most challenging to replicate.
Metric descriptions and our detailed scores are illustrated in the following parts.
Shape (Column Distribution Shape). The Shape metric evaluates how well synthetic data replicates
individual feature distributions from the real data using two complementary measures: KSCom-
plement compares cumulative distribution functions to assess overall shape alignment (including
normality and skewness), while TVComplement evaluates precise probability mass matching for
categorical data. Higher scores (closer to 1.0) indicate better preservation of the original data’s
statistical properties. As shown in Table 9, our synthetic data demonstrates strong distributional
fidelity across all columns.
Trend (Column Pair Trends). This metric evaluates how faithfully synthetic data reproduces
the statistical relationships between variables. For numerical features (0-23), we assess linear
and nonlinear correlations (CorrelationSimilarity), examining whether directional patterns (e.g.,
positive/negative associations) are preserved. For categorical feature 24, we measure dependency
preservation through joint probability distributions (ContingencySimilarity). High scores (closer to
1.0) indicate the synthetic data successfully maintains the original data’s multivariate patterns, as
demonstrated in Figure 6.
Coverage. The Coverage metric evaluates how comprehensively synthetic data captures the full
spectrum of variability in real data. For numerical features, RangeCoverage assesses whether extreme
values and distribution tails are properly reproduced, while CategoryCoverage verifies the faithful
representation of categorical frequencies and combinations. High scores (closer to 1.0) indicate the
synthetic data successfully encompasses the complete range of real-world scenarios present in the
original dataset, as shown in Table 10.
Table 10: Score of column distribution. RC is short for RangeCoverage; CC is short for CategoryCov-
erage.
Column 0 1 2 3 4 5 6 7 8 9 10 11 12
Metric RC RC RC RC RC RC RC RC RC RC RC RC RC
Score 1.00 1.00 0.98 1.00 1.00 1.00 1.00 1.00 0.96 0.75 1.00 1.00 0.75
Column 13 14 15 16 17 18 19 20 21 22 23 24
Metric RC RC RC RC RC RC RC RC RC RC RC CC
Score 1.00 0.78 0.83 0.95 0.95 0.94 0.98 1.00 0.86 0.90 0.90 1.00
29
6 F R U H R I 3 D L U 7 U H Q G V E H W Z H H Q '