Paper Review

The document outlines various methodologies and results related to AI and machine learning applications in drug discovery, including frameworks like DrugAgent and CLADD, which automate programming tasks and enhance collaboration among LLM agents. It highlights the potential of LLMs in accelerating drug development processes while noting limitations such as reliance on existing data accuracy and lack of real-world validation. Additionally, it discusses advancements in databases like IMPPAT and KARE's knowledge graph approach for healthcare predictions, emphasizing the need for empirical validation and integration of traditional knowledge.

Uploaded by

harsha12132003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views12 pages

Paper Review

Uploaded by

harsha12132003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Paper Methodology Results Limitations

DrugAgent: Automating AI-aided Drug DrugAgent, a multi-agent framework  ADMET Prediction: DrugAgent Depends heavily on the accuracy of
Discovery Programming through LLM designed to automate machine successfully automated the ML language models.
Multi-Agent Collaboration learning (ML) programming tasks in pipeline for predicting absorption May lack deep domain knowledge in
drug discovery. The framework using the PAMPA dataset, complex biomedical areas.
Tested on limited drug discovery tasks
comprises two primary components: achieving an F1 score of 0.92.
only.
 Drug-Target Interaction: In DTI Not yet validated in real-world pharma
• The LLM Planner comes up with tasks, DrugAgent outperformed workflows.
different solution ideas and improves the ReAct baseline with a 4.92% Computationally expensive due to
them based on results. relative improvement in ROC- multiple iterations.
AUC.
• The LLM Instructor turns those
ideas into working code using drug-
specific knowledge.
RAG-Enhanced Collaborative LLM  Introduces CLADD, a system of  CLADD effectively handled drug  Difficulty in integrating varied data
Agents for Drug Discovery collaborative LLM agents enhanced discovery tasks using general- types (molecular, protein, disease
with Retrieval-Augmented purpose LLMs. data).
Generation (RAG).  Showed promising results without  Retrieval errors can lead to incorrect
 Agents dynamically retrieve the need for specialized training. outputs.
biomedical knowledge and  Demonstrated the feasibility of
coordinate to perform tasks like RAG-enhanced agent collaboration
molecule property prediction and for scientific applications.
target identification.
 Avoids domain-specific fine-tuning
by relying on real-time retrieval and
inter-agent collaboration.
LLM-Assisted Drug Discovery  Conducts a qualitative analysis of  Accelerating target identification The study is theoretical and
the role of Large Language Models and drug efficacy prediction, conceptual, lacking empirical
(LLMs) in drug discovery. It  Reducing costs and time-to-market, validation or real-world application
examines how LLMs process  Improving accuracy in preclinical results.
biomedical literature, clinical trial insights.
data, and molecular databases to:
 Identify new therapeutic targets,
Predict drug efficacy, and
Streamline development workflows.
 It leverages case studies and existing
research to argue LLMs' potential,
rather than implementing or
evaluating a new model.
Generating Novel Leads for Drug  Proposed LMLF (Language Models  Relies on existing tools (RDKit,  GPTLF++ and PaLMLF++
Discovery Using LLMs with Logical with Logical Feedback) — a novel, GNINA) for molecular property produced molecules with higher
Feedback iterative framework to guide LLMs validation and docking — results binding scores than baselines.
in generating drug-like molecules. depend on their accuracy.  15–40% of molecules contained
 The method separates the prompt  Generalization is limited to numeric selective functional groups (JAK2,
into two parts: constraints. DRD2).
1. A domain-specific logical  The feedback loop is constrained to  Many generated leads were novel
constraint (e.g., molecular iterative logic updates — might not (Tanimoto similarity < 0.75 to
weight, binding affinity). capture broader chemical intuition. known drugs).
2. A domain-independent query
(e.g., "Generate valid
molecules").
KRAGEN: a knowledge graph-enhanced  Proposed KRAGEN, a novel tool  KRAGEN improves logical  Lacks extensive benchmarking
RAG framework for biomedical problem that enhances standard Retrieval- structure, transparency, and factual against traditional RAG or clinical
solving using large language models Augmented Generation (RAG) by accuracy of LLM responses in decision systems
integrating knowledge graphs and biomedical domains.
Graph-of-Thoughts (GoT)  The use of GoT allows for traceable
prompting. reasoning paths, helping users
 Knowledge Graphs are converted understand and verify the logic
into a vector database for retrieval, behind outputs.
enhancing factual grounding.
 The GoT technique breaks down
complex biomedical problems into
smaller subproblems (nodes in a
graph), and each is solved
individually using LLMs and
retrieved knowledge.
Computational Strategies for Drug  Data Collection: Bioactive  The study identified multiple lead  The study relies heavily on in silico
Discovery: Harnessing Indian Medicinal compounds were sourced from compounds with strong binding predictions without experimental
Plants databases like IMMPAT, affinity, favorable ADMET profiles, (wet lab) validation.
IMPPAT2.0, and COCONUT using and stability in molecular  Traditional knowledge integration is
a literature review. simulations. dependent on available databases,
 ADMET Analysis: Tools such as  Demonstrated that Indian medicinal which may not capture the full
pkCSM and ADMET3.0 predicted plants are a rich reservoir for ethnopharmacological scope.
absorption, distribution, metabolism, potential drug candidates.
excretion, and toxicity profiles to  Highlighted that integrating
screen drug-like compounds. traditional medicinal knowledge
 Network Pharmacology: with modern computational
Constructed interaction networks to techniques can accelerate the drug
understand multi-target effects and discovery pipeline.
mechanisms of action within
biological pathways.
 Molecular Docking: Performed
virtual screening using tools like
AutoDock Vina, PyRx, CB-Dock,
etc., to predict binding affinity of
compounds with disease-related
proteins.
 Molecular Dynamics (MD)
Simulations: Used GROMACS,
NAMD, Desmond to simulate
stability and behavior of protein-
ligand complexes; RMSD/RMSF
and binding free energies were
evaluated.
Molecular Simulations for Ayurvedic  Offers a review and conceptual  Molecular simulations (especially  Lack of experimental data to
Phytochemicals framework for using CADD MD) have the potential to reveal validate in silico predictions of
(Computer-Aided Drug Discovery) mechanistic insights into Ayurvedic Ayurvedic phytochemicals.
tools—particularly molecular formulations.  No comprehensive phytochemical-
docking and molecular dynamics  Highlights the urgent need for: target database for Ayurvedic plants.
(MD) simulations—to study  Phytochemical-specific force fields,
Ayurvedic phytochemicals.  Expanded compound-target
 Molecular Docking: Described as a association databases,
first-pass screening technique to  Collaborative efforts between
evaluate ligand-protein interactions, computational researchers and
using structure-based approaches. Ayurveda experts.
 Molecular Dynamics (MD)
Simulations: Proposed to overcome
limitations of docking by modelling
ligand flexibility, solvent effects,
and complex interactions such as
allosteric or competitive binding.
A Comprehensive Survey on Vector This is a survey paper, not an  Vector databases are essential for  The survey focuses mostly on
Database: Storage and Retrieval experimental or implementation-based storing and retrieving high- architectural and algorithmic
Technique, Challenge study. It provides a comprehensive dimensional, unstructured data used overviews; lacks empirical
review of: in modern AI. benchmarks or performance
 Vector database architecture —  They support efficient similarity comparisons.
particularly how data is stored, search through techniques like
retrieved, and queried. ANNS and scalable storage with
sharding and replication.
 Four key approximate nearest  Integration with LLMs can enhance
neighbour search (ANNS) semantic understanding in search
approaches: and RAG pipelines.
1. Hash-based
2. Tree-based
3. Graph-based
4. Quantization-based
 Storage techniques including
sharing, partitioning, caching, and
replication.
 Retrieval techniques using NNS and
ANNS strategies.
IMPPAT: A curated database of Indian Developed IMPPAT, a manually curated  IMPPAT is the largest curated  Incomplete phytochemical data for
Medicinal Plants, Phytochemistry And database featuring: digital repository of phytochemicals many plants due to limitations in
Therapeutics 1. 1742 Indian medicinal plants from Indian medicinal plants. literature and digitized sources.
2. 9596 phytochemicals (with 2D  960 phytochemicals were identified
& 3D structures) as druggable, with 591 being novel
3. 1124 therapeutic uses (no similarity to FDA drugs).
 Data sources included traditional  The chemical space of IMPPAT
medicine books, databases (e.g., phytochemicals is more complex
PubMed), and the Traditional and diverse (stereochemistry, Fsp³)
Knowledge Digital Library (TKDL). than commercial compound
 Chemical properties (e.g., logP, libraries.
molecular weight) and ADMET  The database provides tools for
(Absorption, Distribution, chemical filtering, drug-likeness
Metabolism, Excretion, Toxicity) evaluation, structure downloads, and
predictions were computed using network visualization.
cheminformatic tools like FAF-
Drugs4 and admetSAR.
IMPPAT 2.0: An Enhanced and IMPPAT 2.0 is a major update to the  Identified 960 druggable  Phytochemical-target associations
Expanded Phytochemical Atlas of Indian original IMPPAT database. phytochemicals, many of which are are not comprehensively mapped.
Medicinal Plants It now includes: structurally novel compared to
1. 4010 Indian medicinal plants (up known drugs.
from 1742),  Revealed the chemical uniqueness of
2. 17,967 phytochemicals (90% Indian phytochemicals compared to
increase), Chinese and FDA drugs.
3. 1,659 therapeutic uses.
 Added chemical structures in
formats like SDF, PDB, and MOL2.
Prompt-RAG: Pioneering Vector  The paper introduces Prompt-RAG,  Prompt-RAG outperformed  Prompt-RAG may struggle with
Embedding-Free Retrieval-Augmented a retrieval-augmented generation traditional vector-based RAG and retrieving highly structured or
Generation in Niche Domains, (RAG) system that does not rely on ChatGPT baselines in terms of hierarchically organized data.
Exemplified by Korean Medicine vector embeddings. relevance and informativeness.  Natural language retrieval could
 Instead of traditional dense  Showed that vector embeddings are become inefficient with very large
embedding searches, it uses natural not always optimal for niche or corpora compared to ANN
language prompts directly to retrieve culturally-specific knowledge (Approximate Nearest Neighbor)
relevant documents. domains. methods.
 The approach is designed for niche  Demonstrated the potential for
domains like Korean Medicine prompt-based retrieval systems to
(KM), where embedding models enhance RAG pipelines, especially
often poorly capture semantic in specialized fields.
relationships.
 Comparative experiments were
conducted between:
1. Vector-based RAG (using
traditional embeddings)
2. Prompt-RAG (using natural
language querying)
Explainable Biomedical Hypothesis  The authors propose RUGGED  RUGGED enhances hypothesis  RUGGED was evaluated on a single
Generation via Retrieval Augmented (Retrieval Under Graph-Guided generation by combining RAG with case study (ACM vs. DCM).
Generation enabled Large Language Explainable Disease Distinction), a explainable knowledge graphs. Broader validation is pending.
Models workflow to generate biomedical  It reduced hallucination risk  The effectiveness of hypothesis
hypotheses using RAG-enabled compared to purely generative LLM generation depends heavily on the
LLMs. approaches. relevance and accuracy of retrieved
 Combines three main steps:  Successfully identified potential documents.
1. Text mining: Extracts disease, therapeutic targets and disease
drug, and molecular associations linkages in the ACM/DCM case
from biomedical literature. study.
2. Graph prediction models:  Demonstrated that structured
Creates explainable graphs that retrieval plus graph-based
forecast potential links among explainability can strengthen
diseases and therapeutics. biomedical research support
3. RAG-based LLM interaction: systems.
Supports human users in
generating and refining
biomedical hypotheses based on
retrieved evidence and graph
suggestions.
 Evaluated using a clinical case study
involving Arrhythmogenic
Cardiomyopathy (ACM) and Dilated
Cardiomyopathy (DCM) to suggest
repurposed therapeutics.
REASONING-ENHANCED  KARE constructs a multi-source  KARE outperforms traditional RAG  Scalability to complex healthcare
HEALTHCARE PREDICTIONS WITH medical Knowledge Graph (KG) by and other baseline models on all tasks is mentioned as future work;
KNOWLEDGE GRAPH integrating: tasks: current validation is limited to
COMMUNITY RETRIEVAL (KARE) 1. Biomedical databases (e.g., 1. Upto 10.8%-15.0% mortality and readmission.
UMLS), improvement on MIMIC-III  Fine-grained clinical concepts may
2. Clinical literature (e.g., tasks. be lost due to reliance on code
PubMed), 2. Upto 12.6%-12.7% mappings.
3. LLM-generated domain-specific improvement on MIMIC-IV  For extremely large graph
insights. tasks. communities, summaries are not
 It detects graph communities  Significantly improved generated due to LLM context
hierarchically (using the Leiden interpretability by producing window limits.
algorithm) and summarizes them. reasoning chains for each clinical
 Patient EHR data is augmented with prediction.
relevant community summaries
(context augmentation).
 A local small LLM is fine-tuned on
augmented EHRs, generating:
 Reasoning chains (step-by-step
explanations)
 Prediction labels (e.g., mortality,
readmission).
 Datasets used: MIMIC-III and
MIMIC-IV electronic health record
(EHR) datasets.
Y-Mol: A Multiscale Biomedical  Y-Mol is a domain-specific LLM  Y-Mol successfully establishes the Y-Mol is built on synthesized instruction
Knowledge-Guided Large Language built on LLaMA2 and fine-tuned for first multiscale knowledge-guided datasets; real-world noise and variation
Model for Drug Development drug development tasks. LLM pipeline for drug development. may impact downstream performance.
 Constructs a multiscale biomedical  Shows that integrating biomedical
knowledge dataset from: domain knowledge at scale can
1. Publications (PubMed corpus) improve LLM reasoning,
2. Biomedical knowledge graphs generalization, and predictive ability
(for interaction relationships) in drug R&D.
3. Expert-designed synthetic data  Y-Mol offers a blueprint for future
from small models (e.g., biomedical-specific LLMs
ADMET predictions, drug leveraging structured and
repurposing tools). unstructured knowledge sources.
 Instruction dataset crafted into three
types:
1. Description-based prompts
(extracted from publications)
2. Semantic-based prompts
(capturing relations in
knowledge graphs)
3. Template-based prompts
(simulating expert domain
knowledge)
KEDRec-LM: A Knowledge-distilled  Drug-Disease Pair Extraction:  KEDRec-LM achieves strong  The model’s scope is restricted to
Explainable Drug Recommendation Important drug-disease associations performance in generating accurate, drug-disease pairs drawn from
Large Language Model are gathered from the Drug explainable drug recommendations. DRKG — not comprehensive across
Repurposing Knowledge Graph  Distilled models retain high all biomedical domains.
(DRKG). reasoning ability while being more  The retrieval stage relies heavily on
 Evidence Retrieval: A Retrieval- efficient than teacher LLMs. PubMed and Clinical Trials; missing
Augmented Generation (RAG)  Demonstrates that integrating or low-quality literature can affect
system fetches supporting knowledge graphs, literature the rationale generation.
biomedical documents from PubMed retrieval, and instruction fine-tuning  The model’s scope is restricted to
and Clinical Trials. can lead to practical, explainable drug-disease pairs drawn from
 Teacher Model Reasoning: A GPT- biomedical LLMs. DRKG — not comprehensive across
based teacher model answers clinical  Authors open-sourced both the all biomedical domains.
questions about each drug-disease dataset and the fine-tuned model to
pair by reasoning over the retrieved foster further research in explainable
evidence. drug recommendation.
 Knowledge Distillation: A smaller
student model (based on fine-tuned
LLaMA) is trained to predict drug
recommendations and generate
human-readable rationales based on
the teacher’s responses.
Harnessing the Power of Knowledge  Model that enhances biomedical  The CMA model (cross-modal  Although the model uses attention
Graphs to Enhance LLM Explainability in reasoning by fusing Knowledge attention between LLM and KG mechanisms for explainability, the
the Biomedical Domain Graph (KG) and LLM embeddings) outperforms fine-tuned paper acknowledges that attention-
representations through a Cross- BioBERT on the MedQA based explanations are debated and
Modal Attention (CMA) biomedical reasoning task. not always fully faithful.
mechanism.  Combines improved task  Model explainability and reasoning
 The process follows these steps: performance with plausible local improvements are based on a single
 First, relevant triples are extracted explainability through attention biomedical KG (UMLS);
from the UMLS Knowledge Graph visualization. generalization to other biomedical
based on each MedQA question and  Demonstrates that structured knowledge bases is not evaluated.
its answer choices. biomedical knowledge (from KGs)
 Two parallel encodings are then can significantly enhance both
performed: accuracy and interpretability of
 KG triples are encoded using a LLM predictions.
Graph Neural Network (GNN).  Provides a promising step toward
 Textual data (questions and answers) integrating knowledge graphs and
are encoded using a pre-trained LLMs for transparent biomedical AI
BioBERT model. systems.
 A single-layer cross-modal
transformer is employed, where
CMA enables the fusion of KG
embeddings and text embeddings,
replacing traditional self-attention
mechanisms.
 The model then uses the CLS token
output from the transformer to make
a final answer prediction.
 To enhance explainability, the
attention scores linking the CLS
token and KG nodes are visualized,
providing local, interpretable
rationales for each prediction.
Leveraging AI in ayurvedic agriculture: A  Developed a hybrid deep learning  The DeiT+VGG16 hybrid model  The model sometimes misidentified
RAG chatbot for comprehensive model combining DeiT (Data- achieved 96.75% testing accuracy, medicinal plants, especially among
medicinal plant insights using hybrid efficient Image Transformer) and outperforming individual DeiT visually similar species.
deep learning approaches VGG16 CNN model. (95.97%) and VGG16 (90.26%)  The RAG chatbot occasionally
 Dataset used: Indian Medicinal Plant models. produced misinformation about
dataset (4161 training images, 893  Successfully developed an offline- plants’ uses and economic values.
testing images across 40 classes). capable, bilingual, and farmer-  Translation inaccuracies occurred
 Preprocessing steps included image accessible RAG chatbot. due to reliance on the GoogleTrans
resizing, random cropping, flipping,  The system allows users to scan API for Nepali-English switching.
normalization. plants, identify them, and get  Performance wasn't strong enough to
 Models trained using 70:15:15 split generated medicinal and economic handle all possible medicinal plants
(training:validation:testing) with insights, enhancing farmer beyond those present in the dataset.
batch size 32, learning rate 1e-4 empowerment and Ayurvedic
using AdamW optimizer. research support.
 Final DeiT+VGG16 hybrid model  Future work: Plan to extend to real-
used concatenation of local time mobile apps, improve chatbot
(VGG16) and global (DeiT) UI, support more languages, and
features. develop IoT-based plant tracking.
 Integrated this model into a
Retrieval-Augmented Generation
(RAG) chatbot using LangChain +
OpenAI API to generate insights
about the medicinal plants in English
and Nepali.
 Googletrans API used for bilingual
translation support.
Drug discovery from plant sources: An  Plant selection is prioritized using:  A systematic integration of  Extensive use of natural sources for
integrated approach 1. Traditional documented use, Ayurvedic knowledge into plant drug development could threaten
2. Tribal/ethnomedicinal selection significantly improves biodiversity.
undocumented use, success rates for drug discovery.  Not all bioactive plant compounds
3. Exhaustive literature search,  Applying Ayurvedic concepts (like are easily or completely
4. Ayurvedic pharmacological Rasa, Guna, Veerya) allows strategic synthetically reproducible.
attributes (Rasa, Guna, Veerya, shortlisting of potential plants,  Access and benefit-sharing rules
Vipaka, Dosha Karma). reducing time, costs, and under the Convention on Biological
 Extraction strategy suggested: development risks. Diversity (CBD) complicate drug
1. Use parallel extraction (multiple  Proposes that modern commercialization from plant
solvents simultaneously) for pharmaceutical industries should sources.
plants with known activity, embrace herbal extracts and  Many bioactive natural compounds
2. Use sequential extraction standardized botanical drugs for violate Lipinski’s Rule of Five,
(polarity-based) for plants with faster, safer drug development. meaning they may face issues in
unknown activity.  Reinforces that combining bioavailability, requiring alternate
 Bioassay-guided fractionation is traditional medicine insights with drug ability criteria.
performed to isolate and standardize modern biological screening creates  Especially when plants are selected
bioactive compounds. a cost-effective and scientifically randomly, success rates in drug
 Application of Go/No-Go criteria sound pathway for developing new discovery programs are low
based on potency of extracts vs pure plant-derived drugs. compared to systematic, Ayurveda-
compounds to decide development guided selection.
paths.
Integrating Retrieval-Augmented  The fine-tune Mistral 7b model  Mistral 7b with RAG achieved a  Only 9 journals were used for model
Generation with Large Language Model using the Retrieval-Augmented higher METEOR score (0.22%) than fine-tuning — limiting the herbal
Mistral 7b for Indonesian Medical Herb Generation (RAG) method. LLaMa2 7b (0.14%). plant knowledge coverage.
 Dataset: 9 academic journals  The RAG-Mistral model generated  Validation was done with only one
focused exclusively on Indonesian more creative, context-grounded expert (ethnobotany field), lacking
broader clinical validation.
medicinal herbs were collected and herbal recommendations than  Evaluation was done for only 6
processed. LLaMa2 7b. herbal-related diseases (e.g.,
 Preprocessing: Text chunked into  Precision was slightly lower due to headache, diabetes, hypertension,
500-sized chunks, 512 tokens per creative outputs, but relevance and fever, rheumatism, heartburn).
chunk, embedded with Sentence- factual correctness were better.  Mistral 7b outputs were more
Transformers, indexed using FAISS.  They recommend further research creative but had lower precision
System architecture: with more experts and expanding the compared to LLaMa2 7b.
 User query → Semantic search over journal dataset to improve reliability  Overfitting risks since the training
vector DB → Retrieved context + and application in real-world herbal was narrow on Indonesian plants and
Query fed to Mistral 7b → Answer medicine Q&A systems. may not generalize outside this
generated. scope.
VAIV bio-discovery service using  Developed VAIV Bio-Discovery, a  VAIV Bio-Discovery significantly  Currently only covers PubMed
transformer model and retrieval biomedical neural search service. improves biomedical information abstracts, not full-text articles,
augmented generation  The system combines transformer- retrieval by combining neural limiting depth of information.
based models (like T5slim_dec) for search, entity recognition, relation  The number of recognized
relation extraction and Retrieval- extraction, and RAG-based interactions is lower compared to
Augmented Generation (RAG) for summarization. curated databases like CTD because
natural language search and  It provides user-friendly interfaces it only uses direct extraction without
summarization. supporting basic search, entity and inferred associations.
 It uses PubMed abstracts and interaction search, and natural  Limited coverage for Chemical-
Therapeutic Target Database (TTD) language queries. Disease Relations (CDR) because of
documents to recognize entities  Outperforms traditional databases in small training datasets.
(chemicals, genes/proteins, diseases) discovering new interactions and  Some issues in named entity
and their interactions (e.g., drug- providing richer summaries. recognition (due to synonyms,
drug, chemical-protein, chemical-  Achieved high QA performance: abbreviations) and relation
disease). ROUGE-1 F-score of 0.912 and extraction granularity (especially for
 Neural search (using RoBERTa BLEU score of 0.795. underrepresented classes like
embeddings) is combined with  Positioned as a powerful tool for modulators, cofactors).
BM25 keyword search to retrieve hypothesis development, database  Still needs frequent updates and
relevant documents. curation, and biomedical research expansion to other resources (e.g.,
support. full texts, arXiv biomedical papers).
 Requires improvement in force field
models to enhance molecular
dynamics simulations if applied.
MoC: Mixtures of Text Chunking  New metrics: Introduce Boundary  Boundary Clarity and Chunk  The model assumes that document
Learners for Retrieval-Augmented Clarity and Chunk Stickiness for Stickiness are effective at directly structures can be captured using
Generation System direct evaluation of chunking quality measuring chunking quality, regular patterns (regex), which may
(instead of indirect QA accuracy). providing better insight than not generalize to highly unstructured
downstream QA evaluation alone. texts.
MoC Framework: MoC improves both:  While computationally efficient,
1. Granularity-aware router 1. Chunking precision (better smaller specialized chunkers may
dynamically selects lightweight, content retrieval for RAG), lack general reasoning ability for
specialized chunkers. 2. Computational efficiency (low complex or ambiguous inputs.
2. Each chunker outputs regular resource cost per chunking  Introducing multiple chunkers,
expressions to segment the text operation). routing, and recovery adds system
efficiently.  Achieved superior QA results complexity and may pose
3. Edit distance recovery algorithm compared to baseline chunking deployment challenges.
fixes potential hallucination strategies on multiple datasets.  Focuses primarily on QA datasets;
errors from LLM chunkers by  Demonstrates that better chunking performance on non-QA RAG tasks
comparing generated chunks strategies can significantly enhance (like summarization or multi-hop
with original text. overall RAG system performance reasoning) remains untested.
The system optimizes the balance without heavy LLM usage.
between:
 High chunking precision (important
for RAG quality),
 Low computational overhead (only
activating a small lightweight model
at a time).
Meta-Chunking: Learning Efficient Text  The paper proposes Meta-Chunking,  Meta-Chunking outperformed rule-  Assumes reliable sentence splitting
Segmentation via Logical Perception a new text segmentation method for based and similarity-based chunking before chunking, which may not be
Retrieval-Augmented Generation methods on 11 benchmarks, trivial in noisy or OCR datasets.
(RAG) systems, targeting better including single-hop and multi-hop  The effectiveness of Margin
logical coherence than traditional QA tasks. Sampling relies on how well the
rule- or similarity-based chunking.  Achieved 1.32% improvement on LLM can perceive logical
Two core strategies are introduced: 2WikiMultihopQA while reducing connections; smaller/weaker LLMs
1. Margin Sampling Chunking: chunking time to 45.8% of previous might degrade performance.
Uses an LLM to perform binary LLM-based chunkers like  PPL Chunking requires calculation
classification between consecutive LumberChunker. of perplexities sentence-by-sentence,
sentences (whether they should be  The approach balances logical which adds computational cost
merged or segmented) based on consistency, retrieval relevance, and during chunking pre-processing
margin probability sampling. efficiency — making it highly (although cheaper than Gemini-
2. Perplexity (PPL) Chunking: suitable for practical RAG pipelines. based LumberChunker).
Calculates the perplexity of each
sentence given preceding sentences,
identifying chunk boundaries where
perplexity distribution shows
minima (logical break points).
ChunkRAG: Novel LLM-Chunk Filtering 1. Semantic Chunking:  ChunkRAG substantially reduced  Running an LLM to assess each
Method for RAG Systems Documents are first segmented into hallucination and irrelevance chunk adds significant
logical, meaningful units — compared to standard RAG setups. computational cost compared to
sentences or coherent sections —  Improved factual accuracy on traditional retrieval pipelines.
using semantic-based tokenization. knowledge-intensive tasks (e.g.,  The relevance scoring quality is
2. Relevance Scoring via LLM: complex QA benchmarks). strongly dependent on how well the
A large language model evaluates  Achieved better retrieval precision primary LLM understands subtle
each chunk individually, assigning a by avoiding irrelevant information domain-specific context.
relevance score based on how well "leaking" into final generations.  For extremely large documents, even
the chunk aligns with the user query.  Demonstrated that chunk-level semantic chunking and per-chunk
3. Chunk Filtering: filtering is more effective than evaluation can become inefficient if
Only the most relevant chunks document-level or passage-level not batched or optimized.
(above a dynamic threshold) are methods for high-stakes reasoning
selected and passed forward into the tasks.
retrieval pipeline.
4. Dynamic Thresholding:
Instead of using a fixed relevance
cut-off, the threshold is dynamically
adjusted based on document
complexity and query specificity.
5. Critic Model Feedback (Optional):
A second LLM (the "Critic") re-
assesses selected chunks to further
improve relevance filtering and
catch false positives.

Large
No ratings yet
Large
45 pages
Waqar Hussain - Bilal Rasheed Machine Learning and Drug Discovery
No ratings yet
Waqar Hussain - Bilal Rasheed Machine Learning and Drug Discovery
4 pages
Final Poster
No ratings yet
Final Poster
1 page
Poster Test
No ratings yet
Poster Test
1 page
Yash Mohan
No ratings yet
Yash Mohan
11 pages
Drug Design
No ratings yet
Drug Design
19 pages
Posters
No ratings yet
Posters
1 page
2.4 Available AI Tools and Platforms
No ratings yet
2.4 Available AI Tools and Platforms
36 pages
9-Ai Mi Drug Discov Dev
No ratings yet
9-Ai Mi Drug Discov Dev
40 pages
Bbad 467
No ratings yet
Bbad 467
22 pages
Gangwal Lavecchia 2025 Artificial Intelligence in Natural Product Drug Discovery Current Applications and Future
No ratings yet
Gangwal Lavecchia 2025 Artificial Intelligence in Natural Product Drug Discovery Current Applications and Future
22 pages
Artificial Intelligence Drug Discovery
No ratings yet
Artificial Intelligence Drug Discovery
11 pages
Poster Com 2025 GTU
No ratings yet
Poster Com 2025 GTU
3 pages
Project Paper 1
No ratings yet
Project Paper 1
9 pages
Masterarbeit / Master'S Thesis
No ratings yet
Masterarbeit / Master'S Thesis
58 pages
Major Project Review 1
No ratings yet
Major Project Review 1
29 pages
1 s2.0 S2095809923001649 Main
No ratings yet
1 s2.0 S2095809923001649 Main
33 pages
AI in Drug Discovery
100% (2)
AI in Drug Discovery
23 pages
Prepare 1
No ratings yet
Prepare 1
38 pages
Abstract
No ratings yet
Abstract
5 pages
Machine Learning Empowering Drug Discovery Applica
No ratings yet
Machine Learning Empowering Drug Discovery Applica
20 pages
Comprehensive Survey of Recent Drug Discovery Usin
No ratings yet
Comprehensive Survey of Recent Drug Discovery Usin
37 pages
Drug-Target Interaction Prediction by Integrating Heterogeneous Information With Mutual Attention Network
No ratings yet
Drug-Target Interaction Prediction by Integrating Heterogeneous Information With Mutual Attention Network
16 pages
Zhavoronkov 2018 Artificial Intelligence For Drug Discovery Biomarker Development and Generation of Novel Chemistry
No ratings yet
Zhavoronkov 2018 Artificial Intelligence For Drug Discovery Biomarker Development and Generation of Novel Chemistry
3 pages
Artificial Intelligence and Machine Learning Approaches For Drug Design: Challenges and Opportunities For The Pharmaceutical Industries
No ratings yet
Artificial Intelligence and Machine Learning Approaches For Drug Design: Challenges and Opportunities For The Pharmaceutical Industries
21 pages
ML in Drug Discovery Review
No ratings yet
ML in Drug Discovery Review
53 pages
Information 16 00543
No ratings yet
Information 16 00543
15 pages
Machine Learning in Drug Discovery A Cri
No ratings yet
Machine Learning in Drug Discovery A Cri
11 pages
Early Drug Discovery
No ratings yet
Early Drug Discovery
16 pages
Machine Learning For Drug Discovery (Welcome - HTML) (Z-Library)
100% (1)
Machine Learning For Drug Discovery (Welcome - HTML) (Z-Library)
245 pages
RMF24 S1 TakeHome 247845
No ratings yet
RMF24 S1 TakeHome 247845
9 pages
Drug Discovery in The Age of Artificial Intelligence
No ratings yet
Drug Discovery in The Age of Artificial Intelligence
16 pages
2025 08 01 668090v7 Full
No ratings yet
2025 08 01 668090v7 Full
38 pages
AI Is A Viable Alternative To High Throughput Screening: A 318 Target Study
No ratings yet
AI Is A Viable Alternative To High Throughput Screening: A 318 Target Study
16 pages
20 Limitation
No ratings yet
20 Limitation
13 pages
Pharmaceuticals 16 00891 v2
No ratings yet
Pharmaceuticals 16 00891 v2
11 pages
Preprints202407 0876 v1
No ratings yet
Preprints202407 0876 v1
32 pages
Computational Drug Discovery Guide
No ratings yet
Computational Drug Discovery Guide
10 pages
Biology Project On Ai in Medicine
No ratings yet
Biology Project On Ai in Medicine
10 pages
In Silico Drug Discovery and Design Theory, Methods, Challenges, and Applications - 1st Edition Official Download
100% (12)
In Silico Drug Discovery and Design Theory, Methods, Challenges, and Applications - 1st Edition Official Download
14 pages
Chan HCS, Shan H, Dahoun T, Vogel H, Yuan S. Advancing Drug Discovery Via Artificial Intelligence.
No ratings yet
Chan HCS, Shan H, Dahoun T, Vogel H, Yuan S. Advancing Drug Discovery Via Artificial Intelligence.
15 pages
Drugs Recommended System2
No ratings yet
Drugs Recommended System2
30 pages
Technology
No ratings yet
Technology
10 pages
Cloud & ML in Drug Discovery
No ratings yet
Cloud & ML in Drug Discovery
10 pages
Drug Discovery and Drug Identification Using AI
No ratings yet
Drug Discovery and Drug Identification Using AI
3 pages
Full Text BMS-CTMC-2024-HT242-5771-8
No ratings yet
Full Text BMS-CTMC-2024-HT242-5771-8
22 pages
AI Assisted Drug Discovery
No ratings yet
AI Assisted Drug Discovery
10 pages
Artificial Intelligence in Drug Discovery Applications and Techniques v3
No ratings yet
Artificial Intelligence in Drug Discovery Applications and Techniques v3
66 pages
Drug Recommendation System Based On Sentiment Analysis of Drug Reviews Using Machine Learning
No ratings yet
Drug Recommendation System Based On Sentiment Analysis of Drug Reviews Using Machine Learning
7 pages
New Research With Referance
No ratings yet
New Research With Referance
3 pages
Survey Final PDF
No ratings yet
Survey Final PDF
4 pages
Tdpa Suumry DRFT 2
No ratings yet
Tdpa Suumry DRFT 2
13 pages
Juliana Reiew
No ratings yet
Juliana Reiew
18 pages
Mini Documentation
No ratings yet
Mini Documentation
43 pages
Drug Recommendation System Based On Sentiment
No ratings yet
Drug Recommendation System Based On Sentiment
7 pages
2024 09 27 24314506 Full
No ratings yet
2024 09 27 24314506 Full
19 pages
Artificial Intelligence To Deep Learning: Machine Intelligence Approach For Drug Discovery
No ratings yet
Artificial Intelligence To Deep Learning: Machine Intelligence Approach For Drug Discovery
46 pages
ENGGG
No ratings yet
ENGGG
36 pages
Anxiety and How You Could Deal With English Essay
No ratings yet
Anxiety and How You Could Deal With English Essay
2 pages
BRONCHODILATOR
No ratings yet
BRONCHODILATOR
20 pages
AAS Health Science Program Guide
No ratings yet
AAS Health Science Program Guide
1 page
Engl Loss and Grief Brochure 1
No ratings yet
Engl Loss and Grief Brochure 1
2 pages
2021 FEDVIP Dental Rates
100% (3)
2021 FEDVIP Dental Rates
3 pages
Body Soul Interpretation
100% (1)
Body Soul Interpretation
58 pages
Case Study First Aid
100% (1)
Case Study First Aid
2 pages
Global Study on Infections in Cirrhosis
No ratings yet
Global Study on Infections in Cirrhosis
23 pages
Copar (Nursing)
No ratings yet
Copar (Nursing)
9 pages
Body Composition Insights
No ratings yet
Body Composition Insights
1 page
Dunlee Datasheet CT8000
No ratings yet
Dunlee Datasheet CT8000
2 pages
Problems With Distance Learning System Must Be Thoroughly Investigated
No ratings yet
Problems With Distance Learning System Must Be Thoroughly Investigated
2 pages
Home Work 1,2 Et Article
No ratings yet
Home Work 1,2 Et Article
5 pages
PRMSU ASA OURSF 5 - Dropping and Changing Form - February 15 2023 3
No ratings yet
PRMSU ASA OURSF 5 - Dropping and Changing Form - February 15 2023 3
1 page
GIS for Waste Management in Keffi
No ratings yet
GIS for Waste Management in Keffi
13 pages
Clin Psychology and Psychoth - 2024 - Sanz - Psychological Group Interventions For Reducing Distress Symptoms in Healthcare
No ratings yet
Clin Psychology and Psychoth - 2024 - Sanz - Psychological Group Interventions For Reducing Distress Symptoms in Healthcare
12 pages
Laporan Harga Jual & No. Rak 2020
No ratings yet
Laporan Harga Jual & No. Rak 2020
22 pages
English Grammar Practice
No ratings yet
English Grammar Practice
17 pages
Project Phase II Air Quality
No ratings yet
Project Phase II Air Quality
15 pages
The Pranic Healers - Our System - Ajna Chakra
100% (2)
The Pranic Healers - Our System - Ajna Chakra
4 pages
Uk Sbs Sfs Design Guide
No ratings yet
Uk Sbs Sfs Design Guide
88 pages
Core Strength/Stability Self-Test: Two Options: High Hips - This Is
No ratings yet
Core Strength/Stability Self-Test: Two Options: High Hips - This Is
1 page
Final Report PPT Sunder Updated
No ratings yet
Final Report PPT Sunder Updated
14 pages
SF2-JUNE 2025-2026 Grade5Bonifacio
No ratings yet
SF2-JUNE 2025-2026 Grade5Bonifacio
2 pages
A Historical Study of Nurse Anesthesia Education in Nebraska
No ratings yet
A Historical Study of Nurse Anesthesia Education in Nebraska
295 pages
Sports Psychology
100% (1)
Sports Psychology
17 pages
Adverse Effects and Toxicity Management of Anesthetic Agents
No ratings yet
Adverse Effects and Toxicity Management of Anesthetic Agents
6 pages
Credentialing - Wikipedia
No ratings yet
Credentialing - Wikipedia
5 pages
FDA Drug Master Files Guide
No ratings yet
FDA Drug Master Files Guide
62 pages
Pre-Assessment Guidelines and Forms: Nabh-Ayush-Pa
No ratings yet
Pre-Assessment Guidelines and Forms: Nabh-Ayush-Pa
10 pages

Paper Review

Uploaded by

Paper Review

Uploaded by

Paper Methodology Results Limitations

You might also like