DCVBin: A Novel Metagenomic Binning Tool

Overview

DCVBin is an innovative metagenomic binning tool that combines compositional features and semantic features to improve the recovery of high-quality Metagenome-Assembled Genomes (MAGs). By leveraging a DNA language model (DNABERT_S) to extract semantic features and integrating them with compositional data, DCVBin addresses binning challenges, especially in cases with limited coverage information. This tool enhances binning performance and provides better recovery of high-quality MAGs from metagenomic data.

The model provides two optional analysis modes:

Standard Mode

Uses the pre-trained DNABERT model to extract sequence features, combined with k-mer frequency and sample-specific coverage features. After feature fusion via Variational Autoencoders (VAE), two-step adaptive clustering is performed to bin the sequences.

Enhanced Mode

Builds on the standard process by adding domain-adaptive pretraining of DNABERT, fine-tuning it on your data. This improves feature extraction for specific microbiomes, yielding better results in identifying new species, but requires more computational resources.

Both modes automatically complete the entire pipeline, from raw data to quality assessment. The enhanced mode incorporates domain adaptation training, significantly improving the identification of new species, but requiring more computational resources. Users can choose the most suitable analysis mode based on their specific research goals and available computational resources.

Installation

To install DCVBin, follow these steps:

Clone the repository to your local machine:

git clone https://github.com/dengdengf/dcvbin.git
cd DCVBin

Create and activate a conda environment using the provided .yaml file in the envs directory:

conda env create -f envs/checkm2.yaml
conda env create -f envs/copygen.yaml
conda env create -f envs/dnaberts.yaml
conda create --name vambnew --file envs/vambnew.lock
conda activate dcvbin_env

Usage

1. Basic Mode (without CPT)

In the basic mode, you will run the my_method.sh script to process your input FASTA files and generate MAGs.

Prepare your data: Ensure that your input data is in the .fasta format.Ensure your input data is in FASTA format. If the data is in a different format, consider converting it to FASTA first.
Run the script:
```
bash my_method.sh input_dir output_dir
```
- input_dir: Directory containing your .fasta files.
- output_dir: Directory where the output MAGs will be saved.

2. CPT Mode (with Continued Pretraining)

For the CPT mode, use the my_method_fur.sh script to run the process with Continued Pretraining (CPT) of the DNA language model.

Run the script:
```
bash my_method_fur.sh input_dir output_dir
```
- input_dir: Directory containing your .fasta files.
- output_dir: Directory where the output MAGs will be saved.

The DNABERT_S-main directory is too large for GitHub and has been uploaded to Figshare:

Download Link: (https://figshare.com/articles/figure/DCVBin_-_Training_Data_and_Models_for_Metagenomic_Binning/30400258)

If you need to use CPT mode, you must:

Download the compressed file from the above link
Extract the archive to your project root directory:
```
tar -xzf DNABERT_S-main.tar.gz
```

Example:

To run in basic mode:

bash my_method.sh ./data ./data/output

To run in CPT mode:

bash my_method_fur.sh ./data ./data/output

Contact

For any questions, concerns, or issues with DCVBin, please contact LiuYF via email at 17837410609@163.com, or create an issue on the GitHub Issues page.

Thank you for using DCVBin!

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
DNABERT-S		DNABERT-S
bin		bin
data		data
envs		envs
marker_gene		marker_gene
myvae		myvae
scripts		scripts
..gitattributes.swp		..gitattributes.swp
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DCVBin: A Novel Metagenomic Binning Tool

Overview

Standard Mode

Enhanced Mode

Installation

Usage

1. Basic Mode (without CPT)

2. CPT Mode (with Continued Pretraining)

Example:

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DCVBin: A Novel Metagenomic Binning Tool

Overview

Standard Mode

Enhanced Mode

Installation

Usage

1. Basic Mode (without CPT)

2. CPT Mode (with Continued Pretraining)

Example:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages