Skip to content

marbl/MetagenomeScope

Repository files navigation

Icon MetagenomeScope

CI Code Coverage

MetagenomeScope is an interactive visualization tool designed for (meta)genome assembly graphs.

This version of the tool is still a work in progress; more to come soon.

Installation

Using mamba:

mamba create -n mgsc -c conda-forge "python >= 3.8" pygraphviz
mamba activate mgsc
pip install git+https://github.com/marbl/MetagenomeScope.git

(... Eventually we'll put this on bioconda or something.)

Usage

Activate the mamba environment we just created and run:

mgsc -g graph.gfa

... where graph.gfa is a path to the assembly graph you want to visualize (see information below on supported graph filetypes).

This will start a server using Dash. Navigate to localhost:8050 in a web browser to access the visualization. (If you want to use a different port number than 8050, you can use the -p/--port parameter of mgsc.)

Supported assembly graph filetypes

Filetype Tools that output this filetype Notes
GFA (.gfa) (meta)Flye, LJA, miniasm, hifiasm, hifiasm-meta, ... Both GFA v1 and GFA v2 files are accepted, but currently only the raw structure (segments and links) are included.
FASTG (.fastg) SPAdes, MEGAHIT Expects SPAdes-"dialect" FASTG files. See pyfastg's documentation for details.
DOT (.dot, .gv) (meta)Flye, LJA Expects DOT files produced by Flye or LJA. See "What filetype should I use for de Bruijn graphs?" in the FAQs below.
GML (.gml) MetaCarvel Expects MetaCarvel-"dialect" GML files.
LastGraph (.LastGraph) Velvet Only the raw structure (nodes and arcs) are included.

Should you run into additional assembly graph filetypes you'd like us to support, feel free to open a GitHub issue.

FAQs

Reverse-complementary sequences

FAQ 1. How do you handle reverse complement nodes/edges?

The answer to this depends on the filetype of the graph you are using.

"Explicit" graph filetypes (FASTG, DOT, GML)

When MetagenomeScope reads in FASTG, DOT, and GML files, it assumes that these files explicitly describe all of the nodes and edges in the graph. So, let's say you give MetagenomeScope the following DOT file:

digraph g {
  1 -> 2 [label="A99(2.4)"];
}

We will interpret this as a graph with two nodes (1, 2) and one edge (1 -> 2).

"Implicit" graph filetypes (GFA, LastGraph)

However, for GFA and LastGraph files, MetagenomeScope cannot make the assumption that these files explicitly describe all of the nodes and edges in the graph. In these files, each declaration of a node / edge (in GFA parlance, "segment" / "link"; in LastGraph parlance, "node" / "arc") also declares this node / edge's reverse complement.

So, let's say you give MetagenomeScope the following GFA file (based on this example):

H	VN:Z:1.0
S	1	CGATGCAA
S	2	TGCAAAGTAC
L	1	+	2	+	5M

We will interpret this as a graph with four nodes (1, -1, 2, -2) and two edges (1 -> 2, -2 -> -1). The presence of node X "implies" the existence of the reverse complement node -X, and the presence of edge X -> Y "implies" the existence of the reverse complement edge -Y -> -X. Interpreting the graph file in this way is analogous to how "double mode" works in Bandage.

Based on the FASTG specification, shouldn't FASTG be an "implicit" instead of an "explicit" filetype?

It's complicated. The way I interpret the FASTG specification, each declaration of an edge sequence implicitly also declares this edge sequence's reverse complement; however, this is not the case for "adjacencies" between edge sequences.

In any case, the "dialect" of FASTG files produced by SPAdes and MEGAHIT lists edge sequences and their reverse complements (as well as adjacencies between edge sequences and their reverse complements) separately. Because of this, we consider FASTG to be an "explicit" filetype. (See pyfastg's documentation for details on how we handle reverse complements in FASTG files.)

FAQ 2. Why does my graph have node X and -X in the same component?

The short answer is "probably palindromes." Below is a more detailed answer.

Strand-separated components

Consider the following example GFA file from FAQ 1:

H	VN:Z:1.0
S	1	CGATGCAA
S	2	TGCAAAGTAC
L	1	+	2	+	5M

There are four nodes and two edges in this graph, but they form two (weakly) connected components -- that is, the graph contains one "island" of 1 and 2 (which are connected to each other), and another "island" of -1 and -2 (which are also connected to each other). You can think of these entire components as "reverse complements" of each other: although MetagenomeScope will visualize both of them (at least right now), you don't really need to analyze them separately. These "strand-separated" components describe the same (or mostly the same) sequences, just in different directions.

Strand-mixed components

Sometimes a node and its reverse complement will end up being in the same component, due to things like palindromic sequences gluing them together. The following GFA file is the same as the one we just saw, but it now contains an extra "link" line from 1 to -2:

H	VN:Z:1.0
S	1	CGATGCAA
S	2	TGCAAAGTAC
L	1	+	2	+	5M
L	1	+	2	-	0M

This graph contains four edges: 1 -> 2 and -2 -> -1 (which we've already seen), and 1 -> -2 and 2 -> -1. The introduction of these last two edges has caused the graph to become a single "strand-mixed" component, containing both a node X and its reverse-complementary node -X.

This often happens with the big ("hairball") component in an assembly graph.

FAQ 3. What happens if an edge is its own reverse complement?

(This assumes that you have read FAQ 1.)

This can happen if an edge exists from X -> -X or from -X -> X in an "implicit" graph file (GFA / LastGraph). Consider this GFA file, c/o Shaun Jackman:

H	VN:Z:1.0
S	1	AAA
S	2	ACG
S	3	CAT
S	4	TTT
L	1	+	1	+	2M
L	2	+	2	-	2M
L	3	-	3	+	2M
L	4	-	4	-	2M

Since this GFA file contains four "link" lines, we might think at first that the corresponding graph contains 4 × 2 = 8 edges. However, the graph only contains 6 unique edges. This is because the reverse complement of 2 -> -2 is itself: we know from above that X -> Y implies -Y -> -X, but -(-2) -> -(2) is equal to 2 -> -2! The same goes for -3 -> 3: -(3) -> -(-3) is equal to -3 -> 3. Both of these edges "imply" themselves as their own reverse complements!

How do we handle this situation? As of writing, when MetagenomeScope visualizes these graphs it will only draw one copy of these "self-implying" edges. This matches the original visualization of this graph, and also matches Bandage's visualization of this GFA file.

Notably, since we assume that "explicit" graph files (FASTG / DOT / GML) explicitly define all of the nodes and edges in their graph, MetagenomeScope doesn't do anything special for this case for these files. (If your DOT file describes one edge from X -> -X, then that's fine; if it describes two or more edges from X -> -X, then that's also fine, and we'll visualize all of them.)

Graph structure

FAQ 4. What do you mean by a component's "size rank"?

Given a graph with N connected components: we sort these components by the number of nodes they contain, from high to low. We then assign each of these components a size rank, a number from 1 to N: the component with size rank #1 corresponds to the largest component, and the component with size rank #N corresponds to the smallest component.

Often, we only care about looking at individual components in a graph -- laying out and drawing the entire graph is not always a good idea when the graph is massive. Component size ranks are a nice way of formalizing this.

Some details about component size ranks, if you are interested:

  • The numbers shown in the treemap (accessible in the "Graph info" dialog) correspond exactly to component size ranks. So, the rectangle labelled #1 in the treemap corresponds to the largest component, the rectangle labelled #2 corresponds to the second-largest component, etc.

  • The exact component sorting functionality accounts for ties by using four different sorting criteria, in the following order. Ties at one level cause later levels to be considered for breaking ties.

    • the number of "full" nodes in the component (treating a pair of split nodes 40-L → 40-R as a single node)
    • the number of "total" nodes in the component (treating a pair of split nodes 40-L → 40-R as two nodes)
    • the number of "total" edges in the component (including both real edges and "fake" edges between pairs of split nodes like 40-L → 40-R)
    • the number of patterns in the component
FAQ 5. Can my graphs have parallel edges?

Yes! MetagenomeScope supports multigraphs. If your assembly graph file describes more than one edge from X -> Y, then MetagenomeScope will visualize all of these "parallel" edges. (This is mostly useful when visualizing de Bruijn graphs.)

Notably, parallel edges are only supported right now for some filetypes. The parsers MetagenomeScope uses for GFA and FASTG files do not allow multigraphs -- this means that, at the moment, trying to use MetagenomeScope to visualize a GFA or FASTG file containing parallel edges will cause an error. I would like to address this (at least for GFA files) at some point, but it doesn't seem like a very important issue.

FAQ 6. What filetype should I use for de Bruijn graphs?

If you are using LJA (and probably also if you are using Flye), you may want to use a DOT file instead of a GFA / FASTG file as input.

This is because GFA and FASTG are not ideal for representing graphs in which sequences are stored on edges rather than nodes (i.e. de Bruijn / repeat graphs). The DOT files output by Flye and LJA should contain the original structure of these graphs (in which edges and nodes in the visualization actually correspond to edges and nodes in the original graph, respectively); the GFA / FASTG files usually represent altered versions in which nodes and edges have been swapped, which is not always an ideal representation.

That being said, please note that -- if you are using an assembler that outputs graphs in different filetypes -- these files may have additional differences beyond the usual filetype differences. For example, Flye's GFA and DOT files can have slightly different coverages, since Flye produces them at different times in its pipeline.

Development documentation

See CONTRIBUTING.md.

License

MetagenomeScope is licensed under the GNU GPL, version 3.

MetagenomeScope's code is distributed with Bootstrap and Bootstrap Icons. Please see the metagenomescope/assets/vendor/licenses/ directory for copies of these tools' licenses.

Acknowledgements

Thanks to various people in the Pop, Knight, and Pevzner Labs over the years for feedback and suggestions on the tool.

Contact

Please open a GitHub issue if you have any questions or suggestions.

About

Visualization tool for (meta)genome assembly graphs

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published