The Genealogical World of Phylogenetic Networks: computational biology

Showing posts with label computational biology. Show all posts

Wednesday, May 6, 2015

Pattern and process: computation and biology

It is obvious that there is a big cultural difference between biologists and computationalists, irrespective of whether we think its a good idea or not. This follows simply from the nature of the activities in the two professions — the activities are different and therefore different personalities are attracted to those professions.

Some of these differences are well known. For example, computations require algorithmic repeatability, along with proof that the algorithms achieve the explicitly stated goal. This means that computationalists have to be pedants in order to succeed. On the other hand, no-one can be pedantic and succeed in biology. Biodiversity is a concept that makes it clear that there are no rules to biological phenomena — any generalization that you can think of will turn out to have numerous exceptions. In the biological sciences we do not look for universal "laws" (as in the physical sciences), because there are none; and if you can't handle that fact then you should not try to become a biologist.

This leads to a further difference between the two professions that I think is sometimes poorly appreciated. In general, computationalists focus on patterns, whereas biologists focus on processes. Many processes can produce the same patterns, and therefore the same computations can be used to detect those patterns; and this is of interest to people who are developing algorithms. On the other hand, in biology processes can produce many different patterns, so that patterns are often unpredictable. Biologists are aware that patterns and processes can be poorly connected, and the biological interest is primarily on understanding the processes, because these are frequently more generalizable than are the patterns.

As a simple example of this dichotomy, consider the following diagram (from Loren H. Rieseberg and Richard D. Noyes. 1998. Genetic map-based studies of reticulate evolution in plants. Trends in Plant Science 3: 254-259). It shows the eight haploid chromosomes of a particular plant species.

Perusal of the figure will lead you to identify the pattern, and this is straightforward to detect computationally. Each chromosomal segment is triplicated, but the triplicates are arranged arbitrarily and are sometimes segmented.

On its own this is of little biological interest. The interest lies in the processes that led to the pattern. These processes could produce an infinite number of similar patterns, and so predicting the exact pattern in this species is impossible. We use abduction to proceed from the pattern to the processes (see What we know, what we know we can know, and what we know we cannot know).

We appear to be looking at a case of allopolyploidy (the nuclear genome is hexaploid) followed by recombination. Neither of these processes necessarily produces patterns that can be predicted in detail.

So, the computation focuses on the pattern and the biology on the process. Sometimes biologists forget this, and naively interpret patterns as inevitably implying a particular process. And sometimes computationalists naively expect patterns to be predictable when they are not.

Monday, February 16, 2015

An Hennigian analysis of the Eukaryotae

As usual at the beginning of the week, this blog presents something in a lighter vein.

Homologies lie at the heart of phylogenetic analysis. They express the historical relationships among the characters, rather than the historical relationships of the taxa. As such, homology assessment is the first step of a phylogenetic analysis, while building a tree or network is the second step.

With a colleague (Mike Crisp, now retired), I once wrote a tongue-in-cheek article about how to mis-interpret homologies, and the consequences of this for any subsequent tree-building analysis. This article appeared in 1989 in the Australian Systematic Botany Society Newsletter 60: 24–26. Since this issue of the Newsletter is not online, presumably no-one has read this article since then. However, you should read it, and so I have linked to a PDF copy [1.2 MB] of the paper:
An Hennigian analysis of the Eukaryotae

Monday, August 25, 2014

The evolution of statistical phylogenetics

For those of you who do not understand the notation:

Homo apriorius ponders the probability of a specified hypothesis, while Homo pragamiticus is interested by the probability of observing particular data. Homo frequentistus wishes to estimate the probability of observing the data given the specified hypothesis, whereas Homo sapiens is interested in the joint probability of both the data and the hypothesis. Homo bayesianis estimates the probability of the specified hypothesis given the observed data.

Wednesday, January 22, 2014

Blogs about phylogenetics

I have occasionally been asked about what blogs currently exist in phylogenetics, because there seem to be very few. There are blogs in related areas, such as phyloinformatics, evolutionary biology, and systematics, but very few blogs dedicated primarily to phylogenetics (not just occasionally mentioning it).

Below is a list of the current and former blogs that I know about. In each case I have provided basic information taken from the blog itself. Please let me know about any suitable blogs that have been missed. [Updated 15 October 2014]

Current General Blogs

The Genealogical World of Phylogenetic Networks

Biology, computational science, and networks in phylogenetic analysis. This blog is about the use of networks in phylogenetic analysis, as a replacement for (or an adjunct to) the usual use of trees. This topic has received considerable attention in the biological literature, not least in microbiology (where horizontal gene transfer is often considered to be rampant) and botany (where hybridization has always been considered to be common). It has also received increasing attention in the computational sciences.

Contributors: David Morrison, Steven Kelk, Leo van Iersel, Mike Charleston, Jesper Jansson
Started: 25 February 2012

TreeThinkers

TreeThinkers is a blog devoted to phylogenetic and phylogeny-based inference. We aim to use it as a place to discuss recent research and methods; to ask and answer questions; and serve as a general resource for news and trivia in phylogenetics. Although the blog is associated with the Bodega workshop, we welcome posts and participation from the entire phylogenetics community.

Contributors: Bastien Boussau, Gideon Bradburd, Jeremy Brown, Rich Glor, Tracy Heath, David Hillis, Sebastian Höhna, Luke Mahler, Mike May, Brian Moore, Samantha Price, Peter Wainwright
Editor: Bob Thomson
Started: 2 October 2012

Open Tree of Life

The tree of life links all biodiversity through a shared evolutionary history. This project will produce the first online, comprehensive first-draft tree of all 1.8 million named species, accessible to both the public and scientific communities. Assembly of the tree will incorporate previously-published results, with strong collaborations between computational and empirical biologists to develop, test and improve methods of data synthesis. This initial tree of life will not be static; instead, we will develop tools for scientists to update and revise the tree as new data come in.

Contributors: Robin Blom, Karen Cranston, Karl Gude, Mark Holder, Rosemary Keane, Rick Ree
Started: April 8, 2012

EvoPhylo

Evolution, phylogenetics, bioinformatics, stuff.

Contributor: Dave Lunt
Started: 30 January 2008

The Bayesian Kitchen

Statistical inference and evolutionary biology. Undoubtedly, since its introduction in phylogenetics in the late 90's, Bayesian inference has become an essential part of current applied statistical work in evolutionary sciences. However, there are still many problems, computational, theoretical and even foundational. After ten years of applied Bayesian work in phylogenetics and in evolutionary genetics, I feel the need to step back and re-think the whole thing.

Contributor: Nicolas Lartillot
Started: 24 December 2013

Phylogenetics...

Musings on eukaryote evolution.

Contributor: Marko Prous
Started: 31 December 2013

Current Program Blogs

Phylogenetic Tools for Comparative Biology

This web-log chronicles the development of new tools for phylogenetic analyses in the phytools R package. Unless you are reading a very recent page of the blog, I recommend that you install the latest CRAN version of phytools (or latest beta release) before attempting to replicate any of the analyses of this site. That is because the linked functions may be archived, and very likely have been replaced by newer versions.

Contributor: Liam Revell
Started: 11 December 2010 (at Blogspot)

Osiris Phylogenetics

Accessible and reproducible phylogenetics using the Galaxy workflow system.

Contributor: Todd Oakley
Started: 7 September 2012

Beast 2: Bayesian Evolutionary Analysis by Sampling Trees

Announces the introduction of new tools for phylogenetic analyses in the Beast 2 package, as well as discussing usage issues with the current version, along with tips and tricks.

Contributor: Remco Bouckaert

Started: 18 March 2014

Blogs Currently in Limbo

Dechronization

Dechronization is authored by evolutionary biologists interested in the development and application of methods for estimating phylogeny and making phylogeny-based inferences. The goal of the blog is to provide a forum for discussion of the latest research and methods, while also providing anecdotes, tidbits of natural history, and other related information.

Contributors: Rich Glor, Luke Harmon, Brian Moore, Tom Near, Dan Rabosky, Liam Revell
Started: 29 April 2008 Last post: 6 June 2011

CYPHY - Cybertaxonomy and Phylogenetics

Mostly harmless pointing at things pertaining to cybertaxonomy and phylogenetics.

Contributor: Matt Yoder
Started: 6 November 2007      Last post: 23 February 2011

Phylogeny etc.

Meditations on phylogenetic inference.

Contributor: Bruce Rannala
Started: 6 March 2014      Last post: 6 March 2014

Fish Phylogenetics

I created this new blog to share thoughts on work from my research group on the phylogenetics and evolutionary biology of fishes. This will provide a forum to share insight about the studies that we publish, discuss important scientific aspects of fish diversity, reflect on my experiences teaching ichthyology (the study of fishes), and to comment and review contributions by other researchers.

Contributor: Tom Near
Started: 23 August 2012      Last post: 15 September 2012

Taxonomy Phylogeny

Taxonomies group organisms according to phenotype, while phylogenetic systems groups organisms according to shared evolutionary heritage.

Contributor: ???
Started: 1 January 2008      Last post: 31 December 2010

Phylogenetic Geek

A bag of info on phylogenetics.

Contributor: ???
Started: 5 August 2011      Last post: 16 September 2011

Monday, January 13, 2014

Bioinformatics and inter-disciplinary work

Bioinformaticians are sometimes seen as multi-disciplinary workers (see the previous post on Results of some bioinformatics polls). If so, then the results of a recent study may be of interest:

Kevin M. Kniffin and Andrew S. Hanks (2013) Boundary spanning in academia: antecedents and near-term consequences of academic entrepreneurialism. Cornell Higher Education Research Institute Working Paper 158.

Kniffin and Hanks used data from the Survey of Earned Doctorates (conducted by the National Science Foundation), based on data from all people who earned PhDs in the U.S.A. between July 1 2009 and June 30 2010 (c. 43,000 people). Apparently, 14,000 people (32.5 %) reported that their doctoral work spanned academic boundaries

Two of their main findings are: (i) individuals who complete an interdisciplinary dissertation display short-term income risk, since they tend to earn nearly $1,700 less in the year after graduation; and (ii) the probability that non-citizens pursue interdisciplinary dissertation work is 4.7% higher when compared with U.S. citizens. Sadly, but not unexpectedly, women tend to earn less compared to men upon completion of the doctorate. Perhaps less expectedly, European American individuals also earn less in their first year after graduation than those in other racial groups.

For us, some of the more interesting data are:

Discipline
Agricultural & Life Sciences
Biological Sciences
Health Sciences
Computer Sciences & Mathematics

% of all Research Doctorates
2.3
17.6
4.4
7.0

% Interdisciplinary
44.5
41.1
29.9
22.7

In the regression models, adjusting for all other factors, the "influence of interdisciplinary research upon salary" was positive for Computer Sciences & Mathematics as well as for Health Sciences, but was negative for Biological Sciences. However, the "influence of interdisciplinary research upon employment as postdoctoral researcher" was negative for Computer Sciences & Mathematics as well as for Health Sciences, but was positive for Biological Sciences.

Make of this what you will.

Monday, December 9, 2013

Results of some bioinformatics polls

In 2008, Michael Barton conducted a Bioinformatics Career Survey. Since then, various groups have updated some of that information by conducting polls of their own. Below, I have included some of the more recent results, for your edification.

This first one comes from the Bioinformatics Organization, in response to the question: What is your undergraduate degree in? It is interesting to note that more bioinformaticians are biologists by training, rather than computational people.

The next one is actually an ongoing poll at BioCode's Notes, in response to the question: Which are the best programming languages for a bioinformatician? R is an interesting choice as the most useful language, given the more "traditional" use of Perl and Python.

That leads logically to another of the Bioinformatics Organization's questions: Which computer language are you most interested in learning (next) for bioinformatics R&D? I guess that if you already know R, then either Python or Perl is a useful thing to learn next.

Furthermore, the Bioinformatics Organization also asked: Which math / statistics language / application do you most frequently use? The choice of R here is more obvious, given that it is free, which most of the others are not. I wonder what the answer "none of the above" refers to.

Wednesday, November 20, 2013

Bioinformaticians look at bioinformatics

Bioinformatics as a term dates back to the 1970s, usually credited to Paulien Hogeweg, of the Bioinformatics group at Utrecht University, in The Netherlands, although it apparently did not make it into print until 1988 (Paulien Hogeweg. 1988. MIRROR beyond MIRROR, puddles of Life. In: Artificial Life, C. Langton, ed. Addison Wesley, pp. 297-315.).

In the 1990s the field expanded rapidly and became recognized as a discipline of its own, as a subset of computational science. However, Christos A. Ouzounis (2012. Rise and demise of bioinformatics? Promise and progress. PLoS Computational Biology 8: e1002487) has noted a distinct decrease in the use of the term itself, as shown by this graph.

Ouzounis recognizes three (admittedly artificial) periods in the history: Infancy (1996-2001), Adolescence (2002-2006) and Adulthood (2007-2011). Along the way, the practice of bioinformatics has received a lot of criticism. I have noted some of this before, in previous blog posts:
Poor bioinformatics?
Archiving of bioinformatics software

What is perhaps most important is that much of this criticism comes from bioinformaticians themselves, rather than from biologists. Moreover, this criticism does not seem to have had much effect on how bioinformatics is practiced, given the length of time over which it has been made.

For example, Carole Goble (2007. The seven deadly sins of bioinformatics. Keynote talk at the Bioinformatics Open Source Conference Special Interest Group at the 15th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB 2007) in Vienna, July 2007) produced this list of what she called "intractable problems in bioinformatics":

1. Parochialism and insularity.
2. Exceptionalism.
3. Autonomy or death!
4. Vanity: pride and narcissism.
5. Monolith megalomania.
6. Scientific method sloth.
7. Instant gratification.

More recently, Manuel Corpas, Segun Fatumo & Reinhard Schneider (2012. How not to be a bioinformatician. Source Code for Biology and Medicine 7: 3) pointed out what they call "a series of disastrous practices in the bioinformatics field", which look very similar:

1. Stay low level at every level.
2. Be open source without being open.
3. Make tools that make no sense to biologists.
4. Do not provide a graphical user interface: command line is always more effective.
5. Make sure the output of your application is unreadable, unparseable and does not comply to any known standards.
6. Be unreachable and isolated.
7. Never maintain your databases, web services or any information that you may provide at any time.
8. Blindly believe in the predictions given, P-values or statistics.
9. Do not ever share your results and do not reuse.
10. Make your algorithm or analysis method irreproducible.

You can peruse the originals to check out the details of these problems, and whether they sound uncomfortably familiar.

Wednesday, July 3, 2013

Archiving of bioinformatics software

Some months ago I wrote a blog post about what is perceived to be the rather poor quality of many computer programs in bioinformatics (Poor bioinformatics?), noting that many bioinformaticians aren't taking seriously the need to properly engineer software, with full documentation and standard programming development and versioning.

An obvious follow-up to that post is to consider the archiving of bioinformatics software. If programs are written well, then they should be permanently archived for future reference. A number of bloggers have commented on what is perceived to be the poor current state of affairs here, as well, and I thought that I might draw your attention to a few of the posts.

In many ways, this issue is the computational equivalent of storing biological data, about which I have also written recently (Releasing phylogenetic data). My comments about this were:

There is a difference between storing / releasing the original data (eg. raw DNA sequences) and the data as analyzed (eg. aligned sequences)
There are sustainable and accessible archiving facilities for raw data that are almost universally used (eg. GenBank)
Many people do not release the processed data as analyzed (some of them will if directly asked to do so)
Many of the people who do release their analyzed data do so on the homepage of one of the authors, which is better than nothing but is rarely sustainable
There are sustainable and accessible archiving facilities for processed data, such as TreeBASE and Dryad.

Analogous comments can be made about the archiving of bioinformatics software.

The first question to ask is this: what proportion of the bioinformatics software referred to in publications is actually stored in sustainable and accessible archives? A corollary to this question is: what archive facilities are being used? Casey Bergman, at the I Wish You'd Made Me Angry Earlier blog, has attempted to answer both of these questions (Where Do Bioinformaticians Host Their Code?).

In answer to the first question, Casey notes:

of the many thousands of articles published in the field of bioinformatics, as of Dec 31 2012 just under 700 papers (n=676) have easily discoverable code linked to a major repository in their abstract.

While many papers may have the code URL in the Methods or Results sections but not the Abstract, this does suggest that repository archiving is not the mode actually employed by bioinformaticians. Instead, they are archiving (if at all) on personal or institutional homepages.

Sadly, the reported rate of decay of URLs ("Error 404: Page not found") indicates that this is rarely a sustainable approach to archiving (eg. see the Google+ comment by Dave Lunt). The relevance of the similar situation with the TreeBase / Dryad type of repository has not gone unnoticed, for example by Hilmar Lapp. These repositories require and enforce standards of data and software archiving, as well as providing persistence.

The answer to the second question, about which repositories, seems to be (see also the data provided by MRR in the comments to Casey's blog post):

SourceForge has been vastly predominant
Google Code has a large number of projects, but many of them have never made it to publication
GitHub has had a rapid recent growth rate, and therefore appears to be becoming the preferred repository.

Other repositories, such as BitBucket, seem to be much less used. Users on other forums (eg. Biostar: Where would you host your open source code repository today?) seem to concur with the choice of GitHub, mainly because of the tools available (it is user-oriented rather than project-oriented).

This leads to the issue of how permanent the archiving is at the major repositories. It turns out that there is a major difference in policies, as noted by Casey Bergman:

SourceForge has a very draconian policy when it come to deleting projects, which prevents accidental or willful deletion of a repository. In my opinion, Google Code and (especially) GitHub are too permissive in terms of allowing projects to be deleted.

In a follow-up post (On the Preservation of Published Bioinformatics Code on GitHub), Casey expands on this theme:

A clear trend emerging in the bioinformatics community is to use GitHub as the primary repository of bioinformatics code in published papers. While I am a big fan of Github and I support its widespread adoption, I have concerns about the ease with which an individual can delete a published repository. In contrast to SourceForge, where it is extremely difficult to delete a repository once files have been released, and this can only be done by SourceForge itself, deleting a repository on GitHub takes only a few seconds and can be done (accidentally or intentionally) by the user who created the repository.

This is an important issue, as exemplified by Christopher Hogue in the comments section of that blog post:

In my case SourceForge preserved the SLRI toolkit my group made in Toronto. As the intellectual property underlying the code was sold to Thompson-Reuters in 2007, my host institution and the dealmakers pressured me to delete the repository. SourceForge policy kept it on the site ... [However,] the aftermath of all this is that, of everything my group did under the guise of open source, only about 30% is preserved and online, and the rest is buried in an intellectual property shoebox at Thompson-Reuters. Host institutions have a lot of power of ownership over your intellectual property. If you win the right to post work into open-source, the GitHub delete policy means that your host institution can over-ride this, and require you to take your code out of circulation. GitHub is great, but for the sake of preservation, SourceForge has the right policy, protecting your decision to go open source from later manipulations by your host institution when it becomes "valuable".

Casey Bergman's response to this issue has been to create the Bioinformatics Archive on GitHub. This is based on the idea used by the journal Computers & Geosciences, in which the journal editor forks the GitHub code into a journal "organization" for all accepted papers — this creates a permanent repository, which is necessary because deleting a private GitHub repository will delete all forks of the repository but deleting a public repository will not do so. So, Casey has been personally forking the code for all publications that come to hand (currently 147 repositories) into the Bioinformatics Archive, thus creating a public repository for all of the relevant GitHub code.

However, this is clearly a stopgap measure. Dave Lunt, at the EvoPhylo blog, has listed three desiderata for a more permanent solution to the issue (How can we ensure the persistence of analysis software?):

A publisher driven version of the Bioinformatics Archive; journals should have a policy for the hosting of published code in a sustainable and accessible archive in a standardized manner
Redundancy to ensure persistence in the worst case scenario; archive persistence is the key requirement, and this can only happen in public repositories, with the published URL and/or DOI pointing to a public copy of the code
The community to initiate actual action; authors need to pressure the publishers to adopt a Dryad-like strategy, in which a large group of ecology and evolutionary biology journals agreed to require the use of a public database for storing the biological data associated with their publications.

At a minimum, a persistent public repository is a snapshot of the code at the time of publication, just as a sequence alignment is a snapshot of the processed data at the time of its publication. This does not preclude further work on the code, and further publications based on the newly modified code, just as new sequence alignments can be created by adding newly acquired sequences. Open-source code can still be newly forked, and there can be user-contributed updates and public issue tracking. Multiple snapshots of code related to different publications through time is not necessarily an issue, but it will need to be handled in some sensible manner.

The main reason for requiring the public archiving of code is to deal with the all-too-common situation when code is no longer being maintained (the scholarship ran out, the grant ended, the author retired, etc). For example, Jamie Cuticchia & Gregg Silk (2004, Bioinformatics needs a software archive, Nature 429: 241) mention the loss of part of the code when the multi-million dollar Genome Database lost funding in 1998. These two authors seem to be the first to have proposed a Bioinformatics Software Archive, "in which an archival copy of bioinformatics software would be maintained in a secure central repository supported by public funding." Personal and institutional homepages are too ephemeral (suffering what is known as URL decay) and too prone to politics to be considered acceptable for the storage of data and software in high-quality science.

Monday, April 22, 2013

Personal Type I error rates

As usual at the beginning of the week, this blog presents something in a lighter vein. However, this week we depart from phylogenetic networks in the strict sense, and take a humorous look at the broader statistical life of biologists.

Statistics is a curious thing, which allows scientists to make probability errors of two types: Type I (also known as false positives) and Type II (also known as false negatives). Importantly, these errors can accumulate in any one experiment, so that we can also recognize an Experimentwise Error Rate, which is the sum of the individual errors associated with each experimental hypothesis test.

However, what is not widely recognized is that these errors apply in life, as well. In particular, biologists accumulate statistical errors throughout their lives, so that we all have a Personal Lifetime Error Rate.

I once wrote a tongue-in-cheek article about the accumulation of Type I errors throughout the working life of a biological scientist, and the consequences for the experiments conducted by that scientist. This article appeared in 1991 in the Bulletin of the Ecological Society of Australia 21(3): 49–53, which means that I used an ecologist as my specific example of a biologist. The principle applies to all biologists, however.

Since this issue of the Bulletin is not online, presumably no-one has read this article since 1991, although it has recently been referenced on the web (see the sixth comment on this blog post).** You, too, should read it, and so I have linked to a PDF copy [1.7 MB] of the paper:
Personal Type I error rates in the ecological sciences

** Note that I am alternately referred to as an "inveterate mischief maker" and "a very wise man"!

Monday, December 10, 2012

Data enrichment in phylogenetics

Since this is post #100 in this blog, I thought that we might celebrate with something humorous. Since evolutionists often have a tough time, this post is about how to get more out of your phylogenetic analyses than you previously thought was possible.

In 1957, Henry R. Lewis published an article about The Data-Enrichment Method (Operations Research 5: 551-554). This method was intended "to improve the quality of inferences drawn from a set of experimentally obtained data ... without recourse to the expense and trouble of increasing the size of the sample data." This distinguishes the method from similarly named techniques, such as the likelihood method of Data Augmentation, which require actual data.

Clearly, such a method is of great interest to all empirical scientists, especially those without much grant money. Indeed, The Data Enrichment Method was immediately expanded by other interested parties (see Operations Research 5: 858-859, and 6: 136), who pointed out that it can be applied iteratively to great effect, and that it can be used to support an hypothesis and also its opposite.

The important requirements for the Data Enrichment Method are: (i) a nested set of data patterns, and (ii) an a priori expectation about what should be the answer to the experimental question. All scientists should have the latter, of course, since they are supposed to be testing the expectation by calling it an "hypothesis".

Most interestingly for us, phylogeneticists will often be able to meet requirement (i), as well, because their data often form a nested set, representing the shared derived character states from which a phylogenetic tree will be derived. I therefore once wrote an article examining the application of The Data Enrichment Method to phylogenetics, where it does indeed work very well. You do need at least some data to start with, and so it does not free you entirely from the inconvenience and embarrassment of uncontrollable empirical results.

This article appeared in 1992 in the Australian Systematic Botany Society Newsletter 71: 2–5. Since this issue of the Newsletter is not online, presumably no-one has read this article since then. However, you should read it, and so I have linked to a PDF copy of the paper:
A new method for increasing the robustness of cladistic analyses

After reading it, you might like to think about how to apply this method to phylogenetic networks. The mixture of horizontal gene flow with vertical descent breaks the simple nested data pattern of a phylogenetic tree, which complicates the application of data enrichment to networks.

Thursday, October 11, 2012

An open question about computational complexity

This is a guest blog post by:

Jesper Jansson

Laboratory of Mathematical Bioinformatics, Kyoto University, Japan

Here is an open problem for people interested in computational complexity issues related to phylogenetic networks.

In a recent paper we introduced a parameter called the "minimum spread" that measures a kind of structural complexity of phylogenetic networks:
T. Asano, J. Jansson, K. Sadakane, R. Uehara, G. Valiente (2012) Faster computation of the Robinson-Foulds distance between phylogenetic networks. Information Sciences 197: 77-90.

The definition is as follows:

The "minimum spread" of a rooted phylogenetic network N is the smallest integer x such that the leaves of N can be relabeled by distinct positive integers in a way that at every node u in N, the set of all leaf descendants of u forms at most x consecutive intervals.

For example, any phylogenetic tree has minimum spread 1 because if we do a depth-first traversal of the tree and number the leaves in the order that they are discovered, then at each node, the set of leaf descendants corresponds to a single consecutive interval. This property was used in, for example, Day's algorithm from 1985 for comparing phylogenetic trees and constructing a strict consensus tree.

Similarly, any level-k phylogenetic network has minimum spread at most k+1 (see our paper for the proof). Moreover, any "leaf-outerplanar network" has minimum spread 1, where a "leaf-outerplanar network" is a network that admits a non-crossing layout in the plane with the root (if any) and all leaves lying on the outer face. Today's existing software typically outputs such networks. So, for certain classes of phylogenetic networks, we automatically get a nice upper bound on the minimum spread.

Having a small minimum spread means that the phylogenetic network is "tree-like" in the sense that its cluster collection has a space-efficient representation. But are compact representations of the clusters in a network useful?

Well, they can be employed to compare phylogenetic networks quickly, for example when using the Robinson-Foulds distance to measure the dissimilarity between two phylogenetic networks. There may be other applications, too. On the other hand, if a phylogenetic network is "chaotic" and non-tree-like then the minimum spread will not be a helpful parameter when looking for a compact encoding of its branching information.

At this point in time, not much is known about how to compute the minimum spread efficiently. As an example, consider the class of level-k networks for any fixed k > 1. According to Lemma 6 in our paper, we can always find a leaf relabeling function that yields spread at most k+1 in linear time, but that might not be the minimum possible for some particular level-k network.

As observed by Sylvain Guillemot and Philippe Gambette (independently of each other), a related result for the k-Consecutive Ones Problem implies that computing the minimum spread of an arbitrary phylogenetic network is NP-hard in the general case, although we can expect it to be easier when restricted to special cases:
P. W. Goldberg, M. C. Golumbic, H. Kaplan, R. Shamir (1995) Four strikes against physical mapping of DNA. Journal of Computational Biology 2: 139-152.

In summary, the following is still open:

What is the computational complexity of computing the minimum spread when restricted to particular classes of phylogenetic networks?

Saturday, September 8, 2012

Poor bioinformatics?

There have been a number of recent posts in the blogsphere about what is perceived to be the rather poor quality of many computer programs in bioinformatics. Basically, many bioinformaticians aren't taking seriously the need to properly engineer software, with full documentation and standard programming development and versioning.

I thought that I might draw your attention to a few of the posts here, for those of you who write code. Most of the posts have a long series of comments, which are themselves worth reading, along with the original post.

At the Byte Size Biology blog, Iddo Friedberg discusses the nature of disposable programs in research, which are written for one specific purpose and then effectively thrown away:
Can we make accountable research software?
Such programs are "not done with the purpose of being robust, or reusable, or long-lived in development and versioning repositories." I have much sympathy for this point of view, since all of my own programs are of this throw-away sort.

However, Deepak Singh, at the Business|Bytes|Genes|Molecules blog, fails to see much point to this sort of programming:
Research code
He argues that disposable code creates a "technical debt" from which the programmer will not recover.

Titus Brown, at the Living in an Ivory Basement blog, extends the discussion by considering the consequences of publishing (or not publishing) this sort of code:
Anecdotal science
He considers that failure to properly document and release computer code makes the work anecdotal bioinformatics rather than computational science. He laments the pressure to publish code that is not yet ready for prime-time, and the fact that computational work is treated as secondary to the experimental work. Having myself encountered this latter attitude from experimental biologists (the experiment gets two pages of description and the data analysis gets two lines), I entirely agree. Titus concludes with this telling comment: "I would never recommend a bioinformatics analysis position to anyone — it leads to computational science driven by biologists, which is often something we call 'bad science'." Indeed, indeed.

Back at the Business|Bytes|Genes|Molecules blog, Deepak Singh also agrees that "a lot of computational science, at least in the life sciences, is very anecdoctal and suffers from a lack of computational rigor, and there is an opaqueness that makes science difficult to reproduce":
Titus has a point

This leads to Iddo Friedberg's post (the Byte Size Biology blog) that mentions this concept:
The Bioinformatics Testing Consortium
This group intends to act as testers for bioinformatics software, providing a means to validate the quality of the code. This is a good, if somewhat ambitious, idea.

Finally, Dave Lunt, at the EvoPhylo blog, takes this to the next step, by considering the direct effect on the reproducibility of scientific research:
Reproducible Research in Phylogenetics
He notes that bioinformatics workflows are often complex, using pipelines to tie together a wide range of programs. This makes the data analysis difficult to reproduce if it needs to be done manually. Hence, he champions "pipelines, workflows and/or script-based automation", with the code made available as part of the Methods section of publications.

Wednesday, March 7, 2012

RECOMB-AB

Last week my attention was drawn to the forthcoming conference RECOMB-AB 2012 : First RECOMB Satellite Conference on Open Problems in Algorithmic Biology:

“RECOMB-AB brings together leading researchers in the mathematical, computational, and life sciences to discuss interesting, challenging, and well-formulated open problems in algorithmic biology.”

As someone working in the field of “algorithmic biology” (which, I guess, could be defined as the application of techniques from computer science, discrete mathematics, combinatorial optimization and operations research to computational biology problems) I was, predictably, immediately enthusiastic about the conference.

However, what really caught my attention was the following paragraph:

“The discussion panels at RECOMB-AB will also address the worrisome proliferation of ill-formulated computational problems in bioinformatics. While some biological problems can be translated into well-formulated computational problems, others defy all attempts to bridge biology and computing. This may result in computational biology papers that lack a formulation of a computational problem they are trying to solve. While some such papers may represent valuable biological contributions (despite lacking a well-defined computational problem), others may represent computational 'pseudoscience.' RECOMB-AB will address the difficult question of how to evaluate computational papers that lack a computational problem formulation.”

Calls-for-participation rarely strike such a negative tone. However, in this case I think the conference organizers have highlighted an extremely important point. Problems arising in computational biology are inherently complex and this entails a bewildering number of parameters and degrees of freedom in the underlying models. Furthermore, it is commonplace for computational biology articles to utilize a large number of intermediate algorithms and software packages to perform auxiliary processing, and this further compounds the number of unknowns (and the inaccuracies) in the system.

All this is, to a certain extent, inevitable. However, this complexity sometimes seems to have become an end in itself. This would be harmless except for the fact that scientists subsequently attempt to draw biological conclusions from this mass of data. Rarely is the question asked: is there actually any “biological signal” left amongst all those numbers? Would we have obtained similar results if we had just fed random noise into the system?

The fact that these questions are not posed, is directly linked to the lack of a clear and explicitly articulated optimization criterion. In other words: just what are we trying to optimize exactly? What makes one solution “better” than another? What, at the end of the day, is the question that we are trying to answer? This is exactly what RECOMB-AB is getting at with the sentence, “This may result in computational biology papers that lack a formulation of a computational problem they are trying to solve”. The articulation might be slightly formal, but the point they raise is nevertheless fundamental.

It remains to be seen what kind of a role phylogenetic networks will play at RECOMB-AB, if any. For sure, the field of phylogenetic networks continues to generate a vast number of fascinating open algorithmic problems. However, are the underlying biological models precise enough to allow us to say that we are actually producing biologically-meaningful output? Overall, I think the answer is still no. However, I think that there is reason for optimism. The field is young and evolving and it is likely that both biologists and algorithmic scientists will have a significant role in shaping its future. Hopefully this interplay will allow us to move forward on the biological front without losing sight of the need for explicit optimization criteria.