这么多年了,终于有个故事可以讲
今年是Methods in Ecology and Evolution这个期刊十周年,为了庆祝十周年呢,其中有一个活动就是在每一年(也就是一卷)中找一篇文章做为代表作,然后邀请作者来讲讲这背后的故事。于是干科研这么久了,我终于有了一次讲故事的机会。
2014年换导师到了做传染病的实验室,我完全没有背景,大老板对我说,你整天坐在办公室里,等着别人的数据来投喂你,你都不知道数据来之不易(做field确实幸苦),你从最基本的工作干起。于是我干了一圈,发现小伙伴们都在各种手工做东西,而其实是起码部分可以自动化的,我写了一个实验室内部使用的包,简化了很多操作,因此也得到了大老板的认可,成为了实验室唯一一个和大老板一样用着27寸iMac的学生。
我从2014年12月开始写ggtree,其实很快把看上去该有的功能都在这一个月里写出来了,所以看着就像模像样了。我当时的想法就是快速搞篇文章,打响第一炮,我压抑太久了。然而小老板说,要是我能够把codeml, BEAST等软件输出都给解析了,直接用ggtree到软件分析结果展示在树上的话,就更好了。所以当时我在放圣诞节的时候,回到广州,我拿着电脑在太平洋咖啡写各种parser(老婆就在咖啡馆里陪着我,没错,此处撒一下狗粮),元旦节日过完,这一大块功能也实现了。虽然我开发ggtree的总时间很漫长,但后面可以说是各种修修补补、扩展以及完善。有好的想法,在最初这一个月,集中火力,已经搭好了所有的基础设施。
然后继续着我的快速搞篇文章的想法,写了文章,然而小老板生孩子去了,然后我也生孩子去了,就这样一拖再拖,2016年终于投出去了,2017年见刊,所以啊,若不是我搞得快,可能毕业前都看不到文章见刊,当年老婆经常说,我搞篇文章比生个孩子还困难。
我在2017年因为这篇文章,就被邀请去field museum做了个报告,用的是the tree of life项目的钱,花着美国人民的钱,我吃饭都不帮他们省的,最后超了他们报销的额度,最爽的是给我订了一个河景的套房,简直爽得不要不要的。
虽然是个小东西,但是我持续做,做到就在一小块领域里,成为最好的工具(可能是之一吧),后面我在南方医独立发展了之后,我这加班死磕然后着急发文章的劲还在(然而现在看着学生干着急也没用呀呀呀),2018年发了一篇MBE,2020年又发一篇。一个东西持续多年产出,是会引起大家关注的,我在2019年就收到了William Pearson的邀请,让我在Current Protocols in Bioinformatics上写一篇文章介绍ggtree,这种没有投稿渠道,只能通过邀稿才能发的期刊,我觉得是对我巨大的认可,而且William Pearson是业内很有名的教授,大家都用的FASTA格式,还有一个名字叫Pearson格式,因为是他在他的FASTA软件中使用的格式,而这个软件是可以和Blast一较高低的,若不是NCBI强势,还不见得谁会胜出。
然后就是在今年,收到了MEE十周年庆祝的邀请,介绍这背后的故事。转眼我写ggtree六年了,这六年的付出能被大家看到,被大家认可,我还是蛮开心的。
https://methodsblog.com/2020/11/19/ggtree-tree-visualization/
The team publishing the ggtree paper is working in the field of emerging infectious diseases. Particularly the corresponding author Tommy Lam (TL) has been advocating the integration of different biological and epidemiological information in the studies of fast-evolving viral pathogens. The lead author Guangchuang Yu (GY) joined The University of Hong Kong to pursue his doctorate degree under the supervision of TL and Yi Guan (co-author in the paper), as he was very curious about the application of genomics and phylogenetics in the study of emerging infectious diseases.
At one time, while TL was guiding another student in a project of swine influenza evolution, GY was asked to provide assistance to modify the newick tree string to incorporate some additional information (such as amino acid substitutions and number of glycosylation sites) in the internal node labels of the phylogeny for visualization and comparative analysis. He wrote an R script to do that, but he soon realized that most phylogenetic tree viewing software could only display one type of node label at a time. Therefore, to produce tree graphs displaying different types of branch/node associated information side-by-side, such as bootstrap values and substitutions, people mostly relied on post-processing image software. Such manual edits often take hours and need to be redone whenever the tree was updated with new sequences or other data. At the time, GY tried to find a programming library that could flexibly display different variables directly on the tree figure for visualization and publication-ready graphics. However, none could achieve that in a robust and efficient way and generalizable to different data. This was the motivation behind the development of the ggtree package in R.
There are several ggplot2 extensions that are able to draw tree diagrams, including ggphylo, phyloseq, and OutbreakTools. However, the most valuable part of the ggplot2 syntax – adding layers of annotations – is not supported in these packages. GY thought really hard on the design of a user interface that would fully embrace the grammar of ggplot2 graphic syntax. He first extended the ggplot() function to support phylogenetic tree objects and added the geom_tree() layer to calculate the position of line segments for drawing a tree. Then a set of geometric layers were developed to allow adding annotation layers on the tree. Another important issue that needed to be addressed was how to link external data to the tree structure in an efficient way. For this, GY designed the new ‘%<+%’ operator for attaching external data, as well as all the variables stored in the data, such that they are visible to the geom_tree() function and can be used directly to annotate the tree. The ‘%<+%’ operator allows users to integrate external datasets and display them on different annotation layers. TL also proposed to implement parser functions in ggtree to import tree data and other evolutionary inference results from different external software packages such as BEAST, Hyphy, PAML, among others, so that different analysis results can be displayed and analyzed on the tree collectively in R. Despite the long time that it took to implement these and other important functionalities into ggtree, GY deeply enjoyed the process and learned much from it. The effort was soon rewarded with the publication of the ggtree paper in Methods in Ecology and Evolution, now among the most cited publications in the journal.
The ggtree package remains actively maintained and extended after the MEE paper was published. As more functions were implemented, the maintenance tends to become more difficult. Therefore, GY started to split the ggtree package into multiple packages, including tidytree for manipulating tree with data using the tidy interface; treeio for importing and exporting tree with richly annotated data (recently published here); ggimage for overlaying silhouette images. This way the ggtree package can focus on tree visualization and annotation. The ggtree package supports the use of the gheatmap() function to plot a tree with a heatmap. There are also other tools that support visualizing a tree with a barplot or dotplot. However, there is no general tool for aligning a tree with a graph, such as a histogram of the species data. The two graphs could not simply be put side-by-side, since the data used to produce the graph should be re-ordered to match the taxa order of the phylogenetic tree. Only in this way, the data visualized in the graph can be interpreted in an evolutionary context.
Functions for aligning a phylogeny with different graphs for varied purposes have been continuously developed, but no one have proposed a general solution with a high-level abstraction. The central question is: what is the objective to be achieved when one incorporates data to a phylogenetic tree? Eventually, GY figured out that most of what we are trying to do can be divided into two categories: one for mapping data to the tree structure, to either display the data directly or use the data as information to create a tree visualization, and another category for presenting data and a phylogeny side-by-side, employing the tree structure to help to interpret the graph in an evolutionary context. For the first category of visualization, one can use ggtree’s ‘%<+%’ operator introduced on an earlier version of the package. In the case of side-by-side visualization, the geom_facet() layer was developed as a general solution. It automatically re-orders the input data according to the tree structure and allow for plotting it at a different panel following a user-provided geometric layer. These two methods were published a 2018 Molecular Biology and Evolution paper. The limitation for the geom_facet() layer is that it currently only supports a rectangular layout. To support presenting data in outer rings for a circular layout, the ggtreeExtra package was developed. The output of ggtree is a serialized data object that maintains the input tree, the associated data, and the visualization directive, making it an ideal data structure for publishing phylogenetic trees with almost all the information incorporated (see more about this here).
ggtree is a programming library, so the documentation to guide users is very important. An online book, entitled “Data integration, manipulation, and visualization of phylogenetic trees” (available here) is being drafted to document almost all the aspects of the ggtree package in details. As ggtree gained considerable recognition by the scientific community, GY was invited by William Pearson to publish a protocol paper on the Current Protocol in Bioinformatics journal.