Inferring a Tree of Blobs with ECToBlob

library(MSCquartets)
#> Loading required package: ape
#> Loading required package: phangorn

Why use ECToBlob?

ECToBlob infers a tree of blobs under the network multispecies coalescent (NMSC) model. The resulting unrooted tree represents only the cut edges of the underlying species network, that is, edges whose removal disconnects the network.

More complex structures, such as “blobs,” are collapsed into multifurcations (polytomies). In this way, the tree highlights where evolutionary relationships are tree-like and where they are not.

This makes the tree of blobs useful for network inference. It isolates regions where reticulation has occurred. Researchers can then investigate each blob separately, possibly focusing on smaller subsets of taxa. While current methods may not be able to resolve complex blobs, ECToBlob is statistically consistent under the NMSC model, regardless of the unknown internal blob structure.

Preparing the input data

ECToBlob requires two inputs:

(1) Gene trees (genedata)

A collection of gene trees on a common set of taxa. These are typically obtained by:

  1. Aligning sequences for each locus
  2. Inferring gene trees using phylogenetic software (e.g., IQ-TREE, RAxML)

Although gene trees are themselves inferred, ECToBlob treats them as observed data. The root and branch lengths (if present) are ignored.

Example:

# Read gene trees and compute quartet counts
gts <- read.tree(file = "genetreefile")
tableLeopardusLescroart <- quartetTable(gts)

Notes on gene trees

(2) A tree (tree)

A tree assumed to be a resolution of the true tree of blobs. Such a tree can be produced by ASTRAL or TREE-QMC since, for sufficiently large datasets, trees inferred by these methods have been shown to be resolutions of the true tree of blobs.

MSCquartets contains the ASTRAL tree from the gene trees of Lescroart et al. (2023):

AstralLeopard_tree <-   read.tree(text = ASTRALtreeLeopardusLescroart)

Notes on the input tree

If the user wants to explore the package but has not yet computed a suitable reference tree (for e.g., from ASTRAL or TOB-QMC), one can construct a tree for exploratory purposes using a quartet distance implemented in the MSCquartets package:

gts <- read.tree(file = "genetreefile")
alternative_tree <- quartetDistTree(gtrees)

Note that, while there is no guarantee that the tree obtained from the function quartetDistTree is a resolution of the tree of blobs, in practice, it often gives the same output as ASTRAL. We therefore suggest that the quartetDistTree tree should only be used for exploratory purposes, and not for final analyses.

Initial analysis

ECToBlob begins by counting quartet topologies across gene trees and applying hypothesis tests to each quartet. These tests evaluate fit to:

A basic analysis is run by:

out <- ECToBlob(tableLeopardusLescroart, AstralLeopard_tree, alpha = 0.05)
#> Applying hypothesis tests for T1 model to 1820 quartets.
#> Applying hypothesis test for star tree model to 1820 quartets.
#> Contracting edges with star test at level beta = 0.8

pTable <- out$pTable

Output plots

The function produces several plots:

  1. The input tree → A plot of the input tree.
  2. The tree after collapsing internal edges deemed unresolved by the combined star test (p_star) → A plot of the input tree after collapsing those edges for which the data does not support a resolution. (This tree may be the same as the input tree.)
  3. Trees with progressively collapsed internal edges based on combined T1 test edge p-values → These trees are obtained by contracting edges in order from smallest to largest combined p-value. In case of ties, multiple edges will be contracted simultaneously.
  4. A final star tree → This tree always results from collapsing all internal edges.
  5. A plot of the number of edges contracted vs. \(-\log_{10}(p\text{-value})\), where the p-value is for the combined T1 test values for all edges remaining in the tree → This plot shows the relationship between the p-value for a null hypothesis of the tree’s resolution (in the same order they appeared in the previous plots) and the number of edges contracted to obtain that tree. The y-axis is plotted on a logarithmic scale, and y-values typically decrease as more edges are contracted. The user may want to focus on trees where the p-value shows a sharp drop.

Note on computation

For large datasets (many taxa or gene trees), computing the quartet table is the most time-consuming step. Saving pTable for reuse is recommended.

Combining quartet p-values

Each edge in the input tree is assigned a p-value by combining the p-values of quartets that support it. The set of quartet p-values combined for a given edge can be determined in three ways:

  1. "bi": bipartition quartets
  2. "quad": quadripartition quartets
  3. "mul": multipartition quartets (updated during contraction)

Multiple testing correction

Once the quartet p-values defining an edge have been determined, they can be combined in four different ways to account for multiple testing:

  1. Bon (Bonferroni)
  2. Cauchy (Cauchy)
  3. BBC (Bonferroni + Cauchy, combined via Bonferroni)
  4. CBC (Bonferroni + Cauchy, combined via Cauchy)

Examples

Here we show examples of different choices of both multiple testing correction and quartet p-value selection.

out_1 <- ECToBlob(pTable, AstralLeopard_tree, alpha=0.05, qType="mul",  testCorrection="Bon",    plot=0)
#> Not recomputing T1 p-values for displayed trees.
#> Not recomputing p_star values.
#> Contracting edges with star test at level beta = 0.8
out_2 <- ECToBlob(pTable, AstralLeopard_tree, alpha=0.05, qType="quad", testCorrection="Cauchy", plot=0)
#> Not recomputing T1 p-values for displayed trees.
#> Not recomputing p_star values.
#> Contracting edges with star test at level beta = 0.8
out_3 <- ECToBlob(pTable, AstralLeopard_tree, alpha=0.05, qType="bi",   testCorrection="BBC",    plot=0)
#> Not recomputing T1 p-values for displayed trees.
#> Not recomputing p_star values.
#> Contracting edges with star test at level beta = 0.8
out_4 <- ECToBlob(pTable, AstralLeopard_tree, alpha=0.05, qType="mul",  testCorrection="CBC",    plot=0)
#> Not recomputing T1 p-values for displayed trees.
#> Not recomputing p_star values.
#> Contracting edges with star test at level beta = 0.8

These choices may yield different results and should be explored.

Selecting a tree using the \(\alpha\) threshold

The parameter \(\alpha\) (alpha) determines which tree in the sequence will be selected. In contrast, the choice of \(\beta\) (beta, test level for collapsing edges showing no resolution) influences which edges will initially be contracted as not consistent with any resolution. Succinctly,

To select a tree based on \(\alpha\), one has two options, though often they will agree:

  1. indexEarly: first tree where the tree p-value exceeds \(\alpha\)
  2. indexLate: first tree where the tree p-value and all later tree p-values exceeds \(\alpha\)

One can extract the trees associated with these indices from the output as follows:

early_tree <- out_2$treeList[[out_2$indexEarly]]$tree
plot(read.tree(text = early_tree))

late_tree <- out_2$treeList[[out_2$indexLate]]$tree
plot(read.tree(text = late_tree))

Reporting recommendations

Any reported ECToBlob analysis should include:

Exploring multiple parameter settings is strongly recommended.

References