High-resolution and Deep Phylogenetic Reconstruction of Ancestral States from Large Transcriptomic Data Sets

Phylogenetics is an important area of evolutionary biology that helps to understand the origin and divergence of genes, genomes and species. Building meaningful phylogenetic trees is needed for the accurate reconstruction of the past. To achieve a correct phylogenetic understanding of genes or proteins, reliable and robust methods are needed to construct meaningful trees. With the rapidly increasing availability of genome and transcriptome sequencing data, there is a need for efficient and accurate methodologies for ancestral state reconstruction. Currently available methods are mostly specific for certain gene families, and require substantial adaptation for their application to other gene families. Hence, a generalized framework is essential to utilize large transcriptome resources such as OneKP and MMETSP. Here, we have developed a flexible yet efficient method, based on core strengths such as emphasis on being inclusive in homolog selection, and defining orthologs based on multi-layered inferences. We illustrate how specific steps can be modified to fit the needs of any protein family under consideration. We also demonstrate the success of this protocol by studying and testing the orthologs in various gene families. Taken together, we present a protocol for reconstructing the ancestral states of various domains and proteins across multiple kingdoms of eukaryotes, using thousands of transcriptomes.

the majority of the land plants and algal groups, whereas MMETSP covers majority of the SAR group and other (unidentified) phyla in Chromista.
From their inception, diverse approaches have been developed and applied to these transcriptomes and estimate the ancestral states of various genes across multiple classes, families and even phyla (Li et al., 2014;Wickett et al., 2014;Yerramsetty et al., 2016). The majority of these methods focus on one gene family, and need substantial modifications in methodology to apply them to other gene families.
Moreover, the methods used are neither inclusive nor robust in terms of multi-layered inferences. The orthologous inferences are based on only one evidence, Best Bi-directional Hit or protein domains or simple phylogenies based on few genomes. To overcome these disadvantages, we developed a unified framework to build high-resolution phylogenies that utilize the rich OneKP and MMETSP transcriptome resources. This new method is not only inclusive, but also utilizes multi-layered orthology to interpret phylogenies with high confidence, leading to the identification of new (sub-)classes of orthologs.

Overview of the protocol
The current protocol is developed to reconstruct ancestral states and high-resolution phylogenetic trees of various gene families using transcriptomes and/or proteomes. Ancestral state represents the minimal gene complement at each evolutionary node, where species-specific gene duplications and (or) losses would have modified the gene complement in individual species. Hence, selecting the correct, orthologous as well as diverse, sequences is a crucial step in such a deep phylogenetic tree construction.
This protocol is built on three core strengths: (1) Inclusive: Include more sequences at the start with liberal parameters, and remove sequences as one goes through various steps in the pipeline, resulting in a high-quality logical sequence set for phylogenetic tree construction.

Linux machine
Computer set-up: Majority of the mentioned programs in Software section run only on Linux environment; hence it is recommended to perform the analysis on a Linux machine with access to the BASH shell (terminal). The average time needed to perform the analysis for a gene family is 1-1.5 days on a generic Linux workstation with 64 GB RAM and 8-core processor setup. The disk space needed for this analysis is less than 1 GB.

Procedure
Commands used, along with the parameters used in each step of the protocol, with step numbers corresponding to Figure 1 are given below. Before starting the protocol, we first created a BLAST database for each transcriptome or proteome. This was carried out only once for each transcriptome or proteome using the makeblastdb function, where '-in' takes a FASTA file of the transcriptome, or the proteome and '-dbtype' is the database type with nucl and prot for transcriptomes and proteomes, respectively.   A. Homolog identification 1. To perform a BLAST search to the respective database(s), we created a query protein sequence file (in FASTA format), with sequences from (relatively) well-annotated genomes and from a diverse range of species, if present, across multiple kingdoms. A list of various species used along with a link to the sequence data resource is available in Appendix-1.
2. Using the query sequence file (-query) perform the BLAST search with tblastn and blastp modules, against transcriptome and proteome databases (-db), respectively. When the E-value cut-off (-evalue) is less than 0.01, save the output (-out) in a tab-delimited text file indicated with -outfmt 6 . The remainder of the parameters are kept at default settings. 6 www.bio-protocol.org/e3566 $ tblastn -query filename.fa -db transcriptome.fasta -out output.blast -evalue 0.01 -outfmt '6 qseqid sseqid slen qstart qend sstart send evalue bitscore score length pident nident positive ppos mismatch gaps frames qcovs qcovhsp sseq' $ blastp -query filename.fa -db proteome.fasta -out output.blastevalue 0.01 -outfmt '6 qseqid sseqid slen qstart qend sstart send evalue bitscore score length pident nident positive ppos mismatch gaps frames qcovs qcovhsp sseq' 3. The BLAST output contains all the scoring information about the subject (transcript/protein) sequence that has a similarity to the corresponding query sequence. To retrieve the subject sequence identifiers from the BLAST output, we have used the 'cut', 'sort' and 'uniq' functions of a Linux BASH shell (terminal). 'cut' takes the BLAST output (output.blast) from the previous step, and takes the second column (-f2), i.e., subject sequence identifiers and sends/pipes them (|) to the 'sort' function. After sorting, they are passed on to the 'uniq' function to remove the duplicates and the output is written to the file (SubjectIdentifiers.txt).
$ cut -f2 output.blast | sort | uniq > SubjectIdentifiers.txt 4. Using these identifiers (SubjectIdentifiers.txt) to extract the corresponding transcript (SelectedTranscripts.fasta) or protein sequences (SelectedProteins.fasta) from the respective transcriptome or proteome by running the 'faSomeRecords' program.  InterProScan is a time-consuming step, hence we used pre-annotated data where available, or reduced the number of databases to scan (using -appl Pfam,CDD setting), in order to save time.
In some cases, we split the data in smaller batches and ran on multiple processors.   As expected, various proteins evolve differently, leading to different models of evolution.
ModelFinder is run as a part of IQ-TREE, hence it does not require any additional steps.  shape parameter estimated from maximum likelihood (-a e) and the topology was searched based on the sub-tree pruning and re-grafting approach (-s SPR). After running these multiple programs, the trees obtained were compared to understand the overall topology based on the congruent branches (see next step). We have also tried and tested various Bayesian approaches (using MrBayes), but the trees never converged even after months of computation, and provided various incongruent topologies. Hence, all the analyses were performed with Maximum Likelihood approaches.  14. Once the trees are obtained, they are manually checked for errors. Manually remove the branches with long branch attraction, or partial sequences or any misplaced taxa. If the proportion of these misplaced branches is too high, re-analyze the phylogeny with more sequences from other species, as well as by removing the spurious sequences. These steps are repeated until obtain better trees that are not only supported by good bootstraps but also obeys the taxonomy of those phyla.

Limitations and Conclusions
Due to the generalized nature of the method, it was difficult to automate the complete protocol.
Hence, wherever possible, the method was simplified with scripts/commands dedicated for fast and parallel processing. On the other hand, it gave control over the decision-making process based on the protein under consideration. When dealing with highly redundant protein families, we removed highly similar proteins (> 90% similarity), prior to phylogeny, which reduced the (computational) time without losing accuracy. In many cases we observed that the best-hit in reciprocal-BLAST is not really a BBH, as sometimes a second hit was still the best one due to one or few amino acid difference(s) (especially in proteins with common domains e.g. bHLH or PB1). Hence, in those cases we considered two best hits and used both for phylogeny construction. The false positive orthologs were eventually placed in the outgroup (or at least separate from the ingroup) in the phylogenetic tree. As we were dealing with transcriptomes, we could not predict the actual gene copy number in each species, but only the ancestral copy number for that class or phylum, by comparing the ancestral copies across the majority of the species in that phylum. Another issue of dealing with (low-depth) transcriptomes was that we found many partial transcripts leading to the truncated proteins/domains, or we might fail to identify the transcripts that were not expressed in that particular tissue or condition. In that regard, combining ortholog sequence information from multiple transcriptomes or species of various families is mandatory to confirm the ancestral state for each class or phylum.
Based on this protocol and the guidelines mentioned above, we have reconstructed the ancestral states of various protein families along with their orthologs in a 'deep' phylogenetic space, across multiple kingdoms. We demonstrated how this method was implemented for proteins that are welldefined with known domains, novel proteins with unknown domains, poorly conserved domains and phylum/kingdom-specific proteins that (dis)appeared at various stages in evolution. This approach was successfully applied for the core proteins of the auxin signalling (Nuclear Auxin Pathway (NAP)) 11 www.bio-protocol.org/e3566