PASTEC was developed in the REPET package [7] . In this context, we used PASTEC to classify the consensus TE sequences found
de novo in a genome. PASTEC uses several features of TEs to classify TE consensus sequences. It searches for structural evidence and sequence similarities stored in a MySQL database obtained during a preprocessing step. The structural features considered are TE length, presence of a LTR (long terminal repeat) or TIR (terminal inverted repeat) detected with a custom-built tool (with a minimum length of 10 bp, a minimum identity of 80%, the taking into account of reciprocal orientations of terminal repeats and a maximal length of 7000 bp), the presence of SSRs (simple sequence repeats detected with the tandem repeat finder (TRF) tool [8] (
link)), the polyA tail and an ORF (open reading frame). The blastx and tblastx routines are used to search for similarities to known TEs in Repbase Update, and the hmmer3 package [9] to search against a HMM profile databases (TE-specific or not), after translation in all six frames. Sequence similarities are also identified by blastn searches against known rDNA sequences, known host genes and known helitron ends. The databanks used are preprocessed and formatted. The Repbase Update for PASTEC can be downloaded from
http://www.girinst.org/repbase/index.html, whereas the HMM profile databank formatted for PASTEC is available from the REPET download directory (
http://urgi.versailles.inra.fr/download/repet/).
PASTEC classifies TEs by testing all classifications from Wicker's hierarchical TE classification system. Each possible classification is weighted according to the available evidence, with respect to the classification considered. TEs are currently classified to class and order level. PASTEC can also determine whether a TE is complete on the basis of four criteria: sequence coverage for known TEs, profile coverage, presence of terminal repeats for certain classes, presence of a polyA or SSR tail for LINEs and SINEs, and the length of the TEs with respect to expectations for the class concerned.
We designed PASTEC as a modular multi-agent classifier. The system is composed of four types of agents: retrievers, classifiers, filter agents, and a super-agent (
Figure 1). The retriever agents retrieve the pre-computed analysis results stored in the MySQL database. They act on the requests of the classifier or filter agents, filtering, formatting and supplying the results. The classifier and filter agents are specialized to recognize a particular category. For example, the LTR agent can determine only whether the TE is a LTR or not. The classifier and filter agents act on the request of the super-agent, deciding whether they can classify the TE or not. For example, the LTR agent decides whether the consensus TE is a LTR on the basis of the following evidence: presence of the ENV (envelope protein) profile (a condition sufficient for classification), the presence of INT (integrase), RT (reverse transcriptase), GAG (capsid protein), AP (aspartate proteinase) and RH (RNase H) profiles together with the detection of a LTR (long terminal repeat), a blast match with the sequence of a known LTR retrotransposon. The super-agent resolves classification conflicts and formats the output file. It resolves conflicts by using a confidence index normalized to 100. For example, the LTR agent calculates a confidence index with the following rules: presence of ENV profiles (+2 because this condition is sufficient for classification), presence of a long terminal repeat and an INT, GAG, RT, RH or AP profile (+1 for each profile combined with the long terminal repeat), +1 for each profile (ENV, AP, RT, RH and GAG) found in the same frame in the same ORF. If the consensus matches at least one known LTR retrotransposon, the LTR agent adds +2 for each type of blast (blastx or tblastx) at the confidence index. Finally, the length of the TE is taken into account because we add +1 if the TE without the long terminal repeat is between 4000 and 15000 bp in length, and we decrease the confidence index by 1 if the TE without the long terminal repeat is less than 1000 bp or more than 15000 bp long. The super-agent uses the maximum confidence index defined for each classifier agent to normalize the confidence index for each classification to 100 and then compare the different classifications. Advanced users can edit all decisions rules and maximum confidence indices in the Decision_rules.yaml file.
The output can be read by humans and is biologist-friendly. A single line specifies the name of the TE, its length, status, class, order, completeness, confidence index and all the features characterizing it. A status of “potential chimeric” or “OK” is assigned to the TE. If the TE is not considered to be “OK” then users must apply their own expertise. A TE is declared “potential chimeric” when at least two classifications are possible. In this case, PASTEC chooses the best status based on the available evidence, or does not classify the TE if no decision is possible. In this last case, all possible classifications are given (separated by a pipe symbol “|”). We present an example of PASTEC output in
table S1. PASTEC output is a tabular file, with the columns from left to right indicating the name of the TE, its length, the orientation of the sequence, chimeric/non-chimeric status (OK indicating that the element is not potentially chimeric), class (class I in this case), order. In the first line of the example provided, the TE is a LTR. We presume that the element is complete because we have no evidence to suggest that it is incomplete, and the confidence index is 71/100. The last column summarizes all the evidence found: coding sequence evidence, such as the results of tblastX queries against the Repbase database (TE_BLRtx evidence), blastX queries against the Repbase database (TE_BLRx evidence) and profiles. A blast match is taken account if coverage exceeds 5%, and a profile is taken into account if its coverage exceeds 20% (these parameters can be edited in the configuration file). For each item of coding sequence evidence, the coverage of the subject is specified. The structural evidence is also detailed: >4000 bp indicates that TE length without terminal repeats is between 4000 and 15000 bp, the next item of information presented in the comments columns is the presence of terminal repeats: we have a LTR in this case, with an LTR length of 433 bp; two long ORFs have been identified, the last of which contains four profiles in the same frame and is up to 3000 bp long. Other evidence provided for this example includes the partial match with a
Drosophila melanogaster gene (coverage 16.55% and the TE contains 18% SSRs). The super-agent determines whether a TE is complete based on whether it is sufficiently long, whether the expected terminal repeats or polyA tail are present, whether blast match coverage exceeds 30% and profile coverage exceeds 75%. The second line of the example corresponds to a potentially chimeric TE, for which human expertise is required.
Hoede C., Arnoux S., Moisset M., Chaumier T., Inizan O., Jamilloux V, & Quesneville H. (2014). PASTEC: An Automatic Transposable Element Classification Tool. PLoS ONE, 9(5), e91929.