The sequences of all the PPRs were identified with reference to the 11,938 sequences of Orthohepevirus A (including 338 complete HEV genomes) available in the Virus Pathogen Resource (VIPR) database.5 Selected sequences were systematically searched to identify insertions so that they could be used, together with those identified by PacBio sequencing, for further analysis. The compositions of HEV PPR insertions/duplications were determined and their post-translational modifications predicted by analyzing a range of parameters. Potential ubiquitination sites were identified using the BDM-PUB server6 with a threshold of >0.3 average potential score. Potential phosphorylation sites were identified using the NetPhos 3.1 server7 with a threshold of >0.5 average potential score. Potential acetylation sites were identified using the Prediction of Acetylation on Internal Lysines (PAIL) server8 with a threshold of >0.2 average potential score. Potential N-linked glycosylation sites were identified using the NetNGlyc 1.0 server9 with a threshold of >0.5 average potential score. Potential methylation sites were identified using the BPB-PPMS server10 with a threshold of >0.5 average potential score. Nuclear export signal (NES) sites were identified using the Wregex server11 with parameters NES/CRM1 and Relaxed. Nuclear localization signal (NLS) sites were identified using SeqNLS12 with a 0.86 cut-off. The amino acid composition (proportions of amino acids), physico-chemical composition, and net load were analyzed with R. Principal component analysis (PCA) is a mathematical algorithm that reduces the dimensionality of the data while retaining most of the variation in a data set. PCA allows to identify new variables, the principal components, which are linear combinations of the original variables (Ringner, 2008 (link)). PCA was done (excluding the amino acid composition due to redundancy with physico-chemical properties) to summarize and visualize the information on the variables in our data set (Abdi and Williams, 2010 (link)); each variable was then studied independently. An in-house R-pipeline based on the amino acid sequences and the results of each analysis was used to generate bar plots for amino acid composition. The amino acid compositions were assigned to one of two categories: sequences with insertions/duplications (including insertions of human genome and HEV genome duplications) and sequences without insertions/duplications. The other parameters were assigned to one of three categories: sequences with insertions, those with duplications, and sequences without insertion/duplication.
Free full text: Click here