HERV proviruses and other repeat regions were annotated as previously described58 (link). In brief, hidden Markov models (HMMs) representing known human repeat families (Dfam 2.0 library v.150923) were used to annotate GRCh38 using RepeatMasker, configured with nhmmer. RepeatMasker annotates long terminal repeats (LTRs) and internal regions separately; thus, tabular outputs were parsed to merge adjacent annotations for the same element. A list of HERV proviruses with functional env ORFs was compiled (Supplementary Table 1), and RNA-seq reads from TCGA, GTEx and TRACERx were mapped and counted using a custom transcriptome assembled on a subset of the RNA-seq data from TCGA, as previously described58 (link). In brief, TPM values were calculated for all transcripts in the transcript assembly with a custom Bash pipeline using GNU parallel and Salmon (v.0.12.0)59 (link). TPM values were then imported into Qlucore Omics Explorer v.3.3 (Qlucore) for downstream differential expression analysis and visualization. In the case of multiple transcripts transcribed from a given HERV provirus, data were collapsed by summing expression of any of the multiple transcripts overlapping the env ORF of that provirus. Patient-level mean values were calculated across multiple primary tumour regions, as applicable.
Free full text: Click here