Human long non-coding and protein-coding genes were obtained from the manually curated GENCODE version 24 annotation (EnsEMBL v83 corresponding to the GRCh38 human genome assembly) selecting the long non-coding gene biotypes ‘lincRNA’ and ‘antisense’, and ‘protein_coding’ for coding genes. From each of this set, 10 000 transcripts were extracted and further divided into two sets of 5000 transcripts, that are used for the learning and the testing steps, denoted HL and HT data sets, respectively. Importantly, only one transcript per locus was extracted for all biotypes in order not to create a bias by introducing two isoforms of the same gene in both the HL and HT sets. For mouse, we used the GENCODE version M4 annotation (EnsEMBL v79) and derived the learning and testing sets in the same way as for human (denoted ML and MT). Due to the lower number of GENCODE lncRNAs annotated in mouse compared to human, each file contains ∼2000 lncRNAs and 5000 mRNAs. For ‘non-model organisms’, lncRNAs belonging to the lincRNA and antisense classes (NONCODE codes 0001 and 1000, respectively) were downloaded from the latest version of the NONCODE database (NONCODE 2016) (28 (link)) while mRNAs were retrieved from the EnsEMBL database (v84). A summary of the number of mRNAs/lncRNAs per species is available in
Whole transcriptome sequencing of dog RNA samples (n = 20) was performed by the LUPA consortium. These biological samples, corresponding to 16 unique tissues and 7 breeds, were obtained from the ‘Cani-DNA CRB’ biobank at the University Rennes1, CNRS-IGDR, France (