The following strategies were used when recruiting variants from the databases into our positive group: (i) based on the National Center for Biotechnology Information Reference Sequence (RefSeq) database release 59 (23 (link)), we only included variants within the splicing consensus regions (−3 to +8 at the 5′ splice site and −12 to +2 at the 3′ splice site) at the exon/intron boundaries of protein-coding genes; (ii) within the consensus regions, all variants at GT-AG sites were excluded, because these sites are so invariant that almost all mutations that occur at these sites affect splicing and most tools can predict their impact with very high accuracy (22 (link),24 ); (iii) only single nucleotide substitutions (i.e. SNVs) were retained; (iv) variants were excluded if information provided by the database did not contain biological evidence (e.g. merely computational predictions or statistical associations); and (v) to avoid duplication, variants present in more than one database were only counted once. The first three criteria were also applied to the recruitment of negative variants from the 1000 Genomes Project phase 1 data. Furthermore, additional filtering strategies were implemented: we chose variants within genes that have only one annotated transcript in RefSeq database release 59 (this only applies to recruitment of negative variants) (23 (link)), and we only chose variants with minor allele frequency >0.05 in combined populations of European ancestry. The rationale is that as individuals of European ancestry are the most commonly studied subjects; if a common variant alters splicing, it is highly likely the alternatively spliced transcript has been reported in this population. In contrast, a common variant in a gene without alternative transcripts reported is unlikely to alter splicing of that gene. For the additional test set, we chose the variants reported in the work of Houdayer et al. that are (i) within splicing consensus regions defined above; (ii) single nucleotide substitutions; and (iii) not in our dataset (22 (link)). All variants were annotated using ANNOVAR, a software package that performs functional annotation of genetic variants from high-throughput sequencing data (25 (link)) and based on human reference sequence assembly GRCh37/hg19.
Free full text: Click here