RNA-seq data were processed using a uniform pipeline. First, we investigated RNA-seq data quality using FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). We removed Illumina adapters and poor quality reads (reads < 36 bp long, leading or trailing reads < Phred score of 3 and allowing a maximum of 2 mismatches per read) using Trimmomatic (version 0.39)19 (link). Then, we aligned trimmed reads to either the human hg19 genome or the Rhesus Macaque mmul_10 genome using STAR aligner version 2.5.3.a20 (link). We followed the guidelines outlined by leafcutter (https://davidaknowles.github.io/leafcutter) to align RNA-seq reads and prepare data for differential splicing analyses. RNA-seq read alignment yielded an average of 78,955,738 paired-end reads in humans (s.d. = 29,804,777; MAlignment = 86.16%; Mread_size = 188.36) and a mean of 34,551,920 paired–end reads in primates (s.d. = 8,202,258; MAlignment = 79.71%; Mread_size = 127.59).
DNA genotypes from human RNA-seq data were ascertained via the SAMtools mpileup function as done previously21 (link). Human genotypes derived from RNA-seq data were phased and imputed with Beagle version 5.1, which uses a probabilistic Hidden Markov Chain model that performs well for sequencing data with sparse genomic coverage22 (link). We would like to caution the reader that Beagle was originally developed for genome-wide DNA variant data and not RNA-sequencing data. Our analyses used a few methods and criteria for quality control (QC) including: genotyping rate > 95%, minor allele frequency > 0.10, Hardy–Weinberg equilibrium > 1e-6, > 5 reads per sample, Phred Score > 20 and an imputation score > 0.3. The input for imputation was 40,878 called genotypes that were common among all samples and passed initial QC. These variants were imputed to 1000 Genomes Phase III all data, which resulted in 570,755 SNPs, 178,598 of which passed QC. These ~ 170 k variants were used for polygenic score and sQTL analyses. Note, that the 91.9% of these SNPs were present in the AUD GWAS, but that GWAS has 77.9 times more SNPs than the current study. Thus, we encourage the reader to use caution in interpreting our polygenic score and sQTL analyses given the limited number of individuals and the number of SNPs used.
Free full text: Click here