We applied a second approach to identify sex-specific sequences and their location in the genome as described in Akagi et al. (2014) (link) and Böhne et al. (2019) (link) for 10 of the 11 populations. We excluded the Ruzizi River population due to the low number of female samples (Fig. 1). Starting from trimmed sequencing reads (see above), we generated k-mer catalogs per population of all possible k-mers starting with “AG” and a length of 37 bp present in at least five specimens per population using a Python script provided in Akagi et al. (2014) (link). We divided k-mer catalogs into four categories: Y-k-mers = male-specific, Z-k-mers = male-biased, X-k-mers = female-biased, and W-k-mers = female-specific. To this end, we applied a linear regression to the k-mer counts of each population and retained outliers from the general distribution by calculating studentized residuals from a linear model (i.e. jack-knifed residuals). Outliers were defined as all k-mers with an absolute studentized residual value equal to or bigger than 3, as an observation with an absolute value of 3 is deemed to be an outlier (Belsley et al. 1980 ; Hettmansperger 1987 (link); Atkinson 1994 (link)). Subsequently, sex-specific k-mers (i.e. Y- or W-k-mers) were defined as k-mers having zero counts in one sex but not in the opposite sex. Sex-biased k-mers were obtained based on the ratio of counts between males and females, expecting larger counts for the homogametic sex (e.g. X-k-mers = female count/male count > 4, depending on the population analyzed). In summary, we retained outlier k-mers from the linear regression and from there we took (i) sex-specific k-mers (either Y- or W-k-mers) and (ii) sex-biased k-mers with ratios bigger than four for all populations but not in Kalambo River 1, (i.e. ratio threshold set to 12 for Z-k-mers due to the lower number of female samples for this population, see Fig. 1). Next, we tested for an increased amount of sex-specific k-mers per population with a Wilcoxon test, aiming to detect the heterogametic sex of the population. Additionally, we identified k-mers shared among populations in each category with UpSetR (Conway et al. 2017 (link)) in R.
For the Kalambo River (Ka2) and Chitili River (Ch1) populations, we extracted sequencing reads and their mates containing Y-k-mers of each population. Next, we assembled the extracted reads with MEGAHIT (Li et al. 2015 (link)) with –k-max 12. We also placed the resulting contigs onto the Nile tilapia reference genome with BWA and compared the contig data sets using blastX to the NR database in Blast2GO (Gotz et al. 2008 (link)) to retrieve functional annotations.
Free full text: Click here