Read filtering, OTU clustering, and annotation were performed with the MASQUE pipeline (https://github.com/aghozlane/masque), as described by Quereda et al. (63 (link)). A total of 2851 OTUs were obtained at 97% sequence identity threshold. The statistical analyses were performed with SHAMAN (shaman.c3bi.pasteur.fr) based on R software (v3.1.1) and bioconductor packages (v2.14). Because bacterial communities were expected to differ substantially between mosquito midguts and water samples, the normalization of OTU counts was performed at the OTU level by sample type (midgut or water) using the DESeq2 normalization method. All samples including the negative controls and technical replicates were included in the normalization step. The technical replicates were removed from the data set before analysis. To account for possible contamination at various steps in the sample-processing pipeline, the OTU counts were corrected with the reads from the negative controls (see above). All OTUs found in the negative control samples were removed from the normalized OTU table unless the count in a real sample was >10 times higher than the mean OTU count in the negative controls. This operation was performed with a homemade script in R (64 (link)). This normalized OTU count table with the OTUs found in the negative controls removed (file S8) was used for the richness, Shannon index, Venn diagrams, abundance heat map, and NMDS analysis. Observed richness, Shannon index, and Bray-Curtis distances were calculated with the vegan package in R (65 ). The effects of sample types and ecotypes on the bacterial richness were tested by fitting a generalized linear model (GLM) with a Poisson distribution. The SEs were corrected for overdispersion using a quasi-GLM model where the variance is given by the mean multiplied by the dispersion parameter. A χ2 test was applied to compare the significance of deviance shift after adding the covariates sequentially. The effects of sample type, ecotype, and their interaction on Shannon index were tested by fitting a linear model with a normal error distribution. The response variable was power-transformed to satisfy the model assumptions. The significance of each variable was tested with an analysis of variance (ANOVA) after adding the covariates sequentially. The results of the two models were confirmed by the convergence of backward and forward selection based on the Akaike information criterion. The Bray-Curtis distances were plotted with an NMDS method constrained in two dimensions. The Spearman correlation with real distances and stress value was estimated with the vegan R package (65 ). Effects of habitat and sample type on β diversity were tested with the betadisper and adonis permutational multivariate ANOVA methods from the vegan R package with 999 permutations of the Bray-Curtis distance matrix derived from OTU counts.
In SHAMAN, a GLM was fitted and vectors of contrasts were defined to determine the significance in abundance variation between sample types. The GLM included the main effect of habitat (sylvatic or domestic), the main effect of sample type (midgut or water), and their interaction. The resulting P values were adjusted for multiple testing according to the Benjamini and Hochberg procedure. All OTUs that were present in the negative controls were excluded from the final list of differentially abundant OTUs.
To confirm the OTU-based results with an OTU-independent method, a dissimilarity matrix was generated with the SIMKA software (66 ). Reads with a positive match against the sequences assembled from the negative controls were removed using Bowtie v2.2.9 (67 (link)). Then, k-mers of size 32 and occurring at least greater than two times were identified with SIMKA. Bray-Curtis dissimilarity was estimated between each sample.