BarSeq reads were converted to a table of the number of times that each bar code was seen in each sample using a custom perl script (MultiCodes.pl). The script requires an exact match to the 8 nucleotides at the beginning of the read that identify the sample (“inline” indexes), or relies on Illumina software for demultiplexing (TruSeq P7 indexes), depending on the primer design (see “BarSeq” above). The script also requires an exact match for the 9 nucleotides upstream of the bar code. We did not check the quality scores for the bar code or the sequence downstream of the bar code (the -minQuality 0 option). However, bar codes that do not match exactly an expected bar code are ignored in later stages of the analysis.
Given a table of bar codes, where they map in the genome, and their counts in each sample, we estimate strain fitness and gene fitness values and their reliability with a custom R script (FEBA.R). Roughly, strain fitness is the normalized log2 ratio of counts between the treatment sample (i.e., after growth in a certain medium) and the reference “time-zero” sample. Gene fitness is the weighted average of the strain fitness, and a t score is computed based on the consistency of the strain fitness values for each gene. Ideally, the time-zero and treatment samples are sequenced in the same lane. Also, we usually have multiple replicates of any given time zero, with independent extraction of genomic DNA and independent PCR with a different index. We sum the per-strain counts across replicate time-zero samples.