(i) Calculation of copy number profiles is mainly done as described in our previous publication (Boeva et al., 2010). The most important features of the procedure are: (a) possibility to use GC-content and mappability profiles to normalize read count if a control sample is unavailable; (b) proper characterization of overdiploid genomes; (c) correction for possible contamination by normal cells when constructing the copy number profile of a tumor genome. The new tool Control-FREEC can also be used on non-mammalian genomes and includes many new user control settings, such as (a) defining the program's behavior in low mappability regions (
(ii) We characterize the allelic content via the BAF introduced previously for SNP arrays (Popova et al., 2009 (link)). We limit the list of genomic positions that we consider to evaluate allelic content to known SNPs only (Sherry et al., 2001 (link)). By the B allele, we mean the alternative variant in SNP database (dbSNP). SNPs that are homozygous in the genome being considered give no information about allelic content (in SNP arrays they are denoted as non-informative); therefore putatively homozygous positions are discarded. A position is discarded if the probability of having variation due to sequencing errors under the condition of actual homozygosity is greater than a specified threshold (
We calculate the total coverage and B-allele coverage for each known putatively heterozygous SNP position. For each window i, we calculate the median of the BAF values: Medj = median(abs(xij−0.5)), where {xij} are BAF values of the remaining SNP positions. We segment {Medj} using the same lasso-based algorithm as used for copy numbers (Harchaoui and Lévy-Leduc, 2008 ).
(iii) We predict genotype status for each genomic segment independently, by choosing the allelic content that corresponds to the maximal log-likelihood, given the copy number detected previously.
Input and output: the input consists of a SAM pileup (