Synthetic Generation of Realistic HIV Sequence Data

These methods were applied to population-based HIV sequences from chronically infected, antiretroviral naïve and HLA-typed individuals from two cohorts: the HOMER cohort from British Columbia, Canada, consisting of 567 predominantly clade B gag sequences [9] (link), and the Durban cohort, consisting of 522 predominantly clade C p17/p24 gag sequences from Durban, South Africa [10] (link),[44] (link). Individuals in the HOMER and Durban cohorts were HLA-typed to two- and four-digit resolution, respectively. Here, we truncate the Durban data to two-digits for comparison with the HOMER cohort. Viral sequences were determined by nested reverse-transcriptase polymerase chain reaction (RT-PCR) amplification of extracted plasma HIV RNA followed by bulk sequencing, as previously described [8] (link)–[10] (link). Phylogenies were constructed from these sequences using PHYML [50] (link), run using the general time reversible model over the HIV sequences and estimating all parameters via maximum likelihood.
Synthetic datasets were designed to mimic the real datasets as closely as possible. We first fit a specified model to the real data to identify parameters and q-values for each predictor-target pair. We then planted predictor-target pairs for each significant (q≤0.2) predictor-target pair identified from the real data. Specifically, we generated a synthetic target amino acid for each consensus amino acid in the sequence, such that (1) if the amino acid had no significant (q≤0.2) associations, then the amino acid was generated according to the parameters of the independent evolution model (the null model from the univariate case), and (2) if the amino acid had M>0 associations, then the amino acid was generated according to the given multivariate model with the predictor parameters s¹,…, s^M, taken from the real data. When an observation was missing in the real data, the corresponding observation in the synthetic data was also made to be missing. We treated amino acid insertions/deletions and mixtures as missing data.
Our goal was to generate data that is as realistic as possible, both in the values of the parameters used and the number of predictors deemed correlated with the target. Because our recall rate is less than 100% (see section on synthetic results), planting only those associations that are found in the real data would result in a smaller proportion of synthetic predictor-target pairs called significant than real predictor-target pairs called significant. We therefore planted two associations for every observed significant association in the real data and reduced the number of independently evolving codons accordingly. For the Noisy Add model, this procedure planted 72 HLA-codon and 612 codon-codon associations in the HOMER cohort and 114 HLA-codon and 952 codon-codon associations in the combined HOMER-Durban cohort. In hindsight, doubling the number of planted associations was an overcompensation, as experiments on this synthetic data yielded a 75% recall rate. Nonetheless, the doubling produced a reasonable result, as Noisy Add declared 0.56% of all synthetic predictor-target pairs significant at q≤0.2 compared to 0.65% of all predictor-target pairs in the real data for the combined HOMER-Durban cohort.

Free full text: Click here

Carlson J.M., Brumme Z.L., Rousseau C.M., Brumme C.J., Matthews P., Kadie C., Mullins J.I., Walker B.D., Harrigan P.R., Goulder P.J, & Heckerman D. (2008). Phylogenetic Dependency Networks: Inferring Patterns of CTL Escape and Codon Covariation in HIV-1 Gag. PLoS Computational Biology, 4(11), e1000225.

Publication 2008

Acid amino acid Amino acid Bulk Codon Digits Evolution Insertions deletions Plasma Recall Rt pcr

Corresponding Organization : University of KwaZulu-Natal

Top 5 similar protocols

Protocol cited in 18 other protocols

Variable analysis

independent variables

HLA type of individuals
Consensus amino acid sequence

dependent variables

Amino acid generated for each consensus amino acid in the sequence

control variables

Individuals in the HOMER and Durban cohorts were HLA-typed to two- and four-digit resolution, respectively. Here, the Durban data was truncated to two-digits for comparison with the HOMER cohort.
When an observation was missing in the real data, the corresponding observation in the synthetic data was also made to be missing.
Treated amino acid insertions/deletions and mixtures as missing data.

Annotations

Based on most similar protocols

Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.

As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!