Comprehensive Manual Gene Annotation Workflow

The GENCODE gene set is created by merging the results of manual and computational gene annotation methods. Manual gene annotation has two major modes of operation: clone-by-clone and targeted annotation. ‘Clone-by-clone’ annotation involves ‘walking’ across a genomic region, investigating the sequence, aligned expression data and computational predictions for each BAC clone. In doing so, an expert annotator investigates all possible genic features and considers all possible annotations and biotypes simultaneously. We believe this approach carries substantial advantages. For example, the decision to annotate a locus as protein-coding or pseudogenic benefits from being able to weigh both possibilities in light of all available evidence. This process helps prevent false positive and false negative misclassifications. Targeted annotation is designed to answer specific questions such as ‘is there an unannotated protein-coding gene in this position?’ Ranked target lists are generated by computational analysis based, for example, on transcriptomic data, shotgun proteomic data or conservation measures. Over the last two years mouse annotation has been dominated by the clone-by-clone approach while the human genome has been refined entirely via targeted reannotation except for the annotation of human assembly patches and haplotypes released by the Genome Reference Consortium (15 (link)), which take a clone-by-clone approach.
Over the last two years, we have focused on two broad areas: completing the first pass manual annotation across the entire mouse reference genome and a dedicated effort to improve the annotation of protein-coding genes in human and mouse.
We have completed the annotation of novel protein-coding genes, lncRNAs and pseudogenes, plus QC and updating previous annotation where necessary for mouse chromosomes 9, 10, 11, 12, 13, 14, 15, 16 and 17. These updates bring the fraction of the mouse genome with completed first pass manual annotation to approximately 97%. In addition, we have continued to work with the NCBI and Mouse Genome Informatics project at the Jackson Laboratory to resolve annotation differences for protein-coding, pseudogene and lncRNA loci. For protein-coding genes this is under the umbrella of the Consensus Coding Sequence (CCDS) project (16 (link)).
We have also manually investigated unannotated regions of high protein-coding potential identified by whole genome analysis using PhyloCSF (17 (link)) (a tool described in more detail below). In human, this led to the addition of 144 novel protein-coding genes and 271 pseudogenes (of which 42 were unitary pseudogenes). In mouse, we annotated orthologous loci for all but 11 of the 144 human protein-coding genes. We have also revisited the annotation of all olfactory receptor loci in both human and mouse, using RNAseq data to define 5′ and 3′ UTR sequences for ∼1400 loci. In human we have also targeted a ‘deep dive’ manual reannotation of genes on clinical panels for paediatric neurological disorders to identify missing functional alternative splicing. Incorporating second and third generation transcriptomic data, we reannotated ∼190 genes and added more than 3600 alternatively spliced transcripts, including ∼1400 entirely novel exons and an additional ∼30kb of CDS. We have also completed an effort to capture all recently described unannotated microexons (18 (link)) into GENCODE, and further added an additional 146 novel microexons mined from public SLRseq data (19 (link)).
As part of the CCDS collaboration with RefSeq, we have checked a large subset of human loci where there was disagreement over gene biotype. Similarly, we have checked all UniProt manually annotated and reviewed (i.e. Swiss-Prot) accessions that lack an equivalent in GENCODE. As a result, we added 32 novel protein-coding loci to GENCODE and rejected more than 200 putative coding loci. Finally, we are manually reviewing genes previously annotated as protein-coding, but with weak or no support based on a method incorporating UniProt, APPRIS, PhyloCSF, Ensembl comparative genomics, RNA-seq, mass spectrometry and variation data (20 (link),21 (link)). Of the 821 loci investigated to date, 54 have had their coding status removed while a further 110 potentially dubious cases remain under review.
The approach taken reflects in the kinds of updates captured in the annotation. For example, the targeted reannotation in human leads to the annotation of few novel protein-coding loci but many novel transcripts at updated protein-coding and lncRNA loci. Conversely, in mouse the emphasis on clone-by-clone annotation identifies many more novel loci and transcripts across a broader range of biotypes (Figure 1).

Free full text: Click here

Frankish A., Diekhans M., Ferreira A.M., Johnson R., Jungreis I., Loveland J., Mudge J.M., Sisu C., Wright J., Armstrong J., Barnes I., Berry A., Bignell A., Carbonell Sala S., Chrast J., Cunningham F., Di Domenico T., Donaldson S., Fiddes I.T., García Girón C., Gonzalez J.M., Grego T., Hardy M., Hourlier T., Hunt T., Izuogu O.G., Lagarde J., Martin F.J., Martínez L., Mohanan S., Muir P., Navarro F.C., Parker A., Pei B., Pozo F., Ruffier M., Schmitt B.M., Stapleton E., Suner M.M., Sycheva I., Uszczynska-Ratajczak B., Xu J., Yates A., Zerbino D., Zhang Y., Aken B., Choudhary J.S., Gerstein M., Guigó R., Hubbard T.J., Kellis M., Paten B., Reymond A., Tress M.L, & Flicek P. (2018). GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Research, 47(Database issue), D766-D773.

Publication 2018

3 utr A genes A protein Chromosomes 9 Clone Coding sequence Consensus sequence Exons Gene annotation Genes Genome Haplotypes Human Human genome Human protein Light Lncrna Mass spectrometry Mouse Neurological disorders Olfactory receptor Protein Protein annotation Protein genes Pseudogenes Rna seq Transcriptomic Weak

Corresponding Organization : European Bioinformatics Institute

Other organizations : University of California, Santa Cruz, University of Lausanne, University Hospital of Bern, University of Bern, Massachusetts Institute of Technology, Broad Institute, Brunel University of London, Yale University, Institute of Cancer Research, Centre for Genomic Regulation, Barcelona Institute for Science and Technology, Spanish National Cancer Research Centre, University of Warsaw, The Ohio State University, Whitney Museum of American Art, Pompeu Fabra University, Guy's Hospital, King's College London

Top 5 similar protocols

Protocol cited in 551 other protocols

Variable analysis

independent variables

Not explicitly mentioned

dependent variables

Not explicitly mentioned

control variables

Not explicitly mentioned

controls

No positive or negative controls are mentioned in the given input.

Annotations

Based on most similar protocols

Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.

As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!