The VEP’s caches are built for each of Ensembl’s primary species (70 species as of Ensembl version 84); the files are updated in concert with Ensembl’s release cycle, ensuring access to the latest annotation data. Cache files for all previous releases remain available on Ensembl’s FTP archive site [91 ] to facilitate reproducibility. For 15 of these species there are three types of cache files: one with the Ensembl transcripts, a “refseq” one with the RefSeq transcripts, and a “merged” one that contains both. Caches for both the latest GRCh38 and previous GRCh37 (hg19) human genome builds are maintained. The human GRCh38 cache file is around 5 gigabytes in size, including transcript, regulatory, and variant annotations as well as pathogenicity algorithm predictions. Performance using the cache is substantially faster than using the database; analyzing a small VCF file of 175 variants takes 5 seconds using the cache versus 40 seconds using the public Ensembl variation database over a local network (performance can be expected to be slower when using a remote database connection). The VEP can use FASTA format files of genomic sequence for sequence retrieval. This functionality is needed to generate HGVS notations and to quality check input variants against the reference genome. The VEP uses either an htslib-based indexer [92 ] or BioPerl’s FASTA DB interface to provide fast random access to a whole genome FASTA file. Sequence may alternatively be retrieved from an Ensembl core database, with corresponding performance penalties. Cache and FASTA files are automatically downloaded and set up using the VEP package’s installer script, which utilizes checksums to ensure the integrity of downloaded files. The installer script can also download plugins by consulting a registry. The VEP package also includes a script, gtf2vep.pl, to build custom cache files. This requires a local GFF or general transfer format (GTF) file that describes transcript structures and a FASTA file of the genomic sequence.
Ensembl's primary species (70 species as of Ensembl version 84)
Ensembl's release cycle
Genome builds (GRCh38 and GRCh37 (hg19) for human)
dependent variables
Cache file size (around 5 gigabytes for human GRCh38 cache file)
Performance (5 seconds using the cache versus 40 seconds using the public Ensembl variation database)
control variables
Ensembl's FTP archive site for cache files of all previous releases
FASTA format files of genomic sequence for sequence retrieval
Htslib-based indexer or BioPerl's FASTA DB interface for fast random access to a whole genome FASTA file
Ensembl core database for sequence retrieval
Checksums to ensure the integrity of downloaded files
Registry for downloading plugins
Local GFF or GTF file that describes transcript structures and a FASTA file of the genomic sequence for building custom cache files
Annotations
Based on most similar protocols
Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.
As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.
About PubCompare
Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.
We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.
However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.
Ready to
get started?
Sign up for free.
Registration takes 20 seconds.
Available from any computer
No download required