The data collection process is a pipeline that starts with defining the full set of GeneCards genes, obtained from three primary sources as follows. First, the complete current snapshot of HGNC-approved symbols (4 ) is used as the core gene list. Next, human Entrez Gene (5 ) entries that are different from the HGNC genes are added. Finally, human Ensembl (6 ) records are matched against the emerging gene list via our GeneLoc’s exon-based unification algorithm (12 (link)); those that are not found to be equivalent to others in the set are included as novel Ensembl-based GeneCards gene entries. These primary sources provide annotations for aliases, descriptions, previous symbols, gene category, location, summaries, paralogs and ncRNA details. Once the gene list is in place with these significant annotations, over 80 data sources, including those noted above and others (12 (link),18 (link),22 (link),36 (link),46 (link),47 ) are mined for thousands of additional descriptors.
The data collection and integration process, which runs periodically (typically every 3–5 months) to ensure ongoing access to recent updates, culminates in producing an integrated database, which is available in plain text and XML files, as well as MySQL dumps.