The largest database of trusted experimental protocols
> Physiology > Genetic Function > Gene Duplication

Gene Duplication

Gene Duplication is a fundamental genetic process in which a gene or segment of DNA is replicated, resulting in the creation of one or more additional copies of the original genetic material.
This process plays a crucial role in the evolution of genomes, allowing for the acquisition of new functions and the expansion of gene families.
Gene Duplication can occur through various mechanisms, such as unequal crossingover, retrotransposition, or chromosomal duplications.
The duplicated genes may undergo subfunctionalization, neofunctionalization, or pseudogenization, contributing to the diversity and complexity of living organisms.
Understanding the mechanisms and consequences of Gene Duplication is essential for researchers studying evolutionary biology, genomics, and genetic disease.
PubCompare.ai, an AI-driven platform, can enhance your Gene Duplication research by providing access to relevant protocols from literature, pre-prints, and patents, and enabling AI-driven comparisons to identify the best protocols and products, streamlining your research process and improving reproducibility and accuracy.

Most cited protocols related to «Gene Duplication»

The rooted species tree is required in order to identify the correct out-group in each orthogroup tree, as correct gene tree rooting is critical for the orthology assessment from that tree [22 (link)]. Since orthogroups can potentially contain any subset of the species in the analysis, it is not sufficient to simply know the out-group for the complete species set. Instead, the complete rooted species tree is required. If the user knows the rooted species tree for the set of species being analyzed, then it is recommended to specify this tree manually at the command line to remove the possibility of species tree inference error. Such a tree can be provided as a Newick format text file. In the event that a species tree is not provided (or not known), then OrthoFinder automatically infers it.
Sets of one-to-one orthologs that are present in all species are often used for species tree inference; however, in real-world large-scale analyses, these can be rare [33 ]. A new algorithm, Species Tree from All Genes (STAG), was developed to allow species tree inference even for species sets with few or no complete sets of one-to-one orthologs present in all species [33 ]. Without this algorithm, species tree inference could fail if there were no sets of one-to-one orthologs present in all species. STAG infers the species tree using the most closely related genes within single-copy or multi-copy orthogroups. In benchmark tests, STAG [24 (link)] had higher accuracy than other leading methods for species tree inference, including maximum likelihood species tree inference from concatenated alignments of protein sequences, ASTRAL [38 (link)] and NJst [39 (link)].
The Species Tree Root Inference from Duplication Events (STRIDE) algorithm [22 (link)] is used to root the species tree in OrthoFinder. STRIDE was developed to enable the rooting of the species tree using only information available in the set of gene trees. STRIDE does this by identifying the set of well-supported in-group gene duplication events in the complete set of unrooted orthogroup trees, and using these events to infer a probability distribution over an unrooted STAG species tree for the location of its root. Similarly to STAG, STRIDE has been shown to identify the correct root of the species tree in multiple large-scale molecular phylogenetic data sets spanning a wide range of time scales and taxonomic groups [22 (link)]. In some cases, it is possible that there could be few duplications within the gene trees, and so STRIDE will not be able to identify the root of the species tree, or will only be able to exclude the root from clades in which gene duplication events are observed. In this case, ortholog inference should still not be significantly impacted since the rooting of the gene tree only affects ortholog inference in cases where gene duplication events are present [22 (link)]. This makes the STRIDE approach particularly suited to gene tree rooting for ortholog inference.
Full text: Click here
Publication 2019
Aster Plant Gene Duplication Genes Genes, vif Plant Roots Proteins Sequence Alignment Species Specificity Trees Wakerobin
A gene tree is the canonical representation of the evolutionary relationships between the genes in a gene family. Thus, ortholog inference from gene trees is an important goal. However, no automated software tools are available that provide genome-wide ortholog inference from gene trees. A number of challenges had to be addressed to enable this. These included the efficient partitioning of genes into small, non-overlapping sets such that all orthologs of a gene are contained in the same set as the original gene; scalable and accurate inference of gene trees from these gene sets; automatic rooting of these gene trees without a user-provided species tree; and robust ortholog inference in the presence of imperfect gene tree inference. The OrthoFinder workflow was designed to address each of these challenges and is described in detail below.
By default, OrthoFinder infers orthologs from the orthogroup trees (a gene tree for the orthogroup) using the steps shown in Fig. 2. Input proteomes are provided by the user using one FASTA file per species. Each file contains the amino acid sequences for the proteins in that species. Orthogroups are inferred using the original OrthoFinder algorithm [10 (link)]; an unrooted gene tree is inferred for each orthogroup using DendroBLAST [24 (link)]; the unrooted species tree is inferred from this set of unrooted orthogroup trees using the STAG algorithm [33 ]; this STAG species tree is then rooted using the STRIDE algorithm by identifying high-confidence gene duplication events in the complete set of unrooted orthogroup trees [22 (link)]; the rooted species tree is used to root the orthogroup trees; orthologs and gene duplication events are inferred from the rooted orthogroup trees by a novel hybrid algorithm that combines the “species-overlap” method [31 ] and the duplication-loss-coalescent model [32 (link)] (described below); and comparative statistics are calculated. All major steps of the algorithm are parallelized to allow optimal use of computational resources. Only the orthogroup inference was provided in the original implementation of OrthoFinder [10 (link)]; all other subsequent steps are new and described below.
Full text: Click here
Publication 2019
Amino Acid Sequence Biological Evolution Gene Duplication Genes Genes, vif Genome Hybrids Proteins Proteome Trees
The tests for gene duplication event inference accuracy were performed on the simulated “flies” and “primates” dataset from [32 (link)] and a simulated “metazoa” dataset from [34 (link)]. To model real data, the flies and primate datasets used known species trees, parameters for divergence times, duplication rates, loss rates, population sizes, and generation times. Trees were simulated with varying effective population sizes and duplication rates so as to model incomplete lineage sorting [32 (link), 34 (link)]. The flies dataset consisted of 12,000 trees with 12 species and 12,032 gene duplication events. The primates dataset consisted of 7500 trees with 17 species and 16,066 gene duplication events. The metazoa dataset intended to emulate the complexity of real data by using heterogeneity in rates of duplication and loss, a complex model of sequence evolution, and then inferring trees with a homogenous, simple model [34 (link)]. It consisted of 2000 gene trees with 40 species and 4967 gene duplication events. For comparison, Forester [29 (link)], DLCpar (full), DLCpar (search) [32 (link)], and the overlap algorithm (i.e., without OrthoFinder’s tree resolution) were also tested.
All methods were provided with the input rooted gene tree and, where appropriate, the rooted species tree (Forester and DLCpar). No other parameters required specification for any of the other methods. The rooted gene trees were provided as part of the simulated data for the flies and primates datasets. Multiple sequence alignment (MSA) files were provided for the metazoa dataset. For this dataset, gene trees were inferred from the MSAs using FastTree so as to also include a potential level of tree inference error and were rooted with reconroot [32 (link)]. The OrthoFinder rooting algorithm was not used so as to avoid inadvertently biasing the results in favor of OrthoFinder. All methods were provided with the same input rooted gene trees. The complete set of gene duplication events identified by each of the methods was compared against the ground truth gene duplication events. An inferred gene duplication was identified as correct if the two sets of genes observed post-duplication exactly matched the two sets of genes post-duplication from the ground truth data.
The performance testing of the methods for identifying gene duplication events was performed on the orthogroup trees from the 4- to 128-species Fungi datasets as inferred by OrthoFinder with default parameters. The commands for Forester and DLCpar were run in parallel using GNU Parallel [42 ] using 16 threads on these gene trees. The OrthoFinder method was run via the “scripts/resolve.py” program included as part of the OrthoFinder distribution. To allow testing, the species-overlap method was also implemented in OrthoFinder and was run using the same program with the option “--no_resolve.”
Full text: Click here
Publication 2019
Biological Evolution Diptera Fungi Gene Duplication Genes Genetic Heterogeneity Genetic Testing Homozygote Metazoa Primates Sequence Alignment Trees
CNV burden was compared between cases and controls for rare CNVs (<1%), using CNV length excluding gaps and regions annotated as segmental duplications (hg18). The distribution of these CNVs is indicated in Supplementary Figure 6. Burden was defined using only the largest CNV to account for the large number of bases encompassed in small CNVs and the significant difference in array resolutions between cases and controls. Statistical comparisons utilized the Peto & Peto modification of the Gehan-Wilcoxon test (due to non-proportional hazard ratios) to assess overall burden. For significance at specific thresholds we utilized the Fisher's exact test. Significance for CNV enrichment was enumerated for all RefSeq genes (NCBI36). All isoforms for each gene were combined into a single entry representing all possible coding bases. Rare CNVs from cases and all control CNVs were then enumerated for only cases where the CNV intersects an exon. The resulting counts were then compared using the one-tailed Fisher's exact test. Likelihood ratios were calculated as per standard formulae, and confidence bounds were estimated by using the binomial confidence interval for case and control counts calculated by the Clopper–Pearson exact tail area method as described in Rosenfeld et al59 (link). Additionally, we calculated an empirical p-value for genes affected by rare CNVs. To do so we first excluded CNVs residing in regions with elevated mutation rates or unreliable CNV detection. These regions include subtelomeric CNVs initiating in the first 1.5 Mbp of each chromosome, over 75% of bases intersecting hotspots (145.1 Mbp across 58 sites) and segmental duplications (130.4 Mbp across 7,264 sites), initiating or terminating in a centromere gap region. All CNVs under 10 Mbp were then randomly shuffled (chromosome selection was weighted by the number of bases not filtered) under these constraints for cases and controls and Fisher's exact tests were calculated for deletions and duplications of each gene 20,000 times. The empirical p-value was defined as the number of simulations more significant than observed plus one divided by the number of simulations plus one. CNV burden for regions was also enumerated using a windowed analysis of rare case CNVs over 250 kbp. Window starts/ends were defined based on all unique breakpoints in the signature array. Breakpoint pairs under 50 kbp were then filtered as these represent the uncertainty in edges of Signature calls. Counts for p-values are based on 40% coverage of each window by cases (over 250 kbp) or controls (all CNVS). Significance was calculated using the one-tailed Fisher's exact test, and Supplementary Figure 2 shows the negative logarithm of the p-value. In many cases the critical region may represent multiple subregions that individually reach significance. Here, we report the larger region where smaller subregions are indicated by a number of additional CNVs over the background preventing refinement to a single candidate gene. Due to high prior probability of pathogenicity for large CNVs, the lack of independence between genes disrupted by CNVs, and the high odds ratio for most pathogenic loci, we have chosen to report nominal significance in all cases in addition to the Benjamini-Hochberg q-value, which represents an overestimate of the false discovery rate in our analyses60 . Please see the Supplementary Note for details on our interpretation of q-values in this study.
Publication 2014
Centromere Chromosomes Chromosomes, Human, Pair 5 Exons Gene Deletion Gene Duplication Genes Genetic Background HIVEP1 protein, human pathogenesis Pathogenicity Protein Isoforms Segmental Duplications, Genomic
CNV burden was compared between cases and controls for rare CNVs (<1%), using CNV length excluding gaps and regions annotated as segmental duplications (hg18). The distribution of these CNVs is indicated in Supplementary Figure 6. Burden was defined using only the largest CNV to account for the large number of bases encompassed in small CNVs and the significant difference in array resolutions between cases and controls. Statistical comparisons utilized the Peto & Peto modification of the Gehan-Wilcoxon test (due to non-proportional hazard ratios) to assess overall burden. For significance at specific thresholds we utilized the Fisher's exact test. Significance for CNV enrichment was enumerated for all RefSeq genes (NCBI36). All isoforms for each gene were combined into a single entry representing all possible coding bases. Rare CNVs from cases and all control CNVs were then enumerated for only cases where the CNV intersects an exon. The resulting counts were then compared using the one-tailed Fisher's exact test. Likelihood ratios were calculated as per standard formulae, and confidence bounds were estimated by using the binomial confidence interval for case and control counts calculated by the Clopper–Pearson exact tail area method as described in Rosenfeld et al59 (link). Additionally, we calculated an empirical p-value for genes affected by rare CNVs. To do so we first excluded CNVs residing in regions with elevated mutation rates or unreliable CNV detection. These regions include subtelomeric CNVs initiating in the first 1.5 Mbp of each chromosome, over 75% of bases intersecting hotspots (145.1 Mbp across 58 sites) and segmental duplications (130.4 Mbp across 7,264 sites), initiating or terminating in a centromere gap region. All CNVs under 10 Mbp were then randomly shuffled (chromosome selection was weighted by the number of bases not filtered) under these constraints for cases and controls and Fisher's exact tests were calculated for deletions and duplications of each gene 20,000 times. The empirical p-value was defined as the number of simulations more significant than observed plus one divided by the number of simulations plus one. CNV burden for regions was also enumerated using a windowed analysis of rare case CNVs over 250 kbp. Window starts/ends were defined based on all unique breakpoints in the signature array. Breakpoint pairs under 50 kbp were then filtered as these represent the uncertainty in edges of Signature calls. Counts for p-values are based on 40% coverage of each window by cases (over 250 kbp) or controls (all CNVS). Significance was calculated using the one-tailed Fisher's exact test, and Supplementary Figure 2 shows the negative logarithm of the p-value. In many cases the critical region may represent multiple subregions that individually reach significance. Here, we report the larger region where smaller subregions are indicated by a number of additional CNVs over the background preventing refinement to a single candidate gene. Due to high prior probability of pathogenicity for large CNVs, the lack of independence between genes disrupted by CNVs, and the high odds ratio for most pathogenic loci, we have chosen to report nominal significance in all cases in addition to the Benjamini-Hochberg q-value, which represents an overestimate of the false discovery rate in our analyses60 . Please see the Supplementary Note for details on our interpretation of q-values in this study.
Publication 2014
Centromere Chromosomes Chromosomes, Human, Pair 5 Exons Gene Deletion Gene Duplication Genes Genetic Background HIVEP1 protein, human pathogenesis Pathogenicity Protein Isoforms Segmental Duplications, Genomic

Most recents protocols related to «Gene Duplication»

Raw PacBio long reads were assembled de novo using Flye v2.9 (Kolmogorov et al. 2019 (link)) with the “keep haplotypes” option. Assessment of the resulting genomic contigs with Benchmarking Universal Single-Copy Orthologs (BUSCO) v4.1.4 (Simão et al. 2015 (link)) and the Actinopterygii-lineage dataset (actinopterygii_odb10) identified high levels of gene duplication. Therefore, duplicates were removed from the initial Flye assembly using purge_dups v0.03 (Guan et al. 2020 (link)). The chromosome-scale genome assembly was generated by Phase Genomics using the de novo assembly, FALCON-phase (Kronenberg et al. 2018 (link)), Hi-C sequencing reads, and Phase Genomics’ Proximo algorithm based on Hi-C chromatin contact maps (as described in Bickhart et al. 2017 (link)). Error correction of this chromosome-scale assembly was conducted with Illumina short reads and Pilon v1.23 (Walker et al. 2014 (link)). Quality-trimmed Illumina short reads (Trimmomatic v0.39) (Bolger et al. 2014 (link)) using the parameters “ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:8:keepBothReads LEADING:3 TRAILING:3 MINLEN:36” were aligned to the genome using Bowtie2 v2.4.1 (Langmead and Salzberg 2012 (link)) with the default parameters, and the resulting SAM files were converted to BAM format using SAMtools v1.10 (Li et al. 2009 (link)). BAM files were then used as input for error correction with Pilon. The quality and completeness of the final assembly was assessed using Quast v5.0.2 (Mikheenko et al. 2018 (link)) and BUSCO v4.1.4 (actinopterygii_odb10) (Simão et al. 2015 (link)), and base-level accuracy (QV) was assessed using trimmed Illumina short reads, Merqury v1.3 (Rhie et al. 2020 (link)), and a k-mer value of 20.
Full text: Click here
Publication 2023
Chromatin Chromosomes Gene Duplication Genome Haplotypes Microtubule-Associated Proteins mismatch repair protein 1, human Walkers
To evaluate the temporal dynamics of expanded gene families during the evolution of C. bisecta, the nucleotide substitution rates of bivalves were calculated by the branch distance divided by the estimated divergence time using MCMCtree. With default settings of MAFFT and “-automated1” option of trimAl v1.4 [139 (link)], all paralogs of the target gene family were aligned to determine the time required for gene duplication. The Nei-Gojobori pairwise codeml method was used to determine the dN values for all aligned pairs. Divergence times of gene pairs were estimated using the equation T = K/2r [140 (link)], where T is the insertion time, and r is the nucleotide substitution rate. The relationships between different gene pairs are determined following the DupGen_Finder (https://github.com/qiao-xin/DupGen_finder) pipeline, using Nematostella vectensis as a reference genome.
Full text: Click here
Publication 2023
Biological Evolution Bivalves Gene Duplication Genes Genetic Drift Genome Nucleotides
The MeAMT2 genes were mapped to the chromosomes utilizing TBtools software (Chen et al., 2020 (link)) according to the information obtained from the cassava genome database. Simultaneously, gene duplication of MeAMT2 genes was analyzed utilizing MCScanX software (Wang et al., 2012 (link)) and illustrated with TBtools. The nucleotide substitution parameters Ks (synonymous) and Ka (non-synonymous) of the duplicated genes were assessed using TBtools, and then the Ka/Ks ratio was calculated. In addition, the gene duplication information from cassava, soybean, and A. thaliana was analyzed using MCscanX software, followed by integral visualization of synteny with TBtools software (Wang et al., 2012 (link); Chen et al., 2020 (link)).
Full text: Click here
Publication 2023
Chromosomes Gene Duplication Genes Genes, Duplicate Genome Manihot Nucleotides Soybeans Synteny
To illustrate the removal of paralogs using taxonomic information, we assumed that the analyzed species belong to either phylogeny X, phylogeny Y, or phylogeny Z. Herein, if a phylogenetic tree of the candidate ortholog group was inferred, the presence or absence of paralogs could be determined based on the species information in the clades located at both ends of a certain branch. For example, if one clade contains only sequences of the species belonging to lineage X and another one contains only sequences of the species belonging to lineages Y and Z, no phylogenetic overlap would occur between the clades. Therefore, no paralogs would be included. However, if both clades contained sequences of species belonging to lineages X, Y, and Z, the clades would have diverged owing to gene duplication before the divergence of the three lineages (supplementary fig. S6A, Supplementary Material online). In this case, the paralogs would be removed by retaining only the sequences in one clade (supplementary fig. S6B, Supplementary Material online). Therefore, in the phylogenetic tree of a candidate ortholog group, OrthoPhy obtained information about the lineage to which the species in the clades located at both ends of each branch belong and determined whether the aforementioned conditions are met. If they did, the sequences in the clades with fewer sequences were removed as paralogs.
Paralogs are detected in a bottom–up approach, tracing from each leaf of the phylogenetic tree to its parent nodes. If any of the conditions (i)–(iii) are met at each node, the paralogs are successively removed. These processes removed all paralogs from the candidate ortholog group, resulting in an ortholog group consisting only of sequences that are orthologous to each other.
Publication 2023
Gene Duplication Parent Plant Leaves Trees
According to the GFF annotations of the sweetpotato genomes, 43 IbDof genes were linked with the chromosomes. In order to conduct a synteny analysis between IbDofs and genes from other plant species, the genome sequence and annotation data of sweetpotato and eight representative species (including Ipomoea triloba, Ipomoea trifida, Arabidopsis, rice, tomato, pepper, cabbage, and Brassica oleracea), were downloaded from various databases such as Ipomoea Genome Hub, TAIR, Ensembl (http://plants.ensembl.org/index.html), and Phytozome (https://phytozome.jgi.doe.gov/pz/portal.html). MCScanX software was used to evaluate the association of gene duplication and collinearity using default settings. Circos and TBtools softwares were used to display the obtained results, and the minimum block size setting is 30 (Krzywinski et al., 2009 (link); Chen et al., 2020 (link); Guo et al., 2022 (link)).
Full text: Click here
Publication 2023
Arabidopsis Brassica Cabbage Chromosomes Gene Duplication Genes Genes, Plant Genome Ipomoea Ipomoea batatas Lycopersicon esculentum Piper nigrum Plants Rice Synteny

Top products related to «Gene Duplication»

Sourced in United States, United Kingdom, Japan, Germany, Canada
The ABI 3130xl Genetic Analyzer is a capillary electrophoresis instrument designed for DNA sequencing and fragment analysis. It employs laser-induced fluorescence detection to analyze DNA samples. The instrument can process multiple samples simultaneously and provides high-resolution data for various genetic applications.
Sourced in Germany, Spain, United States
The EZ1 DNA blood kit is a laboratory equipment product from Qiagen designed for the automated extraction and purification of DNA from blood samples. The core function of this kit is to efficiently isolate and concentrate DNA from blood, providing a reliable and consistent DNA sample for further downstream applications.
Sourced in United States, China, Canada, United Kingdom, Germany, Spain, Japan, Israel
The HiSeq 2000 system is a high-throughput DNA sequencing platform developed by Illumina. It is designed for large-scale genomic research projects, providing rapid and accurate DNA sequencing capabilities. The system utilizes Illumina's proprietary sequencing-by-synthesis technology to generate high-quality sequence data.
Sourced in United States, Germany, Japan, China, India, United Kingdom, Switzerland
The ABI 3500 Genetic Analyzer is a capillary electrophoresis instrument designed for DNA sequencing and fragment analysis applications. It utilizes laser-induced fluorescence detection and a 24-capillary array to provide high-throughput analysis of genetic samples.
Sourced in Germany, United States, France, United Kingdom, Netherlands, Spain, Japan, China, Italy, Canada, Switzerland, Australia, Sweden, India, Belgium, Brazil, Denmark
The QIAamp DNA Mini Kit is a laboratory equipment product designed for the purification of genomic DNA from a variety of sample types. It utilizes a silica-membrane-based technology to efficiently capture and purify DNA, which can then be used for various downstream applications.
Sourced in United States, Germany
Illustrator CS6 is a vector graphics editing software that allows users to create and manipulate vector-based images, such as logos, illustrations, and graphics. It provides a range of tools and features for designing and editing vector artwork.
Sourced in United States, Netherlands
Coffalyser is a software application developed by MRC-Holland for the analysis of data generated from their MLPA (Multiplex Ligation-dependent Probe Amplification) assays. The software is designed to process and interpret the raw data files produced by MLPA experiments, providing users with a comprehensive analysis of the target DNA sequences.
Sourced in Germany, United States, Netherlands, Canada, France, Spain
The QIAsymphony is an automated sample processing platform designed for high-throughput nucleic acid isolation and purification. It provides a fully integrated and standardized workflow for processing a wide range of sample types, enabling efficient and reliable sample preparation for downstream applications.
Sourced in Netherlands, United States
Coffalyser.Net is a software application developed by MRC-Holland. It is designed to analyze data generated from their MLPA (Multiplex Ligation-dependent Probe Amplification) experiments. The software provides tools for data normalization, result interpretation, and report generation.
The SALSA MLPA P128-C1 Cytochrome P450 Probemix kit is a laboratory product designed for the detection and analysis of cytochrome P450 genes. The kit utilizes the Multiplex Ligation-dependent Probe Amplification (MLPA) technique to provide a comprehensive assessment of genetic variations in the cytochrome P450 gene family.

More about "Gene Duplication"

Gene duplication is a fundamental genetic process where a gene or DNA segment is replicated, creating one or more additional copies of the original genetic material.
This evolutionary mechanism plays a crucial role in the expansion of gene families and the acquisition of new functions.
The duplication process can occur through various mechanisms, such as unequal crossing-over, retrotransposition, or chromosomal duplications.
The resulting duplicated genes may then undergo subfunctionalization, neofunctionalization, or pseudogenization, contributing to the diversity and complexity of living organisms.
Understanding gene duplication is essential for researchers studying evolutionary biology, genomics, and genetic diseases.
Techniques like the ABI 3130xl Genetic Analyzer, HiSeq 2000 system, and ABI 3500 Genetic Analyzer can be used to analyze gene sequences and identify duplication events.
Complementary tools like the QIAamp DNA Mini Kit, Coffalyser software, and QIAsymphony can facilitate DNA extraction and data analysis.
Leveraging AI-driven platforms like PubCompare.ai can further enhance gene duplication research.
This platform provides access to relevant protocols from literature, pre-prints, and patents, and enables AI-driven comparisons to identify the best protocols and products.
This streamlines the research process, improves reproducibility, and enhances the accuracy of findings.
By incorporating synonyms, related terms, and key subtopics, researchers can optimize their gene duplication studies and stay ahead of the curve in this rapidly evolving field of genomics and evolutionary biology.
The inclusion of relevant techniques and tools, such as the EZ1 DNA blood kit, Illustrator CS6, and Coffalyser.Net software, can further enrich the research process and support the understanding of gene duplication mechanisms and their implications.