Repeat elements were identified
de novo with the RepeatModeler pipeline and masked with RepeatMasker software. Perfect, imperfect and compound SSRs were identified using the SciRoKo SSR-search module (
http://kofler.or.at/bioinformatics/SciRoKo). Satellite motifs were discovered using TRF56 (
link) and filtered for a minimum length of 80 bp and and 4 repeated monomers. Sequences were then clustered at 90% identity using CD-HIT57 (
link) to retrieve representative monomers; only clusters with at least 5 elements were retained. Occurrences on the assembly were obtained via BLASTn alignment of representative sequences with a minimum identity of 90%; only independent non-consecutive matches with a minimum distance of 50 bp were considered for statistics.
Gene prediction utilized reiterative runs of the MAKER suite58 (
link). Both EST sequences and RNAseq data were used to guide gene annotation. RNAseq data of eight globe artichoke and cardoon genotypes was retrieved from SRA archive (PRJNA72327). EST sequences for
C. cardunculus and other available Compositae species were downloaded from the NCBI as well as the non-redundant (nr) protein database for
Viridiplantae. RNAseq reads were aligned to the reference assembly using TopHat2 aligner and
de novo transcripts were assembled using the Cufflinks package with default parameters. A first run of MAKER annotation pipeline was carried out by employing only transcript assemblies along with ESTs and protein alignments to retrieve candidate gene predictions. After filtering for high quality preliminary predictions, HMM models for Augustus and SNAP
ab initio gene prediction algorithms were produced. Then, by utilizing these
ad-hoc HMM models, along with EST and proteins alignments as supporting evidence, final gene models were obtained in a second run of MAKER.
Predicted protein sequences were functionally annotated using InterproScan559 (
link) against all the available databases (ProDom-2006.1, Panther-7.2, SMART-6.2, PrositeProfiles-20.89, TIGRFAM-12.0, PrositePatterns-20.89, PfamA-26.0, SuperFamily-1.75, PRINTS-42.0, Gene3d-3.5.0, PIRSF-2.83, HAMAP-201207.4, Coils-2.2). In parallel, the same protein set was clustered using OrthoMCL60 with default parameters. Putative functions were assigned to each protein cluster based in the InterproScan functional predictions for all the members in the cluster. Protein datasets for
Arabidposis thaliana,
Brassica rapa,
Fragaria vesca and
Solanum lycopersicum (September 2014), were download from Phytozome V961 (
link). A predicted proteome was used for
Lactuca sativa (unpublished data). All the proteins were clustered with the
C. cardunculus using OrthoMCL to generate orthologous clusters with default parameters.
OrthoMCL Clusters, with species having expanded number of genes, were mined using a chi-square test comparing the counts of genes/species against an expected value. Only clusters with a mean number of genes above five were considered. These clusters were analysed for a significant deviation from mean gene count for species adopting a Bonferroni corrected p-value (p < 0.05). GO enrichment in the globe artichoke specific clusters was calculated with AgriGo (
http://bioinfo.cau.edu.cn/agriGO) and visualized with the REVIGO suite (
http://revigo.irb.hr). Genomic regions carrying cluster of genes were highlighted by aligning genomic full-length genes on the reference genome using BWA (bwa-sw algorithm;
http://bio-bwa.sourceforge.net).
Scaglione D., Reyes-Chin-Wo S., Acquadro A., Froenicke L., Portis E., Beitel C., Tirone M., Mauro R., Lo Monaco A., Mauromicale G., Faccioli P., Cattivelli L., Rieseberg L., Michelmore R, & Lanteri S. (2016). The genome sequence of the outbreeding globe artichoke constructed de novo incorporating a phase-aware low-pass sequencing strategy of F1 progeny. Scientific Reports, 6, 19427.