Tentative genomic sequence was subjected to gene prediction and modeling by the Kazusa Annotation PipelinE for
Lotus japonicus (KAPSEL).
5 (link) The KAPSEL employs
ab initio gene-finding software and similarity searches in order to generate the elements for gene model production. The
ab initio gene-finding software used in the pipeline includes GeneMark.hmm,
24 (link) Genscan
25 (link) and Grail
26 (link) using the
A. thaliana-trained matrix. Splice-site candidates were deduced by NetGene2
27 (link) and SplicePredictor.
28 (link) The similarity searches to detect potential protein-coding exons were performed using the BLASTX function of BLAST against the UniProtKB database.
29 (link) The assigned exon candidates were extracted from the original sequence library, then mapped on the TGS more precisely using the dps and nap programs in the program suite of the analysis and annotation tool (AAT) package.
30 (link) Similarity searches of transcript sequences were performed by aligning the TGS against the Gene Indices
31 (link) for legume species including
L. japonicus, M. truncatula and Glycine max. The assigned transcript sequences were mapped on the TGS using the dds and gap2 programs in AAT to confirm working models of protein-encoding genes. As a result of the automated annotation process, a total of 19 848 partial and 10 951 complete models were assigned as protein-encoding genes in the TGS, except for those related to TEs. The 76.4-Mb sequences in the HGS were edited and annotated manually to ensure high-quality gene prediction.
The genes thus assigned were denoted by IDs with the clone (LjT**** for TACs and LjB**** for BACs) or contig (CM****) names followed by sequential numbers from one end to another. Of these, manually annotated genes on the HGS were followed by “.nc”, and others were followed by “.nd”. The genes assigned on the SGA sequences were denoted by IDs with the assemble consensus names (LjSGA_****) followed by sequential numbers from one end to another in the insert.
A global alignment of the genome sequences and ESTs was performed using the NEEDLE program
32 (link),33 that is provided at the EMBOSS site (
http://emboss.sourceforge.net/). To identify a possible TATA box-like motif for recognition by RNA polymerase II, a search against the plant cis-acting regulatory DNA elements (PLACE) database
34 (link) (
http://www.dna.affrc.go.jp/PLACE/) was carried out.