Tentative genomic sequence was subjected to gene prediction and modeling by the Kazusa Annotation PipelinE for Lotus japonicus (KAPSEL).5 (link) The KAPSEL employs ab initio gene-finding software and similarity searches in order to generate the elements for gene model production. The ab initio gene-finding software used in the pipeline includes GeneMark.hmm,24 (link) Genscan25 (link) and Grail26 (link) using the A. thaliana-trained matrix. Splice-site candidates were deduced by NetGene227 (link) and SplicePredictor.28 (link) The similarity searches to detect potential protein-coding exons were performed using the BLASTX function of BLAST against the UniProtKB database.29 (link) The assigned exon candidates were extracted from the original sequence library, then mapped on the TGS more precisely using the dps and nap programs in the program suite of the analysis and annotation tool (AAT) package.30 (link) Similarity searches of transcript sequences were performed by aligning the TGS against the Gene Indices31 (link) for legume species including L. japonicus, M. truncatula and Glycine max. The assigned transcript sequences were mapped on the TGS using the dds and gap2 programs in AAT to confirm working models of protein-encoding genes. As a result of the automated annotation process, a total of 19 848 partial and 10 951 complete models were assigned as protein-encoding genes in the TGS, except for those related to TEs. The 76.4-Mb sequences in the HGS were edited and annotated manually to ensure high-quality gene prediction.
The genes thus assigned were denoted by IDs with the clone (LjT**** for TACs and LjB**** for BACs) or contig (CM****) names followed by sequential numbers from one end to another. Of these, manually annotated genes on the HGS were followed by “.nc”, and others were followed by “.nd”. The genes assigned on the SGA sequences were denoted by IDs with the assemble consensus names (LjSGA_****) followed by sequential numbers from one end to another in the insert.
A global alignment of the genome sequences and ESTs was performed using the NEEDLE program32 (link),33 that is provided at the EMBOSS site (http://emboss.sourceforge.net/). To identify a possible TATA box-like motif for recognition by RNA polymerase II, a search against the plant cis-acting regulatory DNA elements (PLACE) database34 (link) (http://www.dna.affrc.go.jp/PLACE/) was carried out.