Eukaryotic proteins are processed using the general pipeline depicted in Figure 1 . The pipeline is organized as a directed rooted computational graph where each node corresponds to the execution of a specific tool. The graph root is the query protein sequence, while leaves correspond to predicted subcellular localizations, here represented as GO terms of the cellular component ontology. A path from the root to one leaf is determined by the outcomes of the different tools. In Figure 1 , GO terms and tools highlighted in green are only applied for plant proteins.
At the very first level, the query sequence is scanned for the presence of signal peptide using the DeepSig predictor (4 ). If the signal sequence is found (suggesting the sorting of the protein through the secretory pathway), the mature protein sequence is determined by cleaving the predicted signal peptide. The resulting mature sequence is then analyzed by the subsequent tools. Firstly, PredGPI (6 (link)) determines the presence of GPI-anchors. If an anchor is found, the sequence is classified as Membrane anchored component (GO:0046658). Otherwise, the sequence is filtered for the presence of α-helical TransMembrane (TM) domains using ENSEMBLE3.0 (7 (link)). If at least one TM domain is found, the protein is predicted as membrane protein and passed to MemLoci (10 (link)), which predicts the final membrane protein localization that includes: Endomembrane system (GO:00112505), Plasma membrane (GO:0005886) and Organelle membrane (GO:0031090). If no TM domain is found, the protein is predicted to be localized in the Extracellular space (GO:0005615).
Proteins not directed to the secretory pathway (as predicted with DeepSig) are analyzed for their potential organelle localization using TPpred3 (5 (link)), which predicts the presence of organelle-targeting peptides and distinguishes between mitochondrial and chloroplast sorting for plant proteins.
If no targeting peptide is detected with TPpred3, ENSEMBLE3.0 is used to discriminate membrane from globular proteins: MemLoci or BaCelLo (9 (link)) are hence applied to predict localization of membrane and globular protein, respectively. In particular, BaCelLo is able to distinguish among five different cellular compartments (four in case of animal or fungi proteins): Nucleus (GO:0005634), Cytoplasm (GO:0005737), Extracellular space (GO:0005615), Mitochondrion (GO:0005739) and, for plant proteins, Chloroplast (GO:0009507). Moreover, since BaCelLo adopts different optimized models for animals and fungi, information about the taxonomic origin of the input is also provided as a parameter to the predictor.
When a mitochondrial targeting signal is detected, this is cleaved-off to determine the mature protein sequence. ENSEMBLE3.0 is then used to determine whether the mature protein is localized into a Mitochondrial membrane (GO:0031966) or, more generally, into the Mitochondrion (GO:0005739).
For plant proteins, TPpred3 is also able to distinguish potential chloroplast-targeting peptides. If detected, they are cleaved and the sequence submitted to SChloro (11 (link)) that discriminates six different sub-chloroplast localizations: Outer membrane (GO:0009707), Inner membrane (GO:0009706), Plastoglobule (GO:0010287), Thylakoid lumen (GO:0009543), Thylakoid membrane (GO:0009535) and Stroma (GO:0009570).
Overall BUSCA is able to predict sixteen different compartments for plants and nine for animals and fungi.
At the very first level, the query sequence is scanned for the presence of signal peptide using the DeepSig predictor (4 ). If the signal sequence is found (suggesting the sorting of the protein through the secretory pathway), the mature protein sequence is determined by cleaving the predicted signal peptide. The resulting mature sequence is then analyzed by the subsequent tools. Firstly, PredGPI (6 (link)) determines the presence of GPI-anchors. If an anchor is found, the sequence is classified as Membrane anchored component (GO:0046658). Otherwise, the sequence is filtered for the presence of α-helical TransMembrane (TM) domains using ENSEMBLE3.0 (7 (link)). If at least one TM domain is found, the protein is predicted as membrane protein and passed to MemLoci (10 (link)), which predicts the final membrane protein localization that includes: Endomembrane system (GO:00112505), Plasma membrane (GO:0005886) and Organelle membrane (GO:0031090). If no TM domain is found, the protein is predicted to be localized in the Extracellular space (GO:0005615).
Proteins not directed to the secretory pathway (as predicted with DeepSig) are analyzed for their potential organelle localization using TPpred3 (5 (link)), which predicts the presence of organelle-targeting peptides and distinguishes between mitochondrial and chloroplast sorting for plant proteins.
If no targeting peptide is detected with TPpred3, ENSEMBLE3.0 is used to discriminate membrane from globular proteins: MemLoci or BaCelLo (9 (link)) are hence applied to predict localization of membrane and globular protein, respectively. In particular, BaCelLo is able to distinguish among five different cellular compartments (four in case of animal or fungi proteins): Nucleus (GO:0005634), Cytoplasm (GO:0005737), Extracellular space (GO:0005615), Mitochondrion (GO:0005739) and, for plant proteins, Chloroplast (GO:0009507). Moreover, since BaCelLo adopts different optimized models for animals and fungi, information about the taxonomic origin of the input is also provided as a parameter to the predictor.
When a mitochondrial targeting signal is detected, this is cleaved-off to determine the mature protein sequence. ENSEMBLE3.0 is then used to determine whether the mature protein is localized into a Mitochondrial membrane (GO:0031966) or, more generally, into the Mitochondrion (GO:0005739).
For plant proteins, TPpred3 is also able to distinguish potential chloroplast-targeting peptides. If detected, they are cleaved and the sequence submitted to SChloro (11 (link)) that discriminates six different sub-chloroplast localizations: Outer membrane (GO:0009707), Inner membrane (GO:0009706), Plastoglobule (GO:0010287), Thylakoid lumen (GO:0009543), Thylakoid membrane (GO:0009535) and Stroma (GO:0009570).
Overall BUSCA is able to predict sixteen different compartments for plants and nine for animals and fungi.