All LLPS data in LLPSDB were curated manually from literatures. The literature mining was performed via retrieving in ‘PubMed’ and ‘Web of Science’ by key words: phase separation, phase transition, liquid, protein, de-mixing, assembly, condensate, condensation, coacervate, segregate and segregation. From the total retrieved 3603 papers published up to July 2019, 154 articles were screened out. Only original research articles were collected, and those containing the data of LLPS in vitro were retained. Review papers were excluded to avoid redundancy and confusion. We considered proteins and nucleic acids involved in LLPS as ‘main components’, and constructed entries accordingly. Each entry was identified based on specific protein sequence and nucleic acid type (such as the wild type and mutant of FUS belong to different entry, similarly, component of RNA with 15nt and that with 30nt belong to different entry). Other molecules such as salts, buffer molecules and crowding agents, along with temperature and pH, etc., were considered as experimental conditions. These conditions, in their original units of measurement in the paper, together with the detected phase behavior (a detailed phase diagram or a tag ‘Yes’ or ‘No’ for whether the system phase separates or not) were extracted manually from the screened articles. All the data were checked at least twice. Any incomplete/ambiguous entry was consolidated either by contacting the authors of the article or tracking related references. All the specific information extracted from literatures is listed in the top box of Figure 1.
In LLPSDB, some protein annotations (mainly for natural proteins) such as localization, Gene Ontology (GO) term and sequences of some proteins (if not provided by the literature), were obtained from UniProt/NCBI. As IDRs and LCRs in proteins have been demonstrated to be generally critical in LLPS, the sequences of them are presented in visualization. IDRs were identified via searching MobiDB (28 (link)) or by PONDR VL3-BA algorithm (33 (link)) for those sequences not deposited in MobiDB, and only those segments with no fewer than 15 residues were taken into account (33 (link)). LCRs data were also from MobiDB or predicted by SEG algorithm (34 (link)) with default parameters. Additionally, in order to organize the data methodically, other information such as protein type (natural or designed), protein structure type (IDR, IDR-fold or Fold) and main components (whether DNAs or RNAs are included) were annotated. Other databases related with the corresponding protein, such as DisProt (27 (link)), OMIM (29 (link)), IDEAL (30 (link)), AmyPro (31 (link)), FuzzDB (32 (link)), as well as the PMID/DOI of the literature, were also linked from LLPSDB.
It is important to emphasize that LLPSDB focuses on situations where protein alone or with other components (proteins or nucleic acids) was validated to undergo LLPS (or NOT) in vitro. Since in many investigated systems, the mixtures of RNA such as total mRNAs (35 (link),36 (link)), instead of specific nucleic acid were added, the sequences of nucleic acid are not presented in LLPSDB. Those systems with only nucleic acids as main component (which means there is no protein) in solution were not included. In addition, systems with segregation of antibodies (such as IgG) and materials designed as drug carrier were also excluded. LLPSDB contains systems where the condensates were observed to flow, fuse, drop, wet and reverse (which were characterized as typical liquid-like droplets), or in which liquid morphology was identified by FRAP (fluorescence recovery after photobleaching), EM (electron microscopy) or other techniques. Systems in which assemblies change morphology from liquid in ripening, such as droplet-to-gel or droplet-to-aggregate were recorded in the database, but those form gels or aggregates directly from homogeneous solution were not deposited.