Given an arbitrary data set as an input, TMAP encompasses four phases: (I) LSH forest indexing [37 , 38 ], (II) construction of a -approximate -nearest neighbor graph, (III) calculation of a minimum spanning tree (MST) of the -approximate -nearest neighbor graph [39 (link)], and (IV) generation of a layout for the resulting MST [40 ].
During phase I, the input data are indexed in an LSH forest data structure, enabling -approximate -nearest neighbor (k-NN) searches with a time complexity sub-linear in . Text and binary data are encoded using the MinHash algorithm, while integer and floating-point data are encoded using a weighted variation of the algorithm [41 –43 ]. The LSH Forest data structure for both MinHash and weighted MinHash data is initialized with the number of hash functions used in encoding the data, and the number of prefix trees . An increase in the values of both parameters led to an increase in main memory usage; however, higher values for also decrease query speed. The effect of parameters and on the final visualization is shown in Additional file1 : Fig. S1. The use of a combination of (weighted) MinHash and LSH Forest, which supports fast estimation of the Jaccard distance between two binary sets, has been shown to perform very well for molecules [44 (link)]. Note that other data structures and algorithms implementing a variety of different distance metrics may show better performance on other data and can be used as drop-in replacements of phase I.
In phase II, an undirected weighted -approximate -nearest neighbor graph ( – -NNG) is constructed from the data points indexed in the LSH forest, where an augmented variant of the LSH forest query algorithm we previously introduced for virtual screening tasks is used to increase efficiency [45 (link)]. The – -NNG construction phase takes two arguments, namely , the number of nearest-neighbors to be searched for, and , the factor used by the augmented query algorithm. The variant of the query algorithm increases the time complexity of a single query from to , resulting in an overall time complexity of , where practically , for the – -NNG construction. The edges of the – -NNG are assigned the Jaccard distance of their incident vertices as their weight. Depending on the distribution and the hashing of the data, the – -NNG can be disconnected (1) if outliers exist which have a Jaccard distance of 1.0 to all other data points and are therefore not connected to any other nodes or (2) if, due to highly connected clusters of size in the Jaccard space, connected components are created. However, the following phases are agnostic to whether this phase yields a disconnected graph. The effect of parameters and on the final visualization is shown in Additional file1 : Fig. S2. Alternatively, an arbitrary undirected graph can be supplied to the algorithm as a (weighted) edge list.
During phase III, a minimum spanning tree (MST) is constructed on the weighted – -NNG using Kruskal’s algorithm, which represents the central and differentiating phase of the described algorithm. Whereas comparable algorithms such as UMAP or t-SNE attempt to embed pruned graphs, TMAP removes all cycles from the initial graph using the MST algorithm, significantly lowering the computational complexity of a low dimensional embedding. The algorithm reaches a globally optimal solution by applying a greedy approach of selecting locally optimal solutions at each stage—properties which are also desirable in data visualization. The time complexity of Kruskal’s algorithm is , rendering this phase negligible compared to phase II in terms of execution time. In the case of a disconnected – -NNG, a minimum spanning forest is created.
Phase IV lays out the tree on the Euclidean plane. As the MST is unrooted and to keep the drawing compact, the tree is not visualized by applying a tree but a graph layout algorithm. In order to draw MSTs of considerable size (millions of vertices), a spring-electrical model layout algorithm with multilevel multipole-based force approximation is applied. This algorithm is provided by the open graph drawing framework (OGDF), a modular C++ library [40 ]. In addition, the use of the OGDF allows for effortless adjustments to the graph layout algorithm in terms of both aesthetics and computational time requirements. Whereas several parameters can be configured for the layout phase, only parameter must be adjusted based on the size of the input data set (Additional file1 : Fig. S3). This phase constitutes the bottleneck regarding computational complexity.
During phase I, the input data are indexed in an LSH forest data structure, enabling -approximate -nearest neighbor (k-NN) searches with a time complexity sub-linear in . Text and binary data are encoded using the MinHash algorithm, while integer and floating-point data are encoded using a weighted variation of the algorithm [41 –43 ]. The LSH Forest data structure for both MinHash and weighted MinHash data is initialized with the number of hash functions used in encoding the data, and the number of prefix trees . An increase in the values of both parameters led to an increase in main memory usage; however, higher values for also decrease query speed. The effect of parameters and on the final visualization is shown in Additional file
In phase II, an undirected weighted -approximate -nearest neighbor graph ( – -NNG) is constructed from the data points indexed in the LSH forest, where an augmented variant of the LSH forest query algorithm we previously introduced for virtual screening tasks is used to increase efficiency [45 (link)]. The – -NNG construction phase takes two arguments, namely , the number of nearest-neighbors to be searched for, and , the factor used by the augmented query algorithm. The variant of the query algorithm increases the time complexity of a single query from to , resulting in an overall time complexity of , where practically , for the – -NNG construction. The edges of the – -NNG are assigned the Jaccard distance of their incident vertices as their weight. Depending on the distribution and the hashing of the data, the – -NNG can be disconnected (1) if outliers exist which have a Jaccard distance of 1.0 to all other data points and are therefore not connected to any other nodes or (2) if, due to highly connected clusters of size in the Jaccard space, connected components are created. However, the following phases are agnostic to whether this phase yields a disconnected graph. The effect of parameters and on the final visualization is shown in Additional file
During phase III, a minimum spanning tree (MST) is constructed on the weighted – -NNG using Kruskal’s algorithm, which represents the central and differentiating phase of the described algorithm. Whereas comparable algorithms such as UMAP or t-SNE attempt to embed pruned graphs, TMAP removes all cycles from the initial graph using the MST algorithm, significantly lowering the computational complexity of a low dimensional embedding. The algorithm reaches a globally optimal solution by applying a greedy approach of selecting locally optimal solutions at each stage—properties which are also desirable in data visualization. The time complexity of Kruskal’s algorithm is , rendering this phase negligible compared to phase II in terms of execution time. In the case of a disconnected – -NNG, a minimum spanning forest is created.
Phase IV lays out the tree on the Euclidean plane. As the MST is unrooted and to keep the drawing compact, the tree is not visualized by applying a tree but a graph layout algorithm. In order to draw MSTs of considerable size (millions of vertices), a spring-electrical model layout algorithm with multilevel multipole-based force approximation is applied. This algorithm is provided by the open graph drawing framework (OGDF), a modular C++ library [40 ]. In addition, the use of the OGDF allows for effortless adjustments to the graph layout algorithm in terms of both aesthetics and computational time requirements. Whereas several parameters can be configured for the layout phase, only parameter must be adjusted based on the size of the input data set (Additional file
Full text: Click here