The Liston-Dooley Laboratory

Top

Navigation

Public engagement

What is a t-SNE?

Wednesday, December 8, 2021 at 1:12PM

t-Distributed Stochastic Neighbor Embedding (t-SNE) is the most commonly used non-linear dimensionality reduction algorithm for single cell biology. In its common usage for visualising high-dimensionality single cell data, the algorithm starts with the single cells distributed at random points, along a Gaussian distribution, in transformed space. In an iterative process the cells move along a cost gradient, which provides a penalty for mismatch between the distances between two cells in the original high-dimensional space versus the representational low-dimensional space. In its common usage for visualising high-dimensionality single cell data, the cost gradient of t-SNE places greater weight on pairs of cells close to each other, with medium- and long-range pairs ignored. When sufficient iterations have occurred to reach stability, the outcome produces clusters of similar cells, based on the input data. Membership of a cluster indicates shared properties, however the non-linear nature of the penalty cost does not allow relationships to be inferred by the relative positioning of the clusters.

Running the same dataset through a t-SNE multiple times results in a visually distinct stable states, owing to the random placement of the input data at the first stage. The high cost of violation of local distances ensures that local clusters are maintained across runs, while the low cost of medium- and long-range pairs permits multiple stable states with rotational symmetry to develop. An under-appreciated aspect of the t-SNE algorithm is the early exaggeration of the penalty for violating local distances for the first 50 iterations. Visualising t-SNE runs at each iteration demonstrates that the early exaggeration phase involves a sharp contraction of all points, which then expand out into separate clusters when the exaggeration factor is removed. This early exaggeration is integral to the t-SNE calculation, as maintaining the high penalty throughout results in dense overlapping clusters, while maintaining the low penalty throughout permits splitting of clusters, as cells close together in high-dimensional space do not come in close enough proximity in representational space to drive clustering. It is important to allow both phases and sufficient iterations for the t-SNE to reach stability for consistency in results.

Iteration-by-iteration visualisation of the t-SNE for two unique seeds of the same dataset. Note the initial collapse of the sample into very small distances, and the rotational symmetry observed between the two runs as the samples slowly expand with extra iterations.

Iteration-by-iteration visualisation of the t-SNE for two unique seeds of the same dataset, with no initial exaggeration of the penalty. As the initial collapse of the sample is reduced, similar cells can avoid coming into close enough contact to drive cluster formation. As a result, biological clusters are split in the final representation and repeat runs vary greatly.

Adrian Liston |

1 Comment |

tagged

data science

Reader Comments (1)

What a nice post! Thank you so much and I am really looking forward to reading more and more articles from you.

July 27, 2023 |

fnaf

Post a New Comment

Enter your information below to add a new comment.

Author:

Author Email (optional):

Author URL (optional):

Post:

↓ | ↑

Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>