Here’s a relevant recent lecture from Alex Townsend, as part of Gil Strang’s 18.065 course, “Rapidly Decreasing Singular Values” https://www.youtube.com/watch?v=9BYsNpTCZGg
Currently watching the lecture series, and it's a great experience. In fact, this made me understand the blog post to some level I wouldn't have thought possible only a few weeks ago.
Good stuff. A lot of clustering methods can be viewed as matrix factorization (see, e.g., https://arxiv.org/abs/1512.07548) so it would be interesting to see if you could spend more computation on the creating the low dimensional approximation to achieve a better result. E.g. perhaps optimizing the absolute error (L1 norm) would be better than the squared error (L2 norm)?