Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Deep Learning also works on very small data sets by means of embeddings. A large model trained on large data sets can be used as feature extraction tool for training for small data sets.


Re-using an existing model to generate embeddings doesn’t work well for auxiliary tasks with very small data. Even if you do no fine-tuning at all, you need to have big data sets in terms of the auxiliary task too.

For example, consider needing to train hundreds of unique small models every day, based on new customer inputs affecting causality effects for that day (I had to do this for ad forecasting in a past job).

Generating embeddings via pre-trained models essentially produced gibberish and performed far worse than custom feature engineering + simple logistic models.


I’ve seen this mentioned before, including a blog post by the fast.ai folks. Any idea where I can get details? If my tabular data set is small, what kind of embedding can I get out of it? Or is the idea that a larger data set is used for embeddings of categorical data?


Pre-trained embeddings are only helpful if they are trained on a different (ideally larger) dataset or even a different task, but with the same kind of input data. So you would need to find out where else something similar to the data in your tables appears. If some of the data is text, word embeddings may be applicable. Or if you're trying to analyze user activity by time and location, you might try to transfer what can be learned about the influence of holidays from publicly observable activity e.g. on Twitter (just a random idea that popped into my head, no guarantee that it can actually work).

Of course if all you have are numbers without context, there isn't a lot you can do to improve the situation.


I think this is mainly a thing for perception (images and sounds). Tabular data would have to match up with the training dataset, and "most" interesting tabular models are the sports of things guarded like piles of gold by the businesses that build them...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: