Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Dask sidesteps the problem by splitting up the data into smaller pandas DataFrames and processing them individually, rather than all at once. But the fundamental issue of pandas storing strings as objects in DataFrames (which is really inefficient) is also problematic with Dask (because a Dask DataFrame is composed of a lot of smaller pandas DataFrames). The technologies that really address this problem are Polars/PySpark - technologies that are a lot more memory efficient.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: