I'll just throw it in the discussion: pandas could just interface with and leave...

isoprophlex · on July 6, 2020

In my opinion, pandas is fundamentally broken and unsuitable for any production workload.

The heavy lifting should be left to a RDBMS like you say: something with a sensible, battle-hardened query planner. I've written and debugged too many lines of manual pd joins/merges; something declarative like SQL is much nicer because the query planner is almost always right.

Furthermore, as a user, I've always found the pandas API to be very confusing. I'm always having to interrupt my workflow to figure out boring details about the API (is it df.groupBy().rolling(center=True).median() or any other permutation?), whereas eg pyspark or sql are so much more ergonomic.

Finally, typing inside pd dataframes is a complete and utter nightmare. Int64 missing a null, or the idiocy around datetimes expressed as epoch nanoseconds...

Pandas is nice for noodling around in notebooks. But for me, it should never be used beyond that.

disgruntledphd2 · on July 6, 2020

Pandas combines the intuitive nature of base-R with the excellent missing data model of Python.

alpineidyll3 · on July 7, 2020

Aye aye!

prepend · on July 6, 2020

I use pandas with multiple sources and try to keep as much as possible in the db.

But many sources are outside the db, or run on multiple, disconnected dbs.

Loading data into a db is impractical and I only do it if necessary.

Also, different people run and manage dbs so it’s frequently easier to run in pandas.