> Maybe you mentioned it in your demo and I missed it, but how does this differ pasting the log messages to ChatGPT / Claude / another LLM? Is it mainly that yours can iterate over a large logfile without blowing up the context window?
We do quite a bit of aggregation over the log file, and generate summary stats and choose what bits to stuff in the LLM. Plan to support more platforms than just spark.
> Does it suffer from the same issue as other LLMs, where it will always identify potential optimizations or improvements even if none are truly needed?
Funnily enough, instructing sonnet-3.7 to not suggest unnecessary optimisations seems to have done the trick!
thanks for the feedback! the first version had a lot more detailed code but decided to go with linking to our GitHub than copying all the code. Wanted to illustrate the core touch points involved in extending DF.
DataFusion is primarily a batch OLAP system, so we should be able to support hybrid workloads as well. And definitely agree with you re: Polars dev exp. That is something we are aiming for with our forthcoming Python sdk.
> It'd also be important/useful to support Python udfs (think numpy/jax/etc.).
Yep that's our longterm gameplan.
> It'd be very cool if you could collaborate with or even tap into the polars frontend. If you could execute polars logical plans but with a streaming source, that would be huge.
Are there examples of project that do this? I'd be very much interested in looking into this.
> Are there examples of project that do this? I'd be very much interested in looking into this.
Nope, I don't believe there are. Unfortunately they don't seem like they're interested in exporting their logical plans to substrait, so there's no obvious way forward.
> DataFusion is primarily a batch OLAP system, so we should be able to support hybrid workloads as well. And definitely agree with you re: Polars dev exp. That is something we are aiming for with our forthcoming Python sdk.
Ah, since this is the case, it might also make sense to tap into the datafusion python bindings which recently got a massive overhaul to have a more similar dev ex as polars (though the docs are still quite a bit behind).
I'm looking forward to seeing what the result will be! I know Ibis also is an option, but with my little bit of playing around with it, I've found it's just the lowest common denominator and doesn't provide as nice of an experience as directly using polars (or whatever query engine api is provided).
we absolutely do, the library itself is designed to be extensible. we are currently working on adding webhooks as one of our sources. are there are any specific connectors/sources you'd be interested in?
I have lots of HTTP endpoints that we poll with a cursor but actually the underlying data is very large (we work with snapshots of it) and updates very frequently and eventually we'll move to something else (e.g. interact directly with the underlying services with capnproto) so really it would just be useful to be able to define these sources ourselves. I'm working doing full-stack engineering at an HFT currently and we were thinking of using DataFusion to allow users to join, query and aggregate the data in realtime but I haven't attempted this yet (and to do so means integrating with what currently exists as I don't have time to rewrite all of the services).
while haven't checked out Fluvio yet, we are fans of Arroyo. regarding latter my understanding is that the team is going for a SQL first complete replacement for Flink. Denormalized is meant to be an embeddable engine you can import within your project. Our plan is to focus on the developer experience for users building with Python and Typescript in particular.
thanks @ztratar. would love to hear about your workloads at embra would be very helpful vis-a-vis the direction of our typescript experience. feel free to drop us an email: hello@denormalized.io
thanks for the encouraging words @ethegwo. Tonbo looks very cool and potentially something we could use for our state backend (currently using RocksDB which we aren't that happy about). Would love to chat about how we can work together. Feel free to reach out to me - amey@denormalized.io