Have a slack channel with them, these are the versions they mentioned:
posthog-node 4.18.1
posthog-js 1.297.3
posthog-react-native 4.11.1
posthog-docusaurus 2.0.6
We're fixing it! This for some reason happens on only _some_ phones in our office so was hard to repro. I think has to do with Safari rendering. Will tone down our WebGPU usage
A high level summary is that while this is an impressive model, it underperforms even current SOTA VLMs on document parsing and has a tendency to hallucinate with OCR, table structure, and drop content.
... And? We're judging it for the merits of the technology it purports to be, not the pockets of the people that bankroll them. Probably not fair - sure, but when I pick my OCR, I want to pick SOTA. These comparisons and announcements help me find those.
We’ve generally found that Gemini 2.0 is a great model and have tested this (and nearly every VLM) very extensively.
A big part of our research focus is incorporating the best of what new VLMs offer without losing the benefits and reliability of traditional CV models. A simple example of this is we’ve found bounding box based attribution to be a non-negotiable for many of our current customers. Citing the specific region in a document where an answer came from becomes (in our opinion) even MORE important when using large vision models in the loop, as there is a continued risk of hallucination.
Whether that matters in your product is ultimately use case dependent, but the more important challenge for us has been reliability in outputs. RD-TableBench currently uses a single table image on a page, but when testing with real world dense pages we find that VLMs deviate more. Sometimes that involves minor edits (summarizing a sentence but preserving meaning), but sometimes it’s a more serious case such as hallucinating large sets of content.
The more extreme case is that internally we fine tuned a version of Gemini 1.5 along with base Gemini 2.0, specifically for checkbox extraction. We found that even with a broad distribution of checkbox data we couldn’t prevent frequent checkbox hallucination on both the flash (+17% error rate) and pro model (+8% error rate). Our customers in industries like healthcare expect us to get it right, out of the box, deterministically, and our team’s directive is to get as close as we can to that ideal state.
We think that the ideal state involves a combination of the two. The flexibility that VLMs provide, for example with cases like handwriting, is what I think will make it possible to go from 80 or 90 percent accuracy to some number very close 99%. I should note that the Reducto performance for table extraction is with our pre-VLM table parsing pipeline, and we’ll have more to share in terms of updates there soon.
For now, our focus is entirely on the performance frontier (though we do scale costs down with volume). In the longer term as inference becomes more efficient we want to move the needle on cost as well.
Overall though, I’m very excited about the progress here.
---
One small comment on your footnote, the evaluation script with Needlemen-Wunsch algorithm doesn’t actually consider the headers outputted by the models and looks only at the table structure itself.
Love the Pubtables work! It's a really useful dataset. Their data comes from existing annotations from scientific papers, so in our experience it doesn't include a lot of the hardest cases that a lot of methods fail at today. The annotations are computer generated instead of manually labeled, so you don't have things like scanned and rotated images or a lot of diversity in languages.
I'd encourage you to take a look at some of our data points to compare for yourself! Link: huggingface.co/spaces/reducto/rd_table_bench
In terms of the overall importance of table extraction, we've found it to be a key bottleneck for folks looking to do document parsing. It's up there amongst the hardest problems in the space alongside complex form region parsing. I don't have the exact statistics handy, but I'd estimate that ~25% of the pages we parse have some hairy tables in them!
Valid concern, security and safety are essential for anything that can access a production system. We use k8s RBAC to ensure that the access is read-only, so even if the LLM hallucinates and tries to destroy something, it can't
As we will eventually move towards write-access, we're closely following the work in LLM safety. There has been some interesting work to use smaller models to evaluate tool calls/completions against a set of criteria to ensure safety
Other problem is that you become an extremely big target for bad actors as you have read/write (or just even read) access to all these k8s clusters. Obviously you can mitigate against that to a fairly high degree with on prem, but for users not on that...