Oh yes, I'm very familiar with these. All these do though is extract information...

otoburb · on Aug 28, 2022

Even with Amazon Textract, the middle step to curate extracted information into some form of meaning is still missing. Didn't realize this is still an unsolved problem.

chaps · on Aug 28, 2022

Very much so. Here's an example document I work with that has information that's difficult to extract from: https://s3.documentcloud.org/documents/6929951/CRID-1061543....

Lots of missing context from these sheets that has to be interpreted (ie, how do you taxonomize each field of information?). Then asking questions on top of these documents is a step on top: "is the allegation about sexual violence?", "What is the name and rank of the person being accused?", "Is anything anomalous in the review process?", "Has this person's rank changed in the past 5 years?" etc etc.

Now expand this problem to hundreds of thousands of different types of document.

pani5ue · on Aug 28, 2022

If you haven't yet, you should look at the full text search capabilities of SQLite and postgres. Could simplify your search part of the workflow a bit