Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Oh yes, I'm very familiar with these. All these do though is extract information, but don't immediately make them useful. So there's a massive gulf of a middle-step that's not yet done. Textract gets close...ish to that, but it's prohibitively expensive.


Even with Amazon Textract, the middle step to curate extracted information into some form of meaning is still missing. Didn't realize this is still an unsolved problem.


Very much so. Here's an example document I work with that has information that's difficult to extract from: https://s3.documentcloud.org/documents/6929951/CRID-1061543....

Lots of missing context from these sheets that has to be interpreted (ie, how do you taxonomize each field of information?). Then asking questions on top of these documents is a step on top: "is the allegation about sexual violence?", "What is the name and rank of the person being accused?", "Is anything anomalous in the review process?", "Has this person's rank changed in the past 5 years?" etc etc.

Now expand this problem to hundreds of thousands of different types of document.


If you haven't yet, you should look at the full text search capabilities of SQLite and postgres. Could simplify your search part of the workflow a bit




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: