Does this work well on kvps and tables? That is where I typically have the most trouble with tesseract and where the cloud provider ocr systems really shine.
tl;dr NER models work best when they process data similar in structure to their training data. This notebook shows a few examples of how one can use an NER model not trained on JSON to actually identify and redact both structured and unstructured values with in a JSON blog. The NER model used is free available but closed source, but one can expect similar results with open source NER models such as spacy.
There are also way too many people with access to non-anonymized data. i.e. the development team that has read privileges on the production database. e.g. employees at uber spying on customers (https://www.theguardian.com/technology/2016/dec/13/uber-empl...).
edit:
shameless plug. check out tonic.ai for a solution to the above problem.
Tonic AI | Atlanta or SF| Software Engineer | ONSITE
At Tonic we are building tools to help people create synthetic data that looks, feels, and acts like their real data, without compromising security, privacy, or regulatory compliance.
Looking for a full stack engineer with a preference for someone stronger with back-end technologies.
Hi nagarjun. My company (https://tonic.ai) builds dev tools and we've noticed recently that companies with remote teams have been using our product for an unexpected use case. We are trying to investigate it further. Would you be willing to chat with me? If so, I'll drop you a way to get in touch and we can chat.
Thanks. Our landing page is currently in a constant state of flux. I'll make sure that gets fixed. I'm a bit surprised but my previous reply actually generated a non insignificant amount of traffic to our site so we also opened up app.tonic.ai for anyone that wants to give the product a whirl.
Eddie, I think it would be neat if we could build vendor images by just supplying docker containers with maybe some type of config.
At Tonic (https://tonic.ai) we do on-prem deploys with docker containers and docker-compose. Its seamless and would be great to use that same flow for Digital Ocean marketplace.
We (Redash) have a similar setup (Docker Compose based) and we used Packer to build the DigitalOcean image. Our setup is public on GitHub, in case you want to copy:
We have some vendors building images with a variety of methods—Packer for example (blog post coming soon). I _want_ to say there is someone building out of a container. We've got a repo [0] with our current process but definitely looking for ways to improve it. You should fill out the vendor form and we'll be in touch [1]!
As others have said, we've found a lot of smaller companies will test with production data because of their need/desire to move quickly. But we've also seen much much larger companies use production data in their dev/staging environments. Sometimes there will be production-like safeguards and security measures in place but not always. People shy away from practices that slow down development and testing.
We think synthetic data is the right solution for a few reasons. Most importantly, we believe it provides the right level security, while still allowing your team to be productive, i.e., your business logic and test cases still work. It also allows you to scale really easily since you effectively have a ruleset for generating data of any size. Finally, it’s a great way to share data throughout your organization and can help facilitate sales and partnerships. If you’re curious about scaling, check this post out: https://www.tonic.ai/blog/condenser-a-database-subsetting-to...