I'm not familiar with the topic. How do you measure the quality of the synthetic data? That is, how close the synthetic samples are from the real ones. Moreover, can you control this quality while generating synthetic samples?
From the perspective of using synthetic imagery to train machine vision systems, I think that the idea of fidelity (i.e. how similar synthetic images are to real images) is less than half the story, and has the potential to be dangerously misleading.
Of greater concern are quality measures that look across the entire dataset. Here are some hypothetical metrics which (although impossible to compute in practice) will help get you thinking in the right way.
- How does the synthetic image manifold compare to the natural image manifold?
- Are there any points on the synthetic image manifold where the local number of dimensions is significantly less than at the corresponding point on the natural image manifold? (Would indicate an inability to generalise across that particular mode of variation in that part of feature space).
- For each point on the synthetic image manifold, are there any points where the distance between the synthetic image manifold and the natural image manifold is large AND the variance of the synthetic image manifold in the direction of that difference is small. (Would indicate an inability to generalise across the synthetic-to-real gap at that point in the manifold).
- Does your synthetic data systematically capture the correlations that you wish your learning algorithm to learn?
- Does your synthetic data systematically eliminate the confounding correlations that may be present in nature but which do not necessarily indicate the presence of your target of interest.
Engineering with synthetic data is not data mining. It is much more akin to feature engineering.
The naive way to do it would be to determine the (approximate) distribution and parameters of your data, then generate similar data which conforms to the same distribution under the same parameters, to a very high level of confidence (ideally over 99%). Then the confidence interval would also give you the error bars to control and tune the quality of the synthetic data. But that's not perfect, and you'd want to make sure you're conforming to other important features which are particular to your data (like sparsity and dimensionality).
There's also a common pitfall: in many cases where you'd like to use synthetic data, you're doing it because you lack sufficient real data. This is very dangerous, because that might also mean you have a fundamental misunderstanding of the distribution and parameters of the real data (or those might be simply unknown). This is tantamount to extrapolating from limited data.
What another commenter said about how synthetic data is useful for providing analysts with good quality dummy data instead of confidential real data is correct. I think that's a great use case for synthetic data. But in general, I disagree with using synthetic data to augment a dearth of real world data unless you have reasonable certainty your data conforms to a certain distribution with certain features and parameters.
One such area is financial simulation. You can generally be reasonably certain that price data will conform to a lognormal distribution. So it's okay to generate synthetic lognormal price data in place of real price data for certain types of analysis. But again, I would still stress that you can't use that to measure (for example) how profitable an actual trading strategy would be. You need real data for that (to analyze order fills, counter survivorship bias, etc).
Another area is computer vision. As others have pointed out, since our understanding of roads is very good it's very effective to generate synthetic data for training self-driving vehicle models. But it's still tricky and it can be extremely confounding if misused.
I don't actually think you want to mirror the natural data distribution, but rather to provide a distribution which has a sufficiently high variance in the right directions so that the resulting NN polytope has a chance of being approximately 'correct'.
Because you have this piecewise-linear sort of warping of the feature space going on, the NN is basically a whole bunch of lever-arms. The broader the support that you can give those lever arms, the less they will be influenced by noise and randomness ... hence my obsession with putting enough variance into the dataset along relevant dimensions.
To put this another way, I think that the synthetic data manifold has to be 'fat' in all the right places.
You have a good point, and I probably should have been more clear. When I said same distribution and same parameters, one of the parameters I was thinking of is mean and variance. Though to be fair mean and variance aren't formal parameters of every distribution.
Can you give an example of successful synthetic data generation which doesn't need to map to the same distribution? I'm surprised at that idea.
Well, in a sensing-for-autonomous-vehicles type problem, it's actually more important to have simple and easy to specify data distributions than ones which map to reality, which in any case may be so poorly or incompletely understood that it's impossible to write the requirement for.
So, as a simple example, the illumination in a real data-set might be strongly bimodal, with comparatively few samples at dawn and dusk, but we might in a synthetic dataset want to sample light levels uniformly across a range that is specified in the requirements document.
Similarly, on the road, the majority of other vehicles are seen either head-on or tail-on, but we might want to sample uniformly over different target orientations to ensure that our performance is uniform, easily understood, and does not contain any gaps in coverage.
Similarly, operational experience might highlight certain scenarios as being a particularly high risk. We might want to over-sample in those areas as part of a safety strategy in which we use logging to identify near-miss or elevated-risk scenarios and then bolster our dataset in those areas.
In general, the synthetic dataset should cover the real distribution .. but you may want it to be larger than the real distribution and focus more on edge-cases which may not occur all that often but which either simplify things for your requirements specification, or provide extra safety assurance.
Also, given that it's impossible to make synthetic data that's exactly photo-realistic, you also want enough variation in enough different directions to ensure that you can generalize over the synthetic-to-real gap.
Also, I'm not sure how much sense the concepts of mean and variance make in these very very high dimensional spaces.
In the physical sciences there are plenty of domains where accurate measurements are sparse. In a case close to home for me, it's measurements of water depth off coasts (accurate to centimeters onna grid of size meters). The place where you have these measurements in the real world can be counted on one hand. But now you want to train a ML algorithm to be able to guess water depth in environments all over the world, so in this case you need your data to be representative of a bunch of possible cases that are outside real data. This differs slightly from the GP who I think is talking about creating data that isn't even represented in the real world at all, but that would help an algorithm predict real world data anyway. But they are fairly related topics.
Sometimes I wonder if there are better solutions rather than bans. Banning sounds like sweeping the dirt under the rug. I am afraid haters might concentrate in underground social networks, away from the spotlights, creating a "us and them" situation.
Don't worry, the payment processors and corporate banks with federal charters simply will refuse to deal with them. They won't be able to buy services or connectivity. And then when they try to set up their own payment networks they'll be charged with trumped up KYC law infractions and put in prison.
Can't they still accept paper checks? I think banks are obligated to honor them. Groups like the KKK and the Daily Stormer have managed to stay in business.
Technologies: In December 2019, I completed all degree requirements for my Ph.D. in Artificial Intelligence (AI) with a focus on probabilistic graphical models such as Bayesian networks and deep learning models, including sum-product networks. Researcher experience as a freelancer and more than 25 academic publications (including one AAAI). Industry experience as a Software Engineer working with Angular/Ionic/Typescript/Serverless stack.