I recently heard Anjney Midha — General Partner at Andreessen Horowitz — talking about synthetic data. I could sense synthetic data has to do with AI, but not much more beyond that. So what does any self respecting product person do in such a situation?! Ask ChatGpT of course!
My main takeaway from reading ChatGpT’s answer is that synthetic data isn’t based on actual, personally identifiable data. Synthetic data has been created artificially (via algorithms or generative AI) instead of being derived from real-word events or behaviours. It means that the data is created by analysing information from any data source, detecting structures and patterns. This analysis is then used to create a new dataset that mimics the main characteristics of the original data analysed.
These are the main benefits to using synthetic data as an input to AI modelling over real or “de-identified” data:
- Protecting privacy — Because of the way that synthetic data is created, it means that the patterns and properties of the original date are preserved whilst masking the private and sensitive data itself.
- Reduced cost — Creating synthetic data is often cheaper compared to traditional data collection and curation methods. Collecting and curating real-word data can be expensive because it involves using the right sources and building data collection mechanisms. Because synthetic data is generated through algorithms, the level of cost and consideration required can be much lower.
- Increased speed — Especially when there’s a need for large volumes of data to train machine learning models, generating synthetic can be save a lot of time.
- Experimentation — Using synthetic data makes it easier for researchers to generate new datasets and test how different data models respond to a variety of scenarios.
At the same time, there are several important downsides to synthetic data that we need to consider:
- Need for real-world data — Synthetic data doesn’t remove the value of real-world data. Real-world remains critical for validating machine learning models, and testing their effectiveness in real-world environments.
- Increased inaccuracy — If the accuracy of the underlying algorithms is lacking, the synthetic data can’t be trusted to form an accurate representation of reality.
- Biased data — Worse, you can end up in a situation where it’s even hard to detect the inaccuracy of the data generated. Synthetic data can be misleading or prone to bias due to its lack of variability.
Finally, there’s the question about the use cases for synthetic data.
- Recognising patterns at scale — Take Amazon for example, which uses synthetic data to train its Alexa virtual assistant and improve its multi-speech recognition capabilities. Researchers can use synthetic data to better detect and understand certain patterns at scale.
- Limited access to real-world data — When access to real-world data is limited, sensitive or expensive, synthetic data can be used to simulate real-world scenarios.
- Augmenting a self-play system — Self-play agents are AI systems that learn and improve by playing against themselves. AlphaGo, by Google DeepMind, is the original example of a self-play agent. Synthetic data can be used to build or augment the training data for such self-play systems.
Main learning point: Synthetic data is based on analysing real-world data, thus providing access to large volumes of data without having to use actual real-world data.
Related links for further learning:
- https://www.pwc.com/gx/en/issues/technology/synthetic-data.html
- https://www.amazon.science/blog/tools-for-generating-synthetic-data-helped-bootstrap-alexas-new-language-releases
- https://kozyrkov.medium.com/the-pros-and-cons-of-synthetic-data-f44ebb4d9e98
- https://a16z.simplecast.com/episodes/the-quest-for-agi-q-self-play-and-synthetic-data-prSPQ4lm
- https://mitsloan.mit.edu/ideas-made-to-matter/what-synthetic-data-and-how-can-it-help-you-competitively
- https://news.mit.edu/2020/real-promise-synthetic-data-1016
- https://www.reddit.com/r/datasets/comments/15p7974/what_advantages_or_disadvantages_does_synthetic/
- https://research.aimultiple.com/synthetic-data-vs-real-data/