What is Synthetic Data?
Synthetic data is artificially created rather than captured from real life and has evolved from machine learning needs for data. Originally, training data had to be obtained to cover every possible scenario to accurately train AI models. If a scenario had not occurred or been captured, there was no data, leaving a huge gap in the machine’s ability to understand that scenario. Synthetic data allows for data to be created via a computer program to fill those gaps in use cases. This allows for more freedom to train a wider variety of models for products and services across multiple industries through the creation of a wider variety of datasets.
While the concept of synthetic data seems fairly new, it has actually been around for quite a while. The notion is said to be coined by Donald Rubin in a 1993 article titled “Discussion: Statistical Disclosure Limitation,” published in The Journal of Official Statistics. The focus of the article was on the privatization of data and stated, “the proposal offered here is to release no actual microdata but only synthetic microdata, constructed using multiple imputations so that they can be validly analyzed using standard statistical software.” The end result was data that did not contain any real-world data, which continues to be a key benefit of synthetic data today.
The needs for synthetic data have been focused across multiple industries, being driven (pun intended) in particular by the autonomous vehicles industry. This industry proved the widespread benefits of synthetic data, which has now proliferated to all industries utilizing computer vision, such as drones, security cameras, retail and consumer electronics.
How Does Synthetic Data Help AI?
As the demand for AI training data has grown so has the demand for synthetic data, to help businesses attain reliable training data to improve their products and services. Real-world data has limitations—it’s based on scenarios that have already happened and contains personal identifiable information (PII). It’s easy enough to remove PII from data before using it for training purposes, however, it’s not as easy to orchestrate specific scenarios in the real world that can be used for training purposes. These scenarios, also known as edge cases, are what really set synthetic data apart from human collected data.
How Can Synthetic Data Help You?
The main benefits of using synthetic data are:
Increased speed of data collection
Data that is free from PII
Access to data for rare events (edge cases)
Advanced and accurate annotations
All of these are great reasons to leverage synthetic data, but it’s also important to recognize that humans still have a role in the Data for the AI Lifecycle. Real-world data is used in combination with synthetic data to ensure the model is functioning properly. Real-world data also contains outliers that synthetic data does not naturally account for. While you can program your synthetic data to account for certain scenarios or edge cases, it will not contain those natural outliers.
Synthetic data will always need human generated data to succeed. Human generated data is the starting point for the computer program that’s used to generate the synthetic data. As this human data is used for initial generation purposes, you need to ensure it’s of high quality, so that the generated data is of the same calibre. Once the data is created, quality control is implemented to guarantee no mistakes. To ensure this, the data is tested against high quality human annotated data. This partnership comes with two additional benefits: Increasing your sample size by leveraging lower cost data that utilizes fewer resources and time. As a portion of the data is generated by a computer the cost is lower, allowing companies to invest in further research. Time savings comes from completing the human annotated data portion in an expedited manner.
Also worth noting is that datasets will be more inclusive. Using synthetic data ensures that the generated data comes from a neutral unified viewpoint, one that can be free from biased opinions and other influential factors, and include appropriate diversity. PII is also less of a concern as all synthetic data contains simulated numbers.
A less obvious, but very important benefit is safety. Aside from protecting privacy, which keeps humans’ identities safe, the edge cases that are generated can also benefit safety. Those scenarios can, for example, help smart cars improve their driving and parking abilities without the need for a driver. This means fewer accidents will occur on the road through testing each year. Banks can test against mock fraud programs to ensure all their security settings protect against any potential attacks, bringing peace of mind to clients and customers.
Young male student using Vr Headset for learning
The Future is Bright – Synthetic AI Predictions
While the current usage of synthetic data is low, Gartner predicts that it will grow to be more prevalent by 2030. Currently, synthetic data accounts for only 1% of all market data, and by 2025 it will account for approximately 10%. This increase will expand the use cases for AI applications and in turn, increase jobs in the AI industry. By 2027 the data market segment is forecasted to grow to $1.15B, which represents 48% CAGR. These forecasts are why we included the rise of synthetic data as part of our “Top 5 Predictions About the Future of AI and Data.”
As mentioned earlier, the main markets that utilize synthetic data are any utilizing AI enhanced Computer Vision. As synthetic data becomes more prevalent, we’ll see use cases expanding into industries such as finance for fraud protection, healthcare for diagnoses models, and marketing to ensure the right customers are targeted with the right message or products.
Because we believe synthetic data will play a large role in the future, we’ve partnered with Mindtech to bring their synthetic data capabilities to our clients. They are the developer of the world’s leading end-to-end synthetic data creation platform for the training of AI vision systems, which is accomplished through the creation of accurate neural networks.
Mindtech’s proprietary platform, Chameleon, serves to both create and curate data by building a virtual world that reflects the real-world both structurally and statistically. An intuitive UI is used to place assets (actors, vehicles, items) into this virtual world, actions defined to create the required scenarios. Behavioral led simulations are run to attain the data for desired scenarios, including those hard to come by edge cases. The simulation focus of Chameleon is on both human to human and human to world interactions. Some of their focus markets that benefit from this are retail, smart homes, and smart cities.
“We’re excited about this strategic partnership with Appen,’ shared Steve Harris, CEO at Mindtech. “By working in partnership, we’ll accelerate the development of AI systems that better understand how humans interact with each other and the world around them.”
To read more about our partnership with Mindtech, click here.