So many aspects of modern life have become digitalized the amount of data generated every day across the globe has become truly massive. We reached a significant landmark in 2012 when the total amount of digital data in our world reached one zettabyte, and another in 2016 when global IP traffic exceeded the same figure. And this trend is accelerating.
According to an article published by the World Economic Forum, the entirety of our digital universe is expected to reach 44 zettabytes by 2020. To put this into perspective, that number is 40 times more bytes than there are stars in the observable universe. With a growing need for storing and analyzing these staggering volumes of data – that come from a multitude of sources in countless formats – it’s not surprising that certain sectors of the IT world are facing some serious challenges.
Where is data generated?
More and more people across the globe are digitally connected every year, with more than one million people coming online for the first time each day since January 2018. Consider these statistics from a February 2019 article on Nextweb for additional perspective:
- There are 5.11 billion unique mobile users globally, up 100 million (2%) in the past year.
- There are 4.39 billion internet users in 2019, an increase of 366 million (9%) vs January 2018.
- There are 3.48 billion social media users in 2019, with the worldwide total growing by 288 million (9%) since this time last year.
- 26 billion people use social media on mobile devices in January 2019, with growth of 297 million new users representing a year-on-year increase of more than 10%.
All of these internet users are generating incredible amounts of data: Transactional data from online purchases, mobile data, social media data, search engine data, and more. And don’t forget the ever-increasing amounts of data generated by IoT devices like cameras and sensors used in manufacturing as well as connected cars. For additional context, let’s break down daily data generation statistics from the same World Economic Forum article cited above:
- 500 million tweets are sent
- 294 billion emails are sent
- 4 petabytes of data are created on Facebook
- 4 terabytes of data are created from each connected car
- 65 billion messages are sent on WhatsApp
- 5 billion searches are made
- By 2025, it’s estimated that 463 exabytes of data will be created each day globally – the equivalent of 212,765,957 DVDs per day
How do we refer to all these different types of data?
We’ve discussed how much data is flying about on a daily and yearly basis, but let’s go deeper into what kind of data is out there. While there are scores of formats and classifications, here’s a brief overview of some data types you should know about.
Structured, unstructured, semi-structured data
All data falls into one of these categories. Delineating between structured and unstructured data comes down to whether the data has a pre-defined data model and whether it’s labeled and organized in a pre-defined way. Semi-structured data is data that hasn’t been organized into a repository such as a database, but nevertheless has accompanying information, such as metadata, that makes it more amenable to processing than raw, unstructured data.
While structured data is preferable for big data analytics purposes, estimates from IDC suggest that 80 percent of all data generated globally will be unstructured by 2025. This is due to the fact that much of this data, which can include text, photo, audio, and other file types, is coming from outside of the organization from sources like social media and IoT smart devices. This unstructured data creates a unique challenge for organizations who wish to use it in their big data initiatives because it can’t easily and automatically be labelled and stored in a database. In short, unstructured data needs context. This becomes even more critical when you want to use your data to drive machine learning and artificial intelligence (AI) projects.
Data and Artificial Intelligence
The sheer amount of data we’ve discussed requires companies to maximize their data flow capabilities so they can enable seamless data sharing and information collection for the endlessly increasing volumes of data they generate. This means smart investment in data collection, security, storage, and analysis tools and adoption of data management strategies that are nimble enough to adapt and scale with changing formats and operational needs. For organizations that want to use their data to train machine learning algorithms for AI projects have another challenge: Making the data usable.
An October 2018 article in Forbes asserts using a data-first approach to AI projects is critical for success and that “Any application of AI and ML will only be as good as the quality of data collected.” At the beginning of such a project, most organizations come to the difficult realization that their data is in several disparate formats stored throughout an array of siloed systems. Converting the data to a common format and importing it to a common system is a requirement before beginning to train it for a machine learning algorithm.
Adopt a training data strategy that’s right for your project
Without a well-defined strategy for collecting and structuring the data you need to train, test, and tune your AI systems, you run the risk of delayed projects, not being able to scale appropriately, and ultimately, competitors outpacing you. Since better outcomes are more likely when your training data is as intricate or nuanced as possible, many machine learning initiatives require large volumes of high-quality training data, fast and at scale. To achieve this, you need to build a data pipeline that delivers sufficient volume at the speed needed to refresh your models. That’s why choosing the right data annotation technology is a key piece of your training data strategy.
In our white paper, How to Develop a Training Data Strategy for Machine Learning, we discuss how to create a solid machine learning training data strategy, including budgeting, options for data sourcing, how to ensure data quality and security, and how outsourcing the collection and labeling of training data can help scale your AI initiatives.
Download our whitepaper to learn how to develop the right data training strategy for your project.
As an industry leader, Appen has the expertise and resources to help you quickly scale data annotation for a variety of data types, including text, audio, speech, image, and video in over 180 languages and dialects. Contact us to learn more.