Data Science and Machine Learning Automation: What to Know About the State of Automation in AI
In the last decade, there have been many developments in the automation of artificial intelligence (AI) building. Frequently in dialogue about the future of AI, you may hear reference to data science automation and machine learning automation used interchangeably. In reality, these terms have distinct definitions: the current automated machine learning (known as AutoML) goals refer specifically to the automation of model-building, but a data scientist’s work encompasses a broader range of tasks than that. At the simplest level, data scientists extract knowledge from data to solve real-world problems; machine learning is only one tool in their arsenal.
We’re seeing automation occur at each stage of the data science lifecycle, from data preprocessing all the way through deployment of solutions. AutoML certainly contributes valuable developments toward automation in this lifecycle, particularly in the modeling stage. In most cases, automation targets the most time-consuming, complex tasks to make them faster and easier. With these advances, data scientists have more time to do what they’re trained to do: use data insights to develop differentiating solutions for their organizations.
Automation in the Data Lifecycle
The data science lifecycle includes each of the tasks data scientists complete as part of solution development. For our purposes, we’ll look at the tasks a data scientist would complete for creation of an AI model. Each step of the cycle includes at least some level of automation—an unsurprising fact considering the time-intensive nature of several steps in the AI build process.
Assuming they have a problem in mind they’re trying to solve, the first task of the data scientist is to collect and prepare the data. Generally, data preparation requires converting it into the right format, identifying errors, and fixing anomalies. Currently, this step is partially automated. Data scientists can use simple heuristics or third-party data cleaning tools to clean up data. For example, a heuristic could specify that any numbers outside of a realistic range are automatically deleted. Data cleaning tools automatically clean schemas, perform statistical profiling, and complete other preparation steps as needed.
Why is data cleaning not yet fully automated? A key roadblock is the fact that data scientists often need to make subjective decisions about data. Also, a data set may include many edge cases; tools or heuristics may not accommodate those easily.
The next step in the data science lifecycle is data exploration. At this stage, data scientists use visualization tools to obtain an overview of the data. Like the first step, this stage can only be partially automated. Data scientists can automate the creation of graphics, but analyzing those graphics still requires their expertise.
Feature engineering is gradually becoming a part of AutoML’s purview and will likely be the next area of opportunity for further automation in ML. Feature engineering itself is the creation of new input variables (that are relevant to the problem you’re trying to solve) from existing inputs. Done correctly, feature engineering improves model performance by drawing the model’s attention to important variables not explicitly present in the data.
With automation, tools can derive features from various tables, text, and geo-spatial and time-series data, among other sources. These tools quickly evaluate hundreds, if not millions, of features and output the most relevant ones for your model. What has traditionally been a manual selection process for data scientists is becoming faster and more efficient with automation.
Model building includes model selection, validation, and hyperparameter optimization (HPO). This is where AutoML really shines: full automation is available. AutoML tools can cycle through a variety of models for one set of input data, selecting the model that performs best. Tools can automatically tune the model to improve accuracy using hyperparameter optimization and repeated validation measures. Note that AutoML models still perform high on accuracy and confidence metrics; quality isn’t sacrificed for efficiency.
For more information on automated model building, see our article on everything you need to know about AutoML.
The data science lifecycle doesn’t end at deployment. Every AI model requires continuous maintenance while in production, so setting up a retraining pipeline will be paramount for success. In this area, we’re seeing the emergence of automated tools that provide regular maintenance checks for models, ensuring they’re still meeting accuracy and confidence thresholds. While it’s still helpful to keep a human-in-the-loop at this stage, automation replaces an otherwise fully manual process for faster issue resolution.
An Example of Automation in Action: Github Copilot
As a real-life example of automation in AI, Github Copilot recently launched. The software is powered by OpenAI Codex and is an AI pair programmer that assists engineers with writing code. Using contextual clues from the code you’re developing, Github Copilot will suggest lines or functions as you type. The goal is to help you work faster and easier by offering alternative solutions and test cases. Github Copilot is simply one of the latest exciting applications of machine learning automation for greater efficiencies in AI and engineering.
The Future of Automation in AI
When we look at the future of AI, what can data science automation and AutoML tell us? For one, it tells us that building AI is challenging, but it’s getting easier. The demand for automation no doubt stems from the fact that launching an AI solution is resource-intensive, requiring a significant investment of time, money, and expertise that’s often prohibitive to smaller organizations. With the advent of automation tools, these barriers to entry will lower, allowing more participants in the space to experiment and innovate.
With the evolution of AI and AutoML, one fact remains: the need for high-quality training data continues to grow. AI practitioners will require more and more data to improve and prune their machine learning models, as well as to maintain their performance in production. Seeking assistance from an external data provider can equip teams with the right tools, expertise, and processes to create scalable data pipelines for long-term AI goals. As the most advanced AI-assisted data platform available, Appen’s solution is the most reliable source for obtaining sufficient high-quality data to meet these growing needs
And what about data scientists? Will machines eliminate the need for their role? It’s unlikely. Data scientists have highly-specialized domain knowledge that machines can’t match. Defining and understanding problems, making assumptions about data—these are all tasks that require subjective expertise. As we’ve seen with software engineering, when it became easier, demand for software engineers only went up; data science will likely be no exception.