August 30, 2021 - The term data-centric AI/ML has gained significant traction over the past few months. At its core, it involves a paradigm shift over the way AI systems are being developed. In model-centric AI, you clean the data, hold it constant and focus on finding the best algorithm to improve performance. It is a one and done approach to data. Kaggle competitions, where many practitioners have cut their teeth, have been designed this way.
In data-centric AI, the focus is on systematically iterating on the data to improve its quality so as to improve performance; it is a continuous process, something you do not only at the start but even after deployment into production.
What does iterating on data mean?
It involves taking a systematic approach to iteratively improve your data (inputs x and labels y). Methods that we can implement today include making labeling conventions (mapping x to y) consistent, tossing out noisy examples in x, using error analysis to focus on a subset of data to improve etc. Andrew Ng has a very easy to understand lecture on this with results from multiple projects showing this delta in performance (+16.9% from data-centric vs +0% from model-centric approach in one case).
Different labelers could have different labeling conventions (and thus draw different bounding boxes) as seen in the diagram above. This alone can lead to inconsistent labels which will hurt the model’s performance.
We have seen manifestations of a data-centric approach in industry, though it might not have been termed as such. During the height of the pandemic when lockdowns were in force, some autonomous driving companies could not go out to collect more new data (they deploy two safety drivers per car to prevent crashes due to the lapse of one driver and it would have been impossible to social distance). These companies repurposed their drivers into labelers; they revisited their existing dataset and worked on relabeling data to improve performance instead.
In general, the more experienced an ML engineer, the more likely they are able to spot aberrations in data and iteratively make improvements. However, the thrust of this data-centric effort is to create principles, metrics (what constitutes high quality?) and tools that can make this process more systematic rather than ad-hoc, leading to more efficient and effective AI/ML development across industries.
Data-centric approach brings significant mileage for industry use cases with small labeled datasets
As AI proliferates across industries, a data-centric approach is particularly useful in use cases where the corpus of available labeled data is small. In contrast to internet companies that build large monolithic applications that can easily collect and label tons of data across billions of users, industries like healthcare, manufacturing, agriculture often have to contend with very limited datasets or large unlabeled datasets with limited domain experts. A data-centric approach is also most relevant when dealing with unstructured data (images, video, text, voice), common in the above-mentioned industries. (On the other hand, improved performance for structured data usually involves feature engineering, where outperformance depends on the ability to curate features well.)
In practice, we are unlikely to see a 180deg shift away from innovation in algorithms. We will probably see greater complementarity — model-based approaches that relax the dependence on labeled data with tools that enhance the quality of the small labeled dataset you have. For e.g. a typical manufacturing firm could have hundreds of potential defect detection use cases; if solved well, could each result in millions of dollars in impact. However these companies do not have enough ML/AI resource to curate each and every small dataset to ensure high-quality. This is where methods like multi-task learning (having a model learn a more generalised representation for a bunch of tasks that are related) can be combined with data-centric approaches to get the desired performance outcome.
A new landscape of innovation
Over the past few years, about 99% of academic research in deep learning has been in model architecture and associated improvements. For many data scientists, pre-processing/cleaning and labeling data is also considered menial and mundane. This promulgation of a data-centric approach within the deep learning community aims to encourage more innovation in academia and industry in areas like data collection/generation, labeling, augmentation, data quality evaluation, data debt, and data governance. There will be a new NeurIPS track this year for datasets and benchmarks, focusing on what makes a good dataset, what the metrics for data quality should be, potential pitfalls, ethics etc.
Some early innovations in this space that have been very promising include labeling companies Prodigy and Snorkel, data curation company Aquarium, synthetic data generation company Synthethaic among many others. AutoML itself is ripe for data-centric techniques. The state-of-the-art solutions in AutoML have some automated data cleaning capabilities (pre-processing, feature engineering aspects) but still do not deal well with noise, small data sets and inadequate data coverage. In addition, despite the explosion in tooling startups in recent years, MLOps is still considered a nascent field. There is definitely room for tools to ensure data quality across the entire lifecycle — from collection to modification to tracking of data drift in production etc.
When you combine the outlook for this space and the evolution of the enterprise data stack (itself a whole topic with emerging solutions in reverse ETL, feature stores, metric stores, data observability, new data discovery platforms etc.), they intimate very exciting developments that will help us build a highly performant data and by extension AI stack in the months to come.