Corona is Making Training Data-Gathering Harder: Key Industries and Challenges

Covid-19 has affected every corner of the globe – taking lives, destroying livelihoods, and shaping our world for the foreseeable future. Passenger traffic at airports has plummeted and, with 81% of the global workforce living in countries with mandatory or recommended workplace closures, we have all seen the striking images of cities around the world empty and deserted. This seismic shift in human activity is making the acquisition of manual training data slower and harder, while also less representative than ever before. In this post, we’ll explore a few of the industries affected by these changes.

Autonomous vehicles

Autonomous vehicle companies rely on large teams of drivers and technicians to test vehicles and collect training data. First, stay-at-home orders and the related logistical challenges of these operations have made staffing difficult, forcing companies to lay off significant portions of their teams. They are simply unable to get their teams on the road, collecting data. Second, empty streets, with fewer cars and pedestrians, make driving data less representative and more biased. It has become nearly impossible for these companies to capture data of “normal” (aka pre-Corona) traffic volumes. Some of these companies have refocused on detailed data labeling, 3D mapping, and identification of overlooked scenarios from previous road sessions that can be used to train their systems. Out of necessity, they have shifted to more innovative approaches and long-term experimentation.


While car companies are struggling on the ground, drone companies are struggling in the air. Smart traffic analysis based on drone footage is becoming a reality in many cities and states. But in today’s world, with unprecedented drops in traffic volume, collecting training datasets with enough variance is a nearly impossible task. But, drone applications in other industrial settings – shipping, warehousing, airports, etc. – are also hindered by high-level economic slowdowns. When activity in these outdoor environments grinds to a halt, companies will have a hard time collecting unbiased data. With economists predicting a slow economic recovery, these challenges are likely to persist.

Human-Environment and Human-Human Interactions

The new reality of lockdowns has made life hard for companies which generate datasets by “Crowd Acting”. With this method, teams generate data by instructing workers, customers, models, or volunteers to record short video clips based on defined descriptions. But, asking them to do crowd acting outside their homes has become untenable and downright dangerous in some cases. Additionally, crowd acting of face-to-face interactions may be impossible while maintaining compliance with social distancing guidelines. 

Smart Stores

In February, Amazon opened its first Amazon Go Supermarket in Seattle using the retailer’s cashierless “Just Walk Out” technology that has previously powered 25 Amazon Go convenience stores in a handful of major U.S. metros. Expanding this type of technology requires sophisticated computer vision systems that can synthesize complex human-object interactions; grocery shoppers may examine produce carefully only to put it back, or discard items on shelves aisles where they don’t belong. In order for the technology to work optimally, huge amounts of training data with myriad variants and edge cases are needed. But with online grocery shopping skyrocketing, those variants and edge cases will be harder to capture.

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022


VR/AR is an exciting field of development, but one of the most important stages in the development of new hardware has been user trials. In a COVID-19 world, user trials suddenly become complicated affairs. Dozens of people donning the same headset in succession has become truly unimaginable. Recruiting and scanning hundreds of people on premises is out of the question. 

Age Bias

Covid – 19 has disproportionately affected certain segments of society. Across the world, the elderly have been significantly more at-risk and have been ordered in some countries to minimize leaving their homes. Even in countries without Government orders limiting movement for seniors, many public health organizations have recommended for them to stay at home if possible.  This raises the risk that manually-collected datasets collected during the Corona-period will have significant age biases.


Masks will be with us for a while, but many facial recognition computer vision systems have not been trained to recognize faces with masks. So, new training data must be collected, a process that is difficult for reasons mentioned above. But, what happens when mask use decreases? Systems trained on data collected during Corona, which are likely to show a huge number of people wearing masks, may not be able to operate well on mask-less faces.


In our follow-on post about the impacts of Corona on data-collection, we’ll dig deeper into the field of robotics and how some teams are combating the challenges we’ve discussed here.

A Path Forward

At its core, Corona is creating two intertwined challenges. First, it is making it hard to collect new data. Second, it is increasing the likelihood that any new data will be biased in a number of ways. The current challenge of acquiring data in the wild, offers a significant opportunity for simulated and synthetic data to fuel advances in computer vision. 

It allows for generation of new datasets without compromising the health and safety of workers or navigating the logistical hurdles erected by Corona. And, its customizability ensures that datasets will not contain “Corona biases”. As the challenges posed by the COVID-19 era continue to reshape a range of industries, simulated visual data can help accelerate breakthroughs in computer vision and make sure that this virus does not stop training data collection in its tracks. 

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022