人工智能行业里的劳动者：数据标记员

CGTN 2020-08-26

在如今高速发展的人工智能行业里，有一群总被遗忘的人，他们奠定了算法学习的基础。他们的职业是数据标记员。

据业内人士评估，全职的数据标记员如今已达到10万人，兼职人群的规模更是达到100万。他们将庞大的数据分类、画框，教会算法识别。经过他们标记的数据从原始数据变为高质量数据，从而推动算法模型的深度学习。

24-year-old Liu Xueyan has never seen a self-driving car, but her work has helped to develop an artificial intelligence (AI) algorithm that could power autonomous driving.

At a data annotation base two hours away from central Beijing, Liu was marking objects in hundreds of images shown on her computer screen. She zoomed in, drew a square around the shape of a bus, and added a label "bus" to the specific zone, before moving on to mark sidewalks, pedestrians or traffic signs on the images.

Liu is among the thousands of young workers at the data annotation base operated by Testin, a Chinese tech company founded in 2011 that offers AI data collection and annotation services. Data processed by the young workers will power applications as diverse as autonomous driving, public security cameras, medical diagnosis and retail.

Liu marks an image of a road used for autonomous driving. /CGTN Photo

Unlike what many believe, AI cannot learn on its own, it has to be taught. A large data set is needed to train the algorithm to find patterns and thus generate conclusions in future scenarios. But machines cannot recognize raw data. Scientists need to use clean, annotated data to train machines to learn.

"We can think of annotated data as the textbook for the machines. If content in the textbook is bad, the algorithm that is developed will have low accuracy," said Xu Kun, president of Testin, in an interview with CGTN. Algorithm with low-accuracy may incur security risks, for example, making it easier for others to falsify identity in facial recognition applications, he added.

Given the widespread application of AI across industries, the quality requirement for data annotation is on the rise – most industries now require data annotation to achieve a 99.9-percent accuracy rate. This means a left eye cannot be identified as a right eye in an image used for facial recognition, and a liver cannot be categorized as a lung in a CT scan image.

Liu teaches her colleagues to do 3D Point Cloud labeling. /CGTN Photo

This is having wider implications in the industry traditionally populated by small data annotation farms in remote and impoverished areas of China. Employing mostly low-wage workers with little education background and minimal job training, those data annotation farms operate like assembly lines in the digital age.

But since AI companies now demand highly accurate data annotation, more professional service providers that have a reliable workforce are popping up across China.

At Testin, one of those companies dedicated to providing professional data annotation service, training can go on for as long as weeks. While general projects like facial recognition and natural language processing require data annotation engineers to have graduated from secondary educational institutions, highly specialized projects, like insurance, finance and medical industries would require a college degree.

AI software developers of Testin in Beijing. /CGTN Photo

The first time Liu took on a data annotation project more than one year ago, it took her only three days to master basic tagging. All she needed to do was draw circles and tag objects. It's a repetitive task that has a low skill threshold, she recalled.

Her next project, tagging objects in road scenes, was more challenging. It required her to differentiate double yellow lines from dotted white lines so that a self-driving car knows when to make a turn. She also needed to tag accurately people on foot, bicycles, motorcycles, and electric scooters so the autonomous driving software knows how to respond when seeing those people in real life.

"What we did matters a lot to the application of AI software," said Liu. "if an object is tagged wrongly, it might cause a traffic accident."

Liu tags a car in an image. /CGTN Photo

Workload varies in accordance with the nature of the projects. For a simple AI tagging project, one is required to draw around 3,000 circles every day. For a road scene tagging project, one would draw around 2,600 circles. For the more complicated task of labeling 3D Point Clouds models, the number of images processed each day is much lower.

For Liu and most of her colleagues who are in their 20s, the data labeling job is a satisfactory one, at least for now. She follows a 9 to 6 work schedule, enjoys her weekends off – unless there are urgent tasks – and has a salary ranging from 3,500 yuan (507 U.S. dollars) to 6,000 yuan (869 U.S. dollars) depending on her experience and work performance.

Despite the sometimes repetitive nature of the job, AI is far from taking over the industry, according to Xu. AI in China is still at its infancy, but the demand for AI application to increase efficiency and reduce cost would spiral in the near future, and the demand for data annotation would skyrocket, he said.

Lunch time for Liu and her colleagues. /CGTN Photo

But there are signs that performance improvements can be achieved through having humans and machines work together. Scale AI, a San Francisco-based data labeling firm, pioneered the model of relying on algorithms to do the labeling before data annotation engineers have a final check on their work.

By far, most companies are using AI and human in a complementary manner. While AI is deployed to take over the repetitive tasks, jobs that require teamwork, creativity and social skills still demand human input.

For 24-year-old Liu, the idea that her work will one day be taken over by AI still seems far-fetched. "If AI products are like the newborns, the software developers are like the parents, and we are the people who cook for the newborns," said Liu. "Without the food we provide, the newborns cannot survive."

推荐阅读：

武汉不明原因肺炎病原体为新型冠状病毒 8名不明原因病毒性肺炎患者已出院

关于新型肺炎，这是我们知道的最新情况

法官“垂帘听审”，律师“拍照捉奸”|法官被指集体违法，律师谈提线木偶式审判弊端

严禁教师漠视纵容欺凌，严禁挤占课间10分钟……教育部重点规范整治！

高校回应教师因“非升即走”压力自杀

白宫小可爱又闹国际笑话，拜登为何不敢动她？

别了，“一战封神”的“当代朱可夫”