Dr. Alok Aggarwal

India and the Meteoric rise of Annotation Services

Introduction

Like the previous three industrial revolutions, the fourth and current one will undoubtedly upend the status quo with countless jobs lost and new ones created. This revolution is particularly worrisome because AI systems and other automation techniques are likely to render millions of people unemployed and some even unemployable. In 2013, Frey and Osborne published an extremely influential article that stated “47% of total US employment is in the high-risk category, meaning that associated occupations are potentially automatable over some unspecified number of years, perhaps in a decade or two” [1]. Although very few jobs have been destroyed by AI until now, many doomsayers have repeated the forecast regarding millions of jobs being lost to AI within the next few years. In contrast, as discussed, in the book titled, “The Fourth Industrial Revolution and 100 Years of AI (1950-2050)”, although 395 million jobs may be destroyed by AI by 2050 since the global population will have fewer workers on a percentage basis and more aging people, we may need AI and robots to help us in doing substantial work [2]. Also, very few pundits have discussed new jobs that are likely to be created because of AI. This article deals with one such emerging job category of “data annotators,” which has largely emerged due to AI and has seen a meteoric rise during the last decade.

1. Machine Learning, Data Annotation, and Ground Truth Checking

In October 1950, Alan Turing posed the following question: can a machine imitate human intelligence? In his seminal paper, “Computing Machinery and Intelligence,” he formulated a game, called the “Imitation Game,” in which a man, a computer, and a human interrogator (“judge”) are in three different rooms. The interrogator’s goal is to distinguish the man from the computer by asking them a series of questions and reading their typewritten responses; the computer’s goal is to convince the interrogator that it is the man [3]. See Figure 1.

Later, in a 1952 BBC interview, Turing mentioned that it would be at least 100 years before any computer could convince a judge and pass the test [4]. Indeed, today many AI systems are trained in a manner like that used for educating our children, and this process is called Supervised Machine Learning, which is briefly explained below:

Supervised Machine Learning: Suppose we are given ten million pictures of the faces of dogs and cats, and we would like to partition them into two groups – one containing dogs and the other cats. Rather than doing it manually, a Machine Learning expert writes a computer program (“black box” program) by including various features that differentiate dog faces from cat faces (e.g., length of whiskers, droopy ears, angular faces, round eyes).

Also, another professional (“data annotator”) labels each picture either as having a dog face or a cat face. After enough features have been included in this “black-box program” and after all the ten thousand pictures have been labeled or annotated, these pictures are divided into two groups – a training group (with say 9,500 pictures) and a testing group (with 500 pictures).

Next, the first picture is given to this “black box” program. If this program’s output is not the same as that of the label (e.g., if its output is a dog-face but the label says that it is a cat-face), this program modifies some of its internal structure to ensure that its answer for this picture becomes the same as that of the label provided by the annotator. On the other hand, if the output of this program is the same as that of the label, then this program does not modify its internal code. See Figure 2.

After going through 9,500 such pictures and modifying itself accordingly, this black box program is tested for accuracy related to the remaining 500 pictures. If the accuracy is not sufficient then either the Machine Learning expert rewrites the program or uses more labeled data to train the current program.

On the other hand, if the trained black box program achieved the desired accuracy, then it can be used to classify the remaining (1,000,000 less 10,000) pictures into the two required groups. However, during this classification process, the accuracy could drop because the pictures may be blurry, noisy, etc. Hence, the user would need to periodically check the accuracy of the system and while doing so the user (or data annotator) will have to annotate some of the incoming pictures and create the “ground truth” against which the output of this trained black-box program can be checked.

To recap, data labeling is first required for training the Machine Learning program and then required periodically to create the ground truth to ensure the continued accuracy of the trained program.

Undoubtedly, Data Labeling is humanly the most arduous and time-consuming part of building and maintaining AI systems. In particular:

  • Contrary to popular belief, labels do not come only in black and white but in all shades of grey. In fact, two labelers may not even agree on whether a given picture is that of a cat or a civet or whether a specific picture has a bluish or a greenish one.
  • Since datasets are domain and context-dependent, data labelers may need to work together with data scientists, domain experts, and business owners to label and cleanse the data.
  • In many cases, data scientists and domain experts need to verify that this cleansing and labeling meets their requirements. For example, the labeler may misinterpret that “-“as noise but the use case requires that “-“ should be learned by the AI system because “-“ represents a zero. In the above-mentioned use case, this may require labeling important paragraphs, sentences, and terms of say a sample of 10,000 collected articles, 1,000, of which will be used for training and testing the AI system.

Indeed, data annotation or data labeling is critical both for training the AI system and for checking its accuracy. Keeping this in view, in 2009, Prof. Fei-Fei Lee (of Stanford University) In 2009, Fei-Fei Li created ImageNet, which became the largest database of its kind with more than 14 million URLs of images, almost all of which were hand-labeled (or annotated) to indicate as to what they contained [5]. The next section discusses in detail the importance of data annotation and ground truth checking below.

2. Importance of Data Annotation and Ground Truth Checking

The more accurate an AI system, the less human labor is required. Hence, accuracy is the most important characteristic of AI systems. Unfortunately, because of the debilitating limitations of AI models and because the incoming data changes over time, the accuracy of these AI models deteriorates, thereby forcing data annotators to label new data and AI professionals to retrain the existing AI models. Given below are a few scenarios that may require the AI system to be modified because of deterioration in accuracy:

(1) After operationalizing (i.e., putting it in production) the AI system, professionals may realize that the incoming data has a different distribution, which would require additional labeling of data and more retraining, validating, and testing of AI models. For example, for the AI system to detect vehicles on a highway, they did not include enough pictures when it was raining, snowing, foggy, or misty. Even worse could be the realization that the incoming data distribution is substantially different, which may require a reformulation of the problem and the use case. For example, while training the AI model, practitioners assumed that their model would be only used for detecting pedestrians in California, which does not have snowfall. However, later they realize that even in California there are locations that have snowy winters and hence need to collect data from those locations and then retrain on the combined dataset.

(2) After operationalizing the AI system, AI experts may recognize that the training and testing data has a bias or data governance did not include standard practices of consent management, privacy, auditability, or lineage. For example, after training the AI system, these experts realized that they were using features that included personal information (e.g., age and gender) but the data gatherers only obtained consent from males but not from females. In this case, either they must remove all data regarding females or obtain consent from them (whose data has been included), which they may not receive or receive only partially. In either case, they will need to redo the entire AI pipeline (i.e., relabeling and removing data regarding gay people, re-extraction of features, and retraining AI algorithms).

(3) While operationalizing the AI system, AI professionals may determine that their labeling was incorrect, thereby triggering re-labeling, feature engineering, and the model training pipeline. Similarly, while doing feature engineering, developers may have deprioritized some features (e.g., those related to geography or demographics), which may be incorrect and may need to be reworked. Also, sometimes, the business goal may change (e.g., conditions for approval or denial of a loan may change), which may require the decision thresholds to change appropriately. Since these thresholds are usually set manually, the corresponding AI system would need to be updated, end to end.

(4) After operationalizing the system, a change in government’s rules and regulations may be detected that would necessitate relabeling of data and redoing the entire pipeline. Alternatively, experts may realize that the AI model is not fair or interpretable, but government regulations require it to be so.

3. LLMs, Machine Hallucinations, Machine Endearment, and Ground Truth Checking

Although Generative Pretrained Transformers (GPTs) and Large Language Models (LLMs), which are a subset of Deep Learning Networks, are currently the rage, they suffer from devastating limitations – Machine Hallucinations and Machine Endearment – which are briefly discussed below by providing examples.

Machine Hallucinations: Falcon-40B is an open-source LLM that was unveiled recently. It usually provides fairly accurate and relevant answers. However, like most other LLMs of similar ilk, it occasionally provides the right answer the first time and a wrong one the second time (or vice versa), which is just one example of machine hallucination. For example, we asked the following question:

  • Question: “Which of the following is a mammal: a) snake, b) eagle, c) dolphin, or d) frog?”
  • Falcon-40B’s first answer is correct: “A mammal is a member of the class of vertebrate animals that have hair or fur and nourish their young with milk produced by mammary glands. Therefore, the correct answer is ‘c) dolphin’.”
  • Falcon-40B’s second answer is incorrect: “All of the animals listed are mammals except for the eagle. Eagles are birds.”

Unfortunately, the World Wide Web is now replete with examples of Machine Hallucinations for almost all GPTs and LLMs (e.g., ChatGPT, Gemini, and Claude 3) that are in operation today. Since all categories of current Deep Learning Networks including GPTs are unexplainable and uninterpretable, they become highly untrustworthy when they provide inconsistent or wrong answers. These

characteristics force ground truth checkers to spend substantial time verifying the answers provided by LLMs or they are duped into accepting wrong answers.

Machine Endearment: Regardless of the validity of their content, the output from most GPTs is confident, syntactically coherent, polite, and eloquent. Since they are trained on vast troves of human-written data, they usually produce outputs that also appear convincingly human. For example:

In June 2023, Jonas Simmerlein a theologian from the University of Vienna used ChatGPT to create a 40-minute sermon for Protestants. (See Chapter 11 in [2] for details). According to him, 98% of the content came from ChatGPT, and the entire service was conducted by two male and two female avatars on the screen. About 300 people attended this service, and some people videotaped this event eagerly. One of the attendees, Marc Jansen – a 31-year-old Lutheran pastor – was impressed and remarked, “I had actually imagined it to be worse. But I was positively surprised by how well it worked. Also, the language of the AI worked well, even though it was still a bit bumpy at times.”

The above-mentioned communication style is reminiscent of an endearing advisor, who we often turn to for direction or assistance. Over time, we begin to rely on such advisors because they seem endearing and have a stake in our well-being. We therefore call this characteristic of AI systems, “Machine Endearment”; this refers to the broad notion of people trusting AI systems due to their human-like responses and irrespective of their validity.

Devastating Outcomes Due to Machine Hallucinations and Machine Endearment: Unfortunately, although the arguments by GPTs are seemingly persuasive, they are sometimes Machine Hallucinations. Given below is one such example:

In 2023, two lawyers, Schwartz and LoDuca, used ChatGPT to find prior legal cases to strengthen their client’s lawsuit. In response, this Transformer provided six nonexistent cases. Since these cases were fabricated, the presiding judge fined Schwartz and LoDuca five thousand Dollars. According to an affidavit filed in the court, Schwartz eventually acknowledged that ChatGPT invented the cases, but he was “unaware of the possibility that its content could be false” and therefore believed that it had produced genuine citations.

Since there is no easy way of knowing whether these GPTs and LLMs are providing correct results or those due to Machine Hallucination, “ground truth checking” is required, and this is where many data annotators are being used. Of course, some data annotators working in specific domains (e.g., healthcare) require a considerable amount of subject matter expertise whereas others may not. Finally, although GPTs and LLMs were discussed above, almost all contemporary AI systems (including other Generative AI models such as Generative Adversarial Networks, Diffusion Models, and Support Vector Machines) are brittle (i.e., break easily) and hence require ground truth checking on a regular basis.

4. Meteoric Rise Data Annotation and Its Market Size Estimate

Fifteen years have gone by since the creation of ImageNet (by Fei-Fei Lee) and today there are more than 1,000 use cases of AI that have been already operationalized. Almost all of them require Supervised Machine Learning and hence they require gigantic amounts of data to be annotated (or labeled). The original trend started around 2003-04 with Amazon, Walmart, Target, and other e-commerce companies who initially used workers in India to label their products and create catalogs. In fact, Amazon Mechanical Turk (MTurk) was the first crowdsourcing website with which businesses could hire remotely located “crowd workers” to perform discrete on-demand tasks such as labeling. And many of these “crowd workers” were based in India.

Since every organization is different than others, it also has some data sets that are different than others. For example, if two companies are using Large Language Models to answer customer queries then the answers given by each company will depend upon its own dataset. This implies each company will need annotators to cleanse and label most of the data that it owns. Of course, this is not only true about Large Language Models but about all AI systems (wherein each organization will need its datasets to be cleansed and annotated). And, since the number of organizations using AI systems is growing exponentially, the need to hire data annotators is growing accordingly. Of course, organizations may hire full-time annotators in-house or may outsource this task to a third party, but they will need such annotators in all such cases.

As an employment category, data annotation is currently being defined broadly. For example, it includes:

  • Medical Transcriptionists are used to verify and correct the output of AI speech systems that are used for transcribing notes from medical doctors and other healthcare workers.
  • Annotators are used to check against the ground truth for AI-based language translation systems.
  • Labelers are used to verify the accuracy of AI-based speech recognition and speaker recognition systems.

Hence, during the last decade, Annotation as a Service has been on a meteoric rise. Already, there are more than 400,000 annotators worldwide, and by 2050, this field is likely to have around 15 million. Currently, its market size is roughly 11.5 billion US Dollars and is expected to be around 975 billion US Dollars by 2050 (assuming a global annual inflation of 3%).

Annotation as a Service and the Crucial Role Being Played by India: More than 55% of the labor involved in creating an AI system requires cleansing and annotation of data. Hence, this task needs to be done inexpensively but accurately. Furthermore, these annotators need to work very closely with AI experts, Data Scientists, and Data Engineers. Since, India currently has more than 5.4 million software professionals and many of them are AI experts, Data Scientists, and Data Engineers, the proximity of annotators in India obviously helps.

Below are a few reasons why India has become a hotbed for providing such services:

  • Wages in India are about one-fourth of those in the United States and other developed countries.
  • Since the broadband and Wi-Fi connectivity to second-tier and third-tier cities in India has become quite good, these annotators can work from these cities and reduce the cost further.
  • Unlike large cities (like Bangalore and Gurgaon), there is very little attrition of such people in second and third-tier cities, which not only helps regarding Human Resources but also in ensuring that the annotators become more adept.
  • Written English is the common medium of communication, and India has become a hotbed for annotation services.
  • For about 75% of all use cases related to AI, the annotators only need to have a high school education and most such annotators can be trained to become good annotators within three months.

Given the above-mentioned reasons and since India has a large workforce that is growing rapidly, our analysis shows:

  • Currently, there are 400,000 annotators worldwide (with an annual revenue of approximately 12 billion US Dollars) out of which 200,000 are based in India (with an annual revenue of 4.8 billion US Dollars).
  • By 2030, around 1,000,000 annotators are likely to be working globally (with an annual revenue of 36 billion US Dollars) out of which 550,000 will be in India (with an annual revenue of 15.8 billion US Dollars).
  • By 2050, around 15,000,000 annotators are likely to be working globally (with an annual revenue of 900 billion US Dollars) out of which 8,000,000 will be in India (with an annual revenue of 384 billion US Dollars).

China has around 100,000 data annotators and is second in this emerging area. China also has some of the same advantages as India, but Chinese are not proficient in English and hence most of their labeling is related to the Chinese language. Other low-wage countries like the Philippines, Pakistan, Bangladesh, and a few African countries are also providing some annotators but because they do not have many AI experts, Data Scientists, or Data Engineers, they have a relatively smaller number. Finally, there are a substantial number of annotators that are in expensive countries.

Annotation as a Service is a sub-segment of Business and Knowledge Process Outsourcing Sectors: As mentioned above, many data annotation tasks only require a basic proficiency in the English language and no other subject matter expertise. Hence, such tasks do not require more than a high-school education and some of them are being performed by college students working part-time, thereby making executing such tasks a part of the gig economy.

Since LLMs are likely to reduce the number of call center agents in India from approximately 1.5 million currently to around 750,000 during the next decade, many such people can be redeployed as data annotators. Hence, to some extent, Annotation as a Service belongs to the broader category of “Business Process Outsourcing.”

Finally, as mentioned in Section 2, some of the data annotation work requires considerable subject matter expertise. For example, a data annotator who is labeling the location in X-ray pictures where Melanoma (skin cancer) is present needs to have domain expertise in this field. Similarly, a data annotator who labels different kinds of fishes and other sea creatures needs to have substantial knowledge of this area. Hence, unsurprisingly, such data annotators usually have a college degree in their area of specialization and charge substantially more than those who are labeling data that does not require any subject matter expertise.

Indeed, just like the number of call center agents is expected to decrease (due to LLMs), the number of medical transcriptionists is also likely to decrease (due to AI and automated speech recognition systems). It is quite likely that several such people will be redeployed as data annotators. Hence, some areas within Annotation as a Service belong to the category of “Knowledge Process Outsourcing” [6].

Blog Written by

Dr. Alok Aggarwal

CEO, Chief Data Scientist at Scry AI
Author of the book The Fourth Industrial Revolution
and 100 Years of AI (1950-2050)