Skip to main content

Data and AI are keys to digital transformation – how can you ensure their integrity?

Young woman standing in holographic background
Image Credit: Getty Images

Join us in Atlanta on April 10th and explore the landscape of security workforce. We will explore the vision, benefits, and use cases of AI for security teams. Request an invite here.


If data is the new oil of the digital economy, artificial intelligence (AI) is the steam engine. Companies that take advantage of the power of data and AI hold the key to innovation — just as oil and steam engines fueled transportation and, ultimately, the Industrial Revolution.

In 2022, data and AI have set the stage for the next chapter of the digital revolution, increasingly powering companies across the globe. How can companies ensure that responsibility and ethics are at the core of these revolutionary technologies?

Defining responsibility in data and AI

Arguably, one of the largest contributing factors to biases in AI is the lack of diversity among annotators and data labelers, who then train the models that the AI ultimately learns from.

Saiph Savage, a panelist at VentureBeat’s Data Summit and assistant professor and director of the Civic AI Lab at the Khoury College of Computer Sciences at Northeastern University, says that responsible AI begins with groundwork that’s inclusive from the start.

“One of the critical things to think about is, on the one hand, being able to get different types of workforces to conduct the data labeling for your company,” Savage said during VentureBeat’s Data Summit conference. “Why? Let’s say that you only recruit workers from New York. It’s very likely that the workers from New York might even have different ways of labeling information than a worker from a rural region, based on their different types of experiences and even different types of biases that the workers can have.”

VB Event

The AI Impact Tour – Atlanta

Continuing our tour, we’re headed to Atlanta for the AI Impact Tour stop on April 10th. This exclusive, invite-only event, in partnership with Microsoft, will feature discussions on how generative AI is transforming the security workforce. Space is limited, so request an invite today.
Request an invite

Industry experts understand that a large swath of AI models in production today require annotated, labeled data to learn from to bolster the AI’s intelligence and ultimately, the machine’s overall capabilities. 

The technologies that support this are also intricate, like natural language processing (NLP), computer vision and sentiment analysis. With these complexities, the margin for error regarding how the AI is trained can unfortunately be quite large. 

Research shows that even well-known NLP language models contain racial, religious, gender and occupational biases. Similarly, researchers have documented evidence of permeating biases in computer vision algorithms that have shown these models automatically learn bias from the manner in which groups of people (by ethnicitiy, gender, weight, etc.) are stereotypically portrayed online. Sentiment analysis models hold the same challenges.

“Responsible AI is a very important topic, but it is only as good as it is actionable,” said Olga Megorskaya, Data Summit panelist and CEO of a global data labeling platform Toloka AI. “If you are a business, applying AI responsibly means constantly monitoring the quality of the models that you have deployed in the production at every moment of time and understanding where the decisions made by AI come from. [You must] understand the data on which these models were trained and constantly update the training models to the current context in which the model is operating. Secondly, responsible AI means responsible treatment of people who are actually acting behind the scene of training AI models. And this is where we tightly cooperate with many researchers and universities.”  

Explainability and transparency

If responsible AI is only as good as it is actionable, the explainability and transparency behind AI is only as good as the sentiments of transparency and information extended to both the annotators and labelers working with the data, as well as to the customers of the companies using services like Toloka.

Specifically, Toloka, which launched in 2014, positions itself as a crowdsourcing platform and microtasking project to source diverse individuals worldwide to quickly mark up large amounts of data that ultimately are then used for machine learning and improving search algorithms.

Over the past eight years, Toloka has expanded; today, the project boasts upwards of 200,000 users contributing to data annotating and labeling from more than 100 countries around the world. The company also develops tools to assist in detecting biases in datasets and tools that provide rapid feedback about issues that come up related to labeling projects that could impact the requesting company’s interfaces, project or tools. Toloka also works closely with researchers at laboratories like the Civic AI Lab at the Khoury College of Computer Sciences at Northeastern University, where Savage works.

According to Megorskaya, companies in the AI and data labeling market should work toward transparency and explainability in a way that “… match[es] both the interests of the workers and of the businesses to make it a win-win situation where everybody gets the advantage of common development.”

Megorskaya recommends enterprises stay attuned to the following to ensure transparency and explainability on internal and external fronts:

  • Constantly adjust the data on which AI is trained to reflect current real life situations and data.
  • Measure the quality of the models and use that information to build metrics on the quality of your models to track its improvement and performance overtime.
  • Stay nimble. Think of transparency as visibility into the guidelines which the data labelers should follow when conducting the annotations.
  • Make feedback accessible and prioritize addressing it. 

For example, Toloka’s platform offers visibility into available tasks, as well as the guidelines for the labelers doing the work. This way, there is a direct, rapid feedback loop from the workers doing the labeling and the companies that request that work. If a labeling rule or guideline needs to be adjusted, that change can be made in a moment’s notice. This process makes space for teams of labelers to then approach the remainder of the data labeling process in a more unified, accurate and updated manner — allowing room for a human-centric approach to addressing biases as they may arise. 

Bringing the ‘humanity’ to the forefront of innovation

Both Megorskaya and Savage agree that if a company leaves its data labeling and annotation up to third-parties or outsourcing, that decision itself creates a crack in the responsible development of the AI it will eventually go on to train. Oftentimes, companies that outsource labeling and training their AI models don’t have the option to interact directly with the individuals who are actually labeling the data. 

By focusing on removing bias from the AI production sphere and breaking the cycle of disconnected systems, Toloka says AI and machine learning will become more inclusive and representative of society.

Toloka is hoping to pave the way for this change and aims to have development engineers at requesting companies meet the data labelers face-to-face. By doing so, they can see the diversity in end-users that its data and AI will eventually impact. Engineering without visibility into the real people, places and communities a company’s technology will ultimately impact creates a gap, and removing that gap in this manner creates a new layer of responsible development for teams.

“In the modern world, no effective AI model can be trained on some data collected by a narrow group of preselected people who are spending their lives only doing this annotation,” Megorskaya said.

Toloka is building data sheets to showcase the biases that workers can have. “When you’re doing data labeling, these sheets display information such as what type of backgrounds the workers have, what backgrounds might be missing,” Savage said. “This is particularly helpful for developers and researchers to see so they can make decisions to obtain the backgrounds and perspectives that may be missing in the next run to make the models more inclusive.”

Though it may seem to be a daunting endeavor to include a globe of countless ethnicities, backgrounds, experiences and upbringings in every dataset and model, Savage and Megorskaya stress that for enterprises, researchers and developers alike, the most important way to keep climbing toward equitable and responsible AI is to involve as many major stakeholders that your technology will impact from the start, as correcting biases later down the road becomes much more difficult.

“It may be difficult to say AI can be absolutely responsible and ethical, but it’s important to approach this aim as closely as you possibly can,” Megorskaya said. “It is critical to have as wide and inclusive representation as possible to give engineers the best tools to effectively build AI as responsibly as possible.”

VB Daily - get the latest in your inbox

Thanks for subscribing. Check out more VB newsletters here.

An error occured.