Back to Blog

How To Get Data For AI Applications - Tricks and Tactics

By Christopher Steiner  •  Jun 20, 2017

Any engineer who has taken the first steps of learning to work with AI methods has confronted the foremost challenge of the space: sourcing enough high quality data to make a project viable. Sample sets of data can be had, of course, but working with these isn't much fun for the same reason that solving a machine problem for computer science class isn't much fun: quite simply, it's not real.

In fact, using fake data is somewhat anathema to the spirit of independently developing software: we do it because fixing real problems, even if they're trivial or just our own, is quite satisfying. 

Using the example dataset from AWS allows a developer to understand how Amazon's Machine Learning API works, which is the point, of course, but most engineers won't dig too deeply into the problems and methods here, as it's not interesting to keep grinding on something that's been solved by thousands of people before and to which the engineer has no stake. 

So the real challenge for an engineer then becomes: how and where to get data—enough of it—to hone one's AI skills and to build the desired model? 

“When on the prowl for the newest AI developments, it may be helpful to remember that data comes first, not the other way around," says Michael Hiskey, the CMO of Semarchy, which makes data management software.

This first hurdle, where to get the data, tends to be the most bedeviling. For those who don't own an application that's throwing off deep troves of data, or who don't have access to a historical base of data upon which to build a model, the challenge can be daunting. 

Most great ideas in the AI space die right here, because would-be founders conclude that the data doesn't exist, that getting it is too hard, or that what little of it that does exist is too corrupted to use for AI. 

Climbing over this challenge, however, is what separates rising AI startups from those who merely talk about doing it. Here are some tips to make it happen:

The highlights (more details below):

  • Multiply the power of your data 

  • Augment your data with those that are similar 

  • Scrape it

  • Look to the burgeoning TDaaS space 

  • Leverage your tax dollars and tap the government

  • Look to open-sourced data repositories 

  • Utilize surveys and crowdsourcing 

  • Form partnerships with industry stalwarts who are rich in data

  • Build a useful application, give it away, use the data

Multiply the power of your data 

Some of these problems can be solved via simple intuition. If a developer seeks to make a deep learning model that will recognize images that contain the face of William Shatner, enough pictures of the Star Trek legend and Priceline pitchman could be scraped from the web—along with even more random images that don't include him (the model will require both, of course). 

Beyond tinkering with data already in hand, however, data seekers need to get creative.

For AI models being trained to identify dogs and cats, one picture can effectively be turned into four:One picture of a dog and cat can be rotated into many.

Augment your data with those that are similar

Brennan White, the CEO of Cortex, which helps formulate companies content and social media plans through AI, found a clever solution when coming up short on data. 

"For our customers looking at their own data, the amount of data is never enough to solve the problem we're focused on," he says. 

White solved the issue by sampling social media data of his customers' closest competitors. Adding that data to the set increased the sample by enough multiples to give him a critical mass with which to build an AI model. 

Scrape it

Scraping is how applications get built. It's how half the web came to be. We'll insert the canned warning here about violating websites' terms of service by crawling their sites with scripts and recording what you might find—many sites frown on this, but not all of them. 

Assuming founders are operating above-board here, there exists nearly endless roads of information that can be driven by building code that can crawl and parse the web. The smarter the crawler, the better the data. 

This is how a lot of applications and datasets get started. For those afraid of scraping errors or being blocked by cloud servers or ISPs that see what you're up to, there are human-based options. In addition to Amazon's Mechanical Turk, which it playfully refers to as "Artificial Artificial Intelligence," there exist a bevy of options: Upwork, Fiverr,, Elance. There is also a similar kind of platform, aimed directly at data, dubbed TDaaS - which we mention next.

Look to the burgeoning TDaaS space

Beyond all of this, there are now startups that help companies, or other startups, solve the data problem. The clunky acronym that has sprouted up around these shop is TDaaS—training data as a service. Companies like this give startups access to a labor force that's trained and ready to help in gathering, cleaning and labeling data, all part of the critical path to building a 

Training data as a service (TDaaS): There are few startups like CrowdFlower and, which provide training data across domains ranging from visual data (images, videos for object recognition etc) to text data (used for natural language process tasks). 

Think of this process as similar to using Amazon's Mechanical Turk, with much of the explicit AI-related instructions and standards abstracted away. Through these channels, there's also less of a burden on the startup to vet workers and dig through completed jobs to sort for quality. That's what the platforms do for founders.

Leverage your tax dollars and tap the government

It can be helpful for many people to look first to governments, federal and state, for data on given topics, as public bodies make more and more of their data troves available to be downloaded in useful formats. The open data movement within government is real, and it has a website - a great place to start for engineers looking to get a project started:

Open-source data repositories

As machine learning methods become more prevalent, the infrastructure and communities that support them have grown up as well. Part of that ecosystem includes publicly accessible stores of data that cover a multitude of topics and disciplines. 

Gurudatt Bhobe, the COO and co-founder of SupplyAI, which uses AI to help prevent retail returns, advises founders to look to these repos before building a scraper or running in circles trying to scare up data from sources that are less likely to be cooperative. There is an expanding set of subjects on which data is available through these repos. 

Some repos to check out:

University of California, Irvine

Data Science Central

Free datasets on Github

Utilize surveys and crowdsourcing

Stuart Watt, the CTO of Turalt, which uses AI to help companies introduce more empathy into their communications, has had success with crowdsourcing data. He notes that it's important to be detailed and explicit in instructions to users and people who might be sourcing the data. Some users, he notes, will try and speed through the required tasks and surveys, clicking merrily away. But almost all of those cases can be spotted by instituting a few tests for speed and variance, Watt says, discarding results that don't fall within the normal ranges.

Andrew Hearst, a unified search engineer at Bloomberg, also thinks that crowdsourced data can be quite useful and economical—as long as there are controls for quality. He recommends constantly testing the quality of responses. 

Respondents’ goals in crowdsourced surveys are simple: complete as many units as possible in the shortest period of time in order to make money. However, this doesn’t align with the goal of the engineer who is working to get lots of good data. To ensure that respondents provide good data, Hearst says, they should first pass a test that mimics the actual task. For those who do pass, additional test questions should be randomly given throughout the task, unbeknownst to them, for quality assurance.

"Eventually respondents learn which units are tests and which ones are not, so engineers will need to constantly create new test questions," Hearst adds.

Form partnerships with industry stalwarts who are rich in data

For startups looking for data in a particular field or market, it can be beneficial to form partnerships with the industry's core places to get relevant data. Forming partnerships will cost startups precious time, of course, but the proprietary data gained will build a natural barrier to any rivals looking to do similar things, points out Ashlesh Sharma, who holds a PhD in computer vision and is co-founder and CTO of Entrupy, which uses machine learning to authenticate high-end luxury products (like Hermès and Louis Vuitton handbags). 

Build a useful application, give it away, use the data

A more passive method than going out and building partnerships is simply giving away access to a cloud application that's useful to customers. The data that makes it into the app, if it get some traction, can be used to build machine learning models. Google has leveraged this method for years via Google Photos, YouTube, and even versions of CAPTCHA.

Christopher Steiner is a New York Times Bestselling Author and the founder of ZRankings, and the co-founder of Aisle50 (YCS11), which was acquired by Groupon in 2015.