Link To TurkPrime

How to run successful experiments and get the most out of Amazon's Mechanical Turk

Wednesday, August 15, 2018

TurkPrime Tools to Help Combat Responses from Suspicious Geolocations

Last week, the research community was struck with concern that “bots” were contaminating data collection on Amazon’s Mechanical Turk (MTurk). We wrote about the issue and conducted our own preliminary investigation into the problem using the TurkPrime database. In this blog, we introduce two new tools TurkPrime is launching to help researchers combat suspicious activity on MTurk and reiterate some of the important takeaways from this conversation so far.

TurkPrime’s Tools to Deal with Suspicious Activity

As we announced last week, we’ve created two new tools to help researchers fight fraud in their data collection:

1.      Block Suspicious Geolocations
2.      Block Duplicate Geolocations

The Block Suspicious Geolocations tool is a Free Feature that allows researchers to block submissions from a list of suspicious geolocations. In our investigation last week, we identified several geolocations that were responsible for a majority of duplicate submissions. Our Block Suspicious Geolocations tool will prevent any MTurk Worker from submitting a HIT from these locations. As mentioned in last week’s blog, once we removed these locations from our analyses, we saw the rate of duplicate submissions from the same geolocation across studies this summer fell to 1.7%—a number well within the range of what we’ve identified as normal across the life of our platform. The screenshot below shows our new Block Suspicious Geolocations tool, found in Tab 6 “Worker Requirements” when you design a study.

Our second tool, the Block Duplicate Geolocations tool, is a Pro Feature that allows researchers to block multiple submissions from any geolocation. The Block Duplicate Geolocations tool casts a much wider net than the Block Suspicious Geolocations tool and should ensure that responses collected in any one survey come from a more distributed set of locations. By restricting the number of submissions from each geolocation, researchers can be more confident that the responses they collect are coming from unique participants. When using this tool data collection may be a little slower, especially if the target sample is  concentrated in a small geographic area (e.g., one particular state). The screenshot below shows our new Block Duplicate Geolocations Tool, found in Tab 8 “Pro Features” when you design a study.

Moving Forward

Understanding what has caused the recent increase in low quality responses on MTurk and the corresponding increase in submissions from the same geolocation is a matter of ongoing research. As we learn more details we will share them with the research community and continue to develop tools that ensure the highest quality of research data.

More immediately, we have identified a list of worker IDs that have repeatedly been associated with suspicious geolocations. In addition to the tools described above, we will create an internal exclusion list based on the worker IDs of suspicious accounts over the next several days. This exclusion list will create an additional layer of protection on our system by blocking worker accounts that have a high likelihood of being involved in fraud. We will write another blog to provide more detail about this issue in the coming days. In the meantime, however, researchers already have two powerful tools for eliminating fraud in their data collection. These tools should increase researchers’ confidence that they are obtaining genuine responses from unique workers.

Friday, August 10, 2018

Concerns about Bots on Mechanical Turk: Problems and Solutions

Data quality on online platforms
When researchers collect data online, it’s natural to be concerned about data quality. Participants aren’t in the lab, so researchers can’t see who is taking their survey, what those participants are doing while answering questions, or whether participants are who they say they are. Not knowing is unsettling.

Recently, the research community has been consumed with concern that workers on Amazon’s Mechanical Turk (MTurk) are cheating requesters by faking their location or using “bots” to submit surveys. These concerns originated and have been driven by reports from researchers that there are more nonsensical and low-quality responses in recent studies conducted on MTurk. In these studies, researchers have noticed that several low-quality responses are pinned to the same geolocation. In this blog, we’d like to add some context to the conversation, share the findings from our internal inquiry, and inform researchers what TurkPrime is doing to address the issue.

Concern about Bots
The recent concern about bots appears to have begun on Tuesday, August 7th, 2018, when a researcher asked the PscyhMap Facebook group if anyone had experienced an increase in low quality data. In just the third response to that thread, another researcher suggested, “maybe a machine?” Soon, other researchers were reporting an increase in nonsense responses and low-quality data, although at least a few reported no increase in junk responses to their studies. The primary piece of evidence causing researchers to suspect bots was that most of the low-quality responses were tagged to the same geolocation and a few places in particular—Niagara Square in Buffalo, NY; a lake in Kansas; and a forest in Venezuela. What’s more, many respondents from these geolocations provided suspicious responses to open-ended questions, often answering with “GOOD STUDY,” or “NICE.”

Although this activity raises concerns, the conversation, so far, has overlooked some important points. Most critically, while it is clear some researchers have unfortunately collected several bad responses, the research community does not yet know how widespread this problem is. Diagnosing the issue requires knowing how many studies don’t fit the pattern, as well as how many do.

Scope of problem
At TurkPrime, we track the geolocation of all surveys submitted in studies run on our platform. In the last 24 hours, we have worked to determine whether there is a growing problem of multiple submissions from the same geolocation. In reviewing over 100,000 studies that have been launched on TurkPrime, we see that the rate of submissions from duplicate geolocations typically bounced from less than 1% to 2.5% within a study—a number that could be explained by people submitting surveys from the same building, office, internet service provider, or even the same city. Geolocations are not precise, an issue we will discuss in more detail in a future blog post.

Based on this analysis, we set 2.5% as the threshold for detecting suspicious activity. Over 97% of studies have not reached this threshold, showing that the overwhelming majority of studies have not been affected by data coming from  the same geolocation. 

However, when we look at the rate of duplicate submissions based on geolocation over time, we see that in March of this year the percentage of duplicate submissions began edging up. Clearly, this is a problem, but a problem that has emerged only recently.

What TurkPrime is Doing
At TurkPrime, we are developing tools that will help researchers combat suspicious activity. We have identified that all suspicious activity is coming from a relatively small number of sources. We have additionally confirmed that blocking those sources completely eliminates the problem. In fact, once the suspicious locations were removed, we saw that the number of duplicate submissions had actually dropped over the summer to a rate of just 1.7% in July 2018.

What you can do to eliminate the problem
In the coming days, we will launch a Free feature that allows researchers to block suspicious geolocations. This means researchers will be able to block workers from suspicious geolocations, excluding submissions from those locations in their data collection. We will also launch a Pro feature that allows researchers to block multiple submissions from the same geolocation within a study. This feature will cast a wider net and may block well-intentioned workers using the same internet service provider, or working in the same library. This tool will give researchers greater confidence that they are not receiving submissions from anyone using the same location to submit junk responses.

Our data, and the work of multiple researchers, show there has been a recent increase in the number of low quality responses submitted on Mechanical Turk. Data from the TurkPrime database show that the vast majority of all studies, and the vast majority of recent studies, have never been affected by the current concern of bots. What we still don’t know about the recent issue, is whether these responses are coming from bots or foreign workers using a VPN to disguise their location and submit surveys intended to sample US workers. Either way, in the coming days TurkPrime will release tools that allow researchers to block workers from suspicious locations and to decide how narrowly they would like to set the exclusion criteria. Concerns about bots and low quality data on MTurk are not new. But at TurkPrime we will continue to look for ways to ensure quality data and to make conducting online research easier for researchers.