Wow. This post by Tim Ryan says the bot problem is far, far worse than anticipated. Read it. How can academics use our Requester leverage to make AMT clean up its act? I may be running all my future studies at Prolific Academic.
With this excellent post from the Tips for Requesters blog.
1. Use 99% approval percentage an above. Pay a fair wage and you will get enough workers to take the study.
Turk prime just published a blog post about the bot problem, and it is interesting. First, they question whether these weird responses are from bots or from foreign workers using VPNs. They acknowledge, though, that some people are seeing wonky data.
And they’ve done some stuff.
“At TurkPrime, we are developing tools that will help researchers combat suspicious activity. We have identified that all suspicious activity is coming from a relatively small number of sources. We have additionally confirmed that blocking those sources completely eliminates the problem. In fact, once the suspicious locations were removed, we saw that the number of duplicate submissions had actually dropped over the summer to a rate of just 1.7% in July 2018.”
Mechanical Turk needs to do this as well. And they need to do it right away, lest they lose all credibility.
The people at Prolific Academic have made a blog post (click on the link) about the bot issue with MTurk. I’m really flumoxed by this–I talked with one of my research colleagues today and he definitely found evidence of bots (when given an open ended question about binge watching, he got answers that did things like repeat his name or other information from the informed consent form).
I checked two studies I’ve worked on (one with the author mentioned above) and another I did on my own. The one I did on my own I paid to only have women in the sample, and I saw no evidence there. For the study we did together I found two ‘matching’ gps coordinates but since Qualtrics can only narrow down to a city, these might be OK. I just checked a third study and while I do find the ‘bad’ number: 88639831 in the latitude these have different longitudes. I do have several pairs of repeats though, with different IP addresses. Hmm. Unfortunately I just submitted a paper based on that data, I guess I’ll go back and look at it a bit more closely.
Prolific on their blog has a great list of things that they do:
- Every account needs a unique non-VOIP phone number to verify.
- Prolific restricts signups based on IP and ISP (e.g. we allow common residential ISPs but block low-trustworthy IP/ISPS).
- Prolific limits to the number of accounts that can use the same IP address and the machine to prevent duplicate accounts.
- Prolific limits the number of unique IPs per “HIT” (study).
- PayPal and Circle accounts for getting paid must be unique to a participant account. This means that in order to have 2 participant accounts that get paid, you would also need to have 2 PayPal accounts. PayPal and Circle also have steps to prevent duplicate accounts.
- They take any data quality reports very seriously and whenever researchers have suspicions about accounts they can report the relevant participant IDs to us we investigate the individual accounts as well as any shared patterns between them.
- They analyse our internal data to monitor for unusual usage patterns, and data reports from researchers.
This bot thing is worrisome. What I would suggest is to set higher standards (maybe completed 10000 HITS? Although certainly a bot can do those) or, if you can at all afford it, add a qualifier that you ‘pay’ a fee for (such as gender, income or the like). They are listed here. The cheapest ones are that the worker has an account on twitter, reddit, youtube or facebook–five cents each. Voted in US presidential election 2012 or 2016: ten cents (I like that one).
I’ve read a few complaints about low data quality on MTurk recently: read about it at this Facebook page. searchers at Minnesota are worried that people are creating bots to do surveys. This is against AMT’s terms of service, and frankly I don’t really know how this works, but one way to check is to look at GPS data of respondents and see if there is some duplicate information in the GPS data. That might suggest a bot is working. I just checked my most recent data and I’m free of bots, and I’ll check some other stuff too.
There’s a survey where you can respond to your experiences.
And here’s more information: from the blog of Hui Bai.
According to this report, low paid workers on MTurk had cognitive dissonance from the low payment, and this led to overstating the importance of the study because it is associated with more enjoyment and less tension in their minds. At the same time, though,people may answer in a way to please the researcher instead of giving their authentic answers.
Per the report, it is upcoming in Computers in Human Behavior.
There were two experiments. In the first one, 145 people were recruited, with half paid $1.50 and the other half .50 for a 15 minute experiment. In the second experiment, a 149 participants were recruited, with half paid for $3.00 and the other half $0.25.
I’m sure there are valuable learnings about pay from this study, and I truly hope the researchers bonused the people who got shafted on the pay scale.
I am guessing that the answer, found in this new study is no. They studied 1000 workers (including a bunch of Masters workers) and compared to a national health study.
“Adjusting for covariates, MTurk users were less likely to be vaccinated for influenza, to smoke, to have asthma, to self-report being in excellent or very good health, to exercise, and have health insurance but over twice as likely to screen positive for depression relative to a national sample. Results were fairly consistent among different age groups.”
Hmmm. I think the Masters threw them off. (Kidding). But really, why so many Masters workers, when we don’t know how someone attains this status?
Citation: Walters K, Christakis DA, Wright DR (2018) Are Mechanical Turk worker samples representative of health status and health behaviors in the U.S.? PLoS ONE 13(6): e0198835. https://doi.org/10.1371/journal.pone.0198835
According to this study, apparently not. The study looked at how paying four different hourly rates ($2, $4, $6 and $8) affected things like attention as well as answers.
“Looking at demographics and using measures of attention, engagement and evaluation of the candidates, we find no effects of pay rates upon subject recruitment or participation. We conclude by discussing implications and ethical standards of pay.”
They do find some indication that lower paid workers do not as well on some attention checks. They also suggest that they didn’t have problems getting people to do the study, although each was capped at 99 people. Two things are important to note:
“Our larger concern is for things that we were not able to measure, such as Turker experience. It is possible that more experienced Turkers may gravitate toward higher pay rates, or studies that they feel have a higher pay-to-effort ratio. This is, regrettably, something that we were not able to measure. However, since experimental samples do not tend to seek representative samples on Mechanical Turk, we feel that the risk of any demographic or background differences in who we recruit is that it could then lead to differences in behavior, either through attention to the study or in reaction to the various elements of the study. ”
They also clearly state: “Paying a fair wage for work done does still involve ethical standards.”
So advocates Alexandra Samuel in this excellent article that describes MTurk as part of the ‘golden age’ of research.
And good for her, she writes:
“The use of crowdsourced survey platforms is likely to increase in the years ahead, so now is the time to entrench research practices that ensure fair wages for online survey respondents. Peer-reviewed journals, academic publishers, and universities can all play a part in promoting ethical treatment of online respondents, simply by requiring full disclosure of payment rate and task time allocation as part of any study that uses a crowdsourced workforce. We already expect academic researchers to disclose their sample size; we should also expect them to disclose whether their respondents earned a dollar for a five-minute survey, or a quarter for a half-hour survey.”
This new study by Hulland and Miller compares MTurk to Google Surveys for survey research. Interestingly, the second author works for a commercial research company and the piece starts out by saying that the use of MTurk is fairly non-existent in the commercial world. Hmmm. Not sure about that. I’ve heard anecdotal evidence that large panel companies turn to the Turk when they can’t get enough responses in some categories. So I start out reading this document with a bit of skepticism.
The authors review the good parts of MTurk (calling Turkers ‘agreeable’ which makes me smile) and then move on to the bad parts. There’s non-representativeness, self-selection (a problem with panels as well), non-naivete, and participant misrepresentation (ie lying on answers). The authors suggest this is most problematic when screening for specific populations. That may be true, but that may be more on the researcher writing the screener than the audience.
The authors then sing some praises of Google Surveys, including that people who complete surveys want to be there, garner high response rates, and self-selection isn’t a problem since respondents read an article of their preference and then are asked to answer questions in exchange. The maximum number of questions one can ask is ten, by the way, and Google Surveys uses an algorithm to deduce the demographics of the respondent.
The authors compare four samples (GS, MTurk, Burke research firm employees, and SSI), asking about mobile phone purchases. They conclude:
“For example, our results suggest that surveys about shopping behavior incidence rates should be placed neither with an Amazon audience nor with a convenience sample of relatively educated and affluent respondents (e.g., the Burke internal sample), whereas Google Surveys may prove adequate for providing reliable estimates of behavioral incidence. Yet use of MTurk may be completely suitable for studies regarding different types of attitudes or behaviors, or for research studying effect differences across experimental conditions. (Much of the existing work in Marketing making use of MTurk workers has been experimental.) ”
It’s an interesting study.
Citation: Hulland, J. & Miller, J. J. of the Acad. Mark. Sci. (2018). https://doi.org/10.1007/s11747-018-0587-4