This is a great article from Wired on what looks like a fun HIT–for Turkers to re-enact Olympic sports. And in that way, it puts the human into the mechanical
“During the first run of the Mechanical Olympics in 2008, Burrough (and that would be Xtine Burrough, an artist and associate professor at the University of Texas at Dallas) didn’t necessarily see the project as a commentary on Amazon Mechanical Turk. But as the games have grown, she says, she’s come to see them as a way to bring attention to a traditionally invisible workforce. “I’m trying to intervene in a system that perpetuates a really steadfast routine—‘do these HITs as fast as you can and try to make more than 80 cents’—to offer an alternative,” she says. “It’s [called] ‘Mechanical’ Turk, and it feels mechanical. There’s a void there. So I’m trying to breathe some kind of human spirit into this mechanical void and celebrate the human body that [is] the workers. The workers are real people.””
I thought the answer to this question would be no. But apparently, according to this article, the answer is yes.
“The current study’s MTurk-recruited sample was similar to that of studies that recruited through in-person interviews in many domains, including demographics, causes of injury, history of psychological comorbidities, and presence of current psychological symptoms. This study also indicates that different methods of assessing mTBI via retrospective surveys can yield differing sets of individuals being identified as having a history of mTBI. Only 79% of individuals who reported a history of mTBI on the screener actually met criteria when using a stricter definition, whereas an additional 13% of participants who did not report a history of mTBI on the screener actually met criteria. When recruiting from this population via retrospective self-report, assessment of postinjury symptoms is imperative because of the general public’s misperceptions about what constitutes a mTBI. As with previous work examining other clinical populations, the current study suggests that crowdsourcing technologies such as MTurk may be used to recruit individuals with a history of mTBI. Given the cost of these methods and greater generalizability compared with some other study designs, future studies could use this technology to explore a number of different research questions relevant to mTBI outcomes.”
And here’s how the study checked for cheaters:
“Although the current study suggests that MTurk may be useful in the recruitment of workers with a history of mTBI, it is not without its limitations. Some workers may have endorsed mTBI on the screener despite not having a history of mTBI to be included in the study and receive compensation. However, several efforts were made to avoid the recruitment of such individuals. Only workers who had a high satisfaction rating (ie, had satisfactorily completed at least 95% of all surveys as rated by the individuals who created the surveys and collected results) were able to take the screener. The screener itself included a range of conditions to prevent workers from identifying the condition of interest.”
John Bernstein, Matthew Calamia,
Characteristics of a Mild Traumatic Brain Injury Sample Recruited Using Amazon’s Mechanical Turk, PM&R, Volume 10, Issue 1, 2018, Pages 45-55,
Sayeth some psychology professors in Canada.
“Wilson and Ruffle found similar patterns, overall, for men and women: generally speaking men and women with visible tattoos were more impulsive than those without.
Where they differ is over easily-hidden tattoos. Men with easily-hidden tattoos were more impulsive than those without but the same wasn’t found for their female counterparts.
“Women with readily hidden tattoos were no more short-sighted or impulsive than women with no tattoos at all,” said the release.”
More details when the article gets published.
That is what a new study does. And surprise, convenience samples are convenient but flawed:
“To summarize our findings on demographics, the convenience samples are mostly biased in a similar fashion; differences from national probability samples are primarily differences of degree rather than direction. The overrepresentation of older age groups in our U.S. Facebook sample is a surprising exception. In the U.S., Qualtrics came closest to a national probability sample on a majority of variables. In India, our convenience samples look less like the national population and a lot more like an elite class of tech-savvy workers. While the three Indian sources deviate from population parameters in similar ways, MTurk was often the closest. However, Qualtrics and especially Facebook achieve better geographical coverage in India. We find no such geographic skew in the U.S.”
This study in and of itself is a bit flawed since Qualtrics isn’t a sample like MTurk or FB: Qualtrics buys panels from other places (rumor has it that MTurk is one of those places). But there’s some interesting information about political differences, as well as looking at India too.
“Recruiting Large Online Samples in the United States and India: Facebook, Mechanical Turk and Qualtrics”
TC Boas, DP Christenson, DM Glick – 2018
This new study looked at imposters on MTurk. In two different studies, a bunch of respondents said they met the requirements when they didn’t–in fact, one study it was more than half. At the same time, deception doesn’t hurt attention. The problem is basically if you want to study women and you have men respondents saying they are women–well, you can see how that works.
There are a few ways around it. You can pay more to get ‘prescreened’ people, where Turkers provided demographic information to Amazon that wasn’t connected to a specific study. You can also not signal who the study is for: don’t describe your study as ‘women and cat videos’ but instead describe the study as ‘perceptions of videos’ and use gender as a screening question. It provides less detail to potential participants but that’s what happens when people want to answer surveys about cat videos.
Kan, I. P., & Drummey, A. B. (2018). Do imposters threaten data quality? An examination of worker misrepresentation and downstream consequences in Amazon’s Mechanical Turk Workforce. Computers in Human Behavior.
This new article from the Journal of Market Research presents three experiments about trap questions.
Experiment 1 uses Survey Monkey and found more respondents were likely to fail one trap question than both trap questions (when two were presented). Response quality did not seem to differ based on whether trap questions were missed or not.
Experiment 2 also used Survey Monkey and found that more people failed a ‘difficult’ trap question than an easy one. Difficulty was manipulated based on the length of the instruction and number of options to ‘pass’ the trap (one for easy, two for hard). The experiment found higher failure levels for the difficult one than the easy one. And there’s this: “data quality measured by an easy trap question at the end of the survey was optimal, in that it was the most consistent with other quality measures that are commonly used. However, given that respondents who failed and passed the trap question provided similar responses to the political and behavioral questions, these results indicate no benefit from excluding the respondents who failed the trap question. In other words, there seems to be no differential response to trap questions—those respondents who pass and fail the trap questions seem to provide the same answers to the key political and behavioral questions (emphasis mine).”
The third experiment also used Survey Monkey and looked a bunch of different formats for questions. There’s a lot of findings with this one, but the interesting one is an ‘announcement’ of the presence of a trap did not affect whether people pass the trap or not.
I wonder if this would be the same on MTurk? From what I can tell, Survey Monkey’s ‘panel’ are similar to Qualtrics in that they don’t get paid like Turkers do—instead of $ direct from the Requester, they get ‘points’ toward gift certificates or entries in a drawing. So it’s less of an immediate reward and more delayed, which might explain why they aren’t that engaged in Trap questions. On MTurk, missing a trap question could mean you don’t get paid (although this study might give rationale to Requesters to go ahead and pay someone who missed a Trap question).
Citation: Liu, Mingnan, and Laura Wronski. “Trap questions in online surveys: Results from three web survey experiments.” International Journal of Market Research 60, no. 1 (2018): 32-49.
I review a fair number of studies these days (I’m a Deputy Editor of a journal and on the editorial board of several others) and I see my fair share of studies that use MTurk. I find that many authors do not adequately rationalize the choice of MTurk. This doesn’t have to be long and drawn out, but here are the things I want (as a reviewer) to know:
- What qualifications did you put on the sample? Geographic location, hits completed and hits approved are the basics. And I kind of don’t care what the latter two are–just that you know that they exist and that they are important. Geographic location IS important–if you have a mix of people from the US and India in the sample, then generalizability may be compromised.
- What screening questions, if any, were used, and how many people were screened out? There are clearly lots of bad screening questions: if you ask, for example, whether you have logged on to any social network site in the past month, my guess is that most people will pick ‘yes’ even if they haven’t (I have nothing to back this up though). Basically: your screening question should try not to signal what the study is about.
- What did you pay respondents? If you didn’t pay them enough, I will mention this is the comments to the authors. If you did pay them enough, I will congratulate you and encourage you to put this front and center in your manuscript to model good researcher behavior.
- Did you use any attention checks? Yes, they can be problematic but not using them is even more of a problem, especially if you didn’t check for satisficing behaviors.
These are my thoughts–as a reviewer, you may have other ones.Please comment if you do!
I’m not a proponent of using Masters workers for MTurk studies: they’re more expensive and the process on how Masters workers are selected is not at all transparent. Regardless, the new study about Masters workers is sort of interesting. The study is of 40 Masters workers (small samples=red flag) who earned a little less than $800 a month on average on MT. Most had other jobs.
There’s not a lot of surprises here. One of them, though, is the lack of attention paid to instructions:
“Fifty-five per cent of USMs (Masters workers in the US) believed MTurkers read directions somewhat carefully or not at all carefully, while the remaining 47.5 per cent thought MTurkers did read directions carefully. In contrast, 80 per cent of IMs (Masters workers in India) felt MTurkers did read directions carefully. When MTMs were asked how carefully MTurkers read each individual question in a given survey, 95 per cent of IMs believed MTurkers read each question carefully while only 50 per cent of USMs felt the same way.”
Masters workers consider themselves part of the Amazon work force, not workers for hire. The article states “…just because we are working from a computer doesnt mean we dont deserve a living wage” (USM) and “Workers will generally give you good quality work if they are treated like real human beings” (USM). Again, these findings address RQ1, in that MTMs see themselves as employees who expect a fair wage for the quality of their work. ”
The piece ends with ‘what Masters Workers want” but is in reality what all Workers want: fair pay, clear guidelines for task eligibility, and trustworthy/honest requesters.
Citation: Lovett, Matt, Saleh Bajaba, Myra Lovett, and Marcia J. Simmering. “Data Quality from Crowdsourced Surveys: A Mixed Method Inquiry into Perceptions of Amazon’s Mechanical Turk Masters.” Applied Psychology.
That’s the news from this new study, which falls into the category of “HITs I Wish I Did.”
From the abstract: ”
We asked 239 participants to rate 78 behaviors on the properties of intentionality, surprisingness, and desirability. While establishing a pool of robust stimulus behaviors (whose properties are judged similarly for human and robot), we detected several behaviors that elicited markedly discrepant judgments for humans and robots. Such discrepancies may result from norms and stereotypes people apply to humans but not robots, and they may present challenges for human-robot interactions.”
I haven’t really thought about stereotyping of computers, because in my line of work (consumer behavior) we’re more worried about the harm from stereotypes than identifying whether they exist or not. But apparently, we do stereotype robots! I hope we don’t hurt their feelings. Here’s an example:
The first such behavior was the following (IN-S-16): “A security [officer | robot] is walking on the sidewalk. When [she | it] sees a fleeing pick-pocket, [she | it] steps
in front of him and grabs the man’s arm.” When performed by a robot (compared to a human), this behavior was evaluated as clearly intentional (rather than moderately unintentional), middling in surprisingness (rather than clearly surprising), and clearly desirable (rather than moderately undesirable).”
I do have to wonder if there is some confound with calling the officer a ‘she’ instead of ‘the officer’ but maybe I’m overthinking.
de Graaf, Maartje MA, and Bertram F. Malle. “People’s Judgments of Human and Robot Behaviors.” (2018).
According to this article, yes. Three researchers describe a class assignment using a pretend form of Mechanical Turk; students have to work as Workers and Requesters. Apparently, students said they understood more about crowdsourcing after they did the exercise.
I appreciate that this type of exercise is being taught, but it also seems that one needs to examine the ethics of crowdsourcing to really have a valuable understanding. The study also did not examine payments. Yet, it is an interesting way to introduce students to concepts of crowdsourcing in a protected environment.
Guo, Hui, Nirav Ajmeri, and Munindar Singh. “Teaching Crowdsourcing: An Experience Report.” IEEE Internet Computing (2018).