Making sure ratings are accurate on MTurk

An Thanh Nguyen, Matthew Halpern, Byron C. Wallace, and Matthew Lease have published a paper called Probabilistic Modeling for Crowdsourcing Partially-Subjective Ratings in the Proceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP), 2016.  

The problem studied in the paper was that of ‘objective’ ratings provided by Turkers–how do we know that Turkers aren’t messing with Requesters in these ratings?

“The key observation we make in our paper is that the worker data for these partially-subjective tasks, where worker labels are partially ordered (i.e. scores from one to five), are heteroscedastic in nature. Therefore we propose a probabilistic, heteroscedastic model where the means and variances of worker responses are modeled as functions of instance attributes. In other words, the variability of scores can itself vary across the different parameters. Consider the results as the font size of a logo is varied. We would expect that most workers would give the logos with the smallest and largest font sizes low scores. However, the range of scores for the middle range of fonts is going to be more varied.”

The  proposed model can predict user ratings and also identify outliers. Now there’s a whole lot of statsy talk there, but I implicitly trust these authors–Matt Lease was one of the researchers who reporter that Turkers weren’t really anonymous.

The linked text above will take you to a blog post describing the study, and then links not only to the full text but also to additional links for shared code.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s