Sequential dependence in online restaurant reviews

For this Featured Project, we're talking to Dave Vinson, a Ph.D. candidate in Cognitive and Information Sciences at the University of California, Merced.  Some of his recent work explored whether a basic psychophysiological phenomenon -- sequential dependence -- might be predictive of higher-level behavior in more naturalistic settings. He and his coauthors decided to tap into the massive Yelp dataset to understand the real-world effects of sequential dependence. In this interview, he shared his thoughts on the challenges and rewards of working with big data as a cognitive scientist.


What's your one-sentence summary of the project? 

Star ratings on Yelp show sequential dependence: A reviewer’s current review rating is dependent on his or her prior experience.

What overarching cognitive question were you trying to answer with this project? 

We were interested to see whether echoes of a well-known psychophysical laboratory effect—that one’s previous judgments influence one’s current judgments—affect natural behavior such as Yelp reviews.

Researcher: 
Dave Vinson
How did you discover (or build) this dataset, and what made you decide to work with it to answer your question? 

The Yelp dataset is free, available online, and neatly parsed into JSON files for anyone to use. Yelp continues to release more and more reviews (currently at 2.2 million reviews today). I saw this as a free goldmine of data, and I use it to answer a variety of questions related to cognitive science.

What skills were most valuable to you during the project? Do you have any suggestions for how others might acquire those skills? 

The most valuable skills I have when considering big data are knowing how to write code that won’t overload my RAM. In this project, for instance, the files themselves are very large and extracting the relevant pieces (star ratings and dates) means needing to essentially load the entire set of reviews from Yelp. But to do this efficiently, you can simply connect to it and loop through it via base functions in R:

yelp = file(“~/Dir/Yelpreviews.json”,’r’,FALSE) 
for (i in 1:224){  # some number*10000 larger than file size
     data <- matrix(ncol=4, nrow=10000)
     revs = readLines(con=yelp, n=10000) 
     for (k in 1:length(revs)){
          x = fromJSON(revs[k])
          data[k,]=c(x$user_id,
                     x$business_id,
                     as.numeric(as.POSIXct(x$date)),
                     x$stars) 
      }
}

Once the data are extracted, the size is substantially more manageable. After, you can build functions that allow you to lag star ratings relatively easily. 

In other projects I have had to develop new tools, such as the R package cmscu, to analyze more complex data (such as the language within a review itself) before being able to move forward. If you don’t know something, or are having problems with preprocessing data, talk to computer scientists and applied mathematicians. Interdisciplinary collaborations are invaluable and should be the standard in academia today.

What prior experience did you have working with big data before this project? 

I had been working with the Yelp dataset, but in more manageable chunks. I had no prior experience using or analyzing big data up to this point. 

How did ethical considerations for this study differ from laboratory studies? Did the IRB or ethics board have any new concerns? 

Since the data are freely available and reviewers are anonymously coded with unique identifiers, I had no problems.

What objections or obstacles did you have to overcome in the review process that were unique to working with big data? 

I’ve come across two major concerns, one from scientists and the other from those in industry: (1) Some scientists express concerns that looking for laboratory effects in natural data is misguided since natural data is noisier and laboratory studies are intended to remove that noise. (2) Searching for echoes of cognition in large datasets typically does not result in strong machine learning predictions. While our effects are robust, sometimes they only account for a small amount of variability.

I find the first easier to address as this highlights a standard tradeoff between observing behavior outright and watching how behavior changes with intervention.  I think there are benefits to both. In this case, we use findings from laboratory interventions as guiding tools toward predicting what we will observe in natural data.  The second is conceptually harder; a lot of variance in natural behavior is accounted for by using sophisticated machine learning tools and just throwing tons of variables at the problem.  Large feature spaces are common, and small effects for any given feature thus the norm. Selecting small feature spaces -- perhaps even single variables -- in a theoretically guided way may help us to unpack some of the role of different features in a meaningful way. Even if the predictive power is small for one or two theoretically interesting features, analysis of theoretically motivated variables help us to understand both these small sets of features, but also their participation in the broader set of features. While we might be showing only subtle echoes of cognitive effects, we’re accounting for variance using only a single variable, making our models significantly more efficient and specific to cognition at, of course, the cost of accuracy.

Did the recent movement toward open science and reproducibility play any role in planning or executing this project? If so, how? 

Not particularly, but perhaps generally. The theme of open science has facilitated the release of larger datasets, but also the desire for discovery leads companies such as Yelp to release data for free sometimes with incentives for scientists such as cash rewards  (check out the Yelp Dataset Challenge).

What did a big data perspective afford you for this project that a more traditional perspective might not have? 

The data are naturally noisier, so I look for subtle echoes of cognitive influences that leak out into natural behavior.  This mindset allows me use traditional theories developed in highly controlled laboratory settings as a guide toward understanding what we might expect to find in natural behavior.

Do you have any advice for those interested in using big data for cognitive science? 

Before you start playing with big data, start thinking about the scale of it and what tools you’ll need to process the data efficiently. This includes the most efficient way to answer your question: If you don’t need all the data, don’t use it! Sometimes you’ll be fine, but sometimes you’ll need to do a little extra reading before you know what tools to choose.


Project publication: 

Vinson, D. W., Dale, R., & Jones, M. N. (2016). Decision contamination in the wild: Sequential dependencies in Yelp review ratings. In Papafragou, A., Grodner, D., Mirman, D., & Trueswell, J.C. (Eds.), Proceedings of the 38th Annual Conference of the Cognitive Science Society. Austin, TX: Cognitive Science Society.

For more, contact: 
Dave Vinson
Cognitive and Information Sciences
University of California, Merced
dave@davevinson.com
Coauthors: 
Michael N. Jones (Department of Psychological and Brain Sciences, Indiana University)
Rick Dale (Cognitive and Information Sciences, University of California, Merced)