Sequential dependence in online restaurant reviews
For this Featured Project, we're talking to Dave Vinson, a Ph.D. candidate in Cognitive and Information Sciences at the University of California, Merced. Some of his recent work explored whether a basic psychophysiological phenomenon -- sequential dependence -- might be predictive of higher-level behavior in more naturalistic settings. He and his coauthors decided to tap into the massive Yelp dataset to understand the real-world effects of sequential dependence. In this interview, he shared his thoughts on the challenges and rewards of working with big data as a cognitive scientist.
Star ratings on Yelp show sequential dependence: A reviewer’s current review rating is dependent on his or her prior experience.
We were interested to see whether echoes of a well-known psychophysical laboratory effect—that one’s previous judgments influence one’s current judgments—affect natural behavior such as Yelp reviews.

The Yelp dataset is free, available online, and neatly parsed into JSON files for anyone to use. Yelp continues to release more and more reviews (currently at 2.2 million reviews today). I saw this as a free goldmine of data, and I use it to answer a variety of questions related to cognitive science.
The most valuable skills I have when considering big data are knowing how to write code that won’t overload my RAM. In this project, for instance, the files themselves are very large and extracting the relevant pieces (star ratings and dates) means needing to essentially load the entire set of reviews from Yelp. But to do this efficiently, you can simply connect to it and loop through it via base functions in R:
yelp = file(“~/Dir/Yelpreviews.json”,’r’,FALSE)
for (i in 1:224){ # some number*10000 larger than file size
data <- matrix(ncol=4, nrow=10000)
revs = readLines(con=yelp, n=10000)
for (k in 1:length(revs)){
x = fromJSON(revs[k])
data[k,]=c(x$user_id,
x$business_id,
as.numeric(as.POSIXct(x$date)),
x$stars)
}
}
Once the data are extracted, the size is substantially more manageable. After, you can build functions that allow you to lag star ratings relatively easily.
In other projects I have had to develop new tools, such as the R package cmscu, to analyze more complex data (such as the language within a review itself) before being able to move forward. If you don’t know something, or are having problems with preprocessing data, talk to computer scientists and applied mathematicians. Interdisciplinary collaborations are invaluable and should be the standard in academia today.
I had been working with the Yelp dataset, but in more manageable chunks. I had no prior experience using or analyzing big data up to this point.
Since the data are freely available and reviewers are anonymously coded with unique identifiers, I had no problems.
I’ve come across two major concerns, one from scientists and the other from those in industry: (1) Some scientists express concerns that looking for laboratory effects in natural data is misguided since natural data is noisier and laboratory studies are intended to remove that noise. (2) Searching for echoes of cognition in large datasets typically does not result in strong machine learning predictions. While our effects are robust, sometimes they only account for a small amount of variability.
I find the first easier to address as this highlights a standard tradeoff between observing behavior outright and watching how behavior changes with intervention. I think there are benefits to both. In this case, we use findings from laboratory interventions as guiding tools toward predicting what we will observe in natural data. The second is conceptually harder; a lot of variance in natural behavior is accounted for by using sophisticated machine learning tools and just throwing tons of variables at the problem. Large feature spaces are common, and small effects for any given feature thus the norm. Selecting small feature spaces -- perhaps even single variables -- in a theoretically guided way may help us to unpack some of the role of different features in a meaningful way. Even if the predictive power is small for one or two theoretically interesting features, analysis of theoretically motivated variables help us to understand both these small sets of features, but also their participation in the broader set of features. While we might be showing only subtle echoes of cognitive effects, we’re accounting for variance using only a single variable, making our models significantly more efficient and specific to cognition at, of course, the cost of accuracy.
Not particularly, but perhaps generally. The theme of open science has facilitated the release of larger datasets, but also the desire for discovery leads companies such as Yelp to release data for free sometimes with incentives for scientists such as cash rewards (check out the Yelp Dataset Challenge).
The data are naturally noisier, so I look for subtle echoes of cognitive influences that leak out into natural behavior. This mindset allows me use traditional theories developed in highly controlled laboratory settings as a guide toward understanding what we might expect to find in natural behavior.
Before you start playing with big data, start thinking about the scale of it and what tools you’ll need to process the data efficiently. This includes the most efficient way to answer your question: If you don’t need all the data, don’t use it! Sometimes you’ll be fine, but sometimes you’ll need to do a little extra reading before you know what tools to choose.
Vinson, D. W., Dale, R., & Jones, M. N. (2016). Decision contamination in the wild: Sequential dependencies in Yelp review ratings. In Papafragou, A., Grodner, D., Mirman, D., & Trueswell, J.C. (Eds.), Proceedings of the 38th Annual Conference of the Cognitive Science Society. Austin, TX: Cognitive Science Society.