Workshop Topics and Schedule
This workshop is bringing together experts on skills that are essential for working with big data and naturally occurring datasets. Below you'll find our daily schedule. Click on the name for a brief description of the tutorial and a bit of information about the instructor.
The Jupyter notebook is a widely used tool for performing reproducible, interactive data analysis. As a "digital lab notebook", Jupyter weaves together code, prose, images, and more, allowing users to iterate on analyses and view the results all within a single document. This tutorial will give a hands-on overview of some of the notebook's capabilities using Python, including how the notebook integrates with Python packages such as matplotlib, pandas, and statsmodels. Additionally, this tutorial will cover more advanced uses, including "widgets" for interactively exploring data, conversions to other formats such as HTML, using the notebook with R, and reproducibility.
Instructor: Jessica B. Hamrick (University of California, Berkeley)
Jessica Hamrick is a Ph.D. candidate in the Psychology department at the University of California, Berkeley working with Tom Griffiths. Previously, she received her M.Eng. in Computer Science from MIT working with Josh Tenenbaum, and did research for a summer at Google DeepMind. Jessica is a recipient of the NSF Graduate Fellowship, the Berkeley Fellowship, and the Outstanding Graduate Student Instructor Award. In addition to research, Jessica is a core contributor to the IPython/Jupyter notebook, and is a member of the Project Jupyter Steering Council. [website]
The availability of data on the web has opened up many resources for cognitive scientists who know how to deal with "medium data" - big enough to crash excel but small enough to load into memory. R is a powerful tool for statistical data analysis and reproducible research, and the "tidverse" - an ecosystem of R packages for manipulating, analyzing, and visualizing data - provides many tools for manipulating this kind of data quickly and easily. In this tutorial, I'll walk through how to go from a database or tabular data file to an interactive plot with surprisingly little pain (and less code than you'd imagine). My focus will be on introducing a workflow that uses a wide variety of different tools and packages, including readr, dplyr, tidyr, and shiny. I'll assume basic familiarity with R and will use (but not spend too much time teaching) ggplot2.
Instructor: Michael C. Frank (Stanford University)
Michael C. Frank is Associate Professor of Psychology at Stanford University. He earned his BS from Stanford University in Symbolic Systems in 2005 and his PhD from MIT in Brain and Cognitive Sciences in 2010. He studies both adults' language use and children's language learning and how both of these interact with social cognition. His work uses behavioral experiments, computational tools, and novel measurement methods including large-scale web-based studies, eye-tracking, and head-mounted cameras. He is recipient of the FABBS Early Career Impact award, his dissertation received the Glushko Prize from the Cognitive Science Society, and he has been recognized as a "rising star" by the Association for Psychological Science. [website]
This workshop will explore different ways to collect data from the web with Python. Have you ever needed to copy and paste hundreds (or thousands!) of tables on different web pages? Or click through combinations of dropdown menu selections and download files? Are you interested in collecting social media or news data? After this workshop, you'll be well on your way to automating these processes. We will first consider getting data from RESTful APIs. We will walk through using the documentation to build a query for a GET request. We'll then write our response to a CSV spreadsheet. We'll then discuss web scraping, keeping in mind the Terms of Service for websites and ensuring we are not in violation. We will look at two ways of scraping: 1) parsing the HTML response of a GET request using BeautifulSoup, and 2) utilizing the Selenium web driver to interact directly with dynamic web content.
Instructor: Christopher Hench (University of California, Berkeley)
Christopher is a PhD Candidate in German Literature and Medieval Studies at UC Berkeley. He’s interested in computational approaches to formal analyses of lyric and epic poetry, and is currently working on soundscapes. He also teaches Python workshops at Berkeley’s D-Lab and is the Program Development Lead for Digital Humanities at Berkeley. [website][GitHub]
R is an open source programming language for statistical modeling and computation. The first goal of this talk will be to give a general introduction to statistical modeling in R, specifically in the context of linear models. Most psychological data sets incorporate repeated measures (i.e., multiple observations from the same unit of observation) or multiple independent sources of random variation (i.e., crossed random effects). Unfortunately, ordinary linear models are not a suitable for such data as they assume independence of the data points conditional on the statistical model. The second goal of the talk is to introduce linear mixed models, a powerful model class that can handle all the aforementioned issues and is becoming more and more popular. This part of the talk will focus on R packages lme4 and afex.
Instructor: Henrik Singmann (University of Zurich, Switzerland)
Henrik is a postdoc at the University of Zurich in the lab of Klaus Oberauer and interested in cognitive and statistical models for psychology and related disciplines. He is author of a number of R packages such as afex, rtdists, and bridgesampling. Most of his substantive work is on aspects of higher-level cognition such as reasoning or memory. [website]
People love to talk on the web. How can we listen to what they're telling us, and why would we want to? This workshop will discuss some methods of collecting social media data to construct larger---and in some cases, more naturalistic---datasets than laboratory-based experiments yield. We'll cover methods for building datasets through Python-accessible Twitter APIs, and structuring both the search query and the experimental question to obtain data that is appropriate in both content and amount. We'll also discuss connections between data and metadata, with a focus on geolocation, as well as ways to collect online conversations and interactions.
Instructor: Gabriel Doyle (Stanford University)
Gabe Doyle is a postdoc in Psychology at Stanford, and a soon-to-be Assistant Professor in Linguistics and Digital Humanities at San Diego State. He's interested in better understanding the psycholinguistics, sociolinguistics, and pragmatics of conversation. To figure that out, he combines computational models of communication with emerging big data sources, like Twitter and e-mail databases. Some cool results of this work include showing that Twitter can be used to map American dialectal syntax, that the amount of information provided in a tweet trades off with how exciting and unexpected of an event it discusses, and that employees who eventually stay at or leave a company use “we” differently in their work emails.
The analysis of real-world data often involves acquiring, cleaning, and merging heterogeneous datasets from disparate sources. Often times, our behavioral questions require combining datasets that are in different formats, timescales, geographic levels of granularity, and even different dimensionality (e.g., spatial versus temporal data). Using a series of worked examples from my ongoing research examining the risk-taking behavior of New York City residents, this tutorial outlines common challenges and approaches for dealing with these heterogeneous datasets, as well as explores some public data sources which are can be critically useful for elucidating real-world behavioral questions. This tutorial will assume some familiarity with Python, ‘pandas’ data structures, and basic web queries.
Instructor: Ross Otto (McGill University)
Ross Otto is an Assistant Professor of Psychology at McGill University. He obtained his BS in Cognitive Science in 2005 from UCLA, and his PhD in Psychology from UT Austin in 2012. He completed postdoctoral work at NYU’s Center for Neural Science prior to beginning his position at McGill. Otto’s work relies on a combination of computational, behavioral, and psychophysiological, and (more recently) “big data” techniques to understand how people make decisions both in the laboratory and in the real world. His lab’s work is supported by the National Science and Engineering Research Council (NSERC) and the Fonds de recherche du Quebec – Nature et technologies (FRQ-NT). [website]
Massive natural language datasets are now widely available for public use. Given the size of these datasets, even the simplest language models, such as n-gram analyses, require considerable computational power. The necessary computational requirements impose soft limits—available only to those trained in computational efficiency—to these rich datasets even though they are free to use. To help bridge computational efficiency with behavioral research agendas, my colleagues and I developed the R package, cmscu, a replacement to the standard DocumentTermMatrix function in R’s tm package. I will show how cmscu can be used to implement some of the most sophisticated n-gram algorithms.
Instructor: David W. Vinson (University of California, Merced)
I am a PhD candidate in the Cognitive and Information Sciences group at UC Merced (Aug., 2017). I am interested in information - what is it and how is it transferred. I use theories and models from cognitive and social sciences to inform computational models used to predict naturally occurring behavior. This involves the analysis of large datasets, specifically amazon and Yelp reviews. I am currently an intern at IBM.
Instructor: Todd Gureckis (New York University)
Todd Gureckis is an Associate Professor of Psychology at New York University and an affiliate of the NYU Center for Data Science. His research interests focus on how people actively explore their environment when learning. He hopes that the study of human psychology can influence future methods in machine learning and artificial intelligence. On the nights and weekends he moonlights as the program lead for the psiTurk project (http://psiturk.org), an open-source framework for conducting scientific experiments on Amazon Mechanical Turk. [website]
As experimentation in the behavioral and social sciences moves from brick-and-mortar laboratories to the web, new opportunities arise in the design of experiments. By taking advantage of the new medium, experimenters can write complex computationally mediated adaptive procedures for gathering data: algorithms. This tutorial introduces participants to Dallinger, a software platform for algorithmic experimentation that uses crowdsourcing to automate the full pipeline of experimentation, from participant recruitment through data handling. Participants will get hands-on experience with the platform through examples of psychological methods such as the transmission chain and minimal group paradigm. The session will include discussion of machine learning techniques, such as active learning, that can be applied to psychological experimentation through Dallinger.
Instructor: Jordan Suchow (University of California, Berkeley)
Jordan W. Suchow is a postdoctoral fellow at the University of California, Berkeley. He received a B.S. in computer science from Brandeis University and a PhD in psychology from Harvard University. Suchow’s research focuses on the computational underpinnings of vision, learning, and memory, and the development of next-generation technologies for automating behavioral and social science research. His work is supported by the Defense Advanced Research Projects Agency (DARPA).