data cleaning & visualization practice

Overview

this file walks you through the process of how to visualize the results of an acceptability judgment experiment. Some of the code is the same as the one used in Part 1 of the tutorial, so please refer to the file if you get stuck.

Install and import required packages

library(tidyverse)

Import and clean the data

data = read.csv('./fakedata_2.csv')

Let’s take a look at the dataset

glimpse(data)

## Rows: 512
## Columns: 8
## $ Movement    <chr> "WH", "WH", "WH", "WH", "WH", "WH", "WH", "WH", "WH", "WH"…
## $ Island_Type <chr> "whe", "whe", "whe", "whe", "whe", "whe", "whe", "whe", "w…
## $ Island      <chr> "non", "non", "non", "non", "non", "non", "non", "non", "i…
## $ Distance    <chr> "sh", "sh", "sh", "sh", "sh", "sh", "sh", "sh", "sh", "sh"…
## $ Item        <int> 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4…
## $ Sentence    <chr> "Who thinks that Paul stole the necklace?", "Who thinks th…
## $ Subj_id     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Score       <int> 6, 2, 3, 7, 2, 4, 7, 2, 5, 2, 3, 5, 5, 2, 4, 4, 2, 3, 2, 7…

There are 4 conditions in this experiment:

Island = “isl”, Distance = “lg”: This is the condition where there is a movement of a wh-word (“what”) across a chunk of words starting with “whether”, which works as an island (a structure from which no movement is possible) in English. This condition is predicted to have the lowest acceptability.
Island = “isl”, Distance = “sh”: This is the condition where there is a movement of a wh-word, but NOT across the island containing “whether”.
Island = “non”, Distance = “lg”: This is the condition where there is a movement of a wh-word across a chunk of words starting with “that”, which DOES NOT work as an island in English.
Island = “non”, Distance = “sh”: This is the condition where there is a movement of a wh-word, but NOT across the chunk containing “that”. This condition is predicted to have the highest acceptability.

Review Q: Conduct z-score conversion on raw acceptability scores

#write your answers here

Answers

data = data %>% group_by(Subj_id) %>% mutate(Z_score = (Score - mean(Score)) / sd(Score))

Make a summary table that contains the average acceptability (in z-score) for each of the 4 conditions

To make a summary, group_by and summarize functions come in handy. Inside the summarize function, calculate the mean acceptability for each group using mean() and name the column “Mean”.

data_summary = data %>% group_by(Island, Distance) %>% summarize(Mean = mean(Z_score))
data_summary

## # A tibble: 4 × 3
## # Groups:   Island [2]
##   Island Distance    Mean
##   <chr>  <chr>      <dbl>
## 1 isl    lg       -0.608 
## 2 isl    sh       -0.0708
## 3 non    lg        0.117 
## 4 non    sh        0.561

Exercise: Calculate the standard deviation for each condition

Create a similar summary table with the mean AND standard deviation for each condition.

#write your answers here

Answers

data_summary = data %>% group_by(Island, Distance) %>% summarize(Mean = mean(Z_score), SD = sd(Z_score))

Exercise: Add standard error of each condition to the summary table

Standard error is different from standard deviation (See https://towardsdatascience.com/standard-deviation-vs-standard-error-5210e3bc9c04) and we’ll need it for each condition when we plot the acceptability scores with error bars.

The formula is pretty simple: Standard deviation / number of subjects.

Write the code below to get a summary dataset consisting of the mean z-score, standard deviation, and standard error of each condition.

#write your answers here

Answers

data_summary = data %>% group_by(Island, Distance) %>% 
  summarize(Mean = mean(Z_score),
            SD = sd(Z_score),
            SE = SD/sqrt(length(levels(as.factor(data$Subj_id)))))
data_summary

## # A tibble: 4 × 5
## # Groups:   Island [2]
##   Island Distance    Mean    SD    SE
##   <chr>  <chr>      <dbl> <dbl> <dbl>
## 1 isl    lg       -0.608  0.684 0.171
## 2 isl    sh       -0.0708 0.658 0.164
## 3 non    lg        0.117  1.12  0.281
## 4 non    sh        0.561  1.02  0.254

Plot the data

Let’s make a simple plot to see what happens.

data_summary %>% ggplot(aes(x=Distance, y=Mean))+
  geom_point()+
  geom_path(aes(group = Island))

The plot looks okay, but it’s not easy to tell what’s going on there. We will make a number of edits to the plot to make it look nicer.

Exercise: Make the first round of edits

Do the following to improve the plot. You should be able to find the hints in the previous tutorial.

Make the line for the non-island conditions a dotted line
Since z-scores range from -1 to 1, expand the upper and lower limits of the y-axis accordingly.
Remove the grey background and grids
Advanced: reverse the positions of “lg” and “sh” conditions on the x-axis

#write your answers here

Answers

data_summary %>% ggplot(aes(x=Distance, y=Mean))+
  geom_point()+
  geom_path(aes(group = Island, linetype = rev(Island)))+
  expand_limits(y = c(-1, 1))+
  theme_classic()+
  scale_x_discrete(limits = rev(levels(as.factor(data_summary$Distance))))

Improve axis labels and the legend, and add error bars

Right now, it’s not clear what the axis labels and the legend stand for (what’s “sh”? what’s “isl”?). The labels are kind of small so we want to fix that as well. In addition, it’s not clear how variable the data can be with those single points, which is why we might want to add error bars to the plot.

First, let’s relabel the x-axis and legend. To do so, we will make a separate dataset and combine it with a part of the summary dataset.

data_summary

## # A tibble: 4 × 5
## # Groups:   Island [2]
##   Island Distance    Mean    SD    SE
##   <chr>  <chr>      <dbl> <dbl> <dbl>
## 1 isl    lg       -0.608  0.684 0.171
## 2 isl    sh       -0.0708 0.658 0.164
## 3 non    lg        0.117  1.12  0.281
## 4 non    sh        0.561  1.02  0.254

columns = data.frame(Island=c('whether_island', 'whether_island', 'non-island', 'non-island'),
                     Distance=c('long','short','long','short'))
data_summary = cbind(columns, data_summary[,c(3:5)])
data_summary

##           Island Distance        Mean        SD        SE
## 1 whether_island     long -0.60802871 0.6842833 0.1710708
## 2 whether_island    short -0.07079371 0.6576115 0.1644029
## 3     non-island     long  0.11745032 1.1247666 0.2811916
## 4     non-island    short  0.56137210 1.0151507 0.2537877

data_summary %>% ggplot(aes(x=Distance, y=Mean))+
  geom_point()+
  geom_path(aes(group = Island, linetype = rev(Island)))+
  expand_limits(y = c(-1, 1))+
  theme_classic()+
  scale_x_discrete(limits = rev(levels(as.factor(data_summary$Distance))))

Second, let’s add an error bar to each data point. It’s pretty simple; we will use the geom_errorbar function.

data_summary %>% ggplot(aes(x=Distance, y=Mean))+
  geom_point()+
  geom_path(aes(group = Island, linetype = rev(Island)))+
  expand_limits(y = c(-1, 1))+
  geom_errorbar(aes(ymin=Mean-SE, ymax=Mean+SE), width=0.1)+
  theme_classic()+
  scale_x_discrete(limits = rev(levels(as.factor(data_summary$Distance))))

Exercise: Make the second round of edits

We’re almost there! Let’s make a few more edits to the plot to make it camera-ready.

Change the title of the y-axis to be something more descriptive (e.g., “mean acceptability in z-score”)
Make the axis titles and labels bigger
Remove the legend title (“rev(Island)”)
Move the legend so that it is displayed inside the graph (rather than next to the graph)

#write your answers here

Answers

data_summary %>% ggplot(aes(x=Distance, y=Mean))+
  geom_point()+
  geom_path(aes(group = Island, linetype = rev(Island)))+
  expand_limits(y = c(-1, 1))+
  geom_errorbar(aes(ymin=Mean-SE, ymax=Mean+SE), width=0.1)+
  theme_classic()+
  scale_x_discrete(limits = rev(levels(as.factor(data_summary$Distance))))+
  ylab('mean z-score')+
  theme(legend.title = element_blank(),
        legend.position = c(0.85, 0.85),
        legend.text = element_text(size = 15),
        axis.text.x = element_text(size = 15),
        axis.title.x = element_blank(),
        axis.title.y = element_text(size = 15))

data cleaning & visualization practice - Part 2

Overview

Install and import required packages

Import and clean the data

Review Q: Conduct z-score conversion on raw acceptability scores

Answers

Make a summary table that contains the average acceptability (in z-score) for each of the 4 conditions

Exercise: Calculate the standard deviation for each condition

Answers

Exercise: Add standard error of each condition to the summary table

Answers

Plot the data

Exercise: Make the first round of edits

Answers

Improve axis labels and the legend, and add error bars

Exercise: Make the second round of edits

Answers