data cleaning & visualization practice

Overview

this file walks you through the process of importing a dataset that resembles the real dataset (stimuli are in Japanese) that we obtain from an acceptability judgment experiment, cleaning it, and analyzing it.

Install and import required packages

tidyverse is a collection of R packages (“libraries”) designed for data science. uncomment and run the following code if the package is not installed in your environment:

#install.packages('tidyverse')

call the installed library so that we can use it in this file

library(tidyverse)

Import and clean the data

data = read.csv('./fakedata.csv')
data = data[,-1] #removes unnecessary index column

check the data type of each column

glimpse(data)

## Rows: 512
## Columns: 5
## $ Sentence  <chr> "統計的機械学習での分析が応用可能で、より精度の高いデータと…
## $ Lex_set   <int> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, …
## $ Condition <int> 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, …
## $ Subj_id   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Score     <int> 6, 5, 3, 1, 7, 6, 1, 1, 7, 5, 1, 3, 6, 5, 3, 4, 6, 5, 4, 2, …

There are columns like the following:

Lex_set: Serial number of lexicalization sets (sets of sentences that belong to different conditions but share certain words and phrases)
Condition: Serial number of condition. As you can see, each lexicalization set consists of four sentences belonging to the different conditions
Score: Acceptability score given by participants

Note: If you are seeing unicodes (e.g., \u9ad8) but not actual characters under the Sentence column, try updating R (not R Studio).

let’s convert some of the data types from numerical to categorical. This is because the variables like Condition are numbered but we do not intend to apply mathematical operations to them.

cols = c('Lex_set', 'Condition', 'Subj_id')
data[cols] = lapply(data[cols], factor)

we can subset the data (i.e., extracting certain rows and/or certain columns) using a square bracket

#subset rows to show acceptability scores from subject 1
data[data$Subj_id == 1, 5] #6 is the column index of the Score column

##  [1] 6 5 3 1 7 6 1 1 7 5 1 3 6 5 3 4 6 5 4 2 7 5 3 2 7 5 2 1 6 5 3 3

or we can do so with which() function

#getting row numbers where the subject ID is ten
idx = which(data$Subj_id == 10)
idx

##  [1] 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307
## [20] 308 309 310 311 312 313 314 315 316 317 318 319 320

Exercise 1: Find & replace rows with NA values

suppose that I forgot to turn on the feature in my experiment where participants must select an answer in order to proceed. As a result, there may be some rows with an acceptability score missing. Find rows with a missing Score value and replace them with 3 (in real life, though, it’s common to throw away the entire data of participants with missing values).

Advanced: Find rows with a missing Score value and replace them with the mean acceptability score.

#write your answers here

Answers

#find rows with NA values
data[is.na(data$Score),]

##                                                                                                       Sentence
## 159   最新のニュース映像や番組が放送されるが、番組本編ではノンスクランブルかつ完全ノンスクランブルで放送される
## 232 世界中の商品情報を発信するとともに、各都道府県の情報を提供している。また、各種地域情報を発信するため、全国
## 454     世界中の商品情報を発信するために日本を代表するローカル紙「週刊東洋経済」を発行していた。創刊当初から、
##     Lex_set Condition Subj_id Score
## 159       8         3       5    NA
## 232       2         4       8    NA
## 454       2         2      15    NA

#replace NA values with 3
idx = which(is.na(data$Score)) #get row number of missing values
data[idx, ]$Score = 3
data[idx, ]

##                                                                                                       Sentence
## 159   最新のニュース映像や番組が放送されるが、番組本編ではノンスクランブルかつ完全ノンスクランブルで放送される
## 232 世界中の商品情報を発信するとともに、各都道府県の情報を提供している。また、各種地域情報を発信するため、全国
## 454     世界中の商品情報を発信するために日本を代表するローカル紙「週刊東洋経済」を発行していた。創刊当初から、
##     Lex_set Condition Subj_id Score
## 159       8         3       5     3
## 232       2         4       8     3
## 454       2         2      15     3

#replace NA values with mean score
#don't forget to include na.rm=TRUE, or the returned value will be NA
data[idx, ]$Score = mean(data$Score, na.rm=TRUE)
data[idx, ]

##                                                                                                       Sentence
## 159   最新のニュース映像や番組が放送されるが、番組本編ではノンスクランブルかつ完全ノンスクランブルで放送される
## 232 世界中の商品情報を発信するとともに、各都道府県の情報を提供している。また、各種地域情報を発信するため、全国
## 454     世界中の商品情報を発信するために日本を代表するローカル紙「週刊東洋経済」を発行していた。創刊当初から、
##     Lex_set Condition Subj_id    Score
## 159       8         3       5 4.439453
## 232       2         4       8 4.439453
## 454       2         2      15 4.439453

Exercise 2: Find & modify impossible scores

suppose also that there are some rows with the acceptability score out of the possible range (1 to 7). Find such rows.

Advanced: Replace the rows with 7 if a value is over 7, and 1 if it is under 1. Hint: Use ifelse() function whose arguments are condition, value if the condition is met, value if the condition is not met (ifelse(condition, value if condition=T, value if condition=F))

#write your answers here

Answers

#find rows with the score of over 7 or under 1 and modify them
idx = which(data$Score > 7 | data$Score < 1)
data[idx, ]$Score = ifelse(data[idx, ]$Score > 7, 7, 1)
data[idx, ]

##                                                                                   Sentence
## 188                  会社でトラブルの原因を究明する作業員と、その社員が協力して、2人の男性
## 377 会社でトラブルの原因を追及していた。しかし、この事故は、ボーイングが事故機について検査
##     Lex_set Condition Subj_id Score
## 188       7         4       6     1
## 377       7         1      12     7

Get the summary of results

grouping the data by condition and averaging the acceptability score with the summarize function is a quick and easy way to get a snapshot of results.

data %>% group_by(Condition) %>% summarize(ave = mean(Score))

## # A tibble: 4 × 2
##   Condition   ave
##   <fct>     <dbl>
## 1 1          6.07
## 2 2          5.89
## 3 3          2.57
## 4 4          2.49

Exercise 3: Get an average score per participant

#write your answers here

Answer

data %>% group_by(Subj_id) %>% summarize(ave = mean(Score))

## # A tibble: 16 × 2
##    Subj_id   ave
##    <fct>   <dbl>
##  1 1        4.06
##  2 2        4.06
##  3 3        4   
##  4 4        4.59
##  5 5        4.26
##  6 6        4.06
##  7 7        4.25
##  8 8        4.42
##  9 9        4.16
## 10 10       4   
## 11 11       4.31
## 12 12       4.5 
## 13 13       4.75
## 14 14       4.28
## 15 15       4.26
## 16 16       4.06

Convert raw acceptability scores to z-scores

participants make use of the acceptability scale in various ways; some people may use 1 and 7 exclusively, while others may stick to 3 through 5. In order to normalize their scores, it is common to convert their raw acceptability to z-scores (standard scores). The formula for z-score conversion is as follows:

z-score = raw score - sample mean / standard deviation of sample

data = data %>% group_by(Subj_id) %>% mutate(Z_score = (Score - mean(Score)) / sd(Score))

Plot the results

we will use the package ggplot for visualization. The syntax for ggplot is slightly different from the R syntax, such as the use of + when you want to add plot features. First, we will create a summary data to be plotted.

data_summary = data %>% group_by(Condition) %>% summarize(Ave = mean(Z_score))

#the code below converts condition numbers to actual condition names
columns = data.frame(Extraction=c('no_extraction', 'no_extraction', 'extraction', 'extraction'),
                     RC=c('non_RC','RC','non_RC','RC'))
data_summary = cbind(columns, data_summary)
data_summary

##      Extraction     RC Condition        Ave
## 1 no_extraction non_RC         1  0.9122053
## 2 no_extraction     RC         2  0.8148336
## 3    extraction non_RC         3 -0.8437908
## 4    extraction     RC         4 -0.8832481

data_summary$Extraction = factor(data_summary$Extraction, levels = c('no_extraction', 'extraction'))

data_summary %>% ggplot(aes(x=Extraction, y=Ave))+
  geom_point()+
  geom_path(aes(group = RC, linetype = rev(RC)))+
  guides(linetype = guide_legend(reverse = TRUE))+
  expand_limits(y = c(-1, 1))

Exercise 4: Make the graph look better

please do the following:

Change the title of the y-axis to be something more descriptive (e.g., “mean acceptability in z-score”)
Remove the x-axis title (“Extraction”)
Make the axis titles and labels bigger
Remove the legend title (“rev(RC)”)
Remove the grey background and grids (use function theme_classic() or theme_bw())
Advanced: move the legend so that it is displayed inside the graph (rather than next to the graph)

#write your answers here

Answer

data_summary %>% ggplot(aes(x=Extraction, y=Ave))+
  geom_point()+
  geom_path(aes(group = RC, linetype = rev(RC)))+
  guides(linetype = guide_legend(reverse = TRUE))+
  expand_limits(y = c(-1, 1))+
  theme_classic()+
  ylab('mean z-score')+
  theme(legend.title = element_blank(),
        legend.position = c(0.9, 0.9),
        legend.text = element_text(size = 15),
        axis.text.x = element_text(size = 15),
        axis.title.x = element_blank(),
        axis.title.y = element_text(size = 15))

data cleaning & visualization practice - Part 1

Overview

Install and import required packages

Import and clean the data

Exercise 1: Find & replace rows with NA values

Answers

Exercise 2: Find & modify impossible scores

Answers

Get the summary of results

Exercise 3: Get an average score per participant

Answer

Convert raw acceptability scores to z-scores

Plot the results

Exercise 4: Make the graph look better

Answer