this file walks you through the process of importing a dataset that resembles the real dataset (stimuli are in Japanese) that we obtain from an acceptability judgment experiment, cleaning it, and analyzing it.
tidyverse
is a collection of R packages (“libraries”)
designed for data science. uncomment and run the following code if the
package is not installed in your environment:
#install.packages('tidyverse')
call the installed library so that we can use it in this file
library(tidyverse)
data = read.csv('./fakedata.csv')
data = data[,-1] #removes unnecessary index column
check the data type of each column
glimpse(data)
## Rows: 512
## Columns: 5
## $ Sentence <chr> "統計的機械学習での分析が応用可能で、より精度の高いデータと…
## $ Lex_set <int> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, …
## $ Condition <int> 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, …
## $ Subj_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Score <int> 6, 5, 3, 1, 7, 6, 1, 1, 7, 5, 1, 3, 6, 5, 3, 4, 6, 5, 4, 2, …
There are columns like the following:
Lex_set
: Serial number of lexicalization sets (sets of
sentences that belong to different conditions but share certain words
and phrases)Condition
: Serial number of condition. As you can see,
each lexicalization set consists of four sentences belonging to the
different conditionsScore
: Acceptability score given by participantsNote: If you are seeing unicodes (e.g.,
\u9ad8
) but not actual characters under the
Sentence
column, try updating R (not R Studio).
let’s convert some of the data types from numerical to categorical.
This is because the variables like Condition
are numbered
but we do not intend to apply mathematical operations to them.
cols = c('Lex_set', 'Condition', 'Subj_id')
data[cols] = lapply(data[cols], factor)
we can subset the data (i.e., extracting certain rows and/or certain columns) using a square bracket
#subset rows to show acceptability scores from subject 1
data[data$Subj_id == 1, 5] #6 is the column index of the Score column
## [1] 6 5 3 1 7 6 1 1 7 5 1 3 6 5 3 4 6 5 4 2 7 5 3 2 7 5 2 1 6 5 3 3
or we can do so with which()
function
#getting row numbers where the subject ID is ten
idx = which(data$Subj_id == 10)
idx
## [1] 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307
## [20] 308 309 310 311 312 313 314 315 316 317 318 319 320
suppose that I forgot to turn on the feature in my experiment where
participants must select an answer in order to proceed. As a result,
there may be some rows with an acceptability score missing. Find rows
with a missing Score
value and replace them with
3 (in real life, though, it’s common to throw away the
entire data of participants with missing values).
Advanced: Find rows with a missing
Score
value and replace them with the mean acceptability
score.
#write your answers here
#find rows with NA values
data[is.na(data$Score),]
## Sentence
## 159 最新のニュース映像や番組が放送されるが、番組本編ではノンスクランブルかつ完全ノンスクランブルで放送される
## 232 世界中の商品情報を発信するとともに、各都道府県の情報を提供している。また、各種地域情報を発信するため、全国
## 454 世界中の商品情報を発信するために日本を代表するローカル紙「週刊東洋経済」を発行していた。創刊当初から、
## Lex_set Condition Subj_id Score
## 159 8 3 5 NA
## 232 2 4 8 NA
## 454 2 2 15 NA
#replace NA values with 3
idx = which(is.na(data$Score)) #get row number of missing values
data[idx, ]$Score = 3
data[idx, ]
## Sentence
## 159 最新のニュース映像や番組が放送されるが、番組本編ではノンスクランブルかつ完全ノンスクランブルで放送される
## 232 世界中の商品情報を発信するとともに、各都道府県の情報を提供している。また、各種地域情報を発信するため、全国
## 454 世界中の商品情報を発信するために日本を代表するローカル紙「週刊東洋経済」を発行していた。創刊当初から、
## Lex_set Condition Subj_id Score
## 159 8 3 5 3
## 232 2 4 8 3
## 454 2 2 15 3
#replace NA values with mean score
#don't forget to include na.rm=TRUE, or the returned value will be NA
data[idx, ]$Score = mean(data$Score, na.rm=TRUE)
data[idx, ]
## Sentence
## 159 最新のニュース映像や番組が放送されるが、番組本編ではノンスクランブルかつ完全ノンスクランブルで放送される
## 232 世界中の商品情報を発信するとともに、各都道府県の情報を提供している。また、各種地域情報を発信するため、全国
## 454 世界中の商品情報を発信するために日本を代表するローカル紙「週刊東洋経済」を発行していた。創刊当初から、
## Lex_set Condition Subj_id Score
## 159 8 3 5 4.439453
## 232 2 4 8 4.439453
## 454 2 2 15 4.439453
suppose also that there are some rows with the acceptability score out of the possible range (1 to 7). Find such rows.
Advanced: Replace the rows with 7
if a value is over 7, and 1 if it is under 1. Hint: Use
ifelse()
function whose arguments are condition, value if
the condition is met, value if the condition is not met
(ifelse(condition, value if condition=T, value if condition=F)
)
#write your answers here
#find rows with the score of over 7 or under 1 and modify them
idx = which(data$Score > 7 | data$Score < 1)
data[idx, ]$Score = ifelse(data[idx, ]$Score > 7, 7, 1)
data[idx, ]
## Sentence
## 188 会社でトラブルの原因を究明する作業員と、その社員が協力して、2人の男性
## 377 会社でトラブルの原因を追及していた。しかし、この事故は、ボーイングが事故機について検査
## Lex_set Condition Subj_id Score
## 188 7 4 6 1
## 377 7 1 12 7
grouping the data by condition and averaging the acceptability score
with the summarize
function is a quick and easy way to get
a snapshot of results.
data %>% group_by(Condition) %>% summarize(ave = mean(Score))
## # A tibble: 4 × 2
## Condition ave
## <fct> <dbl>
## 1 1 6.07
## 2 2 5.89
## 3 3 2.57
## 4 4 2.49
#write your answers here
data %>% group_by(Subj_id) %>% summarize(ave = mean(Score))
## # A tibble: 16 × 2
## Subj_id ave
## <fct> <dbl>
## 1 1 4.06
## 2 2 4.06
## 3 3 4
## 4 4 4.59
## 5 5 4.26
## 6 6 4.06
## 7 7 4.25
## 8 8 4.42
## 9 9 4.16
## 10 10 4
## 11 11 4.31
## 12 12 4.5
## 13 13 4.75
## 14 14 4.28
## 15 15 4.26
## 16 16 4.06
participants make use of the acceptability scale in various ways; some people may use 1 and 7 exclusively, while others may stick to 3 through 5. In order to normalize their scores, it is common to convert their raw acceptability to z-scores (standard scores). The formula for z-score conversion is as follows:
z-score = raw score - sample mean / standard deviation of sample
data = data %>% group_by(Subj_id) %>% mutate(Z_score = (Score - mean(Score)) / sd(Score))
we will use the package ggplot
for visualization. The
syntax for ggplot is slightly different from the R syntax, such as the
use of +
when you want to add plot features. First, we will
create a summary data to be plotted.
data_summary = data %>% group_by(Condition) %>% summarize(Ave = mean(Z_score))
#the code below converts condition numbers to actual condition names
columns = data.frame(Extraction=c('no_extraction', 'no_extraction', 'extraction', 'extraction'),
RC=c('non_RC','RC','non_RC','RC'))
data_summary = cbind(columns, data_summary)
data_summary
## Extraction RC Condition Ave
## 1 no_extraction non_RC 1 0.9122053
## 2 no_extraction RC 2 0.8148336
## 3 extraction non_RC 3 -0.8437908
## 4 extraction RC 4 -0.8832481
data_summary$Extraction = factor(data_summary$Extraction, levels = c('no_extraction', 'extraction'))
data_summary %>% ggplot(aes(x=Extraction, y=Ave))+
geom_point()+
geom_path(aes(group = RC, linetype = rev(RC)))+
guides(linetype = guide_legend(reverse = TRUE))+
expand_limits(y = c(-1, 1))
please do the following:
theme_classic()
or theme_bw()
)#write your answers here
data_summary %>% ggplot(aes(x=Extraction, y=Ave))+
geom_point()+
geom_path(aes(group = RC, linetype = rev(RC)))+
guides(linetype = guide_legend(reverse = TRUE))+
expand_limits(y = c(-1, 1))+
theme_classic()+
ylab('mean z-score')+
theme(legend.title = element_blank(),
legend.position = c(0.9, 0.9),
legend.text = element_text(size = 15),
axis.text.x = element_text(size = 15),
axis.title.x = element_blank(),
axis.title.y = element_text(size = 15))