Linguistics PhD
import pandas as pd
sample = pd.read_csv('./sampledata.csv')
sample.head(3)
Movement | Island_Type | Island | Distance | Item | Sentence | Subj_id | List | Score | |
---|---|---|---|---|---|---|---|---|---|
0 | WH | whe | non | sh | 1 | Who thinks that Paul stole the necklace? | 1 | 1 | 6 |
1 | WH | whe | non | sh | 2 | Who thinks that Matt chased the bus? | 1 | 1 | 2 |
2 | WH | whe | non | sh | 3 | Who thinks that Tom sold the television? | 1 | 1 | 3 |
The following code groups the data by two conditions (Island, Distance), assigns the mean and the standard deviation of acceptability scores per group, and ungroups them:
sample['mean_response'] = sample.groupby(['Island','Distance'])['Score'].transform('mean')
sample['sd_answer_z'] = sample.groupby(['Island','Distance'])['Score'].transform('std')
sample.iloc[:5, [2, 3, -3, -2, -1]] #showing only relevant columns
Island | Distance | Score | mean_response | sd_answer_z | |
---|---|---|---|---|---|
0 | non | sh | 6 | 4.585938 | 1.709295 |
1 | non | sh | 2 | 4.585938 | 1.709295 |
2 | non | sh | 3 | 4.585938 | 1.709295 |
3 | non | sh | 7 | 4.585938 | 1.709295 |
4 | non | sh | 2 | 4.585938 | 1.709295 |
Now I will aggregate the data based on the two conditions and make a summary dataset that shows the mean, the standard deviation, and the standard error of the mean of each group’s acceptability.
data_summary = sample.groupby(['Island','Distance'])['Score'].agg(['mean','std','sem']).reset_index()
data_summary
Island | Distance | mean | std | sem | |
---|---|---|---|---|---|
0 | isl | lg | 2.593750 | 1.180134 | 0.104310 |
1 | isl | sh | 3.531250 | 1.128985 | 0.099789 |
2 | non | lg | 3.890625 | 1.920607 | 0.169759 |
3 | non | sh | 4.585938 | 1.709295 | 0.151082 |