Linguistics PhD Student at UC San Diego
Acknowledgement: I thank my RA Sidney Ma for coming up with the code.
PennController for IBEX is a great tool for conducting various psycholinguistic experiments like sentence acceptability experiments, but one potential downside of the platform is that the data it outputs is not very clean. In particular, rawdata has no column values.
And while importing the dataset automatically assigns column names, those column names are far from descriptive.
import pandas as pd
sample = pd.read_csv('/your/path/results.csv', header=None)
sample.head(3)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | # Results on Sun | 12 Feb 2023 17:53:32 GMT | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | # USER AGENT: Mozilla/5.0 (Macintosh; Intel Ma... | like Gecko) Chrome/109.0.0.0 Safari/537.36 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | # Design number was non-random = 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 rows × 26 columns
Here, I’d like to point out that the index column at the start of each entry could be used to find the corresponding columns. That way, we wouldn’t need to manually figure out which column is which, nor would we need to rename them.
The strategy is to collect these index columns, and add the index columns back as the new headers.
# The index columns all start with "#" and a number.
# Return true if a string fits the above description.
def is_index_col(s):
if type(s) == str:
if len(s) > 2:
if "#" in s and s[2].isnumeric():
return True
return False
index_cols = list(sample[sample[0].apply(is_index_col)][0].unique())
index_cols
['# 1. Results reception time.',
"# 2. MD5 hash of participant's IP address.",
'# 3. Controller name.',
'# 4. Order number of item.',
'# 5. Inner element number.',
'# 6. Label.',
'# 7. Latin Square Group.',
'# 8. PennElementType.',
'# 9. PennElementName.',
'# 10. Parameter.',
'# 11. Value.',
'# 12. EventTime.',
'# 13. id.',
'# 14. lang_comf.',
'# 15. comf.',
'# 16. lang.',
'# 17. parents.',
'# 18. age_US.',
'# 19. birth.',
'# 20. gender.',
'# 21. age.',
'# 22. Comments.',
'# 22. LIST.',
'# 23. ITEM.',
'# 24. CONDITION.',
'# 25. SENTENCE.',
'# 26. Comments.']
# For some reason, "Comments" and "LIST" are both labelled as #22. I'll remove "comments" manually.
index_cols.remove("# 22. Comments.")
# Remove irrelevant rows
sample = sample[(sample[5] == "experiment-filler") |
(sample[5] == "experiment-critical") |
(sample[5] == "background")]
# Replace columns
sample.columns = index_cols
sample.head(3)
# 1. Results reception time. | # 2. MD5 hash of participant's IP address. | # 3. Controller name. | # 4. Order number of item. | # 5. Inner element number. | # 6. Label. | # 7. Latin Square Group. | # 8. PennElementType. | # 9. PennElementName. | # 10. Parameter. | ... | # 17. parents. | # 18. age_US. | # 19. birth. | # 20. gender. | # 21. age. | # 22. LIST. | # 23. ITEM. | # 24. CONDITION. | # 25. SENTENCE. | # 26. Comments. | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
65 | 1676224412 | eeba289e3ac0463f13af3ea14e757415 | PennController | 52.0 | 0.0 | experiment-filler | NaN | PennController | 53 | _Trial_ | ... | undefined | undefined | undefined | undefined | undefined | 1 | 9.0 | intermediate | Which teachers are the administrator firing at... | NaN |
66 | 1676224412 | eeba289e3ac0463f13af3ea14e757415 | PennController | 52.0 | 0.0 | experiment-filler | NaN | PennController | 53 | _Header_ | ... | undefined | undefined | undefined | undefined | undefined | 1 | 9.0 | intermediate | Which teachers are the administrator firing at... | NaN |
67 | 1676224412 | eeba289e3ac0463f13af3ea14e757415 | PennController | 52.0 | 0.0 | experiment-filler | NaN | PennController | 53 | _Header_ | ... | undefined | undefined | undefined | undefined | undefined | 1 | 9.0 | intermediate | Which teachers are the administrator firing at... | NaN |
3 rows × 26 columns
Now we won’t have to figure out which column stands for list number, acceptability judgment, etc. Of course, feel free to rename the columns if you think that the default names are unwieldy (like the second column, which is a unique identifier of participants).