Maho Takahashi

Linguistics PhD

research

CV

code

mtakahas[at]ucsd[dot]edu

How to assign column names to data output by PCIbex

Acknowledgement: I thank my RA Sidney Ma for coming up with the code.

PennController for IBEX is a great tool for conducting various psycholinguistic experiments like sentence acceptability experiments, but one potential downside of the platform is that the data it outputs is not very clean. In particular, rawdata has no column values.

And while importing the dataset automatically assigns column names, those column names are far from descriptive.

import pandas as pd
sample = pd.read_csv('/your/path/results.csv', header=None)
sample.head(3)
0 1 2 3 4 5 6 7 8 9 ... 16 17 18 19 20 21 22 23 24 25
0 # Results on Sun 12 Feb 2023 17:53:32 GMT NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 # USER AGENT: Mozilla/5.0 (Macintosh; Intel Ma... like Gecko) Chrome/109.0.0.0 Safari/537.36 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 # Design number was non-random = 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

3 rows × 26 columns

Here, I’d like to point out that the index column at the start of each entry could be used to find the corresponding columns. That way, we wouldn’t need to manually figure out which column is which, nor would we need to rename them.

The strategy is to collect these index columns, and add the index columns back as the new headers.

# The index columns all start with "#" and a number.
# Return true if a string fits the above description.
def is_index_col(s):
    if type(s) == str:
        if len(s) > 2:
            if "#" in s and s[2].isnumeric():
                return True
    return False

index_cols = list(sample[sample[0].apply(is_index_col)][0].unique())
index_cols
['# 1. Results reception time.',
 "# 2. MD5 hash of participant's IP address.",
 '# 3. Controller name.',
 '# 4. Order number of item.',
 '# 5. Inner element number.',
 '# 6. Label.',
 '# 7. Latin Square Group.',
 '# 8. PennElementType.',
 '# 9. PennElementName.',
 '# 10. Parameter.',
 '# 11. Value.',
 '# 12. EventTime.',
 '# 13. id.',
 '# 14. lang_comf.',
 '# 15. comf.',
 '# 16. lang.',
 '# 17. parents.',
 '# 18. age_US.',
 '# 19. birth.',
 '# 20. gender.',
 '# 21. age.',
 '# 22. Comments.',
 '# 22. LIST.',
 '# 23. ITEM.',
 '# 24. CONDITION.',
 '# 25. SENTENCE.',
 '# 26. Comments.']
# For some reason, "Comments" and "LIST" are both labelled as #22. I'll remove "comments" manually.
index_cols.remove("# 22. Comments.")
# Remove irrelevant rows
sample = sample[(sample[5] == "experiment-filler") | 
        (sample[5] == "experiment-critical") | 
        (sample[5] == "background")]
# Replace columns
sample.columns = index_cols
sample.head(3)
# 1. Results reception time. # 2. MD5 hash of participant's IP address. # 3. Controller name. # 4. Order number of item. # 5. Inner element number. # 6. Label. # 7. Latin Square Group. # 8. PennElementType. # 9. PennElementName. # 10. Parameter. ... # 17. parents. # 18. age_US. # 19. birth. # 20. gender. # 21. age. # 22. LIST. # 23. ITEM. # 24. CONDITION. # 25. SENTENCE. # 26. Comments.
65 1676224412 eeba289e3ac0463f13af3ea14e757415 PennController 52.0 0.0 experiment-filler NaN PennController 53 _Trial_ ... undefined undefined undefined undefined undefined 1 9.0 intermediate Which teachers are the administrator firing at... NaN
66 1676224412 eeba289e3ac0463f13af3ea14e757415 PennController 52.0 0.0 experiment-filler NaN PennController 53 _Header_ ... undefined undefined undefined undefined undefined 1 9.0 intermediate Which teachers are the administrator firing at... NaN
67 1676224412 eeba289e3ac0463f13af3ea14e757415 PennController 52.0 0.0 experiment-filler NaN PennController 53 _Header_ ... undefined undefined undefined undefined undefined 1 9.0 intermediate Which teachers are the administrator firing at... NaN

3 rows × 26 columns

Now we won’t have to figure out which column stands for list number, acceptability judgment, etc. Of course, feel free to rename the columns if you think that the default names are unwieldy (like the second column, which is a unique identifier of participants).