How to assign column names to data output by PCIbex

Acknowledgement: I thank my RA Sidney Ma for coming up with the code.

PennController for IBEX is a great tool for conducting various psycholinguistic experiments like sentence acceptability experiments, but one potential downside of the platform is that the data it outputs is not very clean. In particular, rawdata has no column values.

And while importing the dataset automatically assigns column names, those column names are far from descriptive.

import pandas as pd

sample = pd.read_csv('/your/path/results.csv', header=None)
sample.head(3)

	0	1	2	3	4	5	6	7	8	9	...	16	17	18	19	20	21	22	23	24	25
0	# Results on Sun	12 Feb 2023 17:53:32 GMT	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	# USER AGENT: Mozilla/5.0 (Macintosh; Intel Ma...	like Gecko) Chrome/109.0.0.0 Safari/537.36	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	# Design number was non-random = 0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

3 rows × 26 columns

Here, I’d like to point out that the index column at the start of each entry could be used to find the corresponding columns. That way, we wouldn’t need to manually figure out which column is which, nor would we need to rename them.

The strategy is to collect these index columns, and add the index columns back as the new headers.

# The index columns all start with "#" and a number.
# Return true if a string fits the above description.
def is_index_col(s):
    if type(s) == str:
        if len(s) > 2:
            if "#" in s and s[2].isnumeric():
                return True
    return False

index_cols = list(sample[sample[0].apply(is_index_col)][0].unique())
index_cols

['# 1. Results reception time.',
 "# 2. MD5 hash of participant's IP address.",
 '# 3. Controller name.',
 '# 4. Order number of item.',
 '# 5. Inner element number.',
 '# 6. Label.',
 '# 7. Latin Square Group.',
 '# 8. PennElementType.',
 '# 9. PennElementName.',
 '# 10. Parameter.',
 '# 11. Value.',
 '# 12. EventTime.',
 '# 13. id.',
 '# 14. lang_comf.',
 '# 15. comf.',
 '# 16. lang.',
 '# 17. parents.',
 '# 18. age_US.',
 '# 19. birth.',
 '# 20. gender.',
 '# 21. age.',
 '# 22. Comments.',
 '# 22. LIST.',
 '# 23. ITEM.',
 '# 24. CONDITION.',
 '# 25. SENTENCE.',
 '# 26. Comments.']

# For some reason, "Comments" and "LIST" are both labelled as #22. I'll remove "comments" manually.
index_cols.remove("# 22. Comments.")

# Remove irrelevant rows
sample = sample[(sample[5] == "experiment-filler") | 
        (sample[5] == "experiment-critical") | 
        (sample[5] == "background")]

# Replace columns
sample.columns = index_cols
sample.head(3)

	# 1. Results reception time.	# 2. MD5 hash of participant's IP address.	# 3. Controller name.	# 4. Order number of item.	# 6. Label.	# 7. Latin Square Group.	# 8. PennElementType.	# 9. PennElementName.	# 10. Parameter.	...	# 17. parents.	# 18. age_US.	# 19. birth.	# 20. gender.	# 21. age.	# 22. LIST.	# 23. ITEM.	# 24. CONDITION.	# 25. SENTENCE.	# 26. Comments.
65	1676224412	eeba289e3ac0463f13af3ea14e757415	PennController	52.0	experiment-filler	NaN	PennController	53	_Trial_	...	undefined	undefined	undefined	undefined	undefined	1	9.0	intermediate	Which teachers are the administrator firing at...	NaN
66	1676224412	eeba289e3ac0463f13af3ea14e757415	PennController	52.0	experiment-filler	NaN	PennController	53	_Header_	...	undefined	undefined	undefined	undefined	undefined	1	9.0	intermediate	Which teachers are the administrator firing at...	NaN
67	1676224412	eeba289e3ac0463f13af3ea14e757415	PennController	52.0	experiment-filler	NaN	PennController	53	_Header_	...	undefined	undefined	undefined	undefined	undefined	1	9.0	intermediate	Which teachers are the administrator firing at...	NaN

3 rows × 26 columns

Now we won’t have to figure out which column stands for list number, acceptability judgment, etc. Of course, feel free to rename the columns if you think that the default names are unwieldy (like the second column, which is a unique identifier of participants).