Linguistics PhD
import pandas as pd
sample = pd.read_csv('./sampledata1.csv')
sample.head(3)
Unnamed: 0 | Movement | Island_Type | Island | Distance | Item | Sentence | Subj_id | List | Score | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | WH | whe | non | sh | 1 | Who thinks that Paul stole the necklace? | 31WPPC | 1 | 6 |
1 | 1 | WH | whe | non | sh | 2 | Who thinks that Matt chased the bus? | 31WPPC | 1 | 2 |
2 | 2 | WH | whe | non | sh | 3 | Who thinks that Tom sold the television? | 31WPPC | 1 | 3 |
sample['Subj_id'].unique()
array(['31WPPC', 'MLOT0C', 'QUCYBY', '3HM9R4', 'TNZ93A', 'RE7119',
'IKH3NF', '0R04SW', 'S7VOS9', 'JO1B7Q', '0HY4IC', 'MNSV2I',
'IOEK50', 'LXP23M', '7NXUBG', '4EQFWR'], dtype=object)
Let’s simplify the subject ids in this dataset by converting them to numbers only. Here is one way to accomplish this.
sample['Subj_id'] = sample.groupby('Subj_id', sort=False).ngroup()
sample['Subj_id'] = sample['Subj_id'] + 1 #add 1 to each id if you do not want the first id to be 0
sample['Subj_id'].unique()
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16])