Linguistics PhD
Using iterrows()
in Pandas is often considered a bad idea for performance reasons (for instance, iterrows()
converts each rows to a index and series pair, which slows down an execution).
I recently ran into a situation where I need to (i) check if each row in a certain column (“subscription_type”) is NaN, (ii) if so, grab the their parent subscriber id under another column of the row (“is_plus_1_of”), and (iii) replace NaN with the subscription type of the parent subscriber. Here’s how I managed to do these without relying on iterrows()
- namely, I wrote a function and call the function for each row with apply()
.
import pandas as pd
import numpy as np
df = pd.DataFrame({'subscriber_id':['21daf8cd', '3393eee6', 'e0c9b302', '1f0c4dbc', '7c49a7e8'],
'subscription_type':[np.nan, 'standard', np.nan, 'pro', 'standard'],
'is_plus_1_of':['1f0c4dbc', np.nan, '3393eee6', np.nan, np.nan],
'account_create_date':['12/2/2023', '12/4/2023', '12/8/2023', '12/10/2023', '12/15/2023']})
def mark_plus1_type(row):
if row['subscription_type'] != row['subscription_type']: # check for NaN
parent_type = df[df['subscriber_id'] == row['is_plus_1_of']]['subscription_type'].iloc[0]
return parent_type
else:
return row['subscription_type']
df['subscription_type'] = df.apply(mark_plus1_type, axis=1)
df
subscriber_id | subscription_type | is_plus_1_of | account_create_date | |
---|---|---|---|---|
0 | 21daf8cd | pro | 1f0c4dbc | 12/2/2023 |
1 | 3393eee6 | standard | NaN | 12/4/2023 |
2 | e0c9b302 | standard | 3393eee6 | 12/8/2023 |
3 | 1f0c4dbc | pro | NaN | 12/10/2023 |
4 | 7c49a7e8 | standard | NaN | 12/15/2023 |
You can see now that the subscription types of the first and third rows reflect those of their parent subscribers.