Merging Pandas data frames is covered extensively in a StackOverflow article Pandas Merging 101. However, my experience of grading data science take-home tests leads me to believe that left joins remain to be a challenge for many people. In this post, I show how to properly handle cases when the right table (data frame) in a Pandas left join contains nulls.
Let’s consider a scenario where we have a table transactions
containing transactions performed by some users and a table users
containing some user properties, for example, their favorite color. We want to annotate the transactions with the users’ properties. Here are the data frames:
import numpy as np
import pandas as pd
np.random.seed(0)
# transactions
left = pd.DataFrame({'transaction_id': ['A', 'B', 'C', 'D'],
'user_id': ['Peter', 'John', 'John', 'Anna'],
'value': np.random.randn(4),
})
# users
right = pd.DataFrame({'user_id': ['Paul', 'Mary', 'John', 'Anna'],
'favorite_color': ['blue', 'blue', 'red', np.NaN],
})
Note that Peter is not in the users
table and Anna
doesn’t have a favorite color.
>>> left
transaction_id user_id value
0 A Peter 1.867558
1 B John -0.977278
2 C John 0.950088
3 D Anna -0.151357
>>> right
user_id favorite_color
0 Paul blue
1 Mary blue
2 John red
3 Anna NaN
Adding the user’s favorite color to the transaction table seems straightforward using a left join on the user id:
>>> left.merge(right, on='user_id', how='left')
transaction_id user_id value favorite_color
0 A Peter 1.867558 NaN
1 B John -0.977278 red
2 C John 0.950088 red
3 D Anna -0.151357 NaN
We see that Peter and Anna have NaN
s in the favorite_color
column. However, the missing values are there for two different reasons: Peter’s record didn’t have a match in the users
table, while Anna didn’t have a value for the favorite color. In some cases, this subtle difference is important. For example, it can be critical to understanding the data during initial exploration and to improving data quality.
Here are two simple methods to track the differences in why a value is missing in the result of a left join. The first is provided directly by the merge
function through the indicator
parameter. When set to True
, the resulting data frame has an additional column _merge
:
>>> left.merge(right, on='user_id', how='left', indicator=True)
transaction_id user_id value favorite_color _merge
0 A Peter 1.867558 NaN left_only
1 B John -0.977278 red both
2 C John 0.950088 red both
3 D Anna -0.151357 NaN both
The second method is related to how it would be done in the SQL world and explicitly adds a column representing the user_id
in the right table. We note that if the join columns in the two tables have different names, both columns appear in the resulting data frame, so we rename the user_id
column in the users
table before merging.
>>> left.merge(right.rename({'user_id': 'user_id_r'}, axis=1),
left_on='user_id', right_on='user_id_r', how='left')
transaction_id user_id value user_id_r favorite_color
0 A Peter 1.867558 NaN NaN
1 B John -0.977278 John red
2 C John 0.950088 John red
3 D Anna -0.151357 Anna NaN
An equivalent SQL query is
select
t.transaction_id
, t.user_id
, t.value
, u.user_id as user_id_r
, u.favorite_color
from
transactions t
left join
users u
on t.user_id = u.user_id
;
In conclusion, adding an extra column that indicates whether there was a match in the Pandas left join allows us to subsequently treat the missing values for the favorite color differently depending on whether the user was known but didn’t have a favorite color or the user was missing from the users
table.
Photo by Ilona Froehlich on Unsplash.