1. Assign the same data type for the merge keys
Given two pandas.DataFrames:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
# a_df
# amount
# CODE
# 20262 3.0
# 20270 1.0
# 20280 4.0
# 20281 1.0
# 20299 5.0
# b_df
# CODE limit
# 0 20299 4
# 1 20262 5
# 2 20270 6
# 3 20280 4
# 4 20281 3
# 5 30001 3
|
When I try to do outer join with pandas.merge
function, it fails occasionaly.
1
2
3
4
5
6
7
8
9
|
ab_df = pd.merge(
a_df,
b_df,
how='right',
left_index=True,
right_on=['CODE']
).set_index(
'CODE'
)
|
ab_df, a merged dataframe
1
2
3
4
5
6
7
8
|
# ab_df
# amount1 amount2
# CODE
# 20299 5.0 4
# 20262 3.0 5
# 20270 1.0 6
# 20280 4.0 4
# 20281 1.0 3
|
As this is right outer join, above ab_df is missing below expected record.
1
2
3
4
|
# ab_df
# amount1 amount2
# CODE
# 30001 NaN 3
|
Why this happens? In this case, it’s because the type of column in Pandas DataFrame seems to be different each other.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
a_df.dtypes
# amount1 float64
# dtype: object
b_df.dtypes
# CODE object
# amount2 int64
ab_df.dtypes
# amount1 float64
# amount2 int64
# dtype: object
a_df.index.dtype
# object
b_df.index.dtype
# int64
ab_df.index.dtype
# object
|
While it looks the merge keys' data type are same, the type “object” is not explicit.
The class type of object depends on what and how you retrieved the data frame.
- a_df’s key: index (object, the name of index is ‘CODE’)
- b_df’s key: CODE (object)
In my case, assign the same data type for the merge keys works.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
a_df.index = a_Df.index.astype(str)
b_df[['CODE']] = b_df[['CODE']].astype(str)
ab_df = pd.merge(
a_df,
b_df,
how='right',
left_index=True,
right_on=['CODE']
).set_index(
'CODE'
)
# ab_df
# amount1 amount2
# CODE
# 20299 5.0 4
# 20262 3.0 5
# 20270 1.0 6
# 20280 4.0 4
# 20281 1.0 3
# 30001 NaN 3
|
2. Deal with NaN values
Now I want to sumup amount1 and amount2 in ab_df, then add the column (a pandas Series object) to ab_df.
1
2
3
4
5
6
7
8
9
10
11
|
ab_df['sum] = ab_df['amount1'] + ab_df['amount2']
# ab_df
# amount1 amount2 sum
# CODE
# 20299 5.0 4 9.0
# 20262 3.0 5 8.0
# 20270 1.0 6 7.0
# 20280 4.0 4 8.0
# 20281 1.0 3 4.0
# 30001 NaN 3 NaN # This row is wrong.
|
The last column, I expected 3 at column sum
. However, the equation includes NaN value. So, it’s necessary to fill NaN values before calculation.
1
2
3
4
5
6
7
8
9
10
11
12
|
ab_df[['amount1']]=ab_df[['amount1']].fillna(0.0)
ab_df['sum] = ab_df['amount1'] + ab_df['amount2']
# ab_df
# amount1 amount2 sum
# CODE
# 20299 5.0 4 9.0
# 20262 3.0 5 8.0
# 20270 1.0 6 7.0
# 20280 4.0 4 8.0
# 20281 1.0 3 4.0
# 30001 0.0 3 3.0
|
fillna(0)
does not work because the data type of ‘amount1’ is float64. Instead, fillna(0.0
)` worked.
After filling NaN data point, the calculation over the Pandas.DataFrame works fine.