Why the dataframes cannot be merged?

2022-11-05 575 words 3 minutes

Contents

1. Assign the same data type for the merge keys

Given two pandas.DataFrames:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


# a_df
#              amount
# CODE
# 20262           3.0
# 20270           1.0
# 20280           4.0
# 20281           1.0
# 20299           5.0

# b_df
#             CODE        limit
# 0           20299           4
# 1           20262           5
# 2           20270           6
# 3           20280           4
# 4           20281           3
# 5           30001           3

When I try to do outer join with pandas.merge function, it fails occasionaly.

1
2
3
4
5
6
7
8
9


ab_df = pd.merge(
    a_df,
    b_df,
    how='right', 
    left_index=True, 
    right_on=['CODE']
).set_index(
    'CODE'
)

ab_df, a merged dataframe

1
2
3
4
5
6
7
8


# ab_df
#           amount1     amount2
# CODE
# 20299         5.0           4
# 20262         3.0           5
# 20270         1.0           6
# 20280         4.0           4
# 20281         1.0           3

As this is right outer join, above ab_df is missing below expected record.

1
2
3
4


# ab_df
#           amount1     amount2
# CODE
# 30001         NaN           3

Why this happens? In this case, it’s because the type of column in Pandas DataFrame seems to be different each other.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


a_df.dtypes
# amount1        float64
# dtype: object

b_df.dtypes
# CODE           object
# amount2        int64

ab_df.dtypes
# amount1        float64
# amount2        int64
# dtype: object


a_df.index.dtype
# object

b_df.index.dtype
# int64

ab_df.index.dtype
# object

While it looks the merge keys' data type are same, the type “object” is not explicit.
The class type of object depends on what and how you retrieved the data frame.

a_df’s key: index (object, the name of index is ‘CODE’)
b_df’s key: CODE (object)

In my case, assign the same data type for the merge keys works.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


a_df.index = a_Df.index.astype(str)
b_df[['CODE']] = b_df[['CODE']].astype(str)

ab_df = pd.merge(
    a_df,
    b_df,
    how='right', 
    left_index=True, 
    right_on=['CODE']
).set_index(
    'CODE'
)

# ab_df
#           amount1     amount2
# CODE
# 20299         5.0           4
# 20262         3.0           5
# 20270         1.0           6
# 20280         4.0           4
# 20281         1.0           3
# 30001         NaN           3

2. Deal with NaN values

Now I want to sumup amount1 and amount2 in ab_df, then add the column (a pandas Series object) to ab_df.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


ab_df['sum] = ab_df['amount1'] + ab_df['amount2']

# ab_df
#           amount1     amount2         sum
# CODE
# 20299         5.0           4         9.0
# 20262         3.0           5         8.0
# 20270         1.0           6         7.0
# 20280         4.0           4         8.0
# 20281         1.0           3         4.0
# 30001         NaN           3         NaN # This row is wrong. 

The last column, I expected 3 at column sum. However, the equation includes NaN value. So, it’s necessary to fill NaN values before calculation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


ab_df[['amount1']]=ab_df[['amount1']].fillna(0.0)
ab_df['sum] = ab_df['amount1'] + ab_df['amount2']

# ab_df
#           amount1     amount2         sum
# CODE
# 20299         5.0           4         9.0
# 20262         3.0           5         8.0
# 20270         1.0           6         7.0
# 20280         4.0           4         8.0
# 20281         1.0           3         4.0
# 30001         0.0           3         3.0 

fillna(0) does not work because the data type of ‘amount1’ is float64. Instead, fillna(0.0)` worked. After filling NaN data point, the calculation over the Pandas.DataFrame works fine.