fbpx

[Detailed] concat, join, merge dataframes in pandas & python

How to use pandas and python to concatenate dataframes

Pandas can be used to concatenate dataframes and series (columns). Concatenate simply means joining 2 or more things together. Naturally, the concatenation process is a row-wise operation, but you can still specify which axis the operation should occur on.

Remember, axis = 0 means rows and axis = 1 means columns.

In this tutorial, I concatenated the df dataset with the df2 dataset.

Let’s start by creating a df dataframe and df2 dataframe

# how to concatenate data in pandas and python
# create dummy dataframe first. 

df = pd.DataFrame({
    "managers": ["amy", "john", "whois", "me"],
    "employees": ["ok", "now", "harry", "gift"],
    "tips": [65,47,89,56],
    "salary": [96000,50000,23000,80000]
})

df2 = pd.DataFrame({
    "food": ['apple', "banana", "kiwi", "mango"],
    "employees": ["ok", "now", "harry", "gift"],
    "tips": [65,47,89,56],
    "price": [6,9,4,7]
})

This is what the df and df2 datasets looks like as well as their shape.

# to get the results below, simply do

print(df.shape)
df.head()

# and in a new code block, do

print(df2.shape)
df2.head()

This is how to use pandas Concat function to join the df and df2 dataframes

# how to concatenate data in pandas and python
df_df2 = pd.concat([df, df2])

# to get the result below, simply do

print(df_df2.shape)
df_df2

After using pandas concatenation function, this is what the dataset looks like now and this is the current shape. As you can see, pandas concat did a combination of the 2 datasets.

In this resulting dataframe, it kept everything from df dataframe and added everything from df2 dataframe. Any missing values were replaced with NaN’s. And as you can see, the resulting shape of the output dataframe is an addition of the input dataframe shapes.

If you want to learn more about pandas merge, concat, and join, I recommend you click here and check out this book. “Python Data Science Handbook: Essential Tools for Working with Data”

How to append dataframes using pandas and python

Similar to pandas concatenation, you can also append two dataframes together. If we use the exact same dataset used above and append them together, the results will be the exact same.

This is how to append dataframes using pandas and python

# appending df and df2 dataframes together. 
# Similar to dataframe concatenation. 

df_df2_2 = df.append(df2)

# to get the results below, simply do

print(df_df2_2.shape)
df_df2_2

And this is what the results look like

As you can see, pandas concatenation and append function does the same job and gives you the same results.

How to use inner merge with pandas and python

When you use pandas merge function, it recognizes column names that are the same on the two dataframe inputs. It uses this common column as the key to merge the two dataframes together.

Let’s create a dummy dataframe to demonstrate pandas inner merge.

# dataframe to demonstrate pandas inner merge

ls = pd.DataFrame({
    "Driver": ['Amy', 'Josh', 'Ada', 'Ok'],
    "Tips": [70, 90, 34, 65]
})

ot = pd.DataFrame({
    "Managers": ['ok', 'why', 'd', 'f', '8'],
    "Salary":[40000, 60000, 20000, 150000, 456856],
    "Driver":["noway", 'josh', 'Ada', 'Ok', 'dry']
})

This is what the 2 dataframes look like.

Using pandas merge function to merge ls dataframe and ot dataframe together without specifying which column the merge should occur on.

# merge ls and ot without specifying which column to use for the merge. 

ls_ot = pd.merge(ls, ot)
# print the shape of the dataframe
print(ls_ot.shape)
# show the dataframe
ls_ot

This is what the resulting dataframe looks like.

In this example, we do pandas merge function but without providing what the key should be. Since “Driver” column is present in the two dataframes we are merging, pandas automatically merged them based on “Driver” using inner merge. Inner merge is the default key type for pandas merge function.

You can specify which columns should be used for the merge. The columns to be used for the merge can be specified using the “on” keyword. The “on” keyword can take a column name or list of column names.

The resulting dataframe is a combination of ls and ot dataframes and pandas used “Driver” as the merging column.

How to do inner merge and use “on” keyword with pandas merge

# merge ls and ot without specifying which column to use for the merge. 

ls_ot_2 = pd.merge(ls, ot, how="inner", on="Driver")

print(ls_ot_2.shape)

ls_ot_2

This is the exact same merge done above, but using “on” keyword to specify the column to do the merge on. As you can see, the resulting dataframe is the same. This is what the result looks like.

How exactly does pandas inner merge work?

The result of inner merge is the intersection of the two dataframes. It will look at dataset A and see if there are matching column names in dataset B. If there are matching column names, it will use them to do the inner merge. If you provide a specific column name, it will use the provided column name assuming that column name is present in both datasets.

The resulting dataframe will find rows that match up between the two dataframes. Any rows that is not present in datasets A and B will not be present in the resulting dataframe.

In the example above, “Ada” and “Ok” are the only rows from the “Driver” column that matches between dataset A and B. “Josh” row was not included in the final result above because in ls dataframe, “Josh” is spelled with “J” but in ot dataframe, “josh” is spelled with “j”


If you want to learn more about pandas merge, concat, and join, I recommend you click here and check out this book. “Python Data Science Handbook: Essential Tools for Working with Data”

How to do outer merge with pandas and python

We will be using the same dataframe used above for inner merge to demonstrate outer merge

# this is the datarame we will be using for "outer merge" example

ls = pd.DataFrame({
    "Driver": ['Amy', 'Josh', 'Ada', 'Ok'],
    "Tips": [70, 90, 34, 65]
})

ot = pd.DataFrame({
    "Managers": ['ok', 'why', 'd', 'f', '8'],
    "Salary":[40000, 60000, 20000, 150000, 456856],
    "Driver":["noway", 'josh', 'Ada', 'Ok', 'dry']
})

# outer merge without "on" keyword

outer = pd.merge(ls, ot, how="outer")

# print shape
print(outer.shape)

# show result
outer

This is what the resulting dataframe looks like.

Now, let’s do outer merge while specifying which column to use for the merge using the “On” keyword.

How to do pandas outer merge using the “On” Keyword

# outer merge with "on" keyword

outer2 = pd.merge(ls, ot, how="outer", on="Driver")

# print shape
print(outer2.shape)

# show result
outer2

This is what the resulting dataframe looks like.

So, How does Outer merge join work?

Outer join merge unites the two dataframes and wherever the dataframes don’t match, it will replace it with NaN. In this case, the left dataset (ls) has “Amy and Josh” but the right dataset doesn’t , so where Amy and Josh appeared, it put in NaN for them in the right dataset (ot).

How to merge 2 dataframes with duplicate column names

When you merge two dataframes with the duplicate column names, pandas automatically assigns x and y endings to the column names in order to create unique column names.

# how to merge dataframe with duplicated column names in pandas and python

# create dummy dataframe first. 

df3 = pd.DataFrame({
    "managers": ["amy", "john", "whois", "me"],
    "employees": ["ok", "now", "harry", "gift"],
    "tips": [65,47,89,56],
    "salary": [96000,50000,23000,80000]
})

df4 = pd.DataFrame({
    "managers": ['apple', "banana", "kiwi", "mango"],
    "employees": ["ok", "now", "harry", "gift"],
    "tips": [65,47,89,56],
    "price": [6,9,4,7]
})

This is the resulting dataframes

# outer merge with df3 and df4 to showcase overlapping names pandas merge. 

# when merging 2 dataframes with same column names, pandas automatically assigns x_y endings to the resulting dataframe. 

# this is done so that each column will have a unique column name. 

# pandas merge df3 and df4
df3_df4 = pd.merge(df3, df4, how="outer", on="employees")

# print the shape of the resulting dataframe
print(df3_df4.shape)

# show the resulting dataframe
df3_df4

This is the result of merging 2 dataframes with duplicate column names.

As you can see, the column names for tips, and managers has _x and _y at the end of the column names. This is because, pandas is trying to differentiate between the two columns. _x = first dataframe (“df3” left dataframe), _y = second dataframe (“df4” right dataframe)

You can customize and change the _x and _y ending by using specifying the pandas merge suffix to use.

# add a customized suffix to resulting dataframe

df3_df4_2 = pd.merge(df3, df4, how="outer", on="employees", suffixes=["_L", "_R"])

print(df3_df4_2.shape)

df3_df4_2

You can drop the duplicate columns by using pandas.drop and then specifying which columns to drop.

# drop duplicate columns from resulting dataframe

df3_df4_3 = pd.merge(df3, df4, how="outer", on="employees").drop(["managers_x", "tips_x"], axis=1)

print(df3_df4_3.shape)

df3_df4_3

How to merge multiple dataframes that has no matching column names

The goal of this super merge is to merge multiple datasets together that doesn’t have a column  in common.

Let’s start by creating some dataframes to work with.

# dummy dataframes for showing how to merge dataframes with no columns in common. Merging multiple dataframes together. 

rs = pd.DataFrame({
    "Managers": ["Good", "Friend", "Weird", "Freak"],
    "Salary": [40000, 60000, 20000, 150000],
    "movers": ['Amy', 'Josh', 'Ada', 'Ok']
})

ms = pd.DataFrame({
    "Managers": ["Good", "Friend", "Weird", "Freak"],
    "Money":[40000, 60000, 20000, 150000],
    "Employees": ['Amy', 'Josh', 'Ada', 'Ok']
})

ot = pd.DataFrame({
    "Talkers": ['ok', 'why', 'd', 'f', '8'],
    "Money":[40000, 60000, 20000, 150000, 456856],
    "Driver":["noway", 'josh', 'Ada', 'Ok', 'dry']
})

In this pandas merge, I am merging the rs dataset with the ms dataset and the ot dataset. This is what the datasets look like.

The primary objective is to merge the ot dataset with the rs dataset, but ot dataset and rs dataset does not have any columns in common. So, we have to use the ms dataset as an intermediary dataset to merge ot dataset with rs dataset.

This is how to merge multiple dataframes together even if they don’t have any columns in common

The goal is to find another dataframe that has a column that is common between all of them. For example if you are trying to merge Dataset A and dataset B, but there is no common column between them. If dataset C has a common column with dataset A and dataset B, you can use dataset C as an intermediary between dataset A and dataset B.

# merge rs with ms with ot

# pandas automatically finds a common column between the datasets and does the merge on it. 

super_merge = (rs.merge(ms).merge(ot))

print(super_merge.shape)

super_merge

The way this dataset merge is working is,

  • rs + ms has “Managers” in common
  • ms + ot has “Money” in common

So, pandas finds common column between all the datasets and use that to perform the merge function.

This is the result of the merge

How to use left join to merge dataframes with pandas and python

Let’s start by creating a dataframes to demonstrate left and right joins.

# create a dataframe to demonstrate left and right joins. 

left = pd.DataFrame({
    "Driver": ['Amy', 'Josh', 'Ada', 'Ok'],
    "Tips": [65,47,89,56]
})

right = pd.DataFrame({
    "Managers": ["Good", "Friend", "Weird", "Freak"],
    "Salary":[40000, 60000, 20000, 150000],
    "Driver": ['Why', 'No way', 'Ada', 'Ok']
})

This is what the resulting dataframes look like.

# to get the results below. 

# show left dataframe
print(left.shape)
# show left table
left

# show right dataframe
print(right.shape)
# show right table
right

code showing how to do left join merge in pandas

# demonstrating left join

left_right_join = pd.merge(left, right, how='left', on='Driver')

# show the shape of the merged dataframe
print(left_right_join.shape)

# show the resulting dataframe
left_right_join

This is what the resulting dataframe looks like from doing left join pandas merge

This is the result of doing left join in pandas

How does join on left work?

This how=left merge function returns every column and rows on the dataset that is on the left side of the merge function. Then it joins it to the dataset on the right side of the merge function. Any rows on the right side of the merge function that doesn’t match the dataset on the left will be replaced by NaN.

Any column on the right dataset that doesn’t match a column on the left dataset will also be dropped from the final merged dataset.


If you want to learn more about pandas merge, concat, and join, I recommend you click here and check out this book. “Python Data Science Handbook: Essential Tools for Working with Data”

How to use right join to merge dataframes with pandas and python

Code showing how to do right join merge in pandas​

I will use the exact same dataset used above to demonstrate left join merge to demonstrate right join merge

# demonstrating right join

left_right_join_2 = pd.merge(left, right, how='right', on='Driver')

# show the shape of the merged dataframe
print(left_right_join_2.shape)

# show the resulting dataframe
left_right_join_2

This is the resulting dataframe from doing right join pandas merge

This is the result of doing right join in pandas

How does join on right work?

This how=right merge function returns every column and rows on the dataset that is on the right side of the merge function. Then it joins it to the dataset on the left side of the merge function. Any rows on the left side of the merge function that doesn’t match the dataset on the right will be replaced by NaN.

Any column on the left dataset that doesn’t match a column on the right dataset will also be dropped from the final merged dataset.

How to use left_on and right_on to merge dataframes with pandas and python

Let’s create a dataframe to do right_on and left_on pandas merge

# create a dataframe to demonstrate left_on and right_on joins/merge. 

left_2 = pd.DataFrame({
    "Driver": ['Amy', 'Josh', 'Ada', 'Ok'],
    "Tips": [65,47,89,56]
})

right_2 = pd.DataFrame({
    "Managers": ["Good", "Friend", "Weird", "Freak"],
    "Salary":[40000, 60000, 20000, 150000],
    "Employees": ['Amy', 'Josh', 'Ada', 'Ok']
})

This is the resulting dataframes

How to do left_on and right_on join using pandas and python

# demonstrating "left_on" and "right_on" join

left_right_on = pd.merge(left_2, right_2, left_on="Driver", right_on="Employees")

# show the shape of the merged dataframe
print(left_right_on.shape)

# show the resulting dataframe
left_right_on

This is the result of doing left_on and right_on join using pandas merge and python

This is the result of doing left_on and right_on merge with pandas

How does left_on and right_on pandas merge keyword work?

The left_on and right_on keyword is most essential when you want to merge 2 dataframes that have the same data in a column, but different names for the column.

For example, in the dataframe merge above, “Driver” and “Employee” data has the exact same data, but the columns have different names. So, left_on and right_on is used to specify which columns to use as the merging key.

How to use left_on and right_on in merge while dropping duplicate data

As you might have noticed, when you do left_on and right_on pandas merge, it merges everything and you end up with duplicate data. In this case, you end up with a `Driver` and `Employees` columns and they both have the exact same data.

You never want duplicate data in your dataset.

So, this is how you do left_on` and `right_on merge while dropping duplicate data.

# use left_on and right_on in merge while dropping duplicate data

left_right_on_2 = pd.merge(left_2, right_2, left_on="Driver", right_on="Employees").drop("Employees", axis=1)

# show the shape of the merged dataframe
print(left_right_on_2.shape)

# show the resulting dataframe
left_right_on_2

This is the resulting dataframe after dropping duplicate datafrom left_on and right_on merge

The result of doing left_on and right_on merge while dropping duplicate data

How to use right_index and left_index to merge datasets in pandas and python

We will start by creating a dataframe for doing right_index and left_index pandas merge.

# using join on right_index and left_index

# creating a dataframe to do right_index and left_index and demonstrate join too. 

ls = pd.DataFrame({
    "Driver": ['Amy', 'Josh', 'Ada', 'Ok'],
    "Tips": [70, 90, 34, 65]
})

rs = pd.DataFrame({
    "Managers": ["Good", "Friend", "Weird", "Freak"],
    "Salary": [40000, 60000, 20000, 150000],
    "Driver": ['Amy', 'Josh', 'Ada', 'Ok']
})

How to set dataframe index using pandas and python

In order to demonstrate merging using right_index and left_index, I will set one of the columns to become the index.

So, if you have a dataset that has one of the columns as the index, this will teach you how to merge them. But first, let’s set the columns to become the index.

# setting one column to become the index

lsb = ls.set_index("Driver")
rsb = rs.set_index("Driver")

This is what the dataset looks like after the index has been reset to use the “Driver” Column.

This is how to use dataframe index to merge datasets with pandas and python

# using left_index and right_index to merge dataframes with pandas and python

lsb_rsb = pd.merge(lsb, rsb, left_index=True, right_index=True)

# print the shape of the merged dataframe. 
print (lsb_rsb.shape)

# show resulting dataframe
lsb_rsb

This is the result of the left_index and right_index merge.

The result of merging using left_index and right_index

How does left_index and right_index work with pandas?

The left_index and right_index process is just merging by using the index as the key.

How to use “Join” to merge 2 dataframes in pandas and python

Alternative to using pandas merge on index, you can also use python join function. So, python join function gives you the same result as pandas merge on index function.

This is how to use python join function to join dataframes together

# using index to merge can also be accomplished by join

lsb_join = lsb.join(rsb)

# print the shape of the merged dataframe. 
print (lsb_join.shape)

# show resulting dataframe
lsb_join

This is the result of using python “join” function

How does “Join()” function work?

Dataframes implement a join() function that performs a merge that joins the dataframes based on the index.

How to combine multiple merge functions in one dataframe operation.

Now that you have learned different merge techniques, don’t feel like you have to use them individually. You can combine different joining and merging techniques to combine your dataframe using pandas and python.

Here is an example of how to combine different merge types and functions in one code to join dataframes togehter.

In this example I am combining `right_on` and `left_index` to merge various dataframes together.

Here is the ms dataframe used in this merge. The rsb dataframe used in the merge is the same dataframe above.

# ms dataframe used in combo merge below

ms = pd.DataFrame({
    "Workers": ["Good", "Friend", "Weird", "Freak"],
    "Money":[40000, 60000, 20000, 150000],
    "Employees": ['Amy', 'Josh', 'Ada', 'Ok']
})

# print the shape of ms
print(ms.shape)

# show ms dataframe
ms

This is what the ms dataframe looks like

This is how to combine different pandas merge funtions to get the result you want.

# combining different merge functions. 
# using left_index and right_on to merge ms and rsb together. 

Final_merge = pd.merge(rsb, ms, left_index=True, right_on="Employees")

# showing the shape of resulting dataframe
print(Final_merge.shape)

# show the resulting dataframe. 
Final_merge

This is the result of this merge function.

All these merge options work with multiple indices and columns. To learn more about merge in pandas, I recommend you check out the book below.


If you want to learn more about pandas merge, concat, and join, I recommend you click here and check out this book. “Python Data Science Handbook: Essential Tools for Working with Data”


Click here to download the python notebook used in this tutorial

Leave a Comment

Scroll to Top