Joining data in R using Dplyr

Working with data, Joining is the common operation.

Joining means combine i.e combine the data from two or more than two different sources on the basis of some conditions.

For performing such type of operation in R dplyr is the best option for doing so.

During this post, we will these key points.

  • Types of Joins in R
  • Syntax
  • Joining on DataFrame
  • Joining on tables

Types of Joins

There are six types of Joins in R :

  1. Inner Join (inner_join)
  2. Left Join (left_join)
  3. Right Join (right_join)
  4. Full Join (full_join)
  5. Semi Join (semi_join)
  6. Anti Join (anti_join)

Syntax




# Syntax of Joining in R

Join_type(x,y,by=condition)

/**
  * x: dataframe1/table1
  * y: dataframe2/table2
*/

Joins are basically applied on tables and in case of R files are to be considered as tables.

For a better understanding of joins, we are taking two files 

  1. Product_category_name.csv (product_category_name, product_category_name_english)
  2. product_dataset.csv(product_id,product_category_name,product_name_lenght, product_description_lenght, product_photos_qty, product_weight_g, product_length_cm, product_height_cm, product_width_cm)

You can download these files for your practice my GitHub account

As we know during applying joins on two different datasets or tables we need a common field on the basis of that we can be able to apply a join.

So, in this case, both files contain product_category_name so on the basis of that we can apply to join.

For using a CSV file as a table we are using command 




# read csv file

read.csv(file_name)

So for applying joins on these two files at first, we have to consider these two files as two tables using read.csv() command. As shown below.




# load data


table1 <- read.csv("/home/dheeraj/Desktop/Blog_post/joins_in_r_using_dplyr/dataset/product_category_name.csv") # file location of Product_category_name.csv
table2 <- read.csv("/home/dheeraj/Desktop/Blog_post/joins_in_r_using_dplyr/dataset/products_dataset.csv")   # file location of product_dataset.csv 

Inner Joins

Syntax :



# inner join syntax


inner_join(x,y,by='condition')

Examples :

Select all columns

library(dplyr)
table1 <- read.csv("/home/dheeraj/Desktop/Blog_post/joins_in_r_using_dplyr/dataset/product_category_name.csv") # file location of Product_category_name.csv
table2 <- read.csv("/home/dheeraj/Desktop/Blog_post/joins_in_r_using_dplyr/dataset/products_dataset.csv")   # file location of product_dataset.csv 

appliedInnerJoin <- inner_join(table1,table2,by='product_category_name')

print(head(appliedInnerJoin,n =20))

OutPut:

Select specified columns

If we don’t want to extract all columns then we can select specified columns using select command.

Suppose In this after applying inner join on table1 and table2 we don’t want to extract all columns of these tables but also we want to extract only
product_id.product_category_name_english.

Then we can use select command of dplyr package (for more detail click here) and extract specified columns that we want to extract

Example


library(dplyr)
table1 <- read.csv("/home/dheeraj/Desktop/Blog_post/joins_in_r_using_dplyr/dataset/product_category_name.csv") # file location of Product_category_name.csv
table2 <- read.csv("/home/dheeraj/Desktop/Blog_post/joins_in_r_using_dplyr/dataset/products_dataset.csv")   # file location of product_dataset.csv 

appliedInnerJoin <- inner_join(table1,table2,by='product_category_name')
#select specified column

specified_columns <- select(appliedInnerJoin,product_id,product_category_name_english)
print(specified_columns)

OutPut:

left join

Syntax :


left_join(x,y,by='condition')
 

Example:


library(dplyr)
table1 <- read.csv("/home/dheeraj/Desktop/Blog_post/joins_in_r_using_dplyr/dataset/product_category_name.csv") # file location of Product_category_name.csv
table2 <- read.csv("/home/dheeraj/Desktop/Blog_post/joins_in_r_using_dplyr/dataset/products_dataset.csv")   # file location of product_dataset.csv 
# left join
appliedLeftJoin <- left_join(table1,table2,by='product_category_name')<img src="http://www.krdheeraj.info/wp-content/uploads/2020/01/leftJoin.png" alt="" width="1345" height="579" class="alignnone size-full wp-image-496" />

print(head(appliedLeftJoin,n=10))

OutPut:

right join

Syntax :


right_join(x,y,by='condition')
 

EXample


library(dplyr)
table1 <- read.csv("/home/dheeraj/Desktop/Blog_post/joins_in_r_using_dplyr/dataset/product_category_name.csv") # file location of Product_category_name.csv
table2 <- read.csv("/home/dheeraj/Desktop/Blog_post/joins_in_r_using_dplyr/dataset/products_dataset.csv")   # file location of product_dataset.csv 
# left join
appliedRightJoin <- right_join(table1,table2,by='product_category_name')

print(head(appliedRightJoin,n=10))

Output:

Dplyr grammar of Data Manipulation in R

dplyr package is used for Data Manipulation in R.

So it is the reason that’s why dplyr is called grammar of Data Manipulation.

With the help of dplyr package, we can be able to manipulate data and extract useful information easily and quickly.

Installing and loading dplyr package in R

Installing dplyr



/**
  * install dplyr package
*/

install.package("dplyr")

loading dplyr



/**
  * load dplyr package
*/

library(dplyr)

dplyr package contains 5 verbs for Data Manipulation. That we will discuss in this post.

dplyr 5 verbs for Data Manipulation

Dplyr Function Description
Select () Returns subset of the Columns.
Filter() Returns subset of Rows.
Arrange() Reorder rows according to single or multiple variables.
Mutate() Used for adding a new column from the existing column.
Summarize() Reduce each group to a single row by calculate aggregate measure.

For Exploring our knowledge in dplyr we have need a dataset.

So for that purpose, I am loading a variable.

For reading a CSV file we are using read.csv function.

Syntax :



/**
  * load dplyr package
*/


read.csv("file location")

Example:



/**
  * load dplyr package
*/

load_csv_data <- read.csv("/home/dheeraj/Downloads/brazilian-ecommerce/order_items_dataset.csv") ///

print(load_csv_data)

OutPut:

read csv file
Although read.csv() always return data.frame.

We can be able to store read.csv() return inside a variable as in previous example we are storing in load_csv_data. So if we are accessing variable it means we are accessing that data.

Now we are performing some operation with dplyr Package.
For that purpose, we taking all functions of dplyr package and performing some operations.

But Before using dplyr 5 verbs. We should observe our data on which we are going to implement dplyr verbs.
And dplyr package contains a separate function glimpse() for that purpose.


glimpse(): returns Observation of Data frame.
Syntax :


/**
  * load dplyr package
*/

glimpse(data_frame)

Example:



/**
  * load dplyr package
*/

glimpse(load_csv_data)

OutPut:
glimpse
So glimpse() function provides details about our dataframe.
Now we understood that our data frame contains 112,650 rows and 7 columns as shown below.



Observations: 112,650
Variables: 7
$ order_id            
$ order_item_id       
$ product_id          
$ seller_id          
$ shipping_limit_date 
$ price               
$ freight_value

Selecting Column using Select

Select returns subset of columns.
In other words, we can also say that remove columns from the dataset.
Syntax :



/**
  * select syntax
*/

Select(df, column1,column2)
Where df: data frame 
      Column1: name of column1
      Column2: name of column2

Example:

We have already loaded our data inside our load_csv_data variable. So Now perform these following actions

  • Extract order_id,product_id,price from load_csv_data.
  • Extract order_id,order_item_id, product_id,seller_id,shipping_limit_date,price from load_csv_data.
  1. Extract order_id,product_id,price from load_csv_data.


/**
  * select example for extracting columns 
*/

select_Columns <- select(load_csv_data,order_id,product_id,price)

print(select_Columns)

OutPut:

2. Extract order_id,order_item_id, product_id,seller_id,shipping_limit_date,price from load_csv_data.



/**
  * select example for extracting columns 
*/

select_Columns <- select(load_csv_data,order_id,order_item_id, product_id,seller_id,shipping_limit_date,price)

print(select_Columns)

OutPut:

If we have to access columns in a sequence i.e no column should be removed from the sequence.

As in the previous example, we want to access from order_id to price and these columns are in sequence and no column is missing inside sequence.

In this case, we can access our columns like 



/**
  * syntax for select verb in dplyr 
*/


select(data_frame,column1:column(N-i))

So we can also perform our previous operation in this way.



/**
  * select example for extracting columns 
*/


select_Columns <- select(load_csv_data,order_id:price)

print(select_Columns)

OutPut:

Select rows using filter

glimpse() is used for filtering rows on basis of condition.
Syntax:



/**
  * filter syntax
*/

filter(data_frame,condition1 ... condition(N-i))

Example:

Filter row on basis of single condition

Extract row from load_csv_data data frame of which order_id is 5.



/**
  * filter example
*/


selected_rows <- filter(load_csv_data,order_id==5)

print(selected_rows)

OutPut:

Filter row on the basis of multiple conditions.

Extract rows from load_csv_data from which order_item_id=1 and shipping_limit_date= 2017-11-27 19:09:02



/**
  * filter example
*/
 

selected_rows_multiple_conditions <- filter(load_csv_data,order_item_id==1,shipping_limit_date==	 '2017-11-27 19:09:02')

print(selected_rows_multiple_conditions)

OutPut: