Data visualization Examples in R

In this post, We will explore more examples of Data Visualization using R. For that purpose we are using mtcars as dataset
here is a list of all the features of the observations in mtcars:

  • mpg — Miles/(US) gallon
  • cyl — Number of cylinders
  • disp — Displacement (cu.in.)
  • hp — Gross horsepower
  • drat — Rear axle ratio
  • wt — Weight (lb/1000)
  • qsec — 1/4 mile time
  • vs — V/S engine.
  • am — Transmission (0 = automatic, 1 = manual)
  • gear — Number of forward gears
  • carb — Number of carburetors

Example 1: Plot graph on X and Y axis


# include ggplot2 library
library(ggplot2)

# 1 - Map mpg to x and cyl to y
ggplot(mtcars, aes(x=mpg, y=cyl)) +
  geom_point()

# 2 - Reverse: Map cyl to x and mpg to y
ggplot(mtcars, aes(x=cyl, y=mpg)) +
  geom_point()

OutPut:

Example 2: Change the color, shape, and size of the points


# include ggplot2 library
library(ggplot2)

#chnage color,shape and Size
ggplot(mtcars, aes(x=wt, y=mpg, col=cyl)) +
  geom_point(shape=1, size=4)

OutPut:

Example 3: Add alpha and fill


# include ggplot2 library
library(ggplot2)
# Expand to draw points with alpha 0.5 and fill cyl
ggplot(mtcars, aes(x = wt, y = mpg, fill = cyl)) +geom_point(alpha=0.5)

OutPut:

Exercise 4: Change Shape and color


library(ggplot2)
# Change shape and color
ggplot(mtcars, aes(x = wt, y = mpg, fill = cyl)) +geom_point(shape=24,col="yellow")

OutPut:

Exercise 5: Change shape and Size


# include ggplot2 library
library(ggplot2)
# Define a hexadecimal color
change_color <- "#4ABEFF"
# Set the fill aesthetic; color, size and shape attributes
ggplot(mtcars,aes(x=wt,y=mpg,fill=cyl))+ geom_point(size=10,shape=23,col=change_color)

OutPut:

Explore row data using R

Understanding the structure of your Data

View dimensional

Syntax:


dim()

Example:


# dimensional of mtcars
dim(mtcars)

OutPut:

Looking your DataView dimensional

head()

  • view top of the dataset

Note: By default, it fetches 6 rows but we can also vary a number of rows.
Syntax:


head()

Example:


#head of mtcars
head(mtcars)
# we can vary number of rows
head(mtcars,n=8)

OutPut:

tail()

  • view bottom of the dataset

Example:


#Syntax:tail()
#tail of mtcars
tail(mtcars)

# we can vary number of rows
tail(mtcars,n=8)

Output:
Visualizing your data

hist()

  • view histogram of a single variable

Example:


Syntax:hist()
#histogram
hist(mtcars$mpg)

OutPut:

plot()

  • view plot of two variables

Example:


Synatx: plot()
#plot
plot(mtcars$mpg,mtcars$qsec)

OutPut:

Gather

  • Gather columns into key-value pairs

Syntax:




gather (data, key, value, ...)
/**
 *
data: a data frame
key: bare name of the new key column
value: bare name of the new value column
*/

Spread

  • Opposite of Gather
  • Spread key-value pairs into columns
  • Takes key-value pairs and spread them into multiple columns

Syntax:


spread(data, key, value)
/**
 *
 data: a data frame
 key: bare name of the column containing keys
 value: bare name of the column containing values
*/

Separating columns

  • The separate() function allows you to separate one column into multiple columns.
  • In the case of separate() function, we can also specify sep as an argument for specifying separator.

Syntax:


seperate(data, column_set, c("column1", "column2"))

Uniting column

  • Opposite of separate is unite

Syntax:


unite(data, column-set, c("column1", "column2"))

Note: we can also specify separator between these two columns

Some important functions in R

There are some important functions that we are using in Day to day life. I have categories these functions in these categories for understanding.

  • Common function
  • String Function
  • Looping function

Common function

At first, I am dealing with some common functions in R that we use in day to day life during development.

abs() : Calculate absolute value.

 Example:


  
  # abs function
  
amount <- 56.50
absolute_amount <- abs(amount)
print(absolute_amount)
sum(): Calculate the sum of all the values in the data structure.

 Example:


  
  # sum function
  
myList <- c(23,45,56,67)
sumMyList <- sum(myList)
print(sumMyList)
mean() : Calculate arithmetic mean.

 Example:


  
  # mean function
  
myList <- c(23,45,56,67)
meanMyList <- mean(myList)
print(meanMyList)
round() : Round the values to 0 decimal places by default.

 Example:


    
  ############## round function #####################
    
  amount <- 50.97
  print(round(amount));
 
seq(): Generate sequences by specifying the from, to, and by arguments.

 Example:

String Function
    
     # seq() function
     # seq.int(from, to, by)
    
  sequence_data <- seq(1,10, by=3)
  print(sequence_data)
  
rep(): Replicate elements of vectors and lists.

 Example:


    
    #rep exampleString Function
    #rep(x, times)
    sequence_data <- seq(1,10, by=3)  
    repeated_data <- rep(sequence_data,3)
    print(repeated_data)

  
sort(): sort a vector in ascending order, work on numerics.

 Example:


    
    #sort function
    
  data_set <- c(5,3,11,7,8,1)
  sorted_data <- sort(data_set)Functionround
  print(sorted_data)
  
rev(): Reverse the elements in a data structure for which reversal is defined.

 Example:


    
   # reverse function 
    String Function
  data_set <- c(5,3,11,7,8,1)
  sorted_data <- sort(data_set)
  reverse_data <- rev(sorted_data)
  print(reverse_data)
  
str(): Display the structure of any R Object.

 Example:


    
  # str function 
    
  myWeeklyActivity <- data.frame(
    activity=c("mediatation","exercie","blogging","office"),
    hours=c(7,7,30,48)
  )
  print(str(myWeeklyActivity))
  
append() : Merge vectors or lists.

 Example:


    
   #append function 
    
  activity=c("mediatation","exercie","blogging","office")
  hours=c(7,7,30,48)
  append_data <- append(activity,hours)
  print(append_data)
  
is.*(): check for the class of an R Object.

 Example:


    
  #is.*() function
    
  list_data <- list(log=TRUE,
                    chStr="hello"
                    int_vec=sort(rep(seq(2,10,by=3),times=2)))
  print(is.list(list_data))
  
as.*(): Convert an R Object from one class to another.

 Example:


    
  #as.*() function
    
  list_data <- as.list(c(2,4,5))
  print(is.list(list_data))
  

String Function

Now we are discussing some string function that plays a vital role during data cleaning or data manipulation.

These are functions of stringr package.So, before using these functions at first you have to install stringr package.


    
  # import string library
    
  library(stringr)
 
str_trim () : removing white spaces from string.

 Example:


    
    ############### str_trim ####################
    
   trim_result <- str_trim(" this is my string test. ");
   print("Trim string")
   print(trim_result)
 
str_detect(): search for a string in a vector.That returns boolean value.

 Example:


    
  ############### str_detect ####################
    
  friends <- c("Alice","John","Doe")
  string_detect <- str_detect(friends,"John")
  print("String detect ...")
  print(string_detect)
 
str_replace() : replace a string in a vector.

 Example:


    
  ############## str_replace #####################
    
  str_replace(friends,"Doe","David")
  print("friends list after replacement ....");
  print(friends);
 
tolower() : make all lowercase.

 Example:


    
  ############## tolower #####################
    
  myupperCasseString <- "THIS IS MY UPPERCASE";
  print("lower case string ...");
  print(tolower(myupperCasseString));
 
toupper() : make all uppercase.

 Example:


    
  ############## toupper #####################
    
  myupperCasseString <- "My name is Dheeraj";
  print("Upper case string ...");
  print(toupper(myupperCasseString));

 

Lopping

lapply(): Loop over a list and evaluate a function on each element.

 Some important points regarding lapply :

# lapply takes three arguments:

  1. list X
  2. function (or name the function) FUN
  3. … other argumnts

# lapply always returns list, regardless of the class of input.

Example:


    
  ############## lapply example #####################
    
  x <- list(a = 1:5,rnorm(10))
  lapply(x,mean)
 

OutPut:

Anonymous function

Anonymous functions are those functions that have no name.


    
  ############## lapply example #####################
  # Extract first column of matrix 
    
  
  x <- list(a=matrix(1:4,2,2),b=matrix(1:6,3,2))
  lapply(x,function(elt)elt[,1])
 

OutPut:


Use function with lapply


    
  ############## lapply example #####################
  # multiply each element of list with factor
    
  
  multiply <- function(x,factor){
    x * factor
  }

lapply(list(1,2,3),multiply,factor=3)
 

OutPut:

sapply(): Same as lapply but try to simplify the result.

 Example:


    
  ############## sapply example #####################
  # multiply each element of list with factor
    
  multiply <- function(x,factor){
    x * factor
  }
sapply(list(1,2,3),multiply,factor=3)
 

OutPut:

apply() : Apply a function over the margin of an array.

 Example:


    
  ############## apply function #####################
    
  mat1 <- matrix(c(1:10),nrow=5,ncol = 6)
  apply(mat1,2, sum)
 

OutPut:

tapply(): Apply a function over subsets of a vector.

 Example:


    
  ############## tapply function #####################
    
  tapply(mtcars$mpg, list(mtcars$cyl, mtcars$am), mean)
 

OutPut:

mapply():Multivariate version of lapply.

 Example:

Joins in SQL for Data Analyst

It is the age of Data and now we Data assume data is more precious than Gold. In Data world SQL is one of the best tool used for Data Analysis.

Types of Joins

Basically, there are four types of JOINS in SQL.

      • CROSS JOIN
      • INNER JOIN

                     –   EQUI JOIN
                                 –  NATURAL JOIN

    • OUTER JOIN

                     –   LEFT JOIN (LEFT OUTER JOIN)
                     –   RIGHT JOIN (RIGHT OUTER JOIN)
                     –   FULL JOIN    (FULL OUTER JOIN)

    • SELF JOIN

For practical implementation of JOINS, we are taking two tables 




# Table Details


Products_dataset:-(product_id,product_category_name,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm)

product_category_name_translation :- (product_category_name	product_category_name_english)

CROSS JOIN

CROSS JOIN returns cartesian product of two tables.

Cartesian Product

If table A contains m rows and table B contains n rows, then cartesian product of table A and table B will contain m*n rows.
Syntax :




# cross join syntax


SELECT * FROM tableA CROSS JOIN tableB

     or
     
SELECT * FROM tableA,tableB

Example :


SELECT * FROM products_dataset CROSS JOIN product_category_name_translation

OutPut:

INNER JOIN

return all rows from tableA where there are matching values in tableB and all columns from tableA and tableB. If there are multiple matches between tableA and tableB then all matches will return from tableA and tableB on the basis of a specified condition.

And condition has been mentioned in the syntax after the last table.
Syntax:



# Inner join syntax

SELECT * FROM tableA INNER JOIN tableB ON (tableA.x = tableB.y)


or 

SELECT * FROM tableA JOIN tableB ON (tableA.x = tableB.y)

Note: By default, JOIN is considered as INNER JOIN or EQUI JOIN.

Example :



# inner join example

SELECT * FROM products_dataset pds INNER JOIN product_category_name_translation pct
ON pds.product_category_name=pct.product_category_name

OutPut :

LEFT JOIN (LEFT OUTER JOIN) :

return all rows from tableA, and all columns from tableA and tableB. Rows in tableA with no match in tableB will have NA values in the new columns. If there are multiple matches between tableA and tableB, all combinations of the matches are returned.
Syntax :



# left join syntax

SELECT * FROM tableA LEFT JOIN tableB ON (tableA.x = tableB.y)

Example:



#  left join example

SELECT * FROM products_dataset pds LEFT JOIN product_category_name_translation pct

ON pds.product_category_name=pct.product_category_name

OutPut :

RIGHT JOIN (RIGHT OUTER JOIN)

return all rows from tableB, and all columns from tableA and tableB. Rows in tableB with no match in tableA will have NA values in the new columns. If there are multiple matches between tableA and tableB, all combinations of the matches are returned.
Syntax:



# right join syntax

SELECT * FROM tableA RIGHT JOIN tableB ON (tableA.x = tableB.y)

Example:



#  right join example

SELECT * FROM products_dataset pds RIGHT JOIN product_category_name_translation pct
ON pds.product_category_name=pct.product_category_name

OutPut:

Joining data in R using Dplyr

Working with data, Joining is the common operation.

Joining means combine i.e combine the data from two or more than two different sources on the basis of some conditions.

For performing such type of operation in R dplyr is the best option for doing so.

During this post, we will these key points.

  • Types of Joins in R
  • Syntax
  • Joining on DataFrame
  • Joining on tables

Types of Joins

There are six types of Joins in R :

  1. Inner Join (inner_join)
  2. Left Join (left_join)
  3. Right Join (right_join)
  4. Full Join (full_join)
  5. Semi Join (semi_join)
  6. Anti Join (anti_join)

Syntax




# Syntax of Joining in R

Join_type(x,y,by=condition)

/**
  * x: dataframe1/table1
  * y: dataframe2/table2
*/

Joins are basically applied on tables and in case of R files are to be considered as tables.

For a better understanding of joins, we are taking two files 

  1. Product_category_name.csv (product_category_name, product_category_name_english)
  2. product_dataset.csv(product_id,product_category_name,product_name_lenght, product_description_lenght, product_photos_qty, product_weight_g, product_length_cm, product_height_cm, product_width_cm)

You can download these files for your practice my GitHub account

As we know during applying joins on two different datasets or tables we need a common field on the basis of that we can be able to apply a join.

So, in this case, both files contain product_category_name so on the basis of that we can apply to join.

For using a CSV file as a table we are using command 




# read csv file

read.csv(file_name)

So for applying joins on these two files at first, we have to consider these two files as two tables using read.csv() command. As shown below.




# load data


table1 <- read.csv("/home/dheeraj/Desktop/Blog_post/joins_in_r_using_dplyr/dataset/product_category_name.csv") # file location of Product_category_name.csv
table2 <- read.csv("/home/dheeraj/Desktop/Blog_post/joins_in_r_using_dplyr/dataset/products_dataset.csv")   # file location of product_dataset.csv 

Inner Joins

Syntax :



# inner join syntax


inner_join(x,y,by='condition')

Examples :

Select all columns

library(dplyr)
table1 <- read.csv("/home/dheeraj/Desktop/Blog_post/joins_in_r_using_dplyr/dataset/product_category_name.csv") # file location of Product_category_name.csv
table2 <- read.csv("/home/dheeraj/Desktop/Blog_post/joins_in_r_using_dplyr/dataset/products_dataset.csv")   # file location of product_dataset.csv 

appliedInnerJoin <- inner_join(table1,table2,by='product_category_name')

print(head(appliedInnerJoin,n =20))

OutPut:

Select specified columns

If we don’t want to extract all columns then we can select specified columns using select command.

Suppose In this after applying inner join on table1 and table2 we don’t want to extract all columns of these tables but also we want to extract only
product_id.product_category_name_english.

Then we can use select command of dplyr package (for more detail click here) and extract specified columns that we want to extract

Example


library(dplyr)
table1 <- read.csv("/home/dheeraj/Desktop/Blog_post/joins_in_r_using_dplyr/dataset/product_category_name.csv") # file location of Product_category_name.csv
table2 <- read.csv("/home/dheeraj/Desktop/Blog_post/joins_in_r_using_dplyr/dataset/products_dataset.csv")   # file location of product_dataset.csv 

appliedInnerJoin <- inner_join(table1,table2,by='product_category_name')
#select specified column

specified_columns <- select(appliedInnerJoin,product_id,product_category_name_english)
print(specified_columns)

OutPut:

left join

Syntax :


left_join(x,y,by='condition')
 

Example:


library(dplyr)
table1 <- read.csv("/home/dheeraj/Desktop/Blog_post/joins_in_r_using_dplyr/dataset/product_category_name.csv") # file location of Product_category_name.csv
table2 <- read.csv("/home/dheeraj/Desktop/Blog_post/joins_in_r_using_dplyr/dataset/products_dataset.csv")   # file location of product_dataset.csv 
# left join
appliedLeftJoin <- left_join(table1,table2,by='product_category_name')<img src="http://www.krdheeraj.info/wp-content/uploads/2020/01/leftJoin.png" alt="" width="1345" height="579" class="alignnone size-full wp-image-496" />

print(head(appliedLeftJoin,n=10))

OutPut:

right join

Syntax :


right_join(x,y,by='condition')
 

EXample


library(dplyr)
table1 <- read.csv("/home/dheeraj/Desktop/Blog_post/joins_in_r_using_dplyr/dataset/product_category_name.csv") # file location of Product_category_name.csv
table2 <- read.csv("/home/dheeraj/Desktop/Blog_post/joins_in_r_using_dplyr/dataset/products_dataset.csv")   # file location of product_dataset.csv 
# left join
appliedRightJoin <- right_join(table1,table2,by='product_category_name')

print(head(appliedRightJoin,n=10))

Output:

Dplyr grammar of Data Manipulation in R

dplyr package is used for Data Manipulation in R.

So it is the reason that’s why dplyr is called grammar of Data Manipulation.

With the help of dplyr package, we can be able to manipulate data and extract useful information easily and quickly.

Installing and loading dplyr package in R

Installing dplyr



/**
  * install dplyr package
*/

install.package("dplyr")

loading dplyr



/**
  * load dplyr package
*/

library(dplyr)

dplyr package contains 5 verbs for Data Manipulation. That we will discuss in this post.

dplyr 5 verbs for Data Manipulation

Dplyr Function Description
Select () Returns subset of the Columns.
Filter() Returns subset of Rows.
Arrange() Reorder rows according to single or multiple variables.
Mutate() Used for adding a new column from the existing column.
Summarize() Reduce each group to a single row by calculate aggregate measure.

For Exploring our knowledge in dplyr we have need a dataset.

So for that purpose, I am loading a variable.

For reading a CSV file we are using read.csv function.

Syntax :



/**
  * load dplyr package
*/


read.csv("file location")

Example:



/**
  * load dplyr package
*/

load_csv_data <- read.csv("/home/dheeraj/Downloads/brazilian-ecommerce/order_items_dataset.csv") ///

print(load_csv_data)

OutPut:

read csv file
Although read.csv() always return data.frame.

We can be able to store read.csv() return inside a variable as in previous example we are storing in load_csv_data. So if we are accessing variable it means we are accessing that data.

Now we are performing some operation with dplyr Package.
For that purpose, we taking all functions of dplyr package and performing some operations.

But Before using dplyr 5 verbs. We should observe our data on which we are going to implement dplyr verbs.
And dplyr package contains a separate function glimpse() for that purpose.


glimpse(): returns Observation of Data frame.
Syntax :


/**
  * load dplyr package
*/

glimpse(data_frame)

Example:



/**
  * load dplyr package
*/

glimpse(load_csv_data)

OutPut:
glimpse
So glimpse() function provides details about our dataframe.
Now we understood that our data frame contains 112,650 rows and 7 columns as shown below.



Observations: 112,650
Variables: 7
$ order_id            
$ order_item_id       
$ product_id          
$ seller_id          
$ shipping_limit_date 
$ price               
$ freight_value

Selecting Column using Select

Select returns subset of columns.
In other words, we can also say that remove columns from the dataset.
Syntax :



/**
  * select syntax
*/

Select(df, column1,column2)
Where df: data frame 
      Column1: name of column1
      Column2: name of column2

Example:

We have already loaded our data inside our load_csv_data variable. So Now perform these following actions

  • Extract order_id,product_id,price from load_csv_data.
  • Extract order_id,order_item_id, product_id,seller_id,shipping_limit_date,price from load_csv_data.
  1. Extract order_id,product_id,price from load_csv_data.


/**
  * select example for extracting columns 
*/

select_Columns <- select(load_csv_data,order_id,product_id,price)

print(select_Columns)

OutPut:

2. Extract order_id,order_item_id, product_id,seller_id,shipping_limit_date,price from load_csv_data.



/**
  * select example for extracting columns 
*/

select_Columns <- select(load_csv_data,order_id,order_item_id, product_id,seller_id,shipping_limit_date,price)

print(select_Columns)

OutPut:

If we have to access columns in a sequence i.e no column should be removed from the sequence.

As in the previous example, we want to access from order_id to price and these columns are in sequence and no column is missing inside sequence.

In this case, we can access our columns like 



/**
  * syntax for select verb in dplyr 
*/


select(data_frame,column1:column(N-i))

So we can also perform our previous operation in this way.



/**
  * select example for extracting columns 
*/


select_Columns <- select(load_csv_data,order_id:price)

print(select_Columns)

OutPut:

Select rows using filter

glimpse() is used for filtering rows on basis of condition.
Syntax:



/**
  * filter syntax
*/

filter(data_frame,condition1 ... condition(N-i))

Example:

Filter row on basis of single condition

Extract row from load_csv_data data frame of which order_id is 5.



/**
  * filter example
*/


selected_rows <- filter(load_csv_data,order_id==5)

print(selected_rows)

OutPut:

Filter row on the basis of multiple conditions.

Extract rows from load_csv_data from which order_item_id=1 and shipping_limit_date= 2017-11-27 19:09:02



/**
  * filter example
*/
 

selected_rows_multiple_conditions <- filter(load_csv_data,order_item_id==1,shipping_limit_date==	 '2017-11-27 19:09:02')

print(selected_rows_multiple_conditions)

OutPut:

Import CSV file in MySQL database using SQL

 

During Data Analysis it is a common situation during which we have to import CSV files into MYSQL database.

In this post, we will discuss the same scenario and a simple way for importing CSV files using SQL.

Before importing a CSV file we need to create a table that will contain all columns of CSV file.

Suppose I have CSV file order_items_dataset.csv and we want to import this CSV file in  MYSQL Database.

import csv file

 

Then before importing CSV file, we have to create a table in MYSQL Database.

So, At first, we are creating a table

Create a table in MYSQL Database



/**
  * create order_items_datasets table for importing csv file
*/

CREATE TABLE order_items_datasets (
order_id int PRIMARY KEY,
order_item_id int,
product_id varchar (200),
seller_id varchar(200),
shipping_limit_date DATETIME,
price float,
freight_value float
)

Now our table has been created successfully.

Now we need to import CSV file in order_items_datasets table.

For that, we have to follow these few steps.

STEP 1:   Load Data

For that purpose, we will LOAD DATA Statement.

LOAD DATA Statement: reads a row from a text file into a table with a very high speed.  


# comment line

LOAD DATA

Step 2: Include CSV file

For including CSV file from which we have to import Data we used INFILE command with LOAD DATA  Statement.

INFILE: allow us to read CSV data from a text file.

NOTE: LOAD DATA INFILE statement allows us to read CSV data from a text file and import that into a database table with very fast speed.  

So for Including a file, we will write 



# comment line

LOAD DATA INFILE "/home/dheeraj/Downloads/brazilian-ecommerce/order_items_dataset.csv"  -- file location

Step 4: Import Data to table

For importing data inside the table we will add INTO then after table name with LOAD DATA statement. 

As shown below for importing data into my new table order_items_datasets that we have created previously.



# import data inside table


LOAD DATA INFILE "/home/dheeraj/Downloads/brazilian-ecommerce/order_items_dataset.csv" -- file location
INTO TABLE order_items_datasets -- table name 

Step4: Extract values from CSV file 

CSV file contains comma-separated values. So we need to separate these values on the basis of rows and columns.

For this, we will add these few lines



# comment line

FIELDS TERMINATED BY ',' 
ENCLOSED BY '"'
LINES TERMINATED BY '\n'

Step5: Ignore header of CSV file

During importing CSV file we need to ignore the header. For this, we will add this line.



# ignore header 

IGNORE 1 ROWS;

After following these steps we can be able to import CSV files in MYSQL database.

Complete Query:



/**
Create table
*/

CREATE TABLE order_items_datasets (
order_id int PRIMARY KEY,
order_item_id int,
product_id varchar (200),
seller_id varchar(200),
shipping_limit_date DATETIME,
price float,
freight_value float
)

--  load data 
LOAD DATA

-- include file

LOAD DATA INFILE "/home/dheeraj/Downloads/brazilian-ecommerce/order_items_dataset.csv"  -- file location

-- import data inside table

LOAD DATA INFILE "/home/dheeraj/Downloads/brazilian-ecommerce/order_items_dataset.csv" -- file location
INTO TABLE order_items_datasets -- table name 

-- extract data

FIELDS TERMINATED BY ',' 
ENCLOSED BY '"'
LINES TERMINATED BY '\n'

-- ignore header

IGNORE 1 ROWS;

Web Scraping with R

Now a days to run a business we have need to understand business pattern, Client behavior, Culture, location, environment. In the absence of these, we can’t be able to run a business successfully.

With the help of these factors, the probability of growing our business will become high.

So, In simple terms to understand and run a business successfully, we have need Data from which we can be able to understand Client behavior, business pattern, culture, location, and environment.

Today one of the best sources for collecting data is web and to collect data from the web there are various methods.

One of them is Web Scraping in different languages we extract data from web from different ways.

Here we will discuss some of the methods for extracting data from the web using R Language.

There are various resources on the web and we have various techniques for extracting data from these different resources.

Some of these resources are :

    • Google sheets
    • Wikipedia
    • Extracting Data from web tables
    • Accessing Data from Web 

Read Data in html tables using R

Generally for storing large amounts of data on the web, we use tables and here we are discussing the way for extracting data from html tables. So, without further delay, we are following steps for extracting data from html tables.

During this session, we have design some steps  so that anyone can be able to access html tables following steps:

  1. Install library
  2. Load library
  3. Get Data from url
  4. Read HTML Table
  5. Print Result

Install library


# install library
install.packages('XML')
install.packages('RCurl')

During reading data from web generally, we used these library 


# load library 
library(XML)
library(RCurl)

Get Data from url

During this session, we will extract the list of Nobel laureates from Wikipedia page and for that at first copy url of table and here is url


https://en.wikipedia.org/wiki/List_of_Nobel_laureates#List_of_laureate

In R we write these lines of code for getting data from url.


# Get Data from url
url <- "https://en.wikipedia.org/wiki/List_of_Nobel_laureates#List_of_laureates"
url_data <- getURL(url)

Read HTML Table

Now it’s time to read table and extract information from the table and for that we will use readHTMLTable() function.


# Read HTML Table
data <- readHTMLTable(url_data,stringAsFactors=FALSE)

Print Result

Finally, our data has been stored in data variable and now we can print this


# print result
print(data) 

Here is complete code for reading HTML table from web using R


# install library
install.packages('XML')
install.packages('RCurl')
# load library 
library(XML)
library(RCurl)
# Get Data from url
url <- "https://en.wikipedia.org/wiki/List_of_Nobel_laureates#List_of_laureates"
url_data <- getURL(url)
# Read HTML Table
data <- readHTMLTable(url_data,stringAsFactors=FALSE)
# print result
print(data)

rvest package for Scraping

rvest is most important package for scraping webpages. It is designed to work with magrittr to make it easy to scrape information from the Web inspired by beautiful soup.

Why we need some other package when we already have packages like XML and RCurl package?

During Scraping through XML and RCurl package we need id, name, class attribute of that particular element.
If Our element doesn’t contain such type of attribute, then we can’t be able to Scrap information from the website.
Apart from that rvest package contains some essential functions that enhance its importance from other packages.
During this session also, we have to follow the same steps as we have designed for XML and RCurl package access HTML tables. We are repeating these steps :

    1. Install package
    2. Load package
    3. Get Data from url 
    4. Read HTML Table
    5. Print Result

we are repeating same example as we have discussed before but with rvest package.

Install package


# install package
install.packages('rvest')

Load package


#load package
library('rvest')

Get Data from url and Read HTML Table


url <- 'https://en.wikipedia.org/wiki/List_of_Nobel_laureates'
# Get Data from url and Read HTML Table 
prize_data <- url %>% read_html() %>% html_nodes(xpath = '//*[@id="mw-content-text"]/div/table[1]') %>%
  html_table(fill = TRUE)

Here we have combined two steps with a single step and i.e beauty of piping in R. Apart from that inside html_nodes() method we have used XPath.
Yeah with rvest package we have to use Xpath of element that we want to copy.
And steps for copy XPath as shown below inside image in which we are copying XPath of table

print Data


#print Data
print(prize_data)

Here is complete code for reading HTML table from web using rvest in R


# install package

install.packages('rvest')

#load package
library('rvest')
url <- 'https://en.wikipedia.org/wiki/List_of_Nobel_laureates'
# Get Data from url and Read HTML Table
prize_data <- url %>% read_html() %>% html_nodes(xpath = '//*[@id="mw-content-text"]/div/table[1]') %>%
  html_table(fill = TRUE)
# read prize data
print(prize_data)

Both example i.e reading table with XML and RCurl package and reading the same table with rvest package that will be looked like 
Output:

web Scraping output in R
Note : we will go through more examples on rvest in the next post but before that we took a quick introduction of googlesheets package in R.

Extracting Data from Google sheets :

Google sheets became one of the most important tools for storing data on the web. It is also useful for Data Analysis on the web.

In R we have a separate package for  extracting Data from web i.e googlesheets

How to use Google Sheets with R?

In this section, we will explain how to use googlesheets package for extracting information from google sheets.

We have the process of extracting data from google sheets  in 5 steps

  1. Installing googlesheets 
  2. Loading googlesheets 
  3. Authenticate google account
  4. Show list of worksheets
  5. Read a spreadsheets
  6. Modify the sheet

Install googlesheets package


install.package("googlesheets")

Loading googlesheet


library("googlesheets")

Authenticate google account


gs_ls()

After that in the browser authentication page will be opened like shown below

Complete code


 # install packages 
 install.packages('googlesheets') 
 # load library
 library('googlesheets')
 # Authentication complete,Please close this page and return 
 gs_ls()
 # take worksheet with title
 take_tile <- gs_title("amazon cell phone items")
 #get list of worksheets
  gs_ws_ls(be)
    

More Tutorials on R

Introduction to Text mining in R

Introduction to Text mining in R Part 2

Create Word Cloud in R

More interesting Visuals in Word cloud using R

In this post, we are discussing more about word cloud and stopwords. As in my previous post, some of the user “Paula Darling, Farzad Qassemi, Georgi Petkov” have commented to me on Linkedin that I have missed to use stopwords. I am thankful to them that they have pointed out my mistakes and suggest me correct them. So, In this post at first, I correct my mistake that I have made in my previous post then after that I introduce some more visuals that we have used during generating word clouds.

So, we are starting from removing stopwords then after that, we will cover these topics:

  • Use color function
  • Use multiple colors for creating word cloud
  • Use prebuilt color palettes

Some More functions that are generally used for text mining

Function Description
dist() Compute differences between each row of TDM.
hclust() Perform cluster analysis on dissimilarities of the distance matrix.
dendrogram() Visualize the word frequency distances.
plot() Similar to dendrogram.

Use prebuilt color palettes

viridisLite package

viridisLite color schemes are perceptually-uniform, both in regular form and also when converted to black-and-white, and can be perceived by readers with all forms of color blindness.

The color scales

The package contains four color scales: “Viridis”, the primary choice, and three alternatives with similar properties, “magma”, “plasma”, and “inferno.”

These color palettes are very important during using color scheme and each with a convenience function. Simply specify n to select the number of colors needed.

 

I am not going deep inside viridisLite package because at this moment it is not our priority to learn about viridisLite package but also we have to learn how to use this package to make our visuals more effective during text-mining

 

Create Word Cloud in R

Before going deep inside Word cloud, we first introduce ourselves from some important technical words, that we will use during this word cloud session in text mining .

Basically, the foundation of Bag Word testing is either TDM (Term Document Matrix) or DTM (Document Term matrix).

TDM (Term Document Matrix): The TDM is often the matrix used for language analysis. This is because you likely have more terms than authors or documents and life is generally easier when you have more rows than columns. So, In case of TDM word is represented as row and Document as Column.

 

Each Corpus word is represented as row and Document as Column.

 

How to create TDM from corpus?

TDM Function Description
TermDocumentMatrix() Create Term Document Matrix

DTM (Document Text Matrix): Transpose of TDM  hence Document is a row, and each corpus is column.

 

How to create DTM from corpus?

DTM Function Description
DocumentTermMatrix() Create Document Term Matrix

Note: Among TDM and DTM which we going to use in our application that depends on scenario that either we want to take terms as a row as rows and document as column then, in that case, we have to pick TDM (Term Document Matrix) and vice-versa we have to focus on DTM (Document Term Matrix).

What is Word Cloud?

Word cloud is Text mining technique that allows us to highlights the most frequent keywords in paragraphs of text.

How to Create a Word Cloud?

There is a simple way for creating word cloud using packages like

tm: for creating corpus and cleaning data.

worldcloud: for constructing word cloud.

Steps for creating Word Cloud :
    1. Choose Text file
    2. Install packages and importing packages
    3. Reading file
    4. Converting a text file into corpus
    5. Clean Data and for that execute some commands
    6. Create a Term Document matrix
    7. Create your first cloud

Now without wasting our time we will follow the steps that we have discussed previously and achieve our target to construct word cloud. So, follow the steps

Choose Text file

At first, we choose a text file I have already taken a file that contains speeches of Indian Prime Minister Narendra Modi I will attach this file for you so, that you can be able to practice with it.

Install Packages and importing packages

Install these packages for constructing cloud word for installing these packages you can use install.package() command like this way


#install packages 
Install.packages(‘tm’)
Install.packages(‘wordcloud’)
Install.packages(‘RcolorBrewer’)
#load libary
library(‘tm’)
library(‘wordcloud’) 
library(‘RcolorBrewer’)

Reading file

After installing and loading the library at first, we have need our text data on which we have to process. So, for loading text file and reading data, we will use these commands


# text file location
textFileLocation <- "F:\\Blog\\TextMining\\Word_word_text_mining\\text_speeches\\pm-modi-interacts-with-the-indian-community-in-kobe-japan.txt";
# load text file
textName <- file(textFileLocation,open="r")
# read text file
text_data <- readLines(textName)

Converting a text file into corpus

We can’t directly get word cloud directly from a text file but for constructing word cloud we have to at first convert our text data into the corpus. We can only on corpus for constructing word cloud. So, for constructing word cloud we follow these steps as we had also discussed in our previous post also.

Note: In this case, we are creating Vector Corpus for creating a word cloud.


# create corpus from text_data
textAsCorpus <- VCorpus(VectorSource(text_data)) 

Clean Data and for that execute some commands

Data cleaning is a more important part of any type of Data processing in this case also it is a very important part. For creating cleaning our corpus, we will execute these following commands


# corpus cleaning
clean_text <- tm_map(textAsCorpus,stripWhitespace)
clean_text <- tm_map(textAsCorpus,tolower)
clean_text <- tm_map(textAsCorpus,removeNumbers)
clean_text <- tm_map(textAsCorpus,removePunctuation)

Create a Term Document matrix

This is not a required step because we are considering corpus during creating a word cloud, but in this, I am showing you to calculate Term Document Matrix also.


#creating term Document matrix
tdm1 <- TermDocumentMatrix(clean_text)
# print term document message
print(as.matrix(tdm1))

OutPut:

Create your first Word cloud

It is our final step for creating a word cloud. Now we have to use simple word cloud function of WordCloud package and add corpus inside function.


# create word cloud
wordcloud(clean_text, max.words = 500, colors = "blue")

OutPut:


Complete code

#install packages 
Install.packages(‘tm’)
Install.packages(‘wordcloud’)
Install.packages(‘RcolorBrewer’)
#load libary
library(‘tm’)
library(‘wordcloud’) 
library(‘RcolorBrewer’)

# text file location
textFileLocation <- "F:\\Blog\\TextMining\\Word_word_text_mining\\text_speeches\\pm-modi-interacts-with-the-indian-community-in-kobe-japan.txt";
# load text file
textName <- file(textFileLocation,open="r")
# read text file
text_data <- readLines(textName)

# create corpus from text_data
textAsCorpus <- VCorpus(VectorSource(text_data)) 
# corpus cleaning
clean_text <- tm_map(textAsCorpus,stripWhitespace)
clean_text <- tm_map(textAsCorpus,tolower)
clean_text <- tm_map(textAsCorpus,removeNumbers)
clean_text <- tm_map(textAsCorpus,removePunctuation)

#creating term Document matrix
tdm1 <- TermDocumentMatrix(clean_text)
# print term document message
print(as.matrix(tdm1)) # this not required so we can also comment

# create word cloud
wordcloud(clean_text, max.words = 500, colors = "blue")