Introduction to Text mining in R Part 2

In previous post we had discussed about basic of text mining in R then after that we had little bit talked about Corpus and its types i.e.

  1. PCorpus
  2. VCorpus

Then in case of VCorpus we understood it’s Sources like VectorSource, DirSource and DataFreameSource we had already explained VectorSource with an Example in this post we are continuing with VCorpus with it’s two remaining sources i.e. DirSource and DataframeSource with example.

DirSource : It is basically designed for directories on a file system.

So for further explaining about DirSource we will take a folder which contained some text file I have already created a folder which contains different speeches of Indian Prime Minister Modi I will also share link of this folder so that you can be able to easily download file and practice on it.
Example:


#import tm library
library('tm')
#take file
text <- file.path("text_speeches")
dirSourceCorpus <- VCorpus(DirSource(text))
print(dirSourceCorpus)

DataframeSource : Used for handling data frame csv like structure in this case we are creating our own dataframe example.

Example :


 
#import tm library
library('tm')
# Create a Dataframe example 
example_txt  <- data.frame(
  doc_id=c(1,2,3),
  text= c("example_text","Text analysis provides insights","qdap and tm are used in text mining"),
  author=c("Author1","Author2","Author3"),
  date=c("1514953399","1514953399","1514780598")
)
# Convert DataframeSource frome a example_txt
df_source <- DataframeSource(example_txt)
#Convert df_source to Voletile corpus
df_corpus <- VCorpus(df_source)
#print Voletile Corpus
print(df_corpus)

OutPut :

Some Important functions of TM

TM Function Description
tolower() Make all content in lower case
removePunctuation() Remove Punctuations like period exclamatory
removeNumbers() Remove Numbers i.e basically use for finding pure text
stripWhiteSpace() Removing tabs and extra spaces
removeWords() Removing specific word i.e. defined by Data scientist during process

Now it’s time go for example and understand concept in practical way


#import tm library
library('tm')
# Create a text object 
text <- "Nginx  is a web server Softawre which can also be used as a reverse proxy, 
load balancer, mail proxy and HTTP cache.NGINX is a              free, open-source, high-performance 
HTTP server and reverse proxy, as well as an IMAP/POP3 proxy server. 
NGINX is known for its high performance, stability, rich feature set, simple configuration, 
and low resource consumption."
# Make lowecase 
tolower(text)
# removePunctuation
removePunctuation(text)
# remove number
removeNumbers(text)
#remove white Spaces
stripWhitespace(text)

Output :

Like tm package qdap is also very important package for text mining it also contains some important functions that play a vital role during mining

Some of qdap packages are listed here

QDAP Function Description
bracketX() Remove all contents within bracket
replace_number() Replace numbers with their equivalent words like ( 3 becomes three)
replace_abbreviation() Replace abbreviations with their full text equivalents (e.g. “Er” becomes “Engineer”)
replace_contraction() Convert contractions back to their base words (e.g. “couldn’t” becomes “could not”)
replace_symbol() Replace common symbols with their word equivalents (e.g. “$” becomes “dollar”)

Note : Before go for performing qdap example make sure your should contain Java installed.


#import tm library
library('qdap')
# Create a text object 
text <- "Nginx  is a web server Softawre which can also be used as a reverse proxy, load balancer, mail proxy and HTTP cache.NGINX is a              free, open-source, high-performance HTTP server and reverse proxy, as well as an IMAP/POP3 proxy server. 
NGINX is known for its high performance, stability, rich feature set, simple configuration, and low resource consumption."

# Remove text within brackets
bracketX(text)

# Replace numbers with words
replace_number(text)

# Replace abbreviations
replace_abbreviation(text)

# Replace contractions
replace_contraction(text)

# Replace symbols with words
replace_symbol(text)

Output

Stop Words

Some words come frequently in text and provide little information such type of words are called stop words.In NLP and text mining stop words becomes barrier for our mining or NLP operation.So,during text mining or NLP we remove stop words from our text.

Words like “the”, “a”, “and” are stop words tm packages contains 174 stop words of english. Although you can stop words according to text that we will discuss during practice session.


# import library
library('tm')
# call stopwords function 
stopwords("en")

OutPut:

We can also add new stop words in the list stop.

Add new stop words

For adding new stop words inside list we use c() function like for adding two words like “myword1” and “myword2”

Then we will write


# import library
library('tm')
# call stopwords function 
all_stops <- c("myword1","myword2",stopwords("en"))
# print all stop words
print(all_stops)

OutPut : highlighted words are new stop words

Now it’s time to come when we going to implement stop words concept on real time application.for that purpose we are taking same text  example on which we are considering till yet and from that text file we are removing “Nginx” so “Nginx” is our stop words.


#import tm library
library('tm')
# Create a text object 
text <- "Nginx  is a web server Softawre which can also be used as a reverse proxy, 
load balancer, mail proxy and HTTP cache.Nginx is a              free, open-source, high-performance 
HTTP server and reverse proxy, as well as an IMAP/POP3 proxy server. 
Nginx is known for its high performance, stability, rich feature set, simple configuration, 
and low resource consumption."
# at first Nginx to stop words
new_stop_words <- c("Nginx",stopwords("en")) 
# Remove stop words from text
final_text <- removeWords(text,new_stop_words)
# print final text
print(final_text)

Output:

Word Stemming and Stem Completion

Stemming is a pre-processing step in Text-mining application as well as a very common requirement of Natural Language Processing (NLP) functions.

  • Stemming is usually done by removing any attached suffixes and prefixes (affixes) from index terms before the actual assignment of term of index.
  • Word Stemming reduces words to unify across documents. for example, the stem of “introducing”, “introduction” and “introduces” is “introduce”.
  • During the creation of Stemming it may happen sometimes that we construct such word i.e not real then, in that case, we construct a real word and this process is called stem completion.

Important functions that are used for stemming

TM Stemming function Description
stemDocument() Provides root of word.
stemCompletion() Reconstruct root word.

# import library
library('tm')
library('snowballc')
text_check <- c("Introducing","introduction","introduce")
#perform stemming
stem_doc <- stemDocument(text_check)

# print document root word
print(stem_doc)

Output:

Reconstruct word in stemming


# import library
library('tm')
library('snowballc')
text_check <- c("Introducing","introduction","introduce")
#perform stemming
stem_doc <- stemDocument(text_check)
# print document root word
print(stem_doc)
# create introduction dictionary
comp_doc <- "Introduce"
# reconstruct root word
stem_doc_reconstruct <- stemCompletion(stem_doc,comp_doc)
#print reconstruct word
print(stem_doc_reconstruct)

OutPut:

Leave a Reply

Your email address will not be published. Required fields are marked *