This document comprises the milestone report for the Coursera Data Science Capstone project. The project requires the creation of a text prediction algorith using files from the ‘HC Corpora’ data set. The final project will be presented as a Shiny application that facilitates demonstration of the text prediction capabilities.

The milestone report describes the loading and exploratory data analysis of the data, and summarizes the plan for completing the project.

First, the data files are loaded in, and the raw data is explored to obtain a sense of the distributions of word sizes, as well as basic statistics such as file sizes and word counts.

Next, the data is sampled to reduce computational load and normalize the distributions. It is them used to form a corpus using the tm package, and processed to convert text to lowercase, remove whitespace, numeric characters, punctuation and English stopwords (including profanity). It is them stemmed in order to account for multiple variants of words such as plurals of nouns or diffenent cases of verbs.

Then, to consider the question of word frequencies and the frequencies of 2-grams, 3-grams and 4-grams, the RWeka package is used to create the corresponding ‘N-gram tokenizers’ and then generate Document Term Matrices of each set of N-grams.

In order to visualize the frequency distributions of these n-grams, the Document Term Matrices are converted to data frames, sorted in descending order of frequency, and plotted.

Download file:

This exercise uses the files named LOCALE.blogs.txt where LOCALE is the each of the four locales en_US, de_DE, ru_RU and fi_FI. The data is from a corpus called HC Corpora (www.corpora.heliohost.org). See the readme file at http://www.corpora.heliohost.org/aboutcorpus.html for details on the corpora available.

##                             Name    Length                Date
## 1                         final/         0 2014-07-22 10:10:00
## 2                   final/de_DE/         0 2014-07-22 10:10:00
## 3  final/de_DE/de_DE.twitter.txt  75578341 2014-07-22 10:11:00
## 4    final/de_DE/de_DE.blogs.txt  85459666 2014-07-22 10:11:00
## 5     final/de_DE/de_DE.news.txt  95591959 2014-07-22 10:11:00
## 6                   final/ru_RU/         0 2014-07-22 10:10:00
## 7    final/ru_RU/ru_RU.blogs.txt 116855835 2014-07-22 10:12:00
## 8     final/ru_RU/ru_RU.news.txt 118996424 2014-07-22 10:12:00
## 9  final/ru_RU/ru_RU.twitter.txt 105182346 2014-07-22 10:12:00
## 10                  final/en_US/         0 2014-07-22 10:10:00
## 11 final/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:00
## 12    final/en_US/en_US.news.txt 205811889 2014-07-22 10:13:00
## 13   final/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:00
## 14                  final/fi_FI/         0 2014-07-22 10:10:00
## 15    final/fi_FI/fi_FI.news.txt  94234350 2014-07-22 10:11:00
## 16   final/fi_FI/fi_FI.blogs.txt 108503595 2014-07-22 10:12:00
## 17 final/fi_FI/fi_FI.twitter.txt  25331142 2014-07-22 10:10:00

Reading in data:

blogs <- readLines("data/Coursera-Swiftkey/final/en_US/en_US.blogs.txt")
news <- readLines("data/Coursera-Swiftkey/final/en_US/en_US.news.txt")
twitter <- readLines("data/Coursera-Swiftkey/final/en_US/en_US.twitter.txt")

Exploratory data analysis:

First obtain line and character counts for each file.

stri_stats_general(blogs)
##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   208361438   171926076
stri_stats_general(news)
##       Lines LinesNEmpty       Chars CharsNWhite 
##       77259       77259    15683765    13117038
stri_stats_general(twitter)
##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   162384825   134370864

Next obtain word counts and summary statistics based on the word counts.

wordcount_blogs<- stri_count_words(blogs)
wordcount_news<- stri_count_words(news)
wordcount_twitter<- stri_count_words(twitter)

summary(wordcount_blogs)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   29.00   42.43   61.00 6726.00
summary(wordcount_news)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.87   46.00 1123.00
summary(wordcount_twitter)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     7.0    12.0    12.8    18.0    60.0

View basic histograms of distributions of the word counts in each record

Word size frequency histogram for US blogs:

qplot(wordcount_blogs,bins=80)

Word size frequency histogram for US news:

qplot(wordcount_news,bins=80)

Word size frequency histogram for US tweets:

qplot(wordcount_twitter,bins=80)

Sampling data

Since there are outliers in the blogs and news datasets, and the volume of data is large, perform sampling on each of the data sets, selecting 30000 words at random from each.

sampled_blogs <- blogs[sample(1:length(blogs),30000)]
sampled_news <- news[sample(1:length(news),30000)]
sampled_twitter <- twitter[sample(1:length(twitter),30000)]

Create wordcount objects from the sampled data sets and view summary statistics on the word count distributions:

wordcount_sampled_blogs<- stri_count_words(sampled_blogs)
wordcount_sampled_news<- stri_count_words(sampled_news)
wordcount_sampled_twitter<- stri_count_words(sampled_twitter)

summary(wordcount_blogs)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   29.00   42.43   61.00 6726.00
summary(wordcount_news)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.87   46.00 1123.00
summary(wordcount_twitter)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     7.0    12.0    12.8    18.0    60.0

View basic histograms of distributions of the sampled word counts

Word size frequency histogram for US blogs:

qplot(wordcount_sampled_blogs,bins=80)

Word size frequency histogram for US news:

qplot(wordcount_sampled_news,bins=80)

Word size frequency histogram for US tweets:

qplot(wordcount_sampled_twitter,bins=60)

These distributions look considerably less skewed to the removal of some outliers. However the blogs data set still contains outliers.

Re-sample blogs data set with 10000 samples instead of 30000:

sampled_blogs <- blogs[sample(1:length(blogs),10000)]
wordcount_sampled_blogs<- stri_count_words(sampled_blogs)
qplot(wordcount_sampled_blogs,bins=80)

Now we have no outliers past 800 in the sampled blogs data set.

Write sampled data to text files to prepare for generating corpus

if (!file.exists("data/Coursera-Swiftkey/final/sampled/sampled_blogs.txt")) {
writeLines(sampled_blogs, "./data/Coursera-Swiftkey/final/sampled/sampled_blogs.txt")}
if (!file.exists("data/Coursera-Swiftkey/final/sampled/sampled_news.txt")) {
writeLines(sampled_news, "./data/Coursera-Swiftkey/final/sampled/sampled_news.txt")}
if (!file.exists("data/Coursera-Swiftkey/final/sampled/sampled_twitter.txt")) {
writeLines(sampled_twitter, "./data/Coursera-Swiftkey/final/sampled/sampled_twitter.txt")}

Create Corpus

directory<- file.path("data/Coursera-Swiftkey/final/", "sampled")
data_corpus<-Corpus(DirSource(directory))

Perform operations on corpus to convert to lowercase,

Convert to lowercase:

lc_corpus <- tm_map(data_corpus, content_transformer(tolower))

Remove white space:

lc_corpus <- tm_map(lc_corpus, stripWhitespace)

Remove punctuation:

lc_corpus <- tm_map(lc_corpus, removePunctuation)

Remove numeric characters:

lc_corpus <- tm_map(lc_corpus, removeNumbers)

Remove stopwords (profanity, etc.):

lc_corpus <- tm_map(lc_corpus, removeWords, stopwords("english"))

Stem document:

This handles multiple variants of the same word, such as plurals of nouns or different tenses of words.

lc_corpus <- tm_map(lc_corpus, stemDocument)

Create N-gram tokens of single words, 2-grams, 3-grams and 4-grams

This handles multiple variants of the same word, such as plurals of nouns or different tenses of words.

First create tokenizers for 1-word, 2-word, 3-word and 4-word sequences:

one_grams<- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
two_grams<- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
three_grams<- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
four_grams<- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))

Now generate Document Term Matrices using each of the tokenizers:

one_doc_matrix <- DocumentTermMatrix(lc_corpus, 
                              control = list(tokenize = one_grams))
two_doc_matrix <- DocumentTermMatrix(lc_corpus, 
                              control = list(tokenize = two_grams))
three_doc_matrix <- DocumentTermMatrix(lc_corpus, 
                              control = list(tokenize = three_grams))
four_doc_matrix <- DocumentTermMatrix(lc_corpus, 
                                        control = list(tokenize = four_grams))

Convert Document Term Matrices into data frames sorted by frequency:

one_frequency <- colSums(as.matrix(one_doc_matrix))
one_data_frame <- data.frame(word=names(one_frequency), freq=one_frequency)
one_data_frame<-one_data_frame[with(one_data_frame, order(-freq)), ]

two_frequency <- colSums(as.matrix(two_doc_matrix))
two_data_frame <- data.frame(word=names(two_frequency), freq=two_frequency)
two_data_frame<-two_data_frame[with(two_data_frame, order(-freq)), ]

three_frequency <- colSums(as.matrix(three_doc_matrix))
three_data_frame <- data.frame(word=names(three_frequency), freq=three_frequency)
three_data_frame<-three_data_frame[with(three_data_frame, order(-freq)), ]

four_frequency <- colSums(as.matrix(four_doc_matrix))
four_data_frame <- data.frame(word=names(four_frequency), freq=four_frequency)
four_data_frame<-four_data_frame[with(four_data_frame, order(-freq)), ]

Plot histogram of single words with frequency >2500:

ggplot(one_data_frame[one_data_frame$freq>2500, ], aes(reorder(word,-freq),freq)) +
     geom_bar(stat="identity")  +xlab("Single Words") +
        ylab("Frequency") + 
       ggtitle("Single words appearing over 2500 times")

Plot histogram of 2-grams with frequency >200:

ggplot(two_data_frame[two_data_frame$freq>200, ], aes(reorder(word,-freq),freq)) +
     geom_bar(stat="identity") +xlab("2-grams") +
     ylab("Frequency") +
     ggtitle("2-grams appearing over 200 times")+
  theme(axis.text.x=element_text(angle=45, hjust=1)) 

Plot histogram of 3-grams with frequency >15:

ggplot(three_data_frame[three_data_frame$freq>15, ], aes(reorder(word,-freq),freq)) +
     geom_bar(stat="identity") +xlab("3-grams") +
     ylab("Frequency") +
     ggtitle("3-grams appearing over 15 times") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) 

Plot histogram of 4-grams with frequency >5:

ggplot(four_data_frame[four_data_frame$freq>5, ], aes(reorder(word,-freq),freq)) +
       geom_bar(stat="identity") +xlab("4-grams") +
         ylab("Frequency") +
        ggtitle("4-grams appearing over 5 times") + theme(axis.text.x=element_text(angle=45, hjust=1)) 

Plans for completing project

The plan for completing the text mining project is to use the 2-, 3- and 4-grams to predict the next word. An attempt will first be made to match based on 4-grams existing in the combined data set. If the previous 4 words of text do not match any item in the data set, then the set of preceding 3 words will be used to attempt a match among the 3-grams available in the data set. Failing a match among 3-grams, a match based on the two preceding words will be attempted in the 2-gram collection from the data set. If none of these provides a match, a word will be chosen at random from a list of most commonly occurring English words.