This document comprises the milestone report for the Coursera Data Science Capstone project. The project requires the creation of a text prediction algorith using files from the ‘HC Corpora’ data set. The final project will be presented as a Shiny application that facilitates demonstration of the text prediction capabilities.
The milestone report describes the loading and exploratory data analysis of the data, and summarizes the plan for completing the project.
First, the data files are loaded in, and the raw data is explored to obtain a sense of the distributions of word sizes, as well as basic statistics such as file sizes and word counts.
Next, the data is sampled to reduce computational load and normalize the distributions. It is them used to form a corpus using the tm package, and processed to convert text to lowercase, remove whitespace, numeric characters, punctuation and English stopwords (including profanity). It is them stemmed in order to account for multiple variants of words such as plurals of nouns or diffenent cases of verbs.
Then, to consider the question of word frequencies and the frequencies of 2-grams, 3-grams and 4-grams, the RWeka package is used to create the corresponding ‘N-gram tokenizers’ and then generate Document Term Matrices of each set of N-grams.
In order to visualize the frequency distributions of these n-grams, the Document Term Matrices are converted to data frames, sorted in descending order of frequency, and plotted.
This exercise uses the files named LOCALE.blogs.txt where LOCALE is the each of the four locales en_US, de_DE, ru_RU and fi_FI. The data is from a corpus called HC Corpora (www.corpora.heliohost.org). See the readme file at http://www.corpora.heliohost.org/aboutcorpus.html for details on the corpora available.
## Name Length Date
## 1 final/ 0 2014-07-22 10:10:00
## 2 final/de_DE/ 0 2014-07-22 10:10:00
## 3 final/de_DE/de_DE.twitter.txt 75578341 2014-07-22 10:11:00
## 4 final/de_DE/de_DE.blogs.txt 85459666 2014-07-22 10:11:00
## 5 final/de_DE/de_DE.news.txt 95591959 2014-07-22 10:11:00
## 6 final/ru_RU/ 0 2014-07-22 10:10:00
## 7 final/ru_RU/ru_RU.blogs.txt 116855835 2014-07-22 10:12:00
## 8 final/ru_RU/ru_RU.news.txt 118996424 2014-07-22 10:12:00
## 9 final/ru_RU/ru_RU.twitter.txt 105182346 2014-07-22 10:12:00
## 10 final/en_US/ 0 2014-07-22 10:10:00
## 11 final/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:00
## 12 final/en_US/en_US.news.txt 205811889 2014-07-22 10:13:00
## 13 final/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:00
## 14 final/fi_FI/ 0 2014-07-22 10:10:00
## 15 final/fi_FI/fi_FI.news.txt 94234350 2014-07-22 10:11:00
## 16 final/fi_FI/fi_FI.blogs.txt 108503595 2014-07-22 10:12:00
## 17 final/fi_FI/fi_FI.twitter.txt 25331142 2014-07-22 10:10:00
blogs <- readLines("data/Coursera-Swiftkey/final/en_US/en_US.blogs.txt")
news <- readLines("data/Coursera-Swiftkey/final/en_US/en_US.news.txt")
twitter <- readLines("data/Coursera-Swiftkey/final/en_US/en_US.twitter.txt")
stri_stats_general(blogs)
## Lines LinesNEmpty Chars CharsNWhite
## 899288 899288 208361438 171926076
stri_stats_general(news)
## Lines LinesNEmpty Chars CharsNWhite
## 77259 77259 15683765 13117038
stri_stats_general(twitter)
## Lines LinesNEmpty Chars CharsNWhite
## 2360148 2360148 162384825 134370864
wordcount_blogs<- stri_count_words(blogs)
wordcount_news<- stri_count_words(news)
wordcount_twitter<- stri_count_words(twitter)
summary(wordcount_blogs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 29.00 42.43 61.00 6726.00
summary(wordcount_news)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 19.00 32.00 34.87 46.00 1123.00
summary(wordcount_twitter)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 7.0 12.0 12.8 18.0 60.0
Word size frequency histogram for US blogs:
qplot(wordcount_blogs,bins=80)
Word size frequency histogram for US news:
qplot(wordcount_news,bins=80)
Word size frequency histogram for US tweets:
qplot(wordcount_twitter,bins=80)
Since there are outliers in the blogs and news datasets, and the volume of data is large, perform sampling on each of the data sets, selecting 30000 words at random from each.
sampled_blogs <- blogs[sample(1:length(blogs),30000)]
sampled_news <- news[sample(1:length(news),30000)]
sampled_twitter <- twitter[sample(1:length(twitter),30000)]
Create wordcount objects from the sampled data sets and view summary statistics on the word count distributions:
wordcount_sampled_blogs<- stri_count_words(sampled_blogs)
wordcount_sampled_news<- stri_count_words(sampled_news)
wordcount_sampled_twitter<- stri_count_words(sampled_twitter)
summary(wordcount_blogs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 29.00 42.43 61.00 6726.00
summary(wordcount_news)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 19.00 32.00 34.87 46.00 1123.00
summary(wordcount_twitter)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 7.0 12.0 12.8 18.0 60.0
Word size frequency histogram for US blogs:
qplot(wordcount_sampled_blogs,bins=80)
Word size frequency histogram for US news:
qplot(wordcount_sampled_news,bins=80)
Word size frequency histogram for US tweets:
qplot(wordcount_sampled_twitter,bins=60)
These distributions look considerably less skewed to the removal of some outliers. However the blogs data set still contains outliers.
Re-sample blogs data set with 10000 samples instead of 30000:
sampled_blogs <- blogs[sample(1:length(blogs),10000)]
wordcount_sampled_blogs<- stri_count_words(sampled_blogs)
qplot(wordcount_sampled_blogs,bins=80)
Now we have no outliers past 800 in the sampled blogs data set.
if (!file.exists("data/Coursera-Swiftkey/final/sampled/sampled_blogs.txt")) {
writeLines(sampled_blogs, "./data/Coursera-Swiftkey/final/sampled/sampled_blogs.txt")}
if (!file.exists("data/Coursera-Swiftkey/final/sampled/sampled_news.txt")) {
writeLines(sampled_news, "./data/Coursera-Swiftkey/final/sampled/sampled_news.txt")}
if (!file.exists("data/Coursera-Swiftkey/final/sampled/sampled_twitter.txt")) {
writeLines(sampled_twitter, "./data/Coursera-Swiftkey/final/sampled/sampled_twitter.txt")}
directory<- file.path("data/Coursera-Swiftkey/final/", "sampled")
data_corpus<-Corpus(DirSource(directory))
lc_corpus <- tm_map(data_corpus, content_transformer(tolower))
lc_corpus <- tm_map(lc_corpus, stripWhitespace)
lc_corpus <- tm_map(lc_corpus, removePunctuation)
lc_corpus <- tm_map(lc_corpus, removeNumbers)
lc_corpus <- tm_map(lc_corpus, removeWords, stopwords("english"))
This handles multiple variants of the same word, such as plurals of nouns or different tenses of words.
lc_corpus <- tm_map(lc_corpus, stemDocument)
This handles multiple variants of the same word, such as plurals of nouns or different tenses of words.
First create tokenizers for 1-word, 2-word, 3-word and 4-word sequences:
one_grams<- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
two_grams<- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
three_grams<- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
four_grams<- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
Now generate Document Term Matrices using each of the tokenizers:
one_doc_matrix <- DocumentTermMatrix(lc_corpus,
control = list(tokenize = one_grams))
two_doc_matrix <- DocumentTermMatrix(lc_corpus,
control = list(tokenize = two_grams))
three_doc_matrix <- DocumentTermMatrix(lc_corpus,
control = list(tokenize = three_grams))
four_doc_matrix <- DocumentTermMatrix(lc_corpus,
control = list(tokenize = four_grams))
Convert Document Term Matrices into data frames sorted by frequency:
one_frequency <- colSums(as.matrix(one_doc_matrix))
one_data_frame <- data.frame(word=names(one_frequency), freq=one_frequency)
one_data_frame<-one_data_frame[with(one_data_frame, order(-freq)), ]
two_frequency <- colSums(as.matrix(two_doc_matrix))
two_data_frame <- data.frame(word=names(two_frequency), freq=two_frequency)
two_data_frame<-two_data_frame[with(two_data_frame, order(-freq)), ]
three_frequency <- colSums(as.matrix(three_doc_matrix))
three_data_frame <- data.frame(word=names(three_frequency), freq=three_frequency)
three_data_frame<-three_data_frame[with(three_data_frame, order(-freq)), ]
four_frequency <- colSums(as.matrix(four_doc_matrix))
four_data_frame <- data.frame(word=names(four_frequency), freq=four_frequency)
four_data_frame<-four_data_frame[with(four_data_frame, order(-freq)), ]
ggplot(one_data_frame[one_data_frame$freq>2500, ], aes(reorder(word,-freq),freq)) +
geom_bar(stat="identity") +xlab("Single Words") +
ylab("Frequency") +
ggtitle("Single words appearing over 2500 times")
ggplot(two_data_frame[two_data_frame$freq>200, ], aes(reorder(word,-freq),freq)) +
geom_bar(stat="identity") +xlab("2-grams") +
ylab("Frequency") +
ggtitle("2-grams appearing over 200 times")+
theme(axis.text.x=element_text(angle=45, hjust=1))
ggplot(three_data_frame[three_data_frame$freq>15, ], aes(reorder(word,-freq),freq)) +
geom_bar(stat="identity") +xlab("3-grams") +
ylab("Frequency") +
ggtitle("3-grams appearing over 15 times") +
theme(axis.text.x=element_text(angle=45, hjust=1))
ggplot(four_data_frame[four_data_frame$freq>5, ], aes(reorder(word,-freq),freq)) +
geom_bar(stat="identity") +xlab("4-grams") +
ylab("Frequency") +
ggtitle("4-grams appearing over 5 times") + theme(axis.text.x=element_text(angle=45, hjust=1))
The plan for completing the text mining project is to use the 2-, 3- and 4-grams to predict the next word. An attempt will first be made to match based on 4-grams existing in the combined data set. If the previous 4 words of text do not match any item in the data set, then the set of preceding 3 words will be used to attempt a match among the 3-grams available in the data set. Failing a match among 3-grams, a match based on the two preceding words will be attempted in the 2-gram collection from the data set. If none of these provides a match, a word will be chosen at random from a list of most commonly occurring English words.