Introduction

For this assignment, we have been tasked with two topics out of six.

Load packages

suppressPackageStartupMessages(library(tidyverse))
library(knitr)
library(repurrrsive)
library(gapminder)
suppressPackageStartupMessages(library(leaflet))
suppressPackageStartupMessages(library(ggmap))
suppressPackageStartupMessages(library(MASS))
suppressPackageStartupMessages(library(stringi))

First Task: Character data

I will work on the exercises in the Strings chapter.

Exercises in 14.2.5

1.In code that doesn’t use stringr, you’ll often see paste() and paste0(). What’s the difference between the two functions? What stringr function are they equivalent to? How do the functions differ in their handling of NA?

The difference can be seen by the example below.

paste("Hello","World")

## [1] "Hello World"

paste0("Hello","World")

## [1] "HelloWorld"

By paste(): There is a space between each string;
By paste0(): There is NO space between each string.

The equivalent stringr functions are shown beolow.

paste0("Hello","World")

## [1] "HelloWorld"

str_c("Hello","World")

## [1] "HelloWorld"

paste0(): equivalent to str_c().

Now, let’s figure out how do these functions differ in their handling of NA.

paste("Hello","World",NA)

## [1] "Hello World NA"

paste0("Hello","World",NA)

## [1] "HelloWorldNA"

str_c("Hello","World",NA)

## [1] NA

paste() and paste0(): see NA as a string and return all the strings incluing NA
str_c(): only NA is returned regardless of the other strings

2.In your own words, describe the difference between the `sep` and `collapse` arguments to str_c().

str_c(c("2018","SEP","1"),"Sunny",sep="-")

## [1] "2018-Sunny" "SEP-Sunny"  "1-Sunny"

str_c(c("2018","SEP","1"),"Sunny",collapse="-")

## [1] "2018Sunny-SEPSunny-1Sunny"

By sep: add the sep argument behind each string, the number of output is same to that of input;
By collapse: add the collapse argument behind each string and return all the strings together as one whole string, only ONE output.

3.Use str_length() and str_sub() to extract the middle character from a string. What will you do if the string has an even number of characters?

For strings with odd number of characters: extract the middle character from the string.

x1 <- "abc" #odd number
n_x1 <- str_length(x1)
str_sub(x1,(n_x1+1)/2,(n_x1+1)/2)

## [1] "b"

For strings with even number of characters: extract the middle two characters from the string.

x2 <- "abcd" # even number
n_x2 <- str_length(x2) # calculate the length of the strings
str_sub(x2,(n_x2)/2,(n_x2)/2+1)

## [1] "bc"

Now, an obvious question is: Can we extract the middle character in the same way no matter the string has an odd or even characters?

The answer is YES! It can be done by ceiling().

x3 <- c("a", "abc", "abcd", "abcde")
n_x3 <- str_length(x3)
str_sub(x3, ceiling(n_x3/2), n_x3/2+1) #extract from the ceiling for the middle character

## [1] "a"  "b"  "bc" "c"

5. What does str_trim() do? What’s the opposite of str_trim()?

str_trim() deletes the whitespace from a string.

str_trim(" a b c ",side = "left") # delete from left

## [1] "a b c "

str_trim(" a b c ",side = "right") # delete from right

## [1] " a b c"

str_trim(" a b c ",side = "both")# delete from both

## [1] "a b c"

The opposite of str_trim() is str_pad(), which add whitespace(s) to a string.

str_pad("a b c",width=6, side = "left") # add left side,total width is 6

## [1] " a b c"

str_pad("a b c",width=7, side = "both") # add both side,total width is 7

## [1] " a b c "

6. Write a function that turns (e.g.) a vector c(“a”, “b”, “c”) into the string “a, b, and c”. Think carefully about what it should do if given a vector of length 0, 1, or 2.

First, I write a function as below to turn a vector into a string as required.

vector_to_string <- function(x){
   if(length(x) == 0){
    return(str_c("Please enter a vector")) # with vector of length 0, it's better to show a error information
  }
   if(length(x) ==1){
    return(x) # with vector of length 1, return itself
  }
   if(length(x)==2){
      str0=str_c(x[1],", and ",x[2])
      return(str0)
   }
   else{
      str1 <- str_c(x[1:length(x)-1],collapse=", ") # the first part of the string
      str2 <- str_c(str1,", and ", x[length(x)])
      return(str2)
   }
}

Now, let’s test it with vectors of length 0,1,2 or more.

str_0 <- c()
vector_to_string(str_0)

## [1] "Please enter a vector"

str_1 <- c("A")
vector_to_string(str_1)

## [1] "A"

str_2 <- c("A","B")
vector_to_string(str_2)

## [1] "A, and B"

str_3 <- c("A","B","C")
vector_to_string(str_3)

## [1] "A, B, and C"

Exercises in 14.3.1.1

1. Explain why each of these strings don’t match a : “",”\“,”\".

Regexps use the backslash, , to escape special behaviour.

String	Meaning
\	excape the next character
\	resolve to in the regular expression
\\	the first two backslashes will resolve to a literal backslash in the regular expression, the third will escape the next character

2. How would you match the sequence "’?

x <- "\"\'\\"
str_view(x, "\\\"\\'\\\\")

3. What patterns will the regular expression ...... match? How would you represent it as a string?

x<-"\\..\\..\\.."
writeLines(x)

## \..\..\..

str_c(writeLines(x))

## \..\..\..

## character(0)

Exercises in 14.3.2.1

1. How would you match the literal string “$^$”?

x <- "$^$"
str_view(x, "^\\$\\^\\$$")

2.Given the corpus of common words in stringr::words, create regular expressions that find all words that:

Start with “y”.

str_view(words, "^y", match =TRUE)

End with “x”

str_view(words, "x$", match =TRUE)

Are exactly three letters long. (Don’t cheat by using str_length()!)

str_view(words, "^...$", match = TRUE)

Have seven letters or more.

str_view(words, ".......", match = TRUE)

Exercises in 14.3.3.1

1. Create regular expressions to find all words that:

Start with a vowel.

str_view(words, "^[aeiou]", match = TRUE)

That only contain consonants. (Hint: thinking about matching “not”-vowels.)

str_view(words, "^[^aeiou]+$", match=TRUE)

End with ed, but not with eed.

str_view(words, "^ed$|[^e]ed$", match = TRUE)

End with ing or ise.

str_view(words, "i(ng|se)$", match = TRUE)

2. Empirically verify the rule “i before e except after c”.

str_view(words, "(cei|[^c]ie)", match = TRUE)

str_view(words, "(cie|[^c]ei)", match = TRUE)

3. Is “q” always followed by a “u”?

str_view(words, "q[^u]", match = TRUE)

We get no return, so the answer is NO.

Exercises in 14.3.4.1

1. Describe the equivalents of ?, +, * in {m,n} form.

expression	equivalents in {m,n} form
`?`	{0,1}
`+`	{1,}
`*`	{0,}

2. Describe in words what these regular expressions match: (read carefully to see if I’m using a regular expression or a string that defines a regular expression.)

1.^.*$ matches any string.

"\\{.+\\}" matches any string which contains {xx}, where xx can be any character(s).
\d{4}-\d{2}-\d{2} mastches xxxx-xx-xx, where x represents a digit.
"\\\\{4}" matches four backslashes.

3. Create regular expressions to find all words that:

Start with three consonants.

str_view(words, "^[^aeiou]{3}")

Have three or more vowels in a row.

str_view(words, "[aeiou]{3,}")

Have two or more vowel-consonant pairs in a row.

str_view(words, "([aeiou][^aeiou]){2,}")

Exercises in 14.4.2

1.For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls.

Find all words that start or end with x.

A single regular expression:

str_subset(words,"(^x)|(x$)")

## [1] "box" "sex" "six" "tax"

A combination of multiple str_detect():

str_start <- str_detect(words, "^x")
str_end <- str_detect(words, "x$")
words[str_start|str_end]

## [1] "box" "sex" "six" "tax"

Find all words that start with a vowel and end with a consonant.

A single regular expression:

str_subset(words,"^[aeiou].*[^aeiou]$")%>%
   head() # just show the first six here to avoid large output

## [1] "about"   "accept"  "account" "across"  "act"     "actual"

A combination of multiple str_detect():

s1<-str_detect(words, "^[aeiou]")
s2<-str_detect(words,"[^aeiou]$")
head(words[s1&s2])

## [1] "about"   "accept"  "account" "across"  "act"     "actual"

Are there any words that contain at least one of each different vowel?

str_subset(words,"^[aeiou].*[aeiou]")%>%
   head() # just show the first six here to avoid large output

## [1] "able"     "about"    "absolute" "accept"   "account"  "achieve"

2. What word has the highest number of vowels? What word has the highest proportion of vowels? (Hint: what is the denominator?)

Find the word with the highest number of vowels:

n_max <- max(str_count(words,"[aeiou]")) # highest number of vowels
words[str_count(words,"[aeiou]")==n_max]

## [1] "appropriate" "associate"   "available"   "colleague"   "encourage"  
## [6] "experience"  "individual"  "television"

Find the word with the highest proportion of vowels:

n_vowels <- str_count(words,"[aeiou]") # count the number of vowels of each word
n_total <- str_count(words,".") # count the number of character of each word
max(n_vowels/n_total) # highest proportion of vowels

## [1] 1

words[n_vowels/n_total==max(n_vowels/n_total)]

## [1] "a"

Exercises in 14.4.3.1

1. In the previous example, you might have noticed that the regular expression matched “flickered”, which is not a colour. Modify the regex to fix the problem.

I match the boundary between words with \b.

colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c(colours, collapse = "|")
b_colour_match <- str_c("\\b", colour_match, "\\b")
b_colour_match

## [1] "\\bred|orange|yellow|green|blue|purple\\b"

more <- sentences[str_count(sentences, b_colour_match) > 1]
str_view_all(more, colour_match)

2. From the Harvard sentences data, extract:

The first word from each sentence.

n1<-"^[^ ]+"
str_extract(sentences,n1) %>%
   head()

## [1] "The"   "Glue"  "It's"  "These" "Rice"  "The"

All words ending in ing.

n2<-"([^ ]+)ing"
s<-str_subset(sentences,n2)
str_extract(s,n2) %>%
   head()

## [1] "stocking" "spring"   "evening"  "morning"  "winding"  "living"

Exercises in 14.5.1

1.How would you find all strings containing with regex() vs. with fixed()?

Let’s create test strings test_s first to find all strings containing .

test_s <- c("a\\b\\c","1\\2","hello")

Use regex():

str_view_all(test_s, regex(pattern = "\\\\"))

Use fixed():

str_detect(test_s, fixed(pattern = "\\"))

## [1]  TRUE  TRUE FALSE

2.What are the five most common words in sentences?

Install the library tidytext,then we can find the most common words

library(tidytext)
library(dplyr)
data_frame(text=sentences) %>% 
    unnest_tokens(word, text) %>%    # split words
    count(word, sort = TRUE) %>% # count occurrences
    head(5)

## # A tibble: 5 x 2
##   word      n
##   <chr> <int>
## 1 the     751
## 2 a       202
## 3 of      132
## 4 to      123
## 5 and     118

We can see that the five most common words are “a”, “an”, “the” etc. Futhermore, if we want take out these words, we can do it as below.

data_frame(text=sentences) %>% 
    unnest_tokens(word, text) %>%    # split words
    anti_join(stop_words) %>%    # take out "a", "an", "the", etc.
    count(word, sort = TRUE) %>% # count occurrences
    head(5)

## Joining, by = "word"

## # A tibble: 5 x 2
##   word       n
##   <chr>  <int>
## 1 red       11
## 2 fine      10
## 3 green     10
## 4 hot       10
## 5 strong    10

Exercises in 14.7.1

1. Find the stringi functions that:

1.Count the number of words.

stri_count_words("Today is the deadline of hw06")

## [1] 6

Find duplicated strings.

stri_duplicated(c("a","b","c",NA,"a","b","d","e",NA))

## [1] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE

stri_duplicated_any(c("a","b","c",NA,"a","b","d","e",NA))

## [1] 5

stri_duplicated() shows whether this string is duplicated or not.
stri_duplicated_any() shows the number of duplicated strings.

Generate random text. Random generation can be done by stri_rand_strings() and stri_rand_shuffle().

Generate random strings of specific length:

stri_rand_strings(5, 10) # 5 strings of length 10

## [1] "KonKuDMG8s" "IOHfygXKAo" "xNYl1qKwsw" "UXE61EmnZ8" "tHpUOlzUbw"

Generate random strings of random length:

stri_rand_strings(5, sample(1:10, 5, replace=TRUE)) # 5 strings of random lengths

## [1] "hmLPlNJ"   "E"         "KJKj"      "BqM"       "oMnJqtqAW"

Generate a string consisting of at least one digit, small and big ASCII letter, which is quite useful in creating a random password:

n <- 10 # number of strings
stri_rand_shuffle(stri_paste(
   stri_rand_strings(n, 1, '[0-9]'),
   stri_rand_strings(n, 1, '[a-z]'),
   stri_rand_strings(n, 1, '[A-Z]'),
   stri_rand_strings(n, sample(5:11, 5, replace=TRUE), '[a-zA-Z0-9]')
))

##  [1] "bRx7hwo3"       "Lt4q7J8o9yBB"   "YsupFizoRAJR4"  "b6rTTfTDHa"    
##  [5] "6MCQ1TnMe4vGuI" "w2qvJ0ZH"       "S3UBr7bOMmAc"   "1PjnJa93FydG3" 
##  [9] "T7yOpz4nln"     "L5K8tZVzdxN0ii"

2.How do you control the language that stri_sort() uses for sorting?

In order to control the language, we can set a locale to use when sorting by stri_sort(..., locale = ...).

stri_sort(c("hladny", "chladny"), locale="pl_PL")

## [1] "chladny" "hladny"

stri_sort(c("hladny", "chladny"), locale="sk_SK")

## [1] "hladny"  "chladny"

Second Task: Write functions

In this part, the task requires us to write one (or more) functions that do something useful to pieces of the Gapminder or Singer data. I will begin with the linear regression function presented here[http://stat545.com/block012_function-regress-lifeexp-on-year.html], and generalize that to do quadratic regression (include a squared term).

Generate a dataframe called `j_dat`

I extract the data for Zimbabwe from gapminder in order to get a reasonable dataframe. The new dataframe is called j_dat. I will try to do the quadratic regression for this country’s life expectancy over the years.

j_country <- "Zimbabwe" # an example
(j_dat <- gapminder %>% 
  filter(country == j_country))

## # A tibble: 12 x 6
##    country  continent  year lifeExp      pop gdpPercap
##    <fct>    <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Zimbabwe Africa     1952    48.5  3080907      407.
##  2 Zimbabwe Africa     1957    50.5  3646340      519.
##  3 Zimbabwe Africa     1962    52.4  4277736      527.
##  4 Zimbabwe Africa     1967    54.0  4995432      570.
##  5 Zimbabwe Africa     1972    55.6  5861135      799.
##  6 Zimbabwe Africa     1977    57.7  6642107      686.
##  7 Zimbabwe Africa     1982    60.4  7636524      789.
##  8 Zimbabwe Africa     1987    62.4  9216418      706.
##  9 Zimbabwe Africa     1992    60.4 10704340      693.
## 10 Zimbabwe Africa     1997    46.8 11404948      792.
## 11 Zimbabwe Africa     2002    40.0 11926563      672.
## 12 Zimbabwe Africa     2007    43.5 12311143      470.

Result from linear regression

Now, let’s plot the data first.

j_dat %>% 
   ggplot(aes(x = year, y = lifeExp))+
   geom_point() + 
   geom_smooth(method = "lm", se = FALSE)+ 
   ggtitle("Linear regression of Zimbabwe's lifeExp over years")+
   theme( plot.title = element_text(hjust = 0.5))

We can obtain detailed information on linear regression through the summary() command as below.

attach(j_dat) #attach the entire dataset so that we can refer to all variables directly by name
names(j_dat)

## [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"

j_fit <-lm(lifeExp ~ year)
summary(j_fit)

## 
## Call:
## lm(formula = lifeExp ~ year)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.581  -4.870  -0.882   5.567  10.386 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 236.79819  238.55797   0.993    0.344
## year         -0.09302    0.12051  -0.772    0.458
## 
## Residual standard error: 7.205 on 10 degrees of freedom
## Multiple R-squared:  0.05623,    Adjusted R-squared:  -0.03814 
## F-statistic: 0.5958 on 1 and 10 DF,  p-value: 0.458

From the summary() result, we can obtain some useful coefficients along with four goodness-of-fit measures for regression analysis.

Std. Error is Residual Standard Error (see below) divided by the square root of the sum of the square of that particular x variable.
t value: Estimate divided by Std. Error
Pr(>|t|): Look up your t value in a T distribution table with the given degrees of freedom.
Residual Standard Error: Essentially standard deviation of residuals / errors of your regression model.
Multiple R-Squared: Percent of the variance of Y intact after subtracting the error of the model.
Adjusted R-Squared: Same as multiple R-Squared but takes into account the number of samples and variables you’re using.
F-Statistic: Global test to check if your model has at least one significant variable. Takes into account number of variables and observations used.

The estimated intercept here which equals to 236 is apparently unreasonable.

It makes more sense for the intercept to correspond to life expectancy in 1952, the earliest date in our dataset.

What am I doing here: lm(lifeExp ~ I(year - 1952))? I want the intercept to correspond to 1952 and an easy way to accomplish that is to create a new predictor on the fly: year minus 1952. The way I achieve that in the model formula, I(year - 1952), uses the I() function which “inhibits interpretation/conversion of objects”. By protecting the expression year - 1952, I ensure it is interpreted in the obvious arithmetical way.

j_fit <- lm(lifeExp ~ I(year - 1952), j_dat)
summary(j_fit)

## 
## Call:
## lm(formula = lifeExp ~ I(year - 1952), data = j_dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.581  -4.870  -0.882   5.567  10.386 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    55.22124    3.91270  14.113 6.27e-08 ***
## I(year - 1952) -0.09302    0.12051  -0.772    0.458    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.205 on 10 degrees of freedom
## Multiple R-squared:  0.05623,    Adjusted R-squared:  -0.03814 
## F-statistic: 0.5958 on 1 and 10 DF,  p-value: 0.458

From the summary above, we could obtain some useful coefficients: the intercept turns out to be 55.22, which seems more reasonable. Meanwhile, the standard error of that decreases to 3.9.

Result from Quadratic regression

First, we have to create a new variable called year2 which is the square of the variable year.

year2 <- year^2

Now, let’s fit the quadratic regression to the dataframe.

j_fit <- lm(lifeExp ~ year + year2)
summary(j_fit)

## 
## Call:
## lm(formula = lifeExp ~ year + year2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2522 -2.7543 -0.6202  2.7916  5.9329 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -8.118e+04  1.803e+04  -4.502  0.00148 **
## year         8.217e+01  1.822e+01   4.510  0.00147 **
## year2       -2.078e-02  4.602e-03  -4.515  0.00146 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.203 on 9 degrees of freedom
## Multiple R-squared:  0.711,  Adjusted R-squared:  0.6467 
## F-statistic: 11.07 on 2 and 9 DF,  p-value: 0.003753

Unfortunately, the result looks terrible! The lifeExp equals to minus 81,180 years around year 0 A.D! Std. Error equals to 18,030 which is totally unacceptable.

Let’s handle the offset in the way above.

j_fit <- lm(lifeExp ~ I(year-1952) + I(year2-1952^2))
summary(j_fit)

## 
## Call:
## lm(formula = lifeExp ~ I(year - 1952) + I(year2 - 1952^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2522 -2.7543 -0.6202  2.7916  5.9329 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       45.697407   3.107870  14.704 1.34e-07 ***
## I(year - 1952)    82.172151  18.220039   4.510  0.00147 ** 
## I(year2 - 1952^2) -0.020779   0.004602  -4.515  0.00146 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.203 on 9 degrees of freedom
## Multiple R-squared:  0.711,  Adjusted R-squared:  0.6467 
## F-statistic: 11.07 on 2 and 9 DF,  p-value: 0.003753

Great! The result looks better now. The intercept turns out to be 45.69, which seems more reasonable. Meanwhile, the standard error of that decreases to 3.1.

Now, let’s plot the data.

j_dat %>% 
   ggplot(aes(x = year, y = lifeExp))+
   geom_point() + 
   geom_smooth(method = "lm", formula = y ~ x + I(x^2))+ 
   ggtitle("Quadratic regression of Zimbabwe's lifeExp over years")+
   theme( plot.title = element_text(hjust = 0.5))

It shows that quadratic regression show some advantages over linear regression in this case.

Create a function which do the quadratic regression with an automatic offset

From the results above, we could know that it is more reasonable to do the quadratic regression with the intercept year offset to 1952. The new function we will create below is called quad_fit_1952.

quad_fit_1952 <- function(dat, offset = 1952){
  j_fit <- lm(lifeExp ~ I(year-1952) + I(year2-1952^2))
  summary(j_fit)
}

The function has been created, let’s try it with j_dat dataframe.

quad_fit_1952(j_dat)

## 
## Call:
## lm(formula = lifeExp ~ I(year - 1952) + I(year2 - 1952^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2522 -2.7543 -0.6202  2.7916  5.9329 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       45.697407   3.107870  14.704 1.34e-07 ***
## I(year - 1952)    82.172151  18.220039   4.510  0.00147 ** 
## I(year2 - 1952^2) -0.020779   0.004602  -4.515  0.00146 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.203 on 9 degrees of freedom
## Multiple R-squared:  0.711,  Adjusted R-squared:  0.6467 
## F-statistic: 11.07 on 2 and 9 DF,  p-value: 0.003753

The estimated intercept looks good above. The function works well with an automatic offset.

hw06- Data wrangling wrap up

Meiqi Yu