sep
and collapse
arguments to str_c().For this assignment, we have been tasked with two topics out of six.
suppressPackageStartupMessages(library(tidyverse))
library(knitr)
library(repurrrsive)
library(gapminder)
suppressPackageStartupMessages(library(leaflet))
suppressPackageStartupMessages(library(ggmap))
suppressPackageStartupMessages(library(MASS))
suppressPackageStartupMessages(library(stringi))
I will work on the exercises in the Strings chapter.
The difference can be seen by the example below.
paste("Hello","World")
## [1] "Hello World"
paste0("Hello","World")
## [1] "HelloWorld"
paste()
: There is a space between each string;paste0()
: There is NO space between each string.The equivalent stringr functions are shown beolow.
paste0("Hello","World")
## [1] "HelloWorld"
str_c("Hello","World")
## [1] "HelloWorld"
paste0()
: equivalent to str_c()
.Now, let’s figure out how do these functions differ in their handling of NA.
paste("Hello","World",NA)
## [1] "Hello World NA"
paste0("Hello","World",NA)
## [1] "HelloWorldNA"
str_c("Hello","World",NA)
## [1] NA
paste()
and paste0()
: see NA as a string and return all the strings incluing NAstr_c()
: only NA is returned regardless of the other stringssep
and collapse
arguments to str_c().str_c(c("2018","SEP","1"),"Sunny",sep="-")
## [1] "2018-Sunny" "SEP-Sunny" "1-Sunny"
str_c(c("2018","SEP","1"),"Sunny",collapse="-")
## [1] "2018Sunny-SEPSunny-1Sunny"
sep
: add the sep argument behind each string, the number of output is same to that of input;collapse
: add the collapse argument behind each string and return all the strings together as one whole string, only ONE output.For strings with odd number of characters: extract the middle character from the string.
x1 <- "abc" #odd number
n_x1 <- str_length(x1)
str_sub(x1,(n_x1+1)/2,(n_x1+1)/2)
## [1] "b"
For strings with even number of characters: extract the middle two characters from the string.
x2 <- "abcd" # even number
n_x2 <- str_length(x2) # calculate the length of the strings
str_sub(x2,(n_x2)/2,(n_x2)/2+1)
## [1] "bc"
Now, an obvious question is: Can we extract the middle character in the same way no matter the string has an odd or even characters?
The answer is YES! It can be done by ceiling()
.
x3 <- c("a", "abc", "abcd", "abcde")
n_x3 <- str_length(x3)
str_sub(x3, ceiling(n_x3/2), n_x3/2+1) #extract from the ceiling for the middle character
## [1] "a" "b" "bc" "c"
str_trim()
deletes the whitespace from a string.str_trim(" a b c ",side = "left") # delete from left
## [1] "a b c "
str_trim(" a b c ",side = "right") # delete from right
## [1] " a b c"
str_trim(" a b c ",side = "both")# delete from both
## [1] "a b c"
str_trim()
is str_pad()
, which add whitespace(s) to a string.str_pad("a b c",width=6, side = "left") # add left side,total width is 6
## [1] " a b c"
str_pad("a b c",width=7, side = "both") # add both side,total width is 7
## [1] " a b c "
First, I write a function as below to turn a vector into a string as required.
vector_to_string <- function(x){
if(length(x) == 0){
return(str_c("Please enter a vector")) # with vector of length 0, it's better to show a error information
}
if(length(x) ==1){
return(x) # with vector of length 1, return itself
}
if(length(x)==2){
str0=str_c(x[1],", and ",x[2])
return(str0)
}
else{
str1 <- str_c(x[1:length(x)-1],collapse=", ") # the first part of the string
str2 <- str_c(str1,", and ", x[length(x)])
return(str2)
}
}
Now, let’s test it with vectors of length 0,1,2 or more.
str_0 <- c()
vector_to_string(str_0)
## [1] "Please enter a vector"
str_1 <- c("A")
vector_to_string(str_1)
## [1] "A"
str_2 <- c("A","B")
vector_to_string(str_2)
## [1] "A, and B"
str_3 <- c("A","B","C")
vector_to_string(str_3)
## [1] "A, B, and C"
Regexps use the backslash, , to escape special behaviour.
String | Meaning |
---|---|
\ | excape the next character |
\ | resolve to in the regular expression |
\\ | the first two backslashes will resolve to a literal backslash in the regular expression, the third will escape the next character |
x <- "\"\'\\"
str_view(x, "\\\"\\'\\\\")
x<-"\\..\\..\\.."
writeLines(x)
## \..\..\..
str_c(writeLines(x))
## \..\..\..
## character(0)
x <- "$^$"
str_view(x, "^\\$\\^\\$$")
str_view(words, "^y", match =TRUE)
str_view(words, "x$", match =TRUE)
str_view(words, "^...$", match = TRUE)
str_view(words, ".......", match = TRUE)
str_view(words, "^[aeiou]", match = TRUE)
str_view(words, "^[^aeiou]+$", match=TRUE)
str_view(words, "^ed$|[^e]ed$", match = TRUE)
str_view(words, "i(ng|se)$", match = TRUE)
str_view(words, "(cei|[^c]ie)", match = TRUE)
str_view(words, "(cie|[^c]ei)", match = TRUE)
str_view(words, "q[^u]", match = TRUE)
We get no return, so the answer is NO.
expression | equivalents in {m,n} form |
---|---|
? |
{0,1} |
+ |
{1,} |
* |
{0,} |
1.^.*$
matches any string.
"\\{.+\\}"
matches any string which contains {xx}
, where xx
can be any character(s).
\d{4}-\d{2}-\d{2}
mastches xxxx-xx-xx
, where x
represents a digit.
"\\\\{4}"
matches four backslashes.
str_view(words, "^[^aeiou]{3}")
str_view(words, "[aeiou]{3,}")
str_view(words, "([aeiou][^aeiou]){2,}")
A single regular expression:
str_subset(words,"(^x)|(x$)")
## [1] "box" "sex" "six" "tax"
A combination of multiple str_detect():
str_start <- str_detect(words, "^x")
str_end <- str_detect(words, "x$")
words[str_start|str_end]
## [1] "box" "sex" "six" "tax"
A single regular expression:
str_subset(words,"^[aeiou].*[^aeiou]$")%>%
head() # just show the first six here to avoid large output
## [1] "about" "accept" "account" "across" "act" "actual"
A combination of multiple str_detect():
s1<-str_detect(words, "^[aeiou]")
s2<-str_detect(words,"[^aeiou]$")
head(words[s1&s2])
## [1] "about" "accept" "account" "across" "act" "actual"
str_subset(words,"^[aeiou].*[aeiou]")%>%
head() # just show the first six here to avoid large output
## [1] "able" "about" "absolute" "accept" "account" "achieve"
Find the word with the highest number of vowels:
n_max <- max(str_count(words,"[aeiou]")) # highest number of vowels
words[str_count(words,"[aeiou]")==n_max]
## [1] "appropriate" "associate" "available" "colleague" "encourage"
## [6] "experience" "individual" "television"
Find the word with the highest proportion of vowels:
n_vowels <- str_count(words,"[aeiou]") # count the number of vowels of each word
n_total <- str_count(words,".") # count the number of character of each word
max(n_vowels/n_total) # highest proportion of vowels
## [1] 1
words[n_vowels/n_total==max(n_vowels/n_total)]
## [1] "a"
I match the boundary between words with \b
.
colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c(colours, collapse = "|")
b_colour_match <- str_c("\\b", colour_match, "\\b")
b_colour_match
## [1] "\\bred|orange|yellow|green|blue|purple\\b"
more <- sentences[str_count(sentences, b_colour_match) > 1]
str_view_all(more, colour_match)
n1<-"^[^ ]+"
str_extract(sentences,n1) %>%
head()
## [1] "The" "Glue" "It's" "These" "Rice" "The"
n2<-"([^ ]+)ing"
s<-str_subset(sentences,n2)
str_extract(s,n2) %>%
head()
## [1] "stocking" "spring" "evening" "morning" "winding" "living"
Let’s create test strings test_s
first to find all strings containing .
test_s <- c("a\\b\\c","1\\2","hello")
Use regex()
:
str_view_all(test_s, regex(pattern = "\\\\"))
Use fixed()
:
str_detect(test_s, fixed(pattern = "\\"))
## [1] TRUE TRUE FALSE
Install the library tidytext
,then we can find the most common words
library(tidytext)
library(dplyr)
data_frame(text=sentences) %>%
unnest_tokens(word, text) %>% # split words
count(word, sort = TRUE) %>% # count occurrences
head(5)
## # A tibble: 5 x 2
## word n
## <chr> <int>
## 1 the 751
## 2 a 202
## 3 of 132
## 4 to 123
## 5 and 118
We can see that the five most common words are “a”, “an”, “the” etc. Futhermore, if we want take out these words, we can do it as below.
data_frame(text=sentences) %>%
unnest_tokens(word, text) %>% # split words
anti_join(stop_words) %>% # take out "a", "an", "the", etc.
count(word, sort = TRUE) %>% # count occurrences
head(5)
## Joining, by = "word"
## # A tibble: 5 x 2
## word n
## <chr> <int>
## 1 red 11
## 2 fine 10
## 3 green 10
## 4 hot 10
## 5 strong 10
1.Count the number of words.
stri_count_words("Today is the deadline of hw06")
## [1] 6
stri_duplicated(c("a","b","c",NA,"a","b","d","e",NA))
## [1] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE
stri_duplicated_any(c("a","b","c",NA,"a","b","d","e",NA))
## [1] 5
stri_duplicated()
shows whether this string is duplicated or not.stri_duplicated_any()
shows the number of duplicated strings.stri_rand_strings()
and stri_rand_shuffle()
.Generate random strings of specific length:
stri_rand_strings(5, 10) # 5 strings of length 10
## [1] "KonKuDMG8s" "IOHfygXKAo" "xNYl1qKwsw" "UXE61EmnZ8" "tHpUOlzUbw"
Generate random strings of random length:
stri_rand_strings(5, sample(1:10, 5, replace=TRUE)) # 5 strings of random lengths
## [1] "hmLPlNJ" "E" "KJKj" "BqM" "oMnJqtqAW"
Generate a string consisting of at least one digit, small and big ASCII letter, which is quite useful in creating a random password:
n <- 10 # number of strings
stri_rand_shuffle(stri_paste(
stri_rand_strings(n, 1, '[0-9]'),
stri_rand_strings(n, 1, '[a-z]'),
stri_rand_strings(n, 1, '[A-Z]'),
stri_rand_strings(n, sample(5:11, 5, replace=TRUE), '[a-zA-Z0-9]')
))
## [1] "bRx7hwo3" "Lt4q7J8o9yBB" "YsupFizoRAJR4" "b6rTTfTDHa"
## [5] "6MCQ1TnMe4vGuI" "w2qvJ0ZH" "S3UBr7bOMmAc" "1PjnJa93FydG3"
## [9] "T7yOpz4nln" "L5K8tZVzdxN0ii"
In order to control the language, we can set a locale to use when sorting by stri_sort(..., locale = ...)
.
stri_sort(c("hladny", "chladny"), locale="pl_PL")
## [1] "chladny" "hladny"
stri_sort(c("hladny", "chladny"), locale="sk_SK")
## [1] "hladny" "chladny"
In this part, the task requires us to write one (or more) functions that do something useful to pieces of the Gapminder or Singer data. I will begin with the linear regression function presented here[http://stat545.com/block012_function-regress-lifeexp-on-year.html], and generalize that to do quadratic regression (include a squared term).
j_dat
I extract the data for Zimbabwe from gapminder
in order to get a reasonable dataframe. The new dataframe is called j_dat
. I will try to do the quadratic regression for this country’s life expectancy over the years.
j_country <- "Zimbabwe" # an example
(j_dat <- gapminder %>%
filter(country == j_country))
## # A tibble: 12 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Zimbabwe Africa 1952 48.5 3080907 407.
## 2 Zimbabwe Africa 1957 50.5 3646340 519.
## 3 Zimbabwe Africa 1962 52.4 4277736 527.
## 4 Zimbabwe Africa 1967 54.0 4995432 570.
## 5 Zimbabwe Africa 1972 55.6 5861135 799.
## 6 Zimbabwe Africa 1977 57.7 6642107 686.
## 7 Zimbabwe Africa 1982 60.4 7636524 789.
## 8 Zimbabwe Africa 1987 62.4 9216418 706.
## 9 Zimbabwe Africa 1992 60.4 10704340 693.
## 10 Zimbabwe Africa 1997 46.8 11404948 792.
## 11 Zimbabwe Africa 2002 40.0 11926563 672.
## 12 Zimbabwe Africa 2007 43.5 12311143 470.
Now, let’s plot the data first.
j_dat %>%
ggplot(aes(x = year, y = lifeExp))+
geom_point() +
geom_smooth(method = "lm", se = FALSE)+
ggtitle("Linear regression of Zimbabwe's lifeExp over years")+
theme( plot.title = element_text(hjust = 0.5))
We can obtain detailed information on linear regression through the summary() command as below.
attach(j_dat) #attach the entire dataset so that we can refer to all variables directly by name
names(j_dat)
## [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
j_fit <-lm(lifeExp ~ year)
summary(j_fit)
##
## Call:
## lm(formula = lifeExp ~ year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.581 -4.870 -0.882 5.567 10.386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 236.79819 238.55797 0.993 0.344
## year -0.09302 0.12051 -0.772 0.458
##
## Residual standard error: 7.205 on 10 degrees of freedom
## Multiple R-squared: 0.05623, Adjusted R-squared: -0.03814
## F-statistic: 0.5958 on 1 and 10 DF, p-value: 0.458
From the summary()
result, we can obtain some useful coefficients along with four goodness-of-fit measures for regression analysis.
Std. Error is Residual Standard Error (see below) divided by the square root of the sum of the square of that particular x variable.
t value: Estimate divided by Std. Error
Pr(>|t|): Look up your t value in a T distribution table with the given degrees of freedom.
Residual Standard Error: Essentially standard deviation of residuals / errors of your regression model.
Multiple R-Squared: Percent of the variance of Y intact after subtracting the error of the model.
Adjusted R-Squared: Same as multiple R-Squared but takes into account the number of samples and variables you’re using.
F-Statistic: Global test to check if your model has at least one significant variable. Takes into account number of variables and observations used.
The estimated intercept here which equals to 236 is apparently unreasonable.
It makes more sense for the intercept to correspond to life expectancy in 1952, the earliest date in our dataset.
What am I doing here: lm(lifeExp ~ I(year - 1952))
? I want the intercept to correspond to 1952 and an easy way to accomplish that is to create a new predictor on the fly: year minus 1952. The way I achieve that in the model formula, I(year - 1952), uses the I() function which “inhibits interpretation/conversion of objects”. By protecting the expression year - 1952, I ensure it is interpreted in the obvious arithmetical way.
j_fit <- lm(lifeExp ~ I(year - 1952), j_dat)
summary(j_fit)
##
## Call:
## lm(formula = lifeExp ~ I(year - 1952), data = j_dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.581 -4.870 -0.882 5.567 10.386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.22124 3.91270 14.113 6.27e-08 ***
## I(year - 1952) -0.09302 0.12051 -0.772 0.458
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.205 on 10 degrees of freedom
## Multiple R-squared: 0.05623, Adjusted R-squared: -0.03814
## F-statistic: 0.5958 on 1 and 10 DF, p-value: 0.458
From the summary above, we could obtain some useful coefficients: the intercept turns out to be 55.22, which seems more reasonable. Meanwhile, the standard error of that decreases to 3.9.
First, we have to create a new variable called year2
which is the square of the variable year
.
year2 <- year^2
Now, let’s fit the quadratic regression to the dataframe.
j_fit <- lm(lifeExp ~ year + year2)
summary(j_fit)
##
## Call:
## lm(formula = lifeExp ~ year + year2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.2522 -2.7543 -0.6202 2.7916 5.9329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.118e+04 1.803e+04 -4.502 0.00148 **
## year 8.217e+01 1.822e+01 4.510 0.00147 **
## year2 -2.078e-02 4.602e-03 -4.515 0.00146 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.203 on 9 degrees of freedom
## Multiple R-squared: 0.711, Adjusted R-squared: 0.6467
## F-statistic: 11.07 on 2 and 9 DF, p-value: 0.003753
Unfortunately, the result looks terrible! The lifeExp equals to minus 81,180 years around year 0 A.D! Std. Error equals to 18,030 which is totally unacceptable.
Let’s handle the offset in the way above.
j_fit <- lm(lifeExp ~ I(year-1952) + I(year2-1952^2))
summary(j_fit)
##
## Call:
## lm(formula = lifeExp ~ I(year - 1952) + I(year2 - 1952^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.2522 -2.7543 -0.6202 2.7916 5.9329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.697407 3.107870 14.704 1.34e-07 ***
## I(year - 1952) 82.172151 18.220039 4.510 0.00147 **
## I(year2 - 1952^2) -0.020779 0.004602 -4.515 0.00146 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.203 on 9 degrees of freedom
## Multiple R-squared: 0.711, Adjusted R-squared: 0.6467
## F-statistic: 11.07 on 2 and 9 DF, p-value: 0.003753
Great! The result looks better now. The intercept turns out to be 45.69, which seems more reasonable. Meanwhile, the standard error of that decreases to 3.1.
Now, let’s plot the data.
j_dat %>%
ggplot(aes(x = year, y = lifeExp))+
geom_point() +
geom_smooth(method = "lm", formula = y ~ x + I(x^2))+
ggtitle("Quadratic regression of Zimbabwe's lifeExp over years")+
theme( plot.title = element_text(hjust = 0.5))
It shows that quadratic regression show some advantages over linear regression in this case.
From the results above, we could know that it is more reasonable to do the quadratic regression with the intercept year offset to 1952. The new function we will create below is called quad_fit_1952
.
quad_fit_1952 <- function(dat, offset = 1952){
j_fit <- lm(lifeExp ~ I(year-1952) + I(year2-1952^2))
summary(j_fit)
}
The function has been created, let’s try it with j_dat
dataframe.
quad_fit_1952(j_dat)
##
## Call:
## lm(formula = lifeExp ~ I(year - 1952) + I(year2 - 1952^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.2522 -2.7543 -0.6202 2.7916 5.9329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.697407 3.107870 14.704 1.34e-07 ***
## I(year - 1952) 82.172151 18.220039 4.510 0.00147 **
## I(year2 - 1952^2) -0.020779 0.004602 -4.515 0.00146 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.203 on 9 degrees of freedom
## Multiple R-squared: 0.711, Adjusted R-squared: 0.6467
## F-statistic: 11.07 on 2 and 9 DF, p-value: 0.003753
The estimated intercept looks good above. The function works well with an automatic offset.