R contains a set of functions in the base package that we can use to find pattern matches. Alternatively, the R package stringr also provides several functions for regex operations. This section covers the base R functions that provide pattern finding, pattern replacement, and string splitting capabilities.
There are five functions that provide pattern matching capabilities. The three functions that I provide examples for are ones that are most common. The two other functions which I do not illustrate are gregexpr()
and regexec()
which provide similar capabilities as regexpr()
but with the output in list form.
To find a pattern in a character vector and to have the element values or indices as the output use grep()
:
# use the built in data set `state.division`
head(as.character(state.division))
## [1] "East South Central" "Pacific" "Mountain"
## [4] "West South Central" "Pacific" "Mountain"
# find the elements which match the patter
grep("North", state.division)
## [1] 13 14 15 16 22 23 25 27 34 35 41 49
# use 'value = TRUE' to show the element value
grep("North", state.division, value = TRUE)
## [1] "East North Central" "East North Central" "West North Central"
## [4] "West North Central" "East North Central" "West North Central"
## [7] "West North Central" "West North Central" "West North Central"
## [10] "East North Central" "West North Central" "East North Central"
# can use the 'invert' argument to show the non-matching elements
grep("North | South", state.division, invert = TRUE)
## [1] 2 3 5 6 7 8 9 10 11 12 19 20 21 26 28 29 30 31 32 33 37 38 39
## [24] 40 44 45 46 47 48 50
To find a pattern in a character vector and to have logical (TRUE/FALSE) outputs use grep()
:
grepl("North | South", state.division)
## [1] TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE
## [23] TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
## [45] FALSE FALSE FALSE FALSE TRUE FALSE
# wrap in sum() to get the count of matches
sum(grepl("North | South", state.division))
## [1] 20
To find exactly where the pattern exists in a string use regexpr()
:
x <- c("v.111", "0v.11", "00v.1", "000v.", "00000")
regexpr("v.", x)
## [1] 1 2 3 4 -1
## attr(,"match.length")
## [1] 2 2 2 2 -1
## attr(,"useBytes")
## [1] TRUE
The output of regexpr()
can be interepreted as follows. The first element provides the starting position of the match in each element. Note that the value -1 means there is no match. The second element (attribute “match length”) provides the length of the match. The third element (attribute “useBytes”) has a value TRUE meaning matching was done byte-by-byte rather than character-by-character.
In addition to finding patterns in character vectors, its also common to want to replace a pattern in a string with a new patter. There are two options for this:
To replace the first matching occurrence of a pattern use sub()
:
new <- c("New York", "new new York", "New New New York")
new
## [1] "New York" "new new York" "New New New York"
# Default is case sensitive
sub("New", replacement = "Old", new)
## [1] "Old York" "new new York" "Old New New York"
# use 'ignore.case = TRUE' to perform the obvious
sub("New", replacement = "Old", new, ignore.case = TRUE)
## [1] "Old York" "Old new York" "Old New New York"
To replace all matching occurrences of a pattern use gsub()
:
# Default is case sensitive
gsub("New", replacement = "Old", new)
## [1] "Old York" "new new York" "Old Old Old York"
# use 'ignore.case = TRUE' to perform the obvious
gsub("New", replacement = "Old", new, ignore.case = TRUE)
## [1] "Old York" "Old Old York" "Old Old Old York"
To split the elements of a character string use strsplit()
:
x <- paste(state.name[1:10], collapse = " ")
# output will be a list
strsplit(x, " ")
## [[1]]
## [1] "Alabama" "Alaska" "Arizona" "Arkansas" "California"
## [6] "Colorado" "Connecticut" "Delaware" "Florida" "Georgia"
# output as a vector rather than a list
unlist(strsplit(x, " "))
## [1] "Alabama" "Alaska" "Arizona" "Arkansas" "California"
## [6] "Colorado" "Connecticut" "Delaware" "Florida" "Georgia"