15  Regular expressions

本节正则字符串排版很复杂,难免有错误,可以直接看英文原版:
https://r4ds.hadley.nz/regexps.html

15.1 Introduction

In Chapter 14, you learned a whole bunch of useful functions for working with strings. This chapter will focus on functions that use regular expressions, a concise and powerful language for describing patterns within strings. The term “regular expression” is a bit of a mouthful, so most people abbreviate it to “regex”1 or “regexp”.
Chapter 14 中,你学习了一系列处理字符串的有用函数。本章将重点介绍使用正则表达式的函数,这是一种用于描述字符串内模式的简洁而强大的语言。术语“regular expression”有点拗口,所以大多数人将其缩写为“regex”1 或“regexp”。

The chapter starts with the basics of regular expressions and the most useful stringr functions for data analysis. We’ll then expand your knowledge of patterns and cover seven important new topics (escaping, anchoring, character classes, shorthand classes, quantifiers, precedence, and grouping). Next, we’ll talk about some of the other types of patterns that stringr functions can work with and the various “flags” that allow you to tweak the operation of regular expressions. We’ll finish with a survey of other places in the tidyverse and base R where you might use regexes.
本章首先介绍正则表达式的基础知识以及用于数据分析的最有用的 stringr 函数。然后,我们将扩展你的模式知识,并涵盖七个重要的新主题(转义、锚定、字符类、简写类、量词、优先级和分组)。接下来,我们将讨论 stringr 函数可以处理的其他一些模式类型以及允许你调整正则表达式操作的各种“标志”。最后,我们将概述在 tidyverse 和基础 R 中可能使用正则表达式的其他地方。

15.1.1 Prerequisites

In this chapter, we’ll use regular expression functions from stringr and tidyr, both core members of the tidyverse, as well as data from the babynames package.
在本章中,我们将使用 tidyverse 的核心成员 stringr 和 tidyr 中的正则表达式函数,以及来自 babynames 包的数据。

Through this chapter, we’ll use a mix of very simple inline examples so you can get the basic idea, the baby names data, and three character vectors from stringr:
在本章中,我们将混合使用非常简单的内联示例,以便你了解基本概念、婴儿姓名数据以及来自 stringr 的三个字符向量:

  • fruit contains the names of 80 fruits.
    fruit 包含了 80 种水果的名称。

  • words contains 980 common English words.
    words 包含了 980 个常见的英文单词。

  • sentences contains 720 short sentences.
    sentences 包含了 720 个短句。

15.2 Pattern basics

We’ll use str_view() to learn how regex patterns work. We used str_view() in the last chapter to better understand a string vs. its printed representation, and now we’ll use it with its second argument, a regular expression. When this is supplied, str_view() will show only the elements of the string vector that match, surrounding each match with <>, and, where possible, highlighting the match in blue.
我们将使用 str_view() 来学习正则表达式模式是如何工作的。在上一章中,我们使用 str_view() 来更好地理解字符串与其打印表示之间的区别,现在我们将其与第二个参数(一个正则表达式)一起使用。当提供此参数时,str_view() 将仅显示字符串向量中匹配的元素,用 <> 将每个匹配项括起来,并在可能的情况下用蓝色突出显示匹配项。

The simplest patterns consist of letters and numbers which match those characters exactly:
最简单的模式由字母和数字组成,它们精确匹配这些字符:

str_view(fruit, "berry")
#>  [6] │ bil<berry>
#>  [7] │ black<berry>
#> [10] │ blue<berry>
#> [11] │ boysen<berry>
#> [19] │ cloud<berry>
#> [21] │ cran<berry>
#> ... and 8 more

Letters and numbers match exactly and are called literal characters. Most punctuation characters, like ., +, *, [, ], and ?, have special meanings2 and are called metacharacters. For example, . will match any character3, so "a." will match any string that contains an “a” followed by another character :
字母和数字精确匹配,被称为字面字符 (literal characters)。大多数标点符号,如 .+*[]?,具有特殊含义2,被称为元字符 (metacharacters)。例如,. 将匹配任何字符3,所以 "a." 将匹配任何包含一个“a”后跟另一个字符的字符串:

str_view(c("a", "ab", "ae", "bd", "ea", "eab"), "a.")
#> [2] │ <ab>
#> [3] │ <ae>
#> [6] │ e<ab>

Or we could find all the fruits that contain an “a”, followed by three letters, followed by an “e”:
或者我们可以找到所有包含一个“a”,后跟三个字母,再后跟一个“e”的水果:

str_view(fruit, "a...e")
#>  [1] │ <apple>
#>  [7] │ bl<ackbe>rry
#> [48] │ mand<arine>
#> [51] │ nect<arine>
#> [62] │ pine<apple>
#> [64] │ pomegr<anate>
#> ... and 2 more

Quantifiers control how many times a pattern can match:
量词 (Quantifiers) 控制一个模式可以匹配多少次:

  • ? makes a pattern optional (i.e. it matches 0 or 1 times)
    ? 使模式成为可选的(即它匹配 0 次或 1 次)

  • + lets a pattern repeat (i.e. it matches at least once) + 让模式重复(即它至少匹配一次)

  • * lets a pattern be optional or repeat (i.e. it matches any number of times, including 0).
    * 让模式成为可选的或重复的(即它匹配任意次数,包括 0 次)。

# ab? matches an "a", optionally followed by a "b".
str_view(c("a", "ab", "abb"), "ab?")
#> [1] │ <a>
#> [2] │ <ab>
#> [3] │ <ab>b

# ab+ matches an "a", followed by at least one "b".
str_view(c("a", "ab", "abb"), "ab+")
#> [2] │ <ab>
#> [3] │ <abb>

# ab* matches an "a", followed by any number of "b"s.
str_view(c("a", "ab", "abb"), "ab*")
#> [1] │ <a>
#> [2] │ <ab>
#> [3] │ <abb>

Character classes are defined by [] and let you match a set of characters, e.g., [abcd] matches “a”, “b”, “c”, or “d”. You can also invert the match by starting with ^: [^abcd] matches anything except “a”, “b”, “c”, or “d”. We can use this idea to find the words containing an “x” surrounded by vowels, or a “y” surrounded by consonants:
字符类 (Character classes) 由 [] 定义,让你匹配一组字符,例如 [abcd] 匹配 “a”、“b”、“c” 或 “d”。你也可以通过以 ^ 开头来反转匹配:[^abcd] 匹配除 “a”、“b”、“c” 或 “d” 之外的任何内容。我们可以用这个想法来查找包含被元音包围的 “x” 或被辅音包围的 “y” 的单词:

str_view(words, "[aeiou]x[aeiou]")
#> [284] │ <exa>ct
#> [285] │ <exa>mple
#> [288] │ <exe>rcise
#> [289] │ <exi>st
str_view(words, "[^aeiou]y[^aeiou]")
#> [836] │ <sys>tem
#> [901] │ <typ>e

You can use alternation, |, to pick between one or more alternative patterns. For example, the following patterns look for fruits containing “apple”, “melon”, or “nut”, or a repeated vowel.
你可以使用交替 (alternation),即 |,来在一个或多个备选模式之间进行选择。例如,以下模式查找包含“apple”、“melon”或“nut”的水果,或者包含重复元音的水果。

str_view(fruit, "apple|melon|nut")
#>  [1] │ <apple>
#> [13] │ canary <melon>
#> [20] │ coco<nut>
#> [52] │ <nut>
#> [62] │ pine<apple>
#> [72] │ rock <melon>
#> ... and 1 more
str_view(fruit, "aa|ee|ii|oo|uu")
#>  [9] │ bl<oo>d orange
#> [33] │ g<oo>seberry
#> [47] │ lych<ee>
#> [66] │ purple mangost<ee>n

Regular expressions are very compact and use a lot of punctuation characters, so they can seem overwhelming and hard to read at first. Don’t worry; you’ll get better with practice, and simple patterns will soon become second nature. Let’s kick off that process by practicing with some useful stringr functions.
正则表达式非常紧凑,使用了大量的标点符号,所以初看起来可能会让人觉得不知所措,难以阅读。别担心,通过练习你会越来越熟练,简单的模式很快就会成为你的第二天性。让我们通过练习一些有用的 stringr 函数来开始这个过程。

15.3 Key functions

Now that you’ve got the basics of regular expressions under your belt, let’s use them with some stringr and tidyr functions. In the following section, you’ll learn how to detect the presence or absence of a match, how to count the number of matches, how to replace a match with fixed text, and how to extract text using a pattern.
既然你已经掌握了正则表达式的基础知识,让我们将它们与一些 stringr 和 tidyr 函数一起使用。在接下来的部分,你将学习如何检测匹配的存在与否,如何计算匹配的数量,如何用固定文本替换匹配,以及如何使用模式提取文本。

15.3.1 Detect matches

str_detect() returns a logical vector that is TRUE if the pattern matches an element of the character vector and FALSE otherwise:
str_detect() 返回一个逻辑向量,如果模式匹配字符向量中的某个元素,则为 TRUE,否则为 FALSE

str_detect(c("a", "b", "c"), "[aeiou]")
#> [1]  TRUE FALSE FALSE

Since str_detect() returns a logical vector of the same length as the initial vector, it pairs well with filter(). For example, this code finds all the most popular names containing a lower-case “x”:
由于 str_detect() 返回一个与初始向量长度相同的逻辑向量,它与 filter() 配合得很好。例如,这段代码查找所有包含小写字母“x”的最受欢迎的名字:

babynames |> 
  filter(str_detect(name, "x")) |> 
  count(name, wt = n, sort = TRUE)
#> # A tibble: 974 × 2
#>   name           n
#>   <chr>      <int>
#> 1 Alexander 665492
#> 2 Alexis    399551
#> 3 Alex      278705
#> 4 Alexandra 232223
#> 5 Max       148787
#> 6 Alexa     123032
#> # ℹ 968 more rows

We can also use str_detect() with summarize() by pairing it with sum() or mean(): sum(str_detect(x, pattern)) tells you the number of observations that match and mean(str_detect(x, pattern)) tells you the proportion that match. For example, the following snippet computes and visualizes the proportion of baby names4 that contain “x”, broken down by year. It looks like they’ve radically increased in popularity lately!
我们也可以将 str_detect()summarize() 一起使用,方法是将其与 sum()mean() 配对:sum(str_detect(x, pattern)) 告诉你匹配的观测数量,而 mean(str_detect(x, pattern)) 告诉你匹配的比例。例如,下面的代码片段计算并可视化了包含“x”的婴儿姓名4的比例,按年份细分。看起来它们最近的受欢迎程度急剧增加!

babynames |> 
  group_by(year) |> 
  summarize(prop_x = mean(str_detect(name, "x"))) |> 
  ggplot(aes(x = year, y = prop_x)) + 
  geom_line()

A time series showing the proportion of baby names that contain the letter x.  The proportion declines gradually from 8 per 1000 in 1880 to 4 per 1000 in  1980, then increases rapidly to 16 per 1000 in 2019.

There are two functions that are closely related to str_detect(): str_subset() and str_which(). str_subset() returns a character vector containing only the strings that match. str_which() returns an integer vector giving the positions of the strings that match.
有两个与 str_detect() 密切相关的函数:str_subset()str_which()str_subset() 返回一个只包含匹配字符串的字符向量。str_which() 返回一个给出匹配字符串位置的整数向量。

15.3.2 Count matches

The next step up in complexity from str_detect() is str_count(): rather than a true or false, it tells you how many matches there are in each string.
str_detect() 更复杂一步的是 str_count():它不是返回真或假,而是告诉你每个字符串中有多少个匹配项。

x <- c("apple", "banana", "pear")
str_count(x, "p")
#> [1] 2 0 1

Note that each match starts at the end of the previous match, i.e. regex matches never overlap. For example, in "abababa", how many times will the pattern "aba" match? Regular expressions say two, not three:
请注意,每个匹配都从前一个匹配的末尾开始,即正则表达式的匹配从不重叠。例如,在 "abababa" 中,模式 "aba" 会匹配多少次?正则表达式会说是两次,而不是三次:

str_count("abababa", "aba")
#> [1] 2
str_view("abababa", "aba")
#> [1] │ <aba>b<aba>

It’s natural to use str_count() with mutate(). The following example uses str_count() with character classes to count the number of vowels and consonants in each name.
很自然地,str_count() 可以与 mutate() 一起使用。下面的例子使用 str_count() 和字符类来计算每个名字中元音和辅音的数量。

babynames |> 
  count(name) |> 
  mutate(
    vowels = str_count(name, "[aeiou]"),
    consonants = str_count(name, "[^aeiou]")
  )
#> # A tibble: 97,310 × 4
#>   name          n vowels consonants
#>   <chr>     <int>  <int>      <int>
#> 1 Aaban        10      2          3
#> 2 Aabha         5      2          3
#> 3 Aabid         2      2          3
#> 4 Aabir         1      2          3
#> 5 Aabriella     5      4          5
#> 6 Aada          1      2          2
#> # ℹ 97,304 more rows

If you look closely, you’ll notice that there’s something off with our calculations: “Aaban” contains three “a”s, but our summary reports only two vowels. That’s because regular expressions are case sensitive. There are three ways we could fix this:
如果你仔细看,你会发现我们的计算有些问题:“Aaban” 包含三个 “a”,但我们的摘要只报告了两个元音。这是因为正则表达式是区分大小写的。我们可以通过三种方式来修正这个问题:

  • Add the upper case vowels to the character class: str_count(name, "[aeiouAEIOU]").
    将大写元音添加到字符类中:str_count(name, "[aeiouAEIOU]")

  • Tell the regular expression to ignore case: str_count(name, regex("[aeiou]", ignore_case = TRUE)). We’ll talk about more in Section 15.5.1.
    告诉正则表达式忽略大小写:str_count(name, regex(&quot;[aeiou]&quot;, ignore_case = TRUE))。我们将在 Section 15.5.1 中详细讨论。

  • Use str_to_lower() to convert the names to lower case: str_count(str_to_lower(name), &quot;[aeiou]&quot;).
    使用 str_to_lower() 将名称转换为小写:str_count(str_to_lower(name), &quot;[aeiou]&quot;)

This variety of approaches is pretty typical when working with strings — there are often multiple ways to reach your goal, either by making your pattern more complicated or by doing some preprocessing on your string. If you get stuck trying one approach, it can often be useful to switch gears and tackle the problem from a different perspective.
这种多样化的方法在处理字符串时非常典型——通常有多种方式可以达到你的目标,要么使你的模式更复杂,要么对你的字符串进行一些预处理。如果你在尝试一种方法时遇到困难,换个角度从不同的视角来解决问题通常会很有用。

In this case, since we’re applying two functions to the name, I think it’s easier to transform it first:
在这种情况下,由于我们对名称应用了两个函数,我认为先转换它会更容易:

babynames |> 
  count(name) |> 
  mutate(
    name = str_to_lower(name),
    vowels = str_count(name, "[aeiou]"),
    consonants = str_count(name, "[^aeiou]")
  )
#> # A tibble: 97,310 × 4
#>   name          n vowels consonants
#>   <chr>     <int>  <int>      <int>
#> 1 aaban        10      3          2
#> 2 aabha         5      3          2
#> 3 aabid         2      3          2
#> 4 aabir         1      3          2
#> 5 aabriella     5      5          4
#> 6 aada          1      3          1
#> # ℹ 97,304 more rows

15.3.3 Replace values

As well as detecting and counting matches, we can also modify them with str_replace() and str_replace_all(). str_replace() replaces the first match, and as the name suggests, str_replace_all() replaces all matches.
除了检测和计数匹配项,我们还可以使用 str_replace()str_replace_all() 来修改它们。str_replace() 替换第一个匹配项,顾名思义,str_replace_all() 替换所有匹配项。

x <- c("apple", "pear", "banana")
str_replace_all(x, "[aeiou]", "-")
#> [1] "-ppl-"  "p--r"   "b-n-n-"

str_remove() and str_remove_all() are handy shortcuts for str_replace(x, pattern, ""):
str_remove()str_remove_all()str_replace(x, pattern, "") 的便捷快捷方式:

x <- c("apple", "pear", "banana")
str_remove_all(x, "[aeiou]")
#> [1] "ppl" "pr"  "bnn"

These functions are naturally paired with mutate() when doing data cleaning, and you’ll often apply them repeatedly to peel off layers of inconsistent formatting.
在进行数据清洗时,这些函数很自然地与 mutate() 配对使用,你通常会重复应用它们来剥离不一致的格式层。

15.3.4 Extract variables

The last function we’ll discuss uses regular expressions to extract data out of one column into one or more new columns: separate_wider_regex(). It’s a peer of the separate_wider_position() and separate_wider_delim() functions that you learned about in Section 14.4.2. These functions live in tidyr because they operate on (columns of) data frames, rather than individual vectors.
我们要讨论的最后一个函数使用正则表达式将数据从一列提取到一个或多个新列中:separate_wider_regex()。它与你在 Section 14.4.2 中学到的 separate_wider_position()separate_wider_delim() 函数是同类。这些函数位于 tidyr 包中,因为它们作用于数据框的(列),而不是单个向量。

Let’s create a simple dataset to show how it works. Here we have some data derived from babynames where we have the name, gender, and age of a bunch of people in a rather weird format5:
让我们创建一个简单的数据集来展示它是如何工作的。这里我们有一些从 babynames 派生的数据,其中包含了一群人的姓名、性别和年龄,格式相当奇怪5

df <- tribble(
  ~str,
  "<Sheryl>-F_34",
  "<Kisha>-F_45", 
  "<Brandon>-N_33",
  "<Sharon>-F_38", 
  "<Penny>-F_58",
  "<Justin>-M_41", 
  "<Patricia>-F_84", 
)

To extract this data using separate_wider_regex() we just need to construct a sequence of regular expressions that match each piece. If we want the contents of that piece to appear in the output, we give it a name:
要使用 separate_wider_regex() 提取这些数据,我们只需要构建一系列匹配每个部分的正则表达式。如果我们希望该部分的内容出现在输出中,我们给它一个名字:

df |> 
  separate_wider_regex(
    str,
    patterns = c(
      "<", 
      name = "[A-Za-z]+", 
      ">-", 
      gender = ".",
      "_",
      age = "[0-9]+"
    )
  )
#> # A tibble: 7 × 3
#>   name    gender age  
#>   <chr>   <chr>  <chr>
#> 1 Sheryl  F      34   
#> 2 Kisha   F      45   
#> 3 Brandon N      33   
#> 4 Sharon  F      38   
#> 5 Penny   F      58   
#> 6 Justin  M      41   
#> # ℹ 1 more row

If the match fails, you can use too_few = "debug" to figure out what went wrong, just like separate_wider_delim() and separate_wider_position().
如果匹配失败,你可以使用 too_few = "debug" 来找出问题所在,就像 separate_wider_delim()separate_wider_position() 一样。

15.3.5 Exercises

  1. What baby name has the most vowels? What name has the highest proportion of vowels? (Hint: what is the denominator?)

  2. Replace all forward slashes in "a/b/c/d/e" with backslashes. What happens if you attempt to undo the transformation by replacing all backslashes with forward slashes? (We’ll discuss the problem very soon.)

  3. Implement a simple version of str_to_lower() using str_replace_all().

  4. Create a regular expression that will match telephone numbers as commonly written in your country.

15.4 Pattern details

Now that you understand the basics of the pattern language and how to use it with some stringr and tidyr functions, it’s time to dig into more of the details. First, we’ll start with escaping, which allows you to match metacharacters that would otherwise be treated specially. Next, you’ll learn about anchors which allow you to match the start or end of the string. Then, you’ll learn more about character classes and their shortcuts which allow you to match any character from a set. Next, you’ll learn the final details of quantifiers which control how many times a pattern can match. Then, we have to cover the important (but complex) topic of operator precedence and parentheses. And we’ll finish off with some details of grouping components of the pattern.
现在你已经了解了模式语言的基础知识以及如何将其与一些 stringr 和 tidyr 函数一起使用,是时候深入了解更多细节了。首先,我们将从转义 (escaping) 开始,它允许你匹配那些原本会被特殊处理的元字符。接下来,你将学习锚点 (anchors),它允许你匹配字符串的开头或结尾。然后,你将学习更多关于字符类 (character classes) 及其快捷方式的知识,这些快捷方式允许你匹配集合中的任何字符。接下来,你将学习量词 (quantifiers) 的最后细节,它控制模式可以匹配多少次。然后,我们必须涵盖操作符优先级 (operator precedence) 和括号这个重要(但复杂)的主题。最后,我们将以分组 (grouping) 模式组件的一些细节作为结尾。

The terms we use here are the technical names for each component. They’re not always the most evocative of their purpose, but it’s very helpful to know the correct terms if you later want to Google for more details.
我们在这里使用的术语是每个组件的技术名称。它们并不总是最能说明其用途,但如果你以后想通过 Google 搜索更多细节,了解正确的术语会非常有帮助。

15.4.1 Escaping

In order to match a literal ., you need an escape which tells the regular expression to match metacharacters6 literally. Like strings, regexps use the backslash for escaping. So, to match a ., you need the regexp \.. Unfortunately this creates a problem. We use strings to represent regular expressions, and \ is also used as an escape symbol in strings. So to create the regular expression \. we need the string "\\.", as the following example shows.
为了匹配字面意义上的 .,你需要一个转义 (escape),它告诉正则表达式按字面意义匹配元字符6。与字符串一样,正则表达式使用反斜杠进行转义。因此,要匹配 .,你需要正则表达式 \.。不幸的是,这会产生一个问题。我们使用字符串来表示正则表达式,而 \ 在字符串中也用作转义符号。所以要创建正则表达式 \.,我们需要字符串 "\\.",如下例所示。

# To create the regular expression \., we need to use \\.
dot <- "\\."

# But the expression itself only contains one \
str_view(dot)
#> [1] │ \.

# And this tells R to look for an explicit .
str_view(c("abc", "a.c", "bef"), "a\\.c")
#> [2] │ <a.c>

In this book, we’ll usually write regular expression without quotes, like \.. If we need to emphasize what you’ll actually type, we’ll surround it with quotes and add extra escapes, like "\\.".
在本书中,我们通常会不带引号地编写正则表达式,例如 \.。如果我们需要强调你实际输入的内容,我们会用引号将其括起来并添加额外的转义,例如 "\\."

If \ is used as an escape character in regular expressions, how do you match a literal \? Well, you need to escape it, creating the regular expression \\. To create that regular expression, you need to use a string, which also needs to escape \. That means to match a literal \ you need to write "\\\\" — you need four backslashes to match one!
如果在正则表达式中 \ 被用作转义字符,那么如何匹配字面意义上的 \ 呢?嗯,你需要对它进行转义,从而创建正则表达式 \\。要创建该正则表达式,你需要使用一个字符串,而该字符串也需要对 \ 进行转义。这意味着要匹配一个字面意义上的 \,你需要写成 "\\\\" —— 你需要四个反斜杠来匹配一个!

x <- "a\\b"
str_view(x)
#> [1] │ a\b
str_view(x, "\\\\")
#> [1] │ a<\>b

Alternatively, you might find it easier to use the raw strings you learned about in Section 14.2.2). That lets you avoid one layer of escaping:
或者,你可能会发现使用在 Section 14.2.2 中学到的原始字符串 (raw strings) 更容易。这样可以避免一层转义:

str_view(x, r"{\\}")
#> [1] │ a<\>b

If you’re trying to match a literal ., $, |, *, +, ?, {, }, (, ), there’s an alternative to using a backslash escape: you can use a character class: [.], [$], [|], … all match the literal values.
如果你试图匹配字面上的 .$|*+?{}(),除了使用反斜杠转义外,还有另一种选择:你可以使用字符类:[.][$][|] 等都匹配字面值。

str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
#> [2] │ <a.c>
str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")
#> [3] │ <a*c>

15.4.2 Anchors

By default, regular expressions will match any part of a string. If you want to match at the start or end you need to anchor the regular expression using ^ to match the start or $ to match the end:
默认情况下,正则表达式将匹配字符串的任何部分。如果你想在开头或结尾进行匹配,你需要使用 ^锚定 (anchor) 正则表达式以匹配开头,或使用 $ 来匹配结尾:

str_view(fruit, "^a")
#> [1] │ <a>pple
#> [2] │ <a>pricot
#> [3] │ <a>vocado
str_view(fruit, "a$")
#>  [4] │ banan<a>
#> [15] │ cherimoy<a>
#> [30] │ feijo<a>
#> [36] │ guav<a>
#> [56] │ papay<a>
#> [74] │ satsum<a>

It’s tempting to think that $ should match the start of a string, because that’s how we write dollar amounts, but that’s not what regular expressions want.
我们很自然地会认为 $ 应该匹配字符串的开头,因为我们就是这样写美元金额的,但正则表达式并不是这样设计的。

To force a regular expression to match only the full string, anchor it with both ^ and $:
要强制一个正则表达式只匹配整个字符串,请使用 ^$ 将其锚定:

str_view(fruit, "apple")
#>  [1] │ <apple>
#> [62] │ pine<apple>
str_view(fruit, "^apple$")
#> [1] │ <apple>

You can also match the boundary between words (i.e. the start or end of a word) with \b. This can be particularly useful when using RStudio’s find and replace tool. For example, if to find all uses of sum(), you can search for \bsum\b to avoid matching summarize, summary, rowsum and so on:
你还可以使用 \b 来匹配单词之间的边界(即单词的开头或结尾)。这在使用 RStudio 的查找和替换工具时特别有用。例如,如果要查找 sum() 的所有用法,你可以搜索 \bsum\b 以避免匹配 summarizesummaryrowsum 等:

x <- c("summary(x)", "summarize(df)", "rowsum(x)", "sum(x)")
str_view(x, "sum")
#> [1] │ <sum>mary(x)
#> [2] │ <sum>marize(df)
#> [3] │ row<sum>(x)
#> [4] │ <sum>(x)
str_view(x, "\\bsum\\b")
#> [4] │ <sum>(x)

When used alone, anchors will produce a zero-width match:
当单独使用时,锚点会产生一个零宽度匹配:

str_view("abc", c("$", "^", "\\b"))
#> [1] │ abc<>
#> [2] │ <>abc
#> [3] │ <>abc<>

This helps you understand what happens when you replace a standalone anchor:
这有助于你理解当你替换一个独立的锚点时会发生什么:

str_replace_all("abc", c("$", "^", "\\b"), "--")
#> [1] "abc--"   "--abc"   "--abc--"

15.4.3 Character classes

A character class, or character set, allows you to match any character in a set. As we discussed above, you can construct your own sets with [], where [abc] matches “a”, “b”, or “c” and [^abc] matches any character except “a”, “b”, or “c”. Apart from ^ there are two other characters that have special meaning inside of []:
字符类 (character class),或称字符 (set),允许你匹配一个集合中的任何字符。如上所述,你可以使用 [] 构建自己的集合,其中 [abc] 匹配 “a”、“b” 或 “c”,而 [^abc] 匹配除 “a”、“b” 或 “c” 之外的任何字符。除了 ^,在 [] 内部还有另外两个具有特殊含义的字符:

  • - defines a range, e.g., [a-z] matches any lower case letter and [0-9] matches any number.
    - 定义一个范围,例如 [a-z] 匹配任何小写字母,[0-9] 匹配任何数字。

  • \ escapes special characters, so [\^-\]] matches ^, -, or ].
    \ 转义特殊字符,所以 [\^-\]] 匹配 ^-]

Here are few examples:
这里有一些例子:

x <- "abcd ABCD 12345 -!@#%."
str_view(x, "[abc]+")
#> [1] │ <abc>d ABCD 12345 -!@#%.
str_view(x, "[a-z]+")
#> [1] │ <abcd> ABCD 12345 -!@#%.
str_view(x, "[^a-z0-9]+")
#> [1] │ abcd< ABCD >12345< -!@#%.>

# You need an escape to match characters that are otherwise
# special inside of []
str_view("a-b-c", "[a-c]")
#> [1] │ <a>-<b>-<c>
str_view("a-b-c", "[a\\-c]")
#> [1] │ <a><->b<-><c>

Some character classes are used so commonly that they get their own shortcut. You’ve already seen ., which matches any character apart from a newline. There are three other particularly useful pairs7:
有些字符类非常常用,以至于它们有自己的快捷方式。你已经见过了 .,它匹配除换行符以外的任何字符。还有另外三对特别有用的快捷方式7

  • \d matches any digit;
    \D matches anything that isn’t a digit.
    \d 匹配任何数字; \D 匹配任何非数字的字符。

  • \s matches any whitespace (e.g., space, tab, newline);
    \S matches anything that isn’t whitespace.
    \s 匹配任何空白字符(例如,空格、制表符、换行符);
    \S 匹配任何非空白字符。

  • \w matches any “word” character, i.e. letters and numbers;
    \W matches any “non-word” character.
    \w 匹配任何“单词”字符,即字母和数字;
    \W 匹配任何“非单词”字符。

The following code demonstrates the six shortcuts with a selection of letters, numbers, and punctuation characters.
下面的代码用一些字母、数字和标点符号演示了这六个快捷方式。

x <- "abcd ABCD 12345 -!@#%."
str_view(x, "\\d+")
#> [1] │ abcd ABCD <12345> -!@#%.
str_view(x, "\\D+")
#> [1] │ <abcd ABCD >12345< -!@#%.>
str_view(x, "\\s+")
#> [1] │ abcd< >ABCD< >12345< >-!@#%.
str_view(x, "\\S+")
#> [1] │ <abcd> <ABCD> <12345> <-!@#%.>
str_view(x, "\\w+")
#> [1] │ <abcd> <ABCD> <12345> -!@#%.
str_view(x, "\\W+")
#> [1] │ abcd< >ABCD< >12345< -!@#%.>

15.4.4 Quantifiers

Quantifiers control how many times a pattern matches. In Section 15.2 you learned about ? (0 or 1 matches), + (1 or more matches), and * (0 or more matches). For example, colou?r will match American or British spelling, \d+ will match one or more digits, and \s? will optionally match a single item of whitespace. You can also specify the number of matches precisely with {}:
量词 (Quantifiers) 控制一个模式匹配的次数。在 Section 15.2 中你学习了 ?(0 或 1 次匹配)、+(1 次或多次匹配)和 *(0 次或多次匹配)。例如,colou?r 将匹配美式或英式拼写,\d+ 将匹配一个或多个数字,\s? 将可选地匹配一个空白项。你还可以使用 {} 精确指定匹配次数:

  • {n} matches exactly n times.
    {n} 精确匹配 n 次。

  • {n,} matches at least n times. {n,} 至少匹配 n 次。

  • {n,m} matches between n and m times.
    {n,m} 匹配 n 到 m 次。

15.4.5 Operator precedence and parentheses

What does ab+ match? Does it match “a” followed by one or more “b”s, or does it match “ab” repeated any number of times? What does ^a|b$ match? Does it match the complete string a or the complete string b, or does it match a string starting with a or a string ending with b?
ab+ 匹配什么?是匹配一个 “a” 后面跟着一个或多个 “b”,还是匹配 “ab” 重复任意次数?^a|b$ 匹配什么?是匹配完整的字符串 “a” 或完整的字符串 “b”,还是匹配以 “a” 开头的字符串或以 “b” 结尾的字符串?

The answer to these questions is determined by operator precedence, similar to the PEMDAS or BEDMAS rules you might have learned in school. You know that a + b * c is equivalent to a + (b * c) not (a + b) * c because * has higher precedence and + has lower precedence: you compute * before +.
这些问题的答案由运算符优先级决定,类似于你在学校可能学到的 PEMDAS 或 BEDMAS 规则。你知道 a + b * c 等同于 a + (b * c) 而不是 (a + b) * c,因为 * 的优先级高于 +:你先计算 * 再计算 +

Similarly, regular expressions have their own precedence rules: quantifiers have high precedence and alternation has low precedence which means that ab+ is equivalent to a(b+), and ^a|b$ is equivalent to (^a)|(b$). Just like with algebra, you can use parentheses to override the usual order. But unlike algebra you’re unlikely to remember the precedence rules for regexes, so feel free to use parentheses liberally.
类似地,正则表达式也有自己的优先级规则:量词具有高优先级,而交替具有低优先级,这意味着 ab+ 等价于 a(b+),而 ^a|b$ 等价于 (^a)|(b$)。就像代数一样,你可以使用括号来覆盖通常的顺序。但与代数不同,你不太可能记住正则表达式的优先级规则,所以请随意大量使用括号。

15.4.6 Grouping and capturing

As well as overriding operator precedence, parentheses have another important effect: they create capturing groups that allow you to use sub-components of the match.
除了覆盖运算符优先级,括号还有另一个重要作用:它们创建了捕获组 (capturing groups),允许你使用匹配的子组件。

The first way to use a capturing group is to refer back to it within a match with back reference: \1 refers to the match contained in the first parenthesis, \2 in the second parenthesis, and so on. For example, the following pattern finds all fruits that have a repeated pair of letters:
使用捕获组的第一种方法是在匹配中使用反向引用 (back reference) 来引用它:\1 引用第一个括号中包含的匹配项,\2 引用第二个括号中的匹配项,依此类推。例如,以下模式查找所有具有重复字母对的水果:

str_view(fruit, "(..)\\1")
#>  [4] │ b<anan>a
#> [20] │ <coco>nut
#> [22] │ <cucu>mber
#> [41] │ <juju>be
#> [56] │ <papa>ya
#> [73] │ s<alal> berry

And this one finds all words that start and end with the same pair of letters:
而这个则查找所有以相同字母对开头和结尾的单词:

str_view(words, "^(..).*\\1$")
#> [152] │ <church>
#> [217] │ <decide>
#> [617] │ <photograph>
#> [699] │ <require>
#> [739] │ <sense>

You can also use back references in str_replace(). For example, this code switches the order of the second and third words in sentences:
你也可以在 str_replace() 中使用反向引用。例如,这段代码交换了 sentences 中第二个和第三个单词的顺序:

sentences |> 
  str_replace("(\\w+) (\\w+) (\\w+)", "\\1 \\3 \\2") |> 
  str_view()
#> [1] │ The canoe birch slid on the smooth planks.
#> [2] │ Glue sheet the to the dark blue background.
#> [3] │ It's to easy tell the depth of a well.
#> [4] │ These a days chicken leg is a rare dish.
#> [5] │ Rice often is served in round bowls.
#> [6] │ The of juice lemons makes fine punch.
#> ... and 714 more

If you want to extract the matches for each group you can use str_match(). But str_match() returns a matrix, so it’s not particularly easy to work with8:
如果你想提取每个组的匹配项,可以使用 str_match()。但是 str_match() 返回一个矩阵,所以处理起来不是特别容易8

sentences |> 
  str_match("the (\\w+) (\\w+)") |> 
  head()
#>      [,1]                [,2]     [,3]    
#> [1,] "the smooth planks" "smooth" "planks"
#> [2,] "the sheet to"      "sheet"  "to"    
#> [3,] "the depth of"      "depth"  "of"    
#> [4,] NA                  NA       NA      
#> [5,] NA                  NA       NA      
#> [6,] NA                  NA       NA

You could convert to a tibble and name the columns:
你可以将其转换为一个 tibble 并命名列:

sentences |> 
  str_match("the (\\w+) (\\w+)") |> 
  as_tibble(.name_repair = "minimal") |> 
  set_names("match", "word1", "word2")
#> # A tibble: 720 × 3
#>   match             word1  word2 
#>   <chr>             <chr>  <chr> 
#> 1 the smooth planks smooth planks
#> 2 the sheet to      sheet  to    
#> 3 the depth of      depth  of    
#> 4 <NA>              <NA>   <NA>  
#> 5 <NA>              <NA>   <NA>  
#> 6 <NA>              <NA>   <NA>  
#> # ℹ 714 more rows

But then you’ve basically recreated your own version of separate_wider_regex(). Indeed, behind the scenes, separate_wider_regex() converts your vector of patterns to a single regex that uses grouping to capture the named components.
但这样一来,你基本上就重新创建了自己版本的 separate_wider_regex()。实际上,在幕后,separate_wider_regex() 会将你的模式向量转换为一个单一的正则表达式,该表达式使用分组来捕获命名的组件。

Occasionally, you’ll want to use parentheses without creating matching groups. You can create a non-capturing group with (?:).
有时,你会想使用括号而不创建匹配组。你可以使用 (?:) 创建一个非捕获组。

x <- c("a gray cat", "a grey dog")
str_match(x, "gr(e|a)y")
#>      [,1]   [,2]
#> [1,] "gray" "a" 
#> [2,] "grey" "e"
str_match(x, "gr(?:e|a)y")
#>      [,1]  
#> [1,] "gray"
#> [2,] "grey"

15.4.7 Exercises

  1. How would you match the literal string "'\? How about "$^$"?

  2. Explain why each of these patterns don’t match a \: "\", "\\", "\\\".

  3. Given the corpus of common words in stringr::words, create regular expressions that find all words that:

    1. Start with “y”.
    2. Don’t start with “y”.
    3. End with “x”.
    4. Are exactly three letters long. (Don’t cheat by using str_length()!)
    5. Have seven letters or more.
    6. Contain a vowel-consonant pair.
    7. Contain at least two vowel-consonant pairs in a row.
    8. Only consist of repeated vowel-consonant pairs.
  4. Create 11 regular expressions that match the British or American spellings for each of the following words: airplane/aeroplane, aluminum/aluminium, analog/analogue, ass/arse, center/centre, defense/defence, donut/doughnut, gray/grey, modeling/modelling, skeptic/sceptic, summarize/summarise. Try and make the shortest possible regex!

  5. Switch the first and last letters in words. Which of those strings are still words?

  6. Describe in words what these regular expressions match: (read carefully to see if each entry is a regular expression or a string that defines a regular expression.)

    1. ^.*$
    2. "\\{.+\\}"
    3. \d{4}-\d{2}-\d{2}
    4. "\\\\{4}"
    5. \..\..\..
    6. (.)\1\1
    7. "(..)\\1"
  7. Solve the beginner regexp crosswords at https://regexcrossword.com/challenges/beginner.

15.5 Pattern control

It’s possible to exercise extra control over the details of the match by using a pattern object instead of just a string. This allows you to control the so called regex flags and match various types of fixed strings, as described below.
通过使用模式对象而不仅仅是字符串,可以对匹配的细节进行额外的控制。这允许你控制所谓的正则表达式标志,并匹配各种类型的固定字符串,如下所述。

15.5.1 Regex flags

There are a number of settings that can be used to control the details of the regexp. These settings are often called flags in other programming languages. In stringr, you can use these by wrapping the pattern in a call to regex(). The most useful flag is probably ignore_case = TRUE because it allows characters to match either their uppercase or lowercase forms:
有许多设置可用于控制正则表达式的细节。这些设置在其他编程语言中通常被称为标志 (flags)。在 stringr 中,你可以通过将模式包装在对 regex() 的调用中使用它们。最有用的标志可能是 ignore_case = TRUE,因为它允许字符匹配其大写或小写形式:

bananas <- c("banana", "Banana", "BANANA")
str_view(bananas, "banana")
#> [1] │ <banana>
str_view(bananas, regex("banana", ignore_case = TRUE))
#> [1] │ <banana>
#> [2] │ <Banana>
#> [3] │ <BANANA>

If you’re doing a lot of work with multiline strings (i.e. strings that contain \n), dotalland multiline may also be useful:
如果你正在处理大量多行字符串(即包含 \n 的字符串),dotallmultiline 也可能很有用:

  • dotall = TRUE lets . match everything, including \n:
    dotall = TRUE. 匹配所有内容,包括 \n

    x <- "Line 1\nLine 2\nLine 3"
    str_view(x, ".Line")
    str_view(x, regex(".Line", dotall = TRUE))
    #> [1] │ Line 1<
    #>     │ Line> 2<
    #>     │ Line> 3
  • multiline = TRUE makes ^ and $ match the start and end of each line rather than the start and end of the complete string:
    multiline = TRUE 使 ^$ 匹配每行的开头和结尾,而不是整个字符串的开头和结尾:

    x <- "Line 1\nLine 2\nLine 3"
    str_view(x, "^Line")
    #> [1] │ <Line> 1
    #>     │ Line 2
    #>     │ Line 3
    str_view(x, regex("^Line", multiline = TRUE))
    #> [1] │ <Line> 1
    #>     │ <Line> 2
    #>     │ <Line> 3

Finally, if you’re writing a complicated regular expression and you’re worried you might not understand it in the future, you might try comments = TRUE. It tweaks the pattern language to ignore spaces and new lines, as well as everything after #. This allows you to use comments and whitespace to make complex regular expressions more understandable9, as in the following example:
最后,如果你正在编写一个复杂的正则表达式,并且担心将来可能无法理解它,你可以尝试使用 comments = TRUE。它会调整模式语言,使其忽略空格和换行符,以及 # 之后的所有内容。这允许你使用注释和空白来使复杂的正则表达式更易于理解9,如下例所示:

phone <- regex(
  r"(
    \(?     # optional opening parens
    (\d{3}) # area code
    [)-]?  # optional closing parens or dash
    \ ?     # optional space
    (\d{3}) # another three numbers
    [\ -]?  # optional space or dash
    (\d{4}) # four more numbers
  )", 
  comments = TRUE
)

str_extract(c("514-791-8141", "(123) 456 7890", "123456"), phone)
#> [1] "514-791-8141"   "(123) 456 7890" NA

If you’re using comments and want to match a space, newline, or #, you’ll need to escape it with \.
如果你正在使用注释并且想要匹配空格、换行符或 #,你需要使用 \ 对其进行转义。

15.5.2 Fixed matches

You can opt-out of the regular expression rules by using fixed():
你可以通过使用 fixed() 来选择不使用正则表达式规则:

str_view(c("", "a", "."), fixed("."))
#> [3] │ <.>

fixed() also gives you the ability to ignore case:
fixed() 还让你能够忽略大小写:

str_view("x X", "X")
#> [1] │ x <X>
str_view("x X", fixed("X", ignore_case = TRUE))
#> [1] │ <x> <X>

If you’re working with non-English text, you will probably want coll() instead of fixed(), as it implements the full rules for capitalization as used by the locale you specify. See Section 14.6 for more details on locales.
如果你正在处理非英语文本,你可能需要使用 coll() 而不是 fixed(),因为它实现了你指定的 locale 所使用的完整大写规则。有关区域设置的更多详细信息,请参阅 Section 14.6

str_view("i İ ı I", fixed("İ", ignore_case = TRUE))
#> [1] │ i <İ> ı I
str_view("i İ ı I", coll("İ", ignore_case = TRUE, locale = "tr"))
#> [1] │ <i> <İ> ı I

15.6 Practice

To put these ideas into practice we’ll solve a few semi-authentic problems next. We’ll discuss three general techniques:
为了将这些想法付诸实践,我们接下来将解决一些半真实的问题。我们将讨论三种通用技术:

  1. checking your work by creating simple positive and negative controls

  2. 通过创建简单的阳性和阴性对照来检查你的工作

    1. combining regular expressions with Boolean algebra

2. 将正则表达式与布尔代数相结合 3. creating complex patterns using string manipulation
3. 使用字符串操作创建复杂模式

15.6.1 Check your work

First, let’s find all sentences that start with “The”. Using the ^ anchor alone is not enough:
首先,让我们找到所有以“The”开头的句子。仅使用 ^ 锚点是不够的:

str_view(sentences, "^The")
#>  [1] │ <The> birch canoe slid on the smooth planks.
#>  [4] │ <The>se days a chicken leg is a rare dish.
#>  [6] │ <The> juice of lemons makes fine punch.
#>  [7] │ <The> box was thrown beside the parked truck.
#>  [8] │ <The> hogs were fed chopped corn and garbage.
#> [11] │ <The> boy was there when the sun rose.
#> ... and 271 more

Because that pattern also matches sentences starting with words like They or These. We need to make sure that the “e” is the last letter in the word, which we can do by adding a word boundary:
因为该模式也匹配以 TheyThese 等词开头的句子。我们需要确保 “e” 是单词中的最后一个字母,我们可以通过添加一个词边界来实现这一点:

str_view(sentences, "^The\\b")
#>  [1] │ <The> birch canoe slid on the smooth planks.
#>  [6] │ <The> juice of lemons makes fine punch.
#>  [7] │ <The> box was thrown beside the parked truck.
#>  [8] │ <The> hogs were fed chopped corn and garbage.
#> [11] │ <The> boy was there when the sun rose.
#> [13] │ <The> source of the huge river is the clear spring.
#> ... and 250 more

What about finding all sentences that begin with a pronoun?
那么,如何找到所有以代词开头的句子呢?

str_view(sentences, "^She|He|It|They\\b")
#>  [3] │ <It>'s easy to tell the depth of a well.
#> [15] │ <He>lp the woman get back to her feet.
#> [27] │ <He>r purse was full of useless trash.
#> [29] │ <It> snowed, rained, and hailed the same morning.
#> [63] │ <He> ran half way to the hardware store.
#> [90] │ <He> lay prone and hardly moved a limb.
#> ... and 57 more

A quick inspection of the results shows that we’re getting some spurious matches. That’s because we’ve forgotten to use parentheses:
快速检查结果显示,我们得到了一些虚假的匹配。这是因为我们忘记了使用括号:

str_view(sentences, "^(She|He|It|They)\\b")
#>   [3] │ <It>'s easy to tell the depth of a well.
#>  [29] │ <It> snowed, rained, and hailed the same morning.
#>  [63] │ <He> ran half way to the hardware store.
#>  [90] │ <He> lay prone and hardly moved a limb.
#> [116] │ <He> ordered peach pie with ice cream.
#> [127] │ <It> caught its hind paw in a rusty trap.
#> ... and 51 more

You might wonder how you might spot such a mistake if it didn’t occur in the first few matches. A good technique is to create a few positive and negative matches and use them to test that your pattern works as expected:
你可能会想,如果这种错误没有出现在前几个匹配项中,你该如何发现它。一个好的技巧是创建一些阳性和阴性匹配,并用它们来测试你的模式是否按预期工作:

pos <- c("He is a boy", "She had a good time")
neg <- c("Shells come from the sea", "Hadley said 'It's a great day'")

pattern <- "^(She|He|It|They)\\b"
str_detect(pos, pattern)
#> [1] TRUE TRUE
str_detect(neg, pattern)
#> [1] FALSE FALSE

It’s typically much easier to come up with good positive examples than negative examples, because it takes a while before you’re good enough with regular expressions to predict where your weaknesses are. Nevertheless, they’re still useful: as you work on the problem you can slowly accumulate a collection of your mistakes, ensuring that you never make the same mistake twice.
通常,想出好的正面例子比负面例子要容易得多,因为你需要一段时间才能熟练掌握正则表达式,从而预测你的弱点在哪里。尽管如此,它们仍然很有用:在解决问题的过程中,你可以慢慢积累你的错误集合,确保你不会犯同样的错误两次。

15.6.2 Boolean operations

Imagine we want to find words that only contain consonants. One technique is to create a character class that contains all letters except for the vowels ([^aeiou]), then allow that to match any number of letters ([^aeiou]+), then force it to match the whole string by anchoring to the beginning and the end (^[^aeiou]+$):
想象一下,我们想找到只包含辅音的单词。一种方法是创建一个包含除元音外所有字母的字符类 ([^aeiou]),然后让它匹配任意数量的字母 ([^aeiou]+),最后通过锚定到开头和结尾 (^[^aeiou]+$) 来强制它匹配整个字符串:

str_view(words, "^[^aeiou]+$")
#> [123] │ <by>
#> [249] │ <dry>
#> [328] │ <fly>
#> [538] │ <mrs>
#> [895] │ <try>
#> [952] │ <why>

But you can make this problem a bit easier by flipping the problem around. Instead of looking for words that contain only consonants, we could look for words that don’t contain any vowels:
但是你可以通过反向思考来让这个问题变得更简单。与其寻找只包含辅音的单词,我们可以寻找不包含任何元音的单词:

str_view(words[!str_detect(words, "[aeiou]")])
#> [1] │ by
#> [2] │ dry
#> [3] │ fly
#> [4] │ mrs
#> [5] │ try
#> [6] │ why

This is a useful technique whenever you’re dealing with logical combinations, particularly those involving “and” or “not”. For example, imagine if you want to find all words that contain “a” and “b”. There’s no “and” operator built in to regular expressions so we have to tackle it by looking for all words that contain an “a” followed by a “b”, or a “b” followed by an “a”:
在处理逻辑组合时,这是一种有用的技术,特别是涉及“与”或“非”的组合。例如,假设你想查找所有同时包含“a”和“b”的单词。正则表达式中没有内置的“与”运算符,所以我们必须通过查找所有包含“a”后跟“b”的单词,或者“b”后跟“a”的单词来解决这个问题:

str_view(words, "a.*b|b.*a")
#>  [2] │ <ab>le
#>  [3] │ <ab>out
#>  [4] │ <ab>solute
#> [62] │ <availab>le
#> [66] │ <ba>by
#> [67] │ <ba>ck
#> ... and 24 more

It’s simpler to combine the results of two calls to str_detect():
将两次调用 str_detect() 的结果结合起来会更简单:

words[str_detect(words, "a") & str_detect(words, "b")]
#>  [1] "able"      "about"     "absolute"  "available" "baby"      "back"     
#>  [7] "bad"       "bag"       "balance"   "ball"      "bank"      "bar"      
#> [13] "base"      "basis"     "bear"      "beat"      "beauty"    "because"  
#> [19] "black"     "board"     "boat"      "break"     "brilliant" "britain"  
#> [25] "debate"    "husband"   "labour"    "maybe"     "probable"  "table"

What if we wanted to see if there was a word that contains all vowels? If we did it with patterns we’d need to generate 5! (120) different patterns:
如果我们想看看是否有一个单词包含所有元音怎么办?如果用模式来做,我们需要生成 5! (120) 个不同的模式:

words[str_detect(words, "a.*e.*i.*o.*u")]
# ...
words[str_detect(words, "u.*o.*i.*e.*a")]

It’s much simpler to combine five calls to str_detect():
结合五次调用 str_detect() 要简单得多:

words[
  str_detect(words, "a") &
  str_detect(words, "e") &
  str_detect(words, "i") &
  str_detect(words, "o") &
  str_detect(words, "u")
]
#> character(0)

In general, if you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.
总的来说,如果你在尝试创建一个单一的正则表达式来解决问题时遇到困难,不妨退后一步,思考是否可以将问题分解成更小的部分,先解决每个挑战,然后再进入下一个。

15.6.3 Creating a pattern with code

What if we wanted to find all sentences that mention a color? The basic idea is simple: we just combine alternation with word boundaries.
如果我们想找到所有提到颜色的 sentences 该怎么办?基本思想很简单:我们只需将交替与词边界结合起来。

str_view(sentences, "\\b(red|green|blue)\\b")
#>   [2] │ Glue the sheet to the dark <blue> background.
#>  [26] │ Two <blue> fish swam in the tank.
#>  [92] │ A wisp of cloud hung in the <blue> air.
#> [148] │ The spot on the blotter was made by <green> ink.
#> [160] │ The sofa cushion is <red> and of light weight.
#> [174] │ The sky that morning was clear and bright <blue>.
#> ... and 20 more

But as the number of colors grows, it would quickly get tedious to construct this pattern by hand. Wouldn’t it be nice if we could store the colors in a vector?
但是随着颜色数量的增加,手动构建这个模式很快就会变得乏味。如果能把颜色存储在一个向量里,岂不是很好?

rgb <- c("red", "green", "blue")

Well, we can! We’d just need to create the pattern from the vector using str_c() and str_flatten():
嗯,我们可以!我们只需要使用 str_c()str_flatten() 从向量创建模式:

str_c("\\b(", str_flatten(rgb, "|"), ")\\b")
#> [1] "\\b(red|green|blue)\\b"

We could make this pattern more comprehensive if we had a good list of colors. One place we could start from is the list of built-in colors that R can use for plots:
如果我们有一个好的颜色列表,我们可以使这个模式更全面。一个可以开始的地方是 R 中用于绘图的内置颜色列表:

str_view(colors())
#> [1] │ white
#> [2] │ aliceblue
#> [3] │ antiquewhite
#> [4] │ antiquewhite1
#> [5] │ antiquewhite2
#> [6] │ antiquewhite3
#> ... and 651 more

But lets first eliminate the numbered variants:
但让我们首先消除带编号的变体:

cols <- colors()
cols <- cols[!str_detect(cols, "\\d")]
str_view(cols)
#> [1] │ white
#> [2] │ aliceblue
#> [3] │ antiquewhite
#> [4] │ aquamarine
#> [5] │ azure
#> [6] │ beige
#> ... and 137 more

Then we can turn this into one giant pattern. We won’t show the pattern here because it’s huge, but you can see it working:
然后我们可以把它变成一个巨大的模式。我们不会在这里显示这个模式,因为它太大了,但你可以看到它的工作效果:

pattern <- str_c("\\b(", str_flatten(cols, "|"), ")\\b")
str_view(sentences, pattern)
#>   [2] │ Glue the sheet to the dark <blue> background.
#>  [12] │ A rod is used to catch <pink> <salmon>.
#>  [26] │ Two <blue> fish swam in the tank.
#>  [66] │ Cars and busses stalled in <snow> drifts.
#>  [92] │ A wisp of cloud hung in the <blue> air.
#> [112] │ Leaves turn <brown> and <yellow> in the fall.
#> ... and 57 more

In this example, cols only contains numbers and letters so you don’t need to worry about metacharacters. But in general, whenever you create patterns from existing strings it’s wise to run them through str_escape() to ensure they match literally.
在这个例子中,cols 只包含数字和字母,所以你不需要担心元字符。但总的来说,每当你从现有字符串创建模式时,最好将它们通过 str_escape() 处理,以确保它们是字面匹配。

15.6.4 Exercises

  1. For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls.

    1. Find all words that start or end with x.
    2. Find all words that start with a vowel and end with a consonant.
    3. Are there any words that contain at least one of each different vowel?
  2. Construct patterns to find evidence for and against the rule “i before e except after c”?

  3. colors() contains a number of modifiers like “lightgray” and “darkblue”. How could you automatically identify these modifiers? (Think about how you might detect and then remove the colors that are modified).

  4. Create a regular expression that finds any base R dataset. You can get a list of these datasets via a special use of the data() function: data(package = "datasets")$results[, "Item"]. Note that a number of old datasets are individual vectors; these contain the name of the grouping “data frame” in parentheses, so you’ll need to strip those off.

15.7 Regular expressions in other places

Just like in the stringr and tidyr functions, there are many other places in R where you can use regular expressions. The following sections describe some other useful functions in the wider tidyverse and base R.
就像在 stringr 和 tidyr 函数中一样,R 中还有许多其他地方可以使用正则表达式。以下各节描述了更广泛的 tidyverse 和基础 R 中的一些其他有用函数。

15.7.1 tidyverse

There are three other particularly useful places where you might want to use a regular expressions
还有另外三个特别有用的地方,你可能想在其中使用正则表达式

  • matches(pattern) will select all variables whose name matches the supplied pattern. It’s a “tidyselect” function that you can use anywhere in any tidyverse function that selects variables (e.g., select(), rename_with() and across()).
    matches(pattern) 将选择所有名称与所提供模式匹配的变量。它是一个“tidyselect”函数,你可以在任何选择变量的 tidyverse 函数(例如 select()rename_with()across())中使用它。

  • pivot_longer()'s names_pattern argument takes a vector of regular expressions, just like separate_wider_regex(). It’s useful when extracting data out of variable names with a complex structure
    pivot_longer()names_pattern 参数接受一个正则表达式向量,就像 separate_wider_regex() 一样。当从具有复杂结构的变量名中提取数据时,它很有用。

  • The delim argument in separate_longer_delim() and separate_wider_delim() usually matches a fixed string, but you can use regex() to make it match a pattern. This is useful, for example, if you want to match a comma that is optionally followed by a space, i.e. regex(", ?").
    separate_longer_delim()separate_wider_delim() 中的 delim 参数通常匹配一个固定的字符串,但你可以使用 regex() 使其匹配一个模式。这很有用,例如,如果你想匹配一个逗号,后面可以跟一个空格,即 regex(", ?")

15.7.2 Base R

apropos(pattern) searches all objects available from the global environment that match the given pattern. This is useful if you can’t quite remember the name of a function:
apropos(pattern) 在全局环境中搜索所有与给定模式匹配的可用对象。如果你记不太清函数名,这个功能会很有用:

apropos("replace")
#> [1] "%+replace%"       "replace"          "replace_na"      
#> [4] "setReplaceMethod" "str_replace"      "str_replace_all" 
#> [7] "str_replace_na"   "theme_replace"

list.files(path, pattern) lists all files in path that match a regular expression pattern. For example, you can find all the R Markdown files in the current directory with:
list.files(path, pattern) 列出 path 中所有匹配正则表达式 pattern 的文件。例如,你可以用以下命令找到当前目录中所有的 R Markdown 文件:

head(list.files(pattern = "\\.Rmd$"))
#> character(0)

It’s worth noting that the pattern language used by base R is very slightly different to that used by stringr. That’s because stringr is built on top of the stringi package, which is in turn built on top of the ICU engine, whereas base R functions use either the TRE engine or the PCRE engine, depending on whether or not you’ve set perl = TRUE. Fortunately, the basics of regular expressions are so well established that you’ll encounter few variations when working with the patterns you’ll learn in this book. You only need to be aware of the difference when you start to rely on advanced features like complex Unicode character ranges or special features that use the (?…) syntax.
值得注意的是,基础 R 使用的模式语言与 stringr 使用的略有不同。这是因为 stringr 是建立在 stringi 包 之上的,而 stringi 包又是建立在 ICU 引擎 之上的,而基础 R 函数则使用 TRE 引擎PCRE 引擎,这取决于你是否设置了 perl = TRUE。幸运的是,正则表达式的基础知识已经非常成熟,因此在使用本书中学到的模式时,你几乎不会遇到什么变化。你只需要在开始依赖高级功能(如复杂的 Unicode 字符范围或使用 (?…) 语法的特殊功能)时意识到这种差异。

15.8 Summary

With every punctuation character potentially overloaded with meaning, regular expressions are one of the most compact languages out there. They’re definitely confusing at first but as you train your eyes to read them and your brain to understand them, you unlock a powerful skill that you can use in R and in many other places.
由于每个标点符号都可能被赋予多重含义,正则表达式是现存最紧凑的语言之一。它们起初确实令人困惑,但随着你训练眼睛去阅读它们、训练大脑去理解它们,你将解锁一项强大的技能,可以在 R 和许多其他地方使用。

In this chapter, you’ve started your journey to become a regular expression master by learning the most useful stringr functions and the most important components of the regular expression language. And there are plenty of resources to learn more.
在本章中,你通过学习最实用的 stringr 函数和正则表达式语言最重要的组成部分,开启了成为正则表达式大师的旅程。并且有大量的资源可以让你学习更多。

A good place to start is vignette("regular-expressions", package = "stringr"): it documents the full set of syntax supported by stringr. Another useful reference is https://www.regular-expressions.info/. It’s not R specific, but you can use it to learn about the most advanced features of regexes and how they work under the hood.
一个好的起点是 vignette("regular-expressions", package = "stringr"):它记录了 stringr 支持的完整语法集。另一个有用的参考是 https://www.regular-expressions.info/。它并非 R 专属,但你可以用它来学习正则表达式最高级的功能以及它们在底层是如何工作的。

It’s also good to know that stringr is implemented on top of the stringi package by Marek Gagolewski. If you’re struggling to find a function that does what you need in stringr, don’t be afraid to look in stringi. You’ll find stringi very easy to pick up because it follows many of the the same conventions as stringr.
了解 stringr 是由 Marek Gagolewski 在 stringi 包之上实现的也很有好处。如果你在 stringr 中找不到所需功能的函数,不要害怕去 stringi 中寻找。你会发现 stringi 非常容易上手,因为它遵循了许多与 stringr 相同的约定。

In the next chapter, we’ll talk about a data structure closely related to strings: factors. Factors are used to represent categorical data in R, i.e. data with a fixed and known set of possible values identified by a vector of strings.
在下一章中,我们将讨论一种与字符串密切相关的数据结构:因子 (factors)。因子用于在 R 中表示分类数据,即具有一组固定的、已知的可能值的数据,这些值由一个字符串向量标识。


  1. You can pronounce it with either a hard-g (reg-x) or a soft-g (rej-x).↩︎

  2. You’ll learn how to escape these special meanings in Section 15.4.1.↩︎

  3. Well, any character apart from \n.↩︎

  4. This gives us the proportion of names that contain an “x”; if you wanted the proportion of babies with a name containing an x, you’d need to perform a weighted mean.↩︎

  5. We wish we could reassure you that you’d never see something this weird in real life, but unfortunately over the course of your career you’re likely to see much weirder!↩︎

  6. The complete set of metacharacters is .^$\|*+?{}[]()↩︎

  7. Remember, to create a regular expression containing \d or \s, you’ll need to escape the \ for the string, so you’ll type "\\d" or "\\s".↩︎

  8. Mostly because we never discuss matrices in this book!↩︎

  9. comments = TRUE is particularly effective in combination with a raw string, as we use here.↩︎