library(nycflights13)
library(tidyverse)
#> ── Attaching core tidyverse packages ───────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr 1.1.4 ✔ readr 2.1.5
#> ✔ forcats 1.0.0 ✔ stringr 1.5.1
#> ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
#> ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
#> ✔ purrr 1.0.4
#> ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
3 Data transformation
3.1 Introduction
Visualization is an important tool for generating insight, but it’s rare that you get the data in exactly the right form you need to make the graph you want.
可视化是产生洞察力的重要工具,但你很少能以你需要的确切形式获得数据来制作你想要的图表。
Often you’ll need to create some new variables or summaries to answer your questions with your data, or maybe you just want to rename the variables or reorder the observations to make the data a little easier to work with.
通常,你需要创建一些新的变量或摘要来用你的数据回答你的问题,或者你可能只是想重命名变量或重新排序观测值,以使数据更容易处理。
You’ll learn how to do all that (and more!) in this chapter, which will introduce you to data transformation using the dplyr package and a new dataset on flights that departed from New York City in 2013.
在本章中,你将学习如何做到所有这些 (以及更多!),本章将向你介绍如何使用 dplyr 包和一个关于 2013 年从纽约市起飞的航班的新数据集来进行数据转换。
The goal of this chapter is to give you an overview of all the key tools for transforming a data frame.
本章的目标是让你对转换数据框 (data frame) 的所有关键工具有一个概览。
We’ll start with functions that operate on rows and then columns of a data frame, then circle back to talk more about the pipe, an important tool that you use to combine verbs.
我们将从操作数据框的行和列的函数开始,然后回过头来更多地讨论管道 (pipe),这是一个用来组合动词 (verb) 的重要工具。
We will then introduce the ability to work with groups.
然后我们将介绍处理分组的能力。
We will end the chapter with a case study that showcases these functions in action.
我们将以一个案例研究来结束本章,展示这些函数的实际应用。
In later chapters, we’ll return to the functions in more detail as we start to dig into specific types of data (e.g., numbers, strings, dates).
在后面的章节中,我们将更详细地回顾这些函数,因为我们将开始深入研究特定类型的数据 (例如,数字、字符串、日期)。
3.1.1 Prerequisites
In this chapter, we’ll focus on the dplyr package, another core member of the tidyverse.
在本章中,我们将重点介绍 dplyr 包,它是 tidyverse 的另一个核心成员。
We’ll illustrate the key ideas using data from the nycflights13 package and use ggplot2 to help us understand the data.
我们将使用 nycflights13 包中的数据来说明关键思想,并使用 ggplot2 来帮助我们理解数据。
Take careful note of the conflicts message that’s printed when you load the tidyverse.
请仔细注意加载 tidyverse 时打印的冲突消息。
It tells you that dplyr overwrites some functions in base R.
它告诉你 dplyr 覆盖了基础 R 中的一些函数。
If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: stats::filter()
and stats::lag()
.
如果在加载 dplyr 后想使用这些函数的基础版本,你需要使用它们的全名:stats::filter()
和 stats::lag()
。
So far, we’ve mostly ignored which package a function comes from because it doesn’t usually matter.
到目前为止,我们大多忽略了函数来自哪个包,因为它通常不重要。
However, knowing the package can help you find help and find related functions, so when we need to be precise about which package a function comes from, we’ll use the same syntax as R: packagename::functionname()
.
然而,知道包可以帮助你找到帮助和相关的函数,所以当我们需要精确地指出函数来自哪个包时,我们将使用与 R 相同的语法:packagename::functionname()
。
3.1.2 nycflights13
To explore the basic dplyr verbs, we will use nycflights13::flights
.
为了探索基本的 dplyr 动词,我们将使用 nycflights13::flights
。
This dataset contains all 336,776 flights that departed from New York City in 2013.
这个数据集包含了 2013 年从纽约市出发的所有 336,776 个航班。
The data comes from the US Bureau of Transportation Statistics and is documented in ?flights
.
数据来自美国交通统计局,并在 ?flights
中有文档记录。
flights
#> # A tibble: 336,776 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # ℹ 336,770 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
flights
is a tibble, a special type of data frame used by the tidyverse to avoid some common gotchas.flights
是一个 tibble,这是 tidyverse 使用的一种特殊类型的数据框,以避免一些常见的陷阱。
The most important difference between tibbles and data frames is the way tibbles print; they are designed for large datasets, so they only show the first few rows and only the columns that fit on one screen.
tibble 和数据框之间最重要的区别在于 tibble 的打印方式;它们专为大型数据集设计,因此只显示前几行和能在一屏内显示的列。
There are a few options to see everything.
有几个选项可以查看所有内容。
If you’re using RStudio, the most convenient is probably View(flights)
, which opens an interactive, scrollable, and filterable view.
如果你正在使用 RStudio,最方便的可能是 View(flights)
,它会打开一个交互式的、可滚动的和可筛选的视图。
Otherwise you can use print(flights, width = Inf)
to show all columns, or use glimpse()
:
否则,你可以使用 print(flights, width = Inf)
来显示所有列,或者使用 glimpse()
:
glimpse(flights)
#> Rows: 336,776
#> Columns: 19
#> $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013…
#> $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 55…
#> $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 60…
#> $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2,…
#> $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 8…
#> $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 8…
#> $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7,…
#> $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6"…
#> $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301…
#> $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N…
#> $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LG…
#> $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IA…
#> $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149…
#> $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 73…
#> $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6…
#> $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59…
#> $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-0…
In both views, the variable names are followed by abbreviations that tell you the type of each variable: <int>
is short for integer, <dbl>
is short for double (aka real numbers), <chr>
for character (aka strings), and <dttm>
for date-time.
在这两种视图中,变量名后面都跟着缩写,告诉你每个变量的类型:<int>
是整数 (integer) 的缩写,<dbl>
是双精度浮点数 (double) (也称为实数) 的缩写,<chr>
是字符 (character) (也称为字符串) 的缩写,而 <dttm>
是日期时间 (date-time) 的缩写。
These are important because the operations you can perform on a column depend heavily on its “type.”
这些都很重要,因为你可以对一列执行的操作在很大程度上取决于它的“类型”。
3.1.3 dplyr basics
You’re about to learn the primary dplyr verbs (functions), which will allow you to solve the vast majority of your data manipulation challenges.
你即将学习主要的 dplyr 动词 (函数),它们将使你能够解决绝大多数的数据操作挑战。
But before we discuss their individual differences, it’s worth stating what they have in common:
但在我们讨论它们的个体差异之前,有必要说明它们的共同点:
The first argument is always a data frame.
第一个参数总是一个数据框。The subsequent arguments typically describe which columns to operate on using the variable names (without quotes).
后续的参数通常使用变量名 (不带引号) 来描述要操作的列。The output is always a new data frame.
输出总是一个新的数据框。
Because each verb does one thing well, solving complex problems will usually require combining multiple verbs, and we’ll do so with the pipe, |>
.
因为每个动词都能很好地完成一件事,所以解决复杂问题通常需要组合多个动词,我们将使用管道 |>
来实现这一点。
We’ll discuss the pipe more in Section 3.4, but in brief, the pipe takes the thing on its left and passes it along to the function on its right so that x |> f(y)
is equivalent to f(x, y)
, and x |> f(y) |> g(z)
is equivalent to g(f(x, y), z)
.
我们将在 Section 3.4 中更详细地讨论管道,但简而言之,管道将其左边的东西传递给其右边的函数,因此 x |> f(y)
等同于 f(x, y)
,而 x |> f(y) |> g(z)
等同于 g(f(x, y), z)
。
The easiest way to pronounce the pipe is “then”.
管道最简单的发音是“然后” (then)。
That makes it possible to get a sense of the following code even though you haven’t yet learned the details:
这使得即使你还没有学习细节,也能理解以下代码的含义:
dplyr’s verbs are organized into four groups based on what they operate on: rows, columns, groups, or tables.
dplyr 的动词根据它们操作的对象分为四组:行、列、组 或 表。
In the following sections, you’ll learn the most important verbs for rows, columns, and groups.
在接下来的部分中,你将学习针对行、列和组的最重要的动词。
Then, we’ll return to the join verbs that work on tables in Chapter 19.
然后,我们将在 Chapter 19 中回到处理表的连接动词。
Let’s dive in!
让我们开始吧!
3.2 Rows
The most important verbs that operate on rows of a dataset are filter()
, which changes which rows are present without changing their order, and arrange()
, which changes the order of the rows without changing which are present.
操作数据集行的最重要的动词是 filter()
,它在不改变行顺序的情况下改变存在的行;以及 arrange()
,它在不改变存在的行的情况下改变行的顺序。
Both functions only affect the rows, and the columns are left unchanged.
这两个函数都只影响行,列保持不变。
We’ll also discuss distinct()
which finds rows with unique values.
我们还将讨论 distinct()
,它能找到具有唯一值的行。
Unlike arrange()
and filter()
it can also optionally modify the columns.
与 arrange()
和 filter()
不同,它还可以选择性地修改列。
3.2.1 filter()
filter()
allows you to keep rows based on the values of the columns1.filter()
允许你根据列的值保留行1。
The first argument is the data frame.
第一个参数是数据框。
The second and subsequent arguments are the conditions that must be true to keep the row.
第二个及后续参数是必须为真才能保留该行的条件。
For example, we could find all flights that departed more than 120 minutes (two hours) late:
例如,我们可以找到所有晚点超过 120 分钟 (两小时) 的航班:
flights |>
filter(dep_delay > 120)
#> # A tibble: 9,723 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 848 1835 853 1001 1950
#> 2 2013 1 1 957 733 144 1056 853
#> 3 2013 1 1 1114 900 134 1447 1222
#> 4 2013 1 1 1540 1338 122 2020 1825
#> 5 2013 1 1 1815 1325 290 2120 1542
#> 6 2013 1 1 1842 1422 260 1958 1535
#> # ℹ 9,717 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
As well as >
(greater than), you can use >=
(greater than or equal to), <
(less than), <=
(less than or equal to), ==
(equal to), and !=
(not equal to).
除了 >
(大于),你还可以使用 >=
(大于或等于)、<
(小于)、<=
(小于或等于)、==
(等于) 和 !=
(不等于)。
You can also combine conditions with &
or ,
to indicate “and” (check for both conditions) or with |
to indicate “or” (check for either condition):
你还可以使用 &
或 ,
来组合条件,表示“与” (检查两个条件),或使用 |
表示“或” (检查任一条件):
# Flights that departed on January 1
flights |>
filter(month == 1 & day == 1)
#> # A tibble: 842 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # ℹ 836 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
# Flights that departed in January or February
flights |>
filter(month == 1 | month == 2)
#> # A tibble: 51,955 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # ℹ 51,949 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
There’s a useful shortcut when you’re combining |
and ==
: %in%
.
当你组合 |
和 ==
时,有一个有用的快捷方式:%in%
。
It keeps rows where the variable equals one of the values on the right:
它会保留变量等于右侧值之一的行:
# A shorter way to select flights that departed in January or February
flights |>
filter(month %in% c(1, 2))
#> # A tibble: 51,955 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # ℹ 51,949 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
We’ll come back to these comparisons and logical operators in more detail in Chapter 12.
我们将在 Chapter 12 中更详细地回顾这些比较和逻辑运算符。
When you run filter()
dplyr executes the filtering operation, creating a new data frame, and then prints it.
当你运行 filter()
时,dplyr 会执行筛选操作,创建一个新的数据框,然后打印它。
It doesn’t modify the existing flights
dataset because dplyr functions never modify their inputs.
它不会修改现有的 flights
数据集,因为 dplyr 函数从不修改它们的输入。
To save the result, you need to use the assignment operator, <-
:
要保存结果,你需要使用赋值运算符 <-
:
jan1 <- flights |>
filter(month == 1 & day == 1)
3.2.2 Common mistakes
When you’re starting out with R, the easiest mistake to make is to use =
instead of ==
when testing for equality.
当你刚开始使用 R 时,最容易犯的错误是在测试相等性时使用 =
而不是 ==
。
filter()
will let you know when this happens:filter()
会在这种情况下通知你:
flights |>
filter(month = 1)
#> Error in `filter()`:
#> ! We detected a named input.
#> ℹ This usually means that you've used `=` instead of `==`.
#> ℹ Did you mean `month == 1`?
Another mistakes is you write “or” statements like you would in English:
另一个错误是你像用英语一样写“或”语句:
flights |>
filter(month == 1 | 2)
This “works”, in the sense that it doesn’t throw an error, but it doesn’t do what you want because |
first checks the condition month == 1
and then checks the condition 2
, which is not a sensible condition to check.
这“行得通”,因为它不会抛出错误,但它没有做你想要的,因为 |
首先检查条件 month == 1
,然后检查条件 2
,这不是一个合理的检查条件。
We’ll learn more about what’s happening here and why in Section 12.3.2.
我们将在 Section 12.3.2 中更多地了解这里发生了什么以及为什么会这样。
3.2.3 arrange()
arrange()
changes the order of the rows based on the value of the columns.arrange()
根据列的值改变行的顺序。
It takes a data frame and a set of column names (or more complicated expressions) to order by.
它接受一个数据框和一组列名 (或更复杂的表达式) 作为排序依据。
If you provide more than one column name, each additional column will be used to break ties in the values of the preceding columns.
如果你提供多个列名,每个额外的列将用于打破前面列值中的平局。
For example, the following code sorts by the departure time, which is spread over four columns.
例如,以下代码按出发时间排序,该时间分布在四列中。
We get the earliest years first, then within a year, the earliest months, etc.
我们首先得到最早的年份,然后在一年内,得到最早的月份,依此类推。
flights |>
arrange(year, month, day, dep_time)
#> # A tibble: 336,776 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # ℹ 336,770 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
You can use desc()
on a column inside of arrange()
to re-order the data frame based on that column in descending (big-to-small) order.
你可以在 arrange()
内部对列使用 desc()
,以按该列的降序 (从大到小) 重新排序数据框。
For example, this code orders flights from most to least delayed:
例如,此代码将航班从最晚点到最少晚点排序:
flights |>
arrange(desc(dep_delay))
#> # A tibble: 336,776 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 9 641 900 1301 1242 1530
#> 2 2013 6 15 1432 1935 1137 1607 2120
#> 3 2013 1 10 1121 1635 1126 1239 1810
#> 4 2013 9 20 1139 1845 1014 1457 2210
#> 5 2013 7 22 845 1600 1005 1044 1815
#> 6 2013 4 10 1100 1900 960 1342 2211
#> # ℹ 336,770 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
Note that the number of rows has not changed – we’re only arranging the data, we’re not filtering it.
请注意,行数没有改变——我们只是在排列数据,而不是筛选数据。
3.2.4 distinct()
distinct()
finds all the unique rows in a dataset, so technically, it primarily operates on the rows.distinct()
在数据集中查找所有唯一的行,因此从技术上讲,它主要作用于行。
Most of the time, however, you’ll want the distinct combination of some variables, so you can also optionally supply column names:
然而,大多数时候,你会想要一些变量的独特组合,所以你也可以选择性地提供列名:
# Remove duplicate rows, if any
flights |>
distinct()
#> # A tibble: 336,776 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # ℹ 336,770 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
# Find all unique origin and destination pairs
flights |>
distinct(origin, dest)
#> # A tibble: 224 × 2
#> origin dest
#> <chr> <chr>
#> 1 EWR IAH
#> 2 LGA IAH
#> 3 JFK MIA
#> 4 JFK BQN
#> 5 LGA ATL
#> 6 EWR ORD
#> # ℹ 218 more rows
Alternatively, if you want to keep the other columns when filtering for unique rows, you can use the .keep_all = TRUE
option.
或者,如果你想在筛选唯一行时保留其他列,可以使用 .keep_all = TRUE
选项。
flights |>
distinct(origin, dest, .keep_all = TRUE)
#> # A tibble: 224 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # ℹ 218 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
It’s not a coincidence that all of these distinct flights are on January 1: distinct()
will find the first occurrence of a unique row in the dataset and discard the rest.
所有这些不同的航班都在 1 月 1 日并非巧合:distinct()
会找到数据集中唯一行的第一次出现,并丢弃其余的。
If you want to find the number of occurrences instead, you’re better off swapping distinct()
for count()
.
如果你想查找出现次数,最好将 distinct()
换成 count()
。
With the sort = TRUE
argument, you can arrange them in descending order of the number of occurrences.
使用 sort = TRUE
参数,你可以按出现次数的降序排列它们。
You’ll learn more about count in Section 13.3.
你将在 Section 13.3 中学到更多关于 count()
的知识。
flights |>
count(origin, dest, sort = TRUE)
#> # A tibble: 224 × 3
#> origin dest n
#> <chr> <chr> <int>
#> 1 JFK LAX 11262
#> 2 LGA ATL 10263
#> 3 LGA ORD 8857
#> 4 JFK SFO 8204
#> 5 LGA CLT 6168
#> 6 EWR ORD 6100
#> # ℹ 218 more rows
3.2.5 Exercises
-
In a single pipeline for each condition, find all flights that meet the condition:
- Had an arrival delay of two or more hours
- Flew to Houston (
IAH
orHOU
) - Were operated by United, American, or Delta
- Departed in summer (July, August, and September)
- Arrived more than two hours late but didn’t leave late
- Were delayed by at least an hour, but made up over 30 minutes in flight
Sort
flights
to find the flights with the longest departure delays. Find the flights that left earliest in the morning.Sort
flights
to find the fastest flights. (Hint: Try including a math calculation inside of your function.)Was there a flight on every day of 2013?
Which flights traveled the farthest distance? Which traveled the least distance?
Does it matter what order you used
filter()
andarrange()
if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.
3.3 Columns
There are four important verbs that affect the columns without changing the rows: mutate()
creates new columns that are derived from the existing columns, select()
changes which columns are present, rename()
changes the names of the columns, and relocate()
changes the positions of the columns.
有四个重要的动词会影响列而不改变行:mutate()
从现有列派生出新列,select()
改变存在的列,rename()
改变列的名称,relocate()
改变列的位置。
3.3.1 mutate()
The job of mutate()
is to add new columns that are calculated from the existing columns.mutate()
的工作是添加从现有列计算出的新列。
In the transform chapters, you’ll learn a large set of functions that you can use to manipulate different types of variables.
在转换章节中,你将学习大量可用于操作不同类型变量的函数。
For now, we’ll stick with basic algebra, which allows us to compute the gain
, how much time a delayed flight made up in the air, and the speed
in miles per hour:
现在,我们将坚持使用基本代数,它允许我们计算 gain
(延误的航班在空中弥补了多少时间) 和 speed
(以英里/小时为单位):
flights |>
mutate(
gain = dep_delay - arr_delay,
speed = distance / air_time * 60
)
#> # A tibble: 336,776 × 21
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # ℹ 336,770 more rows
#> # ℹ 13 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
By default, mutate()
adds new columns on the right-hand side of your dataset, which makes it difficult to see what’s happening here.
默认情况下,mutate()
会在数据集的右侧添加新列,这使得很难看清这里发生了什么。
We can use the .before
argument to instead add the variables to the left-hand side2:
我们可以使用 .before
参数来将变量添加到左侧2:
flights |>
mutate(
gain = dep_delay - arr_delay,
speed = distance / air_time * 60,
.before = 1
)
#> # A tibble: 336,776 × 21
#> gain speed year month day dep_time sched_dep_time dep_delay arr_time
#> <dbl> <dbl> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 -9 370. 2013 1 1 517 515 2 830
#> 2 -16 374. 2013 1 1 533 529 4 850
#> 3 -31 408. 2013 1 1 542 540 2 923
#> 4 17 517. 2013 1 1 544 545 -1 1004
#> 5 19 394. 2013 1 1 554 600 -6 812
#> 6 -16 288. 2013 1 1 554 558 -4 740
#> # ℹ 336,770 more rows
#> # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, …
The .
indicates that .before
is an argument to the function, not the name of a third new variable we are creating..
表示 .before
是函数的一个参数,而不是我们正在创建的第三个新变量的名称。
You can also use .after
to add after a variable, and in both .before
and .after
you can use the variable name instead of a position.
你也可以使用 .after
在变量之后添加,在 .before
和 .after
中,你都可以使用变量名而不是位置。
For example, we could add the new variables after day
:
例如,我们可以在 day
之后添加新变量:
flights |>
mutate(
gain = dep_delay - arr_delay,
speed = distance / air_time * 60,
.after = day
)
Alternatively, you can control which variables are kept with the .keep
argument.
或者,你可以使用 .keep
参数来控制保留哪些变量。
A particularly useful argument is "used"
which specifies that we only keep the columns that were involved or created in the mutate()
step.
一个特别有用的参数是 "used"
,它指定我们只保留在 mutate()
步骤中涉及或创建的列。
For example, the following output will contain only the variables dep_delay
, arr_delay
, air_time
, gain
, hours
, and gain_per_hour
.
例如,以下输出将只包含 dep_delay
、arr_delay
、air_time
、gain
、hours
和 gain_per_hour
变量。
flights |>
mutate(
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours,
.keep = "used"
)
Note that since we haven’t assigned the result of the above computation back to flights
, the new variables gain
, hours
, and gain_per_hour
will only be printed but will not be stored in a data frame.
请注意,由于我们没有将上述计算的结果分配回 flights
,新变量 gain
、hours
和 gain_per_hour
将只被打印而不会存储在数据框中。
And if we want them to be available in a data frame for future use, we should think carefully about whether we want the result to be assigned back to flights
, overwriting the original data frame with many more variables, or to a new object.
如果我们希望它们在数据框中可供将来使用,我们应该仔细考虑是否要将结果分配回 flights
(用更多的变量覆盖原始数据框),还是分配给一个新对象。
Often, the right answer is a new object that is named informatively to indicate its contents, e.g., delay_gain
, but you might also have good reasons for overwriting flights
.
通常,正确的答案是一个新对象,其名称具有信息性,以指示其内容,例如 delay_gain
,但你也可能有充分的理由覆盖 flights
。
3.3.2 select()
It’s not uncommon to get datasets with hundreds or even thousands of variables.
获得包含数百甚至数千个变量的数据集并不少见。
In this situation, the first challenge is often just focusing on the variables you’re interested in.
在这种情况下,第一个挑战通常只是关注你感兴趣的变量。
select()
allows you to rapidly zoom in on a useful subset using operations based on the names of the variables:select()
允许你使用基于变量名称的操作快速放大到一个有用的子集:
-
Select columns by name:
按名称选择列:{r} #| results: false flights |> select(year, month, day)
-
Select all columns between year and day (inclusive):
选择从 year 到 day (含) 的所有列:{r} #| results: false flights |> select(year:day)
-
Select all columns except those from year to day (inclusive):
选择除了从 year 到 day (含) 之外的所有列:{r} #| results: false flights |> select(!year:day)
Historically this operation was done with
-
instead of!
, so you’re likely to see that in the wild.
历史上,这个操作是使用-
而不是!
完成的,所以你很可能在野外看到它。These two operators serve the same purpose but with subtle differences in behavior.
这两个运算符的目的相同,但在行为上有细微的差别。We recommend using
!
because it reads as “not” and combines well with&
and|
.
我们建议使用!
,因为它读作“非” (not),并且能很好地与&
和|
结合。 -
Select all columns that are characters:
选择所有字符类型的列:{r} #| results: false flights |> select(where(is.character))
There are a number of helper functions you can use within select()
:
在 select()
中你可以使用许多辅助函数:
starts_with("abc")
: matches names that begin with “abc”.
匹配以 “abc” 开头的名称。ends_with("xyz")
: matches names that end with “xyz”.
匹配以 “xyz” 结尾的名称。contains("ijk")
: matches names that contain “ijk”.
匹配包含 “ijk” 的名称。num_range("x", 1:3)
: matchesx1
,x2
andx3
.
匹配x1
,x2
和x3
。
See ?select
for more details.
查看 ?select
获取更多细节。
Once you know regular expressions (the topic of Chapter 15) you’ll also be able to use matches()
to select variables that match a pattern.
一旦你了解了正则表达式 (regular expressions) (Chapter 15 的主题),你还可以使用 matches()
来选择匹配模式的变量。
You can rename variables as you select()
them by using =
.
你可以在 select()
时使用 =
来重命名变量。
The new name appears on the left-hand side of the =
, and the old variable appears on the right-hand side:
新名称出现在 =
的左侧,旧变量出现在右侧:
flights |>
select(tail_num = tailnum)
#> # A tibble: 336,776 × 1
#> tail_num
#> <chr>
#> 1 N14228
#> 2 N24211
#> 3 N619AA
#> 4 N804JB
#> 5 N668DN
#> 6 N39463
#> # ℹ 336,770 more rows
3.3.3 rename()
If you want to keep all the existing variables and just want to rename a few, you can use rename()
instead of select()
:
如果你想保留所有现有变量,而只想重命名少数几个,你可以使用 rename()
代替 select()
:
flights |>
rename(tail_num = tailnum)
#> # A tibble: 336,776 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # ℹ 336,770 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
If you have a bunch of inconsistently named columns and it would be painful to fix them all by hand, check out janitor::clean_names()
which provides some useful automated cleaning.
如果你有一堆命名不一致的列,并且手动修复它们会很痛苦,可以看看 janitor::clean_names()
,它提供了一些有用的自动清理功能。
3.3.4 relocate()
Use relocate()
to move variables around.
使用 relocate()
来移动变量。
You might want to collect related variables together or move important variables to the front.
你可能想将相关的变量收集在一起,或者将重要的变量移到前面。
By default relocate()
moves variables to the front:
默认情况下,relocate()
将变量移动到最前面:
flights |>
relocate(time_hour, air_time)
#> # A tibble: 336,776 × 19
#> time_hour air_time year month day dep_time sched_dep_time
#> <dttm> <dbl> <int> <int> <int> <int> <int>
#> 1 2013-01-01 05:00:00 227 2013 1 1 517 515
#> 2 2013-01-01 05:00:00 227 2013 1 1 533 529
#> 3 2013-01-01 05:00:00 160 2013 1 1 542 540
#> 4 2013-01-01 05:00:00 183 2013 1 1 544 545
#> 5 2013-01-01 06:00:00 116 2013 1 1 554 600
#> 6 2013-01-01 05:00:00 150 2013 1 1 554 558
#> # ℹ 336,770 more rows
#> # ℹ 12 more variables: dep_delay <dbl>, arr_time <int>, …
You can also specify where to put them using the .before
and .after
arguments, just like in mutate()
:
你也可以像在 mutate()
中一样,使用 .before
和 .after
参数来指定将它们放在哪里:
flights |>
relocate(year:dep_time, .after = time_hour)
flights |>
relocate(starts_with("arr"), .before = dep_time)
3.3.5 Exercises
Compare
dep_time
,sched_dep_time
, anddep_delay
. How would you expect those three numbers to be related?Brainstorm as many ways as possible to select
dep_time
,dep_delay
,arr_time
, andarr_delay
fromflights
.What happens if you specify the name of the same variable multiple times in a
select()
call?-
What does the
any_of()
function do? Why might it be helpful in conjunction with this vector?variables <- c("year", "month", "day", "dep_delay", "arr_delay")
-
Does the result of running the following code surprise you? How do the select helpers deal with upper and lower case by default? How can you change that default?
Rename
air_time
toair_time_min
to indicate units of measurement and move it to the beginning of the data frame.-
Why doesn’t the following work, and what does the error mean?
3.4 The pipe
We’ve shown you simple examples of the pipe above, but its real power arises when you start to combine multiple verbs.
我们在上面向你展示了管道的简单示例,但它真正的威力在于当你开始组合多个动词时。
For example, imagine that you wanted to find the fastest flights to Houston’s IAH airport: you need to combine filter()
, mutate()
, select()
, and arrange()
:
例如,假设你想找到飞往休斯顿 IAH 机场的最快航班:你需要组合 filter()
、mutate()
、select()
和 arrange()
:
flights |>
filter(dest == "IAH") |>
mutate(speed = distance / air_time * 60) |>
select(year:day, dep_time, carrier, flight, speed) |>
arrange(desc(speed))
#> # A tibble: 7,198 × 7
#> year month day dep_time carrier flight speed
#> <int> <int> <int> <int> <chr> <int> <dbl>
#> 1 2013 7 9 707 UA 226 522.
#> 2 2013 8 27 1850 UA 1128 521.
#> 3 2013 8 28 902 UA 1711 519.
#> 4 2013 8 28 2122 UA 1022 519.
#> 5 2013 6 11 1628 UA 1178 515.
#> 6 2013 8 27 1017 UA 333 515.
#> # ℹ 7,192 more rows
Even though this pipeline has four steps, it’s easy to skim because the verbs come at the start of each line: start with the flights
data, then filter, then mutate, then select, then arrange.
尽管这个管道有四个步骤,但它很容易浏览,因为动词都出现在每行的开头:从 flights
数据开始,然后筛选,然后转换,然后选择,然后排列。
What would happen if we didn’t have the pipe?
如果我们没有管道会怎么样?
We could nest each function call inside the previous call:
我们可以将每个函数调用嵌套在前一个调用中:
Or we could use a bunch of intermediate objects:
或者我们可以使用一堆中间对象:
While both forms have their time and place, the pipe generally produces data analysis code that is easier to write and read.
虽然这两种形式都有其适用的时机和场合,但管道通常能生成更易于编写和阅读的数据分析代码。
To add the pipe to your code, we recommend using the built-in keyboard shortcut Ctrl/Cmd + Shift + M.
要将管道添加到你的代码中,我们建议使用内置的键盘快捷键 Ctrl/Cmd + Shift + M。
You’ll need to make one change to your RStudio options to use |>
instead of %>%
as shown in Figure 3.1; more on %>%
shortly.
你需要对 RStudio 选项进行一项更改,以使用 |>
而不是 %>%
,如 Figure 3.1 所示;稍后将详细介绍 %>%
。

|>
, make sure the “Use native pipe operator” option is checked. 要插入
|>
,请确保选中“使用原生管道运算符” (Use native pipe operator) 选项。
If you’ve been using the tidyverse for a while, you might be familiar with the %>%
pipe provided by the magrittr package.
如果你已经使用 tidyverse 一段时间了,你可能熟悉 magrittr 包提供的 %>%
管道。
The magrittr package is included in the core tidyverse, so you can use %>%
whenever you load the tidyverse:
magrittr 包包含在 tidyverse 核心中,因此每当你加载 tidyverse 时都可以使用 %>%
:
For simple cases, |>
and %>%
behave identically.
在简单的情况下,|>
和 %>%
的行为完全相同。
So why do we recommend the base pipe?
那么我们为什么推荐基础管道呢?
Firstly, because it’s part of base R, it’s always available for you to use, even when you’re not using the tidyverse.
首先,因为它是基础 R 的一部分,所以即使你不使用 tidyverse,它也始终可用。
Secondly, |>
is quite a bit simpler than %>%
: in the time between the invention of %>%
in 2014 and the inclusion of |>
in R 4.1.0 in 2021, we gained a better understanding of the pipe.
其次,|>
比 %>%
简单得多:在 2014 年 %>%
发明和 2021 年 R 4.1.0 包含 |>
之间的时间里,我们对管道有了更好的理解。
This allowed the base implementation to jettison infrequently used and less important features.
这使得基础实现可以摒弃不常用和不太重要的功能。
3.5 Groups
So far you’ve learned about functions that work with rows and columns.
到目前为止,你已经学习了处理行和列的函数。
dplyr gets even more powerful when you add in the ability to work with groups.
当你加入处理分组的能力时,dplyr 会变得更加强大。
In this section, we’ll focus on the most important functions: group_by()
, summarize()
, and the slice family of functions.
在本节中,我们将重点介绍最重要的函数:group_by()
、summarize()
和 slice 系列函数。
3.5.1 group_by()
Use group_by()
to divide your dataset into groups meaningful for your analysis:
使用 group_by()
将你的数据集划分为对你的分析有意义的组:
flights |>
group_by(month)
#> # A tibble: 336,776 × 19
#> # Groups: month [12]
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # ℹ 336,770 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
group_by()
doesn’t change the data but, if you look closely at the output, you’ll notice that the output indicates that it is “grouped by” month (Groups: month [12]
).group_by()
不会改变数据,但是,如果你仔细观察输出,你会注意到输出表明它已按月份“分组” (Groups: month [12]
)。
This means subsequent operations will now work “by month”.
这意味着后续操作现在将“按月”工作。
group_by()
adds this grouped feature (referred to as class) to the data frame, which changes the behavior of the subsequent verbs applied to the data.group_by()
将此分组特征 (称为类) 添加到数据框中,这会改变应用于数据的后续动词的行为。
3.5.2 summarize()
The most important grouped operation is a summary, which, if being used to calculate a single summary statistic, reduces the data frame to have a single row for each group.
最重要的分组操作是摘要,如果用于计算单个摘要统计量,它会将数据框减少为每个组只有一行。
In dplyr, this operation is performed by summarize()
3, as shown by the following example, which computes the average departure delay by month:
在 dplyr 中,此操作由 summarize()
3 执行,如下例所示,该示例计算按月平均出发延误:
Uh-oh!
噢!
Something has gone wrong, and all of our results are NA
s (pronounced “N-A”), R’s symbol for missing value.
出问题了,我们所有的结果都是 NA
(读作“N-A”),这是 R 中表示缺失值的符号。
This happened because some of the observed flights had missing data in the delay column, and so when we calculated the mean including those values, we got an NA
result.
发生这种情况是因为一些观察到的航班在延误列中有缺失数据,因此当我们计算包括这些值的平均值时,我们得到了一个 NA
结果。
We’ll come back to discuss missing values in detail in Chapter 18, but for now, we’ll tell the mean()
function to ignore all missing values by setting the argument na.rm
to TRUE
:
我们将在 Chapter 18 中详细讨论缺失值,但现在,我们将通过将参数 na.rm
设置为 TRUE
来告诉 mean()
函数忽略所有缺失值:
You can create any number of summaries in a single call to summarize()
.
你可以在一次 summarize()
调用中创建任意数量的摘要。
You’ll learn various useful summaries in the upcoming chapters, but one very useful summary is n()
, which returns the number of rows in each group:
你将在接下来的章节中学到各种有用的摘要,但一个非常有用的摘要是 n()
,它返回每个组中的行数:
Means and counts can get you a surprisingly long way in data science!
均值和计数可以让你在数据科学领域走得很远!
3.5.3 The slice_
functions
There are five handy functions that allow you to extract specific rows within each group:
有五个方便的函数,可以让你提取每个组中的特定行:
df |> slice_head(n = 1)
takes the first row from each group.
从每个组中取第一行。df |> slice_tail(n = 1)
takes the last row in each group.
从每个组中取最后一行。df |> slice_min(x, n = 1)
takes the row with the smallest value of columnx
.
取x
列值最小的行。df |> slice_max(x, n = 1)
takes the row with the largest value of columnx
.
取x
列值最大的行。df |> slice_sample(n = 1)
takes one random row.
取一个随机行。
You can vary n
to select more than one row, or instead of n =
, you can use prop = 0.1
to select (e.g.) 10% of the rows in each group.
你可以改变 n
来选择多于一行,或者用 prop = 0.1
来代替 n =
,以选择 (例如) 每个组中 10% 的行。
For example, the following code finds the flights that are most delayed upon arrival at each destination:
例如,以下代码查找在每个目的地到达时延误最严重的航班:
flights |>
group_by(dest) |>
slice_max(arr_delay, n = 1) |>
relocate(dest)
#> # A tibble: 108 × 19
#> # Groups: dest [105]
#> dest year month day dep_time sched_dep_time dep_delay arr_time
#> <chr> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 ABQ 2013 7 22 2145 2007 98 132
#> 2 ACK 2013 7 23 1139 800 219 1250
#> 3 ALB 2013 1 25 123 2000 323 229
#> 4 ANC 2013 8 17 1740 1625 75 2042
#> 5 ATL 2013 7 22 2257 759 898 121
#> 6 AUS 2013 7 10 2056 1505 351 2347
#> # ℹ 102 more rows
#> # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, …
Note that there are 105 destinations but we get 108 rows here.
请注意,这里有 105 个目的地,但我们得到了 108 行。
What’s up?
怎么回事?
slice_min()
and slice_max()
keep tied values so n = 1
means give us all rows with the highest value.slice_min()
和 slice_max()
会保留并列值,所以 n = 1
意味着给我们所有具有最高值的行。
If you want exactly one row per group you can set with_ties = FALSE
.
如果你想每个组只保留一行,你可以设置 with_ties = FALSE
。
This is similar to computing the max delay with summarize()
, but you get the whole corresponding row (or rows if there’s a tie) instead of the single summary statistic.
这类似于用 summarize()
计算最大延迟,但你会得到整个对应的行 (如果有并列,则为多行),而不是单个摘要统计量。
3.5.4 Grouping by multiple variables
You can create groups using more than one variable.
你可以使用多个变量创建分组。
For example, we could make a group for each date.
例如,我们可以为每个日期创建一个分组。
daily <- flights |>
group_by(year, month, day)
daily
#> # A tibble: 336,776 × 19
#> # Groups: year, month, day [365]
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # ℹ 336,770 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
When you summarize a tibble grouped by more than one variable, each summary peels off the last group.
当你对按多个变量分组的 tibble 进行汇总时,每个汇总都会剥离最后一个分组。
In hindsight, this wasn’t a great way to make this function work, but it’s difficult to change without breaking existing code.
事后看来,这不是一个让这个函数工作的好方法,但很难在不破坏现有代码的情况下进行更改。
To make it obvious what’s happening, dplyr displays a message that tells you how you can change this behavior:
为了清楚地说明发生了什么,dplyr 会显示一条消息,告诉你如何改变这种行为:
If you’re happy with this behavior, you can explicitly request it in order to suppress the message:
如果你对这种行为感到满意,可以明确请求它以抑制消息:
Alternatively, change the default behavior by setting a different value, e.g., "drop"
to drop all grouping or "keep"
to preserve the same groups.
或者,通过设置不同的值来更改默认行为,例如,"drop"
用于删除所有分组,"keep"
用于保留相同的分组。
3.5.5 Ungrouping
You might also want to remove grouping from a data frame without using summarize()
.
你可能还想在不使用 summarize()
的情况下从数据框中删除分组。
You can do this with ungroup()
.
你可以使用 ungroup()
来做到这一点。
daily |>
ungroup()
#> # A tibble: 336,776 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # ℹ 336,770 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …
Now let’s see what happens when you summarize an ungrouped data frame.
现在让我们看看当你汇总一个未分组的数据框时会发生什么。
You get a single row back because dplyr treats all the rows in an ungrouped data frame as belonging to one group.
你只会得到一行,因为 dplyr 将未分组数据框中的所有行都视为属于一个组。
3.5.6 .by
dplyr 1.1.0 includes a new, experimental, syntax for per-operation grouping, the .by
argument.
dplyr 1.1.0 包含了一个新的、实验性的、用于按操作分组的语法,即 .by
参数。
group_by()
and ungroup()
aren’t going away, but you can now also use the .by
argument to group within a single operation:group_by()
和 ungroup()
不会消失,但你现在也可以使用 .by
参数在单个操作内进行分组:
Or if you want to group by multiple variables:
或者如果你想按多个变量分组:
.by
works with all verbs and has the advantage that you don’t need to use the .groups
argument to suppress the grouping message or ungroup()
when you’re done..by
适用于所有动词,其优点是当你完成操作后,无需使用 .groups
参数来抑制分组消息或使用 ungroup()
。
We didn’t focus on this syntax in this chapter because it was very new when we wrote the book.
我们在本章中没有重点介绍这种语法,因为在我们写书时它还是非常新的。
We did want to mention it because we think it has a lot of promise and it’s likely to be quite popular.
我们确实想提一下它,因为我们认为它有很大的潜力,而且很可能会非常流行。
You can learn more about it in the dplyr 1.1.0 blog post.
你可以在 dplyr 1.1.0 博客文章 中了解更多相关信息。
3.5.7 Exercises
Which carrier has the worst average delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about
flights |> group_by(carrier, dest) |> summarize(n())
)Find the flights that are most delayed upon departure from each destination.
How do delays vary over the course of the day? Illustrate your answer with a plot.
What happens if you supply a negative
n
toslice_min()
and friends?Explain what
count()
does in terms of the dplyr verbs you just learned. What does thesort
argument tocount()
do?-
Suppose we have the following tiny data frame:
-
Write down what you think the output will look like, then check if you were correct, and describe what
group_by()
does.df |> group_by(y)
-
Write down what you think the output will look like, then check if you were correct, and describe what
arrange()
does. Also, comment on how it’s different from thegroup_by()
in part (a).df |> arrange(y)
-
Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does.
-
Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. Then, comment on what the message says.
-
Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. How is the output different from the one in part (d)?
-
Write down what you think the outputs will look like, then check if you were correct, and describe what each pipeline does. How are the outputs of the two pipelines different?
-
3.6 Case study: aggregates and sample size
Whenever you do any aggregation, it’s always a good idea to include a count (n()
).
每当你进行任何聚合时,包含一个计数 (n()
) 总是好主意。
That way, you can ensure that you’re not drawing conclusions based on very small amounts of data.
这样,你可以确保你不是基于非常少量的数据得出结论。
We’ll demonstrate this with some baseball data from the Lahman package.
我们将使用 Lahman 包中的一些棒球数据来演示这一点。
Specifically, we will compare what proportion of times a player gets a hit (H
) vs. the number of times they try to put the ball in play (AB
):
具体来说,我们将比较球员击中球的次数 (H
) 与他们尝试将球击入场内的次数 (AB
) 的比例:
batters <- Lahman::Batting |>
group_by(playerID) |>
summarize(
performance = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
n = sum(AB, na.rm = TRUE)
)
batters
#> # A tibble: 20,730 × 3
#> playerID performance n
#> <chr> <dbl> <int>
#> 1 aardsda01 0 4
#> 2 aaronha01 0.305 12364
#> 3 aaronto01 0.229 944
#> 4 aasedo01 0 5
#> 5 abadan01 0.0952 21
#> 6 abadfe01 0.111 9
#> # ℹ 20,724 more rows
When we plot the skill of the batter (measured by the batting average, performance
) against the number of opportunities to hit the ball (measured by times at bat, n
), you see two patterns:
当我们将击球手的技术 (用击球率 performance
衡量) 与击球机会数 (用击球次数 n
衡量) 绘制成图时,你会看到两种模式:
-
The variation in
performance
is larger among players with fewer at-bats.
击球次数较少的球员的performance
变化更大。The shape of this plot is very characteristic: whenever you plot a mean (or other summary statistics) vs. group size, you’ll see that the variation decreases as the sample size increases4.
这张图的形状非常有特点:每当你绘制均值 (或其他摘要统计量) 与组大小的关系图时,你都会看到随着样本量的增加,变异会减小4。 There’s a positive correlation between skill (
performance
) and opportunities to hit the ball (n
) because teams want to give their best batters the most opportunities to hit the ball.
技术 (performance
) 与击球机会 (n
) 之间存在正相关关系,因为球队希望给他们最好的击球手最多的击球机会。
batters |>
filter(n > 100) |>
ggplot(aes(x = n, y = performance)) +
geom_point(alpha = 1 / 10) +
geom_smooth(se = FALSE)
Note the handy pattern for combining ggplot2 and dplyr.
注意结合 ggplot2 和 dplyr 的便捷模式。
You just have to remember to switch from |>
, for dataset processing, to +
for adding layers to your plot.
你只需要记住从用于数据集处理的 |>
切换到用于向绘图添加图层的 +
。
This also has important implications for ranking.
这对排名也有重要影响。
If you naively sort on desc(performance)
, the people with the best batting averages are clearly the ones who tried to put the ball in play very few times and happened to get a hit, they’re not necessarily the most skilled players:
如果你天真地按 desc(performance)
排序,那么击球率最高的人显然是那些尝试将球击入场内次数很少但碰巧击中了的人,他们不一定是最有技术的球员:
You can find a good explanation of this problem and how to overcome it at http://varianceexplained.org/r/empirical_bayes_baseball/ and https://www.evanmiller.org/how-not-to-sort-by-average-rating.html.
你可以在 http://varianceexplained.org/r/empirical_bayes_baseball/ 和 https://www.evanmiller.org/how-not-to-sort-by-average-rating.html 找到对这个问题以及如何克服它的很好解释。
3.7 Summary
In this chapter, you’ve learned the tools that dplyr provides for working with data frames.
在本章中,你学习了 dplyr 为处理数据框提供的工具。
The tools are roughly grouped into three categories: those that manipulate the rows (like filter()
and arrange()
), those that manipulate the columns (like select()
and mutate()
) and those that manipulate groups (like group_by()
and summarize()
).
这些工具大致分为三类:操作行的工具 (如 filter()
和 arrange()
)、操作列的工具 (如 select()
和 mutate()
) 以及操作组的工具 (如 group_by()
和 summarize()
)。
In this chapter, we’ve focused on these “whole data frame” tools, but you haven’t yet learned much about what you can do with the individual variable.
在本章中,我们重点介绍了这些“整个数据框”的工具,但你还没有学到太多关于如何处理单个变量的知识。
We’ll return to that in the Transform part of the book, where each chapter provides tools for a specific type of variable.
我们将在本书的“转换”部分回到这个问题,其中每一章都为特定类型的变量提供了工具。
In the next chapter, we’ll pivot back to workflow to discuss the importance of code style and keeping your code well organized to make it easy for you and others to read and understand.
在下一章中,我们将转回工作流程,讨论代码风格的重要性以及保持代码井井有条,以便你和其他人易于阅读和理解。