21  Databases

21.1 Introduction

A huge amount of data lives in databases, so it’s essential that you know how to access it.
大量的数据都存储在数据库中,因此了解如何访问这些数据至关重要。

Sometimes you can ask someone to download a snapshot into a .csv for you, but this gets painful quickly: every time you need to make a change you’ll have to communicate with another human.
有时,你可以请别人为你下载一个快照到 .csv 文件中,但这很快就会变得痛苦:每次你需要做更改时,都必须与另一个人沟通。

You want to be able to reach into the database directly to get the data you need, when you need it.
你希望能够直接访问数据库,在需要的时候获取所需的数据。

In this chapter, you’ll first learn the basics of the DBI package: how to use it to connect to a database and then retrieve data with a SQL1 query.
在本章中,你将首先学习 DBI 包的基础知识:如何使用它连接到数据库,然后通过 SQL1 查询来检索数据。

SQL, short for structured query language, is the lingua franca of databases, and is an important language for all data scientists to learn.
SQL结构化查询语言structured query language)的缩写,是数据库的通用语言,也是所有数据科学家需要学习的重要语言。

That said, we’re not going to start with SQL, but instead we’ll teach you dbplyr, which can translate your dplyr code to the SQL.
尽管如此,我们不会从 SQL 开始,而是会教你 dbplyr,它可以将你的 dplyr 代码翻译成 SQL。

We’ll use that as a way to teach you some of the most important features of SQL.
我们将以此为途径,教你一些 SQL 最重要的特性。

You won’t become a SQL master by the end of the chapter, but you will be able to identify the most important components and understand what they do.
在本章结束时,你不会成为 SQL 大师,但你将能够识别出最重要的组成部分并理解它们的作用。

21.1.1 Prerequisites

In this chapter, we’ll introduce DBI and dbplyr.
在本章中,我们将介绍 DBI 和 dbplyr。

DBI is a low-level interface that connects to databases and executes SQL; dbplyr is a high-level interface that translates your dplyr code to SQL queries then executes them with DBI.
DBI 是一个连接数据库并执行 SQL 的底层接口;dbplyr 是一个高层接口,它将你的 dplyr 代码翻译成 SQL 查询,然后通过 DBI 执行它们。

21.2 Database basics

At the simplest level, you can think about a database as a collection of data frames, called tables in database terminology.
在最简单的层面上,你可以把数据库看作是数据框的集合,在数据库术语中称为 (tables)。

Like a data frame, a database table is a collection of named columns, where every value in the column is the same type.
与数据框一样,数据库表是命名列的集合,其中每列中的所有值都具有相同的类型。

There are three high level differences between data frames and database tables:
数据框和数据库表之间有三个高级别的区别:

  • Database tables are stored on disk and can be arbitrarily large.
    Data frames are stored in memory, and are fundamentally limited (although that limit is still plenty large for many problems).
    数据库表存储在磁盘上,可以任意大。
    数据框存储在内存中,并且有根本的大小限制(尽管这个限制对于许多问题来说仍然足够大)。

  • Database tables almost always have indexes.
    Much like the index of a book, a database index makes it possible to quickly find rows of interest without having to look at every single row.
    Data frames and tibbles don’t have indexes, but data.tables do, which is one of the reasons that they’re so fast.
    数据库表几乎总是有索引。
    就像书的索引一样,数据库索引可以快速找到感兴趣的行,而无需查看每一行。
    数据框和 tibbles 没有索引,但 data.tables 有,这也是它们速度如此之快的原因之一。

  • Most classical databases are optimized for rapidly collecting data, not analyzing existing data.
    These databases are called row-oriented because the data is stored row-by-row, rather than column-by-column like R.
    More recently, there’s been much development of column-oriented databases that make analyzing the existing data much faster.
    大多数传统数据库都为快速收集数据而优化,而不是为分析现有数据而优化。
    这些数据库被称为行式存储 (row-oriented),因为数据是按行存储的,而不是像 R 那样按列存储。
    近年来,列式存储 (column-oriented) 数据库得到了长足发展,它们使分析现有数据变得快得多。

Databases are run by database management systems (DBMS’s for short), which come in three basic forms:
数据库由数据库管理系统(简称 DBMS)运行,主要有三种基本形式:

  • Client-server DBMS’s run on a powerful central server, which you connect to from your computer (the client). They are great for sharing data with multiple people in an organization. Popular client-server DBMS’s include PostgreSQL, MariaDB, SQL Server, and Oracle.
    客户端-服务器 (Client-server) 模式的 DBMS 运行在一台强大的中央服务器上,你从你的计算机(客户端)连接到它。它们非常适合在组织内与多人共享数据。流行的客户端-服务器 DBMS 包括 PostgreSQL、MariaDB、SQL Server 和 Oracle。

  • Cloud DBMS’s, like Snowflake, Amazon’s RedShift, and Google’s BigQuery, are similar to client server DBMS’s, but they run in the cloud. This means that they can easily handle extremely large datasets and can automatically provide more compute resources as needed.
    (Cloud) DBMS,如 Snowflake、Amazon 的 RedShift 和 Google 的 BigQuery,与客户端-服务器 DBMS 类似,但它们运行在云端。这意味着它们可以轻松处理极大的数据集,并可以根据需要自动提供更多的计算资源。

  • In-process DBMS’s, like SQLite or duckdb, run entirely on your computer. They’re great for working with large datasets where you’re the primary user.
    进程内 (In-process) DBMS,如 SQLite 或 duckdb,完全在你的计算机上运行。当你作为主要用户处理大型数据集时,它们是很好的选择。

21.3 Connecting to a database

To connect to the database from R, you’ll use a pair of packages:
要从 R 连接到数据库,你需要使用一对包:

  • You’ll always use DBI (database interface) because it provides a set of generic functions that connect to the database, upload data, run SQL queries, etc.
    你将总是使用 DBI(database interface,数据库接口),因为它提供了一组通用函数,用于连接数据库、上传数据、运行 SQL 查询等。

  • You’ll also use a package tailored for the DBMS you’re connecting to.
    This package translates the generic DBI commands into the specifics needed for a given DBMS.
    There’s usually one package for each DBMS, e.g.
    RPostgres for PostgreSQL and RMariaDB for MySQL.
    你还需要使用一个为你所连接的 DBMS 定制的包。
    这个包会将通用的 DBI 命令翻译成特定 DBMS 所需的具体指令。
    通常每个 DBMS 都有一个对应的包,例如:
    用于 PostgreSQL 的 RPostgres 和用于 MySQL 的 RMariaDB。

If you can’t find a specific package for your DBMS, you can usually use the odbc package instead.
如果你找不到适用于你的 DBMS 的特定包,你通常可以使用 odbc 包作为替代。

This uses the ODBC protocol supported by many DBMS.
它使用了许多 DBMS 支持的 ODBC 协议。

odbc requires a little more setup because you’ll also need to install an ODBC driver and tell the odbc package where to find it.
odbc 需要多一些设置,因为你还需要安装一个 ODBC 驱动程序,并告诉 odbc 包在哪里可以找到它。

Concretely, you create a database connection using DBI::dbConnect().
具体来说,你使用 DBI::dbConnect() 创建一个数据库连接。

The first argument selects the DBMS2, then the second and subsequent arguments describe how to connect to it (i.e. where it lives and the credentials that you need to access it).
第一个参数选择 DBMS2,然后第二个及后续参数描述如何连接到它(即,它在哪里以及访问它所需的凭据)。

The following code shows a couple of typical examples:
以下代码展示了几个典型的例子:

con <- DBI::dbConnect(
  RMariaDB::MariaDB(), 
  username = "foo"
)
con <- DBI::dbConnect(
  RPostgres::Postgres(), 
  hostname = "databases.mycompany.com", 
  port = 1234
)

The precise details of the connection vary a lot from DBMS to DBMS so unfortunately we can’t cover all the details here.
连接的具体细节因 DBMS 而异,所以很遗憾我们无法在这里涵盖所有细节。

This means you’ll need to do a little research on your own.
这意味着你需要自己做一些研究。

Typically you can ask the other data scientists in your team or talk to your DBA (database administrator).
通常你可以询问团队中的其他数据科学家或与你的 DBA(数据库管理员)交流。

The initial setup will often take a little fiddling (and maybe some googling) to get it right, but you’ll generally only need to do it once.
初始设置通常需要一些摸索(可能还需要谷歌搜索)才能搞定,但你通常只需要做一次。

21.3.1 In this book

Setting up a client-server or cloud DBMS would be a pain for this book, so we’ll instead use an in-process DBMS that lives entirely in an R package: duckdb.
为本书设置一个客户端-服务器或云 DBMS 会很麻烦,所以我们将使用一个完全存在于 R 包中的进程内 DBMS:duckdb。

Thanks to the magic of DBI, the only difference between using duckdb and any other DBMS is how you’ll connect to the database.
得益于 DBI 的魔力,使用 duckdb 和任何其他 DBMS 之间的唯一区别就是你如何连接到数据库。

This makes it great to teach with because you can easily run this code as well as easily take what you learn and apply it elsewhere.
这使得它非常适合教学,因为你可以轻松地运行这段代码,也可以轻松地将学到的知识应用到其他地方。

Connecting to duckdb is particularly simple because the defaults create a temporary database that is deleted when you quit R.
连接到 duckdb 特别简单,因为默认设置会创建一个临时数据库,在你退出 R 时会被删除。

That’s great for learning because it guarantees that you’ll start from a clean slate every time you restart R:
这对于学习来说非常棒,因为它保证了你每次重启 R 时都能从一个干净的状态开始:

con <- DBI::dbConnect(duckdb::duckdb())

duckdb is a high-performance database that’s designed very much for the needs of a data scientist.
duckdb 是一个高性能数据库,其设计非常贴合数据科学家的需求。

We use it here because it’s very easy to get started with, but it’s also capable of handling gigabytes of data with great speed.
我们在这里使用它,因为它非常容易上手,而且它还能以极快的速度处理数 GB 的数据。

If you want to use duckdb for a real data analysis project, you’ll also need to supply the dbdir argument to make a persistent database and tell duckdb where to save it.
如果你想在一个真实的数据分析项目中使用 duckdb,你还需要提供 dbdir 参数来创建一个持久化数据库,并告诉 duckdb 将其保存在哪里。

Assuming you’re using a project (Chapter 6), it’s reasonable to store it in the duckdb directory of the current project:
假设你正在使用一个项目 (Chapter 6),将其存储在当前项目的 duckdb 目录中是合理的做法:

con <- DBI::dbConnect(duckdb::duckdb(), dbdir = "duckdb")

21.3.2 Load some data

Since this is a new database, we need to start by adding some data.
由于这是一个新数据库,我们需要先添加一些数据。

Here we’ll add mpg and diamonds datasets from ggplot2 using DBI::dbWriteTable().
这里我们将使用 DBI::dbWriteTable() 添加来自 ggplot2 的 mpgdiamonds 数据集。

The simplest usage of dbWriteTable() needs three arguments: a database connection, the name of the table to create in the database, and a data frame of data.
dbWriteTable() 的最简单用法需要三个参数:一个数据库连接、要在数据库中创建的表的名称,以及一个数据框。

dbWriteTable(con, "mpg", ggplot2::mpg)
dbWriteTable(con, "diamonds", ggplot2::diamonds)

If you’re using duckdb in a real project, we highly recommend learning about duckdb_read_csv() and duckdb_register_arrow().
如果你在实际项目中使用 duckdb,我们强烈建议你学习 duckdb_read_csv()duckdb_register_arrow()

These give you powerful and performant ways to quickly load data directly into duckdb, without having to first load it into R.
它们为你提供了强大而高效的方法,可以快速将数据直接加载到 duckdb 中,而无需先将其加载到 R 中。

We’ll also show off a useful technique for loading multiple files into a database in Section 26.4.1.
我们还将在 Section 26.4.1 中展示一个将多个文件加载到数据库中的实用技巧。

21.3.3 DBI basics

You can check that the data is loaded correctly by using a couple of other DBI functions: dbListTables() lists all tables in the database3 and dbReadTable() retrieves the contents of a table.
你可以通过使用其他几个 DBI 函数来检查数据是否已正确加载:dbListTables() 列出数据库中的所有表3dbReadTable() 检索表的内容。

dbListTables(con)
#> [1] "diamonds" "mpg"

con |> 
  dbReadTable("diamonds") |> 
  as_tibble()
#> # A tibble: 53,940 × 10
#>   carat cut       color clarity depth table price     x     y     z
#>   <dbl> <fct>     <fct> <fct>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
#> 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
#> 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
#> 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
#> 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
#> 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
#> # ℹ 53,934 more rows

dbReadTable() returns a data.frame so we use as_tibble() to convert it into a tibble so that it prints nicely.
dbReadTable() 返回一个 data.frame,所以我们使用 as_tibble() 将其转换为 tibble,以便更好地打印输出。

If you already know SQL, you can use dbGetQuery() to get the results of running a query on the database:
如果你已经了解 SQL,你可以使用 dbGetQuery() 来获取在数据库上运行查询的结果:

sql <- "
  SELECT carat, cut, clarity, color, price 
  FROM diamonds 
  WHERE price > 15000
"
as_tibble(dbGetQuery(con, sql))
#> # A tibble: 1,655 × 5
#>   carat cut       clarity color price
#>   <dbl> <fct>     <fct>   <fct> <int>
#> 1  1.54 Premium   VS2     E     15002
#> 2  1.19 Ideal     VVS1    F     15005
#> 3  2.1  Premium   SI1     I     15007
#> 4  1.69 Ideal     SI1     D     15011
#> 5  1.5  Very Good VVS2    G     15013
#> 6  1.73 Very Good VS1     G     15014
#> # ℹ 1,649 more rows

If you’ve never seen SQL before, don’t worry!
如果你以前没见过 SQL,别担心!

You’ll learn more about it shortly.
你很快就会学到更多关于它的知识。

But if you read it carefully, you might guess that it selects five columns of the diamonds dataset and all the rows where price is greater than 15,000.
但是如果你仔细阅读它,你可能会猜到它从 diamonds 数据集中选择了五列,以及所有 price 大于 15,000 的行。

21.4 dbplyr basics

Now that we’ve connected to a database and loaded up some data, we can start to learn about dbplyr.
现在我们已经连接到数据库并加载了一些数据,我们可以开始学习 dbplyr 了。

dbplyr is a dplyr backend, which means that you keep writing dplyr code but the backend executes it differently.
dbplyr 是一个 dplyr 后端 (backend),这意味着你继续编写 dplyr 代码,但后端会以不同的方式执行它。

In this, dbplyr translates to SQL; other backends include dtplyr which translates to data.table, and multidplyr which executes your code on multiple cores.
在这里,dbplyr 将代码翻译成 SQL;其他后端包括将代码翻译成 data.tabledtplyr,以及在多个核心上执行代码的 multidplyr

To use dbplyr, you must first use tbl() to create an object that represents a database table:
要使用 dbplyr,你必须首先使用 tbl() 创建一个代表数据库表的对象:

diamonds_db <- tbl(con, "diamonds")
diamonds_db
#> # Source:   table<diamonds> [?? x 10]
#> # Database: DuckDB v1.3.1 [14913@Windows 10 x64:R 4.5.1/:memory:]
#>   carat cut       color clarity depth table price     x     y     z
#>   <dbl> <fct>     <fct> <fct>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
#> 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
#> 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
#> 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
#> 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
#> 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
#> # ℹ more rows

There are two other common ways to interact with a database.
与数据库交互还有另外两种常见方式。

First, many corporate databases are very large so you need some hierarchy to keep all the tables organized.
首先,许多企业数据库非常庞大,因此你需要某种层级结构来保持所有表的有序性。

In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you’re interested in:
在这种情况下,你可能需要提供一个模式 (schema),或者一个目录 (catalog) 和一个模式,以便选择你感兴趣的表:

diamonds_db <- tbl(con, in_schema("sales", "diamonds"))
diamonds_db <- tbl(con, in_catalog("north_america", "sales", "diamonds"))

Other times you might want to use your own SQL query as a starting point:
其他时候,你可能想用自己的 SQL 查询作为起点:

diamonds_db <- tbl(con, sql("SELECT * FROM diamonds"))

This object is lazy; when you use dplyr verbs on it, dplyr doesn’t do any work: it just records the sequence of operations that you want to perform and only performs them when needed.
这个对象是惰性 (lazy) 的;当你对它使用 dplyr 函数时,dplyr 并不执行任何工作:它只是记录下你想要执行的操作序列,并且只在需要时才执行它们。

For example, take the following pipeline:
例如,看下面这个管道:

big_diamonds_db <- diamonds_db |> 
  filter(price > 15000) |> 
  select(carat:clarity, price)

big_diamonds_db
#> # Source:   SQL [?? x 5]
#> # Database: DuckDB v1.3.1 [14913@Windows 10 x64:R 4.5.1/:memory:]
#>   carat cut       color clarity price
#>   <dbl> <fct>     <fct> <fct>   <int>
#> 1  1.54 Premium   E     VS2     15002
#> 2  1.19 Ideal     F     VVS1    15005
#> 3  2.1  Premium   I     SI1     15007
#> 4  1.69 Ideal     D     SI1     15011
#> 5  1.5  Very Good G     VVS2    15013
#> 6  1.73 Very Good G     VS1     15014
#> # ℹ more rows

You can tell this object represents a database query because it prints the DBMS name at the top, and while it tells you the number of columns, it typically doesn’t know the number of rows.
你可以看出这个对象代表一个数据库查询,因为它在顶部打印了 DBMS 的名称,而且虽然它告诉了你列数,但通常不知道行数。

This is because finding the total number of rows usually requires executing the complete query, something we’re trying to avoid.
这是因为查找总行数通常需要执行完整的查询,而这正是我们试图避免的。

You can see the SQL code generated by the dplyr function show_query().
你可以看到由 dplyr 函数 show_query() 生成的 SQL 代码。

If you know dplyr, this is a great way to learn SQL!
如果你了解 dplyr,这是学习 SQL 的一个好方法!

Write some dplyr code, get dbplyr to translate it to SQL, and then try to figure out how the two languages match up.
编写一些 dplyr 代码,让 dbplyr 将其翻译成 SQL,然后试着弄清楚这两种语言是如何对应的。

big_diamonds_db |>
  show_query()
#> <SQL>
#> SELECT carat, cut, color, clarity, price
#> FROM diamonds
#> WHERE (price > 15000.0)

To get all the data back into R, you call collect().
要将所有数据取回 R 中,你可以调用 collect()

Behind the scenes, this generates the SQL, calls dbGetQuery() to get the data, then turns the result into a tibble:
在幕后,这会生成 SQL,调用 dbGetQuery() 获取数据,然后将结果转换为一个 tibble:

big_diamonds <- big_diamonds_db |> 
  collect()
big_diamonds
#> # A tibble: 1,655 × 5
#>   carat cut       color clarity price
#>   <dbl> <fct>     <fct> <fct>   <int>
#> 1  1.54 Premium   E     VS2     15002
#> 2  1.19 Ideal     F     VVS1    15005
#> 3  2.1  Premium   I     SI1     15007
#> 4  1.69 Ideal     D     SI1     15011
#> 5  1.5  Very Good G     VVS2    15013
#> 6  1.73 Very Good G     VS1     15014
#> # ℹ 1,649 more rows

Typically, you’ll use dbplyr to select the data you want from the database, performing basic filtering and aggregation using the translations described below.
通常,你会使用 dbplyr 从数据库中选择你想要的数据,使用下面描述的转换方法执行基本的筛选和聚合。

Then, once you’re ready to analyse the data with functions that are unique to R, you’ll collect() the data to get an in-memory tibble, and continue your work with pure R code.
然后,当你准备好使用 R 特有的函数分析数据时,你会 collect() 数据以获得一个内存中的 tibble,并继续用纯 R 代码进行你的工作。

21.5 SQL

The rest of the chapter will teach you a little SQL through the lens of dbplyr.
本章的其余部分将通过 dbplyr 的视角教你一些 SQL。

It’s a rather non-traditional introduction to SQL but we hope it will get you quickly up to speed with the basics.
这是一个相当非传统的 SQL 入门,但我们希望它能让你快速掌握基础知识。

Luckily, if you understand dplyr you’re in a great place to quickly pick up SQL because so many of the concepts are the same.
幸运的是,如果你理解 dplyr,那么你很快就能学会 SQL,因为很多概念是相同的。

We’ll explore the relationship between dplyr and SQL using a couple of old friends from the nycflights13 package: flights and planes.
我们将使用 nycflights13 包中的两个老朋友:flightsplanes 来探索 dplyr 和 SQL 之间的关系。

These datasets are easy to get into our learning database because dbplyr comes with a function that copies the tables from nycflights13 to our database:
这些数据集很容易导入到我们的学习数据库中,因为 dbplyr 提供了一个函数,可以将 nycflights13 中的表复制到我们的数据库中:

dbplyr::copy_nycflights13(con)
#> Creating table: airlines
#> Creating table: airports
#> Creating table: flights
#> Creating table: planes
#> Creating table: weather
flights <- tbl(con, "flights")
planes <- tbl(con, "planes")

21.5.1 SQL basics

The top-level components of SQL are called statements.
SQL 的顶层组件被称为语句 (statements)。

Common statements include CREATE for defining new tables, INSERT for adding data, and SELECT for retrieving data.
常见的语句包括用于定义新表的 CREATE、用于添加数据的 INSERT 以及用于检索数据的 SELECT

We will focus on SELECT statements, also called queries, because they are almost exclusively what you’ll use as a data scientist.
我们将专注于 SELECT 语句,也称为查询 (queries),因为它们几乎是你作为数据科学家唯一会用到的。

A query is made up of clauses.
一个查询由子句 (clauses) 组成。

There are five important clauses: SELECT, FROM, WHERE, ORDER BY, and GROUP BY. Every query must have the SELECT4 and FROM5 clauses and the simplest query is SELECT * FROM table, which selects all columns from the specified table . This is what dbplyr generates for an unadulterated table :
有五个重要的子句:SELECTFROMWHEREORDER BYGROUP BY。每个查询都必须有 SELECT4FROM5 子句,最简单的查询是 SELECT * FROM table,它从指定的表中选择所有列。这是 dbplyr 为一个未经处理的表生成的代码:

flights |> show_query()
#> <SQL>
#> SELECT *
#> FROM flights
planes |> show_query()
#> <SQL>
#> SELECT *
#> FROM planes

WHERE and ORDER BY control which rows are included and how they are ordered:
WHEREORDER BY 控制包含哪些行以及它们的排序方式:

flights |> 
  filter(dest == "IAH") |> 
  arrange(dep_delay) |>
  show_query()
#> <SQL>
#> SELECT flights.*
#> FROM flights
#> WHERE (dest = 'IAH')
#> ORDER BY dep_delay

GROUP BY converts the query to a summary, causing aggregation to happen:
GROUP BY 将查询转换为摘要,从而进行聚合操作:

flights |> 
  group_by(dest) |> 
  summarize(dep_delay = mean(dep_delay, na.rm = TRUE)) |> 
  show_query()
#> <SQL>
#> SELECT dest, AVG(dep_delay) AS dep_delay
#> FROM flights
#> GROUP BY dest

There are two important differences between dplyr verbs and SELECT clauses:
dplyr 函数和 SELECT 子句之间有两个重要区别:

  • In SQL, case doesn’t matter: you can write select, SELECT, or even SeLeCt. In this book we’ll stick with the common convention of writing SQL keywords in uppercase to distinguish them from table or variables names.
    在 SQL 中,大小写不重要:你可以写 selectSELECT,甚至 SeLeCt。在本书中,我们将遵循通常的惯例,用大写字母书写 SQL 关键字,以区别于表或变量名。

  • In SQL, order matters: you must always write the clauses in the order SELECT, FROM, WHERE, GROUP BY, ORDER BY. Confusingly, this order doesn’t match how the clauses are actually evaluated which is first FROM, then WHERE, GROUP BY, SELECT, and ORDER BY.
    在 SQL 中,顺序很重要:你必须始终按 SELECTFROMWHEREGROUP BYORDER BY 的顺序编写子句。令人困惑的是,这个顺序与子句的实际求值顺序不匹配,实际顺序是先 FROM,然后是 WHEREGROUP BYSELECTORDER BY

The following sections explore each clause in more detail.
以下各节将更详细地探讨每个子句。

Note that while SQL is a standard, it is extremely complex and no database follows it exactly.
请注意,虽然 SQL 是一个标准,但它极其复杂,没有哪个数据库能完全遵循它。

While the main components that we’ll focus on in this book are very similar between DBMS’s, there are many minor variations.
虽然本书中我们关注的主要组成部分在不同 DBMS 之间非常相似,但存在许多细微的差异。

Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases.
幸运的是,dbplyr 旨在处理这个问题,并为不同的数据库生成不同的翻译。

It’s not perfect, but it’s continually improving, and if you hit a problem you can file an issue on GitHub to help us do better.
它并不完美,但它在不断改进,如果你遇到问题,可以在 GitHub 上提交一个 issue,帮助我们做得更好。

21.5.2 SELECT

The SELECT clause is the workhorse of queries and performs the same job as select(), mutate(), rename(), relocate(), and, as you’ll learn in the next section, summarize().
SELECT 子句是查询的主力,它执行与 select()mutate()rename()relocate() 相同的工作,并且,正如你将在下一节中学到的,还包括 summarize()

select(), rename(), and relocate() have very direct translations to SELECT as they just affect where a column appears (if at all) along with its name:
select()rename()relocate()SELECT 有非常直接的转换关系,因为它们只影响列出现的位置(如果出现的话)及其名称:

planes |> 
  select(tailnum, type, manufacturer, model, year) |> 
  show_query()
#> <SQL>
#> SELECT tailnum, "type", manufacturer, model, "year"
#> FROM planes

planes |> 
  select(tailnum, type, manufacturer, model, year) |> 
  rename(year_built = year) |> 
  show_query()
#> <SQL>
#> SELECT tailnum, "type", manufacturer, model, "year" AS year_built
#> FROM planes

planes |> 
  select(tailnum, type, manufacturer, model, year) |> 
  relocate(manufacturer, model, .before = type) |> 
  show_query()
#> <SQL>
#> SELECT tailnum, manufacturer, model, "type", "year"
#> FROM planes

This example also shows you how SQL does renaming.
这个例子也向你展示了 SQL 如何进行重命名。

In SQL terminology renaming is called aliasing and is done with AS.
在 SQL 术语中,重命名被称为别名 (aliasing),并使用 AS 来完成。

Note that unlike mutate(), the old name is on the left and the new name is on the right.
请注意,与 mutate() 不同,旧名称在左边,新名称在右边。

In the examples above note that "year" and "type" are wrapped in double quotes.
在上面的例子中,请注意 "year""type" 被双引号包裹着。

That’s because these are reserved words in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.
这是因为它们在 duckdb 中是保留字 (reserved words),所以 dbplyr 给它们加上引号,以避免列/表名与 SQL 操作符之间可能产生的混淆。

When working with other databases you’re likely to see every variable name quoted because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.
在使用其他数据库时,你很可能会看到每个变量名都被引起来,因为只有少数客户端包(如 duckdb)知道所有的保留字是什么,所以它们为了安全起见会引用所有内容。

SELECT "tailnum", "type", "manufacturer", "model", "year"
FROM "planes"

Some other database systems use backticks instead of quotes:
其他一些数据库系统使用反引号而不是引号:

SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
FROM `planes`

The translations for mutate() are similarly straightforward: each variable becomes a new expression in SELECT:
mutate() 的转换同样直接:每个变量都成为 SELECT 中的一个新表达式:

flights |> 
  mutate(
    speed = distance / (air_time / 60)
  ) |> 
  show_query()
#> <SQL>
#> SELECT flights.*, distance / (air_time / 60.0) AS speed
#> FROM flights

We’ll come back to the translation of individual components (like /) in Section 21.6.
我们将在 Section 21.6 中回过头来讨论单个组件(如 /)的翻译。

21.5.3 FROM

The FROM clause defines the data source. It’s going to be rather uninteresting for a little while, because we’re just using single tables. You’ll see more complex examples once we hit the join functions.
FROM 子句定义了数据源。在一段时间内,它会显得相当无趣,因为我们只使用单个表。一旦我们接触到连接函数,你将会看到更复杂的例子。

21.5.4 GROUP BY

group_by() is translated to the GROUP BY6 clause and summarize() is translated to the SELECT clause:
group_by() 被翻译为 GROUP BY6 子句,而 summarize() 被翻译为 SELECT 子句:

diamonds_db |> 
  group_by(cut) |> 
  summarize(
    n = n(),
    avg_price = mean(price, na.rm = TRUE)
  ) |> 
  show_query()
#> <SQL>
#> SELECT cut, COUNT(*) AS n, AVG(price) AS avg_price
#> FROM diamonds
#> GROUP BY cut

We’ll come back to what’s happening with the translation of n() and mean() in Section 21.6.
我们将在 Section 21.6 回过头来讨论 n()mean() 的翻译发生了什么。

21.5.5 WHERE

filter() is translated to the WHERE clause:
filter() 被翻译为 WHERE 子句:

flights |> 
  filter(dest == "IAH" | dest == "HOU") |> 
  show_query()
#> <SQL>
#> SELECT flights.*
#> FROM flights
#> WHERE (dest = 'IAH' OR dest = 'HOU')

flights |> 
  filter(arr_delay > 0 & arr_delay < 20) |> 
  show_query()
#> <SQL>
#> SELECT flights.*
#> FROM flights
#> WHERE (arr_delay > 0.0 AND arr_delay < 20.0)

There are a few important details to note here:
这里有几个重要的细节需要注意:

  • | becomes OR and & becomes AND.
    | 变成 OR& 变成 AND

  • SQL uses = for comparison, not ==. SQL doesn’t have assignment, so there’s no potential for confusion there.
    SQL 使用 = 进行比较,而不是 ==。SQL 没有赋值操作,所以不存在混淆的可能。

  • SQL uses only '' for strings, not "". In SQL, "" is used to identify variables, like R’s ``.
    SQL 只用 '' 表示字符串,而不用 ""。在 SQL 中,"" 用于标识变量,就像 R 中的 `` 一样。

Another useful SQL operator is IN, which is very close to R’s %in%:
另一个有用的 SQL 运算符是 IN,它与 R 的 %in% 非常接近:

flights |> 
  filter(dest %in% c("IAH", "HOU")) |> 
  show_query()
#> <SQL>
#> SELECT flights.*
#> FROM flights
#> WHERE (dest IN ('IAH', 'HOU'))

SQL uses NULL instead of NA. NULLs behave similarly to NAs. The main difference is that while they’re “infectious” in comparisons and arithmetic, they are silently dropped when summarizing. dbplyr will remind you about this behavior the first time you hit it:
SQL 使用 NULL 而不是 NANULL 的行为与 NA 类似。主要区别在于,虽然它们在比较和算术运算中具有“传染性”,但在汇总时会被静默地丢弃。dbplyr 在你第一次遇到这种情况时会提醒你这个行为:

flights |> 
  group_by(dest) |> 
  summarize(delay = mean(arr_delay))
#> Warning: Missing values are always removed in SQL aggregation functions.
#> Use `na.rm = TRUE` to silence this warning
#> This warning is displayed once every 8 hours.
#> # Source:   SQL [?? x 2]
#> # Database: DuckDB v1.3.1 [14913@Windows 10 x64:R 4.5.1/:memory:]
#>   dest   delay
#>   <chr>  <dbl>
#> 1 CLT    7.36 
#> 2 MDW   12.4  
#> 3 SDF   12.7  
#> 4 LAS    0.258
#> 5 IAH    4.24 
#> 6 CAK   19.7  
#> # ℹ more rows

If you want to learn more about how NULLs work, you might enjoy “The Three-Valued Logic of SQL” by Markus Winand.
如果你想更多地了解 NULL 的工作原理,你可能会喜欢 Markus Winand 的《SQL 的三值逻辑》。

In general, you can work with NULLs using the functions you’d use for NAs in R:
通常,你可以使用在 R 中处理 NA 的函数来处理 NULL

flights |> 
  filter(!is.na(dep_delay)) |> 
  show_query()
#> <SQL>
#> SELECT flights.*
#> FROM flights
#> WHERE (NOT((dep_delay IS NULL)))

This SQL query illustrates one of the drawbacks of dbplyr: while the SQL is correct, it isn’t as simple as you might write by hand. In this case, you could drop the parentheses and use a special operator that’s easier to read:
这个 SQL 查询揭示了 dbplyr 的一个缺点:虽然 SQL 是正确的,但它不像你手写的那样简洁。在这种情况下,你可以去掉括号,使用一个更易读的特殊运算符:

WHERE "dep_delay" IS NOT NULL

Note that if you filter() a variable that you created using a summarize, dbplyr will generate a HAVING clause, rather than a WHERE clause. This is a one of the idiosyncrasies of SQL: WHERE is evaluated before SELECT and GROUP BY, so SQL needs another clause that’s evaluated afterwards.
请注意,如果你对使用 summarize 创建的变量进行 filter(),dbplyr 将生成 HAVING 子句,而不是 WHERE 子句。这是 SQL 的一个特性:WHERESELECTGROUP BY 之前被评估,所以 SQL 需要另一个在之后评估的子句。

diamonds_db |> 
  group_by(cut) |> 
  summarize(n = n()) |> 
  filter(n > 100) |> 
  show_query()
#> <SQL>
#> SELECT cut, COUNT(*) AS n
#> FROM diamonds
#> GROUP BY cut
#> HAVING (COUNT(*) > 100.0)

21.5.6 ORDER BY

Ordering rows involves a straightforward translation from arrange() to the ORDER BY clause:
对行进行排序,涉及从 arrange()ORDER BY 子句的直接翻译:

flights |> 
  arrange(year, month, day, desc(dep_delay)) |> 
  show_query()
#> <SQL>
#> SELECT flights.*
#> FROM flights
#> ORDER BY "year", "month", "day", dep_delay DESC

Notice how desc() is translated to DESC: this is one of the many dplyr functions whose name was directly inspired by SQL.
注意 desc() 是如何被翻译成 DESC 的:这是众多直接受 SQL 启发的 dplyr 函数之一。

21.5.7 Subqueries

Sometimes it’s not possible to translate a dplyr pipeline into a single SELECT statement and you need to use a subquery. A subquery is just a query used as a data source in the FROM clause, instead of the usual table.
有时,无法将一个 dplyr 管道翻译成单个 SELECT 语句,这时你需要使用子查询。子查询 (subquery) 就是一个在 FROM 子句中用作数据源的查询,而不是通常的表。

dbplyr typically uses subqueries to work around limitations of SQL. For example, expressions in the SELECT clause can’t refer to columns that were just created. That means that the following (silly) dplyr pipeline needs to happen in two steps: the first (inner) query computes year1 and then the second (outer) query can compute year2.
dbplyr 通常使用子查询来规避 SQL 的限制。例如,SELECT 子句中的表达式不能引用刚刚创建的列。这意味着下面这个(有点傻的)dplyr 管道需要分两步进行:第一步(内部)查询计算出 year1,然后第二步(外部)查询才能计算出 year2

flights |> 
  mutate(
    year1 = year + 1,
    year2 = year1 + 1
  ) |> 
  show_query()
#> <SQL>
#> SELECT q01.*, year1 + 1.0 AS year2
#> FROM (
#>   SELECT flights.*, "year" + 1.0 AS year1
#>   FROM flights
#> ) q01

You’ll also see this if you attempted to filter() a variable that you just created. Remember, even though WHERE is written after SELECT, it’s evaluated before it, so we need a subquery in this (silly) example:
如果你试图对一个刚刚创建的变量进行 filter(),你也会看到这种情况。记住,尽管 WHERE 写在 SELECT 之后,但它是在 SELECT 之前被评估的,所以在这个(有点傻的)例子中我们需要一个子查询:

flights |> 
  mutate(year1 = year + 1) |> 
  filter(year1 == 2014) |> 
  show_query()
#> <SQL>
#> SELECT q01.*
#> FROM (
#>   SELECT flights.*, "year" + 1.0 AS year1
#>   FROM flights
#> ) q01
#> WHERE (year1 = 2014.0)

Sometimes dbplyr will create a subquery where it’s not needed because it doesn’t yet know how to optimize that translation. As dbplyr improves over time, these cases will get rarer but will probably never go away.
有时 dbplyr 会在不需要的情况下创建一个子查询,因为它还不知道如何优化该翻译。随着 dbplyr 的不断改进,这些情况会越来越少,但可能永远不会完全消失。

21.5.8 Joins

If you’re familiar with dplyr’s joins, SQL joins are very similar. Here’s a simple example:
如果你熟悉 dplyr 的连接 (joins),SQL 的连接非常相似。下面是一个简单的例子:

flights |> 
  left_join(planes |> rename(year_built = year), join_by(tailnum)) |> 
  show_query()
#> <SQL>
#> SELECT
#>   flights.*,
#>   planes."year" AS year_built,
#>   "type",
#>   manufacturer,
#>   model,
#>   engines,
#>   seats,
#>   speed,
#>   engine
#> FROM flights
#> LEFT JOIN planes
#>   ON (flights.tailnum = planes.tailnum)

The main thing to notice here is the syntax: SQL joins use sub-clauses of the FROM clause to bring in additional tables, using ON to define how the tables are related.
这里主要要注意的是语法:SQL 连接使用 FROM 子句的子句来引入额外的表,并使用 ON 来定义表之间的关系。

dplyr’s names for these functions are so closely connected to SQL that you can easily guess the equivalent SQL for inner_join(), right_join(), and full_join():
dplyr 中这些函数的名称与 SQL 紧密相关,因此你可以轻松猜出 inner_join()right_join()full_join() 的等效 SQL:

SELECT flights.*, "type", manufacturer, model, engines, seats, speed
FROM flights
INNER JOIN planes ON (flights.tailnum = planes.tailnum)

SELECT flights.*, "type", manufacturer, model, engines, seats, speed
FROM flights
RIGHT JOIN planes ON (flights.tailnum = planes.tailnum)

SELECT flights.*, "type", manufacturer, model, engines, seats, speed
FROM flights
FULL JOIN planes ON (flights.tailnum = planes.tailnum)

You’re likely to need many joins when working with data from a database. That’s because database tables are often stored in a highly normalized form, where each “fact” is stored in a single place and to keep a complete dataset for analysis you need to navigate a complex network of tables connected by primary and foreign keys. If you hit this scenario, the dm package, by Tobias Schieferdecker, Kirill Müller, and Darko Bergant, is a life saver. It can automatically determine the connections between tables using the constraints that DBAs often supply, visualize the connections so you can see what’s going on, and generate the joins you need to connect one table to another.
当处理数据库中的数据时,你很可能需要进行多次连接。这是因为数据库表通常以高度规范化的形式存储,每个“事实”都存储在单一位置,为了进行分析而得到一个完整的数据集,你需要在一个由主键和外键连接的复杂表网络中穿梭。如果你遇到这种情况,由 Tobias Schieferdecker、Kirill Müller 和 Darko Bergant 开发的 dm 包 将是你的救星。它可以利用数据库管理员(DBA)通常提供的约束自动确定表之间的连接,将连接可视化以便你了解情况,并生成连接一个表到另一个表所需的连接操作。

21.5.9 Other verbs

dbplyr also translates other verbs like distinct(), slice_*(), and intersect(), and a growing selection of tidyr functions like pivot_longer() and pivot_wider(). The easiest way to see the full set of what’s currently available is to visit the dbplyr website: https://dbplyr.tidyverse.org/reference/.
dbplyr 还可以翻译其他动词,如 distinct()slice_*()intersect(),以及越来越多的 tidyr 函数,如 pivot_longer()pivot_wider()。要查看当前所有可用功能的完整列表,最简单的方法是访问 dbplyr 网站:https://dbplyr.tidyverse.org/reference/

21.5.10 Exercises

  1. What is distinct() translated to? How about head()?

  2. Explain what each of the following SQL queries do and try recreate them using dbplyr.

    SELECT \* 
    FROM flights
    WHERE dep_delay \< arr_delay
    
    SELECT \*, distance / (air_time / 60) AS speed
    FROM flights

21.6 Function translations

So far we’ve focused on the big picture of how dplyr verbs are translated to the clauses of a query. Now we’re going to zoom in a little and talk about the translation of the R functions that work with individual columns, e.g., what happens when you use mean(x) in a summarize()?
到目前为止,我们主要关注了 dplyr 动词如何被翻译成查询子句的宏观层面。现在,我们将稍微深入一些,讨论处理单个列的 R 函数的翻译,例如,当你在 summarize() 中使用 mean(x) 时会发生什么?

To help see what’s going on, we’ll use a couple of little helper functions that run a summarize() or mutate() and show the generated SQL. That will make it a little easier to explore a few variations and see how summaries and transformations can differ.
为了帮助理解发生了什么,我们将使用几个小辅助函数,它们会运行一个 summarize()mutate() 并显示生成的 SQL。这将使我们更容易探索一些变化,并观察汇总和转换有何不同。

summarize_query <- function(df, ...) {
  df |> 
    summarize(...) |> 
    show_query()
}
mutate_query <- function(df, ...) {
  df |> 
    mutate(..., .keep = "none") |> 
    show_query()
}

Let’s dive in with some summaries! Looking at the code below you’ll notice that some summary functions, like mean(), have a relatively simple translation while others, like median(), are much more complex. The complexity is typically higher for operations that are common in statistics but less common in databases.
让我们从一些汇总操作开始吧!看下面的代码,你会注意到一些汇总函数,比如 mean(),其翻译相对简单,而另一些,比如 median(),则要复杂得多。对于在统计学中常见但在数据库中不太常见的操作,其复杂性通常更高。

flights |> 
  group_by(year, month, day) |>  
  summarize_query(
    mean = mean(arr_delay, na.rm = TRUE),
    median = median(arr_delay, na.rm = TRUE)
  )
#> `summarise()` has grouped output by "year" and "month". You can override
#> using the `.groups` argument.
#> <SQL>
#> SELECT
#>   "year",
#>   "month",
#>   "day",
#>   AVG(arr_delay) AS mean,
#>   MEDIAN(arr_delay) AS median
#> FROM flights
#> GROUP BY "year", "month", "day"

The translation of summary functions becomes more complicated when you use them inside a mutate() because they have to turn into so-called window functions. In SQL, you turn an ordinary aggregation function into a window function by adding OVER after it:
当你在 mutate() 中使用汇总函数时,它们的翻译会变得更加复杂,因为它们必须转换成所谓的窗口 (window) 函数。在 SQL 中,你通过在普通聚合函数后添加 OVER 来将其转换为窗口函数:

flights |> 
  group_by(year, month, day) |>  
  mutate_query(
    mean = mean(arr_delay, na.rm = TRUE),
  )
#> <SQL>
#> SELECT
#>   "year",
#>   "month",
#>   "day",
#>   AVG(arr_delay) OVER (PARTITION BY "year", "month", "day") AS mean
#> FROM flights

In SQL, the GROUP BY clause is used exclusively for summaries so here you can see that the grouping has moved from the GROUP BY clause to OVER.
在 SQL 中,GROUP BY 子句专用于汇总,所以在这里你可以看到分组已经从 GROUP BY 子句移到了 OVER 中。

Window functions include all functions that look forward or backwards, like lead() and lag() which look at the “previous” or “next” value respectively:
窗口函数包括所有向前或向后看的函数,例如 lead()lag(),它们分别查看“前一个”或“后一个”值:

flights |> 
  group_by(dest) |>  
  arrange(time_hour) |> 
  mutate_query(
    lead = lead(arr_delay),
    lag = lag(arr_delay)
  )
#> <SQL>
#> SELECT
#>   dest,
#>   LEAD(arr_delay, 1, NULL) OVER (PARTITION BY dest ORDER BY time_hour) AS lead,
#>   LAG(arr_delay, 1, NULL) OVER (PARTITION BY dest ORDER BY time_hour) AS lag
#> FROM flights
#> ORDER BY time_hour

Here it’s important to arrange() the data, because SQL tables have no intrinsic order. In fact, if you don’t use arrange() you might get the rows back in a different order every time! Notice for window functions, the ordering information is repeated: the ORDER BY clause of the main query doesn’t automatically apply to window functions.
在这里,对数据进行 arrange() 很重要,因为 SQL 表没有固有的顺序。事实上,如果你不使用 arrange(),每次返回的行顺序可能都不同!注意,对于窗口函数,排序信息是重复的:主查询的 ORDER BY 子句不会自动应用于窗口函数。

Another important SQL function is CASE WHEN. It’s used as the translation of if_else() and case_when(), the dplyr function that it directly inspired. Here are a couple of simple examples:
另一个重要的 SQL 函数是 CASE WHEN。它被用作 if_else()case_when() 的翻译,而后者正是直接受其启发的 dplyr 函数。这里有几个简单的例子:

flights |> 
  mutate_query(
    description = if_else(arr_delay > 0, "delayed", "on-time")
  )
#> <SQL>
#> SELECT CASE WHEN (arr_delay > 0.0) THEN 'delayed' WHEN NOT (arr_delay > 0.0) THEN 'on-time' END AS description
#> FROM flights
flights |> 
  mutate_query(
    description = 
      case_when(
        arr_delay < -5 ~ "early", 
        arr_delay < 5 ~ "on-time",
        arr_delay >= 5 ~ "late"
      )
  )
#> <SQL>
#> SELECT CASE
#> WHEN (arr_delay < -5.0) THEN 'early'
#> WHEN (arr_delay < 5.0) THEN 'on-time'
#> WHEN (arr_delay >= 5.0) THEN 'late'
#> END AS description
#> FROM flights

CASE WHEN is also used for some other functions that don’t have a direct translation from R to SQL. A good example of this is cut():
CASE WHEN 也用于一些其他没有从 R 到 SQL 直接翻译的函数。一个很好的例子是 cut()

flights |> 
  mutate_query(
    description =  cut(
      arr_delay, 
      breaks = c(-Inf, -5, 5, Inf), 
      labels = c("early", "on-time", "late")
    )
  )
#> <SQL>
#> SELECT CASE
#> WHEN (arr_delay <= -5.0) THEN 'early'
#> WHEN (arr_delay <= 5.0) THEN 'on-time'
#> WHEN (arr_delay > 5.0) THEN 'late'
#> END AS description
#> FROM flights

dbplyr also translates common string and date-time manipulation functions, which you can learn about in vignette("translation-function", package = "dbplyr"). dbplyr’s translations are certainly not perfect, and there are many R functions that aren’t translated yet, but dbplyr does a surprisingly good job covering the functions that you’ll use most of the time.
dbplyr 还可以翻译常见的字符串和日期时间操作函数,你可以在 vignette("translation-function", package = "dbplyr") 中了解这些内容。dbplyr 的翻译当然不是完美的,还有很多 R 函数尚未被翻译,但 dbplyr 在覆盖你大部分时间会用到的函数方面做得相当不错。

21.7 Summary

In this chapter you learned how to access data from databases. We focused on dbplyr, a dplyr “backend” that allows you to write the dplyr code you’re familiar with, and have it be automatically translated to SQL. We used that translation to teach you a little SQL; it’s important to learn some SQL because it’s the most commonly used language for working with data and knowing some will make it easier for you to communicate with other data folks who don’t use R.
在本章中,你学习了如何从数据库访问数据。我们重点介绍了 dbplyr,这是一个 dplyr 的“后端”,它允许你编写你所熟悉的 dplyr 代码,并将其自动翻译成 SQL。我们利用这种翻译教了你一点 SQL;学习一些 SQL 很重要,因为它是常用的数据处理语言,了解一些 SQL 将使你更容易与不使用 R 的其他数据从业者交流。

If you’ve finished this chapter and would like to learn more about SQL, we have two recommendations:
如果你已经完成了本章并想学习更多关于 SQL 的知识,我们有两个建议:

  • SQL for Data Scientists by Renée M. P. Teate is an introduction to SQL designed specifically for the needs of data scientists, and includes examples of the sort of highly interconnected data you’re likely to encounter in real organizations.
    Renée M. P. Teate 的 SQL for Data Scientists 是一本专为数据科学家的需求而设计的 SQL 入门书籍,其中包含了你在真实组织中可能遇到的那种高度互联数据的示例。

  • Practical SQL by Anthony DeBarros is written from the perspective of a data journalist (a data scientist specialized in telling compelling stories) and goes into more detail about getting your data into a database and running your own DBMS.
    Anthony DeBarros 的 Practical SQL 是从数据记者(一位专门讲述引人入胜故事的数据科学家)的视角撰写的,它更详细地介绍了如何将数据导入数据库以及如何运行你自己的数据库管理系统 (DBMS)。

In the next chapter, we’ll learn about another dplyr backend for working with large data: arrow. Arrow is designed for working with large files on disk, and is a natural complement to databases.
在下一章中,我们将学习另一个用于处理大数据的 dplyr 后端:arrow。Arrow 专为处理磁盘上的大文件而设计,是数据库的天然补充。


  1. SQL is either pronounced “s”-“q”-“l” or “sequel”.↩︎

  2. Typically, this is the only function you’ll use from the client package, so we recommend using :: to pull out that one function, rather than loading the complete package with library().↩︎

  3. At least, all the tables that you have permission to see.↩︎

  4. Confusingly, depending on the context, SELECT is either a statement or a clause.↩︎

  5. Ok, technically, only the SELECT is required, since you can write queries like SELECT 1+1 to perform basic calculations.↩︎

  6. This is no coincidence: the dplyr function name was inspired by the SQL clause.↩︎