7  Data import

7.1 Introduction

Working with data provided by R packages is a great way to learn data science tools, but you want to apply what you’ve learned to your own data at some point.
使用 R 包中提供的数据是学习数据科学工具的好方法,但在某些时候,你会希望将所学知识应用到自己的数据上。

In this chapter, you’ll learn the basics of reading data files into R.
在本章中,你将学习将数据文件读入 R 的基础知识。

Specifically, this chapter will focus on reading plain-text rectangular files.
具体来说,本章将重点介绍如何读取纯文本矩形文件。

We’ll start with practical advice for handling features like column names, types, and missing data.
我们将从处理列名、类型和缺失数据等特性的实用建议开始。

You will then learn about reading data from multiple files at once and writing data from R to a file.
然后,你将学习如何一次性从多个文件中读取数据,以及如何将数据从 R 写入文件。

Finally, you’ll learn how to handcraft data frames in R.
最后,你将学习如何在 R 中手动创建数据框。

7.1.1 Prerequisites

In this chapter, you’ll learn how to load flat files in R with the readr package, which is part of the core tidyverse.
在本章中,你将学习如何使用 readr 包在 R 中加载平面文件,该包是核心 tidyverse 的一部分。

7.2 Reading data from a file

To begin, we’ll focus on the most common rectangular data file type: CSV, which is short for comma-separated values.
首先,我们将重点关注最常见的矩形数据文件类型:CSV,即逗号分隔值 (comma-separated values) 的缩写。

Here is what a simple CSV file looks like.
下面是一个简单的 CSV 文件的样子。

The first row, commonly called the header row, gives the column names, and the following six rows provide the data.
第一行,通常称为标题行 (header row),给出了列名,接下来的六行提供了数据。

The columns are separated, aka delimited, by commas.
列之间由逗号分隔,也称为定界 (delimited)。

Student ID,Full Name,favourite.food,mealPlan,AGE
1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
2,Barclay Lynn,French fries,Lunch only,5
3,Jayendra Lyne,N/A,Breakfast and lunch,7
4,Leon Rossini,Anchovies,Lunch only,
5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five
6,Güvenç Attila,Ice cream,Lunch only,6

Table 7.1 shows a representation of the same data as a table.
Table 7.1 以表格形式展示了相同的数据。

Table 7.1: Data from the students.csv file as a table.
Student ID Full Name favourite.food mealPlan AGE
1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 Barclay Lynn French fries Lunch only 5
3 Jayendra Lyne N/A Breakfast and lunch 7
4 Leon Rossini Anchovies Lunch only NA
5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 Güvenç Attila Ice cream Lunch only 6

We can read this file into R using read_csv().
我们可以使用 read_csv() 将这个文件读入 R。

The first argument is the most important: the path to the file.
第一个参数是最重要的:文件的路径。

You can think about the path as the address of the file: the file is called students.csv and it lives in the data folder.
你可以将路径看作是文件的地址:文件名为 students.csv,它位于 data 文件夹中。

students <- read_csv("data/students.csv")
#> Rows: 6 Columns: 5
#> ── Column specification ─────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (4): Full Name, favourite.food, mealPlan, AGE
#> dbl (1): Student ID
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The code above will work if you have the students.csv file in a data folder in your project.
如果你的项目中有一个 data 文件夹,并且其中包含 students.csv 文件,那么上面的代码将会正常工作。

You can download the students.csv file from https://pos.it/r4ds-students-csv or you can read it directly from that URL with:
你可以从 https://pos.it/r4ds-students-csv 下载 students.csv 文件,或者使用以下代码直接从该 URL 读取:

students <- read_csv("https://pos.it/r4ds-students-csv")

When you run read_csv(), it prints out a message telling you the number of rows and columns of data, the delimiter that was used, and the column specifications (names of columns organized by the type of data the column contains).
当你运行 read_csv() 时,它会打印出一条消息,告诉你数据的行数和列数、使用的分隔符以及列的规格 (column specifications)(按列所含数据类型组织的列名)。

It also prints out some information about retrieving the full column specification and how to quiet this message.
它还会打印一些关于检索完整列规格以及如何静默此消息的信息。

This message is an integral part of readr, and we’ll return to it in Section 7.3.
这条消息是 readr 不可或缺的一部分,我们将在 Section 7.3 中再次讨论它。

7.2.1 Practical advice

Once you read data in, the first step usually involves transforming it in some way to make it easier to work with in the rest of your analysis.
一旦你读入数据,第一步通常是将其以某种方式进行转换,使其在后续分析中更易于使用。

Let’s take another look at the students data with that in mind.
让我们带着这个想法再看一下 students 数据。

students
#> # A tibble: 6 × 5
#>   `Student ID` `Full Name`      favourite.food     mealPlan            AGE  
#>          <dbl> <chr>            <chr>              <chr>               <chr>
#> 1            1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
#> 2            2 Barclay Lynn     French fries       Lunch only          5    
#> 3            3 Jayendra Lyne    N/A                Breakfast and lunch 7    
#> 4            4 Leon Rossini     Anchovies          Lunch only          <NA> 
#> 5            5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
#> 6            6 Güvenç Attila    Ice cream          Lunch only          6

In the favourite.food column, there are a bunch of food items, and then the character string N/A, which should have been a real NA that R will recognize as “not available”.
favourite.food 列中,有一堆食物条目,然后是字符串 N/A,它本应是一个真正的 NA,R 会将其识别为“不可用”(not available)。

This is something we can address using the na argument.
这是我们可以使用 na 参数来解决的问题。

By default, read_csv() only recognizes empty strings ("") in this dataset as NAs, and we want it to also recognize the character string "N/A".
默认情况下,read_csv() 在这个数据集中只将空字符串 ("") 识别为 NA,我们希望它也能识别字符串 "N/A"

students <- read_csv("data/students.csv", na = c("N/A", ""))

students
#> # A tibble: 6 × 5
#>   `Student ID` `Full Name`      favourite.food     mealPlan            AGE  
#>          <dbl> <chr>            <chr>              <chr>               <chr>
#> 1            1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
#> 2            2 Barclay Lynn     French fries       Lunch only          5    
#> 3            3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
#> 4            4 Leon Rossini     Anchovies          Lunch only          <NA> 
#> 5            5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
#> 6            6 Güvenç Attila    Ice cream          Lunch only          6

You might also notice that the Student ID and Full Name columns are surrounded by backticks.
你可能还会注意到 Student IDFull Name 列被反引号包围。

That’s because they contain spaces, breaking R’s usual rules for variable names; they’re non-syntactic names.
这是因为它们包含空格,破坏了 R 通常的变量命名规则;它们是非语法名称 (non-syntactic names)。

To refer to these variables, you need to surround them with backticks, `:
要引用这些变量,你需要用反引号 ` 将它们括起来:

students |> 
  rename(
    student_id = `Student ID`,
    full_name = `Full Name`
  )
#> # A tibble: 6 × 5
#>   student_id full_name        favourite.food     mealPlan            AGE  
#>        <dbl> <chr>            <chr>              <chr>               <chr>
#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
#> 2          2 Barclay Lynn     French fries       Lunch only          5    
#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
#> 4          4 Leon Rossini     Anchovies          Lunch only          <NA> 
#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
#> 6          6 Güvenç Attila    Ice cream          Lunch only          6

An alternative approach is to use janitor::clean_names() to use some heuristics to turn them all into snake case at once1.
另一种方法是使用 janitor::clean_names(),它会运用一些启发式方法将所有列名一次性转换为蛇形命名法 (snake case)1

students |> janitor::clean_names()
#> # A tibble: 6 × 5
#>   student_id full_name        favourite_food     meal_plan           age  
#>        <dbl> <chr>            <chr>              <chr>               <chr>
#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
#> 2          2 Barclay Lynn     French fries       Lunch only          5    
#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
#> 4          4 Leon Rossini     Anchovies          Lunch only          <NA> 
#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
#> 6          6 Güvenç Attila    Ice cream          Lunch only          6

Another common task after reading in data is to consider variable types.
读入数据后的另一个常见任务是考虑变量类型。

For example, meal_plan is a categorical variable with a known set of possible values, which in R should be represented as a factor:
例如,meal_plan 是一个分类变量,其可能值的集合是已知的,在 R 中应该表示为一个因子 (factor):

students |>
  janitor::clean_names() |>
  mutate(meal_plan = factor(meal_plan))
#> # A tibble: 6 × 5
#>   student_id full_name        favourite_food     meal_plan           age  
#>        <dbl> <chr>            <chr>              <fct>               <chr>
#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
#> 2          2 Barclay Lynn     French fries       Lunch only          5    
#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
#> 4          4 Leon Rossini     Anchovies          Lunch only          <NA> 
#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
#> 6          6 Güvenç Attila    Ice cream          Lunch only          6

Note that the values in the meal_plan variable have stayed the same, but the type of variable denoted underneath the variable name has changed from character (<chr>) to factor (<fct>).
请注意,meal_plan 变量中的值保持不变,但变量名下方表示的变量类型已从字符 (<chr>) 变为了因子 (<fct>)。

You’ll learn more about factors in Chapter 16.
你将在 Chapter 16 中学到更多关于因子的知识。

Before you analyze these data, you’ll probably want to fix the age column.
在分析这些数据之前,你可能需要修正 age 列。

Currently, age is a character variable because one of the observations is typed out as five instead of a numeric 5.
目前,age 是一个字符变量,因为其中一个观测值被输入为 five 而不是数字 5

We discuss the details of fixing this issue in Chapter 20.
我们将在 Chapter 20 中讨论修复这个问题的细节。

students <- students |>
  janitor::clean_names() |>
  mutate(
    meal_plan = factor(meal_plan),
    age = parse_number(if_else(age == "five", "5", age))
  )

students
#> # A tibble: 6 × 5
#>   student_id full_name        favourite_food     meal_plan             age
#>        <dbl> <chr>            <chr>              <fct>               <dbl>
#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
#> 2          2 Barclay Lynn     French fries       Lunch only              5
#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch     7
#> 4          4 Leon Rossini     Anchovies          Lunch only             NA
#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5
#> 6          6 Güvenç Attila    Ice cream          Lunch only              6

A new function here is if_else(), which has three arguments.
这里有一个新函数 if_else(),它有三个参数。

The first argument test should be a logical vector.
第一个参数 test 应该是一个逻辑向量。

The result will contain the value of the second argument, yes, when test is TRUE, and the value of the third argument, no, when it is FALSE.
testTRUE 时,结果将包含第二个参数 yes 的值;当 testFALSE 时,结果将包含第三个参数 no 的值。

Here we’re saying if age is the character string "five", make it "5", and if not leave it as age.
这里我们是说,如果 age 是字符串 "five",就把它变成 "5",否则保持 age 不变。

You will learn more about if_else() and logical vectors in Chapter 12.
你将在 Chapter 12 中学习更多关于 if_else() 和逻辑向量的知识。

7.2.2 Other arguments

There are a couple of other important arguments that we need to mention, and they’ll be easier to demonstrate if we first show you a handy trick: read_csv() can read text strings that you’ve created and formatted like a CSV file:
还有几个其他重要的参数需要我们提及,如果我们先向你展示一个方便的技巧,它们会更容易演示:read_csv() 可以读取你创建并格式化为 CSV 文件形式的文本字符串:

read_csv(
  "a,b,c
  1,2,3
  4,5,6"
)
#> # A tibble: 2 × 3
#>       a     b     c
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3
#> 2     4     5     6

Usually, read_csv() uses the first line of the data for the column names, which is a very common convention.
通常,read_csv() 使用数据的第一行作为列名,这是一个非常普遍的惯例。

But it’s not uncommon for a few lines of metadata to be included at the top of the file.
但是在文件顶部包含几行元数据的情况也并不少见。

You can use skip = n to skip the first n lines or use comment = "#" to drop all lines that start with (e.g.) #:
你可以使用 skip = n 来跳过前 n 行,或者使用 comment = "#" 来删除所有以(例如)# 开头的行:

read_csv(
  "The first line of metadata
  The second line of metadata
  x,y,z
  1,2,3",
  skip = 2
)
#> # A tibble: 1 × 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3

read_csv(
  "# A comment I want to skip
  x,y,z
  1,2,3",
  comment = "#"
)
#> # A tibble: 1 × 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3

In other cases, the data might not have column names.
在其他情况下,数据可能没有列名。

You can use col_names = FALSE to tell read_csv() not to treat the first row as headings and instead label them sequentially from X1 to Xn:
你可以使用 col_names = FALSE 来告诉 read_csv() 不要将第一行作为标题,而是将它们从 X1Xn 顺序标记:

read_csv(
  "1,2,3
  4,5,6",
  col_names = FALSE
)
#> # A tibble: 2 × 3
#>      X1    X2    X3
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3
#> 2     4     5     6

Alternatively, you can pass col_names a character vector which will be used as the column names:
或者,你可以给 col_names 传递一个字符向量,它将被用作列名:

read_csv(
  "1,2,3
  4,5,6",
  col_names = c("x", "y", "z")
)
#> # A tibble: 2 × 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3
#> 2     4     5     6

These arguments are all you need to know to read the majority of CSV files that you’ll encounter in practice.
这些参数就是你在实践中读取绝大多数 CSV 文件所需要知道的全部内容。

(For the rest, you’ll need to carefully inspect your .csv file and read the documentation for read_csv()’s many other arguments.)
(对于其余情况,你需要仔细检查你的 .csv 文件,并阅读 read_csv() 许多其他参数的文档。)

7.2.3 Other file types

Once you’ve mastered read_csv(), using readr’s other functions is straightforward; it’s just a matter of knowing which function to reach for:
一旦你掌握了 read_csv(),使用 readr 的其他函数就变得很简单了;关键在于知道该使用哪个函数:

  • read_csv2() reads semicolon-separated files. These use ; instead of , to separate fields and are common in countries that use , as the decimal marker.
    read_csv2() 读取以分号分隔的文件。这些文件使用 ; 而不是 , 来分隔字段,在那些使用 , 作为小数点的国家很常见。

  • read_tsv() reads tab-delimited files.
    read_tsv() 读取制表符分隔的文件。

  • read_delim() reads in files with any delimiter, attempting to automatically guess the delimiter if you don’t specify it.
    read_delim() 读取任何分隔符的文件,如果你不指定分隔符,它会尝试自动猜测。

  • read_fwf() reads fixed-width files. You can specify fields by their widths with fwf_widths() or by their positions with fwf_positions().
    read_fwf() 读取固定宽度文件。你可以使用 fwf_widths() 按宽度指定字段,或使用 fwf_positions() 按位置指定。

  • read_table() reads a common variation of fixed-width files where columns are separated by white space.
    read_table() 读取固定宽度文件的一种常见变体,其中列由空白分隔。

  • read_log() reads Apache-style log files.
    read_log() 读取 Apache 风格的日志文件。

7.2.4 Exercises

  1. What function would you use to read a file where fields were separated with “|”?

  2. Apart from file, skip, and comment, what other arguments do read_csv() and read_tsv() have in common?

  3. What are the most important arguments to read_fwf()?

  4. Sometimes strings in a CSV file contain commas. To prevent them from causing problems, they need to be surrounded by a quoting character, like " or '. By default, read_csv() assumes that the quoting character will be ". To read the following text into a data frame, what argument to read_csv() do you need to specify?

    "x,y\n1,'a,b'"
  5. Identify what is wrong with each of the following inline CSV files. What happens when you run the code?

    read_csv("a,b\n1,2,3\n4,5,6")
    read_csv("a,b,c\n1,2\n1,2,3,4")
    read_csv("a,b\n\"1")
    read_csv("a,b\n1,2\na,b")
    read_csv("a;b\n1;3")
  6. Practice referring to non-syntactic names in the following data frame by:

    1. Extracting the variable called 1.
    2. Plotting a scatterplot of 1 vs. 2.
    3. Creating a new column called 3, which is 2 divided by 1.
    4. Renaming the columns to one, two, and three.
    annoying <- tibble(
      `1` = 1:10,
      `2` = `1` * 2 + rnorm(length(`1`))
    )

7.3 Controlling column types

A CSV file doesn’t contain any information about the type of each variable (i.e. whether it’s a logical, number, string, etc.), so readr will try to guess the type.
CSV 文件不包含任何关于每个变量类型的信息(例如,它是一个逻辑值、数字、字符串等),所以 readr 会尝试猜测其类型。

This section describes how the guessing process works, how to resolve some common problems that cause it to fail, and, if needed, how to supply the column types yourself.
本节描述了猜测过程是如何工作的,如何解决一些导致它失败的常见问题,以及在需要时如何自己提供列类型。

Finally, we’ll mention a few general strategies that are useful if readr is failing catastrophically and you need to get more insight into the structure of your file.
最后,我们将提及一些通用的策略,如果 readr 发生灾难性故障,而你需要更深入地了解文件结构,这些策略会很有用。

7.3.1 Guessing types

readr uses a heuristic to figure out the column types.
readr 使用一种启发式方法 (heuristic) 来确定列类型。

For each column, it pulls the values of 1,0002 rows spaced evenly from the first row to the last, ignoring missing values.
对于每一列,它会从第一行到最后一行均匀地抽取 1000 行 2 的值,并忽略缺失值。

It then works through the following questions:
然后它会按顺序考虑以下问题:

  • Does it contain only F, T, FALSE, or TRUE (ignoring case)? If so, it’s a logical.
    它是否只包含 FTFALSETRUE(忽略大小写)?如果是,那么它是一个逻辑值。

  • Does it contain only numbers (e.g., 1, -4.5, 5e6, Inf)? If so, it’s a number.
    它是否只包含数字(例如 1, -4.5, 5e6, Inf)?如果是,那么它是一个数字。

  • Does it match the ISO8601 standard? If so, it’s a date or date-time. (We’ll return to date-times in more detail in Section 17.2).
    它是否符合 ISO8601 标准?如果是,那么它是一个日期或日期时间。(我们将在 Section 17.2 中更详细地回到日期时间。)

  • Otherwise, it must be a string.
    否则,它必须是一个字符串。

You can see that behavior in action in this simple example:
你可以在这个简单的例子中看到这种行为:

read_csv("
  logical,numeric,date,string
  TRUE,1,2021-01-15,abc
  false,4.5,2021-02-15,def
  T,Inf,2021-02-16,ghi
")
#> # A tibble: 3 × 4
#>   logical numeric date       string
#>   <lgl>     <dbl> <date>     <chr> 
#> 1 TRUE        1   2021-01-15 abc   
#> 2 FALSE       4.5 2021-02-15 def   
#> 3 TRUE      Inf   2021-02-16 ghi

This heuristic works well if you have a clean dataset, but in real life, you’ll encounter a selection of weird and beautiful failures.
如果你有一个干净的数据集,这种启发式方法会工作得很好,但在现实生活中,你会遇到各种奇怪而美妙的失败情况。

7.3.2 Missing values, column types, and problems

The most common way column detection fails is that a column contains unexpected values, and you get a character column instead of a more specific type.
列检测失败最常见的方式是列中包含意外值,导致你得到一个字符列而不是更具体的类型。

One of the most common causes for this is a missing value, recorded using something other than the NA that readr expects.
其中一个最常见的原因是缺失值,它使用了 readr 期望的 NA 之外的其它方式进行记录。

Take this simple 1 column CSV file as an example:
以这个简单的单列 CSV 文件为例:

simple_csv <- "
  x
  10
  .
  20
  30"

If we read it without any additional arguments, x becomes a character column:
如果我们不带任何附加参数读取它,x 会变成一个字符列:

read_csv(simple_csv)
#> # A tibble: 4 × 1
#>   x    
#>   <chr>
#> 1 10   
#> 2 .    
#> 3 20   
#> 4 30

In this very small case, you can easily see the missing value ..
在这个非常小的情况下,你可以很容易地看到缺失值 .

But what happens if you have thousands of rows with only a few missing values represented by .s sprinkled among them?
但是,如果你有成千上万行数据,其中只散布着少数用 . 表示的缺失值,会发生什么呢?

One approach is to tell readr that x is a numeric column, and then see where it fails.
一种方法是告诉 readr x 是一个数值列,然后看它在哪里失败。

You can do that with the col_types argument, which takes a named list where the names match the column names in the CSV file:
你可以通过 col_types 参数来做到这一点,它接受一个命名列表,其中名称与 CSV 文件中的列名匹配:

df <- read_csv(
  simple_csv, 
  col_types = list(x = col_double())
)
#> Warning: One or more parsing issues, call `problems()` on your data frame for
#> details, e.g.:
#>   dat <- vroom(...)
#>   problems(dat)

Now read_csv() reports that there was a problem, and tells us we can find out more with problems():
现在 read_csv() 报告说有问题,并告诉我们可以用 problems() 了解更多信息:

problems(df)
#> # A tibble: 1 × 5
#>     row   col expected actual file                                           
#>   <int> <int> <chr>    <chr>  <chr>                                          
#> 1     3     1 a double .      C:/Users/14913/AppData/Local/Temp/RtmpOgex7E/f…

This tells us that there was a problem in row 3, col 1 where readr expected a double but got a ..
这告诉我们在第 3 行第 1 列存在一个问题,readr 期望一个双精度浮点数 (double),但得到了一个 .

That suggests this dataset uses . for missing values.
这表明该数据集使用 . 表示缺失值。

So then we set na = ".", the automatic guessing succeeds, giving us the numeric column that we want:
所以我们设置 na = ".",自动猜测成功了,给了我们想要的数值列:

read_csv(simple_csv, na = ".")
#> # A tibble: 4 × 1
#>       x
#>   <dbl>
#> 1    10
#> 2    NA
#> 3    20
#> 4    30

7.3.3 Column types

readr provides a total of nine column types for you to use:
readr 共提供了九种列类型供你使用:

  • col_logical() and col_double() read logicals and real numbers. They’re relatively rarely needed (except as above), since readr will usually guess them for you.
    col_logical()col_double() 读取逻辑值和实数。它们相对较少需要(除非像上面那样),因为 readr 通常会为你猜出它们。

  • col_integer() reads integers. We seldom distinguish integers and doubles in this book because they’re functionally equivalent, but reading integers explicitly can occasionally be useful because they occupy half the memory of doubles.
    col_integer() 读取整数。在本书中,我们很少区分整数和双精度浮点数,因为它们在功能上是等效的,但明确读取为整数有时可能很有用,因为它们只占用双精度浮点数一半的内存。

  • col_character() reads strings. This can be useful to specify explicitly when you have a column that is a numeric identifier, i.e., long series of digits that identifies an object but doesn’t make sense to apply mathematical operations to. Examples include phone numbers, social security numbers, credit card numbers, etc.
    col_character() 读取字符串。当你有一个作为数字标识符的列时,明确指定它会很有用,即,一长串标识对象的数字,但对其应用数学运算没有意义。例如电话号码、社会安全号码、信用卡号等。

  • col_factor(), col_date(), and col_datetime() create factors, dates, and date-times respectively; you’ll learn more about those when we get to those data types in Chapter 16 and Chapter 17.
    col_factor(), col_date()col_datetime() 分别创建因子、日期和日期时间;当我们在 Chapter 16Chapter 17 中讲到这些数据类型时,你将学到更多相关知识。

  • col_number() is a permissive numeric parser that will ignore non-numeric components, and is particularly useful for currencies. You’ll learn more about it in Chapter 13.
    col_number() 是一个宽容的数字解析器,它会忽略非数字部分,对于货币特别有用。你将在 Chapter 13 中学到更多相关知识。

  • col_skip() skips a column so it’s not included in the result, which can be useful for speeding up reading the data if you have a large CSV file and you only want to use some of the columns.
    col_skip() 会跳过一列,使其不包含在结果中,这在处理大型 CSV 文件且你只想使用部分列时,可以加快数据读取速度。

It’s also possible to override the default column by switching from list() to cols() and specifying .default:
也可以通过从 list() 切换到 cols() 并指定 .default 来覆盖默认的列类型:

another_csv <- "
x,y,z
1,2,3"

read_csv(
  another_csv, 
  col_types = cols(.default = col_character())
)
#> # A tibble: 1 × 3
#>   x     y     z    
#>   <chr> <chr> <chr>
#> 1 1     2     3

Another useful helper is cols_only() which will read in only the columns you specify:
另一个有用的辅助函数是 cols_only(),它将只读入你指定的列:

read_csv(
  another_csv,
  col_types = cols_only(x = col_character())
)
#> # A tibble: 1 × 1
#>   x    
#>   <chr>
#> 1 1

7.4 Reading data from multiple files

Sometimes your data is split across multiple files instead of being contained in a single file.
有时你的数据分散在多个文件中,而不是包含在单个文件中。

For example, you might have sales data for multiple months, with each month’s data in a separate file: 01-sales.csv for January, 02-sales.csv for February, and 03-sales.csv for March.
例如,你可能有多個月的銷售數據,每個月的數據都存放在一個單獨的文件中:一月是 01-sales.csv,二月是 02-sales.csv,三月是 03-sales.csv

With read_csv() you can read these data in at once and stack them on top of each other in a single data frame.
使用 read_csv(),你可以一次性读取这些数据,并将它们堆叠在一个数据框中。

sales_files <- c("data/01-sales.csv", "data/02-sales.csv", "data/03-sales.csv")
read_csv(sales_files, id = "file")
#> # A tibble: 19 × 6
#>   file              month    year brand  item     n
#>   <chr>             <chr>   <dbl> <dbl> <dbl> <dbl>
#> 1 data/01-sales.csv January  2019     1  1234     3
#> 2 data/01-sales.csv January  2019     1  8721     9
#> 3 data/01-sales.csv January  2019     1  1822     2
#> 4 data/01-sales.csv January  2019     2  3333     1
#> 5 data/01-sales.csv January  2019     2  2156     9
#> 6 data/01-sales.csv January  2019     2  3987     6
#> # ℹ 13 more rows

Once again, the code above will work if you have the CSV files in a data folder in your project.
同样,如果你的项目中有一个 data 文件夹,并且其中包含这些 CSV 文件,那么上面的代码将会正常工作。

You can download these files from https://pos.it/r4ds-01-sales, https://pos.it/r4ds-02-sales, and https://pos.it/r4ds-03-sales or you can read them directly with:
你可以从 https://pos.it/r4ds-01-saleshttps://pos.it/r4ds-02-saleshttps://pos.it/r4ds-03-sales 下载这些文件,或者使用以下代码直接读取它们:

sales_files <- c(
  "https://pos.it/r4ds-01-sales",
  "https://pos.it/r4ds-02-sales",
  "https://pos.it/r4ds-03-sales"
)
read_csv(sales_files, id = "file")

The id argument adds a new column called file to the resulting data frame that identifies the file the data come from.
id 参数会向结果数据框中添加一个名为 file 的新列,用于标识数据来自哪个文件。

This is especially helpful in circumstances where the files you’re reading in do not have an identifying column that can help you trace the observations back to their original sources.
当 你读入的文件中没有一个可用于将观测值追溯到其原始来源的标识列时,这尤其有用。

If you have many files you want to read in, it can get cumbersome to write out their names as a list.
如果你有许多文件要读入,将它们的名字写成一个列表可能会很麻烦。

Instead, you can use the base list.files() function to find the files for you by matching a pattern in the file names.
相反,你可以使用基础函数 list.files(),通过匹配文件名中的模式来为你查找文件。

You’ll learn more about these patterns in Chapter 15.
你将在 Chapter 15 中学到更多关于这些模式的知识。

sales_files <- list.files("data", pattern = "sales\\.csv$", full.names = TRUE)
sales_files
#> [1] "data/01-sales.csv" "data/02-sales.csv" "data/03-sales.csv"

7.5 Writing to a file

readr also comes with two useful functions for writing data back to disk: write_csv() and write_tsv().
readr 还附带了两个用于将数据写回磁盘的有用函数:write_csv()write_tsv()

The most important arguments to these functions are x (the data frame to save) and file (the location to save it).
这些函数最重要的参数是 x(要保存的数据框)和 file(保存它的位置)。

You can also specify how missing values are written with na, and if you want to append to an existing file.
你还可以用 na 指定缺失值的写入方式,以及是否要 append(追加)到现有文件中。

write_csv(students, "students.csv")

Now let’s read that csv file back in.
现在让我们把那个 csv 文件再读回来。

Note that the variable type information that you just set up is lost when you save to CSV because you’re starting over with reading from a plain text file again:
请注意,当你保存为 CSV 格式时,刚刚设置的变量类型信息会丢失,因为你又从纯文本文件重新开始读取了:

students
#> # A tibble: 6 × 5
#>   student_id full_name        favourite_food     meal_plan             age
#>        <dbl> <chr>            <chr>              <fct>               <dbl>
#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
#> 2          2 Barclay Lynn     French fries       Lunch only              5
#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch     7
#> 4          4 Leon Rossini     Anchovies          Lunch only             NA
#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5
#> 6          6 Güvenç Attila    Ice cream          Lunch only              6
write_csv(students, "students-2.csv")
read_csv("students-2.csv")
#> # A tibble: 6 × 5
#>   student_id full_name        favourite_food     meal_plan             age
#>        <dbl> <chr>            <chr>              <chr>               <dbl>
#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
#> 2          2 Barclay Lynn     French fries       Lunch only              5
#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch     7
#> 4          4 Leon Rossini     Anchovies          Lunch only             NA
#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5
#> 6          6 Güvenç Attila    Ice cream          Lunch only              6

This makes CSVs a little unreliable for caching interim results—you need to recreate the column specification every time you load in.
这使得 CSV 对于缓存中间结果有些不可靠——每次加载时你都需要重新创建列的规格。

There are two main alternatives:
有两个主要的替代方案:

  1. write_rds() and read_rds() are uniform wrappers around the base functions readRDS() and saveRDS(). These store data in R’s custom binary format called RDS. This means that when you reload the object, you are loading the exact same R object that you stored.
    write_rds()read_rds() 是对基础函数 readRDS()saveRDS() 的统一封装。它们以 R 的自定义二进制格式(称为 RDS)存储数据。这意味着当你重新加载对象时,你加载的是与你存储的完全相同的 R 对象。

    write_rds(students, "students.rds")
    read_rds("students.rds")
    #> # A tibble: 6 × 5
    #>   student_id full_name        favourite_food     meal_plan             age
    #>        <dbl> <chr>            <chr>              <fct>               <dbl>
    #> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
    #> 2          2 Barclay Lynn     French fries       Lunch only              5
    #> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch     7
    #> 4          4 Leon Rossini     Anchovies          Lunch only             NA
    #> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5
    #> 6          6 Güvenç Attila    Ice cream          Lunch only              6
  2. The arrow package allows you to read and write parquet files, a fast binary file format that can be shared across programming languages. We’ll return to arrow in more depth in Chapter 22.
    arrow 包允许你读写 parquet 文件,这是一种快速的二进制文件格式,可以在不同编程语言之间共享。我们将在 Chapter 22 中更深入地探讨 arrow。

    library(arrow)
    write_parquet(students, "students.parquet")
    read_parquet("students.parquet")
    #> # A tibble: 6 × 5
    #>   student_id full_name          favourite_food     meal_plan               age
    #>        <dbl> <chr>              <chr>              <fct>                 <dbl>
    #> 1          1 Sunil Huffmann     Strawberry yoghurt Lunch only                4
    #> 2          2 Barclay Lynn       French fries       Lunch only                5
    #> 3          3 Jayendra Lyne      NA                 Breakfast and lunch       7
    #> 4          4 Leon Rossini       Anchovies          Lunch only               NA
    #> 5          5 Chidiegwu Dunkel   Pizza              Breakfast and lunch       5
    #> 6          6 Güvenç Attila      Ice cream          Lunch only                6

Parquet tends to be much faster than RDS and is usable outside of R, but does require the arrow package.
Parquet 格式通常比 RDS 快得多,并且可以在 R 之外使用,但需要 arrow 包。

7.6 Data entry

Sometimes you’ll need to assemble a tibble “by hand” doing a little data entry in your R script.
有时你需要在 R 脚本中通过少量的数据录入来“手动”组建一个 tibble。

There are two useful functions to help you do this which differ in whether you layout the tibble by columns or by rows.
有两个有用的函数可以帮助你做到这一点,它们的区别在于你是按列还是按行来布局 tibble。

tibble() works by column:
tibble() 按列工作:

tibble(
  x = c(1, 2, 5), 
  y = c("h", "m", "g"),
  z = c(0.08, 0.83, 0.60)
)
#> # A tibble: 3 × 3
#>       x y         z
#>   <dbl> <chr> <dbl>
#> 1     1 h      0.08
#> 2     2 m      0.83
#> 3     5 g      0.6

Laying out the data by column can make it hard to see how the rows are related, so an alternative is tribble(), short for transposed tibble, which lets you lay out your data row by row.
按列布局数据可能很难看清行与行之间的关系,所以一个替代方案是 tribble(),即转置 tibble (transposed tibble) 的缩写,它让你能够逐行布局数据。

tribble() is customized for data entry in code: column headings start with ~ and entries are separated by commas.
tribble() 是为在代码中进行数据录入而定制的:列标题以 ~ 开头,条目之间用逗号分隔。

This makes it possible to lay out small amounts of data in an easy to read form:
这使得以易于阅读的形式布置少量数据成为可能:

tribble(
  ~x, ~y, ~z,
  1, "h", 0.08,
  2, "m", 0.83,
  5, "g", 0.60
)
#> # A tibble: 3 × 3
#>       x y         z
#>   <dbl> <chr> <dbl>
#> 1     1 h      0.08
#> 2     2 m      0.83
#> 3     5 g      0.6

7.7 Summary

In this chapter, you’ve learned how to load CSV files with read_csv() and to do your own data entry with tibble() and tribble().
在本章中,你学会了如何使用 read_csv() 加载 CSV 文件,以及如何使用 tibble()tribble() 进行自己的数据录入。

You’ve learned how csv files work, some of the problems you might encounter, and how to overcome them.
你已经了解了 csv 文件的工作原理,可能会遇到的一些问题,以及如何克服它们。

We’ll come to data import a few times in this book: Chapter 20 from Excel and Google Sheets, Chapter 21 will show you how to load data from databases, Chapter 22 from parquet files, Chapter 23 from JSON, and Chapter 24 from websites.
在本书中,我们会多次涉及数据导入:Chapter 20 介绍从 Excel 和 Google Sheets 导入,Chapter 21 将向你展示如何从数据库加载数据,Chapter 22 从 parquet 文件导入,Chapter 23 从 JSON 导入,以及 Chapter 24 从网站导入。

We’re just about at the end of this section of the book, but there’s one important last topic to cover: how to get help.
我们即将结束本书的这一部分,但还有一个重要的最后主题需要涵盖:如何获取帮助。

So in the next chapter, you’ll learn some good places to look for help, how to create a reprex to maximize your chances of getting good help, and some general advice on keeping up with the world of R.
因此,在下一章中,你将学到一些寻求帮助的好地方,如何创建一个可复现的例子 (reprex) 来最大化你获得有效帮助的机会,以及一些关于跟上 R 世界发展步伐的通用建议。


  1. The janitor package is not part of the tidyverse, but it offers handy functions for data cleaning and works well within data pipelines that use |>.↩︎

  2. You can override the default of 1000 with the guess_max argument.↩︎