4 Workflow: code style

Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread. Even as a very new programmer, it’s a good idea to work on your code style. Using a consistent style makes it easier for others (including future-you!) to read your work and is particularly important if you need to get help from someone else. This chapter will introduce the most important points of the tidyverse style guide, which is used throughout this book.
良好的编码风格就像正确的标点符号：没有它也能行，但它确实能让事情更容易阅读。即使你是一个编程新手，养成良好的代码风格也是一个好主意。使用一致的风格可以让他人（包括未来的你！）更容易阅读你的作品，在你需要他人帮助时尤其重要。本章将介绍 tidyverse 风格指南中最重要的几点，本书通篇都使用了该指南。

Styling your code will feel a bit tedious to start with, but if you practice it, it will soon become second nature. Additionally, there are some great tools to quickly restyle existing code, like the styler package by Lorenz Walthert. Once you’ve installed it with install.packages("styler"), an easy way to use it is via RStudio’s command palette. The command palette lets you use any built-in RStudio command and many addins provided by packages. Open the palette by pressing Cmd/Ctrl + Shift + P, then type “styler” to see all the shortcuts offered by styler. Figure 4.1 shows the results.
一开始，你可能会觉得代码风格化有点乏味，但如果你勤加练习，它很快就会成为你的第二天性。此外，还有一些很棒的工具可以快速重塑现有代码的风格，比如 Lorenz Walthert 开发的 styler 包。通过 install.packages("styler") 安装后，一个简便的使用方法是通过 RStudio 的 命令面板 (command palette)。命令面板让你能使用任何 RStudio 内置命令以及包提供的许多插件。按 Cmd/Ctrl + Shift + P 打开面板，然后输入 “styler” 就可以看到 styler 提供的所有快捷方式。Figure 4.1 展示了结果。

A screenshot showing the command palette after typing "styler", showing the four styling tool provided by the package. — Figure 4.1: RStudio’s command palette makes it easy to access every RStudio command using only the keyboard.

We’ll use the tidyverse and nycflights13 packages for code examples in this chapter.
在本章的代码示例中，我们将使用 tidyverse 和 nycflights13 包。

library(tidyverse)
library(nycflights13)

4.1 Names

We talked briefly about names in Section 2.3. Remember that variable names (those created by <- and those created by mutate()) should use only lowercase letters, numbers, and _. Use _ to separate words within a name.
我们在 Section 2.3 中简要讨论了命名。请记住，变量名（由 <- 创建的和由 mutate() 创建的）应仅使用小写字母、数字和 _。使用 _ 来分隔名称中的单词。

# Strive for:
short_flights <- flights |> filter(air_time < 60)

# Avoid:
SHORTFLIGHTS <- flights |> filter(air_time < 60)

As a general rule of thumb, it’s better to prefer long, descriptive names that are easy to understand rather than concise names that are fast to type. Short names save relatively little time when writing code (especially since autocomplete will help you finish typing them), but it can be time-consuming when you come back to old code and are forced to puzzle out a cryptic abbreviation.
根据经验，最好选择易于理解的长描述性名称，而不是输入快的简洁名称。短名称在编写代码时节省的时间相对较少（特别是因为自动补全会帮你完成输入），但当你回头看旧代码并被迫琢磨一个神秘的缩写时，可能会非常耗时。

If you have a bunch of names for related things, do your best to be consistent. It’s easy for inconsistencies to arise when you forget a previous convention, so don’t feel bad if you have to go back and rename things. In general, if you have a bunch of variables that are a variation on a theme, you’re better off giving them a common prefix rather than a common suffix because autocomplete works best on the start of a variable.
如果你有一堆相关事物的名称，请尽量保持一致。当你忘记之前的约定而导致不一致时，这是很容易发生的，所以如果你需要回去重命名，也不必感到难过。总的来说，如果你有一组主题相近的变量，最好给它们一个共同的前缀而不是后缀，因为自动补全在变量开头效果最好。

4.2 Spaces

Put spaces on either side of mathematical operators apart from ^ (i.e. +, -, ==, <, …), and around the assignment operator (<-).
在数学运算符（除了 ^，即 +、-、==、< 等）的两侧以及赋值运算符 (<-) 的周围加上空格。

# Strive for
z <- (a + b)^2 / d

# Avoid
z<-( a + b ) ^ 2/d

Don’t put spaces inside or outside parentheses for regular function calls. Always put a space after a comma, just like in standard English.
在常规函数调用的括号内外不要加空格。在逗号后面一定要加一个空格，就像标准英语一样。

# Strive for
mean(x, na.rm = TRUE)

# Avoid
mean (x ,na.rm=TRUE)

It’s OK to add extra spaces if it improves alignment. For example, if you’re creating multiple variables in mutate(), you might want to add spaces so that all the = line up.¹ This makes it easier to skim the code.
如果能改善对齐，可以添加额外的空格。例如，当你在 mutate() 中创建多个变量时，你可能想添加空格以使所有的 = 对齐。¹ 这能让代码更容易浏览。

flights |> 
  mutate(
    speed      = distance / air_time,
    dep_hour   = dep_time %/% 100,
    dep_minute = dep_time %%  100
  )

4.3 Pipes

|> should always have a space before it and should typically be the last thing on a line. This makes it easier to add new steps, rearrange existing steps, modify elements within a step, and get a 10,000 ft view by skimming the verbs on the left-hand side.
|> 前面应该总有一个空格，并且通常应该是一行的最后一个元素。这使得添加新步骤、重新排列现有步骤、修改步骤中的元素以及通过浏览左侧的动词来获得宏观视角变得更加容易。

# Strive for 
flights |>  
  filter(!is.na(arr_delay), !is.na(tailnum)) |> 
  count(dest)

# Avoid
flights|>filter(!is.na(arr_delay), !is.na(tailnum))|>count(dest)

If the function you’re piping into has named arguments (like mutate() or summarize()), put each argument on a new line. If the function doesn’t have named arguments (like select() or filter()), keep everything on one line unless it doesn’t fit, in which case you should put each argument on its own line.
如果你正在管道连接的函数有命名参数（如 mutate() 或 summarize()），请将每个参数放在新的一行。如果函数没有命名参数（如 select() 或 filter()），则将所有内容保持在一行，除非一行写不下，这种情况下你应该将每个参数放在单独的一行。

# Strive for
flights |>  
  group_by(tailnum) |> 
  summarize(
    delay = mean(arr_delay, na.rm = TRUE),
    n = n()
  )

# Avoid
flights |>
  group_by(
    tailnum
  ) |> 
  summarize(delay = mean(arr_delay, na.rm = TRUE), n = n())

After the first step of the pipeline, indent each line by two spaces. RStudio will automatically put the spaces in for you after a line break following a |> . If you’re putting each argument on its own line, indent by an extra two spaces. Make sure ) is on its own line, and un-indented to match the horizontal position of the function name.
在管道的第一步之后，每行缩进两个空格。在 |> 后面换行时，RStudio 会自动为你添加空格。如果你将每个参数放在单独的一行，则额外缩进两个空格。确保 ) 单独占一行，并且不缩进，以与函数名的水平位置对齐。

# Strive for 
flights |>  
  group_by(tailnum) |> 
  summarize(
    delay = mean(arr_delay, na.rm = TRUE),
    n = n()
  )

# Avoid
flights|>
  group_by(tailnum) |> 
  summarize(
             delay = mean(arr_delay, na.rm = TRUE), 
             n = n()
           )

# Avoid
flights|>
  group_by(tailnum) |> 
  summarize(
  delay = mean(arr_delay, na.rm = TRUE), 
  n = n()
  )

It’s OK to shirk some of these rules if your pipeline fits easily on one line. But in our collective experience, it’s common for short snippets to grow longer, so you’ll usually save time in the long run by starting with all the vertical space you need.
如果你的管道可以轻松地放在一行上，可以不遵守这些规则。但根据我们的集体经验，短代码片段通常会变得越来越长，所以从一开始就使用所需的垂直空间，从长远来看通常会节省时间。

# This fits compactly on one line
df |> mutate(y = x + 1)

# While this takes up 4x as many lines, it's easily extended to 
# more variables and more steps in the future
df |> 
  mutate(
    y = x + 1
  )

Finally, be wary of writing very long pipes, say longer than 10-15 lines. Try to break them up into smaller sub-tasks, giving each task an informative name. The names will help cue the reader into what’s happening and makes it easier to check that intermediate results are as expected. Whenever you can give something an informative name, you should give it an informative name, for example when you fundamentally change the structure of the data, e.g., after pivoting or summarizing. Don’t expect to get it right the first time! This means breaking up long pipelines if there are intermediate states that can get good names.
最后，要警惕编写过长的管道，比如超过 10-15 行。试着将它们分解成更小的子任务，并给每个任务一个信息丰富的名称。这些名称将有助于提示读者正在发生什么，并使其更容易检查中间结果是否符合预期。只要你能给某个东西一个信息丰富的名称，你就应该这样做，例如，当你从根本上改变数据结构时（例如，在数据透视或汇总之后）。不要期望第一次就做对！这意味着如果存在可以获得好名称的中间状态，就应该拆分长管道。

4.4 ggplot2

The same basic rules that apply to the pipe also apply to ggplot2; just treat + the same way as |>.
适用于管道的基本规则同样适用于 ggplot2；只需将 + 视为 |> 即可。

flights |> 
  group_by(month) |> 
  summarize(
    delay = mean(arr_delay, na.rm = TRUE)
  ) |> 
  ggplot(aes(x = month, y = delay)) +
  geom_point() + 
  geom_line()

Again, if you can’t fit all of the arguments to a function on to a single line, put each argument on its own line:
同样，如果一个函数的所有参数无法放在一行内，就将每个参数单独放在一行：

flights |> 
  group_by(dest) |> 
  summarize(
    distance = mean(distance),
    speed = mean(distance / air_time, na.rm = TRUE)
  ) |> 
  ggplot(aes(x = distance, y = speed)) +
  geom_smooth(
    method = "loess",
    span = 0.5,
    se = FALSE, 
    color = "white", 
    linewidth = 4
  ) +
  geom_point()

Watch for the transition from |> to +. We wish this transition wasn’t necessary, but unfortunately, ggplot2 was written before the pipe was discovered.
注意从 |> 到 + 的转换。我们希望这种转换是不必要的，但遗憾的是，ggplot2 是在管道被发现之前编写的。

4.5 Sectioning comments

As your scripts get longer, you can use sectioning comments to break up your file into manageable pieces:
当你的脚本越来越长时，你可以使用分节注释将文件分解成易于管理的部分：

# Load data --------------------------------------

# Plot data --------------------------------------

RStudio provides a keyboard shortcut to create these headers (Cmd/Ctrl + Shift + R), and will display them in the code navigation drop-down at the bottom-left of the editor, as shown in Figure 4.2.
RStudio 提供了创建这些标题的键盘快捷键 (Cmd/Ctrl + Shift + R)，并会在编辑器左下角的代码导航下拉菜单中显示它们，如 Figure 4.2 所示。

Figure 4.2: After adding sectioning comments to your script, you can easily navigate to them using the code navigation tool in the bottom-left of the script editor.

4.6 Exercises

Restyle the following pipelines following the guidelines above.

flights|>filter(dest=="IAH")|>group_by(year,month,day)|>summarize(n=n(),
delay=mean(arr_delay,na.rm=TRUE))|>filter(n>10)

flights|>filter(carrier=="UA",dest%in%c("IAH","HOU"),sched_dep_time>
0900,sched_arr_time<2000)|>group_by(flight)|>summarize(delay=mean(
arr_delay,na.rm=TRUE),cancelled=sum(is.na(arr_delay)),n=n())|>filter(n>10)

4.7 Summary

In this chapter, you’ve learned the most important principles of code style. These may feel like a set of arbitrary rules to start with (because they are!) but over time, as you write more code, and share code with more people, you’ll see how important a consistent style is. And don’t forget about the styler package: it’s a great way to quickly improve the quality of poorly styled code.
在本章中，你学习了代码风格最重要的原则。一开始，这些可能感觉像是一套武断的规则（因为它们确实是！），但随着时间的推移，当你编写更多代码，并与更多人共享代码时，你会看到一致的风格是多么重要。别忘了 styler 包：它是一个快速提高差风格代码质量的好方法。

In the next chapter, we switch back to data science tools, learning about tidy data. Tidy data is a consistent way of organizing your data frames that is used throughout the tidyverse. This consistency makes your life easier because once you have tidy data, it just works with the vast majority of tidyverse functions. Of course, life is never easy, and most datasets you encounter in the wild will not already be tidy. So we’ll also teach you how to use the tidyr package to tidy your untidy data.
在下一章中，我们将回到数据科学工具，学习整洁数据 (tidy data)。整洁数据是组织数据框的一种一致方式，在整个 tidyverse 中都在使用。这种一致性使你的工作更轻松，因为一旦你拥有了整洁数据，它就能与绝大多数 tidyverse 函数协同工作。当然，生活从不轻松，你在野外遇到的大多数数据集都不会是整洁的。因此，我们还将教你如何使用 tidyr 包来整理你的非整洁数据。

Since dep_time is in HMM or HHMM format, we use integer division (%/%) to get hour and remainder (also known as modulo, %%) to get minute.↩︎