Code
# Uncomment and run these once
#install.packages("tidyverse")
#install.packages("palmerpenguins")
library(tidyverse)
library(palmerpenguins)A beginners guide to data wrangling in R
Noah Weidig
February 15, 2025
If you’ve ever worked with messy data in R, you know how painful base R subsetting can be. The dplyr package makes data wrangling intuitive and readable. In this tutorial, we’ll walk through the six core dplyr verbs using the Palmer Penguins dataset.
Let’s load the packages we need.
Let’s take a look at the data.
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
We have 344 penguins with 8 variables: species, island, bill measurements, flipper length, body mass, sex, and year.
Before we dive into the verbs, let’s talk about the pipe: |> (or %>%). The pipe takes the output of one function and passes it as the first argument to the next. This lets us chain operations together in a readable way.
[1] 344
[1] 344
You’ll see the pipe used throughout this tutorial. Think of it as saying “and then…”
filter() keeps rows that match a condition. Let’s grab only the Adelie penguins.
# A tibble: 152 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 142 more rows
# ℹ 2 more variables: sex <fct>, year <int>
You can combine multiple conditions. Let’s find Adelie penguins that weigh more than 4000 grams.
# A tibble: 35 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.2 19.6 195 4675
2 Adelie Torgersen 42 20.2 190 4250
3 Adelie Torgersen 34.6 21.1 198 4400
4 Adelie Torgersen 42.5 20.7 197 4500
5 Adelie Torgersen 46 21.5 194 4200
6 Adelie Dream 39.2 21.1 196 4150
7 Adelie Dream 39.8 19.1 184 4650
8 Adelie Dream 44.1 19.7 196 4400
9 Adelie Dream 39.6 18.8 190 4600
10 Adelie Dream 42.3 21.2 191 4150
# ℹ 25 more rows
# ℹ 2 more variables: sex <fct>, year <int>
Use | (or) for either/or conditions.
# A tibble: 292 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Biscoe 37.8 18.3 174 3400
2 Adelie Biscoe 37.7 18.7 180 3600
3 Adelie Biscoe 35.9 19.2 189 3800
4 Adelie Biscoe 38.2 18.1 185 3950
5 Adelie Biscoe 38.8 17.2 180 3800
6 Adelie Biscoe 35.3 18.9 187 3800
7 Adelie Biscoe 40.6 18.6 183 3550
8 Adelie Biscoe 40.5 17.9 187 3200
9 Adelie Biscoe 37.9 18.6 172 3150
10 Adelie Biscoe 40.5 18.9 180 3950
# ℹ 282 more rows
# ℹ 2 more variables: sex <fct>, year <int>
A cleaner way to filter for multiple values is %in%.
# A tibble: 292 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Biscoe 37.8 18.3 174 3400
2 Adelie Biscoe 37.7 18.7 180 3600
3 Adelie Biscoe 35.9 19.2 189 3800
4 Adelie Biscoe 38.2 18.1 185 3950
5 Adelie Biscoe 38.8 17.2 180 3800
6 Adelie Biscoe 35.3 18.9 187 3800
7 Adelie Biscoe 40.6 18.6 183 3550
8 Adelie Biscoe 40.5 17.9 187 3200
9 Adelie Biscoe 37.9 18.6 172 3150
10 Adelie Biscoe 40.5 18.9 180 3950
# ℹ 282 more rows
# ℹ 2 more variables: sex <fct>, year <int>
select() picks specific columns. This is useful when you only need a few variables from a wide dataset.
# A tibble: 344 × 3
species island body_mass_g
<fct> <fct> <int>
1 Adelie Torgersen 3750
2 Adelie Torgersen 3800
3 Adelie Torgersen 3250
4 Adelie Torgersen NA
5 Adelie Torgersen 3450
6 Adelie Torgersen 3650
7 Adelie Torgersen 3625
8 Adelie Torgersen 4675
9 Adelie Torgersen 3475
10 Adelie Torgersen 4250
# ℹ 334 more rows
You can also remove columns with a minus sign.
# A tibble: 344 × 7
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 1 more variable: sex <fct>
There are handy helper functions too. starts_with() grabs columns that start with a string.
# A tibble: 344 × 3
species bill_length_mm bill_depth_mm
<fct> <dbl> <dbl>
1 Adelie 39.1 18.7
2 Adelie 39.5 17.4
3 Adelie 40.3 18
4 Adelie NA NA
5 Adelie 36.7 19.3
6 Adelie 39.3 20.6
7 Adelie 38.9 17.8
8 Adelie 39.2 19.6
9 Adelie 34.1 18.1
10 Adelie 42 20.2
# ℹ 334 more rows
mutate() creates new columns or modifies existing ones. Let’s convert body mass from grams to kilograms.
# A tibble: 344 × 3
species body_mass_g body_mass_kg
<fct> <int> <dbl>
1 Adelie 3750 3.75
2 Adelie 3800 3.8
3 Adelie 3250 3.25
4 Adelie NA NA
5 Adelie 3450 3.45
6 Adelie 3650 3.65
7 Adelie 3625 3.62
8 Adelie 4675 4.68
9 Adelie 3475 3.48
10 Adelie 4250 4.25
# ℹ 334 more rows
You can create multiple columns at once.
# A tibble: 344 × 3
species body_mass_kg bill_ratio
<fct> <dbl> <dbl>
1 Adelie 3.75 2.09
2 Adelie 3.8 2.27
3 Adelie 3.25 2.24
4 Adelie NA NA
5 Adelie 3.45 1.90
6 Adelie 3.65 1.91
7 Adelie 3.62 2.19
8 Adelie 4.68 2
9 Adelie 3.48 1.88
10 Adelie 4.25 2.08
# ℹ 334 more rows
arrange() sorts rows. By default, it sorts in ascending order.
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Chinstrap Dream 46.9 16.6 192 2700
2 Adelie Biscoe 36.5 16.6 181 2850
3 Adelie Biscoe 36.4 17.1 184 2850
4 Adelie Biscoe 34.5 18.1 187 2900
5 Adelie Dream 33.1 16.1 178 2900
6 Adelie Torgers… 38.6 17 188 2900
7 Chinstrap Dream 43.2 16.6 187 2900
8 Adelie Biscoe 37.9 18.6 193 2925
9 Adelie Dream 37.5 18.9 179 2975
10 Adelie Dream 37 16.9 185 3000
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
Use desc() for descending order.
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Gentoo Biscoe 49.2 15.2 221 6300
2 Gentoo Biscoe 59.6 17 230 6050
3 Gentoo Biscoe 51.1 16.3 220 6000
4 Gentoo Biscoe 48.8 16.2 222 6000
5 Gentoo Biscoe 45.2 16.4 223 5950
6 Gentoo Biscoe 49.8 15.9 229 5950
7 Gentoo Biscoe 48.4 14.6 213 5850
8 Gentoo Biscoe 49.3 15.7 217 5850
9 Gentoo Biscoe 55.1 16 230 5850
10 Gentoo Biscoe 49.5 16.2 229 5800
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
You can sort by multiple columns. This sorts by species first, then by body mass within each species.
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Biscoe 43.2 19 197 4775
2 Adelie Biscoe 41 20 203 4725
3 Adelie Torgersen 42.9 17.6 196 4700
4 Adelie Torgersen 39.2 19.6 195 4675
5 Adelie Dream 39.8 19.1 184 4650
6 Adelie Dream 39.6 18.8 190 4600
7 Adelie Biscoe 45.6 20.3 191 4600
8 Adelie Torgersen 42.5 20.7 197 4500
9 Adelie Dream 37.5 18.5 199 4475
10 Adelie Torgersen 41.8 19.4 198 4450
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
summarize() (or summarise()) collapses the data into a summary. Let’s find the average body mass.
# A tibble: 1 × 1
mean_mass
<dbl>
1 4202.
The na.rm = TRUE argument tells R to ignore missing values. You can compute multiple summaries at once.
group_by() is where dplyr really shines. It splits the data into groups so that subsequent operations are performed per group. Pair it with summarize() for powerful grouped summaries.
# A tibble: 3 × 4
species mean_mass sd_mass n
<fct> <dbl> <dbl> <int>
1 Adelie 3701. 459. 152
2 Chinstrap 3733. 384. 68
3 Gentoo 5076. 504. 124
You can group by multiple variables.
# A tibble: 5 × 4
species island mean_mass n
<fct> <fct> <dbl> <int>
1 Adelie Biscoe 3710. 44
2 Adelie Dream 3688. 56
3 Adelie Torgersen 3706. 52
4 Chinstrap Dream 3733. 68
5 Gentoo Biscoe 5076. 124
The .groups = "drop" argument ungroups the data after summarizing, which is good practice.
The real power of dplyr is chaining verbs together. Let’s find the average flipper length for each species, but only for female penguins weighing over 3500 grams, sorted from longest to shortest.
# A tibble: 3 × 3
species mean_flipper n
<fct> <dbl> <int>
1 Gentoo 213. 58
2 Chinstrap 192. 19
3 Adelie 190. 22
Each step is readable on its own, and the pipe makes the full pipeline easy to follow.
| Verb | What it does |
|---|---|
filter() |
Keep rows that match a condition |
select() |
Pick or remove columns |
mutate() |
Create or modify columns |
arrange() |
Sort rows |
summarize() |
Collapse data into summaries |
group_by() |
Group data for per-group operations |
And that’s it! These six verbs will cover the vast majority of your data wrangling needs. Thanks for reading — keep an eye out for more R tutorials!
---
title: "Basics of dplyr"
subtitle: "A beginners guide to data wrangling in R"
execute:
warning: false
author: "Noah Weidig"
date: "2025-02-15"
categories: [code, dplyr, tidyverse]
image: "images/dplyr_wrangling.png"
description: "Learn the core dplyr verbs — filter, select, mutate, arrange, summarize, and group_by — using the Palmer Penguins dataset."
toc: true
toc-depth: 2
code-fold: show
---
[](https://twitter.com/allison_horst)
If you've ever worked with messy data in R, you know how painful base R subsetting can be. The `dplyr` package makes data wrangling intuitive and readable. In this tutorial, we'll walk through the six core dplyr verbs using the Palmer Penguins dataset.
# Setup
Let's load the packages we need.
```{r}
# Uncomment and run these once
#install.packages("tidyverse")
#install.packages("palmerpenguins")
library(tidyverse)
library(palmerpenguins)
```
Let's take a look at the data.
```{r}
penguins
```
We have 344 penguins with 8 variables: species, island, bill measurements, flipper length, body mass, sex, and year.
# The Pipe Operator
Before we dive into the verbs, let's talk about the pipe: `|>` (or `%>%`). The pipe takes the output of one function and passes it as the first argument to the next. This lets us chain operations together in a readable way.
```{r}
# Without the pipe
nrow(penguins)
# With the pipe — same result
penguins |> nrow()
```
You'll see the pipe used throughout this tutorial. Think of it as saying "and then..."
# filter()
`filter()` keeps rows that match a condition. Let's grab only the Adelie penguins.
```{r}
penguins |>
filter(species == "Adelie")
```
You can combine multiple conditions. Let's find Adelie penguins that weigh more than 4000 grams.
```{r}
penguins |>
filter(species == "Adelie", body_mass_g > 4000)
```
Use `|` (or) for either/or conditions.
```{r}
penguins |>
filter(island == "Biscoe" | island == "Dream")
```
A cleaner way to filter for multiple values is `%in%`.
```{r}
penguins |>
filter(island %in% c("Biscoe", "Dream"))
```
# select()
`select()` picks specific columns. This is useful when you only need a few variables from a wide dataset.
```{r}
penguins |>
select(species, island, body_mass_g)
```
You can also remove columns with a minus sign.
```{r}
penguins |>
select(-year)
```
There are handy helper functions too. `starts_with()` grabs columns that start with a string.
```{r}
penguins |>
select(species, starts_with("bill"))
```
# mutate()
`mutate()` creates new columns or modifies existing ones. Let's convert body mass from grams to kilograms.
```{r}
penguins |>
mutate(body_mass_kg = body_mass_g / 1000) |>
select(species, body_mass_g, body_mass_kg)
```
You can create multiple columns at once.
```{r}
penguins |>
mutate(
body_mass_kg = body_mass_g / 1000,
bill_ratio = bill_length_mm / bill_depth_mm
) |>
select(species, body_mass_kg, bill_ratio)
```
# arrange()
`arrange()` sorts rows. By default, it sorts in ascending order.
```{r}
penguins |>
arrange(body_mass_g)
```
Use `desc()` for descending order.
```{r}
penguins |>
arrange(desc(body_mass_g))
```
You can sort by multiple columns. This sorts by species first, then by body mass within each species.
```{r}
penguins |>
arrange(species, desc(body_mass_g))
```
# summarize()
`summarize()` (or `summarise()`) collapses the data into a summary. Let's find the average body mass.
```{r}
penguins |>
summarize(mean_mass = mean(body_mass_g, na.rm = TRUE))
```
The `na.rm = TRUE` argument tells R to ignore missing values. You can compute multiple summaries at once.
```{r}
penguins |>
summarize(
mean_mass = mean(body_mass_g, na.rm = TRUE),
sd_mass = sd(body_mass_g, na.rm = TRUE),
n = n()
)
```
# group_by()
`group_by()` is where dplyr really shines. It splits the data into groups so that subsequent operations are performed per group. Pair it with `summarize()` for powerful grouped summaries.
```{r}
penguins |>
group_by(species) |>
summarize(
mean_mass = mean(body_mass_g, na.rm = TRUE),
sd_mass = sd(body_mass_g, na.rm = TRUE),
n = n()
)
```
You can group by multiple variables.
```{r}
penguins |>
group_by(species, island) |>
summarize(
mean_mass = mean(body_mass_g, na.rm = TRUE),
n = n(),
.groups = "drop"
)
```
The `.groups = "drop"` argument ungroups the data after summarizing, which is good practice.
# Putting It All Together
The real power of dplyr is chaining verbs together. Let's find the average flipper length for each species, but only for female penguins weighing over 3500 grams, sorted from longest to shortest.
```{r}
penguins |>
filter(sex == "female", body_mass_g > 3500) |>
group_by(species) |>
summarize(
mean_flipper = mean(flipper_length_mm, na.rm = TRUE),
n = n()
) |>
arrange(desc(mean_flipper))
```
Each step is readable on its own, and the pipe makes the full pipeline easy to follow.
# Quick Reference
| Verb | What it does |
|------|-------------|
| `filter()` | Keep rows that match a condition |
| `select()` | Pick or remove columns |
| `mutate()` | Create or modify columns |
| `arrange()` | Sort rows |
| `summarize()` | Collapse data into summaries |
| `group_by()` | Group data for per-group operations |
And that's it! These six verbs will cover the vast majority of your data wrangling needs. Thanks for reading — keep an eye out for more R tutorials!