A beginners guide to data wrangling in R

Learn the core dplyr verbs — filter, select, mutate, arrange, summarize, and group_by — using the Palmer Penguins dataset.
code
dplyr
tidyverse
Author

Noah Weidig

Published

February 15, 2025

Artwork by @allison_horst

Artwork by @allison_horst

If you’ve ever worked with messy data in R, you know how painful base R subsetting can be. The dplyr package makes data wrangling intuitive and readable. In this tutorial, we’ll walk through the six core dplyr verbs using the Palmer Penguins dataset.

Setup

Let’s load the packages we need.

Code
# Uncomment and run these once
#install.packages("tidyverse")
#install.packages("palmerpenguins")

library(tidyverse)
library(palmerpenguins)

Let’s take a look at the data.

Code
penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

We have 344 penguins with 8 variables: species, island, bill measurements, flipper length, body mass, sex, and year.

The Pipe Operator

Before we dive into the verbs, let’s talk about the pipe: |> (or %>%). The pipe takes the output of one function and passes it as the first argument to the next. This lets us chain operations together in a readable way.

Code
# Without the pipe
nrow(penguins)
[1] 344
Code
# With the pipe — same result
penguins |> nrow()
[1] 344

You’ll see the pipe used throughout this tutorial. Think of it as saying “and then…”

filter()

filter() keeps rows that match a condition. Let’s grab only the Adelie penguins.

Code
penguins |>
  filter(species == "Adelie")
# A tibble: 152 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 142 more rows
# ℹ 2 more variables: sex <fct>, year <int>

You can combine multiple conditions. Let’s find Adelie penguins that weigh more than 4000 grams.

Code
penguins |>
  filter(species == "Adelie", body_mass_g > 4000)
# A tibble: 35 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.2          19.6               195        4675
 2 Adelie  Torgersen           42            20.2               190        4250
 3 Adelie  Torgersen           34.6          21.1               198        4400
 4 Adelie  Torgersen           42.5          20.7               197        4500
 5 Adelie  Torgersen           46            21.5               194        4200
 6 Adelie  Dream               39.2          21.1               196        4150
 7 Adelie  Dream               39.8          19.1               184        4650
 8 Adelie  Dream               44.1          19.7               196        4400
 9 Adelie  Dream               39.6          18.8               190        4600
10 Adelie  Dream               42.3          21.2               191        4150
# ℹ 25 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Use | (or) for either/or conditions.

Code
penguins |>
  filter(island == "Biscoe" | island == "Dream")
# A tibble: 292 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Adelie  Biscoe           37.8          18.3               174        3400
 2 Adelie  Biscoe           37.7          18.7               180        3600
 3 Adelie  Biscoe           35.9          19.2               189        3800
 4 Adelie  Biscoe           38.2          18.1               185        3950
 5 Adelie  Biscoe           38.8          17.2               180        3800
 6 Adelie  Biscoe           35.3          18.9               187        3800
 7 Adelie  Biscoe           40.6          18.6               183        3550
 8 Adelie  Biscoe           40.5          17.9               187        3200
 9 Adelie  Biscoe           37.9          18.6               172        3150
10 Adelie  Biscoe           40.5          18.9               180        3950
# ℹ 282 more rows
# ℹ 2 more variables: sex <fct>, year <int>

A cleaner way to filter for multiple values is %in%.

Code
penguins |>
  filter(island %in% c("Biscoe", "Dream"))
# A tibble: 292 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Adelie  Biscoe           37.8          18.3               174        3400
 2 Adelie  Biscoe           37.7          18.7               180        3600
 3 Adelie  Biscoe           35.9          19.2               189        3800
 4 Adelie  Biscoe           38.2          18.1               185        3950
 5 Adelie  Biscoe           38.8          17.2               180        3800
 6 Adelie  Biscoe           35.3          18.9               187        3800
 7 Adelie  Biscoe           40.6          18.6               183        3550
 8 Adelie  Biscoe           40.5          17.9               187        3200
 9 Adelie  Biscoe           37.9          18.6               172        3150
10 Adelie  Biscoe           40.5          18.9               180        3950
# ℹ 282 more rows
# ℹ 2 more variables: sex <fct>, year <int>

select()

select() picks specific columns. This is useful when you only need a few variables from a wide dataset.

Code
penguins |>
  select(species, island, body_mass_g)
# A tibble: 344 × 3
   species island    body_mass_g
   <fct>   <fct>           <int>
 1 Adelie  Torgersen        3750
 2 Adelie  Torgersen        3800
 3 Adelie  Torgersen        3250
 4 Adelie  Torgersen          NA
 5 Adelie  Torgersen        3450
 6 Adelie  Torgersen        3650
 7 Adelie  Torgersen        3625
 8 Adelie  Torgersen        4675
 9 Adelie  Torgersen        3475
10 Adelie  Torgersen        4250
# ℹ 334 more rows

You can also remove columns with a minus sign.

Code
penguins |>
  select(-year)
# A tibble: 344 × 7
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 1 more variable: sex <fct>

There are handy helper functions too. starts_with() grabs columns that start with a string.

Code
penguins |>
  select(species, starts_with("bill"))
# A tibble: 344 × 3
   species bill_length_mm bill_depth_mm
   <fct>            <dbl>         <dbl>
 1 Adelie            39.1          18.7
 2 Adelie            39.5          17.4
 3 Adelie            40.3          18  
 4 Adelie            NA            NA  
 5 Adelie            36.7          19.3
 6 Adelie            39.3          20.6
 7 Adelie            38.9          17.8
 8 Adelie            39.2          19.6
 9 Adelie            34.1          18.1
10 Adelie            42            20.2
# ℹ 334 more rows

mutate()

mutate() creates new columns or modifies existing ones. Let’s convert body mass from grams to kilograms.

Code
penguins |>
  mutate(body_mass_kg = body_mass_g / 1000) |>
  select(species, body_mass_g, body_mass_kg)
# A tibble: 344 × 3
   species body_mass_g body_mass_kg
   <fct>         <int>        <dbl>
 1 Adelie         3750         3.75
 2 Adelie         3800         3.8 
 3 Adelie         3250         3.25
 4 Adelie           NA        NA   
 5 Adelie         3450         3.45
 6 Adelie         3650         3.65
 7 Adelie         3625         3.62
 8 Adelie         4675         4.68
 9 Adelie         3475         3.48
10 Adelie         4250         4.25
# ℹ 334 more rows

You can create multiple columns at once.

Code
penguins |>
  mutate(
    body_mass_kg = body_mass_g / 1000,
    bill_ratio = bill_length_mm / bill_depth_mm
  ) |>
  select(species, body_mass_kg, bill_ratio)
# A tibble: 344 × 3
   species body_mass_kg bill_ratio
   <fct>          <dbl>      <dbl>
 1 Adelie          3.75       2.09
 2 Adelie          3.8        2.27
 3 Adelie          3.25       2.24
 4 Adelie         NA         NA   
 5 Adelie          3.45       1.90
 6 Adelie          3.65       1.91
 7 Adelie          3.62       2.19
 8 Adelie          4.68       2   
 9 Adelie          3.48       1.88
10 Adelie          4.25       2.08
# ℹ 334 more rows

arrange()

arrange() sorts rows. By default, it sorts in ascending order.

Code
penguins |>
  arrange(body_mass_g)
# A tibble: 344 × 8
   species   island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>     <fct>             <dbl>         <dbl>             <int>       <int>
 1 Chinstrap Dream              46.9          16.6               192        2700
 2 Adelie    Biscoe             36.5          16.6               181        2850
 3 Adelie    Biscoe             36.4          17.1               184        2850
 4 Adelie    Biscoe             34.5          18.1               187        2900
 5 Adelie    Dream              33.1          16.1               178        2900
 6 Adelie    Torgers…           38.6          17                 188        2900
 7 Chinstrap Dream              43.2          16.6               187        2900
 8 Adelie    Biscoe             37.9          18.6               193        2925
 9 Adelie    Dream              37.5          18.9               179        2975
10 Adelie    Dream              37            16.9               185        3000
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Use desc() for descending order.

Code
penguins |>
  arrange(desc(body_mass_g))
# A tibble: 344 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo  Biscoe           49.2          15.2               221        6300
 2 Gentoo  Biscoe           59.6          17                 230        6050
 3 Gentoo  Biscoe           51.1          16.3               220        6000
 4 Gentoo  Biscoe           48.8          16.2               222        6000
 5 Gentoo  Biscoe           45.2          16.4               223        5950
 6 Gentoo  Biscoe           49.8          15.9               229        5950
 7 Gentoo  Biscoe           48.4          14.6               213        5850
 8 Gentoo  Biscoe           49.3          15.7               217        5850
 9 Gentoo  Biscoe           55.1          16                 230        5850
10 Gentoo  Biscoe           49.5          16.2               229        5800
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

You can sort by multiple columns. This sorts by species first, then by body mass within each species.

Code
penguins |>
  arrange(species, desc(body_mass_g))
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Biscoe              43.2          19                 197        4775
 2 Adelie  Biscoe              41            20                 203        4725
 3 Adelie  Torgersen           42.9          17.6               196        4700
 4 Adelie  Torgersen           39.2          19.6               195        4675
 5 Adelie  Dream               39.8          19.1               184        4650
 6 Adelie  Dream               39.6          18.8               190        4600
 7 Adelie  Biscoe              45.6          20.3               191        4600
 8 Adelie  Torgersen           42.5          20.7               197        4500
 9 Adelie  Dream               37.5          18.5               199        4475
10 Adelie  Torgersen           41.8          19.4               198        4450
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

summarize()

summarize() (or summarise()) collapses the data into a summary. Let’s find the average body mass.

Code
penguins |>
  summarize(mean_mass = mean(body_mass_g, na.rm = TRUE))
# A tibble: 1 × 1
  mean_mass
      <dbl>
1     4202.

The na.rm = TRUE argument tells R to ignore missing values. You can compute multiple summaries at once.

Code
penguins |>
  summarize(
    mean_mass = mean(body_mass_g, na.rm = TRUE),
    sd_mass = sd(body_mass_g, na.rm = TRUE),
    n = n()
  )
# A tibble: 1 × 3
  mean_mass sd_mass     n
      <dbl>   <dbl> <int>
1     4202.    802.   344

group_by()

group_by() is where dplyr really shines. It splits the data into groups so that subsequent operations are performed per group. Pair it with summarize() for powerful grouped summaries.

Code
penguins |>
  group_by(species) |>
  summarize(
    mean_mass = mean(body_mass_g, na.rm = TRUE),
    sd_mass = sd(body_mass_g, na.rm = TRUE),
    n = n()
  )
# A tibble: 3 × 4
  species   mean_mass sd_mass     n
  <fct>         <dbl>   <dbl> <int>
1 Adelie        3701.    459.   152
2 Chinstrap     3733.    384.    68
3 Gentoo        5076.    504.   124

You can group by multiple variables.

Code
penguins |>
  group_by(species, island) |>
  summarize(
    mean_mass = mean(body_mass_g, na.rm = TRUE),
    n = n(),
    .groups = "drop"
  )
# A tibble: 5 × 4
  species   island    mean_mass     n
  <fct>     <fct>         <dbl> <int>
1 Adelie    Biscoe        3710.    44
2 Adelie    Dream         3688.    56
3 Adelie    Torgersen     3706.    52
4 Chinstrap Dream         3733.    68
5 Gentoo    Biscoe        5076.   124

The .groups = "drop" argument ungroups the data after summarizing, which is good practice.

Putting It All Together

The real power of dplyr is chaining verbs together. Let’s find the average flipper length for each species, but only for female penguins weighing over 3500 grams, sorted from longest to shortest.

Code
penguins |>
  filter(sex == "female", body_mass_g > 3500) |>
  group_by(species) |>
  summarize(
    mean_flipper = mean(flipper_length_mm, na.rm = TRUE),
    n = n()
  ) |>
  arrange(desc(mean_flipper))
# A tibble: 3 × 3
  species   mean_flipper     n
  <fct>            <dbl> <int>
1 Gentoo            213.    58
2 Chinstrap         192.    19
3 Adelie            190.    22

Each step is readable on its own, and the pipe makes the full pipeline easy to follow.

Quick Reference

Verb What it does
filter() Keep rows that match a condition
select() Pick or remove columns
mutate() Create or modify columns
arrange() Sort rows
summarize() Collapse data into summaries
group_by() Group data for per-group operations

And that’s it! These six verbs will cover the vast majority of your data wrangling needs. Thanks for reading — keep an eye out for more R tutorials!