Mastering Joins in dplyr

Understanding left_join, inner_join, full_join, and common pitfalls

Learn how to combine datasets with dplyr joins — left_join, inner_join, full_join — and how to diagnose duplicate rows, missing keys, and NA surprises.
code
dplyr
tidyverse
Author

Noah Weidig

Published

February 25, 2025

Combining two datasets sounds simple — until you get back twice as many rows as you expected, or half your data vanishes. Joins are one of the most useful operations in data analysis, but they’re also one of the easiest places to introduce silent bugs. This tutorial will walk you through the main dplyr join functions, show you exactly what each one does, and teach you how to spot problems before they ruin your analysis.

Introduction

A join merges two tables together based on one or more shared columns (called keys). If you’ve ever used VLOOKUP in Excel or merge() in base R, you’ve already done something like a join.

Why do joins trip people up?

  • You assume every key in one table has a match in the other — often it doesn’t.
  • You don’t realize a key appears more than once, causing row multiplication.
  • Missing matches silently introduce NA values that propagate through your analysis.

Joins are everywhere in real work. You might link survey responses to participant demographics, attach weather data to field observations, or connect sales records to product details. Getting them right is essential.

Setup

Code
# Uncomment and run once if needed
# install.packages("tidyverse")

library(tidyverse)

Example Data

We’ll use two small tibbles throughout this tutorial. Think of them as a field study: one table records which bird species were observed at each site, and the other records habitat information for each site.

Code
# Bird observations at field sites
observations <- tibble(
  site_id = c("A", "B", "C", "D"),
  species = c("Warbler", "Sparrow", "Hawk", "Finch"),
  count   = c(12, 7, 3, 15)
)

observations
# A tibble: 4 x 3
  site_id species count
  <chr>   <chr>   <dbl>
1 A       Warbler    12
2 B       Sparrow     7
3 C       Hawk        3
4 D       Finch      15
Code
# Habitat data for field sites
habitat <- tibble(
  site_id     = c("A", "B", "C", "E"),
  habitat_type = c("Forest", "Grassland", "Wetland", "Desert"),
  elevation_m  = c(450, 120, 30, 900)
)

habitat
# A tibble: 4 x 3
  site_id habitat_type elevation_m
  <chr>   <chr>              <dbl>
1 A       Forest               450
2 B       Grassland            120
3 C       Wetland               30
4 E       Desert               900

Notice the mismatch: observations has site D but not E, while habitat has site E but not D. This is realistic — data sources rarely line up perfectly.

left_join()

left_join() keeps every row from the left table and attaches matching columns from the right table. If there’s no match, the right-side columns are filled with NA.

Think of it this way: you start with your primary dataset and add information from a second source.

Code
# Keep all observations, attach habitat info where available
observations |>
  left_join(habitat, by = "site_id")
# A tibble: 4 x 5
  site_id species count habitat_type elevation_m
  <chr>   <chr>   <dbl> <chr>              <dbl>
1 A       Warbler    12 Forest               450
2 B       Sparrow     7 Grassland            120
3 C       Hawk        3 Wetland               30
4 D       Finch      15 <NA>                  NA

What happened:

  • Sites A, B, and C matched, so their habitat data was attached.
  • Site D had no match in habitat, so habitat_type and elevation_m are NA.
  • Site E (only in habitat) does not appear — left_join() only preserves left-table rows.

This is the most common join in day-to-day analysis. Use it when your left table is the “main” dataset and you want to enrich it with extra columns.

inner_join()

inner_join() keeps only rows that have a match in both tables. No match, no row.

Code
# Keep only sites present in both tables
observations |>
  inner_join(habitat, by = "site_id")
# A tibble: 3 x 5
  site_id species count habitat_type elevation_m
  <chr>   <chr>   <dbl> <chr>              <dbl>
1 A       Warbler    12 Forest               450
2 B       Sparrow     7 Grassland            120
3 C       Hawk        3 Wetland               30

What happened:

  • Only sites A, B, and C appear — they exist in both tables.
  • Site D (only in observations) and site E (only in habitat) are both dropped.
  • No NA values are introduced from the join itself.

Use inner_join() when you need complete data from both sources and are okay discarding non-matching rows. Be aware that you might silently lose data if your keys don’t overlap as much as you expect.

full_join()

full_join() keeps every row from both tables. Where there’s no match, missing values are filled with NA.

Code
# Keep everything from both tables
observations |>
  full_join(habitat, by = "site_id")
# A tibble: 5 x 5
  site_id species count habitat_type elevation_m
  <chr>   <chr>   <dbl> <chr>              <dbl>
1 A       Warbler    12 Forest               450
2 B       Sparrow     7 Grassland            120
3 C       Hawk        3 Wetland               30
4 D       Finch      15 <NA>                  NA
5 E       <NA>       NA Desert               900

What happened:

  • Sites A, B, and C matched fully.
  • Site D appears with NA for habitat columns.
  • Site E appears with NA for observation columns.
  • Nothing is lost — every row from both tables is preserved.

Use full_join() when you can’t afford to drop any data and want to see the complete picture of both datasets, gaps included.

Duplicate Row Explosions

This is the join pitfall that catches most people off guard. When a key appears multiple times in one or both tables, the join produces every combination of matching rows. This is called a many-to-many join, and it can silently multiply your row count.

Code
# Multiple observations per site
obs_multi <- tibble(
  site_id = c("A", "A", "B", "B", "B"),
  species = c("Warbler", "Robin", "Sparrow", "Jay", "Wren"),
  count   = c(12, 5, 7, 3, 9)
)

obs_multi
# A tibble: 5 x 3
  site_id species count
  <chr>   <chr>   <dbl>
1 A       Warbler    12
2 A       Robin       5
3 B       Sparrow     7
4 B       Jay         3
5 B       Wren        9
Code
# Multiple habitat records per site (e.g., sub-habitats)
hab_multi <- tibble(
  site_id     = c("A", "A", "B"),
  habitat_type = c("Forest-canopy", "Forest-understory", "Grassland"),
  elevation_m  = c(450, 445, 120)
)

hab_multi
# A tibble: 3 x 3
  site_id habitat_type      elevation_m
  <chr>   <chr>                   <dbl>
1 A       Forest-canopy             450
2 A       Forest-understory         445
3 B       Grassland                 120
Code
# Many-to-many join: row explosion
obs_multi |>
  left_join(hab_multi, by = "site_id", relationship = "many-to-many")
# A tibble: 7 x 5
  site_id species count habitat_type      elevation_m
  <chr>   <chr>   <dbl> <chr>                   <dbl>
1 A       Warbler    12 Forest-canopy             450
2 A       Warbler    12 Forest-understory         445
3 A       Robin       5 Forest-canopy             450
4 A       Robin       5 Forest-understory         445
5 B       Sparrow     7 Grassland                 120
6 B       Jay         3 Grassland                 120
7 B       Wren        9 Grassland                 120

Site A had 2 observation rows and 2 habitat rows, producing 2 x 2 = 4 rows. Site B had 3 observation rows and 1 habitat row, producing 3 x 1 = 3 rows. Our 5-row table became 7 rows.

In a real dataset with thousands of rows, this can balloon your data without any warning if you aren’t checking row counts.

How to detect duplicates before joining:

Code
# Count how many times each key appears
obs_multi |>
  count(site_id) |>
  filter(n > 1)
# A tibble: 2 x 2
  site_id     n
  <chr>   <int>
1 A           2
2 B           3
Code
hab_multi |>
  count(site_id) |>
  filter(n > 1)
# A tibble: 1 x 2
  site_id     n
  <chr>   <int>
1 A           2

If both tables show duplicates for the same key, you have a many-to-many situation. Decide whether to aggregate first (e.g., summarize to one row per key) or whether the duplication is intentional.

Diagnosing Join Problems

anti_join()

anti_join() returns rows from the left table that have no match in the right table. It’s the best way to find what gets lost in a join.

Code
# Which observations have no habitat data?
observations |>
  anti_join(habitat, by = "site_id")
# A tibble: 1 x 3
  site_id species count
  <chr>   <chr>   <dbl>
1 D       Finch      15
Code
# Which habitats have no observations?
habitat |>
  anti_join(observations, by = "site_id")
# A tibble: 1 x 3
  site_id habitat_type elevation_m
  <chr>   <chr>              <dbl>
1 E       Desert               900

Run anti_join() in both directions before doing your actual join. This tells you exactly what data is missing.

Counting keys

Always check whether your join key is unique before joining. If it isn’t, you may get unexpected row multiplication.

Code
# Check if site_id is unique in each table
observations |>
  count(site_id) |>
  filter(n > 1)
# A tibble: 0 x 2
# i 2 variables: site_id <chr>, n <int>

An empty result means every key is unique — safe for a one-to-one or one-to-many join.

Checking row counts before and after

A simple but effective diagnostic: compare nrow() before and after your join.

Code
# Before
nrow(observations)
[1] 4
Code
# After left_join
result <- observations |>
  left_join(habitat, by = "site_id")

nrow(result)
[1] 4

For a left_join() with unique keys on the right, the row count should stay the same. If it goes up, you likely have duplicate keys in the right table. If it goes down (shouldn’t happen with left_join()), something unusual is going on.

Handling Missing Values

Joins frequently introduce NA values — any time a key doesn’t match, the unmatched columns get NA. It’s important to handle them explicitly rather than letting them propagate through calculations.

Code
# left_join introduces NAs for site D
joined <- observations |>
  left_join(habitat, by = "site_id")

joined
# A tibble: 4 x 5
  site_id species count habitat_type elevation_m
  <chr>   <chr>   <dbl> <chr>              <dbl>
1 A       Warbler    12 Forest               450
2 B       Sparrow     7 Grassland            120
3 C       Hawk        3 Wetland               30
4 D       Finch      15 <NA>                  NA
Code
# Replace NA habitat_type with "Unknown"
joined |>
  mutate(habitat_type = replace_na(habitat_type, "Unknown"))
# A tibble: 4 x 5
  site_id species count habitat_type elevation_m
  <chr>   <chr>   <dbl> <chr>              <dbl>
1 A       Warbler    12 Forest               450
2 B       Sparrow     7 Grassland            120
3 C       Hawk        3 Wetland               30
4 D       Finch      15 Unknown               NA
Code
# Replace multiple columns at once using across()
joined |>
  mutate(
    habitat_type = replace_na(habitat_type, "Unknown"),
    elevation_m  = replace_na(elevation_m, 0)
  )
# A tibble: 4 x 5
  site_id species count habitat_type elevation_m
  <chr>   <chr>   <dbl> <chr>              <dbl>
1 A       Warbler    12 Forest               450
2 B       Sparrow     7 Grassland            120
3 C       Hawk        3 Wetland               30
4 D       Finch      15 Unknown                0

Choosing what to replace NA with depends on your analysis. Sometimes "Unknown" makes sense; sometimes 0 is appropriate; sometimes you should filter out the NA rows entirely. The key is making a deliberate choice rather than ignoring the NA values.

Best Practices Checklist

Before and after every join, run through this list:

  • Inspect your keys. Use count() to check for duplicates in your join columns. Unique keys prevent row explosions.
  • Run anti_join() both ways. Know what doesn’t match before you join so you aren’t surprised by missing data.
  • Compare row counts. Check nrow() before and after. If the count changes unexpectedly, investigate.
  • Use explicit by arguments. Always specify by = "key_column" instead of relying on dplyr to guess. This makes your code readable and avoids surprises when column names change.
  • Handle NA values deliberately. After a left_join() or full_join(), decide what to do with introduced NA values — replace, filter, or flag them.
  • Choose the right join type. Use left_join() to enrich a primary dataset, inner_join() when you need complete cases from both sides, and full_join() when you can’t afford to lose any rows.
  • Aggregate before joining when possible. If you only need one summary row per key, use summarize() before the join to avoid many-to-many blowups.
  • Document your joins. A short comment explaining why you chose a particular join type helps future you (and your collaborators) understand the logic.

Conclusion

Joins are a fundamental skill for working with real data, and dplyr makes them straightforward — once you know the rules. Here’s what to remember:

  • left_join() keeps all left-side rows and fills gaps with NA.
  • inner_join() keeps only rows that match in both tables.
  • full_join() keeps everything from both sides.
  • Duplicate keys cause row multiplication — always check with count().
  • anti_join() is your best diagnostic tool for finding mismatches.
  • Handle NA values explicitly after every join.

The difference between a reliable analysis and a broken one often comes down to whether you checked your join. Build the habit of inspecting keys, comparing row counts, and running anti_join() — it will save you from subtle, hard-to-find bugs.