Joins

Database-Style Joins

We often need to combine two or more data sets together to provide a complete picture of the topic we are studying. For example, suppose that we have the following two data sets:

julia> using DataFrames

julia> names = DataFrame(ID = [20, 40], Name = ["John Doe", "Jane Doe"])
2×2 DataFrame
│ Row │ ID    │ Name     │
│     │ Int64 │ String   │
├─────┼───────┼──────────┤
│ 1   │ 20    │ John Doe │
│ 2   │ 40    │ Jane Doe │

julia> jobs = DataFrame(ID = [20, 40], Job = ["Lawyer", "Doctor"])
2×2 DataFrame
│ Row │ ID    │ Job    │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 20    │ Lawyer │
│ 2   │ 40    │ Doctor │

We might want to work with a larger data set that contains both the names and jobs for each ID. We can do this using the join function:

julia> join(names, jobs, on = :ID)
2×3 DataFrame
│ Row │ ID    │ Name     │ Job    │
│     │ Int64 │ String   │ String │
├─────┼───────┼──────────┼────────┤
│ 1   │ 20    │ John Doe │ Lawyer │
│ 2   │ 40    │ Jane Doe │ Doctor │

In relational database theory, this operation is generally referred to as a join. The columns used to determine which rows should be combined during a join are called keys.

There are seven kinds of joins supported by the DataFrames package:

See the Wikipedia page on SQL joins for more information.

You can control the kind of join that join performs using the kind keyword argument:

julia> jobs = DataFrame(ID = [20, 60], Job = ["Lawyer", "Astronaut"])
2×2 DataFrame
│ Row │ ID    │ Job       │
│     │ Int64 │ String    │
├─────┼───────┼───────────┤
│ 1   │ 20    │ Lawyer    │
│ 2   │ 60    │ Astronaut │

julia> join(names, jobs, on = :ID, kind = :inner)
1×3 DataFrame
│ Row │ ID    │ Name     │ Job    │
│     │ Int64 │ String   │ String │
├─────┼───────┼──────────┼────────┤
│ 1   │ 20    │ John Doe │ Lawyer │

julia> join(names, jobs, on = :ID, kind = :left)
2×3 DataFrame
│ Row │ ID    │ Name     │ Job     │
│     │ Int64 │ String   │ String⍰ │
├─────┼───────┼──────────┼─────────┤
│ 1   │ 20    │ John Doe │ Lawyer  │
│ 2   │ 40    │ Jane Doe │ missing │

julia> join(names, jobs, on = :ID, kind = :right)
2×3 DataFrame
│ Row │ ID    │ Name     │ Job       │
│     │ Int64 │ String⍰  │ String    │
├─────┼───────┼──────────┼───────────┤
│ 1   │ 20    │ John Doe │ Lawyer    │
│ 2   │ 60    │ missing  │ Astronaut │

julia> join(names, jobs, on = :ID, kind = :outer)
3×3 DataFrame
│ Row │ ID    │ Name     │ Job       │
│     │ Int64 │ String⍰  │ String⍰   │
├─────┼───────┼──────────┼───────────┤
│ 1   │ 20    │ John Doe │ Lawyer    │
│ 2   │ 40    │ Jane Doe │ missing   │
│ 3   │ 60    │ missing  │ Astronaut │

julia> join(names, jobs, on = :ID, kind = :semi)
1×2 DataFrame
│ Row │ ID    │ Name     │
│     │ Int64 │ String   │
├─────┼───────┼──────────┤
│ 1   │ 20    │ John Doe │

julia> join(names, jobs, on = :ID, kind = :anti)
1×2 DataFrame
│ Row │ ID    │ Name     │
│     │ Int64 │ String   │
├─────┼───────┼──────────┤
│ 1   │ 40    │ Jane Doe │

Cross joins are the only kind of join that does not use a key:

julia> join(names, jobs, kind = :cross, makeunique = true)
4×4 DataFrame
│ Row │ ID    │ Name     │ ID_1  │ Job       │
│     │ Int64 │ String   │ Int64 │ String    │
├─────┼───────┼──────────┼───────┼───────────┤
│ 1   │ 20    │ John Doe │ 20    │ Lawyer    │
│ 2   │ 20    │ John Doe │ 60    │ Astronaut │
│ 3   │ 40    │ Jane Doe │ 20    │ Lawyer    │
│ 4   │ 40    │ Jane Doe │ 60    │ Astronaut │

In order to join data tables on keys which have different names, you must first rename them so that they match. This can be done using rename!:

julia> a = DataFrame(ID = [20, 40], Name = ["John Doe", "Jane Doe"])
2×2 DataFrame
│ Row │ ID    │ Name     │
│     │ Int64 │ String   │
├─────┼───────┼──────────┤
│ 1   │ 20    │ John Doe │
│ 2   │ 40    │ Jane Doe │

julia> b = DataFrame(IDNew = [20, 40], Job = ["Lawyer", "Doctor"])
2×2 DataFrame
│ Row │ IDNew │ Job    │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 20    │ Lawyer │
│ 2   │ 40    │ Doctor │

julia> rename!(b, :IDNew => :ID)
2×2 DataFrame
│ Row │ ID    │ Job    │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 20    │ Lawyer │
│ 2   │ 40    │ Doctor │

julia> join(a, b, on = :ID, kind = :inner)
2×3 DataFrame
│ Row │ ID    │ Name     │ Job    │
│     │ Int64 │ String   │ String │
├─────┼───────┼──────────┼────────┤
│ 1   │ 20    │ John Doe │ Lawyer │
│ 2   │ 40    │ Jane Doe │ Doctor │

Or renaming multiple columns at a time:

julia> a = DataFrame(City = ["Amsterdam", "London", "London", "New York", "New York"],
                     Job = ["Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"],
                     Category = [1, 2, 3, 4, 5])
5×3 DataFrame
│ Row │ City      │ Job    │ Category │
│     │ String    │ String │ Int64    │
├─────┼───────────┼────────┼──────────┤
│ 1   │ Amsterdam │ Lawyer │ 1        │
│ 2   │ London    │ Lawyer │ 2        │
│ 3   │ London    │ Lawyer │ 3        │
│ 4   │ New York  │ Doctor │ 4        │
│ 5   │ New York  │ Doctor │ 5        │

julia> b = DataFrame(Location = ["Amsterdam", "London", "London", "New York", "New York"],
                     Work = ["Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"],
                     Name = ["a", "b", "c", "d", "e"])
5×3 DataFrame
│ Row │ Location  │ Work   │ Name   │
│     │ String    │ String │ String │
├─────┼───────────┼────────┼────────┤
│ 1   │ Amsterdam │ Lawyer │ a      │
│ 2   │ London    │ Lawyer │ b      │
│ 3   │ London    │ Lawyer │ c      │
│ 4   │ New York  │ Doctor │ d      │
│ 5   │ New York  │ Doctor │ e      │

julia> rename!(b, :Location => :City, :Work => :Job)
5×3 DataFrame
│ Row │ City      │ Job    │ Name   │
│     │ String    │ String │ String │
├─────┼───────────┼────────┼────────┤
│ 1   │ Amsterdam │ Lawyer │ a      │
│ 2   │ London    │ Lawyer │ b      │
│ 3   │ London    │ Lawyer │ c      │
│ 4   │ New York  │ Doctor │ d      │
│ 5   │ New York  │ Doctor │ e      │

julia> join(a, b, on = [:City, :Job])
9×4 DataFrame
│ Row │ City      │ Job    │ Category │ Name   │
│     │ String    │ String │ Int64    │ String │
├─────┼───────────┼────────┼──────────┼────────┤
│ 1   │ Amsterdam │ Lawyer │ 1        │ a      │
│ 2   │ London    │ Lawyer │ 2        │ b      │
│ 3   │ London    │ Lawyer │ 2        │ c      │
│ 4   │ London    │ Lawyer │ 3        │ b      │
│ 5   │ London    │ Lawyer │ 3        │ c      │
│ 6   │ New York  │ Doctor │ 4        │ d      │
│ 7   │ New York  │ Doctor │ 4        │ e      │
│ 8   │ New York  │ Doctor │ 5        │ d      │
│ 9   │ New York  │ Doctor │ 5        │ e      │