Database-Style Joins
We often need to combine two or more data sets together to provide a complete picture of the topic we are studying. For example, suppose that we have the following two data sets:
julia> using DataFrames
julia> people = DataFrame(ID = [20, 40], Name = ["John Doe", "Jane Doe"])
2×2 DataFrame
│ Row │ ID │ Name │
│ │ Int64 │ String │
├─────┼───────┼──────────┤
│ 1 │ 20 │ John Doe │
│ 2 │ 40 │ Jane Doe │
julia> jobs = DataFrame(ID = [20, 40], Job = ["Lawyer", "Doctor"])
2×2 DataFrame
│ Row │ ID │ Job │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 20 │ Lawyer │
│ 2 │ 40 │ Doctor │
We might want to work with a larger data set that contains both the names and jobs for each ID. We can do this using the join
function:
julia> join(people, jobs, on = :ID)
2×3 DataFrame
│ Row │ ID │ Name │ Job │
│ │ Int64 │ String │ String │
├─────┼───────┼──────────┼────────┤
│ 1 │ 20 │ John Doe │ Lawyer │
│ 2 │ 40 │ Jane Doe │ Doctor │
In relational database theory, this operation is generally referred to as a join. The columns used to determine which rows should be combined during a join are called keys.
There are seven kinds of joins supported by the DataFrames package:
- Inner: The output contains rows for values of the key that exist in both the first (left) and second (right) arguments to
join
. - Left: The output contains rows for values of the key that exist in the first (left) argument to
join
, whether or not that value exists in the second (right) argument. - Right: The output contains rows for values of the key that exist in the second (right) argument to
join
, whether or not that value exists in the first (left) argument. - Outer: The output contains rows for values of the key that exist in the first (left) or second (right) argument to
join
. - Semi: Like an inner join, but output is restricted to columns from the first (left) argument to
join
. - Anti: The output contains rows for values of the key that exist in the first (left) but not the second (right) argument to
join
. As with semi joins, output is restricted to columns from the first (left) argument. - Cross: The output is the cartesian product of rows from the first (left) and second (right) arguments to
join
.
See the Wikipedia page on SQL joins for more information.
You can control the kind of join that join
performs using the kind
keyword argument:
julia> jobs = DataFrame(ID = [20, 60], Job = ["Lawyer", "Astronaut"])
2×2 DataFrame
│ Row │ ID │ Job │
│ │ Int64 │ String │
├─────┼───────┼───────────┤
│ 1 │ 20 │ Lawyer │
│ 2 │ 60 │ Astronaut │
julia> join(people, jobs, on = :ID, kind = :inner)
1×3 DataFrame
│ Row │ ID │ Name │ Job │
│ │ Int64 │ String │ String │
├─────┼───────┼──────────┼────────┤
│ 1 │ 20 │ John Doe │ Lawyer │
julia> join(people, jobs, on = :ID, kind = :left)
2×3 DataFrame
│ Row │ ID │ Name │ Job │
│ │ Int64 │ String │ String⍰ │
├─────┼───────┼──────────┼─────────┤
│ 1 │ 20 │ John Doe │ Lawyer │
│ 2 │ 40 │ Jane Doe │ missing │
julia> join(people, jobs, on = :ID, kind = :right)
2×3 DataFrame
│ Row │ ID │ Name │ Job │
│ │ Int64 │ String⍰ │ String │
├─────┼───────┼──────────┼───────────┤
│ 1 │ 20 │ John Doe │ Lawyer │
│ 2 │ 60 │ missing │ Astronaut │
julia> join(people, jobs, on = :ID, kind = :outer)
3×3 DataFrame
│ Row │ ID │ Name │ Job │
│ │ Int64 │ String⍰ │ String⍰ │
├─────┼───────┼──────────┼───────────┤
│ 1 │ 20 │ John Doe │ Lawyer │
│ 2 │ 40 │ Jane Doe │ missing │
│ 3 │ 60 │ missing │ Astronaut │
julia> join(people, jobs, on = :ID, kind = :semi)
1×2 DataFrame
│ Row │ ID │ Name │
│ │ Int64 │ String │
├─────┼───────┼──────────┤
│ 1 │ 20 │ John Doe │
julia> join(people, jobs, on = :ID, kind = :anti)
1×2 DataFrame
│ Row │ ID │ Name │
│ │ Int64 │ String │
├─────┼───────┼──────────┤
│ 1 │ 40 │ Jane Doe │
Cross joins are the only kind of join that does not use a key:
julia> join(people, jobs, kind = :cross, makeunique = true)
4×4 DataFrame
│ Row │ ID │ Name │ ID_1 │ Job │
│ │ Int64 │ String │ Int64 │ String │
├─────┼───────┼──────────┼───────┼───────────┤
│ 1 │ 20 │ John Doe │ 20 │ Lawyer │
│ 2 │ 20 │ John Doe │ 60 │ Astronaut │
│ 3 │ 40 │ Jane Doe │ 20 │ Lawyer │
│ 4 │ 40 │ Jane Doe │ 60 │ Astronaut │
In order to join data tables on keys which have different names, you must first rename them so that they match. This can be done using rename!:
julia> a = DataFrame(ID = [20, 40], Name = ["John Doe", "Jane Doe"])
2×2 DataFrame
│ Row │ ID │ Name │
│ │ Int64 │ String │
├─────┼───────┼──────────┤
│ 1 │ 20 │ John Doe │
│ 2 │ 40 │ Jane Doe │
julia> b = DataFrame(IDNew = [20, 40], Job = ["Lawyer", "Doctor"])
2×2 DataFrame
│ Row │ IDNew │ Job │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 20 │ Lawyer │
│ 2 │ 40 │ Doctor │
julia> rename!(b, :IDNew => :ID)
2×2 DataFrame
│ Row │ ID │ Job │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 20 │ Lawyer │
│ 2 │ 40 │ Doctor │
julia> join(a, b, on = :ID, kind = :inner)
2×3 DataFrame
│ Row │ ID │ Name │ Job │
│ │ Int64 │ String │ String │
├─────┼───────┼──────────┼────────┤
│ 1 │ 20 │ John Doe │ Lawyer │
│ 2 │ 40 │ Jane Doe │ Doctor │
Or renaming multiple columns at a time:
julia> a = DataFrame(City = ["Amsterdam", "London", "London", "New York", "New York"],
Job = ["Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"],
Category = [1, 2, 3, 4, 5])
5×3 DataFrame
│ Row │ City │ Job │ Category │
│ │ String │ String │ Int64 │
├─────┼───────────┼────────┼──────────┤
│ 1 │ Amsterdam │ Lawyer │ 1 │
│ 2 │ London │ Lawyer │ 2 │
│ 3 │ London │ Lawyer │ 3 │
│ 4 │ New York │ Doctor │ 4 │
│ 5 │ New York │ Doctor │ 5 │
julia> b = DataFrame(Location = ["Amsterdam", "London", "London", "New York", "New York"],
Work = ["Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"],
Name = ["a", "b", "c", "d", "e"])
5×3 DataFrame
│ Row │ Location │ Work │ Name │
│ │ String │ String │ String │
├─────┼───────────┼────────┼────────┤
│ 1 │ Amsterdam │ Lawyer │ a │
│ 2 │ London │ Lawyer │ b │
│ 3 │ London │ Lawyer │ c │
│ 4 │ New York │ Doctor │ d │
│ 5 │ New York │ Doctor │ e │
julia> rename!(b, :Location => :City, :Work => :Job)
5×3 DataFrame
│ Row │ City │ Job │ Name │
│ │ String │ String │ String │
├─────┼───────────┼────────┼────────┤
│ 1 │ Amsterdam │ Lawyer │ a │
│ 2 │ London │ Lawyer │ b │
│ 3 │ London │ Lawyer │ c │
│ 4 │ New York │ Doctor │ d │
│ 5 │ New York │ Doctor │ e │
julia> join(a, b, on = [:City, :Job])
9×4 DataFrame
│ Row │ City │ Job │ Category │ Name │
│ │ String │ String │ Int64 │ String │
├─────┼───────────┼────────┼──────────┼────────┤
│ 1 │ Amsterdam │ Lawyer │ 1 │ a │
│ 2 │ London │ Lawyer │ 2 │ b │
│ 3 │ London │ Lawyer │ 2 │ c │
│ 4 │ London │ Lawyer │ 3 │ b │
│ 5 │ London │ Lawyer │ 3 │ c │
│ 6 │ New York │ Doctor │ 4 │ d │
│ 7 │ New York │ Doctor │ 4 │ e │
│ 8 │ New York │ Doctor │ 5 │ d │
│ 9 │ New York │ Doctor │ 5 │ e │