Data Manipulation
Joins
#Base.join
— Function.
Join two DataFrames
join(df1::AbstractDataFrame,df2::AbstractDataFrame;on::Union{Symbol,Vector{Symbol}}=Symbol[],kind::Symbol=:inner)
Arguments
df1
,df2
: the two AbstractDataFrames to be joined
Keyword Arguments
on
: a Symbol or Vector{Symbol}, the column(s) used as keys when joining; required argument except forkind = :cross
kind
: the type of join, options include::inner
: only include rows with keys that match in bothdf1
anddf2
, the default:outer
: include all rows fromdf1
anddf2
:left
: include all rows fromdf1
:right
: include all rows fromdf2
:semi
: return rows ofdf1
that match with the keys indf2
:anti
: return rows ofdf1
that do not match with the keys indf2
:cross
: a full Cartesian product of the key combinations; every row ofdf1
is matched with every row ofdf2
NA
s are filled in where needed to complete joins.
Result
::DataFrame
: the joined DataFrame
Examples
name=DataFrame(ID=[1,2,3],Name=["John Doe","Jane Doe","Joe Blogs"])job=DataFrame(ID=[1,2,4],Job=["Lawyer","Doctor","Farmer"])join(name,job,on=:ID)join(name,job,on=:ID,kind=:outer)join(name,job,on=:ID,kind=:left)join(name,job,on=:ID,kind=:right)join(name,job,on=:ID,kind=:semi)join(name,job,on=:ID,kind=:anti)join(name,job,kind=:cross)
Reshaping
#DataFrames.melt
— Function.
Stacks a DataFrame; convert from a wide to long format; see stack
.
#DataFrames.stack
— Function.
Stacks a DataFrame; convert from a wide to long format
stack(df::AbstractDataFrame,measure_vars,id_vars)stack(df::AbstractDataFrame,measure_vars)stack(df::AbstractDataFrame)melt(df::AbstractDataFrame,id_vars,measure_vars)melt(df::AbstractDataFrame,id_vars)
Arguments
df
: the AbstractDataFrame to be stackedmeasure_vars
: the columns to be stacked (the measurement variables), a normal column indexing type, like a Symbol, Vector{Symbol}, Int, etc.; formelt
, defaults to all variables that are notid_vars
id_vars
: the identifier columns that are repeated during stacking, a normal column indexing type; forstack
defaults to all variables that are notmeasure_vars
If neither measure_vars
or id_vars
are given, measure_vars
defaults to all floating point columns.
Result
::DataFrame
: the long-format dataframe with column:value
holding the values of the stacked columns (measure_vars
), with column:variable
a Vector of Symbols with themeasure_vars
name, and with columns for each of theid_vars
.
See also stackdf
and meltdf
for stacking methods that return a view into the original DataFrame. See unstack
for converting from long to wide format.
Examples
d1=DataFrame(a=repeat([1:3;],inner=[4]),b=repeat([1:4;],inner=[3]),c=randn(12),d=randn(12),e=map(string,'a':'l'))d1s=stack(d1,[:c,:d])d1s2=stack(d1,[:c,:d],[:a])d1m=melt(d1,[:a,:b,:e])
#DataFrames.unstack
— Function.
Unstacks a DataFrame; convert from a long to wide format
unstack(df::AbstractDataFrame,rowkey,colkey,value)unstack(df::AbstractDataFrame,colkey,value)unstack(df::AbstractDataFrame)
Arguments
df
: the AbstractDataFrame to be unstackedrowkey
: the column with a unique key for each row, if not given, find a key by grouping on anything not acolkey
orvalue
colkey
: the column holding the column names in wide format, defaults to:variable
value
: the value column, defaults to:value
Result
::DataFrame
: the wide-format dataframe
Examples
wide=DataFrame(id=1:12,a=repeat([1:3;],inner=[4]),b=repeat([1:4;],inner=[3]),c=randn(12),d=randn(12))long=stack(wide)wide0=unstack(long)wide1=unstack(long,:variable,:value)wide2=unstack(long,:id,:variable,:value)
Note that there are some differences between the widened results above.
#DataFrames.stackdf
— Function.
A stacked view of a DataFrame (long format)
Like stack
and melt
, but a view is returned rather than data copies.
stackdf(df::AbstractDataFrame,measure_vars,id_vars)stackdf(df::AbstractDataFrame,measure_vars)meltdf(df::AbstractDataFrame,id_vars,measure_vars)meltdf(df::AbstractDataFrame,id_vars)
Arguments
df
: the wide AbstractDataFramemeasure_vars
: the columns to be stacked (the measurement variables), a normal column indexing type, like a Symbol, Vector{Symbol}, Int, etc.; formelt
, defaults to all variables that are notid_vars
id_vars
: the identifier columns that are repeated during stacking, a normal column indexing type; forstack
defaults to all variables that are notmeasure_vars
Result
::DataFrame
: the long-format dataframe with column:value
holding the values of the stacked columns (measure_vars
), with column:variable
a Vector of Symbols with themeasure_vars
name, and with columns for each of theid_vars
.
The result is a view because the columns are special AbstractVectors that return indexed views into the original DataFrame.
Examples
d1=DataFrame(a=repeat([1:3;],inner=[4]),b=repeat([1:4;],inner=[3]),c=randn(12),d=randn(12),e=map(string,'a':'l'))d1s=stackdf(d1,[:c,:d])d1s2=stackdf(d1,[:c,:d],[:a])d1m=meltdf(d1,[:a,:b,:e])
#DataFrames.meltdf
— Function.
A stacked view of a DataFrame (long format); see stackdf