Data Manipulation
Joins
#Base.join — Function.
Join two DataFrames
join(df1::AbstractDataFrame,df2::AbstractDataFrame;on::Union{Symbol,Vector{Symbol}}=Symbol[],kind::Symbol=:inner)
Arguments
df1,df2: the two AbstractDataFrames to be joined
Keyword Arguments
on: a Symbol or Vector{Symbol}, the column(s) used as keys when joining; required argument except forkind = :crosskind: the type of join, options include::inner: only include rows with keys that match in bothdf1anddf2, the default:outer: include all rows fromdf1anddf2:left: include all rows fromdf1:right: include all rows fromdf2:semi: return rows ofdf1that match with the keys indf2:anti: return rows ofdf1that do not match with the keys indf2:cross: a full Cartesian product of the key combinations; every row ofdf1is matched with every row ofdf2
NAs are filled in where needed to complete joins.
Result
::DataFrame: the joined DataFrame
Examples
name=DataFrame(ID=[1,2,3],Name=["John Doe","Jane Doe","Joe Blogs"])job=DataFrame(ID=[1,2,4],Job=["Lawyer","Doctor","Farmer"])join(name,job,on=:ID)join(name,job,on=:ID,kind=:outer)join(name,job,on=:ID,kind=:left)join(name,job,on=:ID,kind=:right)join(name,job,on=:ID,kind=:semi)join(name,job,on=:ID,kind=:anti)join(name,job,kind=:cross)
Reshaping
#DataFrames.melt — Function.
Stacks a DataFrame; convert from a wide to long format; see stack.
#DataFrames.stack — Function.
Stacks a DataFrame; convert from a wide to long format
stack(df::AbstractDataFrame,measure_vars,id_vars)stack(df::AbstractDataFrame,measure_vars)stack(df::AbstractDataFrame)melt(df::AbstractDataFrame,id_vars,measure_vars)melt(df::AbstractDataFrame,id_vars)
Arguments
df: the AbstractDataFrame to be stackedmeasure_vars: the columns to be stacked (the measurement variables), a normal column indexing type, like a Symbol, Vector{Symbol}, Int, etc.; formelt, defaults to all variables that are notid_varsid_vars: the identifier columns that are repeated during stacking, a normal column indexing type; forstackdefaults to all variables that are notmeasure_vars
If neither measure_vars or id_vars are given, measure_vars defaults to all floating point columns.
Result
::DataFrame: the long-format dataframe with column:valueholding the values of the stacked columns (measure_vars), with column:variablea Vector of Symbols with themeasure_varsname, and with columns for each of theid_vars.
See also stackdf and meltdf for stacking methods that return a view into the original DataFrame. See unstack for converting from long to wide format.
Examples
d1=DataFrame(a=repeat([1:3;],inner=[4]),b=repeat([1:4;],inner=[3]),c=randn(12),d=randn(12),e=map(string,'a':'l'))d1s=stack(d1,[:c,:d])d1s2=stack(d1,[:c,:d],[:a])d1m=melt(d1,[:a,:b,:e])
#DataFrames.unstack — Function.
Unstacks a DataFrame; convert from a long to wide format
unstack(df::AbstractDataFrame,rowkey,colkey,value)unstack(df::AbstractDataFrame,colkey,value)unstack(df::AbstractDataFrame)
Arguments
df: the AbstractDataFrame to be unstackedrowkey: the column with a unique key for each row, if not given, find a key by grouping on anything not acolkeyorvaluecolkey: the column holding the column names in wide format, defaults to:variablevalue: the value column, defaults to:value
Result
::DataFrame: the wide-format dataframe
Examples
wide=DataFrame(id=1:12,a=repeat([1:3;],inner=[4]),b=repeat([1:4;],inner=[3]),c=randn(12),d=randn(12))long=stack(wide)wide0=unstack(long)wide1=unstack(long,:variable,:value)wide2=unstack(long,:id,:variable,:value)
Note that there are some differences between the widened results above.
#DataFrames.stackdf — Function.
A stacked view of a DataFrame (long format)
Like stack and melt, but a view is returned rather than data copies.
stackdf(df::AbstractDataFrame,measure_vars,id_vars)stackdf(df::AbstractDataFrame,measure_vars)meltdf(df::AbstractDataFrame,id_vars,measure_vars)meltdf(df::AbstractDataFrame,id_vars)
Arguments
df: the wide AbstractDataFramemeasure_vars: the columns to be stacked (the measurement variables), a normal column indexing type, like a Symbol, Vector{Symbol}, Int, etc.; formelt, defaults to all variables that are notid_varsid_vars: the identifier columns that are repeated during stacking, a normal column indexing type; forstackdefaults to all variables that are notmeasure_vars
Result
::DataFrame: the long-format dataframe with column:valueholding the values of the stacked columns (measure_vars), with column:variablea Vector of Symbols with themeasure_varsname, and with columns for each of theid_vars.
The result is a view because the columns are special AbstractVectors that return indexed views into the original DataFrame.
Examples
d1=DataFrame(a=repeat([1:3;],inner=[4]),b=repeat([1:4;],inner=[3]),c=randn(12),d=randn(12),e=map(string,'a':'l'))d1s=stackdf(d1,[:c,:d])d1s2=stackdf(d1,[:c,:d],[:a])d1m=meltdf(d1,[:a,:b,:e])
#DataFrames.meltdf — Function.
A stacked view of a DataFrame (long format); see stackdf