Functions
Base.filter
Base.filter!
Base.join
Base.similar
Base.sort
Base.sort!
Base.unique!
DataFrames.aggregate
DataFrames.allowmissing!
DataFrames.by
DataFrames.colwise
DataFrames.combine
DataFrames.completecases
DataFrames.disallowmissing!
DataFrames.dropmissing
DataFrames.dropmissing!
DataFrames.eachrow
DataFrames.eltypes
DataFrames.groupby
DataFrames.head
DataFrames.melt
DataFrames.meltdf
DataFrames.names!
DataFrames.nonunique
DataFrames.permutecols!
DataFrames.rename
DataFrames.rename!
DataFrames.stack
DataFrames.stackdf
DataFrames.tail
DataFrames.unstack
StatsBase.describe
Grouping, Joining, and Split-Apply-Combine
DataFrames.aggregate
— Function.Split-apply-combine that applies a set of functions over columns of an AbstractDataFrame or GroupedDataFrame
aggregate(d::AbstractDataFrame, cols, fs)
aggregate(gd::GroupedDataFrame, fs)
Arguments
d
: an AbstractDataFramegd
: a GroupedDataFramecols
: a column indicator (Symbol, Int, Vector{Symbol}, etc.)fs
: a function or vector of functions to be applied to vectors within groups; expects each argument to be a column vector
Each fs
should return a value or vector. All returns must be the same length.
Returns
::DataFrame
Examples
df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = randn(8))
aggregate(df, :a, sum)
aggregate(df, :a, [sum, x->mean(skipmissing(x))])
aggregate(groupby(df, :a), [sum, x->mean(skipmissing(x))])
DataFrames.by
— Function.Split-apply-combine in one step; apply f
to each grouping in d
based on columns col
by(d::AbstractDataFrame, cols, f::Function; sort::Bool = false)
by(f::Function, d::AbstractDataFrame, cols; sort::Bool = false)
Arguments
d
: an AbstractDataFramecols
: a column indicator (Symbol, Int, Vector{Symbol}, etc.)f
: a function to be applied to groups; expects each argument to be an AbstractDataFramesort
: sort row groups (no sorting by default)
f
can return a value, a vector, or a DataFrame. For a value or vector, these are merged into a column along with the cols
keys. For a DataFrame, cols
are combined along columns with the resulting DataFrame. Returning a DataFrame is the clearest because it allows column labeling.
A method is defined with f
as the first argument, so do-block notation can be used.
by(d, cols, f)
is equivalent to combine(map(f, groupby(d, cols)))
.
Returns
::DataFrame
Examples
df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = randn(8))
by(df, :a, d -> sum(d[:c]))
by(df, :a, d -> 2 * skipmissing(d[:c]))
by(df, :a, d -> DataFrame(c_sum = sum(d[:c]), c_mean = mean(skipmissing(d[:c]))))
by(df, :a, d -> DataFrame(c = d[:c], c_mean = mean(skipmissing(d[:c]))))
by(df, [:a, :b]) do d
DataFrame(m = mean(skipmissing(d[:c])), v = var(skipmissing(d[:c])))
end
DataFrames.colwise
— Function.Apply a function to each column in an AbstractDataFrame or GroupedDataFrame
colwise(f::Function, d)
colwise(d)
Arguments
f
: a function or vector of functionsd
: an AbstractDataFrame of GroupedDataFrame
If d
is not provided, a curried version of groupby is given.
Returns
- various, depending on the call
Examples
df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = randn(8))
colwise(sum, df)
colwise([sum, length], df)
colwise((minimum, maximum), df)
colwise(sum, groupby(df, :a))
DataFrames.groupby
— Function.A view of an AbstractDataFrame split into row groups
groupby(d::AbstractDataFrame, cols; sort = false, skipmissing = false)
groupby(cols; sort = false, skipmissing = false)
Arguments
d
: an AbstractDataFrame to split (optional, see Returns)cols
: data table columns to group bysort
: whether to sort rows according to the values of the grouping columnscols
skipmissing
: whether to skip rows withmissing
values in one of the grouping columnscols
Returns
::GroupedDataFrame
: a grouped view intod
::Function
: a functionx -> groupby(x, cols)
(ifd
is not specified)
Details
An iterator over a GroupedDataFrame
returns a SubDataFrame
view for each grouping into d
. A GroupedDataFrame
also supports indexing by groups and map
.
See the following for additional split-apply-combine operations:
by
: split-apply-combine using functionsaggregate
: split-apply-combine; applies functions in the form of a cross productcombine
: combine (obviously)colwise
: apply a function to each column in an AbstractDataFrame or GroupedDataFrame
Examples
df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = randn(8))
gd = groupby(df, :a)
gd[1]
last(gd)
vcat([g[:b] for g in gd]...)
for g in gd
println(g)
end
map(d -> mean(skipmissing(d[:c])), gd) # returns a GroupApplied object
combine(map(d -> mean(skipmissing(d[:c])), gd))
Base.join
— Function.join(df1, df2; on = Symbol[], kind = :inner, makeunique = false,
indicator = nothing, validate = (false, false))
Join two DataFrame
objects
Arguments
df1
,df2
: the two AbstractDataFrames to be joined
Keyword Arguments
on
: A column, or vector of columns to join df1 and df2 on. If the column(s) that df1 and df2 will be joined on have different names, then the columns should be(left, right)
tuples orleft => right
pairs, or a vector of such tuples or pairs.on
is a required argument for all joins except forkind = :cross
kind
: the type of join, options include::inner
: only include rows with keys that match in bothdf1
anddf2
, the default:outer
: include all rows fromdf1
anddf2
:left
: include all rows fromdf1
:right
: include all rows fromdf2
:semi
: return rows ofdf1
that match with the keys indf2
:anti
: return rows ofdf1
that do not match with the keys indf2
:cross
: a full Cartesian product of the key combinations; every row ofdf1
is matched with every row ofdf2
makeunique
: iffalse
(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).indicator
: Default:nothing
. If aSymbol
, adds categorical indicator column namedSymbol
for whether a row appeared in onlydf1
("left_only"
), onlydf2
("right_only"
) or in both ("both"
). IfSymbol
is already in use, the column name will be modified ifmakeunique=true
.validate
: whether to check that columns passed as theon
argument define unique keys in each input data frame (according toisequal
). Can be a tuple or a pair, with the first element indicating whether to run check fordf1
and the second element fordf2
. By default no check is performed.
For the three join operations that may introduce missing values (:outer
, :left
, and :right
), all columns of the returned data table will support missing values.
When merging on
categorical columns that differ in the ordering of their levels, the ordering of the left DataFrame
takes precedence over the ordering of the right DataFrame
Result
::DataFrame
: the joined DataFrame
Examples
name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
join(name, job, on = :ID)
join(name, job, on = :ID, kind = :outer)
join(name, job, on = :ID, kind = :left)
join(name, job, on = :ID, kind = :right)
join(name, job, on = :ID, kind = :semi)
join(name, job, on = :ID, kind = :anti)
join(name, job, kind = :cross)
job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
join(name, job2, on = (:ID, :identifier))
join(name, job2, on = :ID => :identifier)
DataFrames.melt
— Function.Stacks a DataFrame; convert from a wide to long format; see stack
.
DataFrames.stack
— Function.Stacks a DataFrame; convert from a wide to long format
stack(df::AbstractDataFrame, [measure_vars], [id_vars];
variable_name::Symbol=:variable, value_name::Symbol=:value)
melt(df::AbstractDataFrame, [id_vars], [measure_vars];
variable_name::Symbol=:variable, value_name::Symbol=:value)
Arguments
df
: the AbstractDataFrame to be stackedmeasure_vars
: the columns to be stacked (the measurement variables), a normal column indexing type, like a Symbol, Vector{Symbol}, Int, etc.; formelt
, defaults to all variables that are notid_vars
. If neithermeasure_vars
orid_vars
are given,measure_vars
defaults to all floating point columns.id_vars
: the identifier columns that are repeated during stacking, a normal column indexing type; forstack
defaults to all variables that are notmeasure_vars
variable_name
: the name of the new stacked column that shall hold the names of each ofmeasure_vars
value_name
: the name of the new stacked column containing the values from each ofmeasure_vars
Result
::DataFrame
: the long-format DataFrame with column:value
holding the values of the stacked columns (measure_vars
), with column:variable
a Vector of Symbols with themeasure_vars
name, and with columns for each of theid_vars
.
See also stackdf
and meltdf
for stacking methods that return a view into the original DataFrame. See unstack
for converting from long to wide format.
Examples
d1 = DataFrame(a = repeat([1:3;], inner = [4]),
b = repeat([1:4;], inner = [3]),
c = randn(12),
d = randn(12),
e = map(string, 'a':'l'))
d1s = stack(d1, [:c, :d])
d1s2 = stack(d1, [:c, :d], [:a])
d1m = melt(d1, [:a, :b, :e])
d1s_name = melt(d1, [:a, :b, :e], variable_name=:somemeasure)
DataFrames.unstack
— Function.Unstacks a DataFrame; convert from a long to wide format
unstack(df::AbstractDataFrame, rowkeys::Union{Symbol, Integer},
colkey::Union{Symbol, Integer}, value::Union{Symbol, Integer})
unstack(df::AbstractDataFrame, rowkeys::AbstractVector{<:Union{Symbol, Integer}},
colkey::Union{Symbol, Integer}, value::Union{Symbol, Integer})
unstack(df::AbstractDataFrame, colkey::Union{Symbol, Integer},
value::Union{Symbol, Integer})
unstack(df::AbstractDataFrame)
Arguments
df
: the AbstractDataFrame to be unstackedrowkeys
: the column(s) with a unique key for each row, if not given, find a key by grouping on anything not acolkey
orvalue
colkey
: the column holding the column names in wide format, defaults to:variable
value
: the value column, defaults to:value
Result
::DataFrame
: the wide-format DataFrame
If colkey
contains missing
values then they will be skipped and a warning will be printed.
If combination of rowkeys
and colkey
contains duplicate entries then last value
will be retained and a warning will be printed.
Examples
wide = DataFrame(id = 1:12,
a = repeat([1:3;], inner = [4]),
b = repeat([1:4;], inner = [3]),
c = randn(12),
d = randn(12))
long = stack(wide)
wide0 = unstack(long)
wide1 = unstack(long, :variable, :value)
wide2 = unstack(long, :id, :variable, :value)
wide3 = unstack(long, [:id, :a], :variable, :value)
Note that there are some differences between the widened results above.
DataFrames.stackdf
— Function.A stacked view of a DataFrame (long format)
Like stack
and melt
, but a view is returned rather than data copies.
stackdf(df::AbstractDataFrame, [measure_vars], [id_vars];
variable_name::Symbol=:variable, value_name::Symbol=:value)
meltdf(df::AbstractDataFrame, [id_vars], [measure_vars];
variable_name::Symbol=:variable, value_name::Symbol=:value)
Arguments
df
: the wide AbstractDataFramemeasure_vars
: the columns to be stacked (the measurement variables), a normal column indexing type, like a Symbol, Vector{Symbol}, Int, etc.; formelt
, defaults to all variables that are notid_vars
id_vars
: the identifier columns that are repeated during stacking, a normal column indexing type; forstack
defaults to all variables that are notmeasure_vars
Result
::DataFrame
: the long-format DataFrame with column:value
holding the values of the stacked columns (measure_vars
), with column:variable
a Vector of Symbols with themeasure_vars
name, and with columns for each of theid_vars
.
The result is a view because the columns are special AbstractVectors that return indexed views into the original DataFrame.
Examples
d1 = DataFrame(a = repeat([1:3;], inner = [4]),
b = repeat([1:4;], inner = [3]),
c = randn(12),
d = randn(12),
e = map(string, 'a':'l'))
d1s = stackdf(d1, [:c, :d])
d1s2 = stackdf(d1, [:c, :d], [:a])
d1m = meltdf(d1, [:a, :b, :e])
DataFrames.meltdf
— Function.A stacked view of a DataFrame (long format); see stackdf
Basics
allowmissing!
categorical!
combine
completecases
deleterows!
describe
disallowmissing!
dropmissing
dropmissing!
eachcol
eachrow
eltypes
filter
filter!
head
names
names!
nonunique
order
rename!
rename
show
showcols
similar
size
sort
sort!
tail
unique!
permutecols!