Functions

Functions

Grouping, Joining, and Split-Apply-Combine

DataFrames.aggregateFunction.

Split-apply-combine that applies a set of functions over columns of an AbstractDataFrame or GroupedDataFrame

aggregate(d::AbstractDataFrame, cols, fs)
aggregate(gd::GroupedDataFrame, fs)

Arguments

  • d : an AbstractDataFrame

  • gd : a GroupedDataFrame

  • cols : a column indicator (Symbol, Int, Vector{Symbol}, etc.)

  • fs : a function or vector of functions to be applied to vectors within groups; expects each argument to be a column vector

Each fs should return a value or vector. All returns must be the same length.

Returns

  • ::DataFrame

Examples

df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
               b = repeat([2, 1], outer=[4]),
               c = randn(8))
aggregate(df, :a, sum)
aggregate(df, :a, [sum, x->mean(skipmissing(x))])
aggregate(groupby(df, :a), [sum, x->mean(skipmissing(x))])
source
DataFrames.byFunction.

Split-apply-combine in one step; apply f to each grouping in d based on columns col

by(d::AbstractDataFrame, cols, f::Function; sort::Bool = false)
by(f::Function, d::AbstractDataFrame, cols; sort::Bool = false)

Arguments

  • d : an AbstractDataFrame

  • cols : a column indicator (Symbol, Int, Vector{Symbol}, etc.)

  • f : a function to be applied to groups; expects each argument to be an AbstractDataFrame

  • sort: sort row groups (no sorting by default)

f can return a value, a vector, or a DataFrame. For a value or vector, these are merged into a column along with the cols keys. For a DataFrame, cols are combined along columns with the resulting DataFrame. Returning a DataFrame is the clearest because it allows column labeling.

A method is defined with f as the first argument, so do-block notation can be used.

by(d, cols, f) is equivalent to combine(map(f, groupby(d, cols))).

Returns

  • ::DataFrame

Examples

df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
               b = repeat([2, 1], outer=[4]),
               c = randn(8))
by(df, :a, d -> sum(d[:c]))
by(df, :a, d -> 2 * skipmissing(d[:c]))
by(df, :a, d -> DataFrame(c_sum = sum(d[:c]), c_mean = mean(skipmissing(d[:c]))))
by(df, :a, d -> DataFrame(c = d[:c], c_mean = mean(skipmissing(d[:c]))))
by(df, [:a, :b]) do d
    DataFrame(m = mean(skipmissing(d[:c])), v = var(skipmissing(d[:c])))
end
source
DataFrames.colwiseFunction.

Apply a function to each column in an AbstractDataFrame or GroupedDataFrame

colwise(f::Function, d)
colwise(d)

Arguments

  • f : a function or vector of functions

  • d : an AbstractDataFrame of GroupedDataFrame

If d is not provided, a curried version of groupby is given.

Returns

  • various, depending on the call

Examples

df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
               b = repeat([2, 1], outer=[4]),
               c = randn(8))
colwise(sum, df)
colwise([sum, length], df)
colwise((minimum, maximum), df)
colwise(sum, groupby(df, :a))
source
DataFrames.groupbyFunction.

A view of an AbstractDataFrame split into row groups

groupby(d::AbstractDataFrame, cols; sort = false, skipmissing = false)
groupby(cols; sort = false, skipmissing = false)

Arguments

  • d : an AbstractDataFrame to split (optional, see Returns)

  • cols : data table columns to group by

  • sort: whether to sort rows according to the values of the grouping columns cols

  • skipmissing: whether to skip rows with missing values in one of the grouping columns cols

Returns

  • ::GroupedDataFrame : a grouped view into d

  • ::Function: a function x -> groupby(x, cols) (if d is not specified)

Details

An iterator over a GroupedDataFrame returns a SubDataFrame view for each grouping into d. A GroupedDataFrame also supports indexing by groups and map.

See the following for additional split-apply-combine operations:

  • by : split-apply-combine using functions

  • aggregate : split-apply-combine; applies functions in the form of a cross product

  • combine : combine (obviously)

  • colwise : apply a function to each column in an AbstractDataFrame or GroupedDataFrame

Examples

df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
               b = repeat([2, 1], outer=[4]),
               c = randn(8))
gd = groupby(df, :a)
gd[1]
last(gd)
vcat([g[:b] for g in gd]...)
for g in gd
    println(g)
end
map(d -> mean(skipmissing(d[:c])), gd)   # returns a GroupApplied object
combine(map(d -> mean(skipmissing(d[:c])), gd))
source
Base.joinFunction.
join(df1, df2; on = Symbol[], kind = :inner, makeunique = false,
     indicator = nothing, validate = (false, false))

Join two DataFrame objects

Arguments

  • df1, df2 : the two AbstractDataFrames to be joined

Keyword Arguments

  • on : A column, or vector of columns to join df1 and df2 on. If the column(s) that df1 and df2 will be joined on have different names, then the columns should be (left, right) tuples or left => right pairs, or a vector of such tuples or pairs. on is a required argument for all joins except for kind = :cross

  • kind : the type of join, options include:

    • :inner : only include rows with keys that match in both df1 and df2, the default

    • :outer : include all rows from df1 and df2

    • :left : include all rows from df1

    • :right : include all rows from df2

    • :semi : return rows of df1 that match with the keys in df2

    • :anti : return rows of df1 that do not match with the keys in df2

    • :cross : a full Cartesian product of the key combinations; every row of df1 is matched with every row of df2

  • makeunique : if false (the default), an error will be raised if duplicate names are found in columns not joined on; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).

  • indicator : Default: nothing. If a Symbol, adds categorical indicator column named Symbol for whether a row appeared in only df1 ("left_only"), only df2 ("right_only") or in both ("both"). If Symbol is already in use, the column name will be modified if makeunique=true.

  • validate : whether to check that columns passed as the on argument define unique keys in each input data frame (according to isequal). Can be a tuple or a pair, with the first element indicating whether to run check for df1 and the second element for df2. By default no check is performed.

For the three join operations that may introduce missing values (:outer, :left, and :right), all columns of the returned data table will support missing values.

When merging on categorical columns that differ in the ordering of their levels, the ordering of the left DataFrame takes precedence over the ordering of the right DataFrame

Result

  • ::DataFrame : the joined DataFrame

Examples

name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])

join(name, job, on = :ID)
join(name, job, on = :ID, kind = :outer)
join(name, job, on = :ID, kind = :left)
join(name, job, on = :ID, kind = :right)
join(name, job, on = :ID, kind = :semi)
join(name, job, on = :ID, kind = :anti)
join(name, job, kind = :cross)

job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
join(name, job2, on = (:ID, :identifier))
join(name, job2, on = :ID => :identifier)
source
DataFrames.meltFunction.

Stacks a DataFrame; convert from a wide to long format; see stack.

source
DataFrames.stackFunction.

Stacks a DataFrame; convert from a wide to long format

stack(df::AbstractDataFrame, [measure_vars], [id_vars];
      variable_name::Symbol=:variable, value_name::Symbol=:value)
melt(df::AbstractDataFrame, [id_vars], [measure_vars];
     variable_name::Symbol=:variable, value_name::Symbol=:value)

Arguments

  • df : the AbstractDataFrame to be stacked

  • measure_vars : the columns to be stacked (the measurement variables), a normal column indexing type, like a Symbol, Vector{Symbol}, Int, etc.; for melt, defaults to all variables that are not id_vars. If neither measure_vars or id_vars are given, measure_vars defaults to all floating point columns.

  • id_vars : the identifier columns that are repeated during stacking, a normal column indexing type; for stack defaults to all variables that are not measure_vars

  • variable_name : the name of the new stacked column that shall hold the names of each of measure_vars

  • value_name : the name of the new stacked column containing the values from each of measure_vars

Result

  • ::DataFrame : the long-format DataFrame with column :value holding the values of the stacked columns (measure_vars), with column :variable a Vector of Symbols with the measure_vars name, and with columns for each of the id_vars.

See also stackdf and meltdf for stacking methods that return a view into the original DataFrame. See unstack for converting from long to wide format.

Examples

d1 = DataFrame(a = repeat([1:3;], inner = [4]),
               b = repeat([1:4;], inner = [3]),
               c = randn(12),
               d = randn(12),
               e = map(string, 'a':'l'))

d1s = stack(d1, [:c, :d])
d1s2 = stack(d1, [:c, :d], [:a])
d1m = melt(d1, [:a, :b, :e])
d1s_name = melt(d1, [:a, :b, :e], variable_name=:somemeasure)
source
DataFrames.unstackFunction.

Unstacks a DataFrame; convert from a long to wide format

unstack(df::AbstractDataFrame, rowkeys::Union{Symbol, Integer},
        colkey::Union{Symbol, Integer}, value::Union{Symbol, Integer})
unstack(df::AbstractDataFrame, rowkeys::AbstractVector{<:Union{Symbol, Integer}},
        colkey::Union{Symbol, Integer}, value::Union{Symbol, Integer})
unstack(df::AbstractDataFrame, colkey::Union{Symbol, Integer},
        value::Union{Symbol, Integer})
unstack(df::AbstractDataFrame)

Arguments

  • df : the AbstractDataFrame to be unstacked

  • rowkeys : the column(s) with a unique key for each row, if not given, find a key by grouping on anything not a colkey or value

  • colkey : the column holding the column names in wide format, defaults to :variable

  • value : the value column, defaults to :value

Result

  • ::DataFrame : the wide-format DataFrame

If colkey contains missing values then they will be skipped and a warning will be printed.

If combination of rowkeys and colkey contains duplicate entries then last value will be retained and a warning will be printed.

Examples

wide = DataFrame(id = 1:12,
                 a  = repeat([1:3;], inner = [4]),
                 b  = repeat([1:4;], inner = [3]),
                 c  = randn(12),
                 d  = randn(12))

long = stack(wide)
wide0 = unstack(long)
wide1 = unstack(long, :variable, :value)
wide2 = unstack(long, :id, :variable, :value)
wide3 = unstack(long, [:id, :a], :variable, :value)

Note that there are some differences between the widened results above.

source
DataFrames.stackdfFunction.

A stacked view of a DataFrame (long format)

Like stack and melt, but a view is returned rather than data copies.

stackdf(df::AbstractDataFrame, [measure_vars], [id_vars];
        variable_name::Symbol=:variable, value_name::Symbol=:value)
meltdf(df::AbstractDataFrame, [id_vars], [measure_vars];
       variable_name::Symbol=:variable, value_name::Symbol=:value)

Arguments

  • df : the wide AbstractDataFrame

  • measure_vars : the columns to be stacked (the measurement variables), a normal column indexing type, like a Symbol, Vector{Symbol}, Int, etc.; for melt, defaults to all variables that are not id_vars

  • id_vars : the identifier columns that are repeated during stacking, a normal column indexing type; for stack defaults to all variables that are not measure_vars

Result

  • ::DataFrame : the long-format DataFrame with column :value holding the values of the stacked columns (measure_vars), with column :variable a Vector of Symbols with the measure_vars name, and with columns for each of the id_vars.

The result is a view because the columns are special AbstractVectors that return indexed views into the original DataFrame.

Examples

d1 = DataFrame(a = repeat([1:3;], inner = [4]),
               b = repeat([1:4;], inner = [3]),
               c = randn(12),
               d = randn(12),
               e = map(string, 'a':'l'))

d1s = stackdf(d1, [:c, :d])
d1s2 = stackdf(d1, [:c, :d], [:a])
d1m = meltdf(d1, [:a, :b, :e])
source
DataFrames.meltdfFunction.

A stacked view of a DataFrame (long format); see stackdf

source

Basics

allowmissing!
categorical!
combine
completecases
deleterows!
describe
disallowmissing!
dropmissing
dropmissing!
eachcol
eachrow
eltypes
filter
filter!
head
names
names!
nonunique
order
rename!
rename
show
showcols
similar
size
sort
sort!
tail
unique!
permutecols!