Functions
Base.append!
Base.copy
Base.filter
Base.filter!
Base.get
Base.hcat
Base.join
Base.keys
Base.map
Base.names
Base.push!
Base.repeat
Base.show
Base.sort
Base.sort!
Base.unique!
Base.vcat
CategoricalArrays.categorical
DataAPI.describe
DataFrames.DataFrame!
DataFrames.aggregate
DataFrames.allowmissing!
DataFrames.by
DataFrames.categorical!
DataFrames.combine
DataFrames.completecases
DataFrames.deleterows!
DataFrames.disallowmissing!
DataFrames.dropmissing
DataFrames.dropmissing!
DataFrames.eachcol
DataFrames.eachrow
DataFrames.flatten
DataFrames.groupby
DataFrames.groupindices
DataFrames.groupvars
DataFrames.insertcols!
DataFrames.mapcols
DataFrames.ncol
DataFrames.nonunique
DataFrames.nrow
DataFrames.rename
DataFrames.rename!
DataFrames.select
DataFrames.select!
DataFrames.stack
DataFrames.unstack
Missings.allowmissing
Missings.disallowmissing
Grouping, Joining, and Split-Apply-Combine
DataFrames.aggregate
— Function.aggregate(df::AbstractDataFrame, fs)
aggregate(df::AbstractDataFrame, cols, fs; sort=false, skipmissing=false)
aggregate(gd::GroupedDataFrame, fs; sort=false)
Split-apply-combine that applies a set of functions over columns of an AbstractDataFrame
or GroupedDataFrame
. Return an aggregated data frame.
Arguments
df
: anAbstractDataFrame
gd
: aGroupedDataFrame
cols
: a column indicator (Symbol
,Int
,Vector{Symbol}
, etc.)fs
: a function or vector of functions to be applied to vectors within groups; expects each argument to be a column vectorsort
: whether to sort rows according to the values of the grouping columnsskipmissing
: whether to skip rows withmissing
values in one of the grouping columnscols
Each fs
should return a value or vector. All returns must be the same length.
Examples
julia> using Statistics
julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = 1:8);
julia> aggregate(df, :a, sum)
4×3 DataFrame
│ Row │ a │ b_sum │ c_sum │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 4 │ 6 │
│ 2 │ 2 │ 2 │ 8 │
│ 3 │ 3 │ 4 │ 10 │
│ 4 │ 4 │ 2 │ 12 │
julia> aggregate(df, :a, [sum, x->mean(skipmissing(x))])
4×5 DataFrame
│ Row │ a │ b_sum │ c_sum │ b_function │ c_function │
│ │ Int64 │ Int64 │ Int64 │ Float64 │ Float64 │
├─────┼───────┼───────┼───────┼────────────┼────────────┤
│ 1 │ 1 │ 4 │ 6 │ 2.0 │ 3.0 │
│ 2 │ 2 │ 2 │ 8 │ 1.0 │ 4.0 │
│ 3 │ 3 │ 4 │ 10 │ 2.0 │ 5.0 │
│ 4 │ 4 │ 2 │ 12 │ 1.0 │ 6.0 │
julia> aggregate(groupby(df, :a), [sum, x->mean(skipmissing(x))])
4×5 DataFrame
│ Row │ a │ b_sum │ c_sum │ b_function │ c_function │
│ │ Int64 │ Int64 │ Int64 │ Float64 │ Float64 │
├─────┼───────┼───────┼───────┼────────────┼────────────┤
│ 1 │ 1 │ 4 │ 6 │ 2.0 │ 3.0 │
│ 2 │ 2 │ 2 │ 8 │ 1.0 │ 4.0 │
│ 3 │ 3 │ 4 │ 10 │ 2.0 │ 5.0 │
│ 4 │ 4 │ 2 │ 12 │ 1.0 │ 6.0 │
DataFrames.by
— Function.by(df::AbstractDataFrame, keys, cols=>f...;
sort::Bool=false, skipmissing::Bool=false)
by(df::AbstractDataFrame, keys; (colname = cols => f)...,
sort::Bool=false, skipmissing::Bool=false)
by(df::AbstractDataFrame, keys, f;
sort::Bool=false, skipmissing::Bool=false)
by(f, df::AbstractDataFrame, keys;
sort::Bool=false, skipmissing::Bool=false)
Split-apply-combine in one step: apply f
to each grouping in df
based on grouping columns keys
, and return a DataFrame
.
keys
can be either a single column index, or a vector thereof.
If the last argument(s) consist(s) in one or more cols => f
pair(s), or if colname = cols => f
keyword arguments are provided, cols
must be a column name or index, or a vector or tuple thereof, and f
must be a callable. A pair or a (named) tuple of pairs can also be provided as the first or last argument. If cols
is a single column index, f
is called with a SubArray
view into that column for each group; else, f
is called with a named tuple holding SubArray
views into these columns.
If the last argument is a callable f
, it is passed a SubDataFrame
view for each group, and the returned DataFrame
then consists of the returned rows plus the grouping columns. If the returned data frame contains columns with the same names as the grouping columns, they are required to be equal. Note that this second form is much slower than the first one due to type instability. A method is defined with f
as the first argument, so do-block notation can be used.
f
can return a single value, a row or multiple rows. The type of the returned value determines the shape of the resulting data frame:
- A single value gives a data frame with a single column and one row per group.
- A named tuple of single values or a
DataFrameRow
gives a data frame with one column for each field and one row per group. - A vector gives a data frame with a single column and as many rows for each group as the length of the returned vector for that group.
- A data frame, a named tuple of vectors or a matrix gives a data frame with the same columns and as many rows for each group as the rows returned for that group.
f
must always return the same kind of object (as defined in the above list) for all groups, and if a named tuple or data frame, with the same fields or columns. Named tuples cannot mix single values and vectors. Due to type instability, returning a single value or a named tuple is dramatically faster than returning a data frame.
As a special case, if multiple pairs are passed as last arguments, each function is required to return a single value or vector, which will produce each a separate column.
In all cases, the resulting data frame contains all the grouping columns in addition to those generated by the application of f
. Column names are automatically generated when necessary: for functions operating on a single column and returning a single value or vector, the function name is appended to the input colummn name; for other functions, columns are called x1
, x2
and so on. The resulting data frame will be sorted on keys
if sort=true
. Otherwise, ordering of rows is undefined. If skipmissing=true
then the resulting data frame will not contain groups with missing
values in one of the keys
columns.
Optimized methods are used when standard summary functions (sum
, prod
, minimum
, maximum
, mean
, var
, std
, first
, last
and length) are specified using the pair syntax (e.g.
col => sum). When computing the
sumor
meanover floating point columns, results will be less accurate than the standard [
sum](@ref) function (which uses pairwise summation). Use
col => x -> sum(x)` to avoid the optimized method and use the slower, more accurate one.
by(d, cols, f)
is equivalent to combine(f, groupby(d, cols))
and to the less efficient combine(map(f, groupby(d, cols)))
.
Examples
julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = 1:8);
julia> by(df, :a, :c => sum)
4×2 DataFrame
│ Row │ a │ c_sum │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 6 │
│ 2 │ 2 │ 8 │
│ 3 │ 3 │ 10 │
│ 4 │ 4 │ 12 │
julia> by(df, :a, d -> sum(d.c)) # Slower variant
4×2 DataFrame
│ Row │ a │ x1 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 6 │
│ 2 │ 2 │ 8 │
│ 3 │ 3 │ 10 │
│ 4 │ 4 │ 12 │
julia> by(df, :a) do d # do syntax for the slower variant
sum(d.c)
end
4×2 DataFrame
│ Row │ a │ x1 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 6 │
│ 2 │ 2 │ 8 │
│ 3 │ 3 │ 10 │
│ 4 │ 4 │ 12 │
julia> by(df, :a, :c => x -> 2 .* x)
8×2 DataFrame
│ Row │ a │ c_function │
│ │ Int64 │ Int64 │
├─────┼───────┼────────────┤
│ 1 │ 1 │ 2 │
│ 2 │ 1 │ 10 │
│ 3 │ 2 │ 4 │
│ 4 │ 2 │ 12 │
│ 5 │ 3 │ 6 │
│ 6 │ 3 │ 14 │
│ 7 │ 4 │ 8 │
│ 8 │ 4 │ 16 │
julia> by(df, :a, c_sum = :c => sum, c_sum2 = :c => x -> sum(x.^2))
4×3 DataFrame
│ Row │ a │ c_sum │ c_sum2 │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼────────┤
│ 1 │ 1 │ 6 │ 26 │
│ 2 │ 2 │ 8 │ 40 │
│ 3 │ 3 │ 10 │ 58 │
│ 4 │ 4 │ 12 │ 80 │
julia> by(df, :a, (:b, :c) => x -> (minb = minimum(x.b), sumc = sum(x.c)))
4×3 DataFrame
│ Row │ a │ minb │ sumc │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 6 │
│ 2 │ 2 │ 1 │ 8 │
│ 3 │ 3 │ 2 │ 10 │
│ 4 │ 4 │ 1 │ 12 │
DataFrames.combine
— Function.combine(gd::GroupedDataFrame, cols => f...)
combine(gd::GroupedDataFrame; (colname = cols => f)...)
combine(gd::GroupedDataFrame, f)
combine(f, gd::GroupedDataFrame)
Transform a GroupedDataFrame
into a DataFrame
.
If the last argument(s) consist(s) in one or more cols => f
pair(s), or if colname = cols => f
keyword arguments are provided, cols
must be a column name or index, or a vector or tuple thereof, and f
must be a callable. A pair or a (named) tuple of pairs can also be provided as the first or last argument. If cols
is a single column index, f
is called with a SubArray
view into that column for each group; else, f
is called with a named tuple holding SubArray
views into these columns.
If the last argument is a callable f
, it is passed a SubDataFrame
view for each group, and the returned DataFrame
then consists of the returned rows plus the grouping columns. If the returned data frame contains columns with the same names as the grouping columns, they are required to be equal. Note that this second form is much slower than the first one due to type instability. A method is defined with f
as the first argument, so do-block notation can be used.
f
can return a single value, a row or multiple rows. The type of the returned value determines the shape of the resulting data frame:
- A single value gives a data frame with a single column and one row per group.
- A named tuple of single values or a
DataFrameRow
gives a data frame with one column for each field and one row per group. - A vector gives a data frame with a single column and as many rows for each group as the length of the returned vector for that group.
- A data frame, a named tuple of vectors or a matrix gives a data frame with the same columns and as many rows for each group as the rows returned for that group.
f
must always return the same kind of object (as defined in the above list) for all groups, and if a named tuple or data frame, with the same fields or columns. Named tuples cannot mix single values and vectors. Due to type instability, returning a single value or a named tuple is dramatically faster than returning a data frame.
As a special case, if a tuple or vector of pairs is passed as the first argument, each function is required to return a single value or vector, which will produce each a separate column.
In all cases, the resulting data frame contains all the grouping columns in addition to those generated by the application of f
. Column names are automatically generated when necessary: for functions operating on a single column and returning a single value or vector, the function name is appended to the input column name; for other functions, columns are called x1
, x2
and so on. The resulting data frame will be sorted if sort=true
was passed to the groupby
call from which gd
was constructed. Otherwise, ordering of rows is undefined.
Optimized methods are used when standard summary functions (sum
, prod
, minimum
, maximum
, mean
, var
, std
, first
, last
and length
) are specified using the pair syntax (e.g. col => sum
). When computing the sum
or mean
over floating point columns, results will be less accurate than the standard sum
function (which uses pairwise summation). Use col => x -> sum(x)
to avoid the optimized method and use the slower, more accurate one.
See also:
by(f, df, cols)
is a shorthand forcombine(f, groupby(df, cols))
.map
:combine(f, groupby(df, cols))
is a more efficient equivalent
of combine(map(f, groupby(df, cols)))
.
Examples
julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = 1:8);
julia> gd = groupby(df, :a);
julia> combine(gd, :c => sum)
4×2 DataFrame
│ Row │ a │ c_sum │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 6 │
│ 2 │ 2 │ 8 │
│ 3 │ 3 │ 10 │
│ 4 │ 4 │ 12 │
julia> combine(:c => sum, gd)
4×2 DataFrame
│ Row │ a │ c_sum │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 6 │
│ 2 │ 2 │ 8 │
│ 3 │ 3 │ 10 │
│ 4 │ 4 │ 12 │
julia> combine(df -> sum(df.c), gd) # Slower variant
4×2 DataFrame
│ Row │ a │ x1 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 6 │
│ 2 │ 2 │ 8 │
│ 3 │ 3 │ 10 │
│ 4 │ 4 │ 12 │
See by
for more examples.
DataFrames.groupby
— Function.groupby(d::AbstractDataFrame, cols; sort=false, skipmissing=false)
Return a GroupedDataFrame
representing a view of an AbstractDataFrame
split into row groups.
Arguments
df
: anAbstractDataFrame
to splitcols
: data frame columns to group bysort
: whether to sort rows according to the values of the grouping columnscols
skipmissing
: whether to skip rows withmissing
values in one of the grouping columnscols
Details
An iterator over a GroupedDataFrame
returns a SubDataFrame
view for each grouping into df
. Within each group, the order of rows in df
is preserved.
cols
can be any valid data frame indexing expression. In particular if it is an empty vector then a single-group GroupedDataFrame
is created.
A GroupedDataFrame
also supports indexing by groups, map
(which applies a function to each group) and combine
(which applies a function to each group and combines the result into a data frame).
See the following for additional split-apply-combine operations:
by
: split-apply-combine using functionsaggregate
: split-apply-combine; applies functions in the form of a cross productmap
: apply a function to each group of aGroupedDataFrame
(without combining)combine
: combine aGroupedDataFrame
, optionally applying a function to each group
GroupedDataFrame
also supports the dictionary interface. The keys are GroupKey
objects returned by keys(::GroupedDataFrame)
, which can also be used to get the values of the grouping columns for each group. Tuples
and NamedTuple
s containing the values of the grouping columns (in the same order as the cols
argument) are also accepted as indices, but this will be slower than using the equivalent GroupKey
.
Examples
julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = 1:8);
julia> gd = groupby(df, :a)
GroupedDataFrame with 4 groups based on key: a
First Group (2 rows): a = 1
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 1 │
│ 2 │ 1 │ 2 │ 5 │
⋮
Last Group (2 rows): a = 4
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 4 │ 1 │ 4 │
│ 2 │ 4 │ 1 │ 8 │
julia> gd[1]
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 1 │
│ 2 │ 1 │ 2 │ 5 │
julia> last(gd)
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 4 │ 1 │ 4 │
│ 2 │ 4 │ 1 │ 8 │
julia> gd[(a=3,)]
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 3 │ 2 │ 3 │
│ 2 │ 3 │ 2 │ 7 │
julia> gd[(3,)]
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 3 │ 2 │ 3 │
│ 2 │ 3 │ 2 │ 7 │
julia> k = first(keys(gd))
GroupKey: (a = 3)
julia> gd[k]
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 3 │ 2 │ 3 │
│ 2 │ 3 │ 2 │ 7 │
julia> for g in gd
println(g)
end
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 1 │
│ 2 │ 1 │ 2 │ 5 │
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 2 │ 1 │ 2 │
│ 2 │ 2 │ 1 │ 6 │
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 3 │ 2 │ 3 │
│ 2 │ 3 │ 2 │ 7 │
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 4 │ 1 │ 4 │
│ 2 │ 4 │ 1 │ 8 │
DataFrames.groupindices
— Function.groupindices(gd::GroupedDataFrame)
Return a vector of group indices for each row of parent(gd)
.
Rows appearing in group gd[i]
are attributed index i
. Rows not present in any group are attributed missing
(this can happen if skipmissing=true
was passed when creating gd
, or if gd
is a subset from a larger GroupedDataFrame
).
DataFrames.groupvars
— Function.groupvars(gd::GroupedDataFrame)
Return a vector of column names in parent(gd)
used for grouping.
Base.keys
— Function.keys(gd::GroupedDataFrame)
Get the set of keys for each group of the GroupedDataFrame
gd
as a GroupKeys
object. Each key is a GroupKey
, which behaves like a NamedTuple
holding the values of the grouping columns for a given group. Unlike the equivalent Tuple
and NamedTuple
, these keys can be used to index into gd
efficiently. The ordering of the keys is identical to the ordering of the groups of gd
under iteration and integer indexing.
Examples
julia> df = DataFrame(a = repeat([:foo, :bar, :baz], outer=[4]),
b = repeat([2, 1], outer=[6]),
c = 1:12);
julia> gd = groupby(df, [:a, :b])
GroupedDataFrame with 6 groups based on keys: a, b
First Group (2 rows): a = :foo, b = 2
│ Row │ a │ b │ c │
│ │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1 │ foo │ 2 │ 1 │
│ 2 │ foo │ 2 │ 7 │
⋮
Last Group (2 rows): a = :baz, b = 1
│ Row │ a │ b │ c │
│ │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1 │ baz │ 1 │ 6 │
│ 2 │ baz │ 1 │ 12 │
julia> keys(gd)
6-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
GroupKey: (a = :foo, b = 2)
GroupKey: (a = :bar, b = 1)
GroupKey: (a = :baz, b = 2)
GroupKey: (a = :foo, b = 1)
GroupKey: (a = :bar, b = 2)
GroupKey: (a = :baz, b = 1)
GroupKey
objects behave similarly to NamedTuple
s:
julia> k = keys(gd)[1]
GroupKey: (a = :foo, b = 2)
julia> keys(k)
(:a, :b)
julia> values(k) # Same as Tuple(k)
(:foo, 2)
julia> NamedTuple(k)
(a = :foo, b = 2)
julia> k.a
:foo
julia> k[:a]
:foo
julia> k[1]
:foo
Keys can be used as indices to retrieve the corresponding group from their GroupedDataFrame
:
julia> gd[k]
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1 │ foo │ 2 │ 1 │
│ 2 │ foo │ 2 │ 7 │
julia> gd[keys(gd)[1]] == gd[1]
true
Base.get
— Function.get(gd::GroupedDataFrame, key, default)
Get a group based on the values of the grouping columns.
key
may be a NamedTuple
or Tuple
of grouping column values (in the same order as the cols
argument to groupby
).
Examples
julia> df = DataFrame(a = repeat([:foo, :bar, :baz], outer=[2]),
b = repeat([2, 1], outer=[3]),
c = 1:6);
julia> gd = groupby(df, :a)
GroupedDataFrame with 3 groups based on key: a
First Group (2 rows): a = :foo
│ Row │ a │ b │ c │
│ │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1 │ foo │ 2 │ 1 │
│ 2 │ foo │ 1 │ 4 │
⋮
Last Group (2 rows): a = :baz
│ Row │ a │ b │ c │
│ │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1 │ baz │ 2 │ 3 │
│ 2 │ baz │ 1 │ 6 │
julia> get(gd, (a=:bar,), nothing)
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1 │ bar │ 1 │ 2 │
│ 2 │ bar │ 2 │ 5 │
julia> get(gd, (:baz,), nothing)
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1 │ baz │ 2 │ 3 │
│ 2 │ baz │ 1 │ 6 │
julia> get(gd, (:qux,), nothing)
Base.join
— Function.join(df1, df2; on = Symbol[], kind = :inner, makeunique = false,
indicator = nothing, validate = (false, false))
join(df1, df2, dfs...; on = Symbol[], kind = :inner, makeunique = false,
validate = (false, false))
Join two or more DataFrame
objects and return a DataFrame
containing the result.
Arguments
df1
,df2
,dfs...
: theAbstractDataFrames
to be joined
Keyword Arguments
on
: A column name to joindf1
anddf2
on. If the columns on whichdf1
anddf2
will be joined have different names, then aleft=>right
pair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed). If more than two data frames are joined then only a column name or a vector of column names are allowed.on
is a required argument for all joins except forkind = :cross
.kind
: the type of join, options include::inner
: only include rows with keys that match in bothdf1
anddf2
, the default:outer
: include all rows fromdf1
anddf2
:left
: include all rows fromdf1
:right
: include all rows fromdf2
:semi
: return rows ofdf1
that match with the keys indf2
:anti
: return rows ofdf1
that do not match with the keys indf2
:cross
: a full Cartesian product of the key combinations; every row ofdf1
is matched with every row ofdf2
When joining more than two data frames only
:inner
,:outer
and:cross
joins are allowed.makeunique
: iffalse
(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).indicator
: Default:nothing
. If aSymbol
, adds categorical indicator column namedSymbol
for whether a row appeared in onlydf1
("left_only"
), onlydf2
("right_only"
) or in both ("both"
). IfSymbol
is already in use, the column name will be modified ifmakeunique=true
. This argument is only supported when joining exactly two data frames.validate
: whether to check that columns passed as theon
argument define unique keys in each input data frame (according toisequal
). Can be a tuple or a pair, with the first element indicating whether to run check fordf1
and the second element fordf2
. By default no check is performed.
For the three join operations that may introduce missing values (:outer
, :left
, and :right
), all columns of the returned data table will support missing values.
When merging on
categorical columns that differ in the ordering of their levels, the ordering of the left DataFrame
takes precedence over the ordering of the right DataFrame
.
If more than two data frames are passed, the join is performed recursively with left associativity. In this case the indicator
keyword argument is not supported.
Examples
name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
join(name, job, on = :ID)
join(name, job, on = :ID, kind = :outer)
join(name, job, on = :ID, kind = :left)
join(name, job, on = :ID, kind = :right)
join(name, job, on = :ID, kind = :semi)
join(name, job, on = :ID, kind = :anti)
join(name, job, kind = :cross)
job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
join(name, job2, on = :ID => :identifier)
join(name, job2, on = [:ID => :identifier])
Base.map
— Function.map(cols => f, gd::GroupedDataFrame)
map(f, gd::GroupedDataFrame)
Apply a function to each group of rows and return a GroupedDataFrame
.
If the first argument is a cols => f
pair, cols
must be a column name or index, or a vector or tuple thereof, and f
must be a callable. If cols
is a single column index, f
is called with a SubArray
view into that column for each group; else, f
is called with a named tuple holding SubArray
views into these columns.
If the first argument is a vector, tuple or named tuple of such pairs, each pair is handled as described above. If a named tuple, field names are used to name each generated column.
If the first argument is a callable f
, it is passed a SubDataFrame
view for each group, and the returned DataFrame
then consists of the returned rows plus the grouping columns. If the returned data frame contains columns with the same names as the grouping columns, they are required to be equal. Note that this second form is much slower than the first one due to type instability.
f
can return a single value, a row or multiple rows. The type of the returned value determines the shape of the resulting data frame:
- A single value gives a data frame with a single column and one row per group.
- A named tuple of single values or a
DataFrameRow
gives a data frame with one column for each field and one row per group. - A vector gives a data frame with a single column and as many rows for each group as the length of the returned vector for that group.
- A data frame, a named tuple of vectors or a matrix gives a data frame with the same columns and as many rows for each group as the rows returned for that group.
f
must always return the same kind of object (as defined in the above list) for all groups, and if a named tuple or data frame, with the same fields or columns. Named tuples cannot mix single values and vectors. Due to type instability, returning a single value or a named tuple is dramatically faster than returning a data frame.
As a special case, if a tuple or vector of pairs is passed as the first argument, each function is required to return a single value or vector, which will produce each a separate column.
In all cases, the resulting GroupedDataFrame
contains all the grouping columns in addition to those generated by the application of f
. Column names are automatically generated when necessary: for functions operating on a single column and returning a single value or vector, the function name is appended to the input column name; for other functions, columns are called x1
, x2
and so on.
Optimized methods are used when standard summary functions (sum
, prod
, minimum
, maximum
, mean
, var
, std
, first
, last
and length
) are specified using the pair syntax (e.g. col => sum
). When computing the sum
or mean
over floating point columns, results will be less accurate than the standard sum
function (which uses pairwise summation). Use col => x -> sum(x)
to avoid the optimized method and use the slower, more accurate one.
See also combine(f, gd)
that returns a DataFrame
rather than a GroupedDataFrame
.
Examples
julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = 1:8);
julia> gd = groupby(df, :a);
julia> map(:c => sum, gd)
GroupedDataFrame{DataFrame} with 4 groups based on key: :a
First Group: 1 row
│ Row │ a │ c_sum │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 6 │
⋮
Last Group: 1 row
│ Row │ a │ c_sum │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 4 │ 12 │
julia> map(df -> sum(df.c), gd) # Slower variant
GroupedDataFrame{DataFrame} with 4 groups based on key: :a
First Group: 1 row
│ Row │ a │ x1 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 6 │
⋮
Last Group: 1 row
│ Row │ a │ x1 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 4 │ 12 │
See by
for more examples.
DataFrames.stack
— Function.stack(df::AbstractDataFrame, [measure_vars], [id_vars];
variable_name::Symbol=:variable, value_name::Symbol=:value, view::Bool=false)
Stack a data frame df
, i.e. convert it from wide to long format.
Return the long-format DataFrame
with column variable_name
(:value
by default) holding the values of the stacked columns (measure_vars
), with column variable_name
(:variable
by default) a vector of Symbol
s holding the name of the corresponding measure_vars
variable, and with columns for each of the id_vars
.
If view=true
then return a stacked view of a data frame (long format). The result is a view because the columns are special AbstractVectors
that return views into the original data frame.
Arguments
df
: the AbstractDataFrame to be stackedmeasure_vars
: the columns to be stacked (the measurement variables), a normal column indexing type, like aSymbol
,Vector{Symbol}
, Int, etc.; If neithermeasure_vars
orid_vars
are given,measure_vars
defaults to all floating point columns.id_vars
: the identifier columns that are repeated during stacking, a normal column indexing type; defaults to all variables that are notmeasure_vars
variable_name
: the name of the new stacked column that shall hold the names of each ofmeasure_vars
value_name
: the name of the new stacked column containing the values from each ofmeasure_vars
view
: whether the stacked data frame should be a view rather than contain freshly allocated vectors.
Examples
d1 = DataFrame(a = repeat([1:3;], inner = [4]),
b = repeat([1:4;], inner = [3]),
c = randn(12),
d = randn(12),
e = map(string, 'a':'l'))
d1s = stack(d1, [:c, :d])
d1s2 = stack(d1, [:c, :d], [:a])
d1m = stack(d1, Not([:a, :b, :e]))
d1s_name = stack(d1, Not([:a, :b, :e]), variable_name=:somemeasure)
DataFrames.unstack
— Function.unstack(df::AbstractDataFrame, rowkeys::Union{Integer, Symbol},
colkey::Union{Integer, Symbol}, value::Union{Integer, Symbol};
renamecols::Function=identity)
unstack(df::AbstractDataFrame, rowkeys::AbstractVector{<:Union{Integer, Symbol}},
colkey::Union{Integer, Symbol}, value::Union{Integer, Symbol};
renamecols::Function=identity)
unstack(df::AbstractDataFrame, colkey::Union{Integer, Symbol},
value::Union{Integer, Symbol}; renamecols::Function=identity)
unstack(df::AbstractDataFrame; renamecols::Function=identity)
Unstack data frame df
, i.e. convert it from long to wide format.
If colkey
contains missing
values then they will be skipped and a warning will be printed.
If combination of rowkeys
and colkey
contains duplicate entries then last value
will be retained and a warning will be printed.
Arguments
df
: the AbstractDataFrame to be unstackedrowkeys
: the column(s) with a unique key for each row, if not given, find a key by grouping on anything not acolkey
orvalue
colkey
: the column holding the column names in wide format, defaults to:variable
value
: the value column, defaults to:value
renamecols
: a function called on each unique value incolkey
which must return the name of the column to be created (typically as a string or aSymbol
). Duplicate names are not allowed.
Examples
wide = DataFrame(id = 1:12,
a = repeat([1:3;], inner = [4]),
b = repeat([1:4;], inner = [3]),
c = randn(12),
d = randn(12))
long = stack(wide)
wide0 = unstack(long)
wide1 = unstack(long, :variable, :value)
wide2 = unstack(long, :id, :variable, :value)
wide3 = unstack(long, [:id, :a], :variable, :value)
wide4 = unstack(long, :id, :variable, :value, renamecols=x->Symbol(:_, x))
Note that there are some differences between the widened results above.
Basics
Missings.allowmissing
— Function.allowmissing(df::AbstractDataFrame,
cols::Union{ColumnIndex, AbstractVector, Regex, Not, Between, All, Colon}=:)
Return a copy of data frame df
with columns cols
converted to element type Union{T, Missing}
from T
to allow support for missing values.
If cols
is omitted all columns in the data frame are converted.
Examples
julia> df = DataFrame(a=[1,2])
2×1 DataFrame
│ Row │ a │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 1 │
│ 2 │ 2 │
julia> allowmissing(df)
2×1 DataFrame
│ Row │ a │
│ │ Int64⍰ │
├─────┼────────┤
│ 1 │ 1 │
│ 2 │ 2 │
DataFrames.allowmissing!
— Function.allowmissing!(df::DataFrame, cols::Colon=:)
allowmissing!(df::DataFrame, cols::Union{Integer, Symbol})
allowmissing!(df::DataFrame, cols::Union{AbstractVector, Regex, Not, Between, All})
Convert columns cols
of data frame df
from element type T
to Union{T, Missing}
to support missing values.
If cols
is omitted all columns in the data frame are converted.
CategoricalArrays.categorical
— Function.categorical(df::AbstractDataFrame, cols::Type=Union{AbstractString, Missing};
compress::Bool=false)
categorical(df::AbstractDataFrame,
cols::Union{ColumnIndex, AbstractVector, Regex, Not, Between, All, Colon};
compress::Bool=false)
Return a copy of data frame df
with columns cols
converted to CategoricalVector
. If categorical
is called with the cols
argument being a Type
, then all columns whose element type is a subtype of this type (by default Union{AbstractString, Missing}
) will be converted to categorical.
If the compress
keyword argument is set to true
then the created CategoricalVector
s will be compressed.
All created CategoricalVector
s are unordered.
Examples
julia> df = DataFrame(a=[1,2], b=["a","b"])
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
julia> categorical(df)
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Categorical… │
├─────┼───────┼──────────────┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
julia> categorical(df, :)
2×2 DataFrame
│ Row │ a │ b │
│ │ Categorical… │ Categorical… │
├─────┼──────────────┼──────────────┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
DataFrames.categorical!
— Function.categorical!(df::DataFrame, cols::Type=Union{AbstractString, Missing};
compress::Bool=false)
categorical!(df::DataFrame, cname::Union{Integer, Symbol};
compress::Bool=false)
categorical!(df::DataFrame, cnames::Vector{<:Union{Integer, Symbol}};
compress::Bool=false)
categorical!(df::DataFrame, cnames::Union{Regex, Not, Between, All};
compress::Bool=false)
Change columns selected by cname
or cnames
in data frame df
to CategoricalVector
.
If categorical!
is called with the cols
argument being a Type
, then all columns whose element type is a subtype of this type (by default Union{AbstractString, Missing}
) will be converted to categorical.
If the compress
keyword argument is set to true
then the created CategoricalVector
s will be compressed.
All created CategoricalVector
s are unordered.
Examples
julia> df = DataFrame(X=["a", "b"], Y=[1, 2], Z=["p", "q"])
2×3 DataFrame
│ Row │ X │ Y │ Z │
│ │ String │ Int64 │ String │
├─────┼────────┼───────┼────────┤
│ 1 │ a │ 1 │ p │
│ 2 │ b │ 2 │ q │
julia> categorical!(df)
2×3 DataFrame
│ Row │ X │ Y │ Z │
│ │ Categorical… │ Int64 │ Categorical… │
├─────┼──────────────┼───────┼──────────────┤
│ 1 │ a │ 1 │ p │
│ 2 │ b │ 2 │ q │
julia> eltype.(eachcol(df))
3-element Array{DataType,1}:
CategoricalString{UInt32}
Int64
CategoricalString{UInt32}
julia> df = DataFrame(X=["a", "b"], Y=[1, 2], Z=["p", "q"])
2×3 DataFrame
│ Row │ X │ Y │ Z │
│ │ String │ Int64 │ String │
├─────┼────────┼───────┼────────┤
│ 1 │ a │ 1 │ p │
│ 2 │ b │ 2 │ q │
julia> categorical!(df, :Y, compress=true)
2×3 DataFrame
│ Row │ X │ Y │ Z │
│ │ String │ Categorical… │ String │
├─────┼────────┼──────────────┼────────┤
│ 1 │ a │ 1 │ p │
│ 2 │ b │ 2 │ q │
julia> eltype.(eachcol(df))
3-element Array{DataType,1}:
String
CategoricalValue{Int64,UInt8}
String
DataFrames.completecases
— Function.completecases(df::AbstractDataFrame, cols::Colon=:)
completecases(df::AbstractDataFrame, cols::Union{AbstractVector, Regex, Not, Between, All})
completecases(df::AbstractDataFrame, cols::Union{Integer, Symbol})
Return a Boolean vector with true
entries indicating rows without missing values (complete cases) in data frame df
. If cols
is provided, only missing values in the corresponding columns are considered.
See also: dropmissing
and dropmissing!
. Use findall(completecases(df))
to get the indices of the rows.
Examples
julia> df = DataFrame(i = 1:5,
x = [missing, 4, missing, 2, 1],
y = [missing, missing, "c", "d", "e"])
5×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64⍰ │ String⍰ │
├─────┼───────┼─────────┼─────────┤
│ 1 │ 1 │ missing │ missing │
│ 2 │ 2 │ 4 │ missing │
│ 3 │ 3 │ missing │ c │
│ 4 │ 4 │ 2 │ d │
│ 5 │ 5 │ 1 │ e │
julia> completecases(df)
5-element BitArray{1}:
false
false
false
true
true
julia> completecases(df, :x)
5-element BitArray{1}:
false
true
false
true
true
julia> completecases(df, [:x, :y])
5-element BitArray{1}:
false
false
false
true
true
Base.copy
— Function.copy(df::DataFrame; copycols::Bool=true)
Copy data frame df
. If copycols=true
(the default), return a new DataFrame
holding copies of column vectors in df
. If copycols=false
, return a new DataFrame
sharing column vectors with df
.
copy(dfr::DataFrameRow)
Convert a DataFrameRow
to a NamedTuple
.
DataFrames.DataFrame!
— Function.DataFrame!(args...; kwargs...)
Equivalent to DataFrame(args...; copycols=false, kwargs...)
.
If kwargs
contains the copycols
keyword argument an error is thrown.
Examples
julia> df1 = DataFrame(a=1:3)
3×1 DataFrame
│ Row │ a │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
julia> df2 = DataFrame!(df1)
julia> df1.a === df2.a
true
DataFrames.deleterows!
— Function.deleterows!(df::DataFrame, inds)
Delete rows specified by inds
from a DataFrame
df
in place and return it.
Internally deleteat!
is called for all columns so inds
must be: a vector of sorted and unique integers, a boolean vector or an integer.
Examples
julia> d = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │
julia> deleterows!(d, 2)
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 3 │ 6 │
DataAPI.describe
— Function.describe(df::AbstractDataFrame; cols=:)
describe(df::AbstractDataFrame, stats::Union{Symbol, Pair{<:Symbol}}...; cols=:)
Return descriptive statistics for a data frame as a new DataFrame
where each row represents a variable and each column a summary statistic.
Arguments
df
: theAbstractDataFrame
stats::Union{Symbol, Pair{<:Symbol}}...
: the summary statistics to report. Arguments can be:- A symbol from the list
:mean
,:std
,:min
,:q25
,:median
,:q75
,:max
,:eltype
,:nunique
,:first
,:last
, and:nmissing
. The default statistics used are:mean
,:min
,:median
,:max
,:nunique
,:nmissing
, and:eltype
. :all
as the onlySymbol
argument to return all statistics.- A
name => function
pair wherename
is aSymbol
. This will create a column of summary statistics with the provided name.
- A symbol from the list
cols
: a keyword argument allowing to select only a subset of columns fromdf
to describe; all standard column selection methods are allowed.
Details
For Real
columns, compute the mean, standard deviation, minimum, first quantile, median, third quantile, and maximum. If a column does not derive from Real
, describe
will attempt to calculate all statistics, using nothing
as a fall-back in the case of an error.
When stats
contains :nunique
, describe
will report the number of unique values in a column. If a column's base type derives from Real
, :nunique
will return nothing
s.
Missing values are filtered in the calculation of all statistics, however the column :nmissing
will report the number of missing values of that variable. If the column does not allow missing values, nothing
is returned. Consequently, nmissing = 0
indicates that the column allows missing values, but does not currently contain any.
If custom functions are provided, they are called repeatedly with the vector corresponding to each column as the only argument. For columns allowing for missing values, the vector is wrapped in a call to skipmissing
: custom functions must therefore support such objects (and not only vectors), and cannot access missing values.
Examples
julia> df = DataFrame(i=1:10, x=0.1:0.1:1.0, y='a':'j')
10×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Float64 │ Char │
├─────┼───────┼─────────┼──────┤
│ 1 │ 1 │ 0.1 │ 'a' │
│ 2 │ 2 │ 0.2 │ 'b' │
│ 3 │ 3 │ 0.3 │ 'c' │
│ 4 │ 4 │ 0.4 │ 'd' │
│ 5 │ 5 │ 0.5 │ 'e' │
│ 6 │ 6 │ 0.6 │ 'f' │
│ 7 │ 7 │ 0.7 │ 'g' │
│ 8 │ 8 │ 0.8 │ 'h' │
│ 9 │ 9 │ 0.9 │ 'i' │
│ 10 │ 10 │ 1.0 │ 'j' │
julia> describe(df)
3×8 DataFrame
│ Row │ variable │ mean │ min │ median │ max │ nunique │ nmissing │ eltype │
│ │ Symbol │ Union… │ Any │ Union… │ Any │ Union… │ Nothing │ DataType │
├─────┼──────────┼────────┼─────┼────────┼─────┼─────────┼──────────┼──────────┤
│ 1 │ i │ 5.5 │ 1 │ 5.5 │ 10 │ │ │ Int64 │
│ 2 │ x │ 0.55 │ 0.1 │ 0.55 │ 1.0 │ │ │ Float64 │
│ 3 │ y │ │ 'a' │ │ 'j' │ 10 │ │ Char │
julia> describe(df, :min, :max)
3×3 DataFrame
│ Row │ variable │ min │ max │
│ │ Symbol │ Any │ Any │
├─────┼──────────┼─────┼─────┤
│ 1 │ i │ 1 │ 10 │
│ 2 │ x │ 0.1 │ 1.0 │
│ 3 │ y │ 'a' │ 'j' │
julia> describe(df, :min, :sum => sum)
3×3 DataFrame
│ Row │ variable │ min │ sum │
│ │ Symbol │ Any │ Any │
├─────┼──────────┼─────┼─────┤
│ 1 │ i │ 1 │ 55 │
│ 2 │ x │ 0.1 │ 5.5 │
│ 3 │ y │ 'a' │ │
julia> describe(df, :min, :sum => sum, cols=:x)
1×3 DataFrame
│ Row │ variable │ min │ sum │
│ │ Symbol │ Float64 │ Float64 │
├─────┼──────────┼─────────┼─────────┤
│ 1 │ x │ 0.1 │ 5.5 │
Missings.disallowmissing
— Function.disallowmissing(df::AbstractDataFrame,
cols::Union{ColumnIndex, AbstractVector, Regex, Not, Between, All, Colon}=:;
error::Bool=true)
Return a copy of data frame df
with columns cols
converted from element type Union{T, Missing}
to T
to drop support for missing values.
If cols
is omitted all columns in the data frame are converted.
If error=false
then columns containing a missing
value will be skipped instead of throwing an error.
Examples
julia> df = DataFrame(a=Union{Int,Missing}[1,2])
2×1 DataFrame
│ Row │ a │
│ │ Int64⍰ │
├─────┼────────┤
│ 1 │ 1 │
│ 2 │ 2 │
julia> disallowmissing(df)
2×1 DataFrame
│ Row │ a │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 1 │
│ 2 │ 2 │
julia> df = DataFrame(a=[1,missing]) 2×2 DataFrame │ Row │ a │ b │ │ │ Int64⍰ │ Int64⍰ │ ├─────┼─────────┼────────┤ │ 1 │ 1 │ 1 │ │ 2 │ missing │ 2 │
julia> disallowmissing(df, error=false) 2×2 DataFrame │ Row │ a │ b │ │ │ Int64⍰ │ Int64 │ ├─────┼─────────┼───────┤ │ 1 │ 1 │ 1 │ │ 2 │ missing │ 2 │
DataFrames.disallowmissing!
— Function.disallowmissing!(df::DataFrame, cols::Colon=:; error::Bool=true)
disallowmissing!(df::DataFrame, cols::Union{Integer, Symbol}; error::Bool=true)
disallowmissing!(df::DataFrame, cols::Union{AbstractVector, Regex, Not, Between, All};
error::Bool=true)
Convert columns cols
of data frame df
from element type Union{T, Missing}
to T
to drop support for missing values.
If cols
is omitted all columns in the data frame are converted.
If error=false
then columns containing a missing
value will be skipped instead of throwing an error.
DataFrames.dropmissing
— Function.dropmissing(df::AbstractDataFrame, cols::Colon=:; disallowmissing::Bool=true)
dropmissing(df::AbstractDataFrame, cols::Union{AbstractVector, Regex, Not, Between, All};
disallowmissing::Bool=true)
dropmissing(df::AbstractDataFrame, cols::Union{Integer, Symbol};
disallowmissing::Bool=true)
Return a copy of data frame df
excluding rows with missing values. If cols
is provided, only missing values in the corresponding columns are considered.
If disallowmissing
is true
(the default) then columns specified in cols
will be converted so as not to allow for missing values using disallowmissing!
.
See also: completecases
and dropmissing!
.
Examples
julia> df = DataFrame(i = 1:5,
x = [missing, 4, missing, 2, 1],
y = [missing, missing, "c", "d", "e"])
5×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64⍰ │ String⍰ │
├─────┼───────┼─────────┼─────────┤
│ 1 │ 1 │ missing │ missing │
│ 2 │ 2 │ 4 │ missing │
│ 3 │ 3 │ missing │ c │
│ 4 │ 4 │ 2 │ d │
│ 5 │ 5 │ 1 │ e │
julia> dropmissing(df)
2×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼────────┤
│ 1 │ 4 │ 2 │ d │
│ 2 │ 5 │ 1 │ e │
julia> dropmissing(df, disallowmissing=false)
2×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64⍰ │ String⍰ │
├─────┼───────┼────────┼─────────┤
│ 1 │ 4 │ 2 │ d │
│ 2 │ 5 │ 1 │ e │
julia> dropmissing(df, :x)
3×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64 │ String⍰ │
├─────┼───────┼───────┼─────────┤
│ 1 │ 2 │ 4 │ missing │
│ 2 │ 4 │ 2 │ d │
│ 3 │ 5 │ 1 │ e │
julia> dropmissing(df, [:x, :y])
2×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼────────┤
│ 1 │ 4 │ 2 │ d │
│ 2 │ 5 │ 1 │ e │
DataFrames.dropmissing!
— Function.dropmissing!(df::AbstractDataFrame, cols::Colon=:; disallowmissing::Bool=true)
dropmissing!(df::AbstractDataFrame, cols::Union{AbstractVector, Regex, Not, Between, All};
disallowmissing::Bool=true)
dropmissing!(df::AbstractDataFrame, cols::Union{Integer, Symbol};
disallowmissing::Bool=true)
Remove rows with missing values from data frame df
and return it. If cols
is provided, only missing values in the corresponding columns are considered.
If disallowmissing
is true
(the default) then the cols
columns will get converted using disallowmissing!
.
See also: dropmissing
and completecases
.
julia> df = DataFrame(i = 1:5,
x = [missing, 4, missing, 2, 1],
y = [missing, missing, "c", "d", "e"])
5×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64⍰ │ String⍰ │
├─────┼───────┼─────────┼─────────┤
│ 1 │ 1 │ missing │ missing │
│ 2 │ 2 │ 4 │ missing │
│ 3 │ 3 │ missing │ c │
│ 4 │ 4 │ 2 │ d │
│ 5 │ 5 │ 1 │ e │
julia> dropmissing!(copy(df))
2×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼────────┤
│ 1 │ 4 │ 2 │ d │
│ 2 │ 5 │ 1 │ e │
julia> dropmissing!(copy(df), disallowmissing=false)
2×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64⍰ │ String⍰ │
├─────┼───────┼────────┼─────────┤
│ 1 │ 4 │ 2 │ d │
│ 2 │ 5 │ 1 │ e │
julia> dropmissing!(copy(df), :x)
3×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64 │ String⍰ │
├─────┼───────┼───────┼─────────┤
│ 1 │ 2 │ 4 │ missing │
│ 2 │ 4 │ 2 │ d │
│ 3 │ 5 │ 1 │ e │
julia> dropmissing!(df3, [:x, :y])
2×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼────────┤
│ 1 │ 4 │ 2 │ d │
│ 2 │ 5 │ 1 │ e │
DataFrames.eachrow
— Function.eachrow(df::AbstractDataFrame)
Return a DataFrameRows
that iterates a data frame row by row, with each row represented as a DataFrameRow
.
Examples
julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 11 │
│ 2 │ 2 │ 12 │
│ 3 │ 3 │ 13 │
│ 4 │ 4 │ 14 │
julia> eachrow(df)
4-element DataFrameRows:
DataFrameRow (row 1)
x 1
y 11
DataFrameRow (row 2)
x 2
y 12
DataFrameRow (row 3)
x 3
y 13
DataFrameRow (row 4)
x 4
y 14
julia> copy.(eachrow(df))
4-element Array{NamedTuple{(:x, :y),Tuple{Int64,Int64}},1}:
(x = 1, y = 11)
(x = 2, y = 12)
(x = 3, y = 13)
(x = 4, y = 14)
julia> eachrow(view(df, [4,3], [2,1]))
2-element DataFrameRows:
DataFrameRow (row 4)
y 14
x 4
DataFrameRow (row 3)
y 13
x 3
DataFrames.eachcol
— Function.eachcol(df::AbstractDataFrame, names::Bool=false)
Return a DataFrameColumns
that iterates an AbstractDataFrame
column by column. If names
is equal to false
(the default) iteration returns column vectors. If names
is equal to true
pairs consisting of column name and column vector are yielded.
Examples
julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 11 │
│ 2 │ 2 │ 12 │
│ 3 │ 3 │ 13 │
│ 4 │ 4 │ 14 │
julia> collect(eachcol(df))
2-element Array{AbstractArray{T,1} where T,1}:
[1, 2, 3, 4]
[11, 12, 13, 14]
julia> map(eachcol(df)) do col
maximum(col) - minimum(col)
end
2-element Array{Int64,1}:
3
3
julia> sum.(eachcol(df))
2-element Array{Int64,1}:
10
50
julia> collect(eachcol(df, true))
2-element Array{Pair{Symbol,AbstractArray{T,1} where T},1}:
:x => [1, 2, 3, 4]
:y => [11, 12, 13, 14]
Base.filter
— Function.filter(function, df::AbstractDataFrame)
Return a copy of data frame df
containing only rows for which function
returns true
. The function is passed a DataFrameRow
as its only argument.
Examples
julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 2 │ a │
│ 4 │ 1 │ b │
julia> filter(row -> row[:x] > 1, df)
2×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 2 │ a │
Base.filter!
— Function.filter!(function, df::AbstractDataFrame)
Remove rows from data frame df
for which function
returns false
. The function is passed a DataFrameRow
as its only argument.
Examples
julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 2 │ a │
│ 4 │ 1 │ b │
julia> filter!(row -> row[:x] > 1, df);
julia> df
2×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 2 │ a │
DataFrames.flatten
— Function.flatten(df::AbstractDataFrame, col::Union{Integer, Symbol})
When column col
of data frame df
has iterable elements that define length
(for example a Vector
of Vector
s), return a DataFrame
where each element of col
is flattened, meaning the column corresponding to col
becomes a longer Vector
where the original entries are concatenated. Elements of row i
of df
in columns other than col
will be repeated according to the length of df[i, col]
. Note that these elements are not copied, and thus if they are mutable changing them in the returned DataFrame
will affect df
.
Examples
julia> df1 = DataFrame(a = [1, 2], b = [[1, 2], [3, 4]])
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Array… │
├─────┼───────┼────────┤
│ 1 │ 1 │ [1, 2] │
│ 2 │ 2 │ [3, 4] │
julia> flatten(df1, :b)
4×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 1 │ 2 │
│ 3 │ 2 │ 3 │
│ 4 │ 2 │ 4 │
julia> df2 = DataFrame(a = [1, 2], b = [("p", "q"), ("r", "s")])
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Tuple… │
├─────┼───────┼────────────┤
│ 1 │ 1 │ ("p", "q") │
│ 2 │ 2 │ ("r", "s") │
julia> flatten(df2, :b)
4×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ p │
│ 2 │ 1 │ q │
│ 3 │ 2 │ r │
│ 4 │ 2 │ s │
Base.hcat
— Function.hcat(df::AbstractDataFrame...;
makeunique::Bool=false, copycols::Bool=true)
hcat(df::AbstractDataFrame..., vs::AbstractVector;
makeunique::Bool=false, copycols::Bool=true)
hcat(vs::AbstractVector, df::AbstractDataFrame;
makeunique::Bool=false, copycols::Bool=true)
Horizontally concatenate AbstractDataFrames
and optionally AbstractVector
s.
If AbstractVector
is passed then a column name for it is automatically generated as :x1
by default.
If makeunique=false
(the default) column names of passed objects must be unique. If makeunique=true
then duplicate column names will be suffixed with _i
(i
starting at 1 for the first duplicate).
If copycols=true
(the default) then the DataFrame
returned by hcat
will contain copied columns from the source data frames. If copycols=false
then it will contain columns as they are stored in the source (without copying). This option should be used with caution as mutating either the columns in sources or in the returned DataFrame
might lead to the corruption of the other object.
Example
julia [DataFrame(A=1:3) DataFrame(B=1:3)]
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │
julia> df1 = DataFrame(A=1:3, B=1:3);
julia> df2 = DataFrame(A=4:6, B=4:6);
julia> df3 = hcat(df1, df2, makeunique=true)
3×4 DataFrame
│ Row │ A │ B │ A_1 │ B_1 │
│ │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1 │ 1 │ 1 │ 4 │ 4 │
│ 2 │ 2 │ 2 │ 5 │ 5 │
│ 3 │ 3 │ 3 │ 6 │ 6 │
julia> df3.A === df1.A
false
julia> df3 = hcat(df1, df2, makeunique=true, copycols=false);
julia> df3.A === df1.A
true
DataFrames.insertcols!
— Function.insertcols!(df::DataFrame, ind::Int; name=col,
makeunique::Bool=false)
insertcols!(df::DataFrame, ind::Int, (:name => col)::Pair{Symbol,<:AbstractVector};
makeunique::Bool=false)
Insert a column into a data frame in place. Return the updated DataFrame
.
Arguments
df
: the DataFrame to which we want to add a columnind
: a position at which we want to insert a columnname
: the name of the new columncol
: anAbstractVector
giving the contents of the new columnmakeunique
: Defines what to do ifname
already exists indf
; if it isfalse
an error will be thrown; if it istrue
a new unique name will be generated by adding a suffix
Examples
julia> d = DataFrame(a=1:3)
3×1 DataFrame
│ Row │ a │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
julia> insertcols!(d, 1, b=['a', 'b', 'c'])
3×2 DataFrame
│ Row │ b │ a │
│ │ Char │ Int64 │
├─────┼──────┼───────┤
│ 1 │ 'a' │ 1 │
│ 2 │ 'b' │ 2 │
│ 3 │ 'c' │ 3 │
julia> insertcols!(d, 1, :c => [2, 3, 4])
3×3 DataFrame
│ Row │ c │ b │ a │
│ │ Int64 │ Char │ Int64 │
├─────┼───────┼──────┼───────┤
│ 1 │ 2 │ 'a' │ 1 │
│ 2 │ 3 │ 'b' │ 2 │
│ 3 │ 4 │ 'c' │ 3 │
DataFrames.mapcols
— Function.mapcols(f::Union{Function,Type}, df::AbstractDataFrame)
Return a DataFrame
where each column of df
is transformed using function f
. f
must return AbstractVector
objects all with the same length or scalars.
Note that mapcols
guarantees not to reuse the columns from df
in the returned DataFrame
. If f
returns its argument then it gets copied before being stored.
Examples
julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 11 │
│ 2 │ 2 │ 12 │
│ 3 │ 3 │ 13 │
│ 4 │ 4 │ 14 │
julia> mapcols(x -> x.^2, df)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 121 │
│ 2 │ 4 │ 144 │
│ 3 │ 9 │ 169 │
│ 4 │ 16 │ 196 │
Base.names
— Function.names(df::AbstractDataFrame)
Return a `Vector{Symbol}` of names of columns contained in `df`.
DataFrames.nonunique
— Function.nonunique(df::AbstractDataFrame)
nonunique(df::AbstractDataFrame, cols)
Return a Vector{Bool}
in which true
entries indicate duplicate rows. A row is a duplicate if there exists a prior row with all columns containing equal values (according to isequal
).
Arguments
df
: the AbstractDataFramecols
: a column indicator (Symbol, Int, Vector{Symbol}, etc.) specifying the column(s) to compare
Examples
df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
df = vcat(df, df)
nonunique(df)
nonunique(df, 1)
DataFrames.nrow
— Function.nrow(df::AbstractDataFrame)
ncol(df::AbstractDataFrame)
Return the number of rows or columns in an AbstractDataFrame
df
.
See also size
.
Examples
julia> df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10));
julia> size(df)
(10, 3)
julia> nrow(df)
10
julia> ncol(df)
3
DataFrames.ncol
— Function.nrow(df::AbstractDataFrame)
ncol(df::AbstractDataFrame)
Return the number of rows or columns in an AbstractDataFrame
df
.
See also size
.
Examples
julia> df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10));
julia> size(df)
(10, 3)
julia> nrow(df)
10
julia> ncol(df)
3
DataFrames.rename!
— Function.rename!(df::AbstractDataFrame, vals::AbstractVector{Symbol}; makeunique::Bool=false)
rename!(df::AbstractDataFrame, vals::AbstractVector{<:AbstractString}; makeunique::Bool=false)
rename!(df::AbstractDataFrame, (from => to)::Pair...)
rename!(df::AbstractDataFrame, d::AbstractDict)
rename!(df::AbstractDataFrame, d::AbstractArray{<:Pair})
rename!(f::Function, df::AbstractDataFrame)
Rename columns of df
in-place. Each name is changed at most once. Permutation of names is allowed.
Arguments
df
: theAbstractDataFrame
d
: anAbstractDict
or anAbstractVector
ofPair
s that maps the original names or column numbers to new namesf
: a function which for each column takes the old name (aSymbol
) and returns the new name that gets converted to aSymbol
vals
: new column names as a vector ofSymbol
s orAbstractString
s of the same length as the number of columns indf
makeunique
: iffalse
(the default), an error will be raised if duplicate names are found; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).
If pairs are passed to rename!
(as positional arguments or in a dictionary or a vector) then:
from
value can be aSymbol
, anAbstractString
or anInteger
;to
value can be aSymbol
or anAbstractString
.
Mixing symbols and strings in to
and from
is not allowed.
See also: rename
Examples
julia> df = DataFrame(i = 1, x = 2, y = 3)
1×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 3 │
julia> rename!(df, Dict(:i => "A", :x => "X"))
1×3 DataFrame
│ Row │ A │ X │ y │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 3 │
julia> rename!(df, [:a, :b, :c])
1×3 DataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 3 │
julia> rename!(df, [:a, :b, :a])
ERROR: ArgumentError: Duplicate variable names: :a. Pass makeunique=true to make them unique using a suffix automatically.
julia> rename!(df, [:a, :b, :a], makeunique=true)
1×3 DataFrame
│ Row │ a │ b │ a_1 │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 3 │
julia> rename!(df) do x
uppercase(string(x))
end
1×3 DataFrame
│ Row │ A │ B │ A_1 │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 3 │
DataFrames.rename
— Function.rename(df::AbstractDataFrame, vals::AbstractVector{Symbol}; makeunique::Bool=false)
rename(df::AbstractDataFrame, vals::AbstractVector{<:AbstractString}; makeunique::Bool=false)
rename(df::AbstractDataFrame, (from => to)::Pair...)
rename(df::AbstractDataFrame, d::AbstractDict)
rename(df::AbstractDataFrame, d::AbstractArray{<:Pair})
rename(f::Function, df::AbstractDataFrame)
Create a new data frame that is a copy of df
with changed column names. Each name is changed at most once. Permutation of names is allowed.
Arguments
df
: theAbstractDataFrame
d
: anAbstractDict
or anAbstractVector
ofPair
s that maps the original names or column numbers to new namesf
: a function which for each column takes the old name (aSymbol
) and returns the new name that gets converted to aSymbol
vals
: new column names as a vector ofSymbol
s orAbstractString
s of the same length as the number of columns indf
makeunique
: iffalse
(the default), an error will be raised if duplicate names are found; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).
If pairs are passed to rename
(as positional arguments or in a dictionary or a vector) then:
from
value can be aSymbol
, anAbstractString
or anInteger
;to
value can be aSymbol
or anAbstractString
.
Mixing symbols and strings in to
and from
is not allowed.
See also: rename!
Examples
```julia julia> df = DataFrame(i = 1, x = 2, y = 3) 1×3 DataFrame │ Row │ i │ x │ y │ │ │ Int64 │ Int64 │ Int64 │ ├─────┼───────┼───────┼───────┤ │ 1 │ 1 │ 2 │ 3 │
julia> rename(df, :i => :A, :x => :X) 1×3 DataFrame │ Row │ A │ X │ y │ │ │ Int64 │ Int64 │ Int64 │ ├─────┼───────┼───────┼───────┤ │ 1 │ 1 │ 2 │ 3 │
julia> rename(df, :x => :y, :y => :x) 1×3 DataFrame │ Row │ i │ y │ x │ │ │ Int64 │ Int64 │ Int64 │ ├─────┼───────┼───────┼───────┤ │ 1 │ 1 │ 2 │ 3 │
julia> rename(df, [1 => :A, 2 => :X]) 1×3 DataFrame │ Row │ A │ X │ y │ │ │ Int64 │ Int64 │ Int64 │ ├─────┼───────┼───────┼───────┤ │ 1 │ 1 │ 2 │ 3 │
julia> rename(df, Dict("i" => "A", "x" => "X")) 1×3 DataFrame │ Row │ A │ X │ y │ │ │ Int64 │ Int64 │ Int64 │ ├─────┼───────┼───────┼───────┤ │ 1 │ 1 │ 2 │ 3 │
julia> rename(df) do x uppercase(string(x)) end 1×3 DataFrame │ Row │ I │ X │ Y │ │ │ Int64 │ Int64 │ Int64 │ ├─────┼───────┼───────┼───────┤ │ 1 │ 1 │ 2 │ 3 │```
Base.repeat
— Function.repeat(df::AbstractDataFrame; inner::Integer = 1, outer::Integer = 1)
Construct a data frame by repeating rows in df
. inner
specifies how many times each row is repeated, and outer
specifies how many times the full set of rows is repeated.
Example
julia> df = DataFrame(a = 1:2, b = 3:4)
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 2 │ 4 │
julia> repeat(df, inner = 2, outer = 3)
12×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 1 │ 3 │
│ 3 │ 2 │ 4 │
│ 4 │ 2 │ 4 │
│ 5 │ 1 │ 3 │
│ 6 │ 1 │ 3 │
│ 7 │ 2 │ 4 │
│ 8 │ 2 │ 4 │
│ 9 │ 1 │ 3 │
│ 10 │ 1 │ 3 │
│ 11 │ 2 │ 4 │
│ 12 │ 2 │ 4 │
repeat(df::AbstractDataFrame, count::Integer)
Construct a data frame by repeating each row in df
the number of times specified by count
.
Example
julia> df = DataFrame(a = 1:2, b = 3:4)
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 2 │ 4 │
julia> repeat(df, 2)
4×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 2 │ 4 │
│ 3 │ 1 │ 3 │
│ 4 │ 2 │ 4 │
DataFrames.select
— Function.select(df::AbstractDataFrame, inds...; copycols::Bool=true)
Create a new data frame that contains columns from df
specified by inds
and return it.
Arguments passed as inds...
can be any index that is allowed for column indexing provided that the columns requested in each of them are unique and present in df
. In particular, regular expressions, All
, Between
, and Not
selectors are supported.
If more than one argument is passed then they are joined as All(inds...)
. Note that All
selects the union of columns passed to it, so columns selected in different inds...
do not have to be unique. For example a call select(df, :col, All())
is valid and creates a new data frame with column :col
moved to be the first, provided it is present in df
.
If df
is a DataFrame
return a new DataFrame
that contains columns from df
specified by inds
. If copycols=true
(the default), then returned DataFrame
holds copies of column vectors in df
. If copycols=false
, then returned DataFrame
shares column vectors with df
.
If df
is a SubDataFrame
then a SubDataFrame
is returned if copycols=false
and a DataFrame
with freshly allocated columns otherwise.
Examples
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │
julia> select(df, :b)
3×1 DataFrame
│ Row │ b │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 4 │
│ 2 │ 5 │
│ 3 │ 6 │
julia> select(df, Not(:b)) # drop column :b from df
3×1 DataFrame
│ Row │ a │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
DataFrames.select!
— Function.select!(df::DataFrame, inds...)
Mutate df
in place to retain only columns specified by inds...
and return it.
Arguments passed as inds...
can be any index that is allowed for column indexing provided that the columns requested in each of them are unique and present in df
. In particular, regular expressions, All
, Between
, and Not
selectors are supported.
If more than one argument is passed then they are joined as All(inds...)
. Note that All
selects the union of columns passed to it, so columns selected in different inds...
do not have to be unique. For example a call select!(df, :col, All())
is valid and moves column :col
in the data frame to be the first, provided it is present in df
.
Examples
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │
julia> select!(df, 2)
3×1 DataFrame
│ Row │ b │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 4 │
│ 2 │ 5 │
│ 3 │ 6 │
Base.show
— Function.show([io::IO,] df::AbstractDataFrame;
allrows::Bool = !get(io, :limit, false),
allcols::Bool = !get(io, :limit, false),
allgroups::Bool = !get(io, :limit, false),
splitcols::Bool = get(io, :limit, false),
rowlabel::Symbol = :Row,
summary::Bool = true)
Render a data frame to an I/O stream. The specific visual representation chosen depends on the width of the display.
If io
is omitted, the result is printed to stdout
, and allrows
, allcols
and allgroups
default to false
while splitcols
defaults to true
.
Arguments
io::IO
: The I/O stream to whichdf
will be printed.df::AbstractDataFrame
: The data frame to print.allrows::Bool
: Whether to print all rows, rather than a subset that fits the device height. By default this is the case only ifio
does not have theIOContext
propertylimit
set.allcols::Bool
: Whether to print all columns, rather than a subset that fits the device width. By default this is the case only ifio
does not have theIOContext
propertylimit
set.allgroups::Bool
: Whether to print all groups rather than the first and last, whendf
is aGroupedDataFrame
. By default this is the case only ifio
does not have theIOContext
propertylimit
set.splitcols::Bool
: Whether to split printing in chunks of columns fitting the screen width rather than printing all columns in the same block. Only applies ifallcols
istrue
. By default this is the case only ifio
has theIOContext
propertylimit
set.rowlabel::Symbol = :Row
: The label to use for the column containing row numbers.summary::Bool = true
: Whether to print a brief string summary of the data frame.
Examples
julia> using DataFrames
julia> df = DataFrame(A = 1:3, B = ["x", "y", "z"]);
julia> show(df, allcols=true)
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ x │
│ 2 │ 2 │ y │
│ 3 │ 3 │ z │
show(io::IO, mime::MIME, df::AbstractDataFrame)
Render a data frame to an I/O stream in MIME type mime
.
Arguments
io::IO
: The I/O stream to whichdf
will be printed.mime::MIME
: supported MIME types are:"text/plain"
,"text/html"
,"text/latex"
,"text/csv"
,"text/tab-separated-values"
df::AbstractDataFrame
: The data frame to print.
Additionally selected MIME types support passing the following keyword arguments:
- MIME type
"text/plain"
accepts all listed keyword arguments and therir behavior is identical as forshow(::IO, ::AbstractDataFrame)
- MIME type
"text/html"
acceptssummary
keyword argument which allows to choose whether to print a brief string summary of the data frame.
Examples
julia> show(stdout, MIME("text/latex"), DataFrame(A = 1:3, B = ["x", "y", "z"]))
\begin{tabular}{r|cc}
& A & B\\
\hline
& Int64 & String\\
\hline
1 & 1 & x \\
2 & 2 & y \\
3 & 3 & z \\
\end{tabular}
14
julia> show(stdout, MIME("text/csv"), DataFrame(A = 1:3, B = ["x", "y", "z"]))
"A","B"
1,"x"
2,"y"
3,"z"
Base.sort
— Function.sort(df::AbstractDataFrame, cols;
alg::Union{Algorithm, Nothing}=nothing, lt=isless, by=identity,
rev::Bool=false, order::Ordering=Forward)
Return a copy of data frame df
sorted by column(s) cols
. cols
can be either a Symbol
or Integer
column index, or a tuple or vector of such indices.
If alg
is nothing
(the default), the most appropriate algorithm is chosen automatically among TimSort
, MergeSort
and RadixSort
depending on the type of the sorting columns and on the number of rows in df
. If rev
is true
, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true)
in cols
, with c
the corresponding column index (see example below). See sort!
for a description of other keyword arguments.
Examples
julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 2 │ a │
│ 4 │ 1 │ b │
julia> sort(df, :x)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ c │
│ 2 │ 1 │ b │
│ 3 │ 2 │ a │
│ 4 │ 3 │ b │
julia> sort(df, (:x, :y))
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ b │
│ 2 │ 1 │ c │
│ 3 │ 2 │ a │
│ 4 │ 3 │ b │
julia> sort(df, (:x, :y), rev=true)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 2 │ a │
│ 3 │ 1 │ c │
│ 4 │ 1 │ b │
julia> sort(df, (:x, order(:y, rev=true)))
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ c │
│ 2 │ 1 │ b │
│ 3 │ 2 │ a │
│ 4 │ 3 │ b │
Base.sort!
— Function.sort!(df::AbstractDataFrame, cols;
alg::Union{Algorithm, Nothing}=nothing, lt=isless, by=identity,
rev::Bool=false, order::Ordering=Forward)
Sort data frame df
by column(s) cols
. cols
can be either a Symbol
or Integer
column index, or a tuple or vector of such indices.
If alg
is nothing
(the default), the most appropriate algorithm is chosen automatically among TimSort
, MergeSort
and RadixSort
depending on the type of the sorting columns and on the number of rows in df
. If rev
is true
, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true)
in cols
, with c
the corresponding column index (see example below). See other methods for a description of other keyword arguments.
Examples
julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 2 │ a │
│ 4 │ 1 │ b │
julia> sort!(df, :x)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ c │
│ 2 │ 1 │ b │
│ 3 │ 2 │ a │
│ 4 │ 3 │ b │
julia> sort!(df, (:x, :y))
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ b │
│ 2 │ 1 │ c │
│ 3 │ 2 │ a │
│ 4 │ 3 │ b │
julia> sort!(df, (:x, :y), rev=true)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 2 │ a │
│ 3 │ 1 │ c │
│ 4 │ 1 │ b │
julia> sort!(df, (:x, order(:y, rev=true)))
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ c │
│ 2 │ 1 │ b │
│ 3 │ 2 │ a │
│ 4 │ 3 │ b │
Base.unique!
— Function.unique(df::AbstractDataFrame)
unique(df::AbstractDataFrame, cols)
unique!(df::AbstractDataFrame)
unique!(df::AbstractDataFrame, cols)
Delete duplicate rows of data frame df
, keeping only the first occurrence of unique rows. When cols
is specified, the return DataFrame contains complete rows, retaining in each case the first instance for which df[cols]
is unique.
When unique
is called a new data frame is returned; unique!
updates df
in-place.
See also nonunique
.
Arguments
df
: the AbstractDataFramecols
: column indicator (Symbol, Int, Vector{Symbol}, Regex, etc.)
specifying the column(s) to compare.
Examples
df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
df = vcat(df, df)
unique(df) # doesn't modify df
unique(df, 1)
unique!(df) # modifies df
Base.vcat
— Function.vcat(dfs::AbstractDataFrame...; cols::Union{Symbol, AbstractVector{Symbol}}=:setequal)
Vertically concatenate AbstractDataFrame
s.
The cols
keyword argument determines the columns of the returned data frame:
:setequal
: require all data frames to have the same column names disregarding order. If they appear in different orders, the order of the first provided data frame is used.:orderequal
: require all data frames to have the same column names and in the same order.:intersect
: only the columns present in all provided data frames are kept. If the intersection is empty, an empty data frame is returned.:union
: columns present in at least one of the provided data frames are kept. Columns not present in some data frames are filled withmissing
where necessary.- A vector of
Symbol
s: only listed columns are kept. Columns not present in some data frames are filled withmissing
where necessary.
The order of columns is determined by the order they appear in the included data frames, searching through the header of the first data frame, then the second, etc.
The element types of columns are determined using promote_type
, as with vcat
for AbstractVector
s.
vcat
ignores empty data frames, making it possible to initialize an empty data frame at the beginning of a loop and vcat
onto it.
Example
julia> df1 = DataFrame(A=1:3, B=1:3);
julia> df2 = DataFrame(A=4:6, B=4:6);
julia> df3 = DataFrame(A=7:9, C=7:9);
julia> d4 = DataFrame();
julia> vcat(df1, df2)
6×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │
│ 4 │ 4 │ 4 │
│ 5 │ 5 │ 5 │
│ 6 │ 6 │ 6 │
julia> vcat(df1, df3, cols=:union)
6×3 DataFrame
│ Row │ A │ B │ C │
│ │ Int64 │ Int64⍰ │ Int64⍰ │
├─────┼───────┼─────────┼─────────┤
│ 1 │ 1 │ 1 │ missing │
│ 2 │ 2 │ 2 │ missing │
│ 3 │ 3 │ 3 │ missing │
│ 4 │ 7 │ missing │ 7 │
│ 5 │ 8 │ missing │ 8 │
│ 6 │ 9 │ missing │ 9 │
julia> vcat(df1, df3, cols=:intersect)
6×1 DataFrame
│ Row │ A │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
│ 4 │ 7 │
│ 5 │ 8 │
│ 6 │ 9 │
julia> vcat(d4, df1)
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │
Base.append!
— Function.append!(df1::DataFrame, df2::AbstractDataFrame; cols::Symbol=:setequal)
append!(df::DataFrame, table; cols::Symbol=:setequal)
Add the rows of df2
to the end of df1
. If the second argument table
is not an AbstractDataFrame
then it is converted using DataFrame(table, copycols=false)
before being appended.
Column names of df1
and df2
must be equal. If cols
is :setequal
(the default) then column names may have different orders and append!
is performed by matching column names. If cols
is :orderequal
then the order of columns in df1
and df2
or table
must be the same. In particular, if table
is a Dict
an error is thrown as it is an unordered collection.
The above rule has the following exceptions:
- If
df1
has no columns then copies of columns fromdf2
are added to it. - If
df2
has no columns then callingappend!
leavesdf1
unchanged.
Values corresponding to new rows are appended in-place to the column vectors of df1
. Column types are therefore preserved, and new values are converted if necessary. An error is thrown if conversion fails: this is the case in particular if a column in df2
contains missing
values but the corresponding column in df1
does not accept them.
Please note that append!
must not be used on a DataFrame
that contains columns that are aliases (equal when compared with ===
).
Use vcat
instead of append!
when more flexibility is needed. Since vcat
does not operate in place, it is able to use promotion to find an appropriate element type to hold values from both data frames. It also accepts columns in different orders between df1
and df2
.
Use push!
to add individual rows to a data frame.
Examples
julia> df1 = DataFrame(A=1:3, B=1:3);
julia> df2 = DataFrame(A=4.0:6.0, B=4:6);
julia> append!(df1, df2);
julia> df1
6×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │
│ 4 │ 4 │ 4 │
│ 5 │ 5 │ 5 │
│ 6 │ 6 │ 6 │
Base.push!
— Function.push!(df::DataFrame, row::Union{Tuple, AbstractArray})
push!(df::DataFrame, row::Union{DataFrameRow, NamedTuple, AbstractDict};
cols::Symbol=:setequal)
Add in-place one row at the end of df
taking the values from row
.
Column types of df
are preserved, and new values are converted if necessary. An error is thrown if conversion fails.
If row
is neither a DataFrameRow
, NamedTuple
nor AbstractDict
then it must be a Tuple
or an AbstractArray
and columns are matched by order of appearance. In this case row
must contain the same number of elements as the number of columns in df
.
If row
is a DataFrameRow
, NamedTuple
or AbstractDict
then values in row
are matched to columns in df
based on names. The exact behavior depends on the cols
argument value in the following way:
- If
cols=:setequal
(this is the default) thenrow
must contain exactly the same columns asdf
(but possibly in a different order). - If
cols=:orderequal
thenrow
must contain the same columns in the same order (forAbstractDict
this option requires thatkeys(row)
matchesnames(df)
to allow for support of ordered dicts; however, ifrow
is aDict
an error is thrown as it is an unordered collection). - If
cols=:intersect
thenrow
may contain more columns thandf
, but all column names that are present indf
must be present inrow
and only they are used to populate a new row indf
. - If
cols=:subset
thenpush!
behaves like for:intersect
but if some column is missing inrow
then amissing
value is pushed todf
.
As a special case, if df
has no columns and row
is a NamedTuple
or DataFrameRow
, columns are created for all values in row
, using their names and order.
Please note that push!
must not be used on a DataFrame
that contains columns that are aliases (equal when compared with ===
).
Examples
julia> df = DataFrame(A=1:3, B=1:3)
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │
julia> push!(df, (true, false))
4×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │
│ 4 │ 1 │ 0 │
julia> push!(df, df[1, :])
5×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │
│ 4 │ 1 │ 0 │
│ 5 │ 1 │ 1 │
julia> push!(df, (C="something", A=true, B=false), cols=:intersect)
4×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │
│ 4 │ 1 │ 0 │
│ 5 │ 1 │ 1 │
│ 6 │ 1 │ 0 │
julia> push!(df, Dict(:A=>1.0, :B=>2.0))
5×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │
│ 4 │ 1 │ 0 │
│ 5 │ 1 │ 1 │
│ 6 │ 1 │ 0 │
│ 7 │ 1 │ 2 │