Functions
Base.append!
Base.filter
Base.filter!
Base.join
Base.map
Base.repeat
Base.show
Base.sort
Base.sort!
Base.unique!
Base.vcat
DataFrames.aggregate
DataFrames.allowmissing!
DataFrames.by
DataFrames.colwise
DataFrames.combine
DataFrames.completecases
DataFrames.deletecols!
DataFrames.deleterows!
DataFrames.disallowmissing!
DataFrames.dropmissing
DataFrames.dropmissing!
DataFrames.eachcol
DataFrames.eachrow
DataFrames.eltypes
DataFrames.groupby
DataFrames.groupindices
DataFrames.groupvars
DataFrames.insertcols!
DataFrames.mapcols
DataFrames.melt
DataFrames.meltdf
DataFrames.names!
DataFrames.nonunique
DataFrames.permutecols!
DataFrames.rename
DataFrames.rename!
DataFrames.stack
DataFrames.stackdf
DataFrames.unstack
StatsBase.describe
Grouping, Joining, and Split-Apply-Combine
DataFrames.aggregate
— Function.Split-apply-combine that applies a set of functions over columns of an AbstractDataFrame or GroupedDataFrame
aggregate(d::AbstractDataFrame, cols, fs)
aggregate(gd::GroupedDataFrame, fs)
Arguments
d
: an AbstractDataFramegd
: a GroupedDataFramecols
: a column indicator (Symbol, Int, Vector{Symbol}, etc.)fs
: a function or vector of functions to be applied to vectors within groups; expects each argument to be a column vector
Each fs
should return a value or vector. All returns must be the same length.
Returns
::DataFrame
Examples
using Statistics
df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = randn(8))
aggregate(df, :a, sum)
aggregate(df, :a, [sum, x->mean(skipmissing(x))])
aggregate(groupby(df, :a), [sum, x->mean(skipmissing(x))])
DataFrames.by
— Function.by(d::AbstractDataFrame, keys, cols => f...; sort::Bool = false)
by(d::AbstractDataFrame, keys; (colname = cols => f)..., sort::Bool = false)
by(d::AbstractDataFrame, keys, f; sort::Bool = false)
by(f, d::AbstractDataFrame, keys; sort::Bool = false)
Split-apply-combine in one step: apply f
to each grouping in d
based on grouping columns keys
, and return a DataFrame
.
keys
can be either a single column index, or a vector thereof.
If the last argument(s) consist(s) in one or more cols => f
pair(s), or if colname = cols => f
keyword arguments are provided, cols
must be a column name or index, or a vector or tuple thereof, and f
must be a callable. A pair or a (named) tuple of pairs can also be provided as the first or last argument. If cols
is a single column index, f
is called with a SubArray
view into that column for each group; else, f
is called with a named tuple holding SubArray
views into these columns.
If the last argument is a callable f
, it is passed a SubDataFrame
view for each group, and the returned DataFrame
then consists of the returned rows plus the grouping columns. Note that this second form is much slower than the first one due to type instability. A method is defined with f
as the first argument, so do-block notation can be used.
f
can return a single value, a row or multiple rows. The type of the returned value determines the shape of the resulting data frame:
- A single value gives a data frame with a single column and one row per group.
- A named tuple of single values or a
DataFrameRow
gives a data frame with one column for each field and one row per group. - A vector gives a data frame with a single column and as many rows for each group as the length of the returned vector for that group.
- A data frame, a named tuple of vectors or a matrix gives a data frame with the same columns and as many rows for each group as the rows returned for that group.
As a special case, if multiple pairs are passed as last arguments, each function is required to return a single value or vector, which will produce each a separate column.
In all cases, the resulting data frame contains all the grouping columns in addition to those listed above. Column names are automatically generated when necessary: for functions operating on a single column and returning a single value or vector, the function name is appended to the input colummn name; for other functions, columns are called x1
, x2
and so on. The resulting data frame will be sorted on keys
if sort=true
. Otherwise, ordering of rows is undefined.
Note that f
must always return the same type of object for all groups, and (if a named tuple or data frame) with the same fields or columns. Due to type instability, returning a single value or a named tuple is dramatically faster than returning a data frame.
Optimized methods are used when standard summary functions (sum
, prod
, minimum
, maximum
, mean
, var
, std
, first
, last
and length) are specified using the pair syntax (e.g.
col => sum). When computing the
sumor
meanover floating point columns, results will be less accurate than the standard [
sum](@ref) function (which uses pairwise summation). Use
col => x -> sum(x)` to avoid the optimized method and use the slower, more accurate one.
by(d, cols, f)
is equivalent to combine(f, groupby(d, cols))
and to the less efficient combine(map(f, groupby(d, cols)))
.
Examples
julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = 1:8);
julia> by(df, :a, :c => sum)
4×2 DataFrame
│ Row │ a │ c_sum │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 6 │
│ 2 │ 2 │ 8 │
│ 3 │ 3 │ 10 │
│ 4 │ 4 │ 12 │
julia> by(df, :a, d -> sum(d.c)) # Slower variant
4×2 DataFrame
│ Row │ a │ x1 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 6 │
│ 2 │ 2 │ 8 │
│ 3 │ 3 │ 10 │
│ 4 │ 4 │ 12 │
julia> by(df, :a) do d # do syntax for the slower variant
sum(d.c)
end
4×2 DataFrame
│ Row │ a │ x1 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 6 │
│ 2 │ 2 │ 8 │
│ 3 │ 3 │ 10 │
│ 4 │ 4 │ 12 │
julia> by(df, :a, :c => x -> 2 .* x)
8×2 DataFrame
│ Row │ a │ c_function │
│ │ Int64 │ Int64 │
├─────┼───────┼────────────┤
│ 1 │ 1 │ 2 │
│ 2 │ 1 │ 10 │
│ 3 │ 2 │ 4 │
│ 4 │ 2 │ 12 │
│ 5 │ 3 │ 6 │
│ 6 │ 3 │ 14 │
│ 7 │ 4 │ 8 │
│ 8 │ 4 │ 16 │
julia> by(df, :a, c_sum = :c => sum, c_sum2 = :c => x -> sum(x.^2))
4×3 DataFrame
│ Row │ a │ c_sum │ c_sum2 │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼────────┤
│ 1 │ 1 │ 6 │ 26 │
│ 2 │ 2 │ 8 │ 40 │
│ 3 │ 3 │ 10 │ 58 │
│ 4 │ 4 │ 12 │ 80 │
julia> by(df, :a, (:b, :c) => x -> (minb = minimum(x.b), sumc = sum(x.c)))
4×3 DataFrame
│ Row │ a │ minb │ sumc │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 6 │
│ 2 │ 2 │ 1 │ 8 │
│ 3 │ 3 │ 2 │ 10 │
│ 4 │ 4 │ 1 │ 12 │
DataFrames.colwise
— Function.Apply a function to each column in an AbstractDataFrame or GroupedDataFrame
colwise(f, d)
Arguments
f
: a function or vector of functionsd
: an AbstractDataFrame of GroupedDataFrame
Returns
- various, depending on the call
Examples
df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = randn(8))
colwise(sum, df)
colwise([sum, length], df)
colwise((minimum, maximum), df)
colwise(sum, groupby(df, :a))
DataFrames.combine
— Function.combine(gd::GroupedDataFrame, cols => f...)
combine(gd::GroupedDataFrame; (colname = cols => f)...)
combine(gd::GroupedDataFrame, f)
combine(f, gd::GroupedDataFrame)
Transform a GroupedDataFrame
into a DataFrame
.
If the last argument(s) consist(s) in one or more cols => f
pair(s), or if colname = cols => f
keyword arguments are provided, cols
must be a column name or index, or a vector or tuple thereof, and f
must be a callable. A pair or a (named) tuple of pairs can also be provided as the first or last argument. If cols
is a single column index, f
is called with a SubArray
view into that column for each group; else, f
is called with a named tuple holding SubArray
views into these columns.
If the last argument is a callable f
, it is passed a SubDataFrame
view for each group, and the returned DataFrame
then consists of the returned rows plus the grouping columns. Note that this second form is much slower than the first one due to type instability. A method is defined with f
as the first argument, so do-block notation can be used.
f
can return a single value, a row or multiple rows. The type of the returned value determines the shape of the resulting data frame:
- A single value gives a data frame with a single column and one row per group.
- A named tuple of single values or a
DataFrameRow
gives a data frame with one column for each field and one row per group. - A vector gives a data frame with a single column and as many rows for each group as the length of the returned vector for that group.
- A data frame, a named tuple of vectors or a matrix gives a data frame with the same columns and as many rows for each group as the rows returned for that group.
As a special case, if a tuple or vector of pairs is passed as the first argument, each function is required to return a single value or vector, which will produce each a separate column.
In all cases, the resulting data frame contains all the grouping columns in addition to those listed above. Column names are automatically generated when necessary: for functions operating on a single column and returning a single value or vector, the function name is appended to the input column name; for other functions, columns are called x1
, x2
and so on. The resulting data frame will be sorted if sort=true
was passed to the groupby
call from which gd
was constructed. Otherwise, ordering of rows is undefined.
Note that f
must always return the same type of object for all groups, and (if a named tuple or data frame) with the same fields or columns. Due to type instability, returning a single value or a named tuple is dramatically faster than returning a data frame.
Optimized methods are used when standard summary functions (sum
, prod
, minimum
, maximum
, mean
, var
, std
, first
, last
and length) are specified using the pair syntax (e.g.
col => sum). When computing the
sumor
meanover floating point columns, results will be less accurate than the standard [
sum](@ref) function (which uses pairwise summation). Use
col => x -> sum(x)` to avoid the optimized method and use the slower, more accurate one.
Examples
julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = 1:8);
julia> gd = groupby(df, :a);
julia> combine(gd, :c => sum)
4×2 DataFrame
│ Row │ a │ c_sum │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 6 │
│ 2 │ 2 │ 8 │
│ 3 │ 3 │ 10 │
│ 4 │ 4 │ 12 │
julia> combine(:c => sum, gd)
4×2 DataFrame
│ Row │ a │ c_sum │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 6 │
│ 2 │ 2 │ 8 │
│ 3 │ 3 │ 10 │
│ 4 │ 4 │ 12 │
julia> combine(df -> sum(df.c), gd) # Slower variant
4×2 DataFrame
│ Row │ a │ x1 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 6 │
│ 2 │ 2 │ 8 │
│ 3 │ 3 │ 10 │
│ 4 │ 4 │ 12 │
See by
for more examples.
See also
by(f, df, cols)
is a shorthand for combine(f, groupby(df, cols))
.
map
: combine(f, groupby(df, cols))
is a more efficient equivalent of combine(map(f, groupby(df, cols)))
.
DataFrames.groupby
— Function.A view of an AbstractDataFrame split into row groups
groupby(d::AbstractDataFrame, cols; sort = false, skipmissing = false)
groupby(cols; sort = false, skipmissing = false)
Arguments
d
: an AbstractDataFrame to split (optional, see Returns)cols
: data table columns to group bysort
: whether to sort rows according to the values of the grouping columnscols
skipmissing
: whether to skip rows withmissing
values in one of the grouping columnscols
Returns
A GroupedDataFrame
: a grouped view into d
Details
An iterator over a GroupedDataFrame
returns a SubDataFrame
view for each grouping into d
. A GroupedDataFrame
also supports indexing by groups, map
(which applies a function to each group) and combine
(which applies a function to each group and combines the result into a data frame).
See the following for additional split-apply-combine operations:
by
: split-apply-combine using functionsaggregate
: split-apply-combine; applies functions in the form of a cross productcolwise
: apply a function to each column in anAbstractDataFrame
orGroupedDataFrame
map
: apply a function to each group of aGroupedDataFrame
(without combining)combine
: combine aGroupedDataFrame
, optionally applying a function to each group
Examples
df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = randn(8))
gd = groupby(df, :a)
gd[1]
last(gd)
vcat([g[:b] for g in gd]...)
for g in gd
println(g)
end
DataFrames.groupindices
— Function.groupindices(gd::GroupedDataFrame)
Return a vector of group indices for each row of parent(gd)
.
Rows appearing in group gd[i]
are attributed index i
. Rows not present in any group are attributed missing
(this can happen if skipmissing=true
was passed when creating gd
, or if gd
is a subset from a larger GroupedDataFrame
).
DataFrames.groupvars
— Function.groupvars(gd::GroupedDataFrame)
Return a vector of column names in parent(gd)
used for grouping.
Base.join
— Function.join(df1, df2; on = Symbol[], kind = :inner, makeunique = false,
indicator = nothing, validate = (false, false))
Join two DataFrame
objects
Arguments
df1
,df2
: the two AbstractDataFrames to be joined
Keyword Arguments
on
: A column, or vector of columns to join df1 and df2 on. If the column(s) that df1 and df2 will be joined on have different names, then the columns should be(left, right)
tuples orleft => right
pairs, or a vector of such tuples or pairs.on
is a required argument for all joins except forkind = :cross
kind
: the type of join, options include::inner
: only include rows with keys that match in bothdf1
anddf2
, the default:outer
: include all rows fromdf1
anddf2
:left
: include all rows fromdf1
:right
: include all rows fromdf2
:semi
: return rows ofdf1
that match with the keys indf2
:anti
: return rows ofdf1
that do not match with the keys indf2
:cross
: a full Cartesian product of the key combinations; every row ofdf1
is matched with every row ofdf2
makeunique
: iffalse
(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).indicator
: Default:nothing
. If aSymbol
, adds categorical indicator column namedSymbol
for whether a row appeared in onlydf1
("left_only"
), onlydf2
("right_only"
) or in both ("both"
). IfSymbol
is already in use, the column name will be modified ifmakeunique=true
.validate
: whether to check that columns passed as theon
argument define unique keys in each input data frame (according toisequal
). Can be a tuple or a pair, with the first element indicating whether to run check fordf1
and the second element fordf2
. By default no check is performed.
For the three join operations that may introduce missing values (:outer
, :left
, and :right
), all columns of the returned data table will support missing values.
When merging on
categorical columns that differ in the ordering of their levels, the ordering of the left DataFrame
takes precedence over the ordering of the right DataFrame
Result
::DataFrame
: the joined DataFrame
Examples
name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
join(name, job, on = :ID)
join(name, job, on = :ID, kind = :outer)
join(name, job, on = :ID, kind = :left)
join(name, job, on = :ID, kind = :right)
join(name, job, on = :ID, kind = :semi)
join(name, job, on = :ID, kind = :anti)
join(name, job, kind = :cross)
job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
join(name, job2, on = (:ID, :identifier))
join(name, job2, on = :ID => :identifier)
Base.map
— Function.map(cols => f, gd::GroupedDataFrame)
map(f, gd::GroupedDataFrame)
Apply a function to each group of rows and return a GroupedDataFrame
.
If the first argument is a cols => f
pair, cols
must be a column name or index, or a vector or tuple thereof, and f
must be a callable. If cols
is a single column index, f
is called with a SubArray
view into that column for each group; else, f
is called with a named tuple holding SubArray
views into these columns.
If the first argument is a vector, tuple or named tuple of such pairs, each pair is handled as described above. If a named tuple, field names are used to name each generated column.
If the first argument is a callable, it is passed a SubDataFrame
view for each group, and the returned DataFrame
then consists of the returned rows plus the grouping columns. Note that this second form is much slower than the first one due to type instability.
f
can return a single value, a row or multiple rows. The type of the returned value determines the shape of the resulting data frame:
- A single value gives a data frame with a single column and one row per group.
- A named tuple of single values or a
DataFrameRow
gives a data frame with one column for each field and one row per group. - A vector gives a data frame with a single column and as many rows for each group as the length of the returned vector for that group.
- A data frame, a named tuple of vectors or a matrix gives a data frame with the same columns and as many rows for each group as the rows returned for that group.
As a special case, if a tuple or vector of pairs is passed as the first argument, each function is required to return a single value or vector, which will produce each a separate column.
In all cases, the resulting GroupedDataFrame
contains all the grouping columns in addition to those listed above. Column names are automatically generated when necessary: for functions operating on a single column and returning a single value or vector, the function name is appended to the input column name; for other functions, columns are called x1
, x2
and so on.
Note that f
must always return the same type of object for all groups, and (if a named tuple or data frame) with the same fields or columns. Due to type instability, returning a single value or a named tuple is dramatically faster than returning a data frame.
Optimized methods are used when standard summary functions (sum
, prod
, minimum
, maximum
, mean
, var
, std
, first
, last
and length) are specified using the pair syntax (e.g.
col => sum). When computing the
sumor
meanover floating point columns, results will be less accurate than the standard [
sum](@ref) function (which uses pairwise summation). Use
col => x -> sum(x)` to avoid the optimized method and use the slower, more accurate one.
Examples
julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = 1:8);
julia> gd = groupby(df, :a);
julia> map(:c => sum, gd)
GroupedDataFrame{DataFrame} with 4 groups based on key: :a
First Group: 1 row
│ Row │ a │ c_sum │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 6 │
⋮
Last Group: 1 row
│ Row │ a │ c_sum │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 4 │ 12 │
julia> map(df -> sum(df.c), gd) # Slower variant
GroupedDataFrame{DataFrame} with 4 groups based on key: :a
First Group: 1 row
│ Row │ a │ x1 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 6 │
⋮
Last Group: 1 row
│ Row │ a │ x1 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 4 │ 12 │
See by
for more examples.
See also
combine(f, gd)
returns a DataFrame
rather than a GroupedDataFrame
DataFrames.melt
— Function.Stacks a DataFrame; convert from a wide to long format; see stack
.
DataFrames.stack
— Function.Stacks a DataFrame; convert from a wide to long format
stack(df::AbstractDataFrame, [measure_vars], [id_vars];
variable_name::Symbol=:variable, value_name::Symbol=:value)
melt(df::AbstractDataFrame, [id_vars], [measure_vars];
variable_name::Symbol=:variable, value_name::Symbol=:value)
Arguments
df
: the AbstractDataFrame to be stackedmeasure_vars
: the columns to be stacked (the measurement variables), a normal column indexing type, like a Symbol, Vector{Symbol}, Int, etc.; formelt
, defaults to all variables that are notid_vars
. If neithermeasure_vars
orid_vars
are given,measure_vars
defaults to all floating point columns.id_vars
: the identifier columns that are repeated during stacking, a normal column indexing type; forstack
defaults to all variables that are notmeasure_vars
variable_name
: the name of the new stacked column that shall hold the names of each ofmeasure_vars
value_name
: the name of the new stacked column containing the values from each ofmeasure_vars
Result
::DataFrame
: the long-format DataFrame with column:value
holding the values of the stacked columns (measure_vars
), with column:variable
a Vector of Symbols with themeasure_vars
name, and with columns for each of theid_vars
.
See also stackdf
and meltdf
for stacking methods that return a view into the original DataFrame. See unstack
for converting from long to wide format.
Examples
d1 = DataFrame(a = repeat([1:3;], inner = [4]),
b = repeat([1:4;], inner = [3]),
c = randn(12),
d = randn(12),
e = map(string, 'a':'l'))
d1s = stack(d1, [:c, :d])
d1s2 = stack(d1, [:c, :d], [:a])
d1m = melt(d1, [:a, :b, :e])
d1s_name = melt(d1, [:a, :b, :e], variable_name=:somemeasure)
DataFrames.unstack
— Function.Unstacks a DataFrame; convert from a long to wide format
unstack(df::AbstractDataFrame, rowkeys::Union{Symbol, Integer},
colkey::Union{Symbol, Integer}, value::Union{Symbol, Integer})
unstack(df::AbstractDataFrame, rowkeys::AbstractVector{<:Union{Symbol, Integer}},
colkey::Union{Symbol, Integer}, value::Union{Symbol, Integer})
unstack(df::AbstractDataFrame, colkey::Union{Symbol, Integer},
value::Union{Symbol, Integer})
unstack(df::AbstractDataFrame)
Arguments
df
: the AbstractDataFrame to be unstackedrowkeys
: the column(s) with a unique key for each row, if not given, find a key by grouping on anything not acolkey
orvalue
colkey
: the column holding the column names in wide format, defaults to:variable
value
: the value column, defaults to:value
Result
::DataFrame
: the wide-format DataFrame
If colkey
contains missing
values then they will be skipped and a warning will be printed.
If combination of rowkeys
and colkey
contains duplicate entries then last value
will be retained and a warning will be printed.
Examples
wide = DataFrame(id = 1:12,
a = repeat([1:3;], inner = [4]),
b = repeat([1:4;], inner = [3]),
c = randn(12),
d = randn(12))
long = stack(wide)
wide0 = unstack(long)
wide1 = unstack(long, :variable, :value)
wide2 = unstack(long, :id, :variable, :value)
wide3 = unstack(long, [:id, :a], :variable, :value)
Note that there are some differences between the widened results above.
DataFrames.stackdf
— Function.A stacked view of a DataFrame (long format)
Like stack
and melt
, but a view is returned rather than data copies.
stackdf(df::AbstractDataFrame, [measure_vars], [id_vars];
variable_name::Symbol=:variable, value_name::Symbol=:value)
meltdf(df::AbstractDataFrame, [id_vars], [measure_vars];
variable_name::Symbol=:variable, value_name::Symbol=:value)
Arguments
df
: the wide AbstractDataFramemeasure_vars
: the columns to be stacked (the measurement variables), a normal column indexing type, like a Symbol, Vector{Symbol}, Int, etc.; formelt
, defaults to all variables that are notid_vars
id_vars
: the identifier columns that are repeated during stacking, a normal column indexing type; forstack
defaults to all variables that are notmeasure_vars
Result
::DataFrame
: the long-format DataFrame with column:value
holding the values of the stacked columns (measure_vars
), with column:variable
a Vector of Symbols with themeasure_vars
name, and with columns for each of theid_vars
.
The result is a view because the columns are special AbstractVectors that return indexed views into the original DataFrame.
Examples
d1 = DataFrame(a = repeat([1:3;], inner = [4]),
b = repeat([1:4;], inner = [3]),
c = randn(12),
d = randn(12),
e = map(string, 'a':'l'))
d1s = stackdf(d1, [:c, :d])
d1s2 = stackdf(d1, [:c, :d], [:a])
d1m = meltdf(d1, [:a, :b, :e])
DataFrames.meltdf
— Function.A stacked view of a DataFrame (long format); see stackdf
Basics
DataFrames.allowmissing!
— Function.allowmissing!(df::DataFrame)
Convert all columns of a df
from element type T
to Union{T, Missing}
to support missing values.
allowmissing!(df::DataFrame, col::Union{Integer, Symbol})
Convert a single column of a df
from element type T
to Union{T, Missing}
to support missing values.
allowmissing!(df::DataFrame, cols::AbstractVector{<:Union{Integer, Symbol}})
Convert multiple columns of a df
from element type T
to Union{T, Missing}
to support missing values.
DataFrames.completecases
— Function.completecases(df::AbstractDataFrame)
completecases(df::AbstractDataFrame, cols::AbstractVector)
completecases(df::AbstractDataFrame, cols::Union{Integer, Symbol})
Return a Boolean vector with true
entries indicating rows without missing values (complete cases) in data frame df
. If cols
is provided, only missing values in the corresponding columns are considered.
See also: dropmissing
and dropmissing!
. Use findall(completecases(df))
to get the indices of the rows.
Examples
julia> df = DataFrame(i = 1:5,
x = [missing, 4, missing, 2, 1],
y = [missing, missing, "c", "d", "e"])
5×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64⍰ │ String⍰ │
├─────┼───────┼─────────┼─────────┤
│ 1 │ 1 │ missing │ missing │
│ 2 │ 2 │ 4 │ missing │
│ 3 │ 3 │ missing │ c │
│ 4 │ 4 │ 2 │ d │
│ 5 │ 5 │ 1 │ e │
julia> completecases(df)
5-element BitArray{1}:
false
false
false
true
true
julia> completecases(df, :x)
5-element BitArray{1}:
false
true
false
true
true
julia> completecases(df, [:x, :y])
5-element BitArray{1}:
false
false
false
true
true
DataFrames.deletecols!
— Function.deletecols!(df::DataFrame, ind)
Delete columns specified by ind
from a DataFrame
df
in place and return it.
Argument ind
can be any index that is allowed for column indexing of a DataFrame
provided that the columns requested to be removed are unique.
Examples
julia> d = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │
julia> deletecols!(d, 1)
3×1 DataFrame
│ Row │ b │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 4 │
│ 2 │ 5 │
│ 3 │ 6 │
DataFrames.deleterows!
— Function.deleterows!(df::DataFrame, ind)
Delete rows specified by ind
from a DataFrame
df
in place and return it.
Internally deleteat!
is called for all columns so ind
must be: a vector of sorted and unique integers, a boolean vector or an integer.
Examples
julia> d = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │
julia> deleterows!(d, 2)
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 3 │ 6 │
StatsBase.describe
— Function.Report descriptive statistics for a data frame
describe(df::AbstractDataFrame; stats = [:mean, :min, :median, :max, :nmissing, :nunique, :eltype])
Arguments
df
: the AbstractDataFramestats::Union{Symbol,AbstractVector{Symbol}}
: the summary statistics to report. If a vector, allowed fields are:mean
,:std
,:min
,:q25
,:median
,:q75
,:max
,:eltype
,:nunique
,:first
,:last
, and:nmissing
. If set to:all
, all summary statistics are reported.
Result
- A
DataFrame
where each row represents a variable and each column a summary statistic.
Details
For Real
columns, compute the mean, standard deviation, minimum, first quantile, median, third quantile, and maximum. If a column does not derive from Real
, describe
will attempt to calculate all statistics, using nothing
as a fall-back in the case of an error.
When stats
contains :nunique
, describe
will report the number of unique values in a column. If a column's base type derives from Real
, :nunique
will return nothing
s.
Missing values are filtered in the calculation of all statistics, however the column :nmissing
will report the number of missing values of that variable. If the column does not allow missing values, nothing
is returned. Consequently, nmissing = 0
indicates that the column allows missing values, but does not currently contain any.
Examples
df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
describe(df)
describe(df, stats = :all)
describe(df, stats = [:min, :max])
DataFrames.disallowmissing!
— Function.disallowmissing!(df::DataFrame)
Convert all columns of a df
from element type Union{T, Missing}
to T
to drop support for missing values.
disallowmissing!(df::DataFrame, col::Union{Integer, Symbol})
Convert a single column of a df
from element type Union{T, Missing}
to T
to drop support for missing values.
disallowmissing!(df::DataFrame, cols::AbstractVector{<:Union{Integer, Symbol}})
Convert multiple columns of a df
from element type Union{T, Missing}
to T
to drop support for missing values.
DataFrames.dropmissing
— Function.dropmissing(df::AbstractDataFrame; disallowmissing::Bool=false)
dropmissing(df::AbstractDataFrame, cols::AbstractVector; disallowmissing::Bool=false)
dropmissing(df::AbstractDataFrame, cols::Union{Integer, Symbol}; disallowmissing::Bool=false)
Return a copy of data frame df
excluding rows with missing values. If cols
is provided, only missing values in the corresponding columns are considered.
In the future disallowmissing
will be true
by default.
See also: completecases
and dropmissing!
.
Examples
julia> df = DataFrame(i = 1:5,
x = [missing, 4, missing, 2, 1],
y = [missing, missing, "c", "d", "e"])
5×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64⍰ │ String⍰ │
├─────┼───────┼─────────┼─────────┤
│ 1 │ 1 │ missing │ missing │
│ 2 │ 2 │ 4 │ missing │
│ 3 │ 3 │ missing │ c │
│ 4 │ 4 │ 2 │ d │
│ 5 │ 5 │ 1 │ e │
julia> dropmissing(df)
2×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64⍰ │ String⍰ │
├─────┼───────┼────────┼─────────┤
│ 1 │ 4 │ 2 │ d │
│ 2 │ 5 │ 1 │ e │
julia> dropmissing(df, disallowmissing=true)
2×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼────────┤
│ 1 │ 4 │ 2 │ d │
│ 2 │ 5 │ 1 │ e │
julia> dropmissing(df, :x)
3×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64⍰ │ String⍰ │
├─────┼───────┼────────┼─────────┤
│ 1 │ 2 │ 4 │ missing │
│ 2 │ 4 │ 2 │ d │
│ 3 │ 5 │ 1 │ e │
julia> dropmissing(df, [:x, :y])
2×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64⍰ │ String⍰ │
├─────┼───────┼────────┼─────────┤
│ 1 │ 4 │ 2 │ d │
│ 2 │ 5 │ 1 │ e │
DataFrames.dropmissing!
— Function.dropmissing!(df::AbstractDataFrame; disallowmissing::Bool=false)
dropmissing!(df::AbstractDataFrame, cols::AbstractVector; disallowmissing::Bool=false)
dropmissing!(df::AbstractDataFrame, cols::Union{Integer, Symbol}; disallowmissing::Bool=false)
Remove rows with missing values from data frame df
and return it. If cols
is provided, only missing values in the corresponding columns are considered.
In the future disallowmissing
will be true
by default.
See also: dropmissing
and completecases
.
Examples
julia> df = DataFrame(i = 1:5,
x = [missing, 4, missing, 2, 1],
y = [missing, missing, "c", "d", "e"])
5×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64⍰ │ String⍰ │
├─────┼───────┼─────────┼─────────┤
│ 1 │ 1 │ missing │ missing │
│ 2 │ 2 │ 4 │ missing │
│ 3 │ 3 │ missing │ c │
│ 4 │ 4 │ 2 │ d │
│ 5 │ 5 │ 1 │ e │
julia> df1 = copy(df);
julia> dropmissing!(df1);
julia> df1
2×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64⍰ │ String⍰ │
├─────┼───────┼────────┼─────────┤
│ 1 │ 4 │ 2 │ d │
│ 2 │ 5 │ 1 │ e │
julia> dropmissing!(df1, disallowmissing=true);
julia> df1
2×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼────────┤
│ 1 │ 4 │ 2 │ d │
│ 2 │ 5 │ 1 │ e │
julia> df2 = copy(df);
julia> dropmissing!(df2, :x);
julia> df2
3×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64⍰ │ String⍰ │
├─────┼───────┼────────┼─────────┤
│ 1 │ 2 │ 4 │ missing │
│ 2 │ 4 │ 2 │ d │
│ 3 │ 5 │ 1 │ e │
julia> df3 = copy(df);
julia> dropmissing!(df3, [:x, :y]);
julia> df3
2×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64⍰ │ String⍰ │
├─────┼───────┼────────┼─────────┤
│ 1 │ 4 │ 2 │ d │
│ 2 │ 5 │ 1 │ e │
DataFrames.eachrow
— Function.eachrow(df::AbstractDataFrame)
Return a DataFrameRows
that iterates a data frame row by row, with each row represented as a DataFrameRow
.
Examples
julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 11 │
│ 2 │ 2 │ 12 │
│ 3 │ 3 │ 13 │
│ 4 │ 4 │ 14 │
julia> eachrow(df)
4-element DataFrameRows:
DataFrameRow (row 1)
x 1
y 11
DataFrameRow (row 2)
x 2
y 12
DataFrameRow (row 3)
x 3
y 13
DataFrameRow (row 4)
x 4
y 14
julia> copy.(eachrow(df))
4-element Array{NamedTuple{(:x, :y),Tuple{Int64,Int64}},1}:
(x = 1, y = 11)
(x = 2, y = 12)
(x = 3, y = 13)
(x = 4, y = 14)
julia> eachrow(view(df, [4,3], [2,1]))
2-element DataFrameRows:
DataFrameRow (row 4)
y 14
x 4
DataFrameRow (row 3)
y 13
x 3
DataFrames.eachcol
— Function.eachcol(df::AbstractDataFrame, names::Bool=true)
Return a DataFrameColumns
that iterates an AbstractDataFrame
column by column. If names
is equal to true
(currently the default, in the future the default will be set to false
) iteration returns a pair consisting of column name and column vector. If names
is equal to false
then column vectors are yielded.
Examples
julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 11 │
│ 2 │ 2 │ 12 │
│ 3 │ 3 │ 13 │
│ 4 │ 4 │ 14 │
julia> collect(eachcol(df, true))
2-element Array{Pair{Symbol,AbstractArray{T,1} where T},1}:
:x => [1, 2, 3, 4]
:y => [11, 12, 13, 14]
julia> collect(eachcol(df, false))
2-element Array{AbstractArray{T,1} where T,1}:
[1, 2, 3, 4]
[11, 12, 13, 14]
julia> sum.(eachcol(df, false))
2-element Array{Int64,1}:
10
50
julia> map(eachcol(df, false)) do col
maximum(col) - minimum(col)
end
2-element Array{Int64,1}:
3
3
DataFrames.eltypes
— Function.Return element types of columns
eltypes(df::AbstractDataFrame)
Arguments
df
: the AbstractDataFrame
Result
::Vector{Type}
: the element type of each column
Examples
df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
eltypes(df)
Base.filter
— Function.filter(function, df::AbstractDataFrame)
Return a copy of data frame df
containing only rows for which function
returns true
. The function is passed a DataFrameRow
as its only argument.
Examples
julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 2 │ a │
│ 4 │ 1 │ b │
julia> filter(row -> row[:x] > 1, df)
2×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 2 │ a │
Base.filter!
— Function.filter!(function, df::AbstractDataFrame)
Remove rows from data frame df
for which function
returns false
. The function is passed a DataFrameRow
as its only argument.
Examples
julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 2 │ a │
│ 4 │ 1 │ b │
julia> filter!(row -> row[:x] > 1, df);
julia> df
2×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 2 │ a │
DataFrames.insertcols!
— Function.Insert a column into a data frame in place.
insertcols!(df::DataFrame, ind::Int; name=col,
makeunique::Bool=false)
insertcols!(df::DataFrame, ind::Int, (:name => col)::Pair{Symbol,<:AbstractVector};
makeunique::Bool=false)
Arguments
df
: the DataFrame to which we want to add a columnind
: a position at which we want to insert a columnname
: the name of the new columncol
: anAbstractVector
giving the contents of the new columnmakeunique
: Defines what to do ifname
already exists indf
; if it isfalse
an error will be thrown; if it istrue
a new unique name will be generated by adding a suffix
Result
::DataFrame
: aDataFrame
with added column.
Examples
julia> d = DataFrame(a=1:3)
3×1 DataFrame
│ Row │ a │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
julia> insertcols!(d, 1, b=['a', 'b', 'c'])
3×2 DataFrame
│ Row │ b │ a │
│ │ Char │ Int64 │
├─────┼──────┼───────┤
│ 1 │ 'a' │ 1 │
│ 2 │ 'b' │ 2 │
│ 3 │ 'c' │ 3 │
julia> insertcols!(d, 1, :c => [2, 3, 4])
3×3 DataFrame
│ Row │ c │ b │ a │
│ │ Int64 │ Char │ Int64 │
├─────┼───────┼──────┼───────┤
│ 1 │ 2 │ 'a' │ 1 │
│ 2 │ 3 │ 'b' │ 2 │
│ 3 │ 4 │ 'c' │ 3 │
DataFrames.mapcols
— Function.mapcols(f::Union{Function,Type}, df::AbstractDataFrame)
Return a DataFrame
where each column of df
is transformed using function f
. f
must return AbstractVector
objects all with the same length or scalars.
Examples
julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 11 │
│ 2 │ 2 │ 12 │
│ 3 │ 3 │ 13 │
│ 4 │ 4 │ 14 │
julia> mapcols(x -> x.^2, df)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 121 │
│ 2 │ 4 │ 144 │
│ 3 │ 9 │ 169 │
│ 4 │ 16 │ 196 │
DataFrames.names!
— Function.Set column names
names!(df::AbstractDataFrame, vals)
Arguments
df
: the AbstractDataFramevals
: column names, normally a Vector{Symbol} the same length as the number of columns indf
makeunique
: iffalse
(the default), an error will be raised if duplicate names are found; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).
Result
::AbstractDataFrame
: the updated result
Examples
df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
names!(df, [:a, :b, :c])
names!(df, [:a, :b, :a]) # throws ArgumentError
names!(df, [:a, :b, :a], makeunique=true) # renames second :a to :a_1
DataFrames.nonunique
— Function.Indexes of duplicate rows (a row that is a duplicate of a prior row)
nonunique(df::AbstractDataFrame)
nonunique(df::AbstractDataFrame, cols)
Arguments
df
: the AbstractDataFramecols
: a column indicator (Symbol, Int, Vector{Symbol}, etc.) specifying the column(s) to compare
Result
::Vector{Bool}
: indicates whether the row is a duplicate of some prior row
Examples
df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
df = vcat(df, df)
nonunique(df)
nonunique(df, 1)
DataFrames.rename!
— Function.Rename columns
rename!(df::AbstractDataFrame, (from => to)::Pair{Symbol, Symbol}...)
rename!(df::AbstractDataFrame, d::AbstractDict{Symbol,Symbol})
rename!(df::AbstractDataFrame, d::AbstractArray{Pair{Symbol,Symbol}})
rename!(f::Function, df::AbstractDataFrame)
rename(df::AbstractDataFrame, (from => to)::Pair{Symbol, Symbol}...)
rename(df::AbstractDataFrame, d::AbstractDict{Symbol,Symbol})
rename(df::AbstractDataFrame, d::AbstractArray{Pair{Symbol,Symbol}})
rename(f::Function, df::AbstractDataFrame)
Arguments
df
: the AbstractDataFramed
: an Associative type or an AbstractArray of pairs that maps the original names to new namesf
: a function which for each column takes the old name (a Symbol) and returns the new name (a Symbol)
Result
::AbstractDataFrame
: the updated result
New names are processed sequentially. A new name must not already exist in the DataFrame
at the moment an attempt to rename a column is performed.
Examples
df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
rename(df, :i => :A, :x => :X)
rename(df, [:i => :A, :x => :X])
rename(df, Dict(:i => :A, :x => :X))
rename(x -> Symbol(uppercase(string(x))), df)
rename(df) do x
Symbol(uppercase(string(x)))
end
rename!(df, Dict(:i =>: A, :x => :X))
DataFrames.rename
— Function.Rename columns
rename!(df::AbstractDataFrame, (from => to)::Pair{Symbol, Symbol}...)
rename!(df::AbstractDataFrame, d::AbstractDict{Symbol,Symbol})
rename!(df::AbstractDataFrame, d::AbstractArray{Pair{Symbol,Symbol}})
rename!(f::Function, df::AbstractDataFrame)
rename(df::AbstractDataFrame, (from => to)::Pair{Symbol, Symbol}...)
rename(df::AbstractDataFrame, d::AbstractDict{Symbol,Symbol})
rename(df::AbstractDataFrame, d::AbstractArray{Pair{Symbol,Symbol}})
rename(f::Function, df::AbstractDataFrame)
Arguments
df
: the AbstractDataFramed
: an Associative type or an AbstractArray of pairs that maps the original names to new namesf
: a function which for each column takes the old name (a Symbol) and returns the new name (a Symbol)
Result
::AbstractDataFrame
: the updated result
New names are processed sequentially. A new name must not already exist in the DataFrame
at the moment an attempt to rename a column is performed.
Examples
df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
rename(df, :i => :A, :x => :X)
rename(df, [:i => :A, :x => :X])
rename(df, Dict(:i => :A, :x => :X))
rename(x -> Symbol(uppercase(string(x))), df)
rename(df) do x
Symbol(uppercase(string(x)))
end
rename!(df, Dict(:i =>: A, :x => :X))
Base.repeat
— Function.repeat(df::AbstractDataFrame; inner::Integer = 1, outer::Integer = 1)
Construct a data frame by repeating rows in df
. inner
specifies how many times each row is repeated, and outer
specifies how many times the full set of rows is repeated.
Example
julia> df = DataFrame(a = 1:2, b = 3:4)
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 2 │ 4 │
julia> repeat(df, inner = 2, outer = 3)
12×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 1 │ 3 │
│ 3 │ 2 │ 4 │
│ 4 │ 2 │ 4 │
│ 5 │ 1 │ 3 │
│ 6 │ 1 │ 3 │
│ 7 │ 2 │ 4 │
│ 8 │ 2 │ 4 │
│ 9 │ 1 │ 3 │
│ 10 │ 1 │ 3 │
│ 11 │ 2 │ 4 │
│ 12 │ 2 │ 4 │
repeat(df::AbstractDataFrame, count::Integer)
Construct a data frame by repeating each row in df
the number of times specified by count
.
Example
julia> df = DataFrame(a = 1:2, b = 3:4)
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 2 │ 4 │
julia> repeat(df, 2)
4×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 2 │ 4 │
│ 3 │ 1 │ 3 │
│ 4 │ 2 │ 4 │
Base.show
— Function.show([io::IO,] df::AbstractDataFrame;
allrows::Bool = !get(io, :limit, false),
allcols::Bool = !get(io, :limit, false),
allgroups::Bool = !get(io, :limit, false),
splitcols::Bool = get(io, :limit, false),
rowlabel::Symbol = :Row,
summary::Bool = true)
Render a data frame to an I/O stream. The specific visual representation chosen depends on the width of the display.
If io
is omitted, the result is printed to stdout
, and allrows
, allcols
and allgroups
default to false
while splitcols
defaults to true
.
Arguments
io::IO
: The I/O stream to whichdf
will be printed.df::AbstractDataFrame
: The data frame to print.allrows::Bool
: Whether to print all rows, rather than a subset that fits the device height. By default this is the case only ifio
does not have theIOContext
propertylimit
set.allcols::Bool
: Whether to print all columns, rather than a subset that fits the device width. By default this is the case only ifio
does not have theIOContext
propertylimit
set.allgroups::Bool
: Whether to print all groups rather than the first and last, whendf
is aGroupedDataFrame
. By default this is the case only ifio
does not have theIOContext
propertylimit
set.splitcols::Bool
: Whether to split printing in chunks of columns fitting the screen width rather than printing all columns in the same block. Only applies ifallcols
istrue
. By default this is the case only ifio
has theIOContext
propertylimit
set.rowlabel::Symbol = :Row
: The label to use for the column containing row numbers.summary::Bool = true
: Whether to print a brief string summary of the data frame.
Examples
julia> using DataFrames
julia> df = DataFrame(A = 1:3, B = ["x", "y", "z"]);
julia> show(df, allcols=true)
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ x │
│ 2 │ 2 │ y │
│ 3 │ 3 │ z │
Base.sort
— Function.sort(df::AbstractDataFrame, cols;
alg::Union{Algorithm, Nothing}=nothing, lt=isless, by=identity,
rev::Bool=false, order::Ordering=Forward)
Return a copy of data frame df
sorted by column(s) cols
. cols
can be either a Symbol
or Integer
column index, or a tuple or vector of such indices.
If alg
is nothing
(the default), the most appropriate algorithm is chosen automatically among TimSort
, MergeSort
and RadixSort
depending on the type of the sorting columns and on the number of rows in df
. If rev
is true
, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true)
in cols
, with c
the corresponding column index (see example below). See sort!
for a description of other keyword arguments.
Examples
julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 2 │ a │
│ 4 │ 1 │ b │
julia> sort(df, :x)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ c │
│ 2 │ 1 │ b │
│ 3 │ 2 │ a │
│ 4 │ 3 │ b │
julia> sort(df, (:x, :y))
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ b │
│ 2 │ 1 │ c │
│ 3 │ 2 │ a │
│ 4 │ 3 │ b │
julia> sort(df, (:x, :y), rev=true)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 2 │ a │
│ 3 │ 1 │ c │
│ 4 │ 1 │ b │
julia> sort(df, (:x, order(:y, rev=true)))
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ c │
│ 2 │ 1 │ b │
│ 3 │ 2 │ a │
│ 4 │ 3 │ b │
Base.sort!
— Function.sort!(df::AbstractDataFrame, cols;
alg::Union{Algorithm, Nothing}=nothing, lt=isless, by=identity,
rev::Bool=false, order::Ordering=Forward)
Sort data frame df
by column(s) cols
. cols
can be either a Symbol
or Integer
column index, or a tuple or vector of such indices.
If alg
is nothing
(the default), the most appropriate algorithm is chosen automatically among TimSort
, MergeSort
and RadixSort
depending on the type of the sorting columns and on the number of rows in df
. If rev
is true
, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true)
in cols
, with c
the corresponding column index (see example below). See other methods for a description of other keyword arguments.
Examples
julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 2 │ a │
│ 4 │ 1 │ b │
julia> sort!(df, :x)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ c │
│ 2 │ 1 │ b │
│ 3 │ 2 │ a │
│ 4 │ 3 │ b │
julia> sort!(df, (:x, :y))
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ b │
│ 2 │ 1 │ c │
│ 3 │ 2 │ a │
│ 4 │ 3 │ b │
julia> sort!(df, (:x, :y), rev=true)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 2 │ a │
│ 3 │ 1 │ c │
│ 4 │ 1 │ b │
julia> sort!(df, (:x, order(:y, rev=true)))
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ c │
│ 2 │ 1 │ b │
│ 3 │ 2 │ a │
│ 4 │ 3 │ b │
Base.unique!
— Function.Delete duplicate rows
unique(df::AbstractDataFrame)
unique(df::AbstractDataFrame, cols)
unique!(df::AbstractDataFrame)
unique!(df::AbstractDataFrame, cols)
Arguments
df
: the AbstractDataFramecols
: column indicator (Symbol, Int, Vector{Symbol}, etc.)
specifying the column(s) to compare.
Result
::AbstractDataFrame
: the updated version ofdf
with unique rows.
When cols
is specified, the return DataFrame contains complete rows, retaining in each case the first instance for which df[cols]
is unique.
See also nonunique
.
Examples
df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
df = vcat(df, df)
unique(df) # doesn't modify df
unique(df, 1)
unique!(df) # modifies df
DataFrames.permutecols!
— Function.permutecols!(df::DataFrame, p::AbstractVector)
Permute the columns of df
in-place, according to permutation p
. Elements of p
may be either column indices (Int
) or names (Symbol
), but cannot be a combination of both. All columns must be listed.
Examples
julia> df = DataFrame(a=1:5, b=2:6, c=3:7)
5×3 DataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 3 │
│ 2 │ 2 │ 3 │ 4 │
│ 3 │ 3 │ 4 │ 5 │
│ 4 │ 4 │ 5 │ 6 │
│ 5 │ 5 │ 6 │ 7 │
julia> permutecols!(df, [2, 1, 3]);
julia> df
5×3 DataFrame
│ Row │ b │ a │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 2 │ 1 │ 3 │
│ 2 │ 3 │ 2 │ 4 │
│ 3 │ 4 │ 3 │ 5 │
│ 4 │ 5 │ 4 │ 6 │
│ 5 │ 6 │ 5 │ 7 │
julia> permutecols!(df, [:c, :a, :b]);
julia> df
5×3 DataFrame
│ Row │ c │ a │ b │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 3 │ 1 │ 2 │
│ 2 │ 4 │ 2 │ 3 │
│ 3 │ 5 │ 3 │ 4 │
│ 4 │ 6 │ 4 │ 5 │
│ 5 │ 7 │ 5 │ 6 │
Base.vcat
— Function.vcat(dfs::AbstractDataFrame...)
Vertically concatenate AbstractDataFrames
.
Column names in all passed data frames must be the same, but they can have different order. In such cases the order of names in the first passed DataFrame
is used.
Example
julia> df1 = DataFrame(A=1:3, B=1:3);
julia> df2 = DataFrame(A=4:6, B=4:6);
julia> vcat(df1, df2)
6×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │
│ 4 │ 4 │ 4 │
│ 5 │ 5 │ 5 │
│ 6 │ 6 │ 6 │
Base.append!
— Function.append!(df1::DataFrame, df2::AbstractDataFrame)
Add the rows of df2
to the end of df1
.
Column names must be equal (including order). Values corresponding to new rows are appended in-place to the column vectors of df1
. Column types are therefore preserved, and new values are converted if necessary. An error is thrown if conversion fails: this is the case in particular if a column in df2
contains missing
values but the corresponding column in df1
does not accept them.
Use vcat
instead of append!
when more flexibility is needed. Since vcat
does not operate in place, it is able to use promotion to find an appropriate element type to hold values from both data frames. It also accepts columns in different orders between df1
and df2
.
Use push!
to add individual rows to a data frame.
Examples
julia> df1 = DataFrame(A=1:3, B=1:3);
julia> df2 = DataFrame(A=4.0:6.0, B=4:6);
julia> append!(df1, df2);
julia> df1
6×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │
│ 4 │ 4 │ 4 │
│ 5 │ 5 │ 5 │
│ 6 │ 6 │ 6 │