Functions
Base.append!
Base.copy
Base.delete!
Base.filter
Base.filter!
Base.first
Base.get
Base.hcat
Base.issorted
Base.keys
Base.last
Base.length
Base.names
Base.ndims
Base.pairs
Base.parent
Base.propertynames
Base.push!
Base.repeat
Base.show
Base.similar
Base.size
Base.sort
Base.sort!
Base.sortperm
Base.unique
Base.unique!
Base.vcat
CategoricalArrays.categorical
Compat.eachcol
Compat.eachrow
DataAPI.describe
DataFrames.DataFrame!
DataFrames.allowmissing!
DataFrames.antijoin
DataFrames.categorical!
DataFrames.combine
DataFrames.completecases
DataFrames.crossjoin
DataFrames.disallowmissing!
DataFrames.dropmissing
DataFrames.dropmissing!
DataFrames.flatten
DataFrames.groupby
DataFrames.groupcols
DataFrames.groupindices
DataFrames.innerjoin
DataFrames.insertcols!
DataFrames.leftjoin
DataFrames.mapcols
DataFrames.mapcols!
DataFrames.ncol
DataFrames.nonunique
DataFrames.nrow
DataFrames.order
DataFrames.outerjoin
DataFrames.rename
DataFrames.rename!
DataFrames.repeat!
DataFrames.rightjoin
DataFrames.select
DataFrames.select!
DataFrames.semijoin
DataFrames.stack
DataFrames.transform
DataFrames.transform!
DataFrames.unstack
DataFrames.valuecols
Missings.allowmissing
Missings.disallowmissing
Joining, Grouping, and Split-Apply-Combine
DataFrames.innerjoin
— Functioninnerjoin(df1, df2; on, makeunique = false,
validate = (false, false))
innerjoin(df1, df2, dfs...; on, makeunique = false,
validate = (false, false))
Perform an inner join of two or more data frame objects and return a DataFrame
containing the result. An inner join includes rows with keys that match in all passed data frames.
Arguments
df1
,df2
,dfs...
: theAbstractDataFrames
to be joined
Keyword Arguments
on
: A column name to joindf1
anddf2
on. If the columns on whichdf1
anddf2
will be joined have different names, then aleft=>right
pair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed). If more than two data frames are joined then only a column name or a vector of column names are allowed.on
is a required argument.makeunique
: iffalse
(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).validate
: whether to check that columns passed as theon
argument define unique keys in each input data frame (according toisequal
). Can be a tuple or a pair, with the first element indicating whether to run check fordf1
and the second element fordf2
. By default no check is performed.
When merging on
categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.
If more than two data frames are passed, the join is performed recursively with left associativity. In this case the validate
keyword argument is applied recursively with left associativity.
See also: leftjoin
, rightjoin
, outerjoin
, semijoin
, antijoin
, crossjoin
.
Examples
julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
│ Row │ ID │ Name │
│ │ Int64 │ String │
├─────┼───────┼───────────┤
│ 1 │ 1 │ John Doe │
│ 2 │ 2 │ Jane Doe │
│ 3 │ 3 │ Joe Blogs │
julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ ID │ Job │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ Lawyer │
│ 2 │ 2 │ Doctor │
│ 3 │ 4 │ Farmer │
julia> innerjoin(name, job, on = :ID)
2×3 DataFrame
│ Row │ ID │ Name │ Job │
│ │ Int64 │ String │ String │
├─────┼───────┼──────────┼────────┤
│ 1 │ 1 │ John Doe │ Lawyer │
│ 2 │ 2 │ Jane Doe │ Doctor │
julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ identifier │ Job │
│ │ Int64 │ String │
├─────┼────────────┼────────┤
│ 1 │ 1 │ Lawyer │
│ 2 │ 2 │ Doctor │
│ 3 │ 4 │ Farmer │
julia> innerjoin(name, job2, on = :ID => :identifier)
2×3 DataFrame
│ Row │ ID │ Name │ Job │
│ │ Int64 │ String │ String │
├─────┼───────┼──────────┼────────┤
│ 1 │ 1 │ John Doe │ Lawyer │
│ 2 │ 2 │ Jane Doe │ Doctor │
julia> innerjoin(name, job2, on = [:ID => :identifier])
2×3 DataFrame
│ Row │ ID │ Name │ Job │
│ │ Int64 │ String │ String │
├─────┼───────┼──────────┼────────┤
│ 1 │ 1 │ John Doe │ Lawyer │
│ 2 │ 2 │ Jane Doe │ Doctor │
DataFrames.leftjoin
— Functionleftjoin(df1, df2; on, makeunique = false,
indicator = nothing, validate = (false, false))
Perform a left join of twodata frame objects and return a DataFrame
containing the result. A left join includes all rows from df1
.
Arguments
df1
,df2
: theAbstractDataFrames
to be joined
Keyword Arguments
on
: A column name to joindf1
anddf2
on. If the columns on whichdf1
anddf2
will be joined have different names, then aleft=>right
pair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed).makeunique
: iffalse
(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).indicator
: Default:nothing
. If aSymbol
or string, adds categorical indicator column with the given name, for whether a row appeared in onlydf1
("left_only"
), onlydf2
("right_only"
) or in both ("both"
). If the name is already in use, the column name will be modified ifmakeunique=true
.validate
: whether to check that columns passed as theon
argument define unique keys in each input data frame (according toisequal
). Can be a tuple or a pair, with the first element indicating whether to run check fordf1
and the second element fordf2
. By default no check is performed.
All columns of the returned data table will support missing values.
When merging on
categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.
See also: innerjoin
, rightjoin
, outerjoin
, semijoin
, antijoin
, crossjoin
.
Examples
julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
│ Row │ ID │ Name │
│ │ Int64 │ String │
├─────┼───────┼───────────┤
│ 1 │ 1 │ John Doe │
│ 2 │ 2 │ Jane Doe │
│ 3 │ 3 │ Joe Blogs │
julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ ID │ Job │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ Lawyer │
│ 2 │ 2 │ Doctor │
│ 3 │ 4 │ Farmer │
julia> leftjoin(name, job, on = :ID)
3×3 DataFrame
│ Row │ ID │ Name │ Job │
│ │ Int64 │ String │ String? │
├─────┼───────┼───────────┼─────────┤
│ 1 │ 1 │ John Doe │ Lawyer │
│ 2 │ 2 │ Jane Doe │ Doctor │
│ 3 │ 3 │ Joe Blogs │ missing │
julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ identifier │ Job │
│ │ Int64 │ String │
├─────┼────────────┼────────┤
│ 1 │ 1 │ Lawyer │
│ 2 │ 2 │ Doctor │
│ 3 │ 4 │ Farmer │
julia> leftjoin(name, job2, on = :ID => :identifier)
3×3 DataFrame
│ Row │ ID │ Name │ Job │
│ │ Int64 │ String │ String? │
├─────┼───────┼───────────┼─────────┤
│ 1 │ 1 │ John Doe │ Lawyer │
│ 2 │ 2 │ Jane Doe │ Doctor │
│ 3 │ 3 │ Joe Blogs │ missing │
julia> leftjoin(name, job2, on = [:ID => :identifier])
3×3 DataFrame
│ Row │ ID │ Name │ Job │
│ │ Int64 │ String │ String? │
├─────┼───────┼───────────┼─────────┤
│ 1 │ 1 │ John Doe │ Lawyer │
│ 2 │ 2 │ Jane Doe │ Doctor │
│ 3 │ 3 │ Joe Blogs │ missing │
DataFrames.rightjoin
— Functionrightjoin(df1, df2; on, makeunique = false,
indicator = nothing, validate = (false, false))
Perform a right join on two data frame objects and return a DataFrame
containing the result. A right join includes all rows from df2
.
Arguments
df1
,df2
: theAbstractDataFrames
to be joined
Keyword Arguments
on
: A column name to joindf1
anddf2
on. If the columns on whichdf1
anddf2
will be joined have different names, then aleft=>right
pair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed).makeunique
: iffalse
(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).indicator
: Default:nothing
. If aSymbol
or string, adds categorical indicator column with the given name for whether a row appeared in onlydf1
("left_only"
), onlydf2
("right_only"
) or in both ("both"
). If the name is already in use, the column name will be modified ifmakeunique=true
.validate
: whether to check that columns passed as theon
argument define unique keys in each input data frame (according toisequal
). Can be a tuple or a pair, with the first element indicating whether to run check fordf1
and the second element fordf2
. By default no check is performed.
All columns of the returned data table will support missing values.
When merging on
categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.
See also: innerjoin
, leftjoin
, outerjoin
, semijoin
, antijoin
, crossjoin
.
Examples
julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
│ Row │ ID │ Name │
│ │ Int64 │ String │
├─────┼───────┼───────────┤
│ 1 │ 1 │ John Doe │
│ 2 │ 2 │ Jane Doe │
│ 3 │ 3 │ Joe Blogs │
julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ ID │ Job │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ Lawyer │
│ 2 │ 2 │ Doctor │
│ 3 │ 4 │ Farmer │
julia> rightjoin(name, job, on = :ID)
3×3 DataFrame
│ Row │ ID │ Name │ Job │
│ │ Int64 │ String? │ String │
├─────┼───────┼──────────┼────────┤
│ 1 │ 1 │ John Doe │ Lawyer │
│ 2 │ 2 │ Jane Doe │ Doctor │
│ 3 │ 4 │ missing │ Farmer │
julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ identifier │ Job │
│ │ Int64 │ String │
├─────┼────────────┼────────┤
│ 1 │ 1 │ Lawyer │
│ 2 │ 2 │ Doctor │
│ 3 │ 4 │ Farmer │
julia> rightjoin(name, job2, on = :ID => :identifier)
3×3 DataFrame
│ Row │ ID │ Name │ Job │
│ │ Int64 │ String? │ String │
├─────┼───────┼──────────┼────────┤
│ 1 │ 1 │ John Doe │ Lawyer │
│ 2 │ 2 │ Jane Doe │ Doctor │
│ 3 │ 4 │ missing │ Farmer │
julia> rightjoin(name, job2, on = [:ID => :identifier])
3×3 DataFrame
│ Row │ ID │ Name │ Job │
│ │ Int64 │ String? │ String │
├─────┼───────┼──────────┼────────┤
│ 1 │ 1 │ John Doe │ Lawyer │
│ 2 │ 2 │ Jane Doe │ Doctor │
│ 3 │ 4 │ missing │ Farmer │
DataFrames.outerjoin
— Functionouterjoin(df1, df2; on, kind = :inner, makeunique = false,
indicator = nothing, validate = (false, false))
outerjoin(df1, df2, dfs...; on, kind = :inner, makeunique = false,
validate = (false, false))
Perform an outer join of two or more data frame objects and return a DataFrame
containing the result. An outer join includes rows with keys that appear in any of the passed data frames.
Arguments
df1
,df2
,dfs...
: theAbstractDataFrames
to be joined
Keyword Arguments
on
: A column name to joindf1
anddf2
on. If the columns on whichdf1
anddf2
will be joined have different names, then aleft=>right
pair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed). If more than two data frames are joined then only a column name or a vector of column names are allowed.on
is a required argument.makeunique
: iffalse
(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).indicator
: Default:nothing
. If aSymbol
or string, adds categorical indicator column with the given name for whether a row appeared in onlydf1
("left_only"
), onlydf2
("right_only"
) or in both ("both"
). If the name is already in use, the column name will be modified ifmakeunique=true
. This argument is only supported when joining exactly two data frames.validate
: whether to check that columns passed as theon
argument define unique keys in each input data frame (according toisequal
). Can be a tuple or a pair, with the first element indicating whether to run check fordf1
and the second element fordf2
. By default no check is performed.
All columns of the returned data table will support missing values.
When merging on
categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.
If more than two data frames are passed, the join is performed recursively with left associativity. In this case the indicator
keyword argument is not supported and validate
keyword argument is applied recursively with left associativity.
See also: innerjoin
, leftjoin
, rightjoin
, semijoin
, antijoin
, crossjoin
.
Examples
julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
│ Row │ ID │ Name │
│ │ Int64 │ String │
├─────┼───────┼───────────┤
│ 1 │ 1 │ John Doe │
│ 2 │ 2 │ Jane Doe │
│ 3 │ 3 │ Joe Blogs │
julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ ID │ Job │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ Lawyer │
│ 2 │ 2 │ Doctor │
│ 3 │ 4 │ Farmer │
julia> outerjoin(name, job, on = :ID)
4×3 DataFrame
│ Row │ ID │ Name │ Job │
│ │ Int64 │ String? │ String? │
├─────┼───────┼───────────┼─────────┤
│ 1 │ 1 │ John Doe │ Lawyer │
│ 2 │ 2 │ Jane Doe │ Doctor │
│ 3 │ 3 │ Joe Blogs │ missing │
│ 4 │ 4 │ missing │ Farmer │
julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ identifier │ Job │
│ │ Int64 │ String │
├─────┼────────────┼────────┤
│ 1 │ 1 │ Lawyer │
│ 2 │ 2 │ Doctor │
│ 3 │ 4 │ Farmer │
julia> outerjoin(name, job2, on = :ID => :identifier)
4×3 DataFrame
│ Row │ ID │ Name │ Job │
│ │ Int64 │ String? │ String? │
├─────┼───────┼───────────┼─────────┤
│ 1 │ 1 │ John Doe │ Lawyer │
│ 2 │ 2 │ Jane Doe │ Doctor │
│ 3 │ 3 │ Joe Blogs │ missing │
│ 4 │ 4 │ missing │ Farmer │
julia> outerjoin(name, job2, on = [:ID => :identifier])
4×3 DataFrame
│ Row │ ID │ Name │ Job │
│ │ Int64 │ String? │ String? │
├─────┼───────┼───────────┼─────────┤
│ 1 │ 1 │ John Doe │ Lawyer │
│ 2 │ 2 │ Jane Doe │ Doctor │
│ 3 │ 3 │ Joe Blogs │ missing │
│ 4 │ 4 │ missing │ Farmer │
DataFrames.antijoin
— Functionantijoin(df1, df2; on, makeunique = false, validate = (false, false))
Perform an anti join of two data frame objects and return a DataFrame
containing the result. An anti join returns the subset of rows of df1
that do not match with the keys in df2
.
Arguments
df1
,df2
: theAbstractDataFrames
to be joined
Keyword Arguments
on
: A column name to joindf1
anddf2
on. If the columns on whichdf1
anddf2
will be joined have different names, then aleft=>right
pair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed).makeunique
: iffalse
(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).validate
: whether to check that columns passed as theon
argument define unique keys in each input data frame (according toisequal
). Can be a tuple or a pair, with the first element indicating whether to run check fordf1
and the second element fordf2
. By default no check is performed.
When merging on
categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.
See also: innerjoin
, leftjoin
, rightjoin
, outerjoin
, semijoin
, crossjoin
.
Examples
julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
│ Row │ ID │ Name │
│ │ Int64 │ String │
├─────┼───────┼───────────┤
│ 1 │ 1 │ John Doe │
│ 2 │ 2 │ Jane Doe │
│ 3 │ 3 │ Joe Blogs │
julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ ID │ Job │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ Lawyer │
│ 2 │ 2 │ Doctor │
│ 3 │ 4 │ Farmer │
julia> antijoin(name, job, on = :ID)
1×2 DataFrame
│ Row │ ID │ Name │
│ │ Int64 │ String │
├─────┼───────┼───────────┤
│ 1 │ 3 │ Joe Blogs │
julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ identifier │ Job │
│ │ Int64 │ String │
├─────┼────────────┼────────┤
│ 1 │ 1 │ Lawyer │
│ 2 │ 2 │ Doctor │
│ 3 │ 4 │ Farmer │
julia> antijoin(name, job2, on = :ID => :identifier)
1×2 DataFrame
│ Row │ ID │ Name │
│ │ Int64 │ String │
├─────┼───────┼───────────┤
│ 1 │ 3 │ Joe Blogs │
julia> antijoin(name, job2, on = [:ID => :identifier])
1×2 DataFrame
│ Row │ ID │ Name │
│ │ Int64 │ String │
├─────┼───────┼───────────┤
│ 1 │ 3 │ Joe Blogs │
DataFrames.semijoin
— Functionsemijoin(df1, df2; on, makeunique = false, validate = (false, false))
Perform a semi join of two data frame objects and return a DataFrame
containing the result. A semi join returns the subset of rows of df1
that match with the keys in df2
.
Arguments
df1
,df2
: theAbstractDataFrames
to be joined
Keyword Arguments
on
: A column name to joindf1
anddf2
on. If the columns on whichdf1
anddf2
will be joined have different names, then aleft=>right
pair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed).makeunique
: iffalse
(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).indicator
: Default:nothing
. If aSymbol
or string, adds categorical indicator column with the given name for whether a row appeared in onlydf1
("left_only"
), onlydf2
("right_only"
) or in both ("both"
). If the name is already in use, the column name will be modified ifmakeunique=true
.validate
: whether to check that columns passed as theon
argument define unique keys in each input data frame (according toisequal
). Can be a tuple or a pair, with the first element indicating whether to run check fordf1
and the second element fordf2
. By default no check is performed.
When merging on
categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.
See also: innerjoin
, leftjoin
, rightjoin
, outerjoin
, antijoin
, crossjoin
.
Examples
julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
│ Row │ ID │ Name │
│ │ Int64 │ String │
├─────┼───────┼───────────┤
│ 1 │ 1 │ John Doe │
│ 2 │ 2 │ Jane Doe │
│ 3 │ 3 │ Joe Blogs │
julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ ID │ Job │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ Lawyer │
│ 2 │ 2 │ Doctor │
│ 3 │ 4 │ Farmer │
julia> semijoin(name, job, on = :ID)
2×2 DataFrame
│ Row │ ID │ Name │
│ │ Int64 │ String │
├─────┼───────┼──────────┤
│ 1 │ 1 │ John Doe │
│ 2 │ 2 │ Jane Doe │
julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ identifier │ Job │
│ │ Int64 │ String │
├─────┼────────────┼────────┤
│ 1 │ 1 │ Lawyer │
│ 2 │ 2 │ Doctor │
│ 3 │ 4 │ Farmer │
julia> semijoin(name, job2, on = :ID => :identifier)
2×2 DataFrame
│ Row │ ID │ Name │
│ │ Int64 │ String │
├─────┼───────┼──────────┤
│ 1 │ 1 │ John Doe │
│ 2 │ 2 │ Jane Doe │
julia> semijoin(name, job2, on = [:ID => :identifier])
2×2 DataFrame
│ Row │ ID │ Name │
│ │ Int64 │ String │
├─────┼───────┼──────────┤
│ 1 │ 1 │ John Doe │
│ 2 │ 2 │ Jane Doe │
DataFrames.crossjoin
— Functioncrossjoin(df1, df2, dfs...; makeunique = false)
Perform a cross join of two or more data frame objects and return a DataFrame
containing the result. A cross join returns the cartesian product of rows from all passed data frames.
Arguments
df1
,df2
,dfs...
: theAbstractDataFrames
to be joined
Keyword Arguments
makeunique
: iffalse
(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).
If more than two data frames are passed, the join is performed recursively with left associativity.
See also: innerjoin
, leftjoin
, rightjoin
, outerjoin
, semijoin
, antijoin
.
Examples
julia> df1 = DataFrame(X=1:3)
3×1 DataFrame
│ Row │ X │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
julia> df2 = DataFrame(Y=["a", "b"])
2×1 DataFrame
│ Row │ Y │
│ │ String │
├─────┼────────┤
│ 1 │ a │
│ 2 │ b │
julia> crossjoin(df1, df2)
6×2 DataFrame
│ Row │ X │ Y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ a │
│ 2 │ 1 │ b │
│ 3 │ 2 │ a │
│ 4 │ 2 │ b │
│ 5 │ 3 │ a │
│ 6 │ 3 │ b │
DataFrames.combine
— Functioncombine(df::AbstractDataFrame, args...)
Create a new data frame that contains columns from df
specified by args
and return it. The result can have any number of rows that is determined by the values returned by passed transformations.
See select
for detailed rules regarding accepted values for args
.
Examples
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │
julia> combine(df, :a => sum, nrow)
1×2 DataFrame
│ Row │ a_sum │ nrow │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 6 │ 3 │
combine(gd::GroupedDataFrame, args...; keepkeys::Bool=true, ungroup::Bool=true)
combine(fun::Union{Function, Type}, gd::GroupedDataFrame;
keepkeys::Bool=true, ungroup::Bool=true)
combine(pair::Pair, gd::GroupedDataFrame; keepkeys::Bool=true, ungroup::Bool=true)
combine(fun::Union{Function, Type}, df::AbstractDataFrame, ungroup::Bool=true)
combine(pair::Pair, df::AbstractDataFrame, ungroup::Bool=true)
Apply operations to each group in a GroupedDataFrame
and return the combined result as a DataFrame
if ungroup=true
or GroupedDataFrame
if ungroup=false
.
If an AbstractDataFrame
is passed, apply operations to the data frame as a whole and a DataFrame
is always returend.
Arguments passed as args...
can be:
- Any index that is allowed for column indexing (
Symbol
, string or integer,:
,All
,Between
,Not
, a regular expression, or a vector ofSymbol
s, strings or integers). - Column transformation operations using the
Pair
notation that is described below and vectors of such pairs.
Transformations allowed using Pair
s follow the rules specified for select
and have the form source_cols => fun
, source_cols => fun => target_col
, or source_col => target_col
. Function fun
is passed SubArray
views as positional arguments for each column specified to be selected, or a NamedTuple
containing these SubArray
s if source_cols
is an AsTable
selector. It can return a vector or a single value (defined precisely below).
As a special case nrow
or nrow => target_col
can be passed without specifying input columns to efficiently calculate number of rows in each group. If nrow
is passed the resulting column name is :nrow
.
If multiple args
are passed then return values of different fun
s are allowed to mix single values and vectors. In this case single values will be broadcasted to match the length of columns specified by returned vectors. As a particular rule, values wrapped in a Ref
or a 0
-dimensional AbstractArray
are unwrapped and then broadcasted.
If the first or last argument is pair
then it must be a Pair
following the rules for pairs described above, except that in this case function defined by fun
can return any return value defined below.
If the first or last argument is a function fun
, it is passed a SubDataFrame
view for each group and can return any return value defined below. Note that this form is slower than pair
or args
due to type instability.
If gd
has zero groups then no transformations are applied.
fun
can return a single value, a row, a vector, or multiple rows. The type of the returned value determines the shape of the resulting DataFrame
. There are four kind of return values allowed:
- A single value gives a
DataFrame
with a single additional column and one row per group. - A named tuple of single values or a
DataFrameRow
gives aDataFrame
with one additional column for each field and one row per group (returning a named tuple will be faster). It is not allowed to mix single values and vectors if a named tuple is returned. - A vector gives a
DataFrame
with a single additional column and as many rows for each group as the length of the returned vector for that group. - A data frame, a named tuple of vectors or a matrix gives a
DataFrame
with the same additional columns and as many rows for each group as the rows returned for that group (returning a named tuple is the fastest option). Returning a table with zero columns is allowed, whatever the number of columns returned for other groups.
fun
must always return the same kind of object (out of four kinds defined above) for all groups, and with the same column names.
Optimized methods are used when standard summary functions (sum
, prod
, minimum
, maximum
, mean
, var
, std
, first
, last
and length
) are specified using the Pair
syntax (e.g. :col => sum
). When computing the sum
or mean
over floating point columns, results will be less accurate than the standard sum
function (which uses pairwise summation). Use col => x -> sum(x)
to avoid the optimized method and use the slower, more accurate one.
Column names are automatically generated when necessary using the rules defined in select
if the Pair
syntax is used and fun
returns a single value or a vector (e.g. for :col => sum
the column name is col_sum
); otherwise (if fun
is a function or a return value is an AbstractMatrix
) columns are named x1
, x2
and so on.
If keepkeys=true
, the resulting DataFrame
contains all the grouping columns in addition to those generated. In this case if the returned value contains columns with the same names as the grouping columns, they are required to be equal. If keepkeys=false
and some generated columns have the same name as grouping columns, they are kept and are not required to be equal to grouping columns.
If ungroup=true
(the default) a DataFrame
is returned. If ungroup=false
a GroupedDataFrame
grouped using keycols(gdf)
is returned.
If gd
has zero groups then no transformations are applied.
Ordering of rows follows the order of groups in gdf
.
See also
groupby
, select
, select!
, transform
, transform!
Examples
julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = 1:8);
julia> gd = groupby(df, :a);
julia> combine(gd, :c => sum, nrow)
4×3 DataFrame
│ Row │ a │ c_sum │ nrow │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 6 │ 2 │
│ 2 │ 2 │ 8 │ 2 │
│ 3 │ 3 │ 10 │ 2 │
│ 4 │ 4 │ 12 │ 2 │
julia> combine(gd, :c => sum, nrow, ungroup=false)
GroupedDataFrame with 4 groups based on key: a
First Group (1 row): a = 1
│ Row │ a │ c_sum │ nrow │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 6 │ 2 │
⋮
Last Group (1 row): a = 4
│ Row │ a │ c_sum │ nrow │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 4 │ 12 │ 2 │
julia> combine(sdf -> sum(sdf.c), gd) # Slower variant
4×2 DataFrame
│ Row │ a │ x1 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 6 │
│ 2 │ 2 │ 8 │
│ 3 │ 3 │ 10 │
│ 4 │ 4 │ 12 │
julia> combine(gdf) do d # do syntax for the slower variant
sum(d.c)
end
4×2 DataFrame
│ Row │ a │ x1 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 6 │
│ 2 │ 2 │ 8 │
│ 3 │ 3 │ 10 │
│ 4 │ 4 │ 12 │
julia> combine(gd, :c => (x -> sum(log, x)) => :sum_log_c) # specifying a name for target column
4×2 DataFrame
│ Row │ a │ sum_log_c │
│ │ Int64 │ Float64 │
├─────┼───────┼───────────┤
│ 1 │ 1 │ 1.60944 │
│ 2 │ 2 │ 2.48491 │
│ 3 │ 3 │ 3.04452 │
│ 4 │ 4 │ 3.46574 │
julia> combine(gd, [:b, :c] .=> sum) # passing a vector of pairs
4×3 DataFrame
│ Row │ a │ b_sum │ c_sum │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 4 │ 6 │
│ 2 │ 2 │ 2 │ 8 │
│ 3 │ 3 │ 4 │ 10 │
│ 4 │ 4 │ 2 │ 12 │
julia> combine(gd) do sdf # dropping group when DataFrame() is returned
sdf.c[1] != 1 ? sdf : DataFrame()
end
6×3 DataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 2 │ 1 │ 2 │
│ 2 │ 2 │ 1 │ 6 │
│ 3 │ 3 │ 2 │ 3 │
│ 4 │ 3 │ 2 │ 7 │
│ 5 │ 4 │ 1 │ 4 │
│ 6 │ 4 │ 1 │ 8 │
julia> combine(gd, :b => :b1, :c => :c1,
[:b, :c] => +, keepkeys=false) # auto-splatting, renaming and keepkeys
8×3 DataFrame
│ Row │ b1 │ c1 │ b_c_+ │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 2 │ 1 │ 3 │
│ 2 │ 2 │ 5 │ 7 │
│ 3 │ 1 │ 2 │ 3 │
│ 4 │ 1 │ 6 │ 7 │
│ 5 │ 2 │ 3 │ 5 │
│ 6 │ 2 │ 7 │ 9 │
│ 7 │ 1 │ 4 │ 5 │
│ 8 │ 1 │ 8 │ 9 │
julia> combine(gd, :b, :c => sum) # passing columns and broadcasting
8×3 DataFrame
│ Row │ a │ b │ c_sum │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 6 │
│ 2 │ 1 │ 2 │ 6 │
│ 3 │ 2 │ 1 │ 8 │
│ 4 │ 2 │ 1 │ 8 │
│ 5 │ 3 │ 2 │ 10 │
│ 6 │ 3 │ 2 │ 10 │
│ 7 │ 4 │ 1 │ 12 │
│ 8 │ 4 │ 1 │ 12 │
julia> combine(gd, [:b, :c] .=> Ref)
4×3 DataFrame
│ Row │ a │ b_Ref │ c_Ref │
│ │ Int64 │ SubArra… │ SubArra… │
├─────┼───────┼──────────┼──────────┤
│ 1 │ 1 │ [2, 2] │ [1, 5] │
│ 2 │ 2 │ [1, 1] │ [2, 6] │
│ 3 │ 3 │ [2, 2] │ [3, 7] │
│ 4 │ 4 │ [1, 1] │ [4, 8] │
julia> combine(gd, AsTable(:) => Ref)
4×2 DataFrame
│ Row │ a │ a_b_c_Ref │
│ │ Int64 │ NamedTuple… │
├─────┼───────┼──────────────────────────────────────┤
│ 1 │ 1 │ (a = [1, 1], b = [2, 2], c = [1, 5]) │
│ 2 │ 2 │ (a = [2, 2], b = [1, 1], c = [2, 6]) │
│ 3 │ 3 │ (a = [3, 3], b = [2, 2], c = [3, 7]) │
│ 4 │ 4 │ (a = [4, 4], b = [1, 1], c = [4, 8]) │
julia> combine(gd, :, AsTable(Not(:a)) => sum)
8×4 DataFrame
│ Row │ a │ b │ c │ b_c_sum │
│ │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼─────────┤
│ 1 │ 1 │ 2 │ 1 │ 3 │
│ 2 │ 1 │ 2 │ 5 │ 7 │
│ 3 │ 2 │ 1 │ 2 │ 3 │
│ 4 │ 2 │ 1 │ 6 │ 7 │
│ 5 │ 3 │ 2 │ 3 │ 5 │
│ 6 │ 3 │ 2 │ 7 │ 9 │
│ 7 │ 4 │ 1 │ 4 │ 5 │
│ 8 │ 4 │ 1 │ 8 │ 9 │
DataFrames.groupby
— Functiongroupby(d::AbstractDataFrame, cols; sort=false, skipmissing=false)
Return a GroupedDataFrame
representing a view of an AbstractDataFrame
split into row groups.
Arguments
df
: anAbstractDataFrame
to splitcols
: data frame columns to group by. Can be any column selector (Symbol
, string or integer;:
,All
,Between
,Not
, a regular expression, or a vector ofSymbol
s, strings or integers).sort
: whether to sort groups according to the values of the grouping columnscols
; if allcols
areCategoricalVector
s then groups are always sorted irrespective of the value ofsort
skipmissing
: whether to skip groups withmissing
values in one of the grouping columnscols
Details
An iterator over a GroupedDataFrame
returns a SubDataFrame
view for each grouping into df
. Within each group, the order of rows in df
is preserved.
cols
can be any valid data frame indexing expression. In particular if it is an empty vector then a single-group GroupedDataFrame
is created.
A GroupedDataFrame
also supports indexing by groups, map
(which applies a function to each group) and combine
(which applies a function to each group and combines the result into a data frame).
GroupedDataFrame
also supports the dictionary interface. The keys are GroupKey
objects returned by keys(::GroupedDataFrame)
, which can also be used to get the values of the grouping columns for each group. Tuples
and NamedTuple
s containing the values of the grouping columns (in the same order as the cols
argument) are also accepted as indices, but this will be slower than using the equivalent GroupKey
.
See also
combine
, select
, select!
, transform
, transform!
Examples
julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = 1:8);
julia> gd = groupby(df, :a)
GroupedDataFrame with 4 groups based on key: a
First Group (2 rows): a = 1
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 1 │
│ 2 │ 1 │ 2 │ 5 │
⋮
Last Group (2 rows): a = 4
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 4 │ 1 │ 4 │
│ 2 │ 4 │ 1 │ 8 │
julia> gd[1]
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 1 │
│ 2 │ 1 │ 2 │ 5 │
julia> last(gd)
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 4 │ 1 │ 4 │
│ 2 │ 4 │ 1 │ 8 │
julia> gd[(a=3,)]
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 3 │ 2 │ 3 │
│ 2 │ 3 │ 2 │ 7 │
julia> gd[(3,)]
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 3 │ 2 │ 3 │
│ 2 │ 3 │ 2 │ 7 │
julia> k = first(keys(gd))
GroupKey: (a = 3)
julia> gd[k]
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 3 │ 2 │ 3 │
│ 2 │ 3 │ 2 │ 7 │
julia> for g in gd
println(g)
end
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 1 │
│ 2 │ 1 │ 2 │ 5 │
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 2 │ 1 │ 2 │
│ 2 │ 2 │ 1 │ 6 │
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 3 │ 2 │ 3 │
│ 2 │ 3 │ 2 │ 7 │
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 4 │ 1 │ 4 │
│ 2 │ 4 │ 1 │ 8 │
DataFrames.groupindices
— Functiongroupindices(gd::GroupedDataFrame)
Return a vector of group indices for each row of parent(gd)
.
Rows appearing in group gd[i]
are attributed index i
. Rows not present in any group are attributed missing
(this can happen if skipmissing=true
was passed when creating gd
, or if gd
is a subset from a larger GroupedDataFrame
).
DataFrames.groupcols
— Functiongroupcols(gd::GroupedDataFrame)
Return a vector of Symbol
column names in parent(gd)
used for grouping.
DataFrames.valuecols
— Functionvaluecols(gd::GroupedDataFrame)
Return a vector of Symbol
column names in parent(gd)
not used for grouping.
Base.keys
— Functionkeys(gd::GroupedDataFrame)
Get the set of keys for each group of the GroupedDataFrame
gd
as a GroupKeys
object. Each key is a GroupKey
, which behaves like a NamedTuple
holding the values of the grouping columns for a given group. Unlike the equivalent Tuple
and NamedTuple
, these keys can be used to index into gd
efficiently. The ordering of the keys is identical to the ordering of the groups of gd
under iteration and integer indexing.
Examples
julia> df = DataFrame(a = repeat([:foo, :bar, :baz], outer=[4]),
b = repeat([2, 1], outer=[6]),
c = 1:12);
julia> gd = groupby(df, [:a, :b])
GroupedDataFrame with 6 groups based on keys: a, b
First Group (2 rows): a = :foo, b = 2
│ Row │ a │ b │ c │
│ │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1 │ foo │ 2 │ 1 │
│ 2 │ foo │ 2 │ 7 │
⋮
Last Group (2 rows): a = :baz, b = 1
│ Row │ a │ b │ c │
│ │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1 │ baz │ 1 │ 6 │
│ 2 │ baz │ 1 │ 12 │
julia> keys(gd)
6-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
GroupKey: (a = :foo, b = 2)
GroupKey: (a = :bar, b = 1)
GroupKey: (a = :baz, b = 2)
GroupKey: (a = :foo, b = 1)
GroupKey: (a = :bar, b = 2)
GroupKey: (a = :baz, b = 1)
GroupKey
objects behave similarly to NamedTuple
s:
julia> k = keys(gd)[1]
GroupKey: (a = :foo, b = 2)
julia> keys(k)
(:a, :b)
julia> values(k) # Same as Tuple(k)
(:foo, 2)
julia> NamedTuple(k)
(a = :foo, b = 2)
julia> k.a
:foo
julia> k[:a]
:foo
julia> k[1]
:foo
Keys can be used as indices to retrieve the corresponding group from their GroupedDataFrame
:
julia> gd[k]
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1 │ foo │ 2 │ 1 │
│ 2 │ foo │ 2 │ 7 │
julia> gd[keys(gd)[1]] == gd[1]
true
keys(dfc::DataFrameColumns)
Get a vector of column names of dfc
as Symbol
s.
Base.get
— Functionget(gd::GroupedDataFrame, key, default)
Get a group based on the values of the grouping columns.
key
may be a NamedTuple
or Tuple
of grouping column values (in the same order as the cols
argument to groupby
).
Examples
julia> df = DataFrame(a = repeat([:foo, :bar, :baz], outer=[2]),
b = repeat([2, 1], outer=[3]),
c = 1:6);
julia> gd = groupby(df, :a)
GroupedDataFrame with 3 groups based on key: a
First Group (2 rows): a = :foo
│ Row │ a │ b │ c │
│ │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1 │ foo │ 2 │ 1 │
│ 2 │ foo │ 1 │ 4 │
⋮
Last Group (2 rows): a = :baz
│ Row │ a │ b │ c │
│ │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1 │ baz │ 2 │ 3 │
│ 2 │ baz │ 1 │ 6 │
julia> get(gd, (a=:bar,), nothing)
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1 │ bar │ 1 │ 2 │
│ 2 │ bar │ 2 │ 5 │
julia> get(gd, (:baz,), nothing)
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1 │ baz │ 2 │ 3 │
│ 2 │ baz │ 1 │ 6 │
julia> get(gd, (:qux,), nothing)
DataFrames.stack
— Functionstack(df::AbstractDataFrame, [measure_vars], [id_vars];
variable_name=:variable, value_name=:value,
view::Bool=false, variable_eltype::Type=CategoricalValue{String})
Stack a data frame df
, i.e. convert it from wide to long format.
Return the long-format DataFrame
with: columns for each of the id_vars
, column variable_name
(:value
by default) holding the values of the stacked columns (measure_vars
), and column variable_name
(:variable
by default) a vector holding the name of the corresponding measure_vars
variable.
If view=true
then return a stacked view of a data frame (long format). The result is a view because the columns are special AbstractVectors
that return views into the original data frame.
Arguments
df
: the AbstractDataFrame to be stackedmeasure_vars
: the columns to be stacked (the measurement variables), as a column selector (Symbol
, string or integer;:
,All
,Between
,Not
, a regular expression, or a vector ofSymbol
s, strings or integers). If neithermeasure_vars
orid_vars
are given,measure_vars
defaults to all floating point columns.id_vars
: the identifier columns that are repeated during stacking, as a column selector (Symbol
, string or integer;:
,All
,Between
,Not
, a regular expression, or a vector ofSymbol
s, strings or integers). Defaults to all variables that are notmeasure_vars
variable_name
: the name (Symbol
or string) of the new stacked column that shall hold the names of each ofmeasure_vars
value_name
: the name (Symbol
or string) of the new stacked column containing the values from each ofmeasure_vars
view
: whether the stacked data frame should be a view rather than contain freshly allocated vectors.variable_eltype
: determines the element type of columnvariable_name
. By default a categorical vector of strings is created. Ifvariable_eltype=Symbol
it is a vector ofSymbol
, and ifvariable_eltype=String
a vector ofString
is produced.
Examples
d1 = DataFrame(a = repeat([1:3;], inner = [4]),
b = repeat([1:4;], inner = [3]),
c = randn(12),
d = randn(12),
e = map(string, 'a':'l'))
d1s = stack(d1, [:c, :d])
d1s2 = stack(d1, [:c, :d], [:a])
d1m = stack(d1, Not([:a, :b, :e]))
d1s_name = stack(d1, Not([:a, :b, :e]), variable_name=:somemeasure)
DataFrames.unstack
— Functionunstack(df::AbstractDataFrame, rowkeys, colkey, value; renamecols::Function=identity)
unstack(df::AbstractDataFrame, colkey, value; renamecols::Function=identity)
unstack(df::AbstractDataFrame; renamecols::Function=identity)
Unstack data frame df
, i.e. convert it from long to wide format.
If colkey
contains missing
values then they will be skipped and a warning will be printed.
If combination of rowkeys
and colkey
contains duplicate entries then last value
will be retained and a warning will be printed.
Arguments
df
: the AbstractDataFrame to be unstackedrowkeys
: the columns with a unique key for each row, if not given, find a key by grouping on anything not acolkey
orvalue
. Can be any column selector (Symbol
, string or integer;:
,All
,Between
,Not
, a regular expression, or a vector ofSymbol
s, strings or integers).colkey
: the column (Symbol
, string or integer) holding the column names in wide format, defaults to:variable
value
: the value column (Symbol
, string or integer), defaults to:value
renamecols
: a function called on each unique value incolkey
which must return the name of the column to be created (typically as a string or aSymbol
). Duplicate names are not allowed.
Examples
wide = DataFrame(id = 1:12,
a = repeat([1:3;], inner = [4]),
b = repeat([1:4;], inner = [3]),
c = randn(12),
d = randn(12))
long = stack(wide)
wide0 = unstack(long)
wide1 = unstack(long, :variable, :value)
wide2 = unstack(long, :id, :variable, :value)
wide3 = unstack(long, [:id, :a], :variable, :value)
wide4 = unstack(long, :id, :variable, :value, renamecols=x->Symbol(:_, x))
Note that there are some differences between the widened results above.
Basics
Missings.allowmissing
— Functionallowmissing(df::AbstractDataFrame, cols=:)
Return a copy of data frame df
with columns cols
converted to element type Union{T, Missing}
from T
to allow support for missing values.
cols
can be any column selector (Symbol
, string or integer; :
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
If cols
is omitted all columns in the data frame are converted.
Examples
julia> df = DataFrame(a=[1,2])
2×1 DataFrame
│ Row │ a │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 1 │
│ 2 │ 2 │
julia> allowmissing(df)
2×1 DataFrame
│ Row │ a │
│ │ Int64? │
├─────┼────────┤
│ 1 │ 1 │
│ 2 │ 2 │
DataFrames.allowmissing!
— Functionallowmissing!(df::DataFrame, cols=:)
Convert columns cols
of data frame df
from element type T
to Union{T, Missing}
to support missing values.
cols
can be any column selector (Symbol
, string or integer; :
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
If cols
is omitted all columns in the data frame are converted.
Base.append!
— Functionappend!(df::DataFrame, df2::AbstractDataFrame; cols::Symbol=:setequal,
promote::Bool=(cols in [:union, :subset]))
append!(df::DataFrame, table; cols::Symbol=:setequal,
promote::Bool=(cols in [:union, :subset]))
Add the rows of df2
to the end of df
. If the second argument table
is not an AbstractDataFrame
then it is converted using DataFrame(table, copycols=false)
before being appended.
The exact behavior of append!
depends on the cols
argument:
- If
cols == :setequal
(this is the default) thendf2
must contain exactly the same columns asdf
(but possibly in a different order). - If
cols == :orderequal
thendf2
must contain the same columns in the same order (forAbstractDict
this option requires thatkeys(row)
matchespropertynames(df)
to allow for support of ordered dicts; however, ifdf2
is aDict
an error is thrown as it is an unordered collection). - If
cols == :intersect
thendf2
may contain more columns thandf
, but all column names that are present indf
must be present indf2
and only these are used. - If
cols == :subset
thenappend!
behaves like for:intersect
but if some column is missing indf2
then amissing
value is pushed todf
. - If
cols == :union
thenappend!
adds columns missing indf
that are present inrow
, for columns present indf
but missing inrow
amissing
value is pushed.
If promote=true
and element type of a column present in df
does not allow the type of a pushed argument then a new column with a promoted element type allowing it is freshly allocated and stored in df
. If promote=false
an error is thrown.
The above rule has the following exceptions:
- If
df
has no columns then copies of columns fromdf2
are added to it. - If
df2
has no columns then callingappend!
leavesdf
unchanged.
Please note that append!
must not be used on a DataFrame
that contains columns that are aliases (equal when compared with ===
).
See also
Use push!
to add individual rows to a data frame and vcat
to vertically concatenate data frames.
Examples
julia> df1 = DataFrame(A=1:3, B=1:3);
julia> df2 = DataFrame(A=4.0:6.0, B=4:6);
julia> append!(df1, df2);
julia> df1
6×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │
│ 4 │ 4 │ 4 │
│ 5 │ 5 │ 5 │
│ 6 │ 6 │ 6 │
CategoricalArrays.categorical
— Functioncategorical(df::AbstractDataFrame, cols=Union{AbstractString, Missing};
compress::Bool=false)
Return a copy of data frame df
with columns cols
converted to CategoricalVector
.
cols
can be any column selector (Symbol
, string or integer; :
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers) or a Type
.
If categorical
is called with the cols
argument being a Type
, then all columns whose element type is a subtype of this type (by default Union{AbstractString, Missing}
) will be converted to categorical.
If the compress
keyword argument is set to true
then the created CategoricalVector
s will be compressed.
All created CategoricalVector
s are unordered.
Examples
julia> df = DataFrame(a=[1,2], b=["a","b"])
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
julia> categorical(df)
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Cat… │
├─────┼───────┼──────┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
julia> categorical(df, :)
2×2 DataFrame
│ Row │ a │ b │
│ │ Cat… │ Cat… │
├─────┼──────┼──────┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
DataFrames.categorical!
— Functioncategorical!(df::DataFrame, cols=Union{AbstractString, Missing};
compress::Bool=false)
Change columns selected by cols
in data frame df
to CategoricalVector
.
cols
can be any column selector (Symbol
, string or integer; :
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers) or a Type
.
If categorical!
is called with the cols
argument being a Type
, then all columns whose element type is a subtype of this type (by default Union{AbstractString, Missing}
) will be converted to categorical.
If the compress
keyword argument is set to true
then the created CategoricalVector
s will be compressed.
All created CategoricalVector
s are unordered.
Examples
julia> df = DataFrame(X=["a", "b"], Y=[1, 2], Z=["p", "q"])
2×3 DataFrame
│ Row │ X │ Y │ Z │
│ │ String │ Int64 │ String │
├─────┼────────┼───────┼────────┤
│ 1 │ a │ 1 │ p │
│ 2 │ b │ 2 │ q │
julia> categorical!(df)
2×3 DataFrame
│ Row │ X │ Y │ Z │
│ │ Cat… │ Int64 │ Cat… │
├─────┼──────┼───────┼──────┤
│ 1 │ a │ 1 │ p │
│ 2 │ b │ 2 │ q │
julia> eltype.(eachcol(df))
3-element Array{DataType,1}:
CategoricalValue{String,UInt32}
Int64
CategoricalValue{String,UInt32}
julia> df = DataFrame(X=["a", "b"], Y=[1, 2], Z=["p", "q"])
2×3 DataFrame
│ Row │ X │ Y │ Z │
│ │ String │ Int64 │ String │
├─────┼────────┼───────┼────────┤
│ 1 │ a │ 1 │ p │
│ 2 │ b │ 2 │ q │
julia> categorical!(df, :Y, compress=true)
2×3 DataFrame
│ Row │ X │ Y │ Z │
│ │ String │ Cat… │ String │
├─────┼────────┼──────┼────────┤
│ 1 │ a │ 1 │ p │
│ 2 │ b │ 2 │ q │
julia> eltype.(eachcol(df))
3-element Array{DataType,1}:
String
CategoricalValue{Int64,UInt8}
String
DataFrames.completecases
— Functioncompletecases(df::AbstractDataFrame, cols=:)
Return a Boolean vector with true
entries indicating rows without missing values (complete cases) in data frame df
.
If cols
is provided, only missing values in the corresponding columns areconsidered. cols
can be any column selector (Symbol
, string or integer; :
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
See also: dropmissing
and dropmissing!
. Use findall(completecases(df))
to get the indices of the rows.
Examples
julia> df = DataFrame(i = 1:5,
x = [missing, 4, missing, 2, 1],
y = [missing, missing, "c", "d", "e"])
5×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64? │ String? │
├─────┼───────┼─────────┼─────────┤
│ 1 │ 1 │ missing │ missing │
│ 2 │ 2 │ 4 │ missing │
│ 3 │ 3 │ missing │ c │
│ 4 │ 4 │ 2 │ d │
│ 5 │ 5 │ 1 │ e │
julia> completecases(df)
5-element BitArray{1}:
false
false
false
true
true
julia> completecases(df, :x)
5-element BitArray{1}:
false
true
false
true
true
julia> completecases(df, [:x, :y])
5-element BitArray{1}:
false
false
false
true
true
Base.copy
— Functioncopy(df::DataFrame; copycols::Bool=true)
Copy data frame df
. If copycols=true
(the default), return a new DataFrame
holding copies of column vectors in df
. If copycols=false
, return a new DataFrame
sharing column vectors with df
.
copy(dfr::DataFrameRow)
Construct a NamedTuple
with the same contents as the DataFrameRow
. This method returns a NamedTuple
so that the returned object is not affected by changes to the parent data frame of which dfr
is a view.
DataFrames.DataFrame!
— FunctionDataFrame!(args...; kwargs...)
Equivalent to DataFrame(args...; copycols=false, kwargs...)
.
If kwargs
contains the copycols
keyword argument an error is thrown.
Examples
julia> df1 = DataFrame(a=1:3)
3×1 DataFrame
│ Row │ a │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
julia> df2 = DataFrame!(df1)
julia> df1.a === df2.a
true
Base.delete!
— Functiondelete!(df::DataFrame, inds)
Delete rows specified by inds
from a DataFrame
df
in place and return it.
Internally deleteat!
is called for all columns so inds
must be: a vector of sorted and unique integers, a boolean vector, an integer, or Not
.
Examples
julia> d = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │
julia> delete!(d, 2)
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 3 │ 6 │
DataAPI.describe
— Functiondescribe(df::AbstractDataFrame; cols=:)
describe(df::AbstractDataFrame, stats::Union{Symbol, Pair}...; cols=:)
Return descriptive statistics for a data frame as a new DataFrame
where each row represents a variable and each column a summary statistic.
Arguments
df
: theAbstractDataFrame
stats::Union{Symbol, Pair}...
: the summary statistics to report. Arguments can be:- A symbol from the list
:mean
,:std
,:min
,:q25
,:median
,:q75
,:max
,:eltype
,:nunique
,:first
,:last
, and:nmissing
. The default statistics used are:mean
,:min
,:median
,:max
,:nunique
,:nmissing
, and:eltype
. :all
as the onlySymbol
argument to return all statistics.- A
name => function
pair wherename
is aSymbol
or string. This will create a column of summary statistics with the provided name.
- A symbol from the list
cols
: a keyword argument allowing to select only a subset of columns fromdf
to describe. Can be any column selector (Symbol
, string or integer;:
,All
,Between
,Not
, a regular expression, or a vector ofSymbol
s, strings or integers).
Details
For Real
columns, compute the mean, standard deviation, minimum, first quantile, median, third quantile, and maximum. If a column does not derive from Real
, describe
will attempt to calculate all statistics, using nothing
as a fall-back in the case of an error.
When stats
contains :nunique
, describe
will report the number of unique values in a column. If a column's base type derives from Real
, :nunique
will return nothing
s.
Missing values are filtered in the calculation of all statistics, however the column :nmissing
will report the number of missing values of that variable. If the column does not allow missing values, nothing
is returned. Consequently, nmissing = 0
indicates that the column allows missing values, but does not currently contain any.
If custom functions are provided, they are called repeatedly with the vector corresponding to each column as the only argument. For columns allowing for missing values, the vector is wrapped in a call to skipmissing
: custom functions must therefore support such objects (and not only vectors), and cannot access missing values.
Examples
julia> df = DataFrame(i=1:10, x=0.1:0.1:1.0, y='a':'j')
10×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Float64 │ Char │
├─────┼───────┼─────────┼──────┤
│ 1 │ 1 │ 0.1 │ 'a' │
│ 2 │ 2 │ 0.2 │ 'b' │
│ 3 │ 3 │ 0.3 │ 'c' │
│ 4 │ 4 │ 0.4 │ 'd' │
│ 5 │ 5 │ 0.5 │ 'e' │
│ 6 │ 6 │ 0.6 │ 'f' │
│ 7 │ 7 │ 0.7 │ 'g' │
│ 8 │ 8 │ 0.8 │ 'h' │
│ 9 │ 9 │ 0.9 │ 'i' │
│ 10 │ 10 │ 1.0 │ 'j' │
julia> describe(df)
3×8 DataFrame
│ Row │ variable │ mean │ min │ median │ max │ nunique │ nmissing │ eltype │
│ │ Symbol │ Union… │ Any │ Union… │ Any │ Union… │ Nothing │ DataType │
├─────┼──────────┼────────┼─────┼────────┼─────┼─────────┼──────────┼──────────┤
│ 1 │ i │ 5.5 │ 1 │ 5.5 │ 10 │ │ │ Int64 │
│ 2 │ x │ 0.55 │ 0.1 │ 0.55 │ 1.0 │ │ │ Float64 │
│ 3 │ y │ │ 'a' │ │ 'j' │ 10 │ │ Char │
julia> describe(df, :min, :max)
3×3 DataFrame
│ Row │ variable │ min │ max │
│ │ Symbol │ Any │ Any │
├─────┼──────────┼─────┼─────┤
│ 1 │ i │ 1 │ 10 │
│ 2 │ x │ 0.1 │ 1.0 │
│ 3 │ y │ 'a' │ 'j' │
julia> describe(df, :min, :sum => sum)
3×3 DataFrame
│ Row │ variable │ min │ sum │
│ │ Symbol │ Any │ Any │
├─────┼──────────┼─────┼─────┤
│ 1 │ i │ 1 │ 55 │
│ 2 │ x │ 0.1 │ 5.5 │
│ 3 │ y │ 'a' │ │
julia> describe(df, :min, :sum => sum, cols=:x)
1×3 DataFrame
│ Row │ variable │ min │ sum │
│ │ Symbol │ Float64 │ Float64 │
├─────┼──────────┼─────────┼─────────┤
│ 1 │ x │ 0.1 │ 5.5 │
Missings.disallowmissing
— Functiondisallowmissing(df::AbstractDataFrame, cols=:; error::Bool=true)
Return a copy of data frame df
with columns cols
converted from element type Union{T, Missing}
to T
to drop support for missing values.
cols
can be any column selector (Symbol
, string or integer; :
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
If cols
is omitted all columns in the data frame are converted.
If error=false
then columns containing a missing
value will be skipped instead of throwing an error.
Examples
julia> df = DataFrame(a=Union{Int,Missing}[1,2])
2×1 DataFrame
│ Row │ a │
│ │ Int64? │
├─────┼────────┤
│ 1 │ 1 │
│ 2 │ 2 │
julia> disallowmissing(df)
2×1 DataFrame
│ Row │ a │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 1 │
│ 2 │ 2 │
julia> df = DataFrame(a=[1,missing])
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64? │
├─────┼─────────┼────────┤
│ 1 │ 1 │ 1 │
│ 2 │ missing │ 2 │
julia> disallowmissing(df, error=false)
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64 │
├─────┼─────────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ missing │ 2 │
DataFrames.disallowmissing!
— Functiondisallowmissing!(df::DataFrame, cols=:; error::Bool=true)
Convert columns cols
of data frame df
from element type Union{T, Missing}
to T
to drop support for missing values.
cols
can be any column selector (Symbol
, string or integer; :
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
If cols
is omitted all columns in the data frame are converted.
If error=false
then columns containing a missing
value will be skipped instead of throwing an error.
DataFrames.dropmissing
— Functiondropmissing(df::AbstractDataFrame, cols=:; disallowmissing::Bool=true)
Return a copy of data frame df
excluding rows with missing values.
If cols
is provided, only missing values in the corresponding columns are considered. cols
can be any column selector (Symbol
, string or integer; :
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
If disallowmissing
is true
(the default) then columns specified in cols
will be converted so as not to allow for missing values using disallowmissing!
.
See also: completecases
and dropmissing!
.
Examples
julia> df = DataFrame(i = 1:5,
x = [missing, 4, missing, 2, 1],
y = [missing, missing, "c", "d", "e"])
5×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64? │ String? │
├─────┼───────┼─────────┼─────────┤
│ 1 │ 1 │ missing │ missing │
│ 2 │ 2 │ 4 │ missing │
│ 3 │ 3 │ missing │ c │
│ 4 │ 4 │ 2 │ d │
│ 5 │ 5 │ 1 │ e │
julia> dropmissing(df)
2×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼────────┤
│ 1 │ 4 │ 2 │ d │
│ 2 │ 5 │ 1 │ e │
julia> dropmissing(df, disallowmissing=false)
2×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64? │ String? │
├─────┼───────┼────────┼─────────┤
│ 1 │ 4 │ 2 │ d │
│ 2 │ 5 │ 1 │ e │
julia> dropmissing(df, :x)
3×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64 │ String? │
├─────┼───────┼───────┼─────────┤
│ 1 │ 2 │ 4 │ missing │
│ 2 │ 4 │ 2 │ d │
│ 3 │ 5 │ 1 │ e │
julia> dropmissing(df, [:x, :y])
2×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼────────┤
│ 1 │ 4 │ 2 │ d │
│ 2 │ 5 │ 1 │ e │
DataFrames.dropmissing!
— Functiondropmissing!(df::AbstractDataFrame, cols=:; disallowmissing::Bool=true)
Remove rows with missing values from data frame df
and return it.
If cols
is provided, only missing values in the corresponding columns are considered. cols
can be any column selector (Symbol
, string or integer; :
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
If disallowmissing
is true
(the default) then the cols
columns will get converted using disallowmissing!
.
See also: dropmissing
and completecases
.
julia> df = DataFrame(i = 1:5,
x = [missing, 4, missing, 2, 1],
y = [missing, missing, "c", "d", "e"])
5×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64? │ String? │
├─────┼───────┼─────────┼─────────┤
│ 1 │ 1 │ missing │ missing │
│ 2 │ 2 │ 4 │ missing │
│ 3 │ 3 │ missing │ c │
│ 4 │ 4 │ 2 │ d │
│ 5 │ 5 │ 1 │ e │
julia> dropmissing!(copy(df))
2×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼────────┤
│ 1 │ 4 │ 2 │ d │
│ 2 │ 5 │ 1 │ e │
julia> dropmissing!(copy(df), disallowmissing=false)
2×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64? │ String? │
├─────┼───────┼────────┼─────────┤
│ 1 │ 4 │ 2 │ d │
│ 2 │ 5 │ 1 │ e │
julia> dropmissing!(copy(df), :x)
3×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64 │ String? │
├─────┼───────┼───────┼─────────┤
│ 1 │ 2 │ 4 │ missing │
│ 2 │ 4 │ 2 │ d │
│ 3 │ 5 │ 1 │ e │
julia> dropmissing!(df3, [:x, :y])
2×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼────────┤
│ 1 │ 4 │ 2 │ d │
│ 2 │ 5 │ 1 │ e │
Compat.eachcol
— Functioneachcol(df::AbstractDataFrame)
Return a DataFrameColumns
that is an AbstractVector
that allows iterating an AbstractDataFrame
column by column. Additionally it is allowed to index DataFrameColumns
using column names.
Examples
julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 11 │
│ 2 │ 2 │ 12 │
│ 3 │ 3 │ 13 │
│ 4 │ 4 │ 14 │
julia> collect(eachcol(df))
2-element Array{AbstractArray{T,1} where T,1}:
[1, 2, 3, 4]
[11, 12, 13, 14]
julia> map(eachcol(df)) do col
maximum(col) - minimum(col)
end
2-element Array{Int64,1}:
3
3
julia> sum.(eachcol(df))
2-element Array{Int64,1}:
10
50
Compat.eachrow
— Functioneachrow(df::AbstractDataFrame)
Return a DataFrameRows
that iterates a data frame row by row, with each row represented as a DataFrameRow
.
Because DataFrameRow
s have an eltype
of Any
, use copy(dfr::DataFrameRow)
to obtain a named tuple, which supports iteration and property access like a DataFrameRow
, but also passes information on the eltypes
of the columns of df
.
Examples
julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 11 │
│ 2 │ 2 │ 12 │
│ 3 │ 3 │ 13 │
│ 4 │ 4 │ 14 │
julia> eachrow(df)
4-element DataFrameRows:
DataFrameRow (row 1)
x 1
y 11
DataFrameRow (row 2)
x 2
y 12
DataFrameRow (row 3)
x 3
y 13
DataFrameRow (row 4)
x 4
y 14
julia> copy.(eachrow(df))
4-element Array{NamedTuple{(:x, :y),Tuple{Int64,Int64}},1}:
(x = 1, y = 11)
(x = 2, y = 12)
(x = 3, y = 13)
(x = 4, y = 14)
julia> eachrow(view(df, [4,3], [2,1]))
2-element DataFrameRows:
DataFrameRow (row 4)
y 14
x 4
DataFrameRow (row 3)
y 13
x 3
Base.filter
— Functionfilter(function, df::AbstractDataFrame)
filter(cols => function, df::AbstractDataFrame)
Return a copy of data frame df
containing only rows for which function
returns true
.
If cols
is not specified then the function is passed DataFrameRow
s.
If cols
is specified then the function is passed elements of the corresponding columns as separate positional arguments, unless cols
is an AsTable
selector, in which case a NamedTuple
of these arguments is passed. cols
can be any column selector (Symbol
, string or integer; :
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers), and column duplicates are allowed if a vector of Symbol
s, strings, or integers is passed.
Passing cols
leads to a more efficient execution of the operation for large data frames.
See also: filter!
Examples
julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 2 │ a │
│ 4 │ 1 │ b │
julia> filter(row -> row.x > 1, df)
2×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 2 │ a │
julia> filter(:x => x -> x > 1, df)
2×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 2 │ a │
julia> filter([:x, :y] => (x, y) -> x == 1 || y == "b", df)
3×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 1 │ b │
julia> filter(AsTable(:) => nt -> nt.x == 1 || nt.y == "b", df)
3×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 1 │ b │
Base.filter!
— Functionfilter!(function, df::AbstractDataFrame)
filter!(cols => function, df::AbstractDataFrame)
Remove rows from data frame df
for which function
returns false
.
If cols
is not specified then the function is passed DataFrameRow
s. If cols
is specified then the function is passed elements of the corresponding columns as separate positional arguments, unless cols
is an AsTable
selector, in which case a NamedTuple
of these arguments is passed. cols
can be any column selector (Symbol
, string or integer; :
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers), and column duplicates are allowed if a vector of Symbol
s, strings, or integers is passed.
Passing cols
leads to a more efficient execution of the operation for large data frames.
See also: filter
Examples
julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 2 │ a │
│ 4 │ 1 │ b │
julia> filter!(row -> row.x > 1, df);
julia> df
2×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 2 │ a │
julia> filter!(:x => x -> x == 3, df);
julia> df
1×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"]);
julia> filter!([:x, :y] => (x, y) -> x == 1 || y == "b", df);
julia> df
3×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 1 │ b │
julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"]);
julia> filter!(AsTable(:) => nt -> nt.x == 1 || nt.y == "b", df)
3×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 1 │ b │
DataFrames.flatten
— Functionflatten(df::AbstractDataFrame, cols)
When columns cols
of data frame df
have iterable elements that define length
(for example a Vector
of Vector
s), return a DataFrame
where each element of each col
in cols
is flattened, meaning the column corresponding to col
becomes a longer vector where the original entries are concatenated. Elements of row i
of df
in columns other than cols
will be repeated according to the length of df[i, col]
. These lengths must therefore be the same for each col
in cols
, or else an error is raised. Note that these elements are not copied, and thus if they are mutable changing them in the returned DataFrame
will affect df
.
cols
can be any column selector (Symbol
, string or integer; :
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
Examples
julia> df1 = DataFrame(a = [1, 2], b = [[1, 2], [3, 4]], c = [[5, 6], [7, 8]])
2×3 DataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Array… │ Array… │
├─────┼───────┼────────┼────────┤
│ 1 │ 1 │ [1, 2] │ [5, 6] │
│ 2 │ 2 │ [3, 4] │ [7, 8] │
julia> flatten(df1, :b)
4×3 DataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Array… │
├─────┼───────┼───────┼────────┤
│ 1 │ 1 │ 1 │ [5, 6] │
│ 2 │ 1 │ 2 │ [5, 6] │
│ 3 │ 2 │ 3 │ [7, 8] │
│ 4 │ 2 │ 4 │ [7, 8] │
julia> flatten(df1, [:b, :c])
4×3 DataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 1 │ 5 │
│ 2 │ 1 │ 2 │ 6 │
│ 3 │ 2 │ 3 │ 7 │
│ 4 │ 2 │ 4 │ 8 │
julia> df2 = DataFrame(a = [1, 2], b = [("p", "q"), ("r", "s")])
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Tuple… │
├─────┼───────┼────────────┤
│ 1 │ 1 │ ("p", "q") │
│ 2 │ 2 │ ("r", "s") │
julia> flatten(df2, :b)
4×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ p │
│ 2 │ 1 │ q │
│ 3 │ 2 │ r │
│ 4 │ 2 │ s │
julia> df3 = DataFrame(a = [1, 2], b = [[1, 2], [3, 4]], c = [[5, 6], [7]])
2×3 DataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Array… │ Array… │
├─────┼───────┼────────┼────────┤
│ 1 │ 1 │ [1, 2] │ [5, 6] │
│ 2 │ 2 │ [3, 4] │ [7] │
julia> flatten(df3, [:b, :c])
ERROR: ArgumentError: Lengths of iterables stored in columns :b and :c
are not the same in row 2
Base.hcat
— Functionhcat(df::AbstractDataFrame...;
makeunique::Bool=false, copycols::Bool=true)
hcat(df::AbstractDataFrame..., vs::AbstractVector;
makeunique::Bool=false, copycols::Bool=true)
hcat(vs::AbstractVector, df::AbstractDataFrame;
makeunique::Bool=false, copycols::Bool=true)
Horizontally concatenate AbstractDataFrames
and optionally AbstractVector
s.
If AbstractVector
is passed then a column name for it is automatically generated as :x1
by default.
If makeunique=false
(the default) column names of passed objects must be unique. If makeunique=true
then duplicate column names will be suffixed with _i
(i
starting at 1 for the first duplicate).
If copycols=true
(the default) then the DataFrame
returned by hcat
will contain copied columns from the source data frames. If copycols=false
then it will contain columns as they are stored in the source (without copying). This option should be used with caution as mutating either the columns in sources or in the returned DataFrame
might lead to the corruption of the other object.
Example
julia [DataFrame(A=1:3) DataFrame(B=1:3)]
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │
julia> df1 = DataFrame(A=1:3, B=1:3);
julia> df2 = DataFrame(A=4:6, B=4:6);
julia> df3 = hcat(df1, df2, makeunique=true)
3×4 DataFrame
│ Row │ A │ B │ A_1 │ B_1 │
│ │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1 │ 1 │ 1 │ 4 │ 4 │
│ 2 │ 2 │ 2 │ 5 │ 5 │
│ 3 │ 3 │ 3 │ 6 │ 6 │
julia> df3.A === df1.A
false
julia> df3 = hcat(df1, df2, makeunique=true, copycols=false);
julia> df3.A === df1.A
true
DataFrames.insertcols!
— Functioninsertcols!(df::DataFrame, [ind::Int], (name=>col)::Pair...;
makeunique::Bool=false, copycols::Bool=true)
Insert a column into a data frame in place. Return the updated DataFrame
. If ind
is omitted it is set to ncol(df)+1
(the column is inserted as the last column).
Arguments
df
: the DataFrame to which we want to add columnsind
: a position at which we want to insert a columnname
: the name of the new columncol
: anAbstractVector
giving the contents of the new column or a value of any type other thanAbstractArray
which will be repeated to fill a new vector; As a particular rule a values stored in aRef
or a0
-dimensionalAbstractArray
are unwrapped and treated in the same way.makeunique
: Defines what to do ifname
already exists indf
; if it isfalse
an error will be thrown; if it istrue
a new unique name will be generated by adding a suffixcopycols
: whether vectors passed as columns should be copied
If col
is an AbstractRange
then the result of collect(col)
is inserted.
Examples
julia> d = DataFrame(a=1:3)
3×1 DataFrame
│ Row │ a │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
julia> insertcols!(d, 1, :b => 'a':'c')
3×2 DataFrame
│ Row │ b │ a │
│ │ Char │ Int64 │
├─────┼──────┼───────┤
│ 1 │ 'a' │ 1 │
│ 2 │ 'b' │ 2 │
│ 3 │ 'c' │ 3 │
julia> insertcols!(d, 2, :c => 2:4, :c => 3:5, makeunique=true)
3×4 DataFrame
│ Row │ b │ c │ c_1 │ a │
│ │ Char │ Int64 │ Int64 │ Int64 │
├─────┼──────┼───────┼───────┼───────┤
│ 1 │ 'a' │ 2 │ 3 │ 1 │
│ 2 │ 'b' │ 3 │ 4 │ 2 │
│ 3 │ 'c' │ 4 │ 5 │ 3 │
Base.length
— Functionlength(dfr::DataFrameRow)
Return the number of elements of dfr
.
See also: size
Examples
julia> dfr = DataFrame(a=1:3, b='a':'c')[1, :];
julia> length(dfr)
2
DataFrames.mapcols
— Functionmapcols(f::Union{Function,Type}, df::AbstractDataFrame)
Return a DataFrame
where each column of df
is transformed using function f
. f
must return AbstractVector
objects all with the same length or scalars (all values other than AbstractVector
are considered to be a scalar).
Note that mapcols
guarantees not to reuse the columns from df
in the returned DataFrame
. If f
returns its argument then it gets copied before being stored.
Examples
julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 11 │
│ 2 │ 2 │ 12 │
│ 3 │ 3 │ 13 │
│ 4 │ 4 │ 14 │
julia> mapcols(x -> x.^2, df)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 121 │
│ 2 │ 4 │ 144 │
│ 3 │ 9 │ 169 │
│ 4 │ 16 │ 196 │
DataFrames.mapcols!
— Functionmapcols!(f::Union{Function,Type}, df::DataFrame)
Update a DataFrame
in-place where each column of df
is transformed using function f
. f
must return AbstractVector
objects all with the same length or scalars (all values other than AbstractVector
are considered to be a scalar).
Note that mapcols!
reuses the columns from df
if they are returned by f
.
Examples
julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 11 │
│ 2 │ 2 │ 12 │
│ 3 │ 3 │ 13 │
│ 4 │ 4 │ 14 │
julia> mapcols!(x -> x.^2, df);
julia> df
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 121 │
│ 2 │ 4 │ 144 │
│ 3 │ 9 │ 169 │
│ 4 │ 16 │ 196 │
Base.names
— Functionnames(df::AbstractDataFrame)
names(df::AbstractDataFrame, cols)
Return a freshly allocated Vector{String}
of names of columns contained in df
.
If cols
is passed then restrict returned column names to those matching the selector (this is useful in particular with regular expressions, Not
, and Between
). cols
can be any column selector (Symbol
, string or integer; :
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
See also propertynames
which returns a Vector{Symbol}
.
DataFrames.ncol
— Functionnrow(df::AbstractDataFrame)
ncol(df::AbstractDataFrame)
Return the number of rows or columns in an AbstractDataFrame
df
.
See also size
.
Examples
julia> df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10));
julia> size(df)
(10, 3)
julia> nrow(df)
10
julia> ncol(df)
3
Base.ndims
— Functionndims(::AbstractDataFrame)
ndims(::Type{<:AbstractDataFrame})
Return the number of dimensions of a data frame, which is always 2
.
ndims(::DataFrameRow)
ndims(::Type{<:DataFrameRow})
Return the number of dimensions of a data frame row, which is always 1
.
DataFrames.nonunique
— Functionnonunique(df::AbstractDataFrame)
nonunique(df::AbstractDataFrame, cols)
Return a Vector{Bool}
in which true
entries indicate duplicate rows. A row is a duplicate if there exists a prior row with all columns containing equal values (according to isequal
).
Arguments
df
:AbstractDataFrame
cols
: a selector specifying the column(s) to compare. Can be any column selector (Symbol
, string or integer;:
,All
,Between
,Not
, a regular expression, or a vector ofSymbol
s, strings or integers).
Examples
df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
df = vcat(df, df)
nonunique(df)
nonunique(df, 1)
DataFrames.nrow
— Functionnrow(df::AbstractDataFrame)
ncol(df::AbstractDataFrame)
Return the number of rows or columns in an AbstractDataFrame
df
.
See also size
.
Examples
julia> df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10));
julia> size(df)
(10, 3)
julia> nrow(df)
10
julia> ncol(df)
3
DataFrames.order
— Functionorder(col::ColumnIndex; kwargs...)
Specify sorting order for a column col
in a data frame. kwargs
can be lt
, by
, rev
, and order
with values following the rules defined in sort!
.
Examples
julia> df = DataFrame(x = [-3, -1, 0, 2, 4], y = 1:5)
5×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ -3 │ 1 │
│ 2 │ -1 │ 2 │
│ 3 │ 0 │ 3 │
│ 4 │ 2 │ 4 │
│ 5 │ 4 │ 5 │
julia> sort(df, order(:x, rev=true))
5×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 4 │ 5 │
│ 2 │ 2 │ 4 │
│ 3 │ 0 │ 3 │
│ 4 │ -1 │ 2 │
│ 5 │ -3 │ 1 │
julia> sort(df, order(:x, by=abs))
5×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 0 │ 3 │
│ 2 │ -1 │ 2 │
│ 3 │ 2 │ 4 │
│ 4 │ -3 │ 1 │
│ 5 │ 4 │ 5 │
Base.push!
— Functionpush!(df::DataFrame, row::Union{Tuple, AbstractArray}; promote::Bool=false)
push!(df::DataFrame, row::Union{DataFrameRow, NamedTuple, AbstractDict};
cols::Symbol=:setequal, promote::Bool=(cols in [:union, :subset]))
Add in-place one row at the end of df
taking the values from row
.
Column types of df
are preserved, and new values are converted if necessary. An error is thrown if conversion fails.
If row
is neither a DataFrameRow
, NamedTuple
nor AbstractDict
then it must be a Tuple
or an AbstractArray
and columns are matched by order of appearance. In this case row
must contain the same number of elements as the number of columns in df
.
If row
is a DataFrameRow
, NamedTuple
or AbstractDict
then values in row
are matched to columns in df
based on names. The exact behavior depends on the cols
argument value in the following way:
- If
cols == :setequal
(this is the default) thenrow
must contain exactly the same columns asdf
(but possibly in a different order). - If
cols == :orderequal
thenrow
must contain the same columns in the same order (forAbstractDict
this option requires thatkeys(row)
matchespropertynames(df)
to allow for support of ordered dicts; however, ifrow
is aDict
an error is thrown as it is an unordered collection). - If
cols == :intersect
thenrow
may contain more columns thandf
, but all column names that are present indf
must be present inrow
and only they are used to populate a new row indf
. - If
cols == :subset
thenpush!
behaves like for:intersect
but if some column is missing inrow
then amissing
value is pushed todf
. - If
cols == :union
then columns missing indf
that are present inrow
are added todf
(usingmissing
for existing rows) and amissing
value is pushed to columns missing inrow
that are present indf
.
If promote=true
and element type of a column present in df
does not allow the type of a pushed argument then a new column with a promoted element type allowing it is freshly allocated and stored in df
. If promote=false
an error is thrown.
As a special case, if df
has no columns and row
is a NamedTuple
or DataFrameRow
, columns are created for all values in row
, using their names and order.
Please note that push!
must not be used on a DataFrame
that contains columns that are aliases (equal when compared with ===
).
Examples
julia> df = DataFrame(A=1:3, B=1:3);
julia> push!(df, (true, false))
4×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │
│ 4 │ 1 │ 0 │
julia> push!(df, df[1, :])
5×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │
│ 4 │ 1 │ 0 │
│ 5 │ 1 │ 1 │
julia> push!(df, (C="something", A=true, B=false), cols=:intersect)
6×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │
│ 4 │ 1 │ 0 │
│ 5 │ 1 │ 1 │
│ 6 │ 1 │ 0 │
julia> push!(df, Dict(:A=>1.0, :C=>1.0), cols=:union)
7×3 DataFrame
│ Row │ A │ B │ C │
│ │ Float64 │ Int64? │ Float64? │
├─────┼─────────┼─────────┼──────────┤
│ 1 │ 1.0 │ 1 │ missing │
│ 2 │ 2.0 │ 2 │ missing │
│ 3 │ 3.0 │ 3 │ missing │
│ 4 │ 1.0 │ 0 │ missing │
│ 5 │ 1.0 │ 1 │ missing │
│ 6 │ 1.0 │ 0 │ missing │
│ 7 │ 1.0 │ missing │ 1.0 │
julia> push!(df, NamedTuple(), cols=:subset)
8×3 DataFrame
│ Row │ A │ B │ C │
│ │ Float64? │ Int64? │ Float64? │
├─────┼──────────┼─────────┼──────────┤
│ 1 │ 1.0 │ 1 │ missing │
│ 2 │ 2.0 │ 2 │ missing │
│ 3 │ 3.0 │ 3 │ missing │
│ 4 │ 1.0 │ 0 │ missing │
│ 5 │ 1.0 │ 1 │ missing │
│ 6 │ 1.0 │ 0 │ missing │
│ 7 │ 1.0 │ missing │ 1.0 │
│ 8 │ missing │ missing │ missing │
DataFrames.rename
— Functionrename(df::AbstractDataFrame, vals::AbstractVector{Symbol};
makeunique::Bool=false)
rename(df::AbstractDataFrame, vals::AbstractVector{<:AbstractString};
makeunique::Bool=false)
rename(df::AbstractDataFrame, (from => to)::Pair...)
rename(df::AbstractDataFrame, d::AbstractDict)
rename(df::AbstractDataFrame, d::AbstractVector{<:Pair})
rename(f::Function, df::AbstractDataFrame)
Create a new data frame that is a copy of df
with changed column names. Each name is changed at most once. Permutation of names is allowed.
Arguments
df
: theAbstractDataFrame
d
: anAbstractDict
or anAbstractVector
ofPair
s that maps the original names or column numbers to new namesf
: a function which for each column takes the old name as aString
and returns the new name that gets converted to aSymbol
vals
: new column names as a vector ofSymbol
s orAbstractString
s of the same length as the number of columns indf
makeunique
: iffalse
(the default), an error will be raised if duplicate names are found; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).
If pairs are passed to rename
(as positional arguments or in a dictionary or a vector) then:
from
value can be aSymbol
, anAbstractString
or anInteger
;to
value can be aSymbol
or anAbstractString
.
Mixing symbols and strings in to
and from
is not allowed.
See also: rename!
Examples
julia> df = DataFrame(i = 1, x = 2, y = 3)
1×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 3 │
julia> rename(df, :i => :A, :x => :X)
1×3 DataFrame
│ Row │ A │ X │ y │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 3 │
julia> rename(df, :x => :y, :y => :x)
1×3 DataFrame
│ Row │ i │ y │ x │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 3 │
julia> rename(df, [1 => :A, 2 => :X])
1×3 DataFrame
│ Row │ A │ X │ y │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 3 │
julia> rename(df, Dict("i" => "A", "x" => "X"))
1×3 DataFrame
│ Row │ A │ X │ y │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 3 │
julia> rename(uppercase, df)
1×3 DataFrame
│ Row │ I │ X │ Y │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 3 │
DataFrames.rename!
— Functionrename!(df::AbstractDataFrame, vals::AbstractVector{Symbol};
makeunique::Bool=false)
rename!(df::AbstractDataFrame, vals::AbstractVector{<:AbstractString};
makeunique::Bool=false)
rename!(df::AbstractDataFrame, (from => to)::Pair...)
rename!(df::AbstractDataFrame, d::AbstractDict)
rename!(df::AbstractDataFrame, d::AbstractVector{<:Pair})
rename!(f::Function, df::AbstractDataFrame)
Rename columns of df
in-place. Each name is changed at most once. Permutation of names is allowed.
Arguments
df
: theAbstractDataFrame
d
: anAbstractDict
or anAbstractVector
ofPair
s that maps the original names or column numbers to new namesf
: a function which for each column takes the old name as aString
and returns the new name that gets converted to aSymbol
vals
: new column names as a vector ofSymbol
s orAbstractString
s of the same length as the number of columns indf
makeunique
: iffalse
(the default), an error will be raised if duplicate names are found; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).
If pairs are passed to rename!
(as positional arguments or in a dictionary or a vector) then:
from
value can be aSymbol
, anAbstractString
or anInteger
;to
value can be aSymbol
or anAbstractString
.
Mixing symbols and strings in to
and from
is not allowed.
See also: rename
Examples
julia> df = DataFrame(i = 1, x = 2, y = 3)
1×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 3 │
julia> rename!(df, Dict(:i => "A", :x => "X"))
1×3 DataFrame
│ Row │ A │ X │ y │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 3 │
julia> rename!(df, [:a, :b, :c])
1×3 DataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 3 │
julia> rename!(df, [:a, :b, :a])
ERROR: ArgumentError: Duplicate variable names: :a. Pass makeunique=true to make
them unique using a suffix automatically.
julia> rename!(df, [:a, :b, :a], makeunique=true)
1×3 DataFrame
│ Row │ a │ b │ a_1 │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 3 │
julia> rename!(uppercase, df)
1×3 DataFrame
│ Row │ A │ B │ A_1 │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 3 │
Base.repeat
— Functionrepeat(df::AbstractDataFrame; inner::Integer = 1, outer::Integer = 1)
Construct a data frame by repeating rows in df
. inner
specifies how many times each row is repeated, and outer
specifies how many times the full set of rows is repeated.
Example
julia> df = DataFrame(a = 1:2, b = 3:4)
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 2 │ 4 │
julia> repeat(df, inner = 2, outer = 3)
12×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 1 │ 3 │
│ 3 │ 2 │ 4 │
│ 4 │ 2 │ 4 │
│ 5 │ 1 │ 3 │
│ 6 │ 1 │ 3 │
│ 7 │ 2 │ 4 │
│ 8 │ 2 │ 4 │
│ 9 │ 1 │ 3 │
│ 10 │ 1 │ 3 │
│ 11 │ 2 │ 4 │
│ 12 │ 2 │ 4 │
repeat(df::AbstractDataFrame, count::Integer)
Construct a data frame by repeating each row in df
the number of times specified by count
.
Example
julia> df = DataFrame(a = 1:2, b = 3:4)
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 2 │ 4 │
julia> repeat(df, 2)
4×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 2 │ 4 │
│ 3 │ 1 │ 3 │
│ 4 │ 2 │ 4 │
DataFrames.repeat!
— Functionrepeat!(df::DataFrame; inner::Integer = 1, outer::Integer = 1)
Update a data frame df
in-place by repeating its rows. inner
specifies how many times each row is repeated, and outer
specifies how many times the full set of rows is repeated. Columns of df
are freshly allocated.
Example
julia> df = DataFrame(a = 1:2, b = 3:4)
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 2 │ 4 │
julia> repeat!(df, inner = 2, outer = 3);
julia> df
12×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 1 │ 3 │
│ 3 │ 2 │ 4 │
│ 4 │ 2 │ 4 │
│ 5 │ 1 │ 3 │
│ 6 │ 1 │ 3 │
│ 7 │ 2 │ 4 │
│ 8 │ 2 │ 4 │
│ 9 │ 1 │ 3 │
│ 10 │ 1 │ 3 │
│ 11 │ 2 │ 4 │
│ 12 │ 2 │ 4 │
repeat!(df::DataFrame, count::Integer)
Update a data frame df
in-place by repeating its rows the number of times specified by count
. Columns of df
are freshly allocated.
Example
julia> df = DataFrame(a = 1:2, b = 3:4)
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 2 │ 4 │
julia> repeat(df, 2)
4×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 2 │ 4 │
│ 3 │ 1 │ 3 │
│ 4 │ 2 │ 4 │
DataFrames.select
— Functionselect(df::AbstractDataFrame, args...; copycols::Bool=true)
Create a new data frame that contains columns from df
specified by args
and return it. The result is guaranteed to have the same number of rows as df
, except when no columns are selected (in which case the result has zero rows)..
If df
is a DataFrame
or copycols=true
then column renaming and transformations are supported.
Arguments passed as args...
can be:
- Any index that is allowed for column indexing (
Symbol
, string or integer;:
,All
,Between
,Not
, a regular expression, or a vector ofSymbol
s, strings or integers). - Column transformation operations using the
Pair
notation that is described below and vectors of such pairs.
Columns can be renamed using the old_column => new_column_name
syntax, and transformed using the old_column => fun => new_column_name
syntax. new_column_name
must be a Symbol
or a string, and fun
a function or a type. If old_column
is a Symbol
, a string, or an integer then fun
is applied to the corresponding column vector. Otherwise old_column
can be any column indexing syntax, in which case fun
will be passed the column vectors specified by old_column
as separate arguments. The only exception is when old_column
is an AsTable
type wrapping a selector, in which case fun
is passed a NamedTuple
containing the selected columns.
If fun
returns a value of type other than AbstractVector
then it will be broadcasted into a vector matching the target number of rows in the data frame, unless its type is one of AbstractDataFrame
, NamedTuple
, DataFrameRow
, AbstractMatrix
, in which case an error is thrown as currently these return types are not allowed. As a particular rule, values wrapped in a Ref
or a 0
-dimensional AbstractArray
are unwrapped and then broadcasted.
To apply fun
to each row instead of whole columns, it can be wrapped in a ByRow
struct. In this case if old_column
is a Symbol
, a string, or an integer then fun
is applied to each element (row) of old_column
using broadcasting. Otherwise old_column
can be any column indexing syntax, in which case fun
will be passed one argument for each of the columns specified by old_column
. If ByRow
is used it is not allowed for old_column
to select an empty set of columns nor for fun
to return a NamedTuple
or a DataFrameRow
.
Column transformation can also be specified using the short old_column => fun
form. In this case, new_column_name
is automatically generated as $(old_column)_$(fun)
. Up to three column names are used for multiple input columns and they are joined using _
; if more than three columns are passed then the name consists of the first two names and etc
suffix then, e.g. [:a,:b,:c,:d] => fun
produces the new column name :a_b_etc_fun
.
Column renaming and transformation operations can be passed wrapped in vectors (this is useful when combined with broadcasting).
As a special rule passing nrow
without specifying old_column
creates a column named :nrow
containing a number of rows in a source data frame, and passing nrow => new_column_name
stores the number of rows in source data frame in new_column_name
column.
If a collection of column names is passed to select!
or select
then requesting duplicate column names in target data frame are accepted (e.g. select!(df, [:a], :, r"a")
is allowed) and only the first occurrence is used. In particular a syntax to move column :col
to the first position in the data frame is select!(df, :col, :)
. On the contrary, output column names of renaming, transformation and single column selection operations must be unique, so e.g. select!(df, :a, :a => :a)
or select!(df, :a, :a => ByRow(sin) => :a)
are not allowed.
If df
is a DataFrame
a new DataFrame
is returned. If copycols=false
, then the returned DataFrame
shares column vectors with df
where possible. If copycols=true
(the default), then the returned DataFrame
will not share columns with df
. The only exception for this rule is the old_column => fun => new_column
transformation when fun
returns a vector that is not allocated by fun
but is neither a SubArray
nor one of the input vectors. In such a case a new DataFrame
might contain aliases. Such a situation can only happen with transformations which returns vectors other than their inputs, e.g. with select(df, :a => (x -> c) => :c1, :b => (x -> c) => :c2)
when c
is a vector object or with select(df, :a => (x -> df.c) => :c2)
.
If df
is a SubDataFrame
and copycols=true
then a DataFrame
is returned and the same copying rules apply as for a DataFrame
input: this means in particular that selected columns will be copied. If copycols=false
, a SubDataFrame
is returned without copying columns.
Note that including the same column several times in the data frame via renaming or transformations that return the same object when copycols=false
will create column aliases. An example of such a situation is select(df, :a, :a => :b, :a => identity => :c, copycols=false)
.
Examples
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │
julia> select(df, :b)
3×1 DataFrame
│ Row │ b │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 4 │
│ 2 │ 5 │
│ 3 │ 6 │
julia> select(df, Not(:b)) # drop column :b from df
3×1 DataFrame
│ Row │ a │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
julia> select(df, :a => :c, :b)
3×2 DataFrame
│ Row │ c │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │
julia> select(df, :a => ByRow(sin) => :c, :b)
3×2 DataFrame
│ Row │ c │ b │
│ │ Float64 │ Int64 │
├─────┼──────────┼───────┤
│ 1 │ 0.841471 │ 4 │
│ 2 │ 0.909297 │ 5 │
│ 3 │ 0.14112 │ 6 │
julia> select(df, :, [:a, :b] => (a,b) -> a .+ b .- sum(b)/length(b))
3×3 DataFrame
│ Row │ a │ b │ a_b_function │
│ │ Int64 │ Int64 │ Float64 │
├─────┼───────┼───────┼──────────────┤
│ 1 │ 1 │ 4 │ 0.0 │
│ 2 │ 2 │ 5 │ 2.0 │
│ 3 │ 3 │ 6 │ 4.0 │
julia> select(df, names(df) .=> sum)
3×2 DataFrame
│ Row │ a_sum │ b_sum │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 6 │ 15 │
│ 2 │ 6 │ 15 │
│ 3 │ 6 │ 15 │
julia> select(df, names(df) .=> sum .=> [:A, :B])
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 6 │ 15 │
│ 2 │ 6 │ 15 │
│ 3 │ 6 │ 15 │
julia> select(df, AsTable(:) => ByRow(mean))
3×1 DataFrame
│ Row │ a_b_mean │
│ │ Float64 │
├─────┼──────────┤
│ 1 │ 2.5 │
│ 2 │ 3.5 │
│ 3 │ 4.5 │
select(gd::GroupedDataFrame, args...;
copycols::Bool=true, keepkeys::Bool=true, ungroup::Bool=true)
Apply args
to gd
following the rules described in combine
.
If ungroup=true
the result is a DataFrame
. If ungroup=false
the result is a GroupedDataFrame
(in this case the returned value retains the order of groups of gd
).
The parent
of the returned value has as many rows as parent(gd)
and in the same order, except when the returned value has no columns (in which case it has zero rows). If an operation in args
returns a single value it is always broadcasted to have this number of rows.
If copycols=false
then do not perform copying of columns that are not transformed.
If keepkeys=true
, the resulting DataFrame
contains all the grouping columns in addition to those generated. In this case if the returned value contains columns with the same names as the grouping columns, they are required to be equal. If keepkeys=false
and some generated columns have the same name as grouping columns, they are kept and are not required to be equal to grouping columns.
If ungroup=true
(the default) a DataFrame
is returned. If ungroup=false
a GroupedDataFrame
grouped using keycols(gdf)
is returned.
If gd
has zero groups then no transformations are applied.
See also
groupby
, combine
, select!
, transform
, transform!
Examples
julia> df = DataFrame(a = [1, 1, 1, 2, 2, 1, 1, 2],
b = repeat([2, 1], outer=[4]),
c = 1:8)
8×3 DataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 1 │
│ 2 │ 1 │ 1 │ 2 │
│ 3 │ 1 │ 2 │ 3 │
│ 4 │ 2 │ 1 │ 4 │
│ 5 │ 2 │ 2 │ 5 │
│ 6 │ 1 │ 1 │ 6 │
│ 7 │ 1 │ 2 │ 7 │
│ 8 │ 2 │ 1 │ 8 │
julia> gd = groupby(df, :a);
julia> select(gd, :c => sum, nrow)
8×3 DataFrame
│ Row │ a │ c_sum │ nrow │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 19 │ 5 │
│ 2 │ 1 │ 19 │ 5 │
│ 3 │ 1 │ 19 │ 5 │
│ 4 │ 2 │ 17 │ 3 │
│ 5 │ 2 │ 17 │ 3 │
│ 6 │ 1 │ 19 │ 5 │
│ 7 │ 1 │ 19 │ 5 │
│ 8 │ 2 │ 17 │ 3 │
julia> select(gd, :c => sum, nrow, ungroup=false)
GroupedDataFrame with 2 groups based on key: a
First Group (5 rows): a = 1
│ Row │ a │ c_sum │ nrow │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 19 │ 5 │
│ 2 │ 1 │ 19 │ 5 │
│ 3 │ 1 │ 19 │ 5 │
│ 4 │ 1 │ 19 │ 5 │
│ 5 │ 1 │ 19 │ 5 │
⋮
Last Group (3 rows): a = 2
│ Row │ a │ c_sum │ nrow │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 2 │ 17 │ 3 │
│ 2 │ 2 │ 17 │ 3 │
│ 3 │ 2 │ 17 │ 3 │
julia> select(gd, :c => (x -> sum(log, x)) => :sum_log_c) # specifying a name for target column
8×2 DataFrame
│ Row │ a │ sum_log_c │
│ │ Int64 │ Float64 │
├─────┼───────┼───────────┤
│ 1 │ 1 │ 5.52943 │
│ 2 │ 1 │ 5.52943 │
│ 3 │ 1 │ 5.52943 │
│ 4 │ 2 │ 5.07517 │
│ 5 │ 2 │ 5.07517 │
│ 6 │ 1 │ 5.52943 │
│ 7 │ 1 │ 5.52943 │
│ 8 │ 2 │ 5.07517 │
julia> select(gd, [:b, :c] .=> sum) # passing a vector of pairs
8×3 DataFrame
│ Row │ a │ b_sum │ c_sum │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 8 │ 19 │
│ 2 │ 1 │ 8 │ 19 │
│ 3 │ 1 │ 8 │ 19 │
│ 4 │ 2 │ 4 │ 17 │
│ 5 │ 2 │ 4 │ 17 │
│ 6 │ 1 │ 8 │ 19 │
│ 7 │ 1 │ 8 │ 19 │
│ 8 │ 2 │ 4 │ 17 │
julia> select(gd, :b => :b1, :c => :c1,
[:b, :c] => +, keepkeys=false) # multiple arguments, renaming and keepkeys
8×3 DataFrame
│ Row │ b1 │ c1 │ b_c_+ │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 2 │ 1 │ 3 │
│ 2 │ 1 │ 2 │ 3 │
│ 3 │ 2 │ 3 │ 5 │
│ 4 │ 1 │ 4 │ 5 │
│ 5 │ 2 │ 5 │ 7 │
│ 6 │ 1 │ 6 │ 7 │
│ 7 │ 2 │ 7 │ 9 │
│ 8 │ 1 │ 8 │ 9 │
julia> select(gd, :b, :c => sum) # passing columns and broadcasting
8×3 DataFrame
│ Row │ a │ b │ c_sum │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 19 │
│ 2 │ 1 │ 1 │ 19 │
│ 3 │ 1 │ 2 │ 19 │
│ 4 │ 2 │ 1 │ 17 │
│ 5 │ 2 │ 2 │ 17 │
│ 6 │ 1 │ 1 │ 19 │
│ 7 │ 1 │ 2 │ 19 │
│ 8 │ 2 │ 1 │ 17 │
julia> select(gd, :, AsTable(Not(:a)) => sum)
8×4 DataFrame
│ Row │ a │ b │ c │ b_c_sum │
│ │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼─────────┤
│ 1 │ 1 │ 2 │ 1 │ 3 │
│ 2 │ 1 │ 1 │ 2 │ 3 │
│ 3 │ 1 │ 2 │ 3 │ 5 │
│ 4 │ 2 │ 1 │ 4 │ 5 │
│ 5 │ 2 │ 2 │ 5 │ 7 │
│ 6 │ 1 │ 1 │ 6 │ 7 │
│ 7 │ 1 │ 2 │ 7 │ 9 │
│ 8 │ 2 │ 1 │ 8 │ 9 │
DataFrames.select!
— Functionselect!(df::DataFrame, args...)
Mutate df
in place to retain only columns specified by args...
and return it. The result is guaranteed to have the same number of rows as df
, except when no columns are selected (in which case the result has zero rows).
Arguments passed as args...
can be:
- Any index that is allowed for column indexing (
Symbol
, string or integer;:
,All
,Between
,Not
, a regular expression, or a vector ofSymbol
s, strings or integers). - Column transformation operations using the
Pair
notation that is described below and vectors of such pairs.
Columns can be renamed using the old_column => new_column_name
syntax, and transformed using the old_column => fun => new_column_name
syntax. new_column_name
must be a Symbol
or a string, and fun
a function or a type. If old_column
is a Symbol
, a string, or an integer then fun
is applied to the corresponding column vector. Otherwise old_column
can be any column indexing syntax, in which case fun
will be passed the column vectors specified by old_column
as separate arguments. The only exception is when old_column
is an AsTable
type wrapping a selector, in which case fun
is passed a NamedTuple
containing the selected columns.
If fun
returns a value of type other than AbstractVector
then it will be broadcasted into a vector matching the target number of rows in the data frame, unless its type is one of AbstractDataFrame
, NamedTuple
, DataFrameRow
, AbstractMatrix
, in which case an error is thrown as currently these return types are not allowed. As a particular rule, values wrapped in a Ref
or a 0
-dimensional AbstractArray
are unwrapped and then broadcasted.
To apply fun
to each row instead of whole columns, it can be wrapped in a ByRow
struct. In this case if old_column
is a Symbol
, a string, or an integer then fun
is applied to each element (row) of old_column
using broadcasting. Otherwise old_column
can be any column indexing syntax, in which case fun
will be passed one argument for each of the columns specified by old_column
. If ByRow
is used it is not allowed for old_column
to select an empty set of columns nor for fun
to return a NamedTuple
or a DataFrameRow
.
Column transformation can also be specified using the short old_column => fun
form. In this case, new_column_name
is automatically generated as $(old_column)_$(fun)
. Up to three column names are used for multiple input columns and they are joined using _
; if more than three columns are passed then the name consists of the first two names and etc
suffix then, e.g. [:a,:b,:c,:d] => fun
produces the new column name :a_b_etc_fun
.
Column renaming and transformation operations can be passed wrapped in vectors (this is useful when combined with broadcasting).
As a special rule passing nrow
without specifying old_column
creates a column named :nrow
containing a number of rows in a source data frame, and passing nrow => new_column_name
stores the number of rows in source data frame in new_column_name
column.
If a collection of column names is passed to select!
or select
then requesting duplicate column names in target data frame are accepted (e.g. select!(df, [:a], :, r"a")
is allowed) and only the first occurrence is used. In particular a syntax to move column :col
to the first position in the data frame is select!(df, :col, :)
. On the contrary, output column names of renaming, transformation and single column selection operations must be unique, so e.g. select!(df, :a, :a => :a)
or select!(df, :a, :a => ByRow(sin) => :a)
are not allowed.
Note that including the same column several times in the data frame via renaming or transformations that return the same object without copying will create column aliases. An example of such a situation is select!(df, :a, :a => :b, :a => identity => :c)
.
Examples
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │
julia> select!(df, 2)
3×1 DataFrame
│ Row │ b │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 4 │
│ 2 │ 5 │
│ 3 │ 6 │
julia> df = DataFrame(a=1:3, b=4:6);
julia> select!(df, :a => ByRow(sin) => :c, :b)
3×2 DataFrame
│ Row │ c │ b │
│ │ Float64 │ Int64 │
├─────┼──────────┼───────┤
│ 1 │ 0.841471 │ 4 │
│ 2 │ 0.909297 │ 5 │
│ 3 │ 0.14112 │ 6 │
julia> select!(df, :, [:c, :b] => (c,b) -> c .+ b .- sum(b)/length(b))
3×3 DataFrame
│ Row │ c │ b │ c_b_function │
│ │ Float64 │ Int64 │ Float64 │
├─────┼──────────┼───────┼──────────────┤
│ 1 │ 0.841471 │ 4 │ -0.158529 │
│ 2 │ 0.909297 │ 5 │ 0.909297 │
│ 3 │ 0.14112 │ 6 │ 1.14112 │
julia> df = DataFrame(a=1:3, b=4:6);
julia> select!(df, names(df) .=> sum);
julia> df
3×2 DataFrame
│ Row │ a_sum │ b_sum │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 6 │ 15 │
│ 2 │ 6 │ 15 │
│ 3 │ 6 │ 15 │
julia> df = DataFrame(a=1:3, b=4:6);
julia> using Statistics
julia> select!(df, AsTable(:) => ByRow(mean))
3×1 DataFrame
│ Row │ a_b_mean │
│ │ Float64 │
├─────┼──────────┤
│ 1 │ 2.5 │
│ 2 │ 3.5 │
│ 3 │ 4.5 │
select!(gd::GroupedDataFrame{DataFrame}, args...; ungroup::Bool=true)
An equivalent of select(gd, args..., copycols=false, keepkeys=true, ungroup=ungroup)
but updates parent(gd)
in place.
See also
Base.show
— Functionshow([io::IO,] df::AbstractDataFrame;
allrows::Bool = !get(io, :limit, false),
allcols::Bool = !get(io, :limit, false),
allgroups::Bool = !get(io, :limit, false),
splitcols::Bool = get(io, :limit, false),
rowlabel::Symbol = :Row,
summary::Bool = true,
eltypes::Bool = true)
Render a data frame to an I/O stream. The specific visual representation chosen depends on the width of the display.
If io
is omitted, the result is printed to stdout
, and allrows
, allcols
and allgroups
default to false
while splitcols
defaults to true
.
Arguments
io::IO
: The I/O stream to whichdf
will be printed.df::AbstractDataFrame
: The data frame to print.allrows::Bool
: Whether to print all rows, rather than a subset that fits the device height. By default this is the case only ifio
does not have theIOContext
propertylimit
set.allcols::Bool
: Whether to print all columns, rather than a subset that fits the device width. By default this is the case only ifio
does not have theIOContext
propertylimit
set.allgroups::Bool
: Whether to print all groups rather than the first and last, whendf
is aGroupedDataFrame
. By default this is the case only ifio
does not have theIOContext
propertylimit
set.splitcols::Bool
: Whether to split printing in chunks of columns fitting the screen width rather than printing all columns in the same block. Only applies ifallcols
istrue
. By default this is the case only ifio
has theIOContext
propertylimit
set.rowlabel::Symbol = :Row
: The label to use for the column containing row numbers.summary::Bool = true
: Whether to print a brief string summary of the data frame.eltypes::Bool = true
: Whether to print the column types under column names.
Examples
julia> using DataFrames
julia> df = DataFrame(A = 1:3, B = ["x", "y", "z"]);
julia> show(df, allcols=true)
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ x │
│ 2 │ 2 │ y │
│ 3 │ 3 │ z │
show(io::IO, mime::MIME, df::AbstractDataFrame)
Render a data frame to an I/O stream in MIME type mime
.
Arguments
io::IO
: The I/O stream to whichdf
will be printed.mime::MIME
: supported MIME types are:"text/plain"
,"text/html"
,"text/latex"
,"text/csv"
,"text/tab-separated-values"
(the last two MIME types do not support showing#undef
values)df::AbstractDataFrame
: The data frame to print.
Additionally selected MIME types support passing the following keyword arguments:
- MIME type
"text/plain"
accepts all listed keyword arguments and therir behavior is identical as forshow(::IO, ::AbstractDataFrame)
- MIME type
"text/html"
acceptssummary
keyword argument which allows to choose whether to print a brief string summary of the data frame.
Examples
julia> show(stdout, MIME("text/latex"), DataFrame(A = 1:3, B = ["x", "y", "z"]))
\begin{tabular}{r|cc}
& A & B\\
\hline
& Int64 & String\\
\hline
1 & 1 & x \\
2 & 2 & y \\
3 & 3 & z \\
\end{tabular}
14
julia> show(stdout, MIME("text/csv"), DataFrame(A = 1:3, B = ["x", "y", "z"]))
"A","B"
1,"x"
2,"y"
3,"z"
Base.size
— Functionsize(df::AbstractDataFrame, [dim])
Return a tuple containing the number of rows and columns of df
. Optionally a dimension dim
can be specified, where 1
corresponds to rows and 2
corresponds to columns.
Examples
julia> df = DataFrame(a=1:3, b='a':'c');
julia> size(df)
(3, 2)
julia> size(df, 1)
3
size(dfr::DataFrameRow, [dim])
Return a 1-tuple containing the number of elements of dfr
. If an optional dimension dim
is specified, it must be 1
, and the number of elements is returned directly as a number.
See also: length
Examples
julia> dfr = DataFrame(a=1:3, b='a':'c')[1, :];
julia> size(dfr)
(2,)
julia> size(dfr, 1)
2
Base.sort
— Functionsort(df::AbstractDataFrame, cols;
alg::Union{Algorithm, Nothing}=nothing, lt=isless, by=identity,
rev::Bool=false, order::Ordering=Forward)
Return a copy of data frame df
sorted by column(s) cols
.
cols
can be any column selector (Symbol
, string or integer; :
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
If alg
is nothing
(the default), the most appropriate algorithm is chosen automatically among TimSort
, MergeSort
and RadixSort
depending on the type of the sorting columns and on the number of rows in df
. If rev
is true
, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true)
in cols
, with c
the corresponding column index (see example below). See sort!
for a description of other keyword arguments.
Examples
julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 2 │ a │
│ 4 │ 1 │ b │
julia> sort(df, :x)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ c │
│ 2 │ 1 │ b │
│ 3 │ 2 │ a │
│ 4 │ 3 │ b │
julia> sort(df, [:x, :y])
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ b │
│ 2 │ 1 │ c │
│ 3 │ 2 │ a │
│ 4 │ 3 │ b │
julia> sort(df, [:x, :y], rev=true)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 2 │ a │
│ 3 │ 1 │ c │
│ 4 │ 1 │ b │
julia> sort(df, [:x, order(:y, rev=true)])
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ c │
│ 2 │ 1 │ b │
│ 3 │ 2 │ a │
│ 4 │ 3 │ b │
Base.sort!
— Functionsort!(df::AbstractDataFrame, cols;
alg::Union{Algorithm, Nothing}=nothing, lt=isless, by=identity,
rev::Bool=false, order::Ordering=Forward)
Sort data frame df
by column(s) cols
.
cols
can be any column selector (Symbol
, string or integer; :
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
If alg
is nothing
(the default), the most appropriate algorithm is chosen automatically among TimSort
, MergeSort
and RadixSort
depending on the type of the sorting columns and on the number of rows in df
. If rev
is true
, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true)
in cols
, with c
the corresponding column index (see example below). See other methods for a description of other keyword arguments.
Examples
julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 2 │ a │
│ 4 │ 1 │ b │
julia> sort!(df, :x)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ c │
│ 2 │ 1 │ b │
│ 3 │ 2 │ a │
│ 4 │ 3 │ b │
julia> sort!(df, [:x, :y])
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ b │
│ 2 │ 1 │ c │
│ 3 │ 2 │ a │
│ 4 │ 3 │ b │
julia> sort!(df, [:x, :y], rev=true)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 2 │ a │
│ 3 │ 1 │ c │
│ 4 │ 1 │ b │
julia> sort!(df, (:x, order(:y, rev=true)))
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ c │
│ 2 │ 1 │ b │
│ 3 │ 2 │ a │
│ 4 │ 3 │ b │
DataFrames.transform
— Functiontransform(df::AbstractDataFrame, args...; copycols::Bool=true)
Create a new data frame that contains columns from df
and adds columns specified by args
and return it. The result is guaranteed to have the same number of rows as df
. Equivalent to select(df, :, args..., copycols=copycols)
.
See select
for detailed rules regarding accepted values for args
.
transform(gd::GroupedDataFrame, args...;
copycols::Bool=true, keepkeys::Bool=true, ungroup::Bool=true)
An equivalent of select(gd, :, args..., copycols=copycols, keepkeys=keepkeys, ungroup=ungroup)
See also
DataFrames.transform!
— Functiontransform!(df::DataFrame, args...)
Mutate df
in place to add columns specified by args...
and return it. The result is guaranteed to have the same number of rows as df
. Equivalent to select!(df, :, args...)
.
See select!
for detailed rules regarding accepted values for args
.
transform!(gd::GroupedDataFrame{DataFrame}, args...; ungroup::Bool=true)
An equivalent of transform(gd, args..., copycols=false, keepkeys=true, ungroup=ungroup)
but updates parent(gd)
in place.
See also
Base.unique!
— Functionunique(df::AbstractDataFrame)
unique(df::AbstractDataFrame, cols)
unique!(df::AbstractDataFrame)
unique!(df::AbstractDataFrame, cols)
Delete duplicate rows of data frame df
, keeping only the first occurrence of unique rows. When cols
is specified, the returned DataFrame
contains complete rows, retaining in each case the first instance for which df[cols]
is unique. cols
can be any column selector (Symbol
, string or integer; :
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
When unique
is called a new data frame is returned; unique!
updates df
in-place.
See also nonunique
.
Arguments
df
: the AbstractDataFramecols
: column indicator (Symbol, Int, Vector{Symbol}, Regex, etc.)
specifying the column(s) to compare.
Examples
df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
df = vcat(df, df)
unique(df) # doesn't modify df
unique(df, 1)
unique!(df) # modifies df
Base.vcat
— Functionvcat(dfs::AbstractDataFrame...;
cols::Union{Symbol, AbstractVector{Symbol},
AbstractVector{<:AbstractString}}=:setequal)
Vertically concatenate AbstractDataFrame
s.
The cols
keyword argument determines the columns of the returned data frame:
:setequal
: require all data frames to have the same column names disregarding order. If they appear in different orders, the order of the first provided data frame is used.:orderequal
: require all data frames to have the same column names and in the same order.:intersect
: only the columns present in all provided data frames are kept. If the intersection is empty, an empty data frame is returned.:union
: columns present in at least one of the provided data frames are kept. Columns not present in some data frames are filled withmissing
where necessary.- A vector of
Symbol
s or strings: only listed columns are kept. Columns not present in some data frames are filled withmissing
where necessary.
The order of columns is determined by the order they appear in the included data frames, searching through the header of the first data frame, then the second, etc.
The element types of columns are determined using promote_type
, as with vcat
for AbstractVector
s.
vcat
ignores empty data frames, making it possible to initialize an empty data frame at the beginning of a loop and vcat
onto it.
Example
julia> df1 = DataFrame(A=1:3, B=1:3);
julia> df2 = DataFrame(A=4:6, B=4:6);
julia> df3 = DataFrame(A=7:9, C=7:9);
julia> d4 = DataFrame();
julia> vcat(df1, df2)
6×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │
│ 4 │ 4 │ 4 │
│ 5 │ 5 │ 5 │
│ 6 │ 6 │ 6 │
julia> vcat(df1, df3, cols=:union)
6×3 DataFrame
│ Row │ A │ B │ C │
│ │ Int64 │ Int64? │ Int64? │
├─────┼───────┼─────────┼─────────┤
│ 1 │ 1 │ 1 │ missing │
│ 2 │ 2 │ 2 │ missing │
│ 3 │ 3 │ 3 │ missing │
│ 4 │ 7 │ missing │ 7 │
│ 5 │ 8 │ missing │ 8 │
│ 6 │ 9 │ missing │ 9 │
julia> vcat(df1, df3, cols=:intersect)
6×1 DataFrame
│ Row │ A │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
│ 4 │ 7 │
│ 5 │ 8 │
│ 6 │ 9 │
julia> vcat(d4, df1)
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │
Unsorted
Base.first
— Functionfirst(df::AbstractDataFrame)
Get the first row of df
as a DataFrameRow
.
first(df::AbstractDataFrame, n::Integer)
Get a data frame with the n
first rows of df
.
Base.last
— Functionlast(df::AbstractDataFrame)
Get the last row of df
as a DataFrameRow
.
last(df::AbstractDataFrame, n::Integer)
Get a data frame with the n
last rows of df
.
Base.unique
— Functionunique(df::AbstractDataFrame)
unique(df::AbstractDataFrame, cols)
unique!(df::AbstractDataFrame)
unique!(df::AbstractDataFrame, cols)
Delete duplicate rows of data frame df
, keeping only the first occurrence of unique rows. When cols
is specified, the returned DataFrame
contains complete rows, retaining in each case the first instance for which df[cols]
is unique. cols
can be any column selector (Symbol
, string or integer; :
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
When unique
is called a new data frame is returned; unique!
updates df
in-place.
See also nonunique
.
Arguments
df
: the AbstractDataFramecols
: column indicator (Symbol, Int, Vector{Symbol}, Regex, etc.)
specifying the column(s) to compare.
Examples
df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
df = vcat(df, df)
unique(df) # doesn't modify df
unique(df, 1)
unique!(df) # modifies df
Base.propertynames
— Functionpropertynames(df::AbstractDataFrame)
Return a freshly allocated Vector{Symbol}
of names of columns contained in df
.
Base.similar
— Functionsimilar(df::AbstractDataFrame, rows::Integer=nrow(df))
Create a new DataFrame
with the same column names and column element types as df
. An optional second argument can be provided to request a number of rows that is different than the number of rows present in df
.
Base.sortperm
— Functionsortperm(df::AbstractDataFrame, cols;
alg::Union{Algorithm, Nothing}=nothing, lt=isless, by=identity,
rev::Bool=false, order::Ordering=Forward)
Return a permutation vector of row indices of data frame df
that puts them in sorted order according to column(s) cols
.
cols
can be any column selector (Symbol
, string or integer; :
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
If alg
is nothing
(the default), the most appropriate algorithm is chosen automatically among TimSort
, MergeSort
and RadixSort
depending on the type of the sorting columns and on the number of rows in df
. If rev
is true
, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true)
in cols
, with c
the corresponding column index (see example below). See other methods for a description of other keyword arguments.
Examples
julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 2 │ a │
│ 4 │ 1 │ b │
julia> sortperm(df, :x)
4-element Array{Int64,1}:
2
4
3
1
julia> sortperm(df, (:x, :y))
4-element Array{Int64,1}:
4
2
3
1
julia> sortperm(df, (:x, :y), rev=true)
4-element Array{Int64,1}:
1
3
2
4
julia> sortperm(df, (:x, order(:y, rev=true)))
4-element Array{Int64,1}:
2
4
3
1
Base.pairs
— Functionpairs(dfc::DataFrameColumns)
Return an iterator of pairs associating the name of each column of dfc
with the corresponding column vector, i.e. name => col
where name
is the column name of the column col
.
Base.parent
— Functionparent(gd::GroupedDataFrame)
Return the parent data frame of gd
.
Base.issorted
— Functionissorted(df::AbstractDataFrame, cols;
lt=isless, by=identity, rev::Bool=false, order::Ordering=Forward)
Test whether data frame df
sorted by column(s) cols
.
cols
can be any column selector (Symbol
, string or integer; :
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
If rev
is true
, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true)
in cols
, with c
the corresponding column index (see example below). See other methods for a description of other keyword arguments.