Functions
Base.append!Base.copyBase.delete!Base.filterBase.filter!Base.firstBase.getBase.hcatBase.issortedBase.keysBase.lastBase.lengthBase.namesBase.ndimsBase.pairsBase.parentBase.propertynamesBase.push!Base.repeatBase.showBase.similarBase.sizeBase.sortBase.sort!Base.sortpermBase.uniqueBase.unique!Base.vcatCategoricalArrays.categoricalCompat.eachcolCompat.eachrowDataAPI.describeDataFrames.DataFrame!DataFrames.allowmissing!DataFrames.antijoinDataFrames.categorical!DataFrames.combineDataFrames.completecasesDataFrames.crossjoinDataFrames.disallowmissing!DataFrames.dropmissingDataFrames.dropmissing!DataFrames.flattenDataFrames.groupbyDataFrames.groupcolsDataFrames.groupindicesDataFrames.innerjoinDataFrames.insertcols!DataFrames.leftjoinDataFrames.mapcolsDataFrames.mapcols!DataFrames.ncolDataFrames.nonuniqueDataFrames.nrowDataFrames.orderDataFrames.outerjoinDataFrames.renameDataFrames.rename!DataFrames.repeat!DataFrames.rightjoinDataFrames.selectDataFrames.select!DataFrames.semijoinDataFrames.stackDataFrames.transformDataFrames.transform!DataFrames.unstackDataFrames.valuecolsMissings.allowmissingMissings.disallowmissing
Joining, Grouping, and Split-Apply-Combine
DataFrames.innerjoin — Functioninnerjoin(df1, df2; on, makeunique = false,
validate = (false, false))
innerjoin(df1, df2, dfs...; on, makeunique = false,
validate = (false, false))Perform an inner join of two or more data frame objects and return a DataFrame containing the result. An inner join includes rows with keys that match in all passed data frames.
Arguments
df1,df2,dfs...: theAbstractDataFramesto be joined
Keyword Arguments
on: A column name to joindf1anddf2on. If the columns on whichdf1anddf2will be joined have different names, then aleft=>rightpair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed). If more than two data frames are joined then only a column name or a vector of column names are allowed.onis a required argument.makeunique: iffalse(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue, duplicate names will be suffixed with_i(istarting at 1 for the first duplicate).validate: whether to check that columns passed as theonargument define unique keys in each input data frame (according toisequal). Can be a tuple or a pair, with the first element indicating whether to run check fordf1and the second element fordf2. By default no check is performed.
When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.
If more than two data frames are passed, the join is performed recursively with left associativity. In this case the validate keyword argument is applied recursively with left associativity.
See also: leftjoin, rightjoin, outerjoin, semijoin, antijoin, crossjoin.
Examples
julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
│ Row │ ID │ Name │
│ │ Int64 │ String │
├─────┼───────┼───────────┤
│ 1 │ 1 │ John Doe │
│ 2 │ 2 │ Jane Doe │
│ 3 │ 3 │ Joe Blogs │
julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ ID │ Job │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ Lawyer │
│ 2 │ 2 │ Doctor │
│ 3 │ 4 │ Farmer │
julia> innerjoin(name, job, on = :ID)
2×3 DataFrame
│ Row │ ID │ Name │ Job │
│ │ Int64 │ String │ String │
├─────┼───────┼──────────┼────────┤
│ 1 │ 1 │ John Doe │ Lawyer │
│ 2 │ 2 │ Jane Doe │ Doctor │
julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ identifier │ Job │
│ │ Int64 │ String │
├─────┼────────────┼────────┤
│ 1 │ 1 │ Lawyer │
│ 2 │ 2 │ Doctor │
│ 3 │ 4 │ Farmer │
julia> innerjoin(name, job2, on = :ID => :identifier)
2×3 DataFrame
│ Row │ ID │ Name │ Job │
│ │ Int64 │ String │ String │
├─────┼───────┼──────────┼────────┤
│ 1 │ 1 │ John Doe │ Lawyer │
│ 2 │ 2 │ Jane Doe │ Doctor │
julia> innerjoin(name, job2, on = [:ID => :identifier])
2×3 DataFrame
│ Row │ ID │ Name │ Job │
│ │ Int64 │ String │ String │
├─────┼───────┼──────────┼────────┤
│ 1 │ 1 │ John Doe │ Lawyer │
│ 2 │ 2 │ Jane Doe │ Doctor │DataFrames.leftjoin — Functionleftjoin(df1, df2; on, makeunique = false,
indicator = nothing, validate = (false, false))Perform a left join of twodata frame objects and return a DataFrame containing the result. A left join includes all rows from df1.
Arguments
df1,df2: theAbstractDataFramesto be joined
Keyword Arguments
on: A column name to joindf1anddf2on. If the columns on whichdf1anddf2will be joined have different names, then aleft=>rightpair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed).makeunique: iffalse(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue, duplicate names will be suffixed with_i(istarting at 1 for the first duplicate).indicator: Default:nothing. If aSymbolor string, adds categorical indicator column with the given name, for whether a row appeared in onlydf1("left_only"), onlydf2("right_only") or in both ("both"). If the name is already in use, the column name will be modified ifmakeunique=true.validate: whether to check that columns passed as theonargument define unique keys in each input data frame (according toisequal). Can be a tuple or a pair, with the first element indicating whether to run check fordf1and the second element fordf2. By default no check is performed.
All columns of the returned data table will support missing values.
When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.
See also: innerjoin, rightjoin, outerjoin, semijoin, antijoin, crossjoin.
Examples
julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
│ Row │ ID │ Name │
│ │ Int64 │ String │
├─────┼───────┼───────────┤
│ 1 │ 1 │ John Doe │
│ 2 │ 2 │ Jane Doe │
│ 3 │ 3 │ Joe Blogs │
julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ ID │ Job │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ Lawyer │
│ 2 │ 2 │ Doctor │
│ 3 │ 4 │ Farmer │
julia> leftjoin(name, job, on = :ID)
3×3 DataFrame
│ Row │ ID │ Name │ Job │
│ │ Int64 │ String │ String? │
├─────┼───────┼───────────┼─────────┤
│ 1 │ 1 │ John Doe │ Lawyer │
│ 2 │ 2 │ Jane Doe │ Doctor │
│ 3 │ 3 │ Joe Blogs │ missing │
julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ identifier │ Job │
│ │ Int64 │ String │
├─────┼────────────┼────────┤
│ 1 │ 1 │ Lawyer │
│ 2 │ 2 │ Doctor │
│ 3 │ 4 │ Farmer │
julia> leftjoin(name, job2, on = :ID => :identifier)
3×3 DataFrame
│ Row │ ID │ Name │ Job │
│ │ Int64 │ String │ String? │
├─────┼───────┼───────────┼─────────┤
│ 1 │ 1 │ John Doe │ Lawyer │
│ 2 │ 2 │ Jane Doe │ Doctor │
│ 3 │ 3 │ Joe Blogs │ missing │
julia> leftjoin(name, job2, on = [:ID => :identifier])
3×3 DataFrame
│ Row │ ID │ Name │ Job │
│ │ Int64 │ String │ String? │
├─────┼───────┼───────────┼─────────┤
│ 1 │ 1 │ John Doe │ Lawyer │
│ 2 │ 2 │ Jane Doe │ Doctor │
│ 3 │ 3 │ Joe Blogs │ missing │DataFrames.rightjoin — Functionrightjoin(df1, df2; on, makeunique = false,
indicator = nothing, validate = (false, false))Perform a right join on two data frame objects and return a DataFrame containing the result. A right join includes all rows from df2.
Arguments
df1,df2: theAbstractDataFramesto be joined
Keyword Arguments
on: A column name to joindf1anddf2on. If the columns on whichdf1anddf2will be joined have different names, then aleft=>rightpair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed).makeunique: iffalse(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue, duplicate names will be suffixed with_i(istarting at 1 for the first duplicate).indicator: Default:nothing. If aSymbolor string, adds categorical indicator column with the given name for whether a row appeared in onlydf1("left_only"), onlydf2("right_only") or in both ("both"). If the name is already in use, the column name will be modified ifmakeunique=true.validate: whether to check that columns passed as theonargument define unique keys in each input data frame (according toisequal). Can be a tuple or a pair, with the first element indicating whether to run check fordf1and the second element fordf2. By default no check is performed.
All columns of the returned data table will support missing values.
When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.
See also: innerjoin, leftjoin, outerjoin, semijoin, antijoin, crossjoin.
Examples
julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
│ Row │ ID │ Name │
│ │ Int64 │ String │
├─────┼───────┼───────────┤
│ 1 │ 1 │ John Doe │
│ 2 │ 2 │ Jane Doe │
│ 3 │ 3 │ Joe Blogs │
julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ ID │ Job │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ Lawyer │
│ 2 │ 2 │ Doctor │
│ 3 │ 4 │ Farmer │
julia> rightjoin(name, job, on = :ID)
3×3 DataFrame
│ Row │ ID │ Name │ Job │
│ │ Int64 │ String? │ String │
├─────┼───────┼──────────┼────────┤
│ 1 │ 1 │ John Doe │ Lawyer │
│ 2 │ 2 │ Jane Doe │ Doctor │
│ 3 │ 4 │ missing │ Farmer │
julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ identifier │ Job │
│ │ Int64 │ String │
├─────┼────────────┼────────┤
│ 1 │ 1 │ Lawyer │
│ 2 │ 2 │ Doctor │
│ 3 │ 4 │ Farmer │
julia> rightjoin(name, job2, on = :ID => :identifier)
3×3 DataFrame
│ Row │ ID │ Name │ Job │
│ │ Int64 │ String? │ String │
├─────┼───────┼──────────┼────────┤
│ 1 │ 1 │ John Doe │ Lawyer │
│ 2 │ 2 │ Jane Doe │ Doctor │
│ 3 │ 4 │ missing │ Farmer │
julia> rightjoin(name, job2, on = [:ID => :identifier])
3×3 DataFrame
│ Row │ ID │ Name │ Job │
│ │ Int64 │ String? │ String │
├─────┼───────┼──────────┼────────┤
│ 1 │ 1 │ John Doe │ Lawyer │
│ 2 │ 2 │ Jane Doe │ Doctor │
│ 3 │ 4 │ missing │ Farmer │DataFrames.outerjoin — Functionouterjoin(df1, df2; on, kind = :inner, makeunique = false,
indicator = nothing, validate = (false, false))
outerjoin(df1, df2, dfs...; on, kind = :inner, makeunique = false,
validate = (false, false))Perform an outer join of two or more data frame objects and return a DataFrame containing the result. An outer join includes rows with keys that appear in any of the passed data frames.
Arguments
df1,df2,dfs...: theAbstractDataFramesto be joined
Keyword Arguments
on: A column name to joindf1anddf2on. If the columns on whichdf1anddf2will be joined have different names, then aleft=>rightpair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed). If more than two data frames are joined then only a column name or a vector of column names are allowed.onis a required argument.makeunique: iffalse(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue, duplicate names will be suffixed with_i(istarting at 1 for the first duplicate).indicator: Default:nothing. If aSymbolor string, adds categorical indicator column with the given name for whether a row appeared in onlydf1("left_only"), onlydf2("right_only") or in both ("both"). If the name is already in use, the column name will be modified ifmakeunique=true. This argument is only supported when joining exactly two data frames.validate: whether to check that columns passed as theonargument define unique keys in each input data frame (according toisequal). Can be a tuple or a pair, with the first element indicating whether to run check fordf1and the second element fordf2. By default no check is performed.
All columns of the returned data table will support missing values.
When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.
If more than two data frames are passed, the join is performed recursively with left associativity. In this case the indicator keyword argument is not supported and validate keyword argument is applied recursively with left associativity.
See also: innerjoin, leftjoin, rightjoin, semijoin, antijoin, crossjoin.
Examples
julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
│ Row │ ID │ Name │
│ │ Int64 │ String │
├─────┼───────┼───────────┤
│ 1 │ 1 │ John Doe │
│ 2 │ 2 │ Jane Doe │
│ 3 │ 3 │ Joe Blogs │
julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ ID │ Job │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ Lawyer │
│ 2 │ 2 │ Doctor │
│ 3 │ 4 │ Farmer │
julia> outerjoin(name, job, on = :ID)
4×3 DataFrame
│ Row │ ID │ Name │ Job │
│ │ Int64 │ String? │ String? │
├─────┼───────┼───────────┼─────────┤
│ 1 │ 1 │ John Doe │ Lawyer │
│ 2 │ 2 │ Jane Doe │ Doctor │
│ 3 │ 3 │ Joe Blogs │ missing │
│ 4 │ 4 │ missing │ Farmer │
julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ identifier │ Job │
│ │ Int64 │ String │
├─────┼────────────┼────────┤
│ 1 │ 1 │ Lawyer │
│ 2 │ 2 │ Doctor │
│ 3 │ 4 │ Farmer │
julia> outerjoin(name, job2, on = :ID => :identifier)
4×3 DataFrame
│ Row │ ID │ Name │ Job │
│ │ Int64 │ String? │ String? │
├─────┼───────┼───────────┼─────────┤
│ 1 │ 1 │ John Doe │ Lawyer │
│ 2 │ 2 │ Jane Doe │ Doctor │
│ 3 │ 3 │ Joe Blogs │ missing │
│ 4 │ 4 │ missing │ Farmer │
julia> outerjoin(name, job2, on = [:ID => :identifier])
4×3 DataFrame
│ Row │ ID │ Name │ Job │
│ │ Int64 │ String? │ String? │
├─────┼───────┼───────────┼─────────┤
│ 1 │ 1 │ John Doe │ Lawyer │
│ 2 │ 2 │ Jane Doe │ Doctor │
│ 3 │ 3 │ Joe Blogs │ missing │
│ 4 │ 4 │ missing │ Farmer │DataFrames.antijoin — Functionantijoin(df1, df2; on, makeunique = false, validate = (false, false))Perform an anti join of two data frame objects and return a DataFrame containing the result. An anti join returns the subset of rows of df1 that do not match with the keys in df2.
Arguments
df1,df2: theAbstractDataFramesto be joined
Keyword Arguments
on: A column name to joindf1anddf2on. If the columns on whichdf1anddf2will be joined have different names, then aleft=>rightpair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed).makeunique: iffalse(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue, duplicate names will be suffixed with_i(istarting at 1 for the first duplicate).validate: whether to check that columns passed as theonargument define unique keys in each input data frame (according toisequal). Can be a tuple or a pair, with the first element indicating whether to run check fordf1and the second element fordf2. By default no check is performed.
When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.
See also: innerjoin, leftjoin, rightjoin, outerjoin, semijoin, crossjoin.
Examples
julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
│ Row │ ID │ Name │
│ │ Int64 │ String │
├─────┼───────┼───────────┤
│ 1 │ 1 │ John Doe │
│ 2 │ 2 │ Jane Doe │
│ 3 │ 3 │ Joe Blogs │
julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ ID │ Job │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ Lawyer │
│ 2 │ 2 │ Doctor │
│ 3 │ 4 │ Farmer │
julia> antijoin(name, job, on = :ID)
1×2 DataFrame
│ Row │ ID │ Name │
│ │ Int64 │ String │
├─────┼───────┼───────────┤
│ 1 │ 3 │ Joe Blogs │
julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ identifier │ Job │
│ │ Int64 │ String │
├─────┼────────────┼────────┤
│ 1 │ 1 │ Lawyer │
│ 2 │ 2 │ Doctor │
│ 3 │ 4 │ Farmer │
julia> antijoin(name, job2, on = :ID => :identifier)
1×2 DataFrame
│ Row │ ID │ Name │
│ │ Int64 │ String │
├─────┼───────┼───────────┤
│ 1 │ 3 │ Joe Blogs │
julia> antijoin(name, job2, on = [:ID => :identifier])
1×2 DataFrame
│ Row │ ID │ Name │
│ │ Int64 │ String │
├─────┼───────┼───────────┤
│ 1 │ 3 │ Joe Blogs │DataFrames.semijoin — Functionsemijoin(df1, df2; on, makeunique = false, validate = (false, false))Perform a semi join of two data frame objects and return a DataFrame containing the result. A semi join returns the subset of rows of df1 that match with the keys in df2.
Arguments
df1,df2: theAbstractDataFramesto be joined
Keyword Arguments
on: A column name to joindf1anddf2on. If the columns on whichdf1anddf2will be joined have different names, then aleft=>rightpair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed).makeunique: iffalse(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue, duplicate names will be suffixed with_i(istarting at 1 for the first duplicate).indicator: Default:nothing. If aSymbolor string, adds categorical indicator column with the given name for whether a row appeared in onlydf1("left_only"), onlydf2("right_only") or in both ("both"). If the name is already in use, the column name will be modified ifmakeunique=true.validate: whether to check that columns passed as theonargument define unique keys in each input data frame (according toisequal). Can be a tuple or a pair, with the first element indicating whether to run check fordf1and the second element fordf2. By default no check is performed.
When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.
See also: innerjoin, leftjoin, rightjoin, outerjoin, antijoin, crossjoin.
Examples
julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
│ Row │ ID │ Name │
│ │ Int64 │ String │
├─────┼───────┼───────────┤
│ 1 │ 1 │ John Doe │
│ 2 │ 2 │ Jane Doe │
│ 3 │ 3 │ Joe Blogs │
julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ ID │ Job │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ Lawyer │
│ 2 │ 2 │ Doctor │
│ 3 │ 4 │ Farmer │
julia> semijoin(name, job, on = :ID)
2×2 DataFrame
│ Row │ ID │ Name │
│ │ Int64 │ String │
├─────┼───────┼──────────┤
│ 1 │ 1 │ John Doe │
│ 2 │ 2 │ Jane Doe │
julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ identifier │ Job │
│ │ Int64 │ String │
├─────┼────────────┼────────┤
│ 1 │ 1 │ Lawyer │
│ 2 │ 2 │ Doctor │
│ 3 │ 4 │ Farmer │
julia> semijoin(name, job2, on = :ID => :identifier)
2×2 DataFrame
│ Row │ ID │ Name │
│ │ Int64 │ String │
├─────┼───────┼──────────┤
│ 1 │ 1 │ John Doe │
│ 2 │ 2 │ Jane Doe │
julia> semijoin(name, job2, on = [:ID => :identifier])
2×2 DataFrame
│ Row │ ID │ Name │
│ │ Int64 │ String │
├─────┼───────┼──────────┤
│ 1 │ 1 │ John Doe │
│ 2 │ 2 │ Jane Doe │DataFrames.crossjoin — Functioncrossjoin(df1, df2, dfs...; makeunique = false)Perform a cross join of two or more data frame objects and return a DataFrame containing the result. A cross join returns the cartesian product of rows from all passed data frames.
Arguments
df1,df2,dfs...: theAbstractDataFramesto be joined
Keyword Arguments
makeunique: iffalse(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue, duplicate names will be suffixed with_i(istarting at 1 for the first duplicate).
If more than two data frames are passed, the join is performed recursively with left associativity.
See also: innerjoin, leftjoin, rightjoin, outerjoin, semijoin, antijoin.
Examples
julia> df1 = DataFrame(X=1:3)
3×1 DataFrame
│ Row │ X │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
julia> df2 = DataFrame(Y=["a", "b"])
2×1 DataFrame
│ Row │ Y │
│ │ String │
├─────┼────────┤
│ 1 │ a │
│ 2 │ b │
julia> crossjoin(df1, df2)
6×2 DataFrame
│ Row │ X │ Y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ a │
│ 2 │ 1 │ b │
│ 3 │ 2 │ a │
│ 4 │ 2 │ b │
│ 5 │ 3 │ a │
│ 6 │ 3 │ b │DataFrames.combine — Functioncombine(df::AbstractDataFrame, args...)Create a new data frame that contains columns from df specified by args and return it. The result can have any number of rows that is determined by the values returned by passed transformations.
See select for detailed rules regarding accepted values for args.
Examples
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │
julia> combine(df, :a => sum, nrow)
1×2 DataFrame
│ Row │ a_sum │ nrow │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 6 │ 3 │combine(gd::GroupedDataFrame, args...; keepkeys::Bool=true, ungroup::Bool=true)
combine(fun::Union{Function, Type}, gd::GroupedDataFrame;
keepkeys::Bool=true, ungroup::Bool=true)
combine(pair::Pair, gd::GroupedDataFrame; keepkeys::Bool=true, ungroup::Bool=true)
combine(fun::Union{Function, Type}, df::AbstractDataFrame, ungroup::Bool=true)
combine(pair::Pair, df::AbstractDataFrame, ungroup::Bool=true)Apply operations to each group in a GroupedDataFrame and return the combined result as a DataFrame if ungroup=true or GroupedDataFrame if ungroup=false.
If an AbstractDataFrame is passed, apply operations to the data frame as a whole and a DataFrame is always returend.
Arguments passed as args... can be:
- Any index that is allowed for column indexing (
Symbol, string or integer,:,All,Between,Not, a regular expression, or a vector ofSymbols, strings or integers). - Column transformation operations using the
Pairnotation that is described below and vectors of such pairs.
Transformations allowed using Pairs follow the rules specified for select and have the form source_cols => fun, source_cols => fun => target_col, or source_col => target_col. Function fun is passed SubArray views as positional arguments for each column specified to be selected, or a NamedTuple containing these SubArrays if source_cols is an AsTable selector. It can return a vector or a single value (defined precisely below).
As a special case nrow or nrow => target_col can be passed without specifying input columns to efficiently calculate number of rows in each group. If nrow is passed the resulting column name is :nrow.
If multiple args are passed then return values of different funs are allowed to mix single values and vectors. In this case single values will be broadcasted to match the length of columns specified by returned vectors. As a particular rule, values wrapped in a Ref or a 0-dimensional AbstractArray are unwrapped and then broadcasted.
If the first or last argument is pair then it must be a Pair following the rules for pairs described above, except that in this case function defined by fun can return any return value defined below.
If the first or last argument is a function fun, it is passed a SubDataFrame view for each group and can return any return value defined below. Note that this form is slower than pair or args due to type instability.
If gd has zero groups then no transformations are applied.
fun can return a single value, a row, a vector, or multiple rows. The type of the returned value determines the shape of the resulting DataFrame. There are four kind of return values allowed:
- A single value gives a
DataFramewith a single additional column and one row per group. - A named tuple of single values or a
DataFrameRowgives aDataFramewith one additional column for each field and one row per group (returning a named tuple will be faster). It is not allowed to mix single values and vectors if a named tuple is returned. - A vector gives a
DataFramewith a single additional column and as many rows for each group as the length of the returned vector for that group. - A data frame, a named tuple of vectors or a matrix gives a
DataFramewith the same additional columns and as many rows for each group as the rows returned for that group (returning a named tuple is the fastest option). Returning a table with zero columns is allowed, whatever the number of columns returned for other groups.
fun must always return the same kind of object (out of four kinds defined above) for all groups, and with the same column names.
Optimized methods are used when standard summary functions (sum, prod, minimum, maximum, mean, var, std, first, last and length) are specified using the Pair syntax (e.g. :col => sum). When computing the sum or mean over floating point columns, results will be less accurate than the standard sum function (which uses pairwise summation). Use col => x -> sum(x) to avoid the optimized method and use the slower, more accurate one.
Column names are automatically generated when necessary using the rules defined in select if the Pair syntax is used and fun returns a single value or a vector (e.g. for :col => sum the column name is col_sum); otherwise (if fun is a function or a return value is an AbstractMatrix) columns are named x1, x2 and so on.
If keepkeys=true, the resulting DataFrame contains all the grouping columns in addition to those generated. In this case if the returned value contains columns with the same names as the grouping columns, they are required to be equal. If keepkeys=false and some generated columns have the same name as grouping columns, they are kept and are not required to be equal to grouping columns.
If ungroup=true (the default) a DataFrame is returned. If ungroup=false a GroupedDataFrame grouped using keycols(gdf) is returned.
If gd has zero groups then no transformations are applied.
Ordering of rows follows the order of groups in gdf.
See also
groupby, select, select!, transform, transform!
Examples
julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = 1:8);
julia> gd = groupby(df, :a);
julia> combine(gd, :c => sum, nrow)
4×3 DataFrame
│ Row │ a │ c_sum │ nrow │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 6 │ 2 │
│ 2 │ 2 │ 8 │ 2 │
│ 3 │ 3 │ 10 │ 2 │
│ 4 │ 4 │ 12 │ 2 │
julia> combine(gd, :c => sum, nrow, ungroup=false)
GroupedDataFrame with 4 groups based on key: a
First Group (1 row): a = 1
│ Row │ a │ c_sum │ nrow │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 6 │ 2 │
⋮
Last Group (1 row): a = 4
│ Row │ a │ c_sum │ nrow │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 4 │ 12 │ 2 │
julia> combine(sdf -> sum(sdf.c), gd) # Slower variant
4×2 DataFrame
│ Row │ a │ x1 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 6 │
│ 2 │ 2 │ 8 │
│ 3 │ 3 │ 10 │
│ 4 │ 4 │ 12 │
julia> combine(gdf) do d # do syntax for the slower variant
sum(d.c)
end
4×2 DataFrame
│ Row │ a │ x1 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 6 │
│ 2 │ 2 │ 8 │
│ 3 │ 3 │ 10 │
│ 4 │ 4 │ 12 │
julia> combine(gd, :c => (x -> sum(log, x)) => :sum_log_c) # specifying a name for target column
4×2 DataFrame
│ Row │ a │ sum_log_c │
│ │ Int64 │ Float64 │
├─────┼───────┼───────────┤
│ 1 │ 1 │ 1.60944 │
│ 2 │ 2 │ 2.48491 │
│ 3 │ 3 │ 3.04452 │
│ 4 │ 4 │ 3.46574 │
julia> combine(gd, [:b, :c] .=> sum) # passing a vector of pairs
4×3 DataFrame
│ Row │ a │ b_sum │ c_sum │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 4 │ 6 │
│ 2 │ 2 │ 2 │ 8 │
│ 3 │ 3 │ 4 │ 10 │
│ 4 │ 4 │ 2 │ 12 │
julia> combine(gd) do sdf # dropping group when DataFrame() is returned
sdf.c[1] != 1 ? sdf : DataFrame()
end
6×3 DataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 2 │ 1 │ 2 │
│ 2 │ 2 │ 1 │ 6 │
│ 3 │ 3 │ 2 │ 3 │
│ 4 │ 3 │ 2 │ 7 │
│ 5 │ 4 │ 1 │ 4 │
│ 6 │ 4 │ 1 │ 8 │
julia> combine(gd, :b => :b1, :c => :c1,
[:b, :c] => +, keepkeys=false) # auto-splatting, renaming and keepkeys
8×3 DataFrame
│ Row │ b1 │ c1 │ b_c_+ │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 2 │ 1 │ 3 │
│ 2 │ 2 │ 5 │ 7 │
│ 3 │ 1 │ 2 │ 3 │
│ 4 │ 1 │ 6 │ 7 │
│ 5 │ 2 │ 3 │ 5 │
│ 6 │ 2 │ 7 │ 9 │
│ 7 │ 1 │ 4 │ 5 │
│ 8 │ 1 │ 8 │ 9 │
julia> combine(gd, :b, :c => sum) # passing columns and broadcasting
8×3 DataFrame
│ Row │ a │ b │ c_sum │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 6 │
│ 2 │ 1 │ 2 │ 6 │
│ 3 │ 2 │ 1 │ 8 │
│ 4 │ 2 │ 1 │ 8 │
│ 5 │ 3 │ 2 │ 10 │
│ 6 │ 3 │ 2 │ 10 │
│ 7 │ 4 │ 1 │ 12 │
│ 8 │ 4 │ 1 │ 12 │
julia> combine(gd, [:b, :c] .=> Ref)
4×3 DataFrame
│ Row │ a │ b_Ref │ c_Ref │
│ │ Int64 │ SubArra… │ SubArra… │
├─────┼───────┼──────────┼──────────┤
│ 1 │ 1 │ [2, 2] │ [1, 5] │
│ 2 │ 2 │ [1, 1] │ [2, 6] │
│ 3 │ 3 │ [2, 2] │ [3, 7] │
│ 4 │ 4 │ [1, 1] │ [4, 8] │
julia> combine(gd, AsTable(:) => Ref)
4×2 DataFrame
│ Row │ a │ a_b_c_Ref │
│ │ Int64 │ NamedTuple… │
├─────┼───────┼──────────────────────────────────────┤
│ 1 │ 1 │ (a = [1, 1], b = [2, 2], c = [1, 5]) │
│ 2 │ 2 │ (a = [2, 2], b = [1, 1], c = [2, 6]) │
│ 3 │ 3 │ (a = [3, 3], b = [2, 2], c = [3, 7]) │
│ 4 │ 4 │ (a = [4, 4], b = [1, 1], c = [4, 8]) │
julia> combine(gd, :, AsTable(Not(:a)) => sum)
8×4 DataFrame
│ Row │ a │ b │ c │ b_c_sum │
│ │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼─────────┤
│ 1 │ 1 │ 2 │ 1 │ 3 │
│ 2 │ 1 │ 2 │ 5 │ 7 │
│ 3 │ 2 │ 1 │ 2 │ 3 │
│ 4 │ 2 │ 1 │ 6 │ 7 │
│ 5 │ 3 │ 2 │ 3 │ 5 │
│ 6 │ 3 │ 2 │ 7 │ 9 │
│ 7 │ 4 │ 1 │ 4 │ 5 │
│ 8 │ 4 │ 1 │ 8 │ 9 │DataFrames.groupby — Functiongroupby(d::AbstractDataFrame, cols; sort=false, skipmissing=false)Return a GroupedDataFrame representing a view of an AbstractDataFrame split into row groups.
Arguments
df: anAbstractDataFrameto splitcols: data frame columns to group by. Can be any column selector (Symbol, string or integer;:,All,Between,Not, a regular expression, or a vector ofSymbols, strings or integers).sort: whether to sort groups according to the values of the grouping columnscols; if allcolsareCategoricalVectors then groups are always sorted irrespective of the value ofsortskipmissing: whether to skip groups withmissingvalues in one of the grouping columnscols
Details
An iterator over a GroupedDataFrame returns a SubDataFrame view for each grouping into df. Within each group, the order of rows in df is preserved.
cols can be any valid data frame indexing expression. In particular if it is an empty vector then a single-group GroupedDataFrame is created.
A GroupedDataFrame also supports indexing by groups, map (which applies a function to each group) and combine (which applies a function to each group and combines the result into a data frame).
GroupedDataFrame also supports the dictionary interface. The keys are GroupKey objects returned by keys(::GroupedDataFrame), which can also be used to get the values of the grouping columns for each group. Tuples and NamedTuples containing the values of the grouping columns (in the same order as the cols argument) are also accepted as indices, but this will be slower than using the equivalent GroupKey.
See also
combine, select, select!, transform, transform!
Examples
julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = 1:8);
julia> gd = groupby(df, :a)
GroupedDataFrame with 4 groups based on key: a
First Group (2 rows): a = 1
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 1 │
│ 2 │ 1 │ 2 │ 5 │
⋮
Last Group (2 rows): a = 4
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 4 │ 1 │ 4 │
│ 2 │ 4 │ 1 │ 8 │
julia> gd[1]
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 1 │
│ 2 │ 1 │ 2 │ 5 │
julia> last(gd)
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 4 │ 1 │ 4 │
│ 2 │ 4 │ 1 │ 8 │
julia> gd[(a=3,)]
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 3 │ 2 │ 3 │
│ 2 │ 3 │ 2 │ 7 │
julia> gd[(3,)]
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 3 │ 2 │ 3 │
│ 2 │ 3 │ 2 │ 7 │
julia> k = first(keys(gd))
GroupKey: (a = 3)
julia> gd[k]
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 3 │ 2 │ 3 │
│ 2 │ 3 │ 2 │ 7 │
julia> for g in gd
println(g)
end
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 1 │
│ 2 │ 1 │ 2 │ 5 │
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 2 │ 1 │ 2 │
│ 2 │ 2 │ 1 │ 6 │
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 3 │ 2 │ 3 │
│ 2 │ 3 │ 2 │ 7 │
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 4 │ 1 │ 4 │
│ 2 │ 4 │ 1 │ 8 │DataFrames.groupindices — Functiongroupindices(gd::GroupedDataFrame)Return a vector of group indices for each row of parent(gd).
Rows appearing in group gd[i] are attributed index i. Rows not present in any group are attributed missing (this can happen if skipmissing=true was passed when creating gd, or if gd is a subset from a larger GroupedDataFrame).
DataFrames.groupcols — Functiongroupcols(gd::GroupedDataFrame)Return a vector of Symbol column names in parent(gd) used for grouping.
DataFrames.valuecols — Functionvaluecols(gd::GroupedDataFrame)Return a vector of Symbol column names in parent(gd) not used for grouping.
Base.keys — Functionkeys(gd::GroupedDataFrame)Get the set of keys for each group of the GroupedDataFramegd as a GroupKeys object. Each key is a GroupKey, which behaves like a NamedTuple holding the values of the grouping columns for a given group. Unlike the equivalent Tuple and NamedTuple, these keys can be used to index into gd efficiently. The ordering of the keys is identical to the ordering of the groups of gd under iteration and integer indexing.
Examples
julia> df = DataFrame(a = repeat([:foo, :bar, :baz], outer=[4]),
b = repeat([2, 1], outer=[6]),
c = 1:12);
julia> gd = groupby(df, [:a, :b])
GroupedDataFrame with 6 groups based on keys: a, b
First Group (2 rows): a = :foo, b = 2
│ Row │ a │ b │ c │
│ │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1 │ foo │ 2 │ 1 │
│ 2 │ foo │ 2 │ 7 │
⋮
Last Group (2 rows): a = :baz, b = 1
│ Row │ a │ b │ c │
│ │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1 │ baz │ 1 │ 6 │
│ 2 │ baz │ 1 │ 12 │
julia> keys(gd)
6-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
GroupKey: (a = :foo, b = 2)
GroupKey: (a = :bar, b = 1)
GroupKey: (a = :baz, b = 2)
GroupKey: (a = :foo, b = 1)
GroupKey: (a = :bar, b = 2)
GroupKey: (a = :baz, b = 1)GroupKey objects behave similarly to NamedTuples:
julia> k = keys(gd)[1]
GroupKey: (a = :foo, b = 2)
julia> keys(k)
(:a, :b)
julia> values(k) # Same as Tuple(k)
(:foo, 2)
julia> NamedTuple(k)
(a = :foo, b = 2)
julia> k.a
:foo
julia> k[:a]
:foo
julia> k[1]
:fooKeys can be used as indices to retrieve the corresponding group from their GroupedDataFrame:
julia> gd[k]
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1 │ foo │ 2 │ 1 │
│ 2 │ foo │ 2 │ 7 │
julia> gd[keys(gd)[1]] == gd[1]
truekeys(dfc::DataFrameColumns)Get a vector of column names of dfc as Symbols.
Base.get — Functionget(gd::GroupedDataFrame, key, default)Get a group based on the values of the grouping columns.
key may be a NamedTuple or Tuple of grouping column values (in the same order as the cols argument to groupby).
Examples
julia> df = DataFrame(a = repeat([:foo, :bar, :baz], outer=[2]),
b = repeat([2, 1], outer=[3]),
c = 1:6);
julia> gd = groupby(df, :a)
GroupedDataFrame with 3 groups based on key: a
First Group (2 rows): a = :foo
│ Row │ a │ b │ c │
│ │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1 │ foo │ 2 │ 1 │
│ 2 │ foo │ 1 │ 4 │
⋮
Last Group (2 rows): a = :baz
│ Row │ a │ b │ c │
│ │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1 │ baz │ 2 │ 3 │
│ 2 │ baz │ 1 │ 6 │
julia> get(gd, (a=:bar,), nothing)
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1 │ bar │ 1 │ 2 │
│ 2 │ bar │ 2 │ 5 │
julia> get(gd, (:baz,), nothing)
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1 │ baz │ 2 │ 3 │
│ 2 │ baz │ 1 │ 6 │
julia> get(gd, (:qux,), nothing)DataFrames.stack — Functionstack(df::AbstractDataFrame, [measure_vars], [id_vars];
variable_name=:variable, value_name=:value,
view::Bool=false, variable_eltype::Type=CategoricalValue{String})Stack a data frame df, i.e. convert it from wide to long format.
Return the long-format DataFrame with: columns for each of the id_vars, column variable_name (:value by default) holding the values of the stacked columns (measure_vars), and column variable_name (:variable by default) a vector holding the name of the corresponding measure_vars variable.
If view=true then return a stacked view of a data frame (long format). The result is a view because the columns are special AbstractVectors that return views into the original data frame.
Arguments
df: the AbstractDataFrame to be stackedmeasure_vars: the columns to be stacked (the measurement variables), as a column selector (Symbol, string or integer;:,All,Between,Not, a regular expression, or a vector ofSymbols, strings or integers). If neithermeasure_varsorid_varsare given,measure_varsdefaults to all floating point columns.id_vars: the identifier columns that are repeated during stacking, as a column selector (Symbol, string or integer;:,All,Between,Not, a regular expression, or a vector ofSymbols, strings or integers). Defaults to all variables that are notmeasure_varsvariable_name: the name (Symbolor string) of the new stacked column that shall hold the names of each ofmeasure_varsvalue_name: the name (Symbolor string) of the new stacked column containing the values from each ofmeasure_varsview: whether the stacked data frame should be a view rather than contain freshly allocated vectors.variable_eltype: determines the element type of columnvariable_name. By default a categorical vector of strings is created. Ifvariable_eltype=Symbolit is a vector ofSymbol, and ifvariable_eltype=Stringa vector ofStringis produced.
Examples
d1 = DataFrame(a = repeat([1:3;], inner = [4]),
b = repeat([1:4;], inner = [3]),
c = randn(12),
d = randn(12),
e = map(string, 'a':'l'))
d1s = stack(d1, [:c, :d])
d1s2 = stack(d1, [:c, :d], [:a])
d1m = stack(d1, Not([:a, :b, :e]))
d1s_name = stack(d1, Not([:a, :b, :e]), variable_name=:somemeasure)DataFrames.unstack — Functionunstack(df::AbstractDataFrame, rowkeys, colkey, value; renamecols::Function=identity)
unstack(df::AbstractDataFrame, colkey, value; renamecols::Function=identity)
unstack(df::AbstractDataFrame; renamecols::Function=identity)Unstack data frame df, i.e. convert it from long to wide format.
If colkey contains missing values then they will be skipped and a warning will be printed.
If combination of rowkeys and colkey contains duplicate entries then last value will be retained and a warning will be printed.
Arguments
df: the AbstractDataFrame to be unstackedrowkeys: the columns with a unique key for each row, if not given, find a key by grouping on anything not acolkeyorvalue. Can be any column selector (Symbol, string or integer;:,All,Between,Not, a regular expression, or a vector ofSymbols, strings or integers).colkey: the column (Symbol, string or integer) holding the column names in wide format, defaults to:variablevalue: the value column (Symbol, string or integer), defaults to:valuerenamecols: a function called on each unique value incolkeywhich must return the name of the column to be created (typically as a string or aSymbol). Duplicate names are not allowed.
Examples
wide = DataFrame(id = 1:12,
a = repeat([1:3;], inner = [4]),
b = repeat([1:4;], inner = [3]),
c = randn(12),
d = randn(12))
long = stack(wide)
wide0 = unstack(long)
wide1 = unstack(long, :variable, :value)
wide2 = unstack(long, :id, :variable, :value)
wide3 = unstack(long, [:id, :a], :variable, :value)
wide4 = unstack(long, :id, :variable, :value, renamecols=x->Symbol(:_, x))Note that there are some differences between the widened results above.
Basics
Missings.allowmissing — Functionallowmissing(df::AbstractDataFrame, cols=:)Return a copy of data frame df with columns cols converted to element type Union{T, Missing} from T to allow support for missing values.
cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).
If cols is omitted all columns in the data frame are converted.
Examples
julia> df = DataFrame(a=[1,2])
2×1 DataFrame
│ Row │ a │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 1 │
│ 2 │ 2 │
julia> allowmissing(df)
2×1 DataFrame
│ Row │ a │
│ │ Int64? │
├─────┼────────┤
│ 1 │ 1 │
│ 2 │ 2 │DataFrames.allowmissing! — Functionallowmissing!(df::DataFrame, cols=:)Convert columns cols of data frame df from element type T to Union{T, Missing} to support missing values.
cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).
If cols is omitted all columns in the data frame are converted.
Base.append! — Functionappend!(df::DataFrame, df2::AbstractDataFrame; cols::Symbol=:setequal,
promote::Bool=(cols in [:union, :subset]))
append!(df::DataFrame, table; cols::Symbol=:setequal,
promote::Bool=(cols in [:union, :subset]))Add the rows of df2 to the end of df. If the second argument table is not an AbstractDataFrame then it is converted using DataFrame(table, copycols=false) before being appended.
The exact behavior of append! depends on the cols argument:
- If
cols == :setequal(this is the default) thendf2must contain exactly the same columns asdf(but possibly in a different order). - If
cols == :orderequalthendf2must contain the same columns in the same order (forAbstractDictthis option requires thatkeys(row)matchespropertynames(df)to allow for support of ordered dicts; however, ifdf2is aDictan error is thrown as it is an unordered collection). - If
cols == :intersectthendf2may contain more columns thandf, but all column names that are present indfmust be present indf2and only these are used. - If
cols == :subsetthenappend!behaves like for:intersectbut if some column is missing indf2then amissingvalue is pushed todf. - If
cols == :unionthenappend!adds columns missing indfthat are present inrow, for columns present indfbut missing inrowamissingvalue is pushed.
If promote=true and element type of a column present in df does not allow the type of a pushed argument then a new column with a promoted element type allowing it is freshly allocated and stored in df. If promote=false an error is thrown.
The above rule has the following exceptions:
- If
dfhas no columns then copies of columns fromdf2are added to it. - If
df2has no columns then callingappend!leavesdfunchanged.
Please note that append! must not be used on a DataFrame that contains columns that are aliases (equal when compared with ===).
See also
Use push! to add individual rows to a data frame and vcat to vertically concatenate data frames.
Examples
julia> df1 = DataFrame(A=1:3, B=1:3);
julia> df2 = DataFrame(A=4.0:6.0, B=4:6);
julia> append!(df1, df2);
julia> df1
6×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │
│ 4 │ 4 │ 4 │
│ 5 │ 5 │ 5 │
│ 6 │ 6 │ 6 │CategoricalArrays.categorical — Functioncategorical(df::AbstractDataFrame, cols=Union{AbstractString, Missing};
compress::Bool=false)Return a copy of data frame df with columns cols converted to CategoricalVector.
cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers) or a Type.
If categorical is called with the cols argument being a Type, then all columns whose element type is a subtype of this type (by default Union{AbstractString, Missing}) will be converted to categorical.
If the compress keyword argument is set to true then the created CategoricalVectors will be compressed.
All created CategoricalVectors are unordered.
Examples
julia> df = DataFrame(a=[1,2], b=["a","b"])
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
julia> categorical(df)
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Cat… │
├─────┼───────┼──────┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
julia> categorical(df, :)
2×2 DataFrame
│ Row │ a │ b │
│ │ Cat… │ Cat… │
├─────┼──────┼──────┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │DataFrames.categorical! — Functioncategorical!(df::DataFrame, cols=Union{AbstractString, Missing};
compress::Bool=false)Change columns selected by cols in data frame df to CategoricalVector.
cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers) or a Type.
If categorical! is called with the cols argument being a Type, then all columns whose element type is a subtype of this type (by default Union{AbstractString, Missing}) will be converted to categorical.
If the compress keyword argument is set to true then the created CategoricalVectors will be compressed.
All created CategoricalVectors are unordered.
Examples
julia> df = DataFrame(X=["a", "b"], Y=[1, 2], Z=["p", "q"])
2×3 DataFrame
│ Row │ X │ Y │ Z │
│ │ String │ Int64 │ String │
├─────┼────────┼───────┼────────┤
│ 1 │ a │ 1 │ p │
│ 2 │ b │ 2 │ q │
julia> categorical!(df)
2×3 DataFrame
│ Row │ X │ Y │ Z │
│ │ Cat… │ Int64 │ Cat… │
├─────┼──────┼───────┼──────┤
│ 1 │ a │ 1 │ p │
│ 2 │ b │ 2 │ q │
julia> eltype.(eachcol(df))
3-element Array{DataType,1}:
CategoricalValue{String,UInt32}
Int64
CategoricalValue{String,UInt32}
julia> df = DataFrame(X=["a", "b"], Y=[1, 2], Z=["p", "q"])
2×3 DataFrame
│ Row │ X │ Y │ Z │
│ │ String │ Int64 │ String │
├─────┼────────┼───────┼────────┤
│ 1 │ a │ 1 │ p │
│ 2 │ b │ 2 │ q │
julia> categorical!(df, :Y, compress=true)
2×3 DataFrame
│ Row │ X │ Y │ Z │
│ │ String │ Cat… │ String │
├─────┼────────┼──────┼────────┤
│ 1 │ a │ 1 │ p │
│ 2 │ b │ 2 │ q │
julia> eltype.(eachcol(df))
3-element Array{DataType,1}:
String
CategoricalValue{Int64,UInt8}
StringDataFrames.completecases — Functioncompletecases(df::AbstractDataFrame, cols=:)Return a Boolean vector with true entries indicating rows without missing values (complete cases) in data frame df.
If cols is provided, only missing values in the corresponding columns areconsidered. cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).
See also: dropmissing and dropmissing!. Use findall(completecases(df)) to get the indices of the rows.
Examples
julia> df = DataFrame(i = 1:5,
x = [missing, 4, missing, 2, 1],
y = [missing, missing, "c", "d", "e"])
5×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64? │ String? │
├─────┼───────┼─────────┼─────────┤
│ 1 │ 1 │ missing │ missing │
│ 2 │ 2 │ 4 │ missing │
│ 3 │ 3 │ missing │ c │
│ 4 │ 4 │ 2 │ d │
│ 5 │ 5 │ 1 │ e │
julia> completecases(df)
5-element BitArray{1}:
false
false
false
true
true
julia> completecases(df, :x)
5-element BitArray{1}:
false
true
false
true
true
julia> completecases(df, [:x, :y])
5-element BitArray{1}:
false
false
false
true
trueBase.copy — Functioncopy(df::DataFrame; copycols::Bool=true)Copy data frame df. If copycols=true (the default), return a new DataFrame holding copies of column vectors in df. If copycols=false, return a new DataFrame sharing column vectors with df.
copy(dfr::DataFrameRow)Construct a NamedTuple with the same contents as the DataFrameRow. This method returns a NamedTuple so that the returned object is not affected by changes to the parent data frame of which dfr is a view.
DataFrames.DataFrame! — FunctionDataFrame!(args...; kwargs...)Equivalent to DataFrame(args...; copycols=false, kwargs...).
If kwargs contains the copycols keyword argument an error is thrown.
Examples
julia> df1 = DataFrame(a=1:3)
3×1 DataFrame
│ Row │ a │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
julia> df2 = DataFrame!(df1)
julia> df1.a === df2.a
trueBase.delete! — Functiondelete!(df::DataFrame, inds)Delete rows specified by inds from a DataFramedf in place and return it.
Internally deleteat! is called for all columns so inds must be: a vector of sorted and unique integers, a boolean vector, an integer, or Not.
Examples
julia> d = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │
julia> delete!(d, 2)
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 3 │ 6 │DataAPI.describe — Functiondescribe(df::AbstractDataFrame; cols=:)
describe(df::AbstractDataFrame, stats::Union{Symbol, Pair}...; cols=:)Return descriptive statistics for a data frame as a new DataFrame where each row represents a variable and each column a summary statistic.
Arguments
df: theAbstractDataFramestats::Union{Symbol, Pair}...: the summary statistics to report. Arguments can be:- A symbol from the list
:mean,:std,:min,:q25,:median,:q75,:max,:eltype,:nunique,:first,:last, and:nmissing. The default statistics used are:mean,:min,:median,:max,:nunique,:nmissing, and:eltype. :allas the onlySymbolargument to return all statistics.- A
name => functionpair wherenameis aSymbolor string. This will create a column of summary statistics with the provided name.
- A symbol from the list
cols: a keyword argument allowing to select only a subset of columns fromdfto describe. Can be any column selector (Symbol, string or integer;:,All,Between,Not, a regular expression, or a vector ofSymbols, strings or integers).
Details
For Real columns, compute the mean, standard deviation, minimum, first quantile, median, third quantile, and maximum. If a column does not derive from Real, describe will attempt to calculate all statistics, using nothing as a fall-back in the case of an error.
When stats contains :nunique, describe will report the number of unique values in a column. If a column's base type derives from Real, :nunique will return nothings.
Missing values are filtered in the calculation of all statistics, however the column :nmissing will report the number of missing values of that variable. If the column does not allow missing values, nothing is returned. Consequently, nmissing = 0 indicates that the column allows missing values, but does not currently contain any.
If custom functions are provided, they are called repeatedly with the vector corresponding to each column as the only argument. For columns allowing for missing values, the vector is wrapped in a call to skipmissing: custom functions must therefore support such objects (and not only vectors), and cannot access missing values.
Examples
julia> df = DataFrame(i=1:10, x=0.1:0.1:1.0, y='a':'j')
10×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Float64 │ Char │
├─────┼───────┼─────────┼──────┤
│ 1 │ 1 │ 0.1 │ 'a' │
│ 2 │ 2 │ 0.2 │ 'b' │
│ 3 │ 3 │ 0.3 │ 'c' │
│ 4 │ 4 │ 0.4 │ 'd' │
│ 5 │ 5 │ 0.5 │ 'e' │
│ 6 │ 6 │ 0.6 │ 'f' │
│ 7 │ 7 │ 0.7 │ 'g' │
│ 8 │ 8 │ 0.8 │ 'h' │
│ 9 │ 9 │ 0.9 │ 'i' │
│ 10 │ 10 │ 1.0 │ 'j' │
julia> describe(df)
3×8 DataFrame
│ Row │ variable │ mean │ min │ median │ max │ nunique │ nmissing │ eltype │
│ │ Symbol │ Union… │ Any │ Union… │ Any │ Union… │ Nothing │ DataType │
├─────┼──────────┼────────┼─────┼────────┼─────┼─────────┼──────────┼──────────┤
│ 1 │ i │ 5.5 │ 1 │ 5.5 │ 10 │ │ │ Int64 │
│ 2 │ x │ 0.55 │ 0.1 │ 0.55 │ 1.0 │ │ │ Float64 │
│ 3 │ y │ │ 'a' │ │ 'j' │ 10 │ │ Char │
julia> describe(df, :min, :max)
3×3 DataFrame
│ Row │ variable │ min │ max │
│ │ Symbol │ Any │ Any │
├─────┼──────────┼─────┼─────┤
│ 1 │ i │ 1 │ 10 │
│ 2 │ x │ 0.1 │ 1.0 │
│ 3 │ y │ 'a' │ 'j' │
julia> describe(df, :min, :sum => sum)
3×3 DataFrame
│ Row │ variable │ min │ sum │
│ │ Symbol │ Any │ Any │
├─────┼──────────┼─────┼─────┤
│ 1 │ i │ 1 │ 55 │
│ 2 │ x │ 0.1 │ 5.5 │
│ 3 │ y │ 'a' │ │
julia> describe(df, :min, :sum => sum, cols=:x)
1×3 DataFrame
│ Row │ variable │ min │ sum │
│ │ Symbol │ Float64 │ Float64 │
├─────┼──────────┼─────────┼─────────┤
│ 1 │ x │ 0.1 │ 5.5 │Missings.disallowmissing — Functiondisallowmissing(df::AbstractDataFrame, cols=:; error::Bool=true)Return a copy of data frame df with columns cols converted from element type Union{T, Missing} to T to drop support for missing values.
cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).
If cols is omitted all columns in the data frame are converted.
If error=false then columns containing a missing value will be skipped instead of throwing an error.
Examples
julia> df = DataFrame(a=Union{Int,Missing}[1,2])
2×1 DataFrame
│ Row │ a │
│ │ Int64? │
├─────┼────────┤
│ 1 │ 1 │
│ 2 │ 2 │
julia> disallowmissing(df)
2×1 DataFrame
│ Row │ a │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 1 │
│ 2 │ 2 │
julia> df = DataFrame(a=[1,missing])
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64? │
├─────┼─────────┼────────┤
│ 1 │ 1 │ 1 │
│ 2 │ missing │ 2 │
julia> disallowmissing(df, error=false)
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64 │
├─────┼─────────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ missing │ 2 │DataFrames.disallowmissing! — Functiondisallowmissing!(df::DataFrame, cols=:; error::Bool=true)Convert columns cols of data frame df from element type Union{T, Missing} to T to drop support for missing values.
cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).
If cols is omitted all columns in the data frame are converted.
If error=false then columns containing a missing value will be skipped instead of throwing an error.
DataFrames.dropmissing — Functiondropmissing(df::AbstractDataFrame, cols=:; disallowmissing::Bool=true)Return a copy of data frame df excluding rows with missing values.
If cols is provided, only missing values in the corresponding columns are considered. cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).
If disallowmissing is true (the default) then columns specified in cols will be converted so as not to allow for missing values using disallowmissing!.
See also: completecases and dropmissing!.
Examples
julia> df = DataFrame(i = 1:5,
x = [missing, 4, missing, 2, 1],
y = [missing, missing, "c", "d", "e"])
5×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64? │ String? │
├─────┼───────┼─────────┼─────────┤
│ 1 │ 1 │ missing │ missing │
│ 2 │ 2 │ 4 │ missing │
│ 3 │ 3 │ missing │ c │
│ 4 │ 4 │ 2 │ d │
│ 5 │ 5 │ 1 │ e │
julia> dropmissing(df)
2×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼────────┤
│ 1 │ 4 │ 2 │ d │
│ 2 │ 5 │ 1 │ e │
julia> dropmissing(df, disallowmissing=false)
2×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64? │ String? │
├─────┼───────┼────────┼─────────┤
│ 1 │ 4 │ 2 │ d │
│ 2 │ 5 │ 1 │ e │
julia> dropmissing(df, :x)
3×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64 │ String? │
├─────┼───────┼───────┼─────────┤
│ 1 │ 2 │ 4 │ missing │
│ 2 │ 4 │ 2 │ d │
│ 3 │ 5 │ 1 │ e │
julia> dropmissing(df, [:x, :y])
2×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼────────┤
│ 1 │ 4 │ 2 │ d │
│ 2 │ 5 │ 1 │ e │DataFrames.dropmissing! — Functiondropmissing!(df::AbstractDataFrame, cols=:; disallowmissing::Bool=true)Remove rows with missing values from data frame df and return it.
If cols is provided, only missing values in the corresponding columns are considered. cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).
If disallowmissing is true (the default) then the cols columns will get converted using disallowmissing!.
See also: dropmissing and completecases.
julia> df = DataFrame(i = 1:5,
x = [missing, 4, missing, 2, 1],
y = [missing, missing, "c", "d", "e"])
5×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64? │ String? │
├─────┼───────┼─────────┼─────────┤
│ 1 │ 1 │ missing │ missing │
│ 2 │ 2 │ 4 │ missing │
│ 3 │ 3 │ missing │ c │
│ 4 │ 4 │ 2 │ d │
│ 5 │ 5 │ 1 │ e │
julia> dropmissing!(copy(df))
2×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼────────┤
│ 1 │ 4 │ 2 │ d │
│ 2 │ 5 │ 1 │ e │
julia> dropmissing!(copy(df), disallowmissing=false)
2×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64? │ String? │
├─────┼───────┼────────┼─────────┤
│ 1 │ 4 │ 2 │ d │
│ 2 │ 5 │ 1 │ e │
julia> dropmissing!(copy(df), :x)
3×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64 │ String? │
├─────┼───────┼───────┼─────────┤
│ 1 │ 2 │ 4 │ missing │
│ 2 │ 4 │ 2 │ d │
│ 3 │ 5 │ 1 │ e │
julia> dropmissing!(df3, [:x, :y])
2×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼────────┤
│ 1 │ 4 │ 2 │ d │
│ 2 │ 5 │ 1 │ e │Compat.eachcol — Functioneachcol(df::AbstractDataFrame)Return a DataFrameColumns that is an AbstractVector that allows iterating an AbstractDataFrame column by column. Additionally it is allowed to index DataFrameColumns using column names.
Examples
julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 11 │
│ 2 │ 2 │ 12 │
│ 3 │ 3 │ 13 │
│ 4 │ 4 │ 14 │
julia> collect(eachcol(df))
2-element Array{AbstractArray{T,1} where T,1}:
[1, 2, 3, 4]
[11, 12, 13, 14]
julia> map(eachcol(df)) do col
maximum(col) - minimum(col)
end
2-element Array{Int64,1}:
3
3
julia> sum.(eachcol(df))
2-element Array{Int64,1}:
10
50Compat.eachrow — Functioneachrow(df::AbstractDataFrame)Return a DataFrameRows that iterates a data frame row by row, with each row represented as a DataFrameRow.
Because DataFrameRows have an eltype of Any, use copy(dfr::DataFrameRow) to obtain a named tuple, which supports iteration and property access like a DataFrameRow, but also passes information on the eltypes of the columns of df.
Examples
julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 11 │
│ 2 │ 2 │ 12 │
│ 3 │ 3 │ 13 │
│ 4 │ 4 │ 14 │
julia> eachrow(df)
4-element DataFrameRows:
DataFrameRow (row 1)
x 1
y 11
DataFrameRow (row 2)
x 2
y 12
DataFrameRow (row 3)
x 3
y 13
DataFrameRow (row 4)
x 4
y 14
julia> copy.(eachrow(df))
4-element Array{NamedTuple{(:x, :y),Tuple{Int64,Int64}},1}:
(x = 1, y = 11)
(x = 2, y = 12)
(x = 3, y = 13)
(x = 4, y = 14)
julia> eachrow(view(df, [4,3], [2,1]))
2-element DataFrameRows:
DataFrameRow (row 4)
y 14
x 4
DataFrameRow (row 3)
y 13
x 3Base.filter — Functionfilter(function, df::AbstractDataFrame)
filter(cols => function, df::AbstractDataFrame)Return a copy of data frame df containing only rows for which function returns true.
If cols is not specified then the function is passed DataFrameRows.
If cols is specified then the function is passed elements of the corresponding columns as separate positional arguments, unless cols is an AsTable selector, in which case a NamedTuple of these arguments is passed. cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers), and column duplicates are allowed if a vector of Symbols, strings, or integers is passed.
Passing cols leads to a more efficient execution of the operation for large data frames.
See also: filter!
Examples
julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 2 │ a │
│ 4 │ 1 │ b │
julia> filter(row -> row.x > 1, df)
2×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 2 │ a │
julia> filter(:x => x -> x > 1, df)
2×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 2 │ a │
julia> filter([:x, :y] => (x, y) -> x == 1 || y == "b", df)
3×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 1 │ b │
julia> filter(AsTable(:) => nt -> nt.x == 1 || nt.y == "b", df)
3×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 1 │ b │Base.filter! — Functionfilter!(function, df::AbstractDataFrame)
filter!(cols => function, df::AbstractDataFrame)Remove rows from data frame df for which function returns false.
If cols is not specified then the function is passed DataFrameRows. If cols is specified then the function is passed elements of the corresponding columns as separate positional arguments, unless cols is an AsTable selector, in which case a NamedTuple of these arguments is passed. cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers), and column duplicates are allowed if a vector of Symbols, strings, or integers is passed.
Passing cols leads to a more efficient execution of the operation for large data frames.
See also: filter
Examples
julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 2 │ a │
│ 4 │ 1 │ b │
julia> filter!(row -> row.x > 1, df);
julia> df
2×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 2 │ a │
julia> filter!(:x => x -> x == 3, df);
julia> df
1×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"]);
julia> filter!([:x, :y] => (x, y) -> x == 1 || y == "b", df);
julia> df
3×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 1 │ b │
julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"]);
julia> filter!(AsTable(:) => nt -> nt.x == 1 || nt.y == "b", df)
3×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 1 │ b │DataFrames.flatten — Functionflatten(df::AbstractDataFrame, cols)When columns cols of data frame df have iterable elements that define length (for example a Vector of Vectors), return a DataFrame where each element of each col in cols is flattened, meaning the column corresponding to col becomes a longer vector where the original entries are concatenated. Elements of row i of df in columns other than cols will be repeated according to the length of df[i, col]. These lengths must therefore be the same for each col in cols, or else an error is raised. Note that these elements are not copied, and thus if they are mutable changing them in the returned DataFrame will affect df.
cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).
Examples
julia> df1 = DataFrame(a = [1, 2], b = [[1, 2], [3, 4]], c = [[5, 6], [7, 8]])
2×3 DataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Array… │ Array… │
├─────┼───────┼────────┼────────┤
│ 1 │ 1 │ [1, 2] │ [5, 6] │
│ 2 │ 2 │ [3, 4] │ [7, 8] │
julia> flatten(df1, :b)
4×3 DataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Array… │
├─────┼───────┼───────┼────────┤
│ 1 │ 1 │ 1 │ [5, 6] │
│ 2 │ 1 │ 2 │ [5, 6] │
│ 3 │ 2 │ 3 │ [7, 8] │
│ 4 │ 2 │ 4 │ [7, 8] │
julia> flatten(df1, [:b, :c])
4×3 DataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 1 │ 5 │
│ 2 │ 1 │ 2 │ 6 │
│ 3 │ 2 │ 3 │ 7 │
│ 4 │ 2 │ 4 │ 8 │
julia> df2 = DataFrame(a = [1, 2], b = [("p", "q"), ("r", "s")])
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Tuple… │
├─────┼───────┼────────────┤
│ 1 │ 1 │ ("p", "q") │
│ 2 │ 2 │ ("r", "s") │
julia> flatten(df2, :b)
4×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ p │
│ 2 │ 1 │ q │
│ 3 │ 2 │ r │
│ 4 │ 2 │ s │
julia> df3 = DataFrame(a = [1, 2], b = [[1, 2], [3, 4]], c = [[5, 6], [7]])
2×3 DataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Array… │ Array… │
├─────┼───────┼────────┼────────┤
│ 1 │ 1 │ [1, 2] │ [5, 6] │
│ 2 │ 2 │ [3, 4] │ [7] │
julia> flatten(df3, [:b, :c])
ERROR: ArgumentError: Lengths of iterables stored in columns :b and :c
are not the same in row 2Base.hcat — Functionhcat(df::AbstractDataFrame...;
makeunique::Bool=false, copycols::Bool=true)
hcat(df::AbstractDataFrame..., vs::AbstractVector;
makeunique::Bool=false, copycols::Bool=true)
hcat(vs::AbstractVector, df::AbstractDataFrame;
makeunique::Bool=false, copycols::Bool=true)Horizontally concatenate AbstractDataFrames and optionally AbstractVectors.
If AbstractVector is passed then a column name for it is automatically generated as :x1 by default.
If makeunique=false (the default) column names of passed objects must be unique. If makeunique=true then duplicate column names will be suffixed with _i (i starting at 1 for the first duplicate).
If copycols=true (the default) then the DataFrame returned by hcat will contain copied columns from the source data frames. If copycols=false then it will contain columns as they are stored in the source (without copying). This option should be used with caution as mutating either the columns in sources or in the returned DataFrame might lead to the corruption of the other object.
Example
julia [DataFrame(A=1:3) DataFrame(B=1:3)]
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │
julia> df1 = DataFrame(A=1:3, B=1:3);
julia> df2 = DataFrame(A=4:6, B=4:6);
julia> df3 = hcat(df1, df2, makeunique=true)
3×4 DataFrame
│ Row │ A │ B │ A_1 │ B_1 │
│ │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1 │ 1 │ 1 │ 4 │ 4 │
│ 2 │ 2 │ 2 │ 5 │ 5 │
│ 3 │ 3 │ 3 │ 6 │ 6 │
julia> df3.A === df1.A
false
julia> df3 = hcat(df1, df2, makeunique=true, copycols=false);
julia> df3.A === df1.A
trueDataFrames.insertcols! — Functioninsertcols!(df::DataFrame, [ind::Int], (name=>col)::Pair...;
makeunique::Bool=false, copycols::Bool=true)Insert a column into a data frame in place. Return the updated DataFrame. If ind is omitted it is set to ncol(df)+1 (the column is inserted as the last column).
Arguments
df: the DataFrame to which we want to add columnsind: a position at which we want to insert a columnname: the name of the new columncol: anAbstractVectorgiving the contents of the new column or a value of any type other thanAbstractArraywhich will be repeated to fill a new vector; As a particular rule a values stored in aRefor a0-dimensionalAbstractArrayare unwrapped and treated in the same way.makeunique: Defines what to do ifnamealready exists indf; if it isfalsean error will be thrown; if it istruea new unique name will be generated by adding a suffixcopycols: whether vectors passed as columns should be copied
If col is an AbstractRange then the result of collect(col) is inserted.
Examples
julia> d = DataFrame(a=1:3)
3×1 DataFrame
│ Row │ a │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
julia> insertcols!(d, 1, :b => 'a':'c')
3×2 DataFrame
│ Row │ b │ a │
│ │ Char │ Int64 │
├─────┼──────┼───────┤
│ 1 │ 'a' │ 1 │
│ 2 │ 'b' │ 2 │
│ 3 │ 'c' │ 3 │
julia> insertcols!(d, 2, :c => 2:4, :c => 3:5, makeunique=true)
3×4 DataFrame
│ Row │ b │ c │ c_1 │ a │
│ │ Char │ Int64 │ Int64 │ Int64 │
├─────┼──────┼───────┼───────┼───────┤
│ 1 │ 'a' │ 2 │ 3 │ 1 │
│ 2 │ 'b' │ 3 │ 4 │ 2 │
│ 3 │ 'c' │ 4 │ 5 │ 3 │Base.length — Functionlength(dfr::DataFrameRow)Return the number of elements of dfr.
See also: size
Examples
julia> dfr = DataFrame(a=1:3, b='a':'c')[1, :];
julia> length(dfr)
2DataFrames.mapcols — Functionmapcols(f::Union{Function,Type}, df::AbstractDataFrame)Return a DataFrame where each column of df is transformed using function f. f must return AbstractVector objects all with the same length or scalars (all values other than AbstractVector are considered to be a scalar).
Note that mapcols guarantees not to reuse the columns from df in the returned DataFrame. If f returns its argument then it gets copied before being stored.
Examples
julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 11 │
│ 2 │ 2 │ 12 │
│ 3 │ 3 │ 13 │
│ 4 │ 4 │ 14 │
julia> mapcols(x -> x.^2, df)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 121 │
│ 2 │ 4 │ 144 │
│ 3 │ 9 │ 169 │
│ 4 │ 16 │ 196 │DataFrames.mapcols! — Functionmapcols!(f::Union{Function,Type}, df::DataFrame)Update a DataFrame in-place where each column of df is transformed using function f. f must return AbstractVector objects all with the same length or scalars (all values other than AbstractVector are considered to be a scalar).
Note that mapcols! reuses the columns from df if they are returned by f.
Examples
julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 11 │
│ 2 │ 2 │ 12 │
│ 3 │ 3 │ 13 │
│ 4 │ 4 │ 14 │
julia> mapcols!(x -> x.^2, df);
julia> df
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 121 │
│ 2 │ 4 │ 144 │
│ 3 │ 9 │ 169 │
│ 4 │ 16 │ 196 │Base.names — Functionnames(df::AbstractDataFrame)
names(df::AbstractDataFrame, cols)Return a freshly allocated Vector{String} of names of columns contained in df.
If cols is passed then restrict returned column names to those matching the selector (this is useful in particular with regular expressions, Not, and Between). cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).
See also propertynames which returns a Vector{Symbol}.
DataFrames.ncol — Functionnrow(df::AbstractDataFrame)
ncol(df::AbstractDataFrame)Return the number of rows or columns in an AbstractDataFramedf.
See also size.
Examples
julia> df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10));
julia> size(df)
(10, 3)
julia> nrow(df)
10
julia> ncol(df)
3Base.ndims — Functionndims(::AbstractDataFrame)
ndims(::Type{<:AbstractDataFrame})Return the number of dimensions of a data frame, which is always 2.
ndims(::DataFrameRow)
ndims(::Type{<:DataFrameRow})Return the number of dimensions of a data frame row, which is always 1.
DataFrames.nonunique — Functionnonunique(df::AbstractDataFrame)
nonunique(df::AbstractDataFrame, cols)Return a Vector{Bool} in which true entries indicate duplicate rows. A row is a duplicate if there exists a prior row with all columns containing equal values (according to isequal).
Arguments
df:AbstractDataFramecols: a selector specifying the column(s) to compare. Can be any column selector (Symbol, string or integer;:,All,Between,Not, a regular expression, or a vector ofSymbols, strings or integers).
Examples
df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
df = vcat(df, df)
nonunique(df)
nonunique(df, 1)DataFrames.nrow — Functionnrow(df::AbstractDataFrame)
ncol(df::AbstractDataFrame)Return the number of rows or columns in an AbstractDataFramedf.
See also size.
Examples
julia> df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10));
julia> size(df)
(10, 3)
julia> nrow(df)
10
julia> ncol(df)
3DataFrames.order — Functionorder(col::ColumnIndex; kwargs...)Specify sorting order for a column col in a data frame. kwargs can be lt, by, rev, and order with values following the rules defined in sort!.
Examples
julia> df = DataFrame(x = [-3, -1, 0, 2, 4], y = 1:5)
5×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ -3 │ 1 │
│ 2 │ -1 │ 2 │
│ 3 │ 0 │ 3 │
│ 4 │ 2 │ 4 │
│ 5 │ 4 │ 5 │
julia> sort(df, order(:x, rev=true))
5×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 4 │ 5 │
│ 2 │ 2 │ 4 │
│ 3 │ 0 │ 3 │
│ 4 │ -1 │ 2 │
│ 5 │ -3 │ 1 │
julia> sort(df, order(:x, by=abs))
5×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 0 │ 3 │
│ 2 │ -1 │ 2 │
│ 3 │ 2 │ 4 │
│ 4 │ -3 │ 1 │
│ 5 │ 4 │ 5 │Base.push! — Functionpush!(df::DataFrame, row::Union{Tuple, AbstractArray}; promote::Bool=false)
push!(df::DataFrame, row::Union{DataFrameRow, NamedTuple, AbstractDict};
cols::Symbol=:setequal, promote::Bool=(cols in [:union, :subset]))Add in-place one row at the end of df taking the values from row.
Column types of df are preserved, and new values are converted if necessary. An error is thrown if conversion fails.
If row is neither a DataFrameRow, NamedTuple nor AbstractDict then it must be a Tuple or an AbstractArray and columns are matched by order of appearance. In this case row must contain the same number of elements as the number of columns in df.
If row is a DataFrameRow, NamedTuple or AbstractDict then values in row are matched to columns in df based on names. The exact behavior depends on the cols argument value in the following way:
- If
cols == :setequal(this is the default) thenrowmust contain exactly the same columns asdf(but possibly in a different order). - If
cols == :orderequalthenrowmust contain the same columns in the same order (forAbstractDictthis option requires thatkeys(row)matchespropertynames(df)to allow for support of ordered dicts; however, ifrowis aDictan error is thrown as it is an unordered collection). - If
cols == :intersectthenrowmay contain more columns thandf, but all column names that are present indfmust be present inrowand only they are used to populate a new row indf. - If
cols == :subsetthenpush!behaves like for:intersectbut if some column is missing inrowthen amissingvalue is pushed todf. - If
cols == :unionthen columns missing indfthat are present inroware added todf(usingmissingfor existing rows) and amissingvalue is pushed to columns missing inrowthat are present indf.
If promote=true and element type of a column present in df does not allow the type of a pushed argument then a new column with a promoted element type allowing it is freshly allocated and stored in df. If promote=false an error is thrown.
As a special case, if df has no columns and row is a NamedTuple or DataFrameRow, columns are created for all values in row, using their names and order.
Please note that push! must not be used on a DataFrame that contains columns that are aliases (equal when compared with ===).
Examples
julia> df = DataFrame(A=1:3, B=1:3);
julia> push!(df, (true, false))
4×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │
│ 4 │ 1 │ 0 │
julia> push!(df, df[1, :])
5×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │
│ 4 │ 1 │ 0 │
│ 5 │ 1 │ 1 │
julia> push!(df, (C="something", A=true, B=false), cols=:intersect)
6×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │
│ 4 │ 1 │ 0 │
│ 5 │ 1 │ 1 │
│ 6 │ 1 │ 0 │
julia> push!(df, Dict(:A=>1.0, :C=>1.0), cols=:union)
7×3 DataFrame
│ Row │ A │ B │ C │
│ │ Float64 │ Int64? │ Float64? │
├─────┼─────────┼─────────┼──────────┤
│ 1 │ 1.0 │ 1 │ missing │
│ 2 │ 2.0 │ 2 │ missing │
│ 3 │ 3.0 │ 3 │ missing │
│ 4 │ 1.0 │ 0 │ missing │
│ 5 │ 1.0 │ 1 │ missing │
│ 6 │ 1.0 │ 0 │ missing │
│ 7 │ 1.0 │ missing │ 1.0 │
julia> push!(df, NamedTuple(), cols=:subset)
8×3 DataFrame
│ Row │ A │ B │ C │
│ │ Float64? │ Int64? │ Float64? │
├─────┼──────────┼─────────┼──────────┤
│ 1 │ 1.0 │ 1 │ missing │
│ 2 │ 2.0 │ 2 │ missing │
│ 3 │ 3.0 │ 3 │ missing │
│ 4 │ 1.0 │ 0 │ missing │
│ 5 │ 1.0 │ 1 │ missing │
│ 6 │ 1.0 │ 0 │ missing │
│ 7 │ 1.0 │ missing │ 1.0 │
│ 8 │ missing │ missing │ missing │DataFrames.rename — Functionrename(df::AbstractDataFrame, vals::AbstractVector{Symbol};
makeunique::Bool=false)
rename(df::AbstractDataFrame, vals::AbstractVector{<:AbstractString};
makeunique::Bool=false)
rename(df::AbstractDataFrame, (from => to)::Pair...)
rename(df::AbstractDataFrame, d::AbstractDict)
rename(df::AbstractDataFrame, d::AbstractVector{<:Pair})
rename(f::Function, df::AbstractDataFrame)Create a new data frame that is a copy of df with changed column names. Each name is changed at most once. Permutation of names is allowed.
Arguments
df: theAbstractDataFramed: anAbstractDictor anAbstractVectorofPairs that maps the original names or column numbers to new namesf: a function which for each column takes the old name as aStringand returns the new name that gets converted to aSymbolvals: new column names as a vector ofSymbols orAbstractStrings of the same length as the number of columns indfmakeunique: iffalse(the default), an error will be raised if duplicate names are found; iftrue, duplicate names will be suffixed with_i(istarting at 1 for the first duplicate).
If pairs are passed to rename (as positional arguments or in a dictionary or a vector) then:
fromvalue can be aSymbol, anAbstractStringor anInteger;tovalue can be aSymbolor anAbstractString.
Mixing symbols and strings in to and from is not allowed.
See also: rename!
Examples
julia> df = DataFrame(i = 1, x = 2, y = 3)
1×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 3 │
julia> rename(df, :i => :A, :x => :X)
1×3 DataFrame
│ Row │ A │ X │ y │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 3 │
julia> rename(df, :x => :y, :y => :x)
1×3 DataFrame
│ Row │ i │ y │ x │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 3 │
julia> rename(df, [1 => :A, 2 => :X])
1×3 DataFrame
│ Row │ A │ X │ y │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 3 │
julia> rename(df, Dict("i" => "A", "x" => "X"))
1×3 DataFrame
│ Row │ A │ X │ y │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 3 │
julia> rename(uppercase, df)
1×3 DataFrame
│ Row │ I │ X │ Y │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 3 │DataFrames.rename! — Functionrename!(df::AbstractDataFrame, vals::AbstractVector{Symbol};
makeunique::Bool=false)
rename!(df::AbstractDataFrame, vals::AbstractVector{<:AbstractString};
makeunique::Bool=false)
rename!(df::AbstractDataFrame, (from => to)::Pair...)
rename!(df::AbstractDataFrame, d::AbstractDict)
rename!(df::AbstractDataFrame, d::AbstractVector{<:Pair})
rename!(f::Function, df::AbstractDataFrame)Rename columns of df in-place. Each name is changed at most once. Permutation of names is allowed.
Arguments
df: theAbstractDataFramed: anAbstractDictor anAbstractVectorofPairs that maps the original names or column numbers to new namesf: a function which for each column takes the old name as aStringand returns the new name that gets converted to aSymbolvals: new column names as a vector ofSymbols orAbstractStrings of the same length as the number of columns indfmakeunique: iffalse(the default), an error will be raised if duplicate names are found; iftrue, duplicate names will be suffixed with_i(istarting at 1 for the first duplicate).
If pairs are passed to rename! (as positional arguments or in a dictionary or a vector) then:
fromvalue can be aSymbol, anAbstractStringor anInteger;tovalue can be aSymbolor anAbstractString.
Mixing symbols and strings in to and from is not allowed.
See also: rename
Examples
julia> df = DataFrame(i = 1, x = 2, y = 3)
1×3 DataFrame
│ Row │ i │ x │ y │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 3 │
julia> rename!(df, Dict(:i => "A", :x => "X"))
1×3 DataFrame
│ Row │ A │ X │ y │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 3 │
julia> rename!(df, [:a, :b, :c])
1×3 DataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 3 │
julia> rename!(df, [:a, :b, :a])
ERROR: ArgumentError: Duplicate variable names: :a. Pass makeunique=true to make
them unique using a suffix automatically.
julia> rename!(df, [:a, :b, :a], makeunique=true)
1×3 DataFrame
│ Row │ a │ b │ a_1 │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 3 │
julia> rename!(uppercase, df)
1×3 DataFrame
│ Row │ A │ B │ A_1 │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 3 │Base.repeat — Functionrepeat(df::AbstractDataFrame; inner::Integer = 1, outer::Integer = 1)Construct a data frame by repeating rows in df. inner specifies how many times each row is repeated, and outer specifies how many times the full set of rows is repeated.
Example
julia> df = DataFrame(a = 1:2, b = 3:4)
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 2 │ 4 │
julia> repeat(df, inner = 2, outer = 3)
12×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 1 │ 3 │
│ 3 │ 2 │ 4 │
│ 4 │ 2 │ 4 │
│ 5 │ 1 │ 3 │
│ 6 │ 1 │ 3 │
│ 7 │ 2 │ 4 │
│ 8 │ 2 │ 4 │
│ 9 │ 1 │ 3 │
│ 10 │ 1 │ 3 │
│ 11 │ 2 │ 4 │
│ 12 │ 2 │ 4 │repeat(df::AbstractDataFrame, count::Integer)Construct a data frame by repeating each row in df the number of times specified by count.
Example
julia> df = DataFrame(a = 1:2, b = 3:4)
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 2 │ 4 │
julia> repeat(df, 2)
4×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 2 │ 4 │
│ 3 │ 1 │ 3 │
│ 4 │ 2 │ 4 │DataFrames.repeat! — Functionrepeat!(df::DataFrame; inner::Integer = 1, outer::Integer = 1)Update a data frame df in-place by repeating its rows. inner specifies how many times each row is repeated, and outer specifies how many times the full set of rows is repeated. Columns of df are freshly allocated.
Example
julia> df = DataFrame(a = 1:2, b = 3:4)
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 2 │ 4 │
julia> repeat!(df, inner = 2, outer = 3);
julia> df
12×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 1 │ 3 │
│ 3 │ 2 │ 4 │
│ 4 │ 2 │ 4 │
│ 5 │ 1 │ 3 │
│ 6 │ 1 │ 3 │
│ 7 │ 2 │ 4 │
│ 8 │ 2 │ 4 │
│ 9 │ 1 │ 3 │
│ 10 │ 1 │ 3 │
│ 11 │ 2 │ 4 │
│ 12 │ 2 │ 4 │repeat!(df::DataFrame, count::Integer)Update a data frame df in-place by repeating its rows the number of times specified by count. Columns of df are freshly allocated.
Example
julia> df = DataFrame(a = 1:2, b = 3:4)
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 2 │ 4 │
julia> repeat(df, 2)
4×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 2 │ 4 │
│ 3 │ 1 │ 3 │
│ 4 │ 2 │ 4 │DataFrames.select — Functionselect(df::AbstractDataFrame, args...; copycols::Bool=true)Create a new data frame that contains columns from df specified by args and return it. The result is guaranteed to have the same number of rows as df, except when no columns are selected (in which case the result has zero rows)..
If df is a DataFrame or copycols=true then column renaming and transformations are supported.
Arguments passed as args... can be:
- Any index that is allowed for column indexing (
Symbol, string or integer;:,All,Between,Not, a regular expression, or a vector ofSymbols, strings or integers). - Column transformation operations using the
Pairnotation that is described below and vectors of such pairs.
Columns can be renamed using the old_column => new_column_name syntax, and transformed using the old_column => fun => new_column_name syntax. new_column_name must be a Symbol or a string, and fun a function or a type. If old_column is a Symbol, a string, or an integer then fun is applied to the corresponding column vector. Otherwise old_column can be any column indexing syntax, in which case fun will be passed the column vectors specified by old_column as separate arguments. The only exception is when old_column is an AsTable type wrapping a selector, in which case fun is passed a NamedTuple containing the selected columns.
If fun returns a value of type other than AbstractVector then it will be broadcasted into a vector matching the target number of rows in the data frame, unless its type is one of AbstractDataFrame, NamedTuple, DataFrameRow, AbstractMatrix, in which case an error is thrown as currently these return types are not allowed. As a particular rule, values wrapped in a Ref or a 0-dimensional AbstractArray are unwrapped and then broadcasted.
To apply fun to each row instead of whole columns, it can be wrapped in a ByRow struct. In this case if old_column is a Symbol, a string, or an integer then fun is applied to each element (row) of old_column using broadcasting. Otherwise old_column can be any column indexing syntax, in which case fun will be passed one argument for each of the columns specified by old_column. If ByRow is used it is not allowed for old_column to select an empty set of columns nor for fun to return a NamedTuple or a DataFrameRow.
Column transformation can also be specified using the short old_column => fun form. In this case, new_column_name is automatically generated as $(old_column)_$(fun). Up to three column names are used for multiple input columns and they are joined using _; if more than three columns are passed then the name consists of the first two names and etc suffix then, e.g. [:a,:b,:c,:d] => fun produces the new column name :a_b_etc_fun.
Column renaming and transformation operations can be passed wrapped in vectors (this is useful when combined with broadcasting).
As a special rule passing nrow without specifying old_column creates a column named :nrow containing a number of rows in a source data frame, and passing nrow => new_column_name stores the number of rows in source data frame in new_column_name column.
If a collection of column names is passed to select! or select then requesting duplicate column names in target data frame are accepted (e.g. select!(df, [:a], :, r"a") is allowed) and only the first occurrence is used. In particular a syntax to move column :col to the first position in the data frame is select!(df, :col, :). On the contrary, output column names of renaming, transformation and single column selection operations must be unique, so e.g. select!(df, :a, :a => :a) or select!(df, :a, :a => ByRow(sin) => :a) are not allowed.
If df is a DataFrame a new DataFrame is returned. If copycols=false, then the returned DataFrame shares column vectors with df where possible. If copycols=true (the default), then the returned DataFrame will not share columns with df. The only exception for this rule is the old_column => fun => new_column transformation when fun returns a vector that is not allocated by fun but is neither a SubArray nor one of the input vectors. In such a case a new DataFrame might contain aliases. Such a situation can only happen with transformations which returns vectors other than their inputs, e.g. with select(df, :a => (x -> c) => :c1, :b => (x -> c) => :c2) when c is a vector object or with select(df, :a => (x -> df.c) => :c2).
If df is a SubDataFrame and copycols=true then a DataFrame is returned and the same copying rules apply as for a DataFrame input: this means in particular that selected columns will be copied. If copycols=false, a SubDataFrame is returned without copying columns.
Note that including the same column several times in the data frame via renaming or transformations that return the same object when copycols=false will create column aliases. An example of such a situation is select(df, :a, :a => :b, :a => identity => :c, copycols=false).
Examples
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │
julia> select(df, :b)
3×1 DataFrame
│ Row │ b │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 4 │
│ 2 │ 5 │
│ 3 │ 6 │
julia> select(df, Not(:b)) # drop column :b from df
3×1 DataFrame
│ Row │ a │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
julia> select(df, :a => :c, :b)
3×2 DataFrame
│ Row │ c │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │
julia> select(df, :a => ByRow(sin) => :c, :b)
3×2 DataFrame
│ Row │ c │ b │
│ │ Float64 │ Int64 │
├─────┼──────────┼───────┤
│ 1 │ 0.841471 │ 4 │
│ 2 │ 0.909297 │ 5 │
│ 3 │ 0.14112 │ 6 │
julia> select(df, :, [:a, :b] => (a,b) -> a .+ b .- sum(b)/length(b))
3×3 DataFrame
│ Row │ a │ b │ a_b_function │
│ │ Int64 │ Int64 │ Float64 │
├─────┼───────┼───────┼──────────────┤
│ 1 │ 1 │ 4 │ 0.0 │
│ 2 │ 2 │ 5 │ 2.0 │
│ 3 │ 3 │ 6 │ 4.0 │
julia> select(df, names(df) .=> sum)
3×2 DataFrame
│ Row │ a_sum │ b_sum │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 6 │ 15 │
│ 2 │ 6 │ 15 │
│ 3 │ 6 │ 15 │
julia> select(df, names(df) .=> sum .=> [:A, :B])
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 6 │ 15 │
│ 2 │ 6 │ 15 │
│ 3 │ 6 │ 15 │
julia> select(df, AsTable(:) => ByRow(mean))
3×1 DataFrame
│ Row │ a_b_mean │
│ │ Float64 │
├─────┼──────────┤
│ 1 │ 2.5 │
│ 2 │ 3.5 │
│ 3 │ 4.5 │select(gd::GroupedDataFrame, args...;
copycols::Bool=true, keepkeys::Bool=true, ungroup::Bool=true)Apply args to gd following the rules described in combine.
If ungroup=true the result is a DataFrame. If ungroup=false the result is a GroupedDataFrame (in this case the returned value retains the order of groups of gd).
The parent of the returned value has as many rows as parent(gd) and in the same order, except when the returned value has no columns (in which case it has zero rows). If an operation in args returns a single value it is always broadcasted to have this number of rows.
If copycols=false then do not perform copying of columns that are not transformed.
If keepkeys=true, the resulting DataFrame contains all the grouping columns in addition to those generated. In this case if the returned value contains columns with the same names as the grouping columns, they are required to be equal. If keepkeys=false and some generated columns have the same name as grouping columns, they are kept and are not required to be equal to grouping columns.
If ungroup=true (the default) a DataFrame is returned. If ungroup=false a GroupedDataFrame grouped using keycols(gdf) is returned.
If gd has zero groups then no transformations are applied.
See also
groupby, combine, select!, transform, transform!
Examples
julia> df = DataFrame(a = [1, 1, 1, 2, 2, 1, 1, 2],
b = repeat([2, 1], outer=[4]),
c = 1:8)
8×3 DataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 1 │
│ 2 │ 1 │ 1 │ 2 │
│ 3 │ 1 │ 2 │ 3 │
│ 4 │ 2 │ 1 │ 4 │
│ 5 │ 2 │ 2 │ 5 │
│ 6 │ 1 │ 1 │ 6 │
│ 7 │ 1 │ 2 │ 7 │
│ 8 │ 2 │ 1 │ 8 │
julia> gd = groupby(df, :a);
julia> select(gd, :c => sum, nrow)
8×3 DataFrame
│ Row │ a │ c_sum │ nrow │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 19 │ 5 │
│ 2 │ 1 │ 19 │ 5 │
│ 3 │ 1 │ 19 │ 5 │
│ 4 │ 2 │ 17 │ 3 │
│ 5 │ 2 │ 17 │ 3 │
│ 6 │ 1 │ 19 │ 5 │
│ 7 │ 1 │ 19 │ 5 │
│ 8 │ 2 │ 17 │ 3 │
julia> select(gd, :c => sum, nrow, ungroup=false)
GroupedDataFrame with 2 groups based on key: a
First Group (5 rows): a = 1
│ Row │ a │ c_sum │ nrow │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 19 │ 5 │
│ 2 │ 1 │ 19 │ 5 │
│ 3 │ 1 │ 19 │ 5 │
│ 4 │ 1 │ 19 │ 5 │
│ 5 │ 1 │ 19 │ 5 │
⋮
Last Group (3 rows): a = 2
│ Row │ a │ c_sum │ nrow │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 2 │ 17 │ 3 │
│ 2 │ 2 │ 17 │ 3 │
│ 3 │ 2 │ 17 │ 3 │
julia> select(gd, :c => (x -> sum(log, x)) => :sum_log_c) # specifying a name for target column
8×2 DataFrame
│ Row │ a │ sum_log_c │
│ │ Int64 │ Float64 │
├─────┼───────┼───────────┤
│ 1 │ 1 │ 5.52943 │
│ 2 │ 1 │ 5.52943 │
│ 3 │ 1 │ 5.52943 │
│ 4 │ 2 │ 5.07517 │
│ 5 │ 2 │ 5.07517 │
│ 6 │ 1 │ 5.52943 │
│ 7 │ 1 │ 5.52943 │
│ 8 │ 2 │ 5.07517 │
julia> select(gd, [:b, :c] .=> sum) # passing a vector of pairs
8×3 DataFrame
│ Row │ a │ b_sum │ c_sum │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 8 │ 19 │
│ 2 │ 1 │ 8 │ 19 │
│ 3 │ 1 │ 8 │ 19 │
│ 4 │ 2 │ 4 │ 17 │
│ 5 │ 2 │ 4 │ 17 │
│ 6 │ 1 │ 8 │ 19 │
│ 7 │ 1 │ 8 │ 19 │
│ 8 │ 2 │ 4 │ 17 │
julia> select(gd, :b => :b1, :c => :c1,
[:b, :c] => +, keepkeys=false) # multiple arguments, renaming and keepkeys
8×3 DataFrame
│ Row │ b1 │ c1 │ b_c_+ │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 2 │ 1 │ 3 │
│ 2 │ 1 │ 2 │ 3 │
│ 3 │ 2 │ 3 │ 5 │
│ 4 │ 1 │ 4 │ 5 │
│ 5 │ 2 │ 5 │ 7 │
│ 6 │ 1 │ 6 │ 7 │
│ 7 │ 2 │ 7 │ 9 │
│ 8 │ 1 │ 8 │ 9 │
julia> select(gd, :b, :c => sum) # passing columns and broadcasting
8×3 DataFrame
│ Row │ a │ b │ c_sum │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 19 │
│ 2 │ 1 │ 1 │ 19 │
│ 3 │ 1 │ 2 │ 19 │
│ 4 │ 2 │ 1 │ 17 │
│ 5 │ 2 │ 2 │ 17 │
│ 6 │ 1 │ 1 │ 19 │
│ 7 │ 1 │ 2 │ 19 │
│ 8 │ 2 │ 1 │ 17 │
julia> select(gd, :, AsTable(Not(:a)) => sum)
8×4 DataFrame
│ Row │ a │ b │ c │ b_c_sum │
│ │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼─────────┤
│ 1 │ 1 │ 2 │ 1 │ 3 │
│ 2 │ 1 │ 1 │ 2 │ 3 │
│ 3 │ 1 │ 2 │ 3 │ 5 │
│ 4 │ 2 │ 1 │ 4 │ 5 │
│ 5 │ 2 │ 2 │ 5 │ 7 │
│ 6 │ 1 │ 1 │ 6 │ 7 │
│ 7 │ 1 │ 2 │ 7 │ 9 │
│ 8 │ 2 │ 1 │ 8 │ 9 │DataFrames.select! — Functionselect!(df::DataFrame, args...)Mutate df in place to retain only columns specified by args... and return it. The result is guaranteed to have the same number of rows as df, except when no columns are selected (in which case the result has zero rows).
Arguments passed as args... can be:
- Any index that is allowed for column indexing (
Symbol, string or integer;:,All,Between,Not, a regular expression, or a vector ofSymbols, strings or integers). - Column transformation operations using the
Pairnotation that is described below and vectors of such pairs.
Columns can be renamed using the old_column => new_column_name syntax, and transformed using the old_column => fun => new_column_name syntax. new_column_name must be a Symbol or a string, and fun a function or a type. If old_column is a Symbol, a string, or an integer then fun is applied to the corresponding column vector. Otherwise old_column can be any column indexing syntax, in which case fun will be passed the column vectors specified by old_column as separate arguments. The only exception is when old_column is an AsTable type wrapping a selector, in which case fun is passed a NamedTuple containing the selected columns.
If fun returns a value of type other than AbstractVector then it will be broadcasted into a vector matching the target number of rows in the data frame, unless its type is one of AbstractDataFrame, NamedTuple, DataFrameRow, AbstractMatrix, in which case an error is thrown as currently these return types are not allowed. As a particular rule, values wrapped in a Ref or a 0-dimensional AbstractArray are unwrapped and then broadcasted.
To apply fun to each row instead of whole columns, it can be wrapped in a ByRow struct. In this case if old_column is a Symbol, a string, or an integer then fun is applied to each element (row) of old_column using broadcasting. Otherwise old_column can be any column indexing syntax, in which case fun will be passed one argument for each of the columns specified by old_column. If ByRow is used it is not allowed for old_column to select an empty set of columns nor for fun to return a NamedTuple or a DataFrameRow.
Column transformation can also be specified using the short old_column => fun form. In this case, new_column_name is automatically generated as $(old_column)_$(fun). Up to three column names are used for multiple input columns and they are joined using _; if more than three columns are passed then the name consists of the first two names and etc suffix then, e.g. [:a,:b,:c,:d] => fun produces the new column name :a_b_etc_fun.
Column renaming and transformation operations can be passed wrapped in vectors (this is useful when combined with broadcasting).
As a special rule passing nrow without specifying old_column creates a column named :nrow containing a number of rows in a source data frame, and passing nrow => new_column_name stores the number of rows in source data frame in new_column_name column.
If a collection of column names is passed to select! or select then requesting duplicate column names in target data frame are accepted (e.g. select!(df, [:a], :, r"a") is allowed) and only the first occurrence is used. In particular a syntax to move column :col to the first position in the data frame is select!(df, :col, :). On the contrary, output column names of renaming, transformation and single column selection operations must be unique, so e.g. select!(df, :a, :a => :a) or select!(df, :a, :a => ByRow(sin) => :a) are not allowed.
Note that including the same column several times in the data frame via renaming or transformations that return the same object without copying will create column aliases. An example of such a situation is select!(df, :a, :a => :b, :a => identity => :c).
Examples
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │
julia> select!(df, 2)
3×1 DataFrame
│ Row │ b │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 4 │
│ 2 │ 5 │
│ 3 │ 6 │
julia> df = DataFrame(a=1:3, b=4:6);
julia> select!(df, :a => ByRow(sin) => :c, :b)
3×2 DataFrame
│ Row │ c │ b │
│ │ Float64 │ Int64 │
├─────┼──────────┼───────┤
│ 1 │ 0.841471 │ 4 │
│ 2 │ 0.909297 │ 5 │
│ 3 │ 0.14112 │ 6 │
julia> select!(df, :, [:c, :b] => (c,b) -> c .+ b .- sum(b)/length(b))
3×3 DataFrame
│ Row │ c │ b │ c_b_function │
│ │ Float64 │ Int64 │ Float64 │
├─────┼──────────┼───────┼──────────────┤
│ 1 │ 0.841471 │ 4 │ -0.158529 │
│ 2 │ 0.909297 │ 5 │ 0.909297 │
│ 3 │ 0.14112 │ 6 │ 1.14112 │
julia> df = DataFrame(a=1:3, b=4:6);
julia> select!(df, names(df) .=> sum);
julia> df
3×2 DataFrame
│ Row │ a_sum │ b_sum │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 6 │ 15 │
│ 2 │ 6 │ 15 │
│ 3 │ 6 │ 15 │
julia> df = DataFrame(a=1:3, b=4:6);
julia> using Statistics
julia> select!(df, AsTable(:) => ByRow(mean))
3×1 DataFrame
│ Row │ a_b_mean │
│ │ Float64 │
├─────┼──────────┤
│ 1 │ 2.5 │
│ 2 │ 3.5 │
│ 3 │ 4.5 │select!(gd::GroupedDataFrame{DataFrame}, args...; ungroup::Bool=true)An equivalent of select(gd, args..., copycols=false, keepkeys=true, ungroup=ungroup) but updates parent(gd) in place.
See also
Base.show — Functionshow([io::IO,] df::AbstractDataFrame;
allrows::Bool = !get(io, :limit, false),
allcols::Bool = !get(io, :limit, false),
allgroups::Bool = !get(io, :limit, false),
splitcols::Bool = get(io, :limit, false),
rowlabel::Symbol = :Row,
summary::Bool = true,
eltypes::Bool = true)Render a data frame to an I/O stream. The specific visual representation chosen depends on the width of the display.
If io is omitted, the result is printed to stdout, and allrows, allcols and allgroups default to false while splitcols defaults to true.
Arguments
io::IO: The I/O stream to whichdfwill be printed.df::AbstractDataFrame: The data frame to print.allrows::Bool: Whether to print all rows, rather than a subset that fits the device height. By default this is the case only ifiodoes not have theIOContextpropertylimitset.allcols::Bool: Whether to print all columns, rather than a subset that fits the device width. By default this is the case only ifiodoes not have theIOContextpropertylimitset.allgroups::Bool: Whether to print all groups rather than the first and last, whendfis aGroupedDataFrame. By default this is the case only ifiodoes not have theIOContextpropertylimitset.splitcols::Bool: Whether to split printing in chunks of columns fitting the screen width rather than printing all columns in the same block. Only applies ifallcolsistrue. By default this is the case only ifiohas theIOContextpropertylimitset.rowlabel::Symbol = :Row: The label to use for the column containing row numbers.summary::Bool = true: Whether to print a brief string summary of the data frame.eltypes::Bool = true: Whether to print the column types under column names.
Examples
julia> using DataFrames
julia> df = DataFrame(A = 1:3, B = ["x", "y", "z"]);
julia> show(df, allcols=true)
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ x │
│ 2 │ 2 │ y │
│ 3 │ 3 │ z │show(io::IO, mime::MIME, df::AbstractDataFrame)Render a data frame to an I/O stream in MIME type mime.
Arguments
io::IO: The I/O stream to whichdfwill be printed.mime::MIME: supported MIME types are:"text/plain","text/html","text/latex","text/csv","text/tab-separated-values"(the last two MIME types do not support showing#undefvalues)df::AbstractDataFrame: The data frame to print.
Additionally selected MIME types support passing the following keyword arguments:
- MIME type
"text/plain"accepts all listed keyword arguments and therir behavior is identical as forshow(::IO, ::AbstractDataFrame) - MIME type
"text/html"acceptssummarykeyword argument which allows to choose whether to print a brief string summary of the data frame.
Examples
julia> show(stdout, MIME("text/latex"), DataFrame(A = 1:3, B = ["x", "y", "z"]))
\begin{tabular}{r|cc}
& A & B\\
\hline
& Int64 & String\\
\hline
1 & 1 & x \\
2 & 2 & y \\
3 & 3 & z \\
\end{tabular}
14
julia> show(stdout, MIME("text/csv"), DataFrame(A = 1:3, B = ["x", "y", "z"]))
"A","B"
1,"x"
2,"y"
3,"z"Base.size — Functionsize(df::AbstractDataFrame, [dim])Return a tuple containing the number of rows and columns of df. Optionally a dimension dim can be specified, where 1 corresponds to rows and 2 corresponds to columns.
Examples
julia> df = DataFrame(a=1:3, b='a':'c');
julia> size(df)
(3, 2)
julia> size(df, 1)
3size(dfr::DataFrameRow, [dim])Return a 1-tuple containing the number of elements of dfr. If an optional dimension dim is specified, it must be 1, and the number of elements is returned directly as a number.
See also: length
Examples
julia> dfr = DataFrame(a=1:3, b='a':'c')[1, :];
julia> size(dfr)
(2,)
julia> size(dfr, 1)
2Base.sort — Functionsort(df::AbstractDataFrame, cols;
alg::Union{Algorithm, Nothing}=nothing, lt=isless, by=identity,
rev::Bool=false, order::Ordering=Forward)Return a copy of data frame df sorted by column(s) cols.
cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).
If alg is nothing (the default), the most appropriate algorithm is chosen automatically among TimSort, MergeSort and RadixSort depending on the type of the sorting columns and on the number of rows in df. If rev is true, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true) in cols, with c the corresponding column index (see example below). See sort! for a description of other keyword arguments.
Examples
julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 2 │ a │
│ 4 │ 1 │ b │
julia> sort(df, :x)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ c │
│ 2 │ 1 │ b │
│ 3 │ 2 │ a │
│ 4 │ 3 │ b │
julia> sort(df, [:x, :y])
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ b │
│ 2 │ 1 │ c │
│ 3 │ 2 │ a │
│ 4 │ 3 │ b │
julia> sort(df, [:x, :y], rev=true)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 2 │ a │
│ 3 │ 1 │ c │
│ 4 │ 1 │ b │
julia> sort(df, [:x, order(:y, rev=true)])
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ c │
│ 2 │ 1 │ b │
│ 3 │ 2 │ a │
│ 4 │ 3 │ b │Base.sort! — Functionsort!(df::AbstractDataFrame, cols;
alg::Union{Algorithm, Nothing}=nothing, lt=isless, by=identity,
rev::Bool=false, order::Ordering=Forward)Sort data frame df by column(s) cols.
cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).
If alg is nothing (the default), the most appropriate algorithm is chosen automatically among TimSort, MergeSort and RadixSort depending on the type of the sorting columns and on the number of rows in df. If rev is true, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true) in cols, with c the corresponding column index (see example below). See other methods for a description of other keyword arguments.
Examples
julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 2 │ a │
│ 4 │ 1 │ b │
julia> sort!(df, :x)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ c │
│ 2 │ 1 │ b │
│ 3 │ 2 │ a │
│ 4 │ 3 │ b │
julia> sort!(df, [:x, :y])
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ b │
│ 2 │ 1 │ c │
│ 3 │ 2 │ a │
│ 4 │ 3 │ b │
julia> sort!(df, [:x, :y], rev=true)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 2 │ a │
│ 3 │ 1 │ c │
│ 4 │ 1 │ b │
julia> sort!(df, (:x, order(:y, rev=true)))
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ c │
│ 2 │ 1 │ b │
│ 3 │ 2 │ a │
│ 4 │ 3 │ b │DataFrames.transform — Functiontransform(df::AbstractDataFrame, args...; copycols::Bool=true)Create a new data frame that contains columns from df and adds columns specified by args and return it. The result is guaranteed to have the same number of rows as df. Equivalent to select(df, :, args..., copycols=copycols).
See select for detailed rules regarding accepted values for args.
transform(gd::GroupedDataFrame, args...;
copycols::Bool=true, keepkeys::Bool=true, ungroup::Bool=true)An equivalent of select(gd, :, args..., copycols=copycols, keepkeys=keepkeys, ungroup=ungroup)
See also
DataFrames.transform! — Functiontransform!(df::DataFrame, args...)Mutate df in place to add columns specified by args... and return it. The result is guaranteed to have the same number of rows as df. Equivalent to select!(df, :, args...).
See select! for detailed rules regarding accepted values for args.
transform!(gd::GroupedDataFrame{DataFrame}, args...; ungroup::Bool=true)An equivalent of transform(gd, args..., copycols=false, keepkeys=true, ungroup=ungroup) but updates parent(gd) in place.
See also
Base.unique! — Functionunique(df::AbstractDataFrame)
unique(df::AbstractDataFrame, cols)
unique!(df::AbstractDataFrame)
unique!(df::AbstractDataFrame, cols)Delete duplicate rows of data frame df, keeping only the first occurrence of unique rows. When cols is specified, the returned DataFrame contains complete rows, retaining in each case the first instance for which df[cols] is unique. cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).
When unique is called a new data frame is returned; unique! updates df in-place.
See also nonunique.
Arguments
df: the AbstractDataFramecols: column indicator (Symbol, Int, Vector{Symbol}, Regex, etc.)
specifying the column(s) to compare.
Examples
df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
df = vcat(df, df)
unique(df) # doesn't modify df
unique(df, 1)
unique!(df) # modifies dfBase.vcat — Functionvcat(dfs::AbstractDataFrame...;
cols::Union{Symbol, AbstractVector{Symbol},
AbstractVector{<:AbstractString}}=:setequal)Vertically concatenate AbstractDataFrames.
The cols keyword argument determines the columns of the returned data frame:
:setequal: require all data frames to have the same column names disregarding order. If they appear in different orders, the order of the first provided data frame is used.:orderequal: require all data frames to have the same column names and in the same order.:intersect: only the columns present in all provided data frames are kept. If the intersection is empty, an empty data frame is returned.:union: columns present in at least one of the provided data frames are kept. Columns not present in some data frames are filled withmissingwhere necessary.- A vector of
Symbols or strings: only listed columns are kept. Columns not present in some data frames are filled withmissingwhere necessary.
The order of columns is determined by the order they appear in the included data frames, searching through the header of the first data frame, then the second, etc.
The element types of columns are determined using promote_type, as with vcat for AbstractVectors.
vcat ignores empty data frames, making it possible to initialize an empty data frame at the beginning of a loop and vcat onto it.
Example
julia> df1 = DataFrame(A=1:3, B=1:3);
julia> df2 = DataFrame(A=4:6, B=4:6);
julia> df3 = DataFrame(A=7:9, C=7:9);
julia> d4 = DataFrame();
julia> vcat(df1, df2)
6×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │
│ 4 │ 4 │ 4 │
│ 5 │ 5 │ 5 │
│ 6 │ 6 │ 6 │
julia> vcat(df1, df3, cols=:union)
6×3 DataFrame
│ Row │ A │ B │ C │
│ │ Int64 │ Int64? │ Int64? │
├─────┼───────┼─────────┼─────────┤
│ 1 │ 1 │ 1 │ missing │
│ 2 │ 2 │ 2 │ missing │
│ 3 │ 3 │ 3 │ missing │
│ 4 │ 7 │ missing │ 7 │
│ 5 │ 8 │ missing │ 8 │
│ 6 │ 9 │ missing │ 9 │
julia> vcat(df1, df3, cols=:intersect)
6×1 DataFrame
│ Row │ A │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
│ 4 │ 7 │
│ 5 │ 8 │
│ 6 │ 9 │
julia> vcat(d4, df1)
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │Unsorted
Base.first — Functionfirst(df::AbstractDataFrame)Get the first row of df as a DataFrameRow.
first(df::AbstractDataFrame, n::Integer)Get a data frame with the n first rows of df.
Base.last — Functionlast(df::AbstractDataFrame)Get the last row of df as a DataFrameRow.
last(df::AbstractDataFrame, n::Integer)Get a data frame with the n last rows of df.
Base.unique — Functionunique(df::AbstractDataFrame)
unique(df::AbstractDataFrame, cols)
unique!(df::AbstractDataFrame)
unique!(df::AbstractDataFrame, cols)Delete duplicate rows of data frame df, keeping only the first occurrence of unique rows. When cols is specified, the returned DataFrame contains complete rows, retaining in each case the first instance for which df[cols] is unique. cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).
When unique is called a new data frame is returned; unique! updates df in-place.
See also nonunique.
Arguments
df: the AbstractDataFramecols: column indicator (Symbol, Int, Vector{Symbol}, Regex, etc.)
specifying the column(s) to compare.
Examples
df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
df = vcat(df, df)
unique(df) # doesn't modify df
unique(df, 1)
unique!(df) # modifies dfBase.propertynames — Functionpropertynames(df::AbstractDataFrame)Return a freshly allocated Vector{Symbol} of names of columns contained in df.
Base.similar — Functionsimilar(df::AbstractDataFrame, rows::Integer=nrow(df))Create a new DataFrame with the same column names and column element types as df. An optional second argument can be provided to request a number of rows that is different than the number of rows present in df.
Base.sortperm — Functionsortperm(df::AbstractDataFrame, cols;
alg::Union{Algorithm, Nothing}=nothing, lt=isless, by=identity,
rev::Bool=false, order::Ordering=Forward)Return a permutation vector of row indices of data frame df that puts them in sorted order according to column(s) cols.
cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).
If alg is nothing (the default), the most appropriate algorithm is chosen automatically among TimSort, MergeSort and RadixSort depending on the type of the sorting columns and on the number of rows in df. If rev is true, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true) in cols, with c the corresponding column index (see example below). See other methods for a description of other keyword arguments.
Examples
julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 2 │ a │
│ 4 │ 1 │ b │
julia> sortperm(df, :x)
4-element Array{Int64,1}:
2
4
3
1
julia> sortperm(df, (:x, :y))
4-element Array{Int64,1}:
4
2
3
1
julia> sortperm(df, (:x, :y), rev=true)
4-element Array{Int64,1}:
1
3
2
4
julia> sortperm(df, (:x, order(:y, rev=true)))
4-element Array{Int64,1}:
2
4
3
1Base.pairs — Functionpairs(dfc::DataFrameColumns)Return an iterator of pairs associating the name of each column of dfc with the corresponding column vector, i.e. name => col where name is the column name of the column col.
Base.parent — Functionparent(gd::GroupedDataFrame)Return the parent data frame of gd.
Base.issorted — Functionissorted(df::AbstractDataFrame, cols;
lt=isless, by=identity, rev::Bool=false, order::Ordering=Forward)Test whether data frame df sorted by column(s) cols.
cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).
If rev is true, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true) in cols, with c the corresponding column index (see example below). See other methods for a description of other keyword arguments.