Functions
Base.append!Base.copyBase.filterBase.filter!Base.hcatBase.joinBase.mapBase.push!Base.repeatBase.showBase.sortBase.sort!Base.unique!Base.vcatDataFrames.DataFrame!DataFrames.aggregateDataFrames.allowmissing!DataFrames.byDataFrames.categorical!DataFrames.combineDataFrames.completecasesDataFrames.deletecolsDataFrames.deletecols!DataFrames.deleterows!DataFrames.disallowmissing!DataFrames.dropmissingDataFrames.dropmissing!DataFrames.eachcolDataFrames.eachrowDataFrames.eltypesDataFrames.groupbyDataFrames.groupindicesDataFrames.groupvarsDataFrames.insertcols!DataFrames.mapcolsDataFrames.meltDataFrames.meltdfDataFrames.names!DataFrames.ncolDataFrames.nonuniqueDataFrames.nrowDataFrames.permutecols!DataFrames.renameDataFrames.rename!DataFrames.selectDataFrames.select!DataFrames.stackDataFrames.stackdfDataFrames.unstackStatsBase.describe
Grouping, Joining, and Split-Apply-Combine
DataFrames.aggregate — Function.Split-apply-combine that applies a set of functions over columns of an AbstractDataFrame or GroupedDataFrame
aggregate(d::AbstractDataFrame, cols, fs)
aggregate(gd::GroupedDataFrame, fs)Arguments
d: anAbstractDataFramegd: aGroupedDataFramecols: a column indicator (Symbol,Int,Vector{Symbol}, etc.)fs: a function or vector of functions to be applied to vectors within groups; expects each argument to be a column vector
Each fs should return a value or vector. All returns must be the same length.
Returns
::DataFrame
Examples
julia> using Statistics
julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = 1:8);
julia> aggregate(df, :a, sum)
4×3 DataFrame
│ Row │ a │ b_sum │ c_sum │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 4 │ 6 │
│ 2 │ 2 │ 2 │ 8 │
│ 3 │ 3 │ 4 │ 10 │
│ 4 │ 4 │ 2 │ 12 │
julia> aggregate(df, :a, [sum, x->mean(skipmissing(x))])
4×5 DataFrame
│ Row │ a │ b_sum │ c_sum │ b_function │ c_function │
│ │ Int64 │ Int64 │ Int64 │ Float64 │ Float64 │
├─────┼───────┼───────┼───────┼────────────┼────────────┤
│ 1 │ 1 │ 4 │ 6 │ 2.0 │ 3.0 │
│ 2 │ 2 │ 2 │ 8 │ 1.0 │ 4.0 │
│ 3 │ 3 │ 4 │ 10 │ 2.0 │ 5.0 │
│ 4 │ 4 │ 2 │ 12 │ 1.0 │ 6.0 │
julia> aggregate(groupby(df, :a), [sum, x->mean(skipmissing(x))])
4×5 DataFrame
│ Row │ a │ b_sum │ c_sum │ b_function │ c_function │
│ │ Int64 │ Int64 │ Int64 │ Float64 │ Float64 │
├─────┼───────┼───────┼───────┼────────────┼────────────┤
│ 1 │ 1 │ 4 │ 6 │ 2.0 │ 3.0 │
│ 2 │ 2 │ 2 │ 8 │ 1.0 │ 4.0 │
│ 3 │ 3 │ 4 │ 10 │ 2.0 │ 5.0 │
│ 4 │ 4 │ 2 │ 12 │ 1.0 │ 6.0 │DataFrames.by — Function.by(d::AbstractDataFrame, keys, cols => f...; sort::Bool = false)
by(d::AbstractDataFrame, keys; (colname = cols => f)..., sort::Bool = false)
by(d::AbstractDataFrame, keys, f; sort::Bool = false)
by(f, d::AbstractDataFrame, keys; sort::Bool = false)Split-apply-combine in one step: apply f to each grouping in d based on grouping columns keys, and return a DataFrame.
keys can be either a single column index, or a vector thereof.
If the last argument(s) consist(s) in one or more cols => f pair(s), or if colname = cols => f keyword arguments are provided, cols must be a column name or index, or a vector or tuple thereof, and f must be a callable. A pair or a (named) tuple of pairs can also be provided as the first or last argument. If cols is a single column index, f is called with a SubArray view into that column for each group; else, f is called with a named tuple holding SubArray views into these columns.
If the last argument is a callable f, it is passed a SubDataFrame view for each group, and the returned DataFrame then consists of the returned rows plus the grouping columns. Note that this second form is much slower than the first one due to type instability. A method is defined with f as the first argument, so do-block notation can be used.
f can return a single value, a row or multiple rows. The type of the returned value determines the shape of the resulting data frame:
- A single value gives a data frame with a single column and one row per group.
- A named tuple of single values or a
DataFrameRowgives a data frame with one column for each field and one row per group. - A vector gives a data frame with a single column and as many rows for each group as the length of the returned vector for that group.
- A data frame, a named tuple of vectors or a matrix gives a data frame with the same columns and as many rows for each group as the rows returned for that group.
f must always return the same kind of object (as defined in the above list) for all groups, and if a named tuple or data frame, with the same fields or columns. Named tuples cannot mix single values and vectors. Due to type instability, returning a single value or a named tuple is dramatically faster than returning a data frame.
As a special case, if multiple pairs are passed as last arguments, each function is required to return a single value or vector, which will produce each a separate column.
In all cases, the resulting data frame contains all the grouping columns in addition to those generated by the application of f. Column names are automatically generated when necessary: for functions operating on a single column and returning a single value or vector, the function name is appended to the input colummn name; for other functions, columns are called x1, x2 and so on. The resulting data frame will be sorted on keys if sort=true. Otherwise, ordering of rows is undefined.
Optimized methods are used when standard summary functions (sum, prod, minimum, maximum, mean, var, std, first, last and length) are specified using the pair syntax (e.g.col => sum). When computing thesumormeanover floating point columns, results will be less accurate than the standard [sum](@ref) function (which uses pairwise summation). Usecol => x -> sum(x)` to avoid the optimized method and use the slower, more accurate one.
by(d, cols, f) is equivalent to combine(f, groupby(d, cols)) and to the less efficient combine(map(f, groupby(d, cols))).
Examples
julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = 1:8);
julia> by(df, :a, :c => sum)
4×2 DataFrame
│ Row │ a │ c_sum │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 6 │
│ 2 │ 2 │ 8 │
│ 3 │ 3 │ 10 │
│ 4 │ 4 │ 12 │
julia> by(df, :a, d -> sum(d.c)) # Slower variant
4×2 DataFrame
│ Row │ a │ x1 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 6 │
│ 2 │ 2 │ 8 │
│ 3 │ 3 │ 10 │
│ 4 │ 4 │ 12 │
julia> by(df, :a) do d # do syntax for the slower variant
sum(d.c)
end
4×2 DataFrame
│ Row │ a │ x1 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 6 │
│ 2 │ 2 │ 8 │
│ 3 │ 3 │ 10 │
│ 4 │ 4 │ 12 │
julia> by(df, :a, :c => x -> 2 .* x)
8×2 DataFrame
│ Row │ a │ c_function │
│ │ Int64 │ Int64 │
├─────┼───────┼────────────┤
│ 1 │ 1 │ 2 │
│ 2 │ 1 │ 10 │
│ 3 │ 2 │ 4 │
│ 4 │ 2 │ 12 │
│ 5 │ 3 │ 6 │
│ 6 │ 3 │ 14 │
│ 7 │ 4 │ 8 │
│ 8 │ 4 │ 16 │
julia> by(df, :a, c_sum = :c => sum, c_sum2 = :c => x -> sum(x.^2))
4×3 DataFrame
│ Row │ a │ c_sum │ c_sum2 │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼────────┤
│ 1 │ 1 │ 6 │ 26 │
│ 2 │ 2 │ 8 │ 40 │
│ 3 │ 3 │ 10 │ 58 │
│ 4 │ 4 │ 12 │ 80 │
julia> by(df, :a, (:b, :c) => x -> (minb = minimum(x.b), sumc = sum(x.c)))
4×3 DataFrame
│ Row │ a │ minb │ sumc │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 6 │
│ 2 │ 2 │ 1 │ 8 │
│ 3 │ 3 │ 2 │ 10 │
│ 4 │ 4 │ 1 │ 12 │DataFrames.combine — Function.combine(gd::GroupedDataFrame, cols => f...)
combine(gd::GroupedDataFrame; (colname = cols => f)...)
combine(gd::GroupedDataFrame, f)
combine(f, gd::GroupedDataFrame)Transform a GroupedDataFrame into a DataFrame.
If the last argument(s) consist(s) in one or more cols => f pair(s), or if colname = cols => f keyword arguments are provided, cols must be a column name or index, or a vector or tuple thereof, and f must be a callable. A pair or a (named) tuple of pairs can also be provided as the first or last argument. If cols is a single column index, f is called with a SubArray view into that column for each group; else, f is called with a named tuple holding SubArray views into these columns.
If the last argument is a callable f, it is passed a SubDataFrame view for each group, and the returned DataFrame then consists of the returned rows plus the grouping columns. Note that this second form is much slower than the first one due to type instability. A method is defined with f as the first argument, so do-block notation can be used.
f can return a single value, a row or multiple rows. The type of the returned value determines the shape of the resulting data frame:
- A single value gives a data frame with a single column and one row per group.
- A named tuple of single values or a
DataFrameRowgives a data frame with one column for each field and one row per group. - A vector gives a data frame with a single column and as many rows for each group as the length of the returned vector for that group.
- A data frame, a named tuple of vectors or a matrix gives a data frame with the same columns and as many rows for each group as the rows returned for that group.
f must always return the same kind of object (as defined in the above list) for all groups, and if a named tuple or data frame, with the same fields or columns. Named tuples cannot mix single values and vectors. Due to type instability, returning a single value or a named tuple is dramatically faster than returning a data frame.
As a special case, if a tuple or vector of pairs is passed as the first argument, each function is required to return a single value or vector, which will produce each a separate column.
In all cases, the resulting data frame contains all the grouping columns in addition to those generated by the application of f. Column names are automatically generated when necessary: for functions operating on a single column and returning a single value or vector, the function name is appended to the input column name; for other functions, columns are called x1, x2 and so on. The resulting data frame will be sorted if sort=true was passed to the groupby call from which gd was constructed. Otherwise, ordering of rows is undefined.
Optimized methods are used when standard summary functions (sum, prod, minimum, maximum, mean, var, std, first, last and length) are specified using the pair syntax (e.g.col => sum). When computing thesumormeanover floating point columns, results will be less accurate than the standard [sum](@ref) function (which uses pairwise summation). Usecol => x -> sum(x)` to avoid the optimized method and use the slower, more accurate one.
Examples
julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = 1:8);
julia> gd = groupby(df, :a);
julia> combine(gd, :c => sum)
4×2 DataFrame
│ Row │ a │ c_sum │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 6 │
│ 2 │ 2 │ 8 │
│ 3 │ 3 │ 10 │
│ 4 │ 4 │ 12 │
julia> combine(:c => sum, gd)
4×2 DataFrame
│ Row │ a │ c_sum │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 6 │
│ 2 │ 2 │ 8 │
│ 3 │ 3 │ 10 │
│ 4 │ 4 │ 12 │
julia> combine(df -> sum(df.c), gd) # Slower variant
4×2 DataFrame
│ Row │ a │ x1 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 6 │
│ 2 │ 2 │ 8 │
│ 3 │ 3 │ 10 │
│ 4 │ 4 │ 12 │See by for more examples.
See also
by(f, df, cols) is a shorthand for combine(f, groupby(df, cols)).
map: combine(f, groupby(df, cols)) is a more efficient equivalent of combine(map(f, groupby(df, cols))).
DataFrames.groupby — Function.A view of an AbstractDataFrame split into row groups
groupby(d::AbstractDataFrame, cols; sort = false, skipmissing = false)
groupby(cols; sort = false, skipmissing = false)Arguments
d: anAbstractDataFrameto split (optional, see Returns)cols: data table columns to group bysort: whether to sort rows according to the values of the grouping columnscolsskipmissing: whether to skip rows withmissingvalues in one of the grouping columnscols
Returns
A GroupedDataFrame : a grouped view into d
Details
An iterator over a GroupedDataFrame returns a SubDataFrame view for each grouping into d. A GroupedDataFrame also supports indexing by groups, map (which applies a function to each group) and combine (which applies a function to each group and combines the result into a data frame).
See the following for additional split-apply-combine operations:
by: split-apply-combine using functionsaggregate: split-apply-combine; applies functions in the form of a cross productmap: apply a function to each group of aGroupedDataFrame(without combining)combine: combine aGroupedDataFrame, optionally applying a function to each group
Examples
julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = 1:8);
julia> gd = groupby(df, :a)
GroupedDataFrame with 4 groups based on key: a
First Group (2 rows): a = 1
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 1 │
│ 2 │ 1 │ 2 │ 5 │
⋮
Last Group (2 rows): a = 4
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 4 │ 1 │ 4 │
│ 2 │ 4 │ 1 │ 8 │
julia> gd[1]
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 1 │
│ 2 │ 1 │ 2 │ 5 │
julia> last(gd)
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 4 │ 1 │ 4 │
│ 2 │ 4 │ 1 │ 8 │
julia> for g in gd
println(g)
end
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 1 │
│ 2 │ 1 │ 2 │ 5 │
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 2 │ 1 │ 2 │
│ 2 │ 2 │ 1 │ 6 │
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 3 │ 2 │ 3 │
│ 2 │ 3 │ 2 │ 7 │
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 4 │ 1 │ 4 │
│ 2 │ 4 │ 1 │ 8 │DataFrames.groupindices — Function.groupindices(gd::GroupedDataFrame)Return a vector of group indices for each row of parent(gd).
Rows appearing in group gd[i] are attributed index i. Rows not present in any group are attributed missing (this can happen if skipmissing=true was passed when creating gd, or if gd is a subset from a larger GroupedDataFrame).
DataFrames.groupvars — Function.groupvars(gd::GroupedDataFrame)Return a vector of column names in parent(gd) used for grouping.
Base.join — Function.join(df1, df2; on = Symbol[], kind = :inner, makeunique = false,
indicator = nothing, validate = (false, false))Join two DataFrame objects
Arguments
df1,df2: the two AbstractDataFrames to be joined
Keyword Arguments
on: A column, or vector of columns to join df1 and df2 on. If the column(s) that df1 and df2 will be joined on have different names, then the columns should be(left, right)tuples orleft => rightpairs, or a vector of such tuples or pairs.onis a required argument for all joins except forkind = :crosskind: the type of join, options include::inner: only include rows with keys that match in bothdf1anddf2, the default:outer: include all rows fromdf1anddf2:left: include all rows fromdf1:right: include all rows fromdf2:semi: return rows ofdf1that match with the keys indf2:anti: return rows ofdf1that do not match with the keys indf2:cross: a full Cartesian product of the key combinations; every row ofdf1is matched with every row ofdf2
makeunique: iffalse(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue, duplicate names will be suffixed with_i(istarting at 1 for the first duplicate).indicator: Default:nothing. If aSymbol, adds categorical indicator column namedSymbolfor whether a row appeared in onlydf1("left_only"), onlydf2("right_only") or in both ("both"). IfSymbolis already in use, the column name will be modified ifmakeunique=true.validate: whether to check that columns passed as theonargument define unique keys in each input data frame (according toisequal). Can be a tuple or a pair, with the first element indicating whether to run check fordf1and the second element fordf2. By default no check is performed.
For the three join operations that may introduce missing values (:outer, :left, and :right), all columns of the returned data table will support missing values.
When merging on categorical columns that differ in the ordering of their levels, the ordering of the left DataFrame takes precedence over the ordering of the right DataFrame
Result
::DataFrame: the joined DataFrame
Examples
name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
join(name, job, on = :ID)
join(name, job, on = :ID, kind = :outer)
join(name, job, on = :ID, kind = :left)
join(name, job, on = :ID, kind = :right)
join(name, job, on = :ID, kind = :semi)
join(name, job, on = :ID, kind = :anti)
join(name, job, kind = :cross)
job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
join(name, job2, on = (:ID, :identifier))
join(name, job2, on = :ID => :identifier)Base.map — Function.map(cols => f, gd::GroupedDataFrame)
map(f, gd::GroupedDataFrame)Apply a function to each group of rows and return a GroupedDataFrame.
If the first argument is a cols => f pair, cols must be a column name or index, or a vector or tuple thereof, and f must be a callable. If cols is a single column index, f is called with a SubArray view into that column for each group; else, f is called with a named tuple holding SubArray views into these columns.
If the first argument is a vector, tuple or named tuple of such pairs, each pair is handled as described above. If a named tuple, field names are used to name each generated column.
If the first argument is a callable, it is passed a SubDataFrame view for each group, and the returned DataFrame then consists of the returned rows plus the grouping columns. Note that this second form is much slower than the first one due to type instability.
f can return a single value, a row or multiple rows. The type of the returned value determines the shape of the resulting data frame:
- A single value gives a data frame with a single column and one row per group.
- A named tuple of single values or a
DataFrameRowgives a data frame with one column for each field and one row per group. - A vector gives a data frame with a single column and as many rows for each group as the length of the returned vector for that group.
- A data frame, a named tuple of vectors or a matrix gives a data frame with the same columns and as many rows for each group as the rows returned for that group.
f must always return the same kind of object (as defined in the above list) for all groups, and if a named tuple or data frame, with the same fields or columns. Named tuples cannot mix single values and vectors. Due to type instability, returning a single value or a named tuple is dramatically faster than returning a data frame.
As a special case, if a tuple or vector of pairs is passed as the first argument, each function is required to return a single value or vector, which will produce each a separate column.
In all cases, the resulting GroupedDataFrame contains all the grouping columns in addition to those generated by the application of f. Column names are automatically generated when necessary: for functions operating on a single column and returning a single value or vector, the function name is appended to the input column name; for other functions, columns are called x1, x2 and so on.
Optimized methods are used when standard summary functions (sum, prod, minimum, maximum, mean, var, std, first, last and length) are specified using the pair syntax (e.g.col => sum). When computing thesumormeanover floating point columns, results will be less accurate than the standard [sum](@ref) function (which uses pairwise summation). Usecol => x -> sum(x)` to avoid the optimized method and use the slower, more accurate one.
Examples
julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = 1:8);
julia> gd = groupby(df, :a);
julia> map(:c => sum, gd)
GroupedDataFrame{DataFrame} with 4 groups based on key: :a
First Group: 1 row
│ Row │ a │ c_sum │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 6 │
⋮
Last Group: 1 row
│ Row │ a │ c_sum │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 4 │ 12 │
julia> map(df -> sum(df.c), gd) # Slower variant
GroupedDataFrame{DataFrame} with 4 groups based on key: :a
First Group: 1 row
│ Row │ a │ x1 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 6 │
⋮
Last Group: 1 row
│ Row │ a │ x1 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 4 │ 12 │See by for more examples.
See also
combine(f, gd) returns a DataFrame rather than a GroupedDataFrame
DataFrames.melt — Function.Stacks a DataFrame; convert from a wide to long format; see stack.
DataFrames.stack — Function.Stacks a DataFrame; convert from a wide to long format
stack(df::AbstractDataFrame, [measure_vars], [id_vars];
variable_name::Symbol=:variable, value_name::Symbol=:value)
melt(df::AbstractDataFrame, [id_vars], [measure_vars];
variable_name::Symbol=:variable, value_name::Symbol=:value)Arguments
df: the AbstractDataFrame to be stackedmeasure_vars: the columns to be stacked (the measurement variables), a normal column indexing type, like a Symbol, Vector{Symbol}, Int, etc.; formelt, defaults to all variables that are notid_vars. If neithermeasure_varsorid_varsare given,measure_varsdefaults to all floating point columns.id_vars: the identifier columns that are repeated during stacking, a normal column indexing type; forstackdefaults to all variables that are notmeasure_varsvariable_name: the name of the new stacked column that shall hold the names of each ofmeasure_varsvalue_name: the name of the new stacked column containing the values from each ofmeasure_vars
Result
::DataFrame: the long-format DataFrame with column:valueholding the values of the stacked columns (measure_vars), with column:variablea Vector of Symbols with themeasure_varsname, and with columns for each of theid_vars.
See also stackdf and meltdf for stacking methods that return a view into the original DataFrame. See unstack for converting from long to wide format.
Examples
d1 = DataFrame(a = repeat([1:3;], inner = [4]),
b = repeat([1:4;], inner = [3]),
c = randn(12),
d = randn(12),
e = map(string, 'a':'l'))
d1s = stack(d1, [:c, :d])
d1s2 = stack(d1, [:c, :d], [:a])
d1m = melt(d1, [:a, :b, :e])
d1s_name = melt(d1, [:a, :b, :e], variable_name=:somemeasure)DataFrames.unstack — Function.Unstacks a DataFrame; convert from a long to wide format
unstack(df::AbstractDataFrame, rowkeys::Union{Symbol, Integer},
colkey::Union{Symbol, Integer}, value::Union{Symbol, Integer})
unstack(df::AbstractDataFrame, rowkeys::AbstractVector{<:Union{Symbol, Integer}},
colkey::Union{Symbol, Integer}, value::Union{Symbol, Integer})
unstack(df::AbstractDataFrame, colkey::Union{Symbol, Integer},
value::Union{Symbol, Integer})
unstack(df::AbstractDataFrame)Arguments
df: the AbstractDataFrame to be unstackedrowkeys: the column(s) with a unique key for each row, if not given, find a key by grouping on anything not acolkeyorvaluecolkey: the column holding the column names in wide format, defaults to:variablevalue: the value column, defaults to:value
Result
::DataFrame: the wide-format DataFrame
If colkey contains missing values then they will be skipped and a warning will be printed.
If combination of rowkeys and colkey contains duplicate entries then last value will be retained and a warning will be printed.
Examples
wide = DataFrame(id = 1:12,
a = repeat([1:3;], inner = [4]),
b = repeat([1:4;], inner = [3]),
c = randn(12),
d = randn(12))
long = stack(wide)
wide0 = unstack(long)
wide1 = unstack(long, :variable, :value)
wide2 = unstack(long, :id, :variable, :value)
wide3 = unstack(long, [:id, :a], :variable, :value)Note that there are some differences between the widened results above.
DataFrames.stackdf — Function.A stacked view of a DataFrame (long format)
Like stack and melt, but a view is returned rather than data copies.
stackdf(df::AbstractDataFrame, [measure_vars], [id_vars];
variable_name::Symbol=:variable, value_name::Symbol=:value)
meltdf(df::AbstractDataFrame, [id_vars], [measure_vars];
variable_name::Symbol=:variable, value_name::Symbol=:value)Arguments
df: the wide AbstractDataFramemeasure_vars: the columns to be stacked (the measurement variables), a normal column indexing type, like a Symbol, Vector{Symbol}, Int, etc.; formelt, defaults to all variables that are notid_varsid_vars: the identifier columns that are repeated during stacking, a normal column indexing type; forstackdefaults to all variables that are notmeasure_vars
Result
::DataFrame: the long-format DataFrame with column:valueholding the values of the stacked columns (measure_vars), with column:variablea Vector of Symbols with themeasure_varsname, and with columns for each of theid_vars.
The result is a view because the columns are special AbstractVectors that return indexed views into the original DataFrame.
Examples
d1 = DataFrame(a = repeat([1:3;], inner = [4]),
b = repeat([1:4;], inner = [3]),
c = randn(12),
d = randn(12),
e = map(string, 'a':'l'))
d1s = stackdf(d1, [:c, :d])
d1s2 = stackdf(d1, [:c, :d], [:a])
d1m = meltdf(d1, [:a, :b, :e])DataFrames.meltdf — Function.A stacked view of a DataFrame (long format); see stackdf
Basics
allowmissing!
categorical!
completecases
copy
DataFrame
DataFrame!
deletecols!
deletecols
deleterows!
describe
disallowmissing!
dropmissing
dropmissing!
eachrow
eachcol
eltypes
filter
filter!
hcat
insertcols!
mapcols
names!
nonunique
nrow
ncol
rename!
rename
repeat
select
select!
show
sort
sort!
unique!
permutecols!
vcat
append!
push!