Functions

Functions

Grouping, Joining, and Split-Apply-Combine

DataFrames.aggregateFunction.
aggregate(df::AbstractDataFrame, fs)
aggregate(df::AbstractDataFrame, cols, fs; sort=false, skipmissing=false)
aggregate(gd::GroupedDataFrame, fs; sort=false)

Split-apply-combine that applies a set of functions over columns of an AbstractDataFrame or GroupedDataFrame. Return an aggregated data frame.

Arguments

  • df : an AbstractDataFrame
  • gd : a GroupedDataFrame
  • cols : a column indicator (Symbol, Int, Vector{Symbol}, etc.)
  • fs : a function or vector of functions to be applied to vectors within groups; expects each argument to be a column vector
  • sort : whether to sort rows according to the values of the grouping columns
  • skipmissing : whether to skip rows with missing values in one of the grouping columns cols

Each fs should return a value or vector. All returns must be the same length.

Examples

julia> using Statistics

julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
                      b = repeat([2, 1], outer=[4]),
                      c = 1:8);

julia> aggregate(df, :a, sum)
4×3 DataFrame
│ Row │ a     │ b_sum │ c_sum │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 4     │ 6     │
│ 2   │ 2     │ 2     │ 8     │
│ 3   │ 3     │ 4     │ 10    │
│ 4   │ 4     │ 2     │ 12    │

julia> aggregate(df, :a, [sum, x->mean(skipmissing(x))])
4×5 DataFrame
│ Row │ a     │ b_sum │ c_sum │ b_function │ c_function │
│     │ Int64 │ Int64 │ Int64 │ Float64    │ Float64    │
├─────┼───────┼───────┼───────┼────────────┼────────────┤
│ 1   │ 1     │ 4     │ 6     │ 2.0        │ 3.0        │
│ 2   │ 2     │ 2     │ 8     │ 1.0        │ 4.0        │
│ 3   │ 3     │ 4     │ 10    │ 2.0        │ 5.0        │
│ 4   │ 4     │ 2     │ 12    │ 1.0        │ 6.0        │

julia> aggregate(groupby(df, :a), [sum, x->mean(skipmissing(x))])
4×5 DataFrame
│ Row │ a     │ b_sum │ c_sum │ b_function │ c_function │
│     │ Int64 │ Int64 │ Int64 │ Float64    │ Float64    │
├─────┼───────┼───────┼───────┼────────────┼────────────┤
│ 1   │ 1     │ 4     │ 6     │ 2.0        │ 3.0        │
│ 2   │ 2     │ 2     │ 8     │ 1.0        │ 4.0        │
│ 3   │ 3     │ 4     │ 10    │ 2.0        │ 5.0        │
│ 4   │ 4     │ 2     │ 12    │ 1.0        │ 6.0        │
source
DataFrames.byFunction.
by(df::AbstractDataFrame, keys, cols=>f...;
   sort::Bool=false, skipmissing::Bool=false)
by(df::AbstractDataFrame, keys; (colname = cols => f)...,
   sort::Bool=false, skipmissing::Bool=false)
by(df::AbstractDataFrame, keys, f;
   sort::Bool=false, skipmissing::Bool=false)
by(f, df::AbstractDataFrame, keys;
   sort::Bool=false, skipmissing::Bool=false)

Split-apply-combine in one step: apply f to each grouping in df based on grouping columns keys, and return a DataFrame.

keys can be either a single column index, or a vector thereof.

If the last argument(s) consist(s) in one or more cols => f pair(s), or if colname = cols => f keyword arguments are provided, cols must be a column name or index, or a vector or tuple thereof, and f must be a callable. A pair or a (named) tuple of pairs can also be provided as the first or last argument. If cols is a single column index, f is called with a SubArray view into that column for each group; else, f is called with a named tuple holding SubArray views into these columns.

If the last argument is a callable f, it is passed a SubDataFrame view for each group, and the returned DataFrame then consists of the returned rows plus the grouping columns. If the returned data frame contains columns with the same names as the grouping columns, they are required to be equal. Note that this second form is much slower than the first one due to type instability. A method is defined with f as the first argument, so do-block notation can be used.

f can return a single value, a row or multiple rows. The type of the returned value determines the shape of the resulting data frame:

  • A single value gives a data frame with a single column and one row per group.
  • A named tuple of single values or a DataFrameRow gives a data frame with one column for each field and one row per group.
  • A vector gives a data frame with a single column and as many rows for each group as the length of the returned vector for that group.
  • A data frame, a named tuple of vectors or a matrix gives a data frame with the same columns and as many rows for each group as the rows returned for that group.

f must always return the same kind of object (as defined in the above list) for all groups, and if a named tuple or data frame, with the same fields or columns. Named tuples cannot mix single values and vectors. Due to type instability, returning a single value or a named tuple is dramatically faster than returning a data frame.

As a special case, if multiple pairs are passed as last arguments, each function is required to return a single value or vector, which will produce each a separate column.

In all cases, the resulting data frame contains all the grouping columns in addition to those generated by the application of f. Column names are automatically generated when necessary: for functions operating on a single column and returning a single value or vector, the function name is appended to the input colummn name; for other functions, columns are called x1, x2 and so on. The resulting data frame will be sorted on keys if sort=true. Otherwise, ordering of rows is undefined. If skipmissing=true then the resulting data frame will not contain groups with missing values in one of the keys columns.

Optimized methods are used when standard summary functions (sum, prod, minimum, maximum, mean, var, std, first, last and length) are specified using the pair syntax (e.g.col => sum). When computing thesumormeanover floating point columns, results will be less accurate than the standard [sum](@ref) function (which uses pairwise summation). Usecol => x -> sum(x)` to avoid the optimized method and use the slower, more accurate one.

by(d, cols, f) is equivalent to combine(f, groupby(d, cols)) and to the less efficient combine(map(f, groupby(d, cols))).

Examples

julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
                      b = repeat([2, 1], outer=[4]),
                      c = 1:8);

julia> by(df, :a, :c => sum)
4×2 DataFrame
│ Row │ a     │ c_sum │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 6     │
│ 2   │ 2     │ 8     │
│ 3   │ 3     │ 10    │
│ 4   │ 4     │ 12    │

julia> by(df, :a, d -> sum(d.c)) # Slower variant
4×2 DataFrame
│ Row │ a     │ x1    │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 6     │
│ 2   │ 2     │ 8     │
│ 3   │ 3     │ 10    │
│ 4   │ 4     │ 12    │

julia> by(df, :a) do d # do syntax for the slower variant
           sum(d.c)
       end
4×2 DataFrame
│ Row │ a     │ x1    │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 6     │
│ 2   │ 2     │ 8     │
│ 3   │ 3     │ 10    │
│ 4   │ 4     │ 12    │

julia> by(df, :a, :c => x -> 2 .* x)
8×2 DataFrame
│ Row │ a     │ c_function │
│     │ Int64 │ Int64      │
├─────┼───────┼────────────┤
│ 1   │ 1     │ 2          │
│ 2   │ 1     │ 10         │
│ 3   │ 2     │ 4          │
│ 4   │ 2     │ 12         │
│ 5   │ 3     │ 6          │
│ 6   │ 3     │ 14         │
│ 7   │ 4     │ 8          │
│ 8   │ 4     │ 16         │

julia> by(df, :a, c_sum = :c => sum, c_sum2 = :c => x -> sum(x.^2))
4×3 DataFrame
│ Row │ a     │ c_sum │ c_sum2 │
│     │ Int64 │ Int64 │ Int64  │
├─────┼───────┼───────┼────────┤
│ 1   │ 1     │ 6     │ 26     │
│ 2   │ 2     │ 8     │ 40     │
│ 3   │ 3     │ 10    │ 58     │
│ 4   │ 4     │ 12    │ 80     │

julia> by(df, :a, (:b, :c) => x -> (minb = minimum(x.b), sumc = sum(x.c)))
4×3 DataFrame
│ Row │ a     │ minb  │ sumc  │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 6     │
│ 2   │ 2     │ 1     │ 8     │
│ 3   │ 3     │ 2     │ 10    │
│ 4   │ 4     │ 1     │ 12    │
source
DataFrames.combineFunction.
combine(gd::GroupedDataFrame, cols => f...)
combine(gd::GroupedDataFrame; (colname = cols => f)...)
combine(gd::GroupedDataFrame, f)
combine(f, gd::GroupedDataFrame)

Transform a GroupedDataFrame into a DataFrame.

If the last argument(s) consist(s) in one or more cols => f pair(s), or if colname = cols => f keyword arguments are provided, cols must be a column name or index, or a vector or tuple thereof, and f must be a callable. A pair or a (named) tuple of pairs can also be provided as the first or last argument. If cols is a single column index, f is called with a SubArray view into that column for each group; else, f is called with a named tuple holding SubArray views into these columns.

If the last argument is a callable f, it is passed a SubDataFrame view for each group, and the returned DataFrame then consists of the returned rows plus the grouping columns. If the returned data frame contains columns with the same names as the grouping columns, they are required to be equal. Note that this second form is much slower than the first one due to type instability. A method is defined with f as the first argument, so do-block notation can be used.

f can return a single value, a row or multiple rows. The type of the returned value determines the shape of the resulting data frame:

  • A single value gives a data frame with a single column and one row per group.
  • A named tuple of single values or a DataFrameRow gives a data frame with one column for each field and one row per group.
  • A vector gives a data frame with a single column and as many rows for each group as the length of the returned vector for that group.
  • A data frame, a named tuple of vectors or a matrix gives a data frame with the same columns and as many rows for each group as the rows returned for that group.

f must always return the same kind of object (as defined in the above list) for all groups, and if a named tuple or data frame, with the same fields or columns. Named tuples cannot mix single values and vectors. Due to type instability, returning a single value or a named tuple is dramatically faster than returning a data frame.

As a special case, if a tuple or vector of pairs is passed as the first argument, each function is required to return a single value or vector, which will produce each a separate column.

In all cases, the resulting data frame contains all the grouping columns in addition to those generated by the application of f. Column names are automatically generated when necessary: for functions operating on a single column and returning a single value or vector, the function name is appended to the input column name; for other functions, columns are called x1, x2 and so on. The resulting data frame will be sorted if sort=true was passed to the groupby call from which gd was constructed. Otherwise, ordering of rows is undefined.

Optimized methods are used when standard summary functions (sum, prod, minimum, maximum, mean, var, std, first, last and length) are specified using the pair syntax (e.g. col => sum). When computing the sum or mean over floating point columns, results will be less accurate than the standard sum function (which uses pairwise summation). Use col => x -> sum(x) to avoid the optimized method and use the slower, more accurate one.

See also:

  • by(f, df, cols) is a shorthand for combine(f, groupby(df, cols)).
  • map: combine(f, groupby(df, cols)) is a more efficient equivalent

of combine(map(f, groupby(df, cols))).

Examples

julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
                      b = repeat([2, 1], outer=[4]),
                      c = 1:8);

julia> gd = groupby(df, :a);

julia> combine(gd, :c => sum)
4×2 DataFrame
│ Row │ a     │ c_sum │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 6     │
│ 2   │ 2     │ 8     │
│ 3   │ 3     │ 10    │
│ 4   │ 4     │ 12    │

julia> combine(:c => sum, gd)
4×2 DataFrame
│ Row │ a     │ c_sum │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 6     │
│ 2   │ 2     │ 8     │
│ 3   │ 3     │ 10    │
│ 4   │ 4     │ 12    │

julia> combine(df -> sum(df.c), gd) # Slower variant
4×2 DataFrame
│ Row │ a     │ x1    │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 6     │
│ 2   │ 2     │ 8     │
│ 3   │ 3     │ 10    │
│ 4   │ 4     │ 12    │

See by for more examples.

source
DataFrames.groupbyFunction.
groupby(d::AbstractDataFrame, cols; sort=false, skipmissing=false)

Return a GroupedDataFrame representing a view of an AbstractDataFrame split into row groups.

Arguments

  • df : an AbstractDataFrame to split
  • cols : data frame columns to group by
  • sort : whether to sort rows according to the values of the grouping columns cols
  • skipmissing : whether to skip rows with missing values in one of the grouping columns cols

Details

An iterator over a GroupedDataFrame returns a SubDataFrame view for each grouping into df. Within each group, the order of rows in df is preserved.

cols can be any valid data frame indexing expression. In particular if it is an empty vector then a single-group GroupedDataFrame is created.

A GroupedDataFrame also supports indexing by groups, map (which applies a function to each group) and combine (which applies a function to each group and combines the result into a data frame).

See the following for additional split-apply-combine operations:

  • by : split-apply-combine using functions
  • aggregate : split-apply-combine; applies functions in the form of a cross product
  • map : apply a function to each group of a GroupedDataFrame (without combining)
  • combine : combine a GroupedDataFrame, optionally applying a function to each group

GroupedDataFrame also supports the dictionary interface. The keys are GroupKey objects returned by keys(::GroupedDataFrame), which can also be used to get the values of the grouping columns for each group. Tuples and NamedTuples containing the values of the grouping columns (in the same order as the cols argument) are also accepted as indices, but this will be slower than using the equivalent GroupKey.

Examples

julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
                      b = repeat([2, 1], outer=[4]),
                      c = 1:8);

julia> gd = groupby(df, :a)
GroupedDataFrame with 4 groups based on key: a
First Group (2 rows): a = 1
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 1     │
│ 2   │ 1     │ 2     │ 5     │
⋮
Last Group (2 rows): a = 4
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 4     │ 1     │ 4     │
│ 2   │ 4     │ 1     │ 8     │

julia> gd[1]
2×3 SubDataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 1     │
│ 2   │ 1     │ 2     │ 5     │

julia> last(gd)
2×3 SubDataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 4     │ 1     │ 4     │
│ 2   │ 4     │ 1     │ 8     │

julia> gd[(a=3,)]
2×3 SubDataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 3     │ 2     │ 3     │
│ 2   │ 3     │ 2     │ 7     │

julia> gd[(3,)]
2×3 SubDataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 3     │ 2     │ 3     │
│ 2   │ 3     │ 2     │ 7     │

julia> k = first(keys(gd))
GroupKey: (a = 3)

julia> gd[k]
2×3 SubDataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 3     │ 2     │ 3     │
│ 2   │ 3     │ 2     │ 7     │

julia> for g in gd
           println(g)
       end
2×3 SubDataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 1     │
│ 2   │ 1     │ 2     │ 5     │
2×3 SubDataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 2     │ 1     │ 2     │
│ 2   │ 2     │ 1     │ 6     │
2×3 SubDataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 3     │ 2     │ 3     │
│ 2   │ 3     │ 2     │ 7     │
2×3 SubDataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 4     │ 1     │ 4     │
│ 2   │ 4     │ 1     │ 8     │
source
groupindices(gd::GroupedDataFrame)

Return a vector of group indices for each row of parent(gd).

Rows appearing in group gd[i] are attributed index i. Rows not present in any group are attributed missing (this can happen if skipmissing=true was passed when creating gd, or if gd is a subset from a larger GroupedDataFrame).

source
DataFrames.groupvarsFunction.
groupvars(gd::GroupedDataFrame)

Return a vector of column names in parent(gd) used for grouping.

source
Base.keysFunction.
keys(gd::GroupedDataFrame)

Get the set of keys for each group of the GroupedDataFramegd as a GroupKeys object. Each key is a GroupKey, which behaves like a NamedTuple holding the values of the grouping columns for a given group. Unlike the equivalent Tuple and NamedTuple, these keys can be used to index into gd efficiently. The ordering of the keys is identical to the ordering of the groups of gd under iteration and integer indexing.

Examples

julia> df = DataFrame(a = repeat([:foo, :bar, :baz], outer=[4]),
                      b = repeat([2, 1], outer=[6]),
                      c = 1:12);

julia> gd = groupby(df, [:a, :b])
GroupedDataFrame with 6 groups based on keys: a, b
First Group (2 rows): a = :foo, b = 2
│ Row │ a      │ b     │ c     │
│     │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1   │ foo    │ 2     │ 1     │
│ 2   │ foo    │ 2     │ 7     │
⋮
Last Group (2 rows): a = :baz, b = 1
│ Row │ a      │ b     │ c     │
│     │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1   │ baz    │ 1     │ 6     │
│ 2   │ baz    │ 1     │ 12    │

julia> keys(gd)
6-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (a = :foo, b = 2)
 GroupKey: (a = :bar, b = 1)
 GroupKey: (a = :baz, b = 2)
 GroupKey: (a = :foo, b = 1)
 GroupKey: (a = :bar, b = 2)
 GroupKey: (a = :baz, b = 1)

GroupKey objects behave similarly to NamedTuples:

julia> k = keys(gd)[1]
GroupKey: (a = :foo, b = 2)

julia> keys(k)
(:a, :b)

julia> values(k)  # Same as Tuple(k)
(:foo, 2)

julia> NamedTuple(k)
(a = :foo, b = 2)

julia> k.a
:foo

julia> k[:a]
:foo

julia> k[1]
:foo

Keys can be used as indices to retrieve the corresponding group from their GroupedDataFrame:

julia> gd[k]
2×3 SubDataFrame
│ Row │ a      │ b     │ c     │
│     │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1   │ foo    │ 2     │ 1     │
│ 2   │ foo    │ 2     │ 7     │

julia> gd[keys(gd)[1]] == gd[1]
true
source
Base.getFunction.
get(gd::GroupedDataFrame, key, default)

Get a group based on the values of the grouping columns.

key may be a NamedTuple or Tuple of grouping column values (in the same order as the cols argument to groupby).

Examples

julia> df = DataFrame(a = repeat([:foo, :bar, :baz], outer=[2]),
                      b = repeat([2, 1], outer=[3]),
                      c = 1:6);

julia> gd = groupby(df, :a)
GroupedDataFrame with 3 groups based on key: a
First Group (2 rows): a = :foo
│ Row │ a      │ b     │ c     │
│     │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1   │ foo    │ 2     │ 1     │
│ 2   │ foo    │ 1     │ 4     │
⋮
Last Group (2 rows): a = :baz
│ Row │ a      │ b     │ c     │
│     │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1   │ baz    │ 2     │ 3     │
│ 2   │ baz    │ 1     │ 6     │

julia> get(gd, (a=:bar,), nothing)
2×3 SubDataFrame
│ Row │ a      │ b     │ c     │
│     │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1   │ bar    │ 1     │ 2     │
│ 2   │ bar    │ 2     │ 5     │

julia> get(gd, (:baz,), nothing)
2×3 SubDataFrame
│ Row │ a      │ b     │ c     │
│     │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1   │ baz    │ 2     │ 3     │
│ 2   │ baz    │ 1     │ 6     │

julia> get(gd, (:qux,), nothing)
source
Base.joinFunction.
join(df1, df2; on = Symbol[], kind = :inner, makeunique = false,
     indicator = nothing, validate = (false, false))
join(df1, df2, dfs...; on = Symbol[], kind = :inner, makeunique = false,
     validate = (false, false))

Join two or more DataFrame objects and return a DataFrame containing the result.

Arguments

  • df1, df2, dfs... : the AbstractDataFrames to be joined

Keyword Arguments

  • on : A column name to join df1 and df2 on. If the columns on which df1 and df2 will be joined have different names, then a left=>right pair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed). If more than two data frames are joined then only a column name or a vector of column names are allowed. on is a required argument for all joins except for kind = :cross.

  • kind : the type of join, options include:

    • :inner : only include rows with keys that match in both df1 and df2, the default
    • :outer : include all rows from df1 and df2
    • :left : include all rows from df1
    • :right : include all rows from df2
    • :semi : return rows of df1 that match with the keys in df2
    • :anti : return rows of df1 that do not match with the keys in df2
    • :cross : a full Cartesian product of the key combinations; every row of df1 is matched with every row of df2

    When joining more than two data frames only :inner, :outer and :cross joins are allowed.

  • makeunique : if false (the default), an error will be raised if duplicate names are found in columns not joined on; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).

  • indicator : Default: nothing. If a Symbol, adds categorical indicator column named Symbol for whether a row appeared in only df1 ("left_only"), only df2 ("right_only") or in both ("both"). If Symbol is already in use, the column name will be modified if makeunique=true. This argument is only supported when joining exactly two data frames.

  • validate : whether to check that columns passed as the on argument define unique keys in each input data frame (according to isequal). Can be a tuple or a pair, with the first element indicating whether to run check for df1 and the second element for df2. By default no check is performed.

For the three join operations that may introduce missing values (:outer, :left, and :right), all columns of the returned data table will support missing values.

When merging on categorical columns that differ in the ordering of their levels, the ordering of the left DataFrame takes precedence over the ordering of the right DataFrame.

If more than two data frames are passed, the join is performed recursively with left associativity. In this case the indicator keyword argument is not supported.

Examples

name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])

join(name, job, on = :ID)
join(name, job, on = :ID, kind = :outer)
join(name, job, on = :ID, kind = :left)
join(name, job, on = :ID, kind = :right)
join(name, job, on = :ID, kind = :semi)
join(name, job, on = :ID, kind = :anti)
join(name, job, kind = :cross)

job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
join(name, job2, on = :ID => :identifier)
join(name, job2, on = [:ID => :identifier])
source
Base.mapFunction.
map(cols => f, gd::GroupedDataFrame)
map(f, gd::GroupedDataFrame)

Apply a function to each group of rows and return a GroupedDataFrame.

If the first argument is a cols => f pair, cols must be a column name or index, or a vector or tuple thereof, and f must be a callable. If cols is a single column index, f is called with a SubArray view into that column for each group; else, f is called with a named tuple holding SubArray views into these columns.

If the first argument is a vector, tuple or named tuple of such pairs, each pair is handled as described above. If a named tuple, field names are used to name each generated column.

If the first argument is a callable f, it is passed a SubDataFrame view for each group, and the returned DataFrame then consists of the returned rows plus the grouping columns. If the returned data frame contains columns with the same names as the grouping columns, they are required to be equal. Note that this second form is much slower than the first one due to type instability.

f can return a single value, a row or multiple rows. The type of the returned value determines the shape of the resulting data frame:

  • A single value gives a data frame with a single column and one row per group.
  • A named tuple of single values or a DataFrameRow gives a data frame with one column for each field and one row per group.
  • A vector gives a data frame with a single column and as many rows for each group as the length of the returned vector for that group.
  • A data frame, a named tuple of vectors or a matrix gives a data frame with the same columns and as many rows for each group as the rows returned for that group.

f must always return the same kind of object (as defined in the above list) for all groups, and if a named tuple or data frame, with the same fields or columns. Named tuples cannot mix single values and vectors. Due to type instability, returning a single value or a named tuple is dramatically faster than returning a data frame.

As a special case, if a tuple or vector of pairs is passed as the first argument, each function is required to return a single value or vector, which will produce each a separate column.

In all cases, the resulting GroupedDataFrame contains all the grouping columns in addition to those generated by the application of f. Column names are automatically generated when necessary: for functions operating on a single column and returning a single value or vector, the function name is appended to the input column name; for other functions, columns are called x1, x2 and so on.

Optimized methods are used when standard summary functions (sum, prod, minimum, maximum, mean, var, std, first, last and length) are specified using the pair syntax (e.g. col => sum). When computing the sum or mean over floating point columns, results will be less accurate than the standard sum function (which uses pairwise summation). Use col => x -> sum(x) to avoid the optimized method and use the slower, more accurate one.

See also combine(f, gd) that returns a DataFrame rather than a GroupedDataFrame.

Examples

julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
                      b = repeat([2, 1], outer=[4]),
                      c = 1:8);

julia> gd = groupby(df, :a);

julia> map(:c => sum, gd)
GroupedDataFrame{DataFrame} with 4 groups based on key: :a
First Group: 1 row
│ Row │ a     │ c_sum │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 6     │
⋮
Last Group: 1 row
│ Row │ a     │ c_sum │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 4     │ 12    │

julia> map(df -> sum(df.c), gd) # Slower variant
GroupedDataFrame{DataFrame} with 4 groups based on key: :a
First Group: 1 row
│ Row │ a     │ x1    │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 6     │
⋮
Last Group: 1 row
│ Row │ a     │ x1    │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 4     │ 12    │

See by for more examples.

source
DataFrames.stackFunction.
stack(df::AbstractDataFrame, [measure_vars], [id_vars];
      variable_name::Symbol=:variable, value_name::Symbol=:value, view::Bool=false)

Stack a data frame df, i.e. convert it from wide to long format.

Return the long-format DataFrame with column variable_name (:value by default) holding the values of the stacked columns (measure_vars), with column variable_name (:variable by default) a vector of Symbols holding the name of the corresponding measure_vars variable, and with columns for each of the id_vars.

If view=true then return a stacked view of a data frame (long format). The result is a view because the columns are special AbstractVectors that return views into the original data frame.

Arguments

  • df : the AbstractDataFrame to be stacked
  • measure_vars : the columns to be stacked (the measurement variables), a normal column indexing type, like a Symbol, Vector{Symbol}, Int, etc.; If neither measure_vars or id_vars are given, measure_vars defaults to all floating point columns.
  • id_vars : the identifier columns that are repeated during stacking, a normal column indexing type; defaults to all variables that are not measure_vars
  • variable_name : the name of the new stacked column that shall hold the names of each of measure_vars
  • value_name : the name of the new stacked column containing the values from each of measure_vars
  • view : whether the stacked data frame should be a view rather than contain freshly allocated vectors.

Examples

d1 = DataFrame(a = repeat([1:3;], inner = [4]),
               b = repeat([1:4;], inner = [3]),
               c = randn(12),
               d = randn(12),
               e = map(string, 'a':'l'))

d1s = stack(d1, [:c, :d])
d1s2 = stack(d1, [:c, :d], [:a])
d1m = stack(d1, Not([:a, :b, :e]))
d1s_name = stack(d1, Not([:a, :b, :e]), variable_name=:somemeasure)
source
DataFrames.unstackFunction.
unstack(df::AbstractDataFrame, rowkeys::Union{Integer, Symbol},
        colkey::Union{Integer, Symbol}, value::Union{Integer, Symbol};
        renamecols::Function=identity)
unstack(df::AbstractDataFrame, rowkeys::AbstractVector{<:Union{Integer, Symbol}},
        colkey::Union{Integer, Symbol}, value::Union{Integer, Symbol};
        renamecols::Function=identity)
unstack(df::AbstractDataFrame, colkey::Union{Integer, Symbol},
        value::Union{Integer, Symbol}; renamecols::Function=identity)
unstack(df::AbstractDataFrame; renamecols::Function=identity)

Unstack data frame df, i.e. convert it from long to wide format.

If colkey contains missing values then they will be skipped and a warning will be printed.

If combination of rowkeys and colkey contains duplicate entries then last value will be retained and a warning will be printed.

Arguments

  • df : the AbstractDataFrame to be unstacked
  • rowkeys : the column(s) with a unique key for each row, if not given, find a key by grouping on anything not a colkey or value
  • colkey : the column holding the column names in wide format, defaults to :variable
  • value : the value column, defaults to :value
  • renamecols : a function called on each unique value in colkey which must return the name of the column to be created (typically as a string or a Symbol). Duplicate names are not allowed.

Examples

wide = DataFrame(id = 1:12,
                 a  = repeat([1:3;], inner = [4]),
                 b  = repeat([1:4;], inner = [3]),
                 c  = randn(12),
                 d  = randn(12))

long = stack(wide)
wide0 = unstack(long)
wide1 = unstack(long, :variable, :value)
wide2 = unstack(long, :id, :variable, :value)
wide3 = unstack(long, [:id, :a], :variable, :value)
wide4 = unstack(long, :id, :variable, :value, renamecols=x->Symbol(:_, x))

Note that there are some differences between the widened results above.

source

Basics

Missings.allowmissingFunction.
allowmissing(df::AbstractDataFrame,
             cols::Union{ColumnIndex, AbstractVector, Regex, Not, Between, All, Colon}=:)

Return a copy of data frame df with columns cols converted to element type Union{T, Missing} from T to allow support for missing values.

If cols is omitted all columns in the data frame are converted.

Examples

julia> df = DataFrame(a=[1,2])
2×1 DataFrame
│ Row │ a     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │

julia> allowmissing(df)
2×1 DataFrame
│ Row │ a      │
│     │ Int64⍰ │
├─────┼────────┤
│ 1   │ 1      │
│ 2   │ 2      │
source
allowmissing!(df::DataFrame, cols::Colon=:)
allowmissing!(df::DataFrame, cols::Union{Integer, Symbol})
allowmissing!(df::DataFrame, cols::Union{AbstractVector, Regex, Not, Between, All})

Convert columns cols of data frame df from element type T to Union{T, Missing} to support missing values.

If cols is omitted all columns in the data frame are converted.

source
categorical(df::AbstractDataFrame, cols::Type=Union{AbstractString, Missing};
            compress::Bool=false)
categorical(df::AbstractDataFrame,
            cols::Union{ColumnIndex, AbstractVector, Regex, Not, Between, All, Colon};
            compress::Bool=false)

Return a copy of data frame df with columns cols converted to CategoricalVector. If categorical is called with the cols argument being a Type, then all columns whose element type is a subtype of this type (by default Union{AbstractString, Missing}) will be converted to categorical.

If the compress keyword argument is set to true then the created CategoricalVectors will be compressed.

All created CategoricalVectors are unordered.

Examples

julia> df = DataFrame(a=[1,2], b=["a","b"])
2×2 DataFrame
│ Row │ a     │ b      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ a      │
│ 2   │ 2     │ b      │

julia> categorical(df)
2×2 DataFrame
│ Row │ a     │ b            │
│     │ Int64 │ Categorical… │
├─────┼───────┼──────────────┤
│ 1   │ 1     │ a            │
│ 2   │ 2     │ b            │

julia> categorical(df, :)
2×2 DataFrame
│ Row │ a            │ b            │
│     │ Categorical… │ Categorical… │
├─────┼──────────────┼──────────────┤
│ 1   │ 1            │ a            │
│ 2   │ 2            │ b            │
source
categorical!(df::DataFrame, cols::Type=Union{AbstractString, Missing};
             compress::Bool=false)
categorical!(df::DataFrame, cname::Union{Integer, Symbol};
             compress::Bool=false)
categorical!(df::DataFrame, cnames::Vector{<:Union{Integer, Symbol}};
             compress::Bool=false)
categorical!(df::DataFrame, cnames::Union{Regex, Not, Between, All};
             compress::Bool=false)

Change columns selected by cname or cnames in data frame df to CategoricalVector.

If categorical! is called with the cols argument being a Type, then all columns whose element type is a subtype of this type (by default Union{AbstractString, Missing}) will be converted to categorical.

If the compress keyword argument is set to true then the created CategoricalVectors will be compressed.

All created CategoricalVectors are unordered.

Examples

julia> df = DataFrame(X=["a", "b"], Y=[1, 2], Z=["p", "q"])
2×3 DataFrame
│ Row │ X      │ Y     │ Z      │
│     │ String │ Int64 │ String │
├─────┼────────┼───────┼────────┤
│ 1   │ a      │ 1     │ p      │
│ 2   │ b      │ 2     │ q      │

julia> categorical!(df)
2×3 DataFrame
│ Row │ X            │ Y     │ Z            │
│     │ Categorical… │ Int64 │ Categorical… │
├─────┼──────────────┼───────┼──────────────┤
│ 1   │ a            │ 1     │ p            │
│ 2   │ b            │ 2     │ q            │

julia> eltype.(eachcol(df))
3-element Array{DataType,1}:
 CategoricalString{UInt32}
 Int64
 CategoricalString{UInt32}

julia> df = DataFrame(X=["a", "b"], Y=[1, 2], Z=["p", "q"])
2×3 DataFrame
│ Row │ X      │ Y     │ Z      │
│     │ String │ Int64 │ String │
├─────┼────────┼───────┼────────┤
│ 1   │ a      │ 1     │ p      │
│ 2   │ b      │ 2     │ q      │

julia> categorical!(df, :Y, compress=true)
2×3 DataFrame
│ Row │ X      │ Y            │ Z      │
│     │ String │ Categorical… │ String │
├─────┼────────┼──────────────┼────────┤
│ 1   │ a      │ 1            │ p      │
│ 2   │ b      │ 2            │ q      │

julia> eltype.(eachcol(df))
3-element Array{DataType,1}:
 String
 CategoricalValue{Int64,UInt8}
 String
source
completecases(df::AbstractDataFrame, cols::Colon=:)
completecases(df::AbstractDataFrame, cols::Union{AbstractVector, Regex, Not, Between, All})
completecases(df::AbstractDataFrame, cols::Union{Integer, Symbol})

Return a Boolean vector with true entries indicating rows without missing values (complete cases) in data frame df. If cols is provided, only missing values in the corresponding columns are considered.

See also: dropmissing and dropmissing!. Use findall(completecases(df)) to get the indices of the rows.

Examples

julia> df = DataFrame(i = 1:5,
                      x = [missing, 4, missing, 2, 1],
                      y = [missing, missing, "c", "d", "e"])
5×3 DataFrame
│ Row │ i     │ x       │ y       │
│     │ Int64 │ Int64⍰  │ String⍰ │
├─────┼───────┼─────────┼─────────┤
│ 1   │ 1     │ missing │ missing │
│ 2   │ 2     │ 4       │ missing │
│ 3   │ 3     │ missing │ c       │
│ 4   │ 4     │ 2       │ d       │
│ 5   │ 5     │ 1       │ e       │

julia> completecases(df)
5-element BitArray{1}:
 false
 false
 false
  true
  true

julia> completecases(df, :x)
5-element BitArray{1}:
 false
  true
 false
  true
  true

julia> completecases(df, [:x, :y])
5-element BitArray{1}:
 false
 false
 false
  true
  true
source
Base.copyFunction.
copy(df::DataFrame; copycols::Bool=true)

Copy data frame df. If copycols=true (the default), return a new DataFrame holding copies of column vectors in df. If copycols=false, return a new DataFrame sharing column vectors with df.

source
copy(dfr::DataFrameRow)

Convert a DataFrameRow to a NamedTuple.

source
DataFrames.DataFrame!Function.
DataFrame!(args...; kwargs...)

Equivalent to DataFrame(args...; copycols=false, kwargs...).

If kwargs contains the copycols keyword argument an error is thrown.

Examples

julia> df1 = DataFrame(a=1:3)
3×1 DataFrame
│ Row │ a     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │
│ 3   │ 3     │

julia> df2 = DataFrame!(df1)

julia> df1.a === df2.a
true
source
deleterows!(df::DataFrame, inds)

Delete rows specified by inds from a DataFramedf in place and return it.

Internally deleteat! is called for all columns so inds must be: a vector of sorted and unique integers, a boolean vector or an integer.

Examples

julia> d = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 4     │
│ 2   │ 2     │ 5     │
│ 3   │ 3     │ 6     │

julia> deleterows!(d, 2)
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 4     │
│ 2   │ 3     │ 6     │
source
DataAPI.describeFunction.
describe(df::AbstractDataFrame; cols=:)
describe(df::AbstractDataFrame, stats::Union{Symbol, Pair{<:Symbol}}...; cols=:)

Return descriptive statistics for a data frame as a new DataFrame where each row represents a variable and each column a summary statistic.

Arguments

  • df : the AbstractDataFrame
  • stats::Union{Symbol, Pair{<:Symbol}}... : the summary statistics to report. Arguments can be:
    • A symbol from the list :mean, :std, :min, :q25, :median, :q75, :max, :eltype, :nunique, :first, :last, and :nmissing. The default statistics used are :mean, :min, :median, :max, :nunique, :nmissing, and :eltype.
    • :all as the only Symbol argument to return all statistics.
    • A name => function pair where name is a Symbol. This will create a column of summary statistics with the provided name.
  • cols : a keyword argument allowing to select only a subset of columns from df to describe; all standard column selection methods are allowed.

Details

For Real columns, compute the mean, standard deviation, minimum, first quantile, median, third quantile, and maximum. If a column does not derive from Real, describe will attempt to calculate all statistics, using nothing as a fall-back in the case of an error.

When stats contains :nunique, describe will report the number of unique values in a column. If a column's base type derives from Real, :nunique will return nothings.

Missing values are filtered in the calculation of all statistics, however the column :nmissing will report the number of missing values of that variable. If the column does not allow missing values, nothing is returned. Consequently, nmissing = 0 indicates that the column allows missing values, but does not currently contain any.

If custom functions are provided, they are called repeatedly with the vector corresponding to each column as the only argument. For columns allowing for missing values, the vector is wrapped in a call to skipmissing: custom functions must therefore support such objects (and not only vectors), and cannot access missing values.

Examples

julia> df = DataFrame(i=1:10, x=0.1:0.1:1.0, y='a':'j')
10×3 DataFrame
│ Row │ i     │ x       │ y    │
│     │ Int64 │ Float64 │ Char │
├─────┼───────┼─────────┼──────┤
│ 1   │ 1     │ 0.1     │ 'a'  │
│ 2   │ 2     │ 0.2     │ 'b'  │
│ 3   │ 3     │ 0.3     │ 'c'  │
│ 4   │ 4     │ 0.4     │ 'd'  │
│ 5   │ 5     │ 0.5     │ 'e'  │
│ 6   │ 6     │ 0.6     │ 'f'  │
│ 7   │ 7     │ 0.7     │ 'g'  │
│ 8   │ 8     │ 0.8     │ 'h'  │
│ 9   │ 9     │ 0.9     │ 'i'  │
│ 10  │ 10    │ 1.0     │ 'j'  │

julia> describe(df)
3×8 DataFrame
│ Row │ variable │ mean   │ min │ median │ max │ nunique │ nmissing │ eltype   │
│     │ Symbol   │ Union… │ Any │ Union… │ Any │ Union…  │ Nothing  │ DataType │
├─────┼──────────┼────────┼─────┼────────┼─────┼─────────┼──────────┼──────────┤
│ 1   │ i        │ 5.5    │ 1   │ 5.5    │ 10  │         │          │ Int64    │
│ 2   │ x        │ 0.55   │ 0.1 │ 0.55   │ 1.0 │         │          │ Float64  │
│ 3   │ y        │        │ 'a' │        │ 'j' │ 10      │          │ Char     │

julia> describe(df, :min, :max)
3×3 DataFrame
│ Row │ variable │ min │ max │
│     │ Symbol   │ Any │ Any │
├─────┼──────────┼─────┼─────┤
│ 1   │ i        │ 1   │ 10  │
│ 2   │ x        │ 0.1 │ 1.0 │
│ 3   │ y        │ 'a' │ 'j' │

julia> describe(df, :min, :sum => sum)
3×3 DataFrame
│ Row │ variable │ min │ sum │
│     │ Symbol   │ Any │ Any │
├─────┼──────────┼─────┼─────┤
│ 1   │ i        │ 1   │ 55  │
│ 2   │ x        │ 0.1 │ 5.5 │
│ 3   │ y        │ 'a' │     │

julia> describe(df, :min, :sum => sum, cols=:x)
1×3 DataFrame
│ Row │ variable │ min     │ sum     │
│     │ Symbol   │ Float64 │ Float64 │
├─────┼──────────┼─────────┼─────────┤
│ 1   │ x        │ 0.1     │ 5.5     │
source
disallowmissing(df::AbstractDataFrame,
                cols::Union{ColumnIndex, AbstractVector, Regex, Not, Between, All, Colon}=:;
                error::Bool=true)

Return a copy of data frame df with columns cols converted from element type Union{T, Missing} to T to drop support for missing values.

If cols is omitted all columns in the data frame are converted.

If error=false then columns containing a missing value will be skipped instead of throwing an error.

Examples

julia> df = DataFrame(a=Union{Int,Missing}[1,2])
2×1 DataFrame
│ Row │ a      │
│     │ Int64⍰ │
├─────┼────────┤
│ 1   │ 1      │
│ 2   │ 2      │

julia> disallowmissing(df)
2×1 DataFrame
│ Row │ a     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │

julia> df = DataFrame(a=[1,missing]) 2×2 DataFrame │ Row │ a │ b │ │ │ Int64⍰ │ Int64⍰ │ ├─────┼─────────┼────────┤ │ 1 │ 1 │ 1 │ │ 2 │ missing │ 2 │

julia> disallowmissing(df, error=false) 2×2 DataFrame │ Row │ a │ b │ │ │ Int64⍰ │ Int64 │ ├─────┼─────────┼───────┤ │ 1 │ 1 │ 1 │ │ 2 │ missing │ 2 │

source
disallowmissing!(df::DataFrame, cols::Colon=:; error::Bool=true)
disallowmissing!(df::DataFrame, cols::Union{Integer, Symbol}; error::Bool=true)
disallowmissing!(df::DataFrame, cols::Union{AbstractVector, Regex, Not, Between, All};
                 error::Bool=true)

Convert columns cols of data frame df from element type Union{T, Missing} to T to drop support for missing values.

If cols is omitted all columns in the data frame are converted.

If error=false then columns containing a missing value will be skipped instead of throwing an error.

source
dropmissing(df::AbstractDataFrame, cols::Colon=:; disallowmissing::Bool=true)
dropmissing(df::AbstractDataFrame, cols::Union{AbstractVector, Regex, Not, Between, All};
            disallowmissing::Bool=true)
dropmissing(df::AbstractDataFrame, cols::Union{Integer, Symbol};
            disallowmissing::Bool=true)

Return a copy of data frame df excluding rows with missing values. If cols is provided, only missing values in the corresponding columns are considered.

If disallowmissing is true (the default) then columns specified in cols will be converted so as not to allow for missing values using disallowmissing!.

See also: completecases and dropmissing!.

Examples

julia> df = DataFrame(i = 1:5,
                      x = [missing, 4, missing, 2, 1],
                      y = [missing, missing, "c", "d", "e"])
5×3 DataFrame
│ Row │ i     │ x       │ y       │
│     │ Int64 │ Int64⍰  │ String⍰ │
├─────┼───────┼─────────┼─────────┤
│ 1   │ 1     │ missing │ missing │
│ 2   │ 2     │ 4       │ missing │
│ 3   │ 3     │ missing │ c       │
│ 4   │ 4     │ 2       │ d       │
│ 5   │ 5     │ 1       │ e       │

julia> dropmissing(df)
2×3 DataFrame
│ Row │ i     │ x     │ y      │
│     │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼────────┤
│ 1   │ 4     │ 2     │ d      │
│ 2   │ 5     │ 1     │ e      │

julia> dropmissing(df, disallowmissing=false)
2×3 DataFrame
│ Row │ i     │ x      │ y       │
│     │ Int64 │ Int64⍰ │ String⍰ │
├─────┼───────┼────────┼─────────┤
│ 1   │ 4     │ 2      │ d       │
│ 2   │ 5     │ 1      │ e       │

julia> dropmissing(df, :x)
3×3 DataFrame
│ Row │ i     │ x     │ y       │
│     │ Int64 │ Int64 │ String⍰ │
├─────┼───────┼───────┼─────────┤
│ 1   │ 2     │ 4     │ missing │
│ 2   │ 4     │ 2     │ d       │
│ 3   │ 5     │ 1     │ e       │

julia> dropmissing(df, [:x, :y])
2×3 DataFrame
│ Row │ i     │ x     │ y      │
│     │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼────────┤
│ 1   │ 4     │ 2     │ d      │
│ 2   │ 5     │ 1     │ e      │
source
dropmissing!(df::AbstractDataFrame, cols::Colon=:; disallowmissing::Bool=true)
dropmissing!(df::AbstractDataFrame, cols::Union{AbstractVector, Regex, Not, Between, All};
             disallowmissing::Bool=true)
dropmissing!(df::AbstractDataFrame, cols::Union{Integer, Symbol};
             disallowmissing::Bool=true)

Remove rows with missing values from data frame df and return it. If cols is provided, only missing values in the corresponding columns are considered.

If disallowmissing is true (the default) then the cols columns will get converted using disallowmissing!.

See also: dropmissing and completecases.

julia> df = DataFrame(i = 1:5,
                      x = [missing, 4, missing, 2, 1],
                      y = [missing, missing, "c", "d", "e"])
5×3 DataFrame
│ Row │ i     │ x       │ y       │
│     │ Int64 │ Int64⍰  │ String⍰ │
├─────┼───────┼─────────┼─────────┤
│ 1   │ 1     │ missing │ missing │
│ 2   │ 2     │ 4       │ missing │
│ 3   │ 3     │ missing │ c       │
│ 4   │ 4     │ 2       │ d       │
│ 5   │ 5     │ 1       │ e       │

julia> dropmissing!(copy(df))
2×3 DataFrame
│ Row │ i     │ x     │ y      │
│     │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼────────┤
│ 1   │ 4     │ 2     │ d      │
│ 2   │ 5     │ 1     │ e      │

julia> dropmissing!(copy(df), disallowmissing=false)
2×3 DataFrame
│ Row │ i     │ x      │ y       │
│     │ Int64 │ Int64⍰ │ String⍰ │
├─────┼───────┼────────┼─────────┤
│ 1   │ 4     │ 2      │ d       │
│ 2   │ 5     │ 1      │ e       │

julia> dropmissing!(copy(df), :x)
3×3 DataFrame
│ Row │ i     │ x     │ y       │
│     │ Int64 │ Int64 │ String⍰ │
├─────┼───────┼───────┼─────────┤
│ 1   │ 2     │ 4     │ missing │
│ 2   │ 4     │ 2     │ d       │
│ 3   │ 5     │ 1     │ e       │

julia> dropmissing!(df3, [:x, :y])
2×3 DataFrame
│ Row │ i     │ x     │ y      │
│     │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼────────┤
│ 1   │ 4     │ 2     │ d      │
│ 2   │ 5     │ 1     │ e      │
source
DataFrames.eachrowFunction.
eachrow(df::AbstractDataFrame)

Return a DataFrameRows that iterates a data frame row by row, with each row represented as a DataFrameRow.

Examples

julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 11    │
│ 2   │ 2     │ 12    │
│ 3   │ 3     │ 13    │
│ 4   │ 4     │ 14    │

julia> eachrow(df)
4-element DataFrameRows:
 DataFrameRow (row 1)
x  1
y  11
 DataFrameRow (row 2)
x  2
y  12
 DataFrameRow (row 3)
x  3
y  13
 DataFrameRow (row 4)
x  4
y  14

julia> copy.(eachrow(df))
4-element Array{NamedTuple{(:x, :y),Tuple{Int64,Int64}},1}:
 (x = 1, y = 11)
 (x = 2, y = 12)
 (x = 3, y = 13)
 (x = 4, y = 14)

julia> eachrow(view(df, [4,3], [2,1]))
2-element DataFrameRows:
 DataFrameRow (row 4)
y  14
x  4
 DataFrameRow (row 3)
y  13
x  3
source
DataFrames.eachcolFunction.
eachcol(df::AbstractDataFrame, names::Bool=false)

Return a DataFrameColumns that iterates an AbstractDataFrame column by column. If names is equal to false (the default) iteration returns column vectors. If names is equal to true pairs consisting of column name and column vector are yielded.

Examples

julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 11    │
│ 2   │ 2     │ 12    │
│ 3   │ 3     │ 13    │
│ 4   │ 4     │ 14    │

julia> collect(eachcol(df))
2-element Array{AbstractArray{T,1} where T,1}:
 [1, 2, 3, 4]
 [11, 12, 13, 14]

julia> map(eachcol(df)) do col
           maximum(col) - minimum(col)
       end
2-element Array{Int64,1}:
 3
 3

julia> sum.(eachcol(df))
2-element Array{Int64,1}:
 10
 50

julia> collect(eachcol(df, true))
2-element Array{Pair{Symbol,AbstractArray{T,1} where T},1}:
 :x => [1, 2, 3, 4]
 :y => [11, 12, 13, 14]
source
Base.filterFunction.
filter(function, df::AbstractDataFrame)

Return a copy of data frame df containing only rows for which function returns true. The function is passed a DataFrameRow as its only argument.

Examples

julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 1     │ c      │
│ 3   │ 2     │ a      │
│ 4   │ 1     │ b      │

julia> filter(row -> row[:x] > 1, df)
2×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 2     │ a      │
source
Base.filter!Function.
filter!(function, df::AbstractDataFrame)

Remove rows from data frame df for which function returns false. The function is passed a DataFrameRow as its only argument.

Examples

julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 1     │ c      │
│ 3   │ 2     │ a      │
│ 4   │ 1     │ b      │

julia> filter!(row -> row[:x] > 1, df);

julia> df
2×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 2     │ a      │
source
DataFrames.flattenFunction.
flatten(df::AbstractDataFrame, col::Union{Integer, Symbol})

When column col of data frame df has iterable elements that define length (for example a Vector of Vectors), return a DataFrame where each element of col is flattened, meaning the column corresponding to col becomes a longer Vector where the original entries are concatenated. Elements of row i of df in columns other than col will be repeated according to the length of df[i, col]. Note that these elements are not copied, and thus if they are mutable changing them in the returned DataFrame will affect df.

Examples

julia> df1 = DataFrame(a = [1, 2], b = [[1, 2], [3, 4]])
2×2 DataFrame
│ Row │ a     │ b      │
│     │ Int64 │ Array… │
├─────┼───────┼────────┤
│ 1   │ 1     │ [1, 2] │
│ 2   │ 2     │ [3, 4] │

julia> flatten(df1, :b)
4×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 1     │ 2     │
│ 3   │ 2     │ 3     │
│ 4   │ 2     │ 4     │

julia> df2 = DataFrame(a = [1, 2], b = [("p", "q"), ("r", "s")])
2×2 DataFrame
│ Row │ a     │ b          │
│     │ Int64 │ Tuple…     │
├─────┼───────┼────────────┤
│ 1   │ 1     │ ("p", "q") │
│ 2   │ 2     │ ("r", "s") │

julia> flatten(df2, :b)
4×2 DataFrame
│ Row │ a     │ b      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ p      │
│ 2   │ 1     │ q      │
│ 3   │ 2     │ r      │
│ 4   │ 2     │ s      │
source
Base.hcatFunction.
hcat(df::AbstractDataFrame...;
     makeunique::Bool=false, copycols::Bool=true)
hcat(df::AbstractDataFrame..., vs::AbstractVector;
     makeunique::Bool=false, copycols::Bool=true)
hcat(vs::AbstractVector, df::AbstractDataFrame;
     makeunique::Bool=false, copycols::Bool=true)

Horizontally concatenate AbstractDataFrames and optionally AbstractVectors.

If AbstractVector is passed then a column name for it is automatically generated as :x1 by default.

If makeunique=false (the default) column names of passed objects must be unique. If makeunique=true then duplicate column names will be suffixed with _i (i starting at 1 for the first duplicate).

If copycols=true (the default) then the DataFrame returned by hcat will contain copied columns from the source data frames. If copycols=false then it will contain columns as they are stored in the source (without copying). This option should be used with caution as mutating either the columns in sources or in the returned DataFrame might lead to the corruption of the other object.

Example

julia [DataFrame(A=1:3) DataFrame(B=1:3)]
3×2 DataFrame
│ Row │ A     │ B     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 2     │
│ 3   │ 3     │ 3     │

julia> df1 = DataFrame(A=1:3, B=1:3);

julia> df2 = DataFrame(A=4:6, B=4:6);

julia> df3 = hcat(df1, df2, makeunique=true)
3×4 DataFrame
│ Row │ A     │ B     │ A_1   │ B_1   │
│     │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1   │ 1     │ 1     │ 4     │ 4     │
│ 2   │ 2     │ 2     │ 5     │ 5     │
│ 3   │ 3     │ 3     │ 6     │ 6     │

julia> df3.A === df1.A
false

julia> df3 = hcat(df1, df2, makeunique=true, copycols=false);

julia> df3.A === df1.A
true
source
insertcols!(df::DataFrame, ind::Int; name=col,
            makeunique::Bool=false)
insertcols!(df::DataFrame, ind::Int, (:name => col)::Pair{Symbol,<:AbstractVector};
            makeunique::Bool=false)

Insert a column into a data frame in place. Return the updated DataFrame.

Arguments

  • df : the DataFrame to which we want to add a column
  • ind : a position at which we want to insert a column
  • name : the name of the new column
  • col : an AbstractVector giving the contents of the new column
  • makeunique : Defines what to do if name already exists in df; if it is false an error will be thrown; if it is true a new unique name will be generated by adding a suffix

Examples

julia> d = DataFrame(a=1:3)
3×1 DataFrame
│ Row │ a     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │
│ 3   │ 3     │

julia> insertcols!(d, 1, b=['a', 'b', 'c'])
3×2 DataFrame
│ Row │ b    │ a     │
│     │ Char │ Int64 │
├─────┼──────┼───────┤
│ 1   │ 'a'  │ 1     │
│ 2   │ 'b'  │ 2     │
│ 3   │ 'c'  │ 3     │

julia> insertcols!(d, 1, :c => [2, 3, 4])
3×3 DataFrame
│ Row │ c     │ b    │ a     │
│     │ Int64 │ Char │ Int64 │
├─────┼───────┼──────┼───────┤
│ 1   │ 2     │ 'a'  │ 1     │
│ 2   │ 3     │ 'b'  │ 2     │
│ 3   │ 4     │ 'c'  │ 3     │
source
DataFrames.mapcolsFunction.
mapcols(f::Union{Function,Type}, df::AbstractDataFrame)

Return a DataFrame where each column of df is transformed using function f. f must return AbstractVector objects all with the same length or scalars.

Note that mapcols guarantees not to reuse the columns from df in the returned DataFrame. If f returns its argument then it gets copied before being stored.

Examples

julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 11    │
│ 2   │ 2     │ 12    │
│ 3   │ 3     │ 13    │
│ 4   │ 4     │ 14    │

julia> mapcols(x -> x.^2, df)
4×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 121   │
│ 2   │ 4     │ 144   │
│ 3   │ 9     │ 169   │
│ 4   │ 16    │ 196   │
source
Base.namesFunction.
names(df::AbstractDataFrame)

Return a `Vector{Symbol}` of names of columns contained in `df`.
source
DataFrames.nonuniqueFunction.
nonunique(df::AbstractDataFrame)
nonunique(df::AbstractDataFrame, cols)

Return a Vector{Bool} in which true entries indicate duplicate rows. A row is a duplicate if there exists a prior row with all columns containing equal values (according to isequal).

See also unique and unique!.

Arguments

  • df : the AbstractDataFrame
  • cols : a column indicator (Symbol, Int, Vector{Symbol}, etc.) specifying the column(s) to compare

Examples

df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
df = vcat(df, df)
nonunique(df)
nonunique(df, 1)
source
DataFrames.nrowFunction.
nrow(df::AbstractDataFrame)
ncol(df::AbstractDataFrame)

Return the number of rows or columns in an AbstractDataFramedf.

See also size.

Examples

julia> df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10));

julia> size(df)
(10, 3)

julia> nrow(df)
10

julia> ncol(df)
3
source
DataFrames.ncolFunction.
nrow(df::AbstractDataFrame)
ncol(df::AbstractDataFrame)

Return the number of rows or columns in an AbstractDataFramedf.

See also size.

Examples

julia> df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10));

julia> size(df)
(10, 3)

julia> nrow(df)
10

julia> ncol(df)
3
source
DataFrames.rename!Function.
rename!(df::AbstractDataFrame, vals::AbstractVector{Symbol}; makeunique::Bool=false)
rename!(df::AbstractDataFrame, vals::AbstractVector{<:AbstractString}; makeunique::Bool=false)
rename!(df::AbstractDataFrame, (from => to)::Pair...)
rename!(df::AbstractDataFrame, d::AbstractDict)
rename!(df::AbstractDataFrame, d::AbstractArray{<:Pair})
rename!(f::Function, df::AbstractDataFrame)

Rename columns of df in-place. Each name is changed at most once. Permutation of names is allowed.

Arguments

  • df : the AbstractDataFrame
  • d : an AbstractDict or an AbstractVector of Pairs that maps the original names or column numbers to new names
  • f : a function which for each column takes the old name (a Symbol) and returns the new name that gets converted to a Symbol
  • vals : new column names as a vector of Symbols or AbstractStrings of the same length as the number of columns in df
  • makeunique : if false (the default), an error will be raised if duplicate names are found; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).

If pairs are passed to rename! (as positional arguments or in a dictionary or a vector) then:

  • from value can be a Symbol, an AbstractString or an Integer;
  • to value can be a Symbol or an AbstractString.

Mixing symbols and strings in to and from is not allowed.

See also: rename

Examples

julia> df = DataFrame(i = 1, x = 2, y = 3)
1×3 DataFrame
│ Row │ i     │ x     │ y     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 3     │

julia> rename!(df, Dict(:i => "A", :x => "X"))
1×3 DataFrame
│ Row │ A     │ X     │ y     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 3     │

julia> rename!(df, [:a, :b, :c])
1×3 DataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 3     │

julia> rename!(df, [:a, :b, :a])
ERROR: ArgumentError: Duplicate variable names: :a. Pass makeunique=true to make them unique using a suffix automatically.

julia> rename!(df, [:a, :b, :a], makeunique=true)
1×3 DataFrame
│ Row │ a     │ b     │ a_1   │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 3     │

julia> rename!(df) do x
           uppercase(string(x))
       end
1×3 DataFrame
│ Row │ A     │ B     │ A_1   │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 3     │
source
DataFrames.renameFunction.
rename(df::AbstractDataFrame, vals::AbstractVector{Symbol}; makeunique::Bool=false)
rename(df::AbstractDataFrame, vals::AbstractVector{<:AbstractString}; makeunique::Bool=false)
rename(df::AbstractDataFrame, (from => to)::Pair...)
rename(df::AbstractDataFrame, d::AbstractDict)
rename(df::AbstractDataFrame, d::AbstractArray{<:Pair})
rename(f::Function, df::AbstractDataFrame)

Create a new data frame that is a copy of df with changed column names. Each name is changed at most once. Permutation of names is allowed.

Arguments

  • df : the AbstractDataFrame
  • d : an AbstractDict or an AbstractVector of Pairs that maps the original names or column numbers to new names
  • f : a function which for each column takes the old name (a Symbol) and returns the new name that gets converted to a Symbol
  • vals : new column names as a vector of Symbols or AbstractStrings of the same length as the number of columns in df
  • makeunique : if false (the default), an error will be raised if duplicate names are found; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).

If pairs are passed to rename (as positional arguments or in a dictionary or a vector) then:

  • from value can be a Symbol, an AbstractString or an Integer;
  • to value can be a Symbol or an AbstractString.

Mixing symbols and strings in to and from is not allowed.

See also: rename!

Examples

```julia julia> df = DataFrame(i = 1, x = 2, y = 3) 1×3 DataFrame │ Row │ i │ x │ y │ │ │ Int64 │ Int64 │ Int64 │ ├─────┼───────┼───────┼───────┤ │ 1 │ 1 │ 2 │ 3 │

julia> rename(df, :i => :A, :x => :X) 1×3 DataFrame │ Row │ A │ X │ y │ │ │ Int64 │ Int64 │ Int64 │ ├─────┼───────┼───────┼───────┤ │ 1 │ 1 │ 2 │ 3 │

julia> rename(df, :x => :y, :y => :x) 1×3 DataFrame │ Row │ i │ y │ x │ │ │ Int64 │ Int64 │ Int64 │ ├─────┼───────┼───────┼───────┤ │ 1 │ 1 │ 2 │ 3 │

julia> rename(df, [1 => :A, 2 => :X]) 1×3 DataFrame │ Row │ A │ X │ y │ │ │ Int64 │ Int64 │ Int64 │ ├─────┼───────┼───────┼───────┤ │ 1 │ 1 │ 2 │ 3 │

julia> rename(df, Dict("i" => "A", "x" => "X")) 1×3 DataFrame │ Row │ A │ X │ y │ │ │ Int64 │ Int64 │ Int64 │ ├─────┼───────┼───────┼───────┤ │ 1 │ 1 │ 2 │ 3 │

julia> rename(df) do x uppercase(string(x)) end 1×3 DataFrame │ Row │ I │ X │ Y │ │ │ Int64 │ Int64 │ Int64 │ ├─────┼───────┼───────┼───────┤ │ 1 │ 1 │ 2 │ 3 │```

source
Base.repeatFunction.
repeat(df::AbstractDataFrame; inner::Integer = 1, outer::Integer = 1)

Construct a data frame by repeating rows in df. inner specifies how many times each row is repeated, and outer specifies how many times the full set of rows is repeated.

Example

julia> df = DataFrame(a = 1:2, b = 3:4)
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 3     │
│ 2   │ 2     │ 4     │

julia> repeat(df, inner = 2, outer = 3)
12×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 3     │
│ 2   │ 1     │ 3     │
│ 3   │ 2     │ 4     │
│ 4   │ 2     │ 4     │
│ 5   │ 1     │ 3     │
│ 6   │ 1     │ 3     │
│ 7   │ 2     │ 4     │
│ 8   │ 2     │ 4     │
│ 9   │ 1     │ 3     │
│ 10  │ 1     │ 3     │
│ 11  │ 2     │ 4     │
│ 12  │ 2     │ 4     │
source
repeat(df::AbstractDataFrame, count::Integer)

Construct a data frame by repeating each row in df the number of times specified by count.

Example

julia> df = DataFrame(a = 1:2, b = 3:4)
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 3     │
│ 2   │ 2     │ 4     │

julia> repeat(df, 2)
4×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 3     │
│ 2   │ 2     │ 4     │
│ 3   │ 1     │ 3     │
│ 4   │ 2     │ 4     │
source
DataFrames.selectFunction.
select(df::AbstractDataFrame, inds...; copycols::Bool=true)

Create a new data frame that contains columns from df specified by inds and return it.

Arguments passed as inds... can be any index that is allowed for column indexing provided that the columns requested in each of them are unique and present in df. In particular, regular expressions, All, Between, and Not selectors are supported.

If more than one argument is passed then they are joined as All(inds...). Note that All selects the union of columns passed to it, so columns selected in different inds... do not have to be unique. For example a call select(df, :col, All()) is valid and creates a new data frame with column :col moved to be the first, provided it is present in df.

If df is a DataFrame return a new DataFrame that contains columns from df specified by inds. If copycols=true (the default), then returned DataFrame holds copies of column vectors in df. If copycols=false, then returned DataFrame shares column vectors with df.

If df is a SubDataFrame then a SubDataFrame is returned if copycols=false and a DataFrame with freshly allocated columns otherwise.

Examples

julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 4     │
│ 2   │ 2     │ 5     │
│ 3   │ 3     │ 6     │

julia> select(df, :b)
3×1 DataFrame
│ Row │ b     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 4     │
│ 2   │ 5     │
│ 3   │ 6     │

julia> select(df, Not(:b)) # drop column :b from df
3×1 DataFrame
│ Row │ a     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │
│ 3   │ 3     │
source
DataFrames.select!Function.
select!(df::DataFrame, inds...)

Mutate df in place to retain only columns specified by inds... and return it.

Arguments passed as inds... can be any index that is allowed for column indexing provided that the columns requested in each of them are unique and present in df. In particular, regular expressions, All, Between, and Not selectors are supported.

If more than one argument is passed then they are joined as All(inds...). Note that All selects the union of columns passed to it, so columns selected in different inds... do not have to be unique. For example a call select!(df, :col, All()) is valid and moves column :col in the data frame to be the first, provided it is present in df.

Examples

julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 4     │
│ 2   │ 2     │ 5     │
│ 3   │ 3     │ 6     │

julia> select!(df, 2)
3×1 DataFrame
│ Row │ b     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 4     │
│ 2   │ 5     │
│ 3   │ 6     │
source
Base.showFunction.
show([io::IO,] df::AbstractDataFrame;
     allrows::Bool = !get(io, :limit, false),
     allcols::Bool = !get(io, :limit, false),
     allgroups::Bool = !get(io, :limit, false),
     splitcols::Bool = get(io, :limit, false),
     rowlabel::Symbol = :Row,
     summary::Bool = true)

Render a data frame to an I/O stream. The specific visual representation chosen depends on the width of the display.

If io is omitted, the result is printed to stdout, and allrows, allcols and allgroups default to false while splitcols defaults to true.

Arguments

  • io::IO: The I/O stream to which df will be printed.
  • df::AbstractDataFrame: The data frame to print.
  • allrows::Bool: Whether to print all rows, rather than a subset that fits the device height. By default this is the case only if io does not have the IOContext property limit set.
  • allcols::Bool: Whether to print all columns, rather than a subset that fits the device width. By default this is the case only if io does not have the IOContext property limit set.
  • allgroups::Bool: Whether to print all groups rather than the first and last, when df is a GroupedDataFrame. By default this is the case only if io does not have the IOContext property limit set.
  • splitcols::Bool: Whether to split printing in chunks of columns fitting the screen width rather than printing all columns in the same block. Only applies if allcols is true. By default this is the case only if io has the IOContext property limit set.
  • rowlabel::Symbol = :Row: The label to use for the column containing row numbers.
  • summary::Bool = true: Whether to print a brief string summary of the data frame.

Examples

julia> using DataFrames

julia> df = DataFrame(A = 1:3, B = ["x", "y", "z"]);

julia> show(df, allcols=true)
3×2 DataFrame
│ Row │ A     │ B      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ x      │
│ 2   │ 2     │ y      │
│ 3   │ 3     │ z      │
source
show(io::IO, mime::MIME, df::AbstractDataFrame)

Render a data frame to an I/O stream in MIME type mime.

Arguments

  • io::IO: The I/O stream to which df will be printed.
  • mime::MIME: supported MIME types are: "text/plain", "text/html", "text/latex", "text/csv", "text/tab-separated-values"
  • df::AbstractDataFrame: The data frame to print.

Additionally selected MIME types support passing the following keyword arguments:

  • MIME type "text/plain" accepts all listed keyword arguments and therir behavior is identical as for show(::IO, ::AbstractDataFrame)
  • MIME type "text/html" accepts summary keyword argument which allows to choose whether to print a brief string summary of the data frame.

Examples

julia> show(stdout, MIME("text/latex"), DataFrame(A = 1:3, B = ["x", "y", "z"]))
\begin{tabular}{r|cc}
        & A & B\\
        \hline
        & Int64 & String\\
        \hline
        1 & 1 & x \\
        2 & 2 & y \\
        3 & 3 & z \\
\end{tabular}
14

julia> show(stdout, MIME("text/csv"), DataFrame(A = 1:3, B = ["x", "y", "z"]))
"A","B"
1,"x"
2,"y"
3,"z"
source
Base.sortFunction.
sort(df::AbstractDataFrame, cols;
     alg::Union{Algorithm, Nothing}=nothing, lt=isless, by=identity,
     rev::Bool=false, order::Ordering=Forward)

Return a copy of data frame df sorted by column(s) cols. cols can be either a Symbol or Integer column index, or a tuple or vector of such indices.

If alg is nothing (the default), the most appropriate algorithm is chosen automatically among TimSort, MergeSort and RadixSort depending on the type of the sorting columns and on the number of rows in df. If rev is true, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true) in cols, with c the corresponding column index (see example below). See sort! for a description of other keyword arguments.

Examples

julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 1     │ c      │
│ 3   │ 2     │ a      │
│ 4   │ 1     │ b      │

julia> sort(df, :x)
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ c      │
│ 2   │ 1     │ b      │
│ 3   │ 2     │ a      │
│ 4   │ 3     │ b      │

julia> sort(df, (:x, :y))
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ b      │
│ 2   │ 1     │ c      │
│ 3   │ 2     │ a      │
│ 4   │ 3     │ b      │

julia> sort(df, (:x, :y), rev=true)
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 2     │ a      │
│ 3   │ 1     │ c      │
│ 4   │ 1     │ b      │

julia> sort(df, (:x, order(:y, rev=true)))
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ c      │
│ 2   │ 1     │ b      │
│ 3   │ 2     │ a      │
│ 4   │ 3     │ b      │
source
Base.sort!Function.
sort!(df::AbstractDataFrame, cols;
      alg::Union{Algorithm, Nothing}=nothing, lt=isless, by=identity,
      rev::Bool=false, order::Ordering=Forward)

Sort data frame df by column(s) cols. cols can be either a Symbol or Integer column index, or a tuple or vector of such indices.

If alg is nothing (the default), the most appropriate algorithm is chosen automatically among TimSort, MergeSort and RadixSort depending on the type of the sorting columns and on the number of rows in df. If rev is true, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true) in cols, with c the corresponding column index (see example below). See other methods for a description of other keyword arguments.

Examples

julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 1     │ c      │
│ 3   │ 2     │ a      │
│ 4   │ 1     │ b      │

julia> sort!(df, :x)
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ c      │
│ 2   │ 1     │ b      │
│ 3   │ 2     │ a      │
│ 4   │ 3     │ b      │

julia> sort!(df, (:x, :y))
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ b      │
│ 2   │ 1     │ c      │
│ 3   │ 2     │ a      │
│ 4   │ 3     │ b      │

julia> sort!(df, (:x, :y), rev=true)
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 2     │ a      │
│ 3   │ 1     │ c      │
│ 4   │ 1     │ b      │

julia> sort!(df, (:x, order(:y, rev=true)))
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ c      │
│ 2   │ 1     │ b      │
│ 3   │ 2     │ a      │
│ 4   │ 3     │ b      │
source
Base.unique!Function.
unique(df::AbstractDataFrame)
unique(df::AbstractDataFrame, cols)
unique!(df::AbstractDataFrame)
unique!(df::AbstractDataFrame, cols)

Delete duplicate rows of data frame df, keeping only the first occurrence of unique rows. When cols is specified, the return DataFrame contains complete rows, retaining in each case the first instance for which df[cols] is unique.

When unique is called a new data frame is returned; unique! updates df in-place.

See also nonunique.

Arguments

  • df : the AbstractDataFrame
  • cols : column indicator (Symbol, Int, Vector{Symbol}, Regex, etc.)

specifying the column(s) to compare.

Examples

df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
df = vcat(df, df)
unique(df)   # doesn't modify df
unique(df, 1)
unique!(df)  # modifies df
source
Base.vcatFunction.
vcat(dfs::AbstractDataFrame...; cols::Union{Symbol, AbstractVector{Symbol}}=:setequal)

Vertically concatenate AbstractDataFrames.

The cols keyword argument determines the columns of the returned data frame:

  • :setequal: require all data frames to have the same column names disregarding order. If they appear in different orders, the order of the first provided data frame is used.
  • :orderequal: require all data frames to have the same column names and in the same order.
  • :intersect: only the columns present in all provided data frames are kept. If the intersection is empty, an empty data frame is returned.
  • :union: columns present in at least one of the provided data frames are kept. Columns not present in some data frames are filled with missing where necessary.
  • A vector of Symbols: only listed columns are kept. Columns not present in some data frames are filled with missing where necessary.

The order of columns is determined by the order they appear in the included data frames, searching through the header of the first data frame, then the second, etc.

The element types of columns are determined using promote_type, as with vcat for AbstractVectors.

vcat ignores empty data frames, making it possible to initialize an empty data frame at the beginning of a loop and vcat onto it.

Example

julia> df1 = DataFrame(A=1:3, B=1:3);

julia> df2 = DataFrame(A=4:6, B=4:6);

julia> df3 = DataFrame(A=7:9, C=7:9);

julia> d4 = DataFrame();

julia> vcat(df1, df2)
6×2 DataFrame
│ Row │ A     │ B     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 2     │
│ 3   │ 3     │ 3     │
│ 4   │ 4     │ 4     │
│ 5   │ 5     │ 5     │
│ 6   │ 6     │ 6     │

julia> vcat(df1, df3, cols=:union)
6×3 DataFrame
│ Row │ A     │ B       │ C       │
│     │ Int64 │ Int64⍰  │ Int64⍰  │
├─────┼───────┼─────────┼─────────┤
│ 1   │ 1     │ 1       │ missing │
│ 2   │ 2     │ 2       │ missing │
│ 3   │ 3     │ 3       │ missing │
│ 4   │ 7     │ missing │ 7       │
│ 5   │ 8     │ missing │ 8       │
│ 6   │ 9     │ missing │ 9       │

julia> vcat(df1, df3, cols=:intersect)
6×1 DataFrame
│ Row │ A     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │
│ 3   │ 3     │
│ 4   │ 7     │
│ 5   │ 8     │
│ 6   │ 9     │

julia> vcat(d4, df1)
3×2 DataFrame
│ Row │ A     │ B     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 2     │
│ 3   │ 3     │ 3     │
source
Base.append!Function.
append!(df1::DataFrame, df2::AbstractDataFrame; cols::Symbol=:setequal)
append!(df::DataFrame, table; cols::Symbol=:setequal)

Add the rows of df2 to the end of df1. If the second argument table is not an AbstractDataFrame then it is converted using DataFrame(table, copycols=false) before being appended.

Column names of df1 and df2 must be equal. If cols is :setequal (the default) then column names may have different orders and append! is performed by matching column names. If cols is :orderequal then the order of columns in df1 and df2 or table must be the same. In particular, if table is a Dict an error is thrown as it is an unordered collection.

The above rule has the following exceptions:

  • If df1 has no columns then copies of columns from df2 are added to it.
  • If df2 has no columns then calling append! leaves df1 unchanged.

Values corresponding to new rows are appended in-place to the column vectors of df1. Column types are therefore preserved, and new values are converted if necessary. An error is thrown if conversion fails: this is the case in particular if a column in df2 contains missing values but the corresponding column in df1 does not accept them.

Please note that append! must not be used on a DataFrame that contains columns that are aliases (equal when compared with ===).

Note

Use vcat instead of append! when more flexibility is needed. Since vcat does not operate in place, it is able to use promotion to find an appropriate element type to hold values from both data frames. It also accepts columns in different orders between df1 and df2.

Use push! to add individual rows to a data frame.

Examples

julia> df1 = DataFrame(A=1:3, B=1:3);

julia> df2 = DataFrame(A=4.0:6.0, B=4:6);

julia> append!(df1, df2);

julia> df1
6×2 DataFrame
│ Row │ A     │ B     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 2     │
│ 3   │ 3     │ 3     │
│ 4   │ 4     │ 4     │
│ 5   │ 5     │ 5     │
│ 6   │ 6     │ 6     │
source
Base.push!Function.
push!(df::DataFrame, row::Union{Tuple, AbstractArray})
push!(df::DataFrame, row::Union{DataFrameRow, NamedTuple, AbstractDict};
      cols::Symbol=:setequal)

Add in-place one row at the end of df taking the values from row.

Column types of df are preserved, and new values are converted if necessary. An error is thrown if conversion fails.

If row is neither a DataFrameRow, NamedTuple nor AbstractDict then it must be a Tuple or an AbstractArray and columns are matched by order of appearance. In this case row must contain the same number of elements as the number of columns in df.

If row is a DataFrameRow, NamedTuple or AbstractDict then values in row are matched to columns in df based on names. The exact behavior depends on the cols argument value in the following way:

  • If cols=:setequal (this is the default) then row must contain exactly the same columns as df (but possibly in a different order).
  • If cols=:orderequal then row must contain the same columns in the same order (for AbstractDict this option requires that keys(row) matches names(df) to allow for support of ordered dicts; however, if row is a Dict an error is thrown as it is an unordered collection).
  • If cols=:intersect then row may contain more columns than df, but all column names that are present in df must be present in row and only they are used to populate a new row in df.
  • If cols=:subset then push! behaves like for :intersect but if some column is missing in row then a missing value is pushed to df.

As a special case, if df has no columns and row is a NamedTuple or DataFrameRow, columns are created for all values in row, using their names and order.

Please note that push! must not be used on a DataFrame that contains columns that are aliases (equal when compared with ===).

Examples

julia> df = DataFrame(A=1:3, B=1:3)
3×2 DataFrame
│ Row │ A     │ B     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 2     │
│ 3   │ 3     │ 3     │

julia> push!(df, (true, false))
4×2 DataFrame
│ Row │ A     │ B     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 2     │
│ 3   │ 3     │ 3     │
│ 4   │ 1     │ 0     │

julia> push!(df, df[1, :])
5×2 DataFrame
│ Row │ A     │ B     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 2     │
│ 3   │ 3     │ 3     │
│ 4   │ 1     │ 0     │
│ 5   │ 1     │ 1     │

julia> push!(df, (C="something", A=true, B=false), cols=:intersect)
4×2 DataFrame
│ Row │ A     │ B     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 2     │
│ 3   │ 3     │ 3     │
│ 4   │ 1     │ 0     │
│ 5   │ 1     │ 1     │
│ 6   │ 1     │ 0     │

julia> push!(df, Dict(:A=>1.0, :B=>2.0))
5×2 DataFrame
│ Row │ A     │ B     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 2     │
│ 3   │ 3     │ 3     │
│ 4   │ 1     │ 0     │
│ 5   │ 1     │ 1     │
│ 6   │ 1     │ 0     │
│ 7   │ 1     │ 2     │
source