Functions

Constructing data frames

Base.copyFunction
copy(df::DataFrame; copycols::Bool=true)

Copy data frame df. If copycols=true (the default), return a new DataFrame holding copies of column vectors in df. If copycols=false, return a new DataFrame sharing column vectors with df.

source
copy(dfr::DataFrameRow)

Construct a NamedTuple with the same contents as the DataFrameRow. This method returns a NamedTuple so that the returned object is not affected by changes to the parent data frame of which dfr is a view.

source
copy(key::GroupKey)

Construct a NamedTuple with the same contents as the GroupKey.

source
Base.similarFunction
similar(df::AbstractDataFrame, rows::Integer=nrow(df))

Create a new DataFrame with the same column names and column element types as df. An optional second argument can be provided to request a number of rows that is different than the number of rows present in df.

source

Summary information

DataAPI.describeFunction
describe(df::AbstractDataFrame; cols=:)
describe(df::AbstractDataFrame, stats::Union{Symbol, Pair}...; cols=:)

Return descriptive statistics for a data frame as a new DataFrame where each row represents a variable and each column a summary statistic.

Arguments

  • df : the AbstractDataFrame
  • stats::Union{Symbol, Pair}... : the summary statistics to report. Arguments can be:
    • A symbol from the list :mean, :std, :min, :q25, :median, :q75, :max, :eltype, :nunique, :first, :last, and :nmissing. The default statistics used are :mean, :min, :median, :max, :nmissing, and :eltype.
    • :all as the only Symbol argument to return all statistics.
    • A function => name pair where name is a Symbol or string. This will create a column of summary statistics with the provided name.
  • cols : a keyword argument allowing to select only a subset of columns from df to describe. Can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

Details

For Real columns, compute the mean, standard deviation, minimum, first quantile, median, third quantile, and maximum. If a column does not derive from Real, describe will attempt to calculate all statistics, using nothing as a fall-back in the case of an error.

When stats contains :nunique, describe will report the number of unique values in a column. If a column's base type derives from Real, :nunique will return nothings.

Missing values are filtered in the calculation of all statistics, however the column :nmissing will report the number of missing values of that variable.

If custom functions are provided, they are called repeatedly with the vector corresponding to each column as the only argument. For columns allowing for missing values, the vector is wrapped in a call to skipmissing: custom functions must therefore support such objects (and not only vectors), and cannot access missing values.

Examples

julia> df = DataFrame(i=1:10, x=0.1:0.1:1.0, y='a':'j');

julia> describe(df)
3×7 DataFrame
 Row │ variable  mean    min  median  max  nmissing  eltype
     │ Symbol    Union…  Any  Union…  Any  Int64     DataType
─────┼────────────────────────────────────────────────────────
   1 │ i         5.5     1    5.5     10          0  Int64
   2 │ x         0.55    0.1  0.55    1.0         0  Float64
   3 │ y                 a            j           0  Char

julia> describe(df, :min, :max)
3×3 DataFrame
 Row │ variable  min  max
     │ Symbol    Any  Any
─────┼────────────────────
   1 │ i         1    10
   2 │ x         0.1  1.0
   3 │ y         a    j

julia> describe(df, :min, sum => :sum)
3×3 DataFrame
 Row │ variable  min  sum
     │ Symbol    Any  Any
─────┼────────────────────
   1 │ i         1    55
   2 │ x         0.1  5.5
   3 │ y         a

julia> describe(df, :min, sum => :sum, cols=:x)
1×3 DataFrame
 Row │ variable  min      sum
     │ Symbol    Float64  Float64
─────┼────────────────────────────
   1 │ x             0.1      5.5
source
Base.lengthFunction
length(dfr::DataFrameRow)

Return the number of elements of dfr.

See also: size

Examples

julia> dfr = DataFrame(a=1:3, b='a':'c')[1, :]
DataFrameRow
 Row │ a      b
     │ Int64  Char
─────┼─────────────
   1 │     1  a

julia> length(dfr)
2
source
DataFrames.ncolFunction
nrow(df::AbstractDataFrame)
ncol(df::AbstractDataFrame)

Return the number of rows or columns in an AbstractDataFrame df.

See also size.

Examples

julia> df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10));

julia> size(df)
(10, 3)

julia> nrow(df)
10

julia> ncol(df)
3
source
Base.ndimsFunction
ndims(::AbstractDataFrame)
ndims(::Type{<:AbstractDataFrame})

Return the number of dimensions of a data frame, which is always 2.

source
ndims(::DataFrameRow)
ndims(::Type{<:DataFrameRow})

Return the number of dimensions of a data frame row, which is always 1.

source
DataFrames.nrowFunction
nrow(df::AbstractDataFrame)
ncol(df::AbstractDataFrame)

Return the number of rows or columns in an AbstractDataFrame df.

See also size.

Examples

julia> df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10));

julia> size(df)
(10, 3)

julia> nrow(df)
10

julia> ncol(df)
3
source
DataFrames.rownumberFunction
rownumber(dfr::DataFrameRow)

Return a row number in the AbstractDataFrame that dfr was created from.

Note that this differs from the first element in the tuple returned by parentindices. The latter gives the row number in the parent(dfr), which is the source DataFrame where data that dfr gives access to is stored.

Examples

julia> df = DataFrame(reshape(1:12, 3, 4))
3×4 DataFrame
 Row │ x1     x2     x3     x4
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1      4      7     10
   2 │     2      5      8     11
   3 │     3      6      9     12

julia> dfr = df[2, :]
DataFrameRow
 Row │ x1     x2     x3     x4
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   2 │     2      5      8     11

julia> rownumber(dfr)
2

julia> parentindices(dfr)
(2, Base.OneTo(4))

julia> parent(dfr)
3×4 DataFrame
 Row │ x1     x2     x3     x4
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1      4      7     10
   2 │     2      5      8     11
   3 │     3      6      9     12

julia> dfv = @view df[2:3, 1:3]
2×3 SubDataFrame
 Row │ x1     x2     x3
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     2      5      8
   2 │     3      6      9

julia> dfrv = dfv[2, :]
DataFrameRow
 Row │ x1     x2     x3
     │ Int64  Int64  Int64
─────┼─────────────────────
   3 │     3      6      9

julia> rownumber(dfrv)
2

julia> parentindices(dfrv)
(3, 1:3)

julia> parent(dfrv)
3×4 DataFrame
 Row │ x1     x2     x3     x4
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1      4      7     10
   2 │     2      5      8     11
   3 │     3      6      9     12
source
Base.showFunction
show([io::IO, ]df::AbstractDataFrame;
     allrows::Bool = !get(io, :limit, false),
     allcols::Bool = !get(io, :limit, false),
     allgroups::Bool = !get(io, :limit, false),
     rowlabel::Symbol = :Row,
     summary::Bool = true,
     eltypes::Bool = true,
     truncate::Int = 32,
     kwargs...)

Render a data frame to an I/O stream. The specific visual representation chosen depends on the width of the display.

If io is omitted, the result is printed to stdout, and allrows, allcols and allgroups default to false.

Arguments

  • io::IO: The I/O stream to which df will be printed.
  • df::AbstractDataFrame: The data frame to print.
  • allrows::Bool: Whether to print all rows, rather than a subset that fits the device height. By default this is the case only if io does not have the IOContext property limit set.
  • allcols::Bool: Whether to print all columns, rather than a subset that fits the device width. By default this is the case only if io does not have the IOContext property limit set.
  • allgroups::Bool: Whether to print all groups rather than the first and last, when df is a GroupedDataFrame. By default this is the case only if io does not have the IOContext property limit set.
  • rowlabel::Symbol = :Row: The label to use for the column containing row numbers.
  • summary::Bool = true: Whether to print a brief string summary of the data frame.
  • eltypes::Bool = true: Whether to print the column types under column names.
  • truncate::Int = 32: the maximal display width the output can use before being truncated (in the textwidth sense, excluding ). If truncate is 0 or less, no truncation is applied.
  • kwargs...: Any keyword argument supported by the function pretty_table of PrettyTables.jl can be passed here to customize the output.

Examples

julia> using DataFrames

julia> df = DataFrame(A = 1:3, B = ["x", "y", "z"]);

julia> show(df, show_row_number=false)
3×2 DataFrame
 A     │ B
 Int64 │ String
───────┼────────
     1 │ x
     2 │ y
     3 │ z
source
show(io::IO, mime::MIME, df::AbstractDataFrame)

Render a data frame to an I/O stream in MIME type mime.

Arguments

  • io::IO: The I/O stream to which df will be printed.
  • mime::MIME: supported MIME types are: "text/plain", "text/html", "text/latex", "text/csv", "text/tab-separated-values" (the last two MIME types do not support showing #undef values)
  • df::AbstractDataFrame: The data frame to print.

Additionally selected MIME types support passing the following keyword arguments:

  • MIME type "text/plain" accepts all listed keyword arguments and therir behavior is identical as for show(::IO, ::AbstractDataFrame)
  • MIME type "text/html" accepts summary keyword argument which allows to choose whether to print a brief string summary of the data frame.

Examples

julia> show(stdout, MIME("text/latex"), DataFrame(A = 1:3, B = ["x", "y", "z"]))
\begin{tabular}{r|cc}
        & A & B\\
        \hline
        & Int64 & String\\
        \hline
        1 & 1 & x \\
        2 & 2 & y \\
        3 & 3 & z \\
\end{tabular}
14

julia> show(stdout, MIME("text/csv"), DataFrame(A = 1:3, B = ["x", "y", "z"]))
"A","B"
1,"x"
2,"y"
3,"z"
source
Base.sizeFunction
size(df::AbstractDataFrame[, dim])

Return a tuple containing the number of rows and columns of df. Optionally a dimension dim can be specified, where 1 corresponds to rows and 2 corresponds to columns.

See also: nrow, ncol

Examples

julia> df = DataFrame(a=1:3, b='a':'c');

julia> size(df)
(3, 2)

julia> size(df, 1)
3
source
size(dfr::DataFrameRow[, dim])

Return a 1-tuple containing the number of elements of dfr. If an optional dimension dim is specified, it must be 1, and the number of elements is returned directly as a number.

See also: length

Examples

julia> dfr = DataFrame(a=1:3, b='a':'c')[1, :]
DataFrameRow
 Row │ a      b
     │ Int64  Char
─────┼─────────────
   1 │     1  a

julia> size(dfr)
(2,)

julia> size(dfr, 1)
2
source

Working with column names

Base.namesFunction
names(df::AbstractDataFrame)
names(df::AbstractDataFrame, cols)

Return a freshly allocated Vector{String} of names of columns contained in df.

If cols is passed then restrict returned column names to those matching the selector (this is useful in particular with regular expressions, Cols, Not, and Between). cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers) or a Type, in which case columns whose eltype is a subtype of cols are returned.

See also propertynames which returns a Vector{Symbol}.

source
Base.propertynamesFunction
propertynames(df::AbstractDataFrame)

Return a freshly allocated Vector{Symbol} of names of columns contained in df.

source
DataFrames.renameFunction
rename(df::AbstractDataFrame, vals::AbstractVector{Symbol};
       makeunique::Bool=false)
rename(df::AbstractDataFrame, vals::AbstractVector{<:AbstractString};
       makeunique::Bool=false)
rename(df::AbstractDataFrame, (from => to)::Pair...)
rename(df::AbstractDataFrame, d::AbstractDict)
rename(df::AbstractDataFrame, d::AbstractVector{<:Pair})
rename(f::Function, df::AbstractDataFrame)

Create a new data frame that is a copy of df with changed column names. Each name is changed at most once. Permutation of names is allowed.

Arguments

  • df : the AbstractDataFrame; if it is a SubDataFrame then renaming is only allowed if it was created using : as a column selector.
  • d : an AbstractDict or an AbstractVector of Pairs that maps the original names or column numbers to new names
  • f : a function which for each column takes the old name as a String and returns the new name that gets converted to a Symbol
  • vals : new column names as a vector of Symbols or AbstractStrings of the same length as the number of columns in df
  • makeunique : if false (the default), an error will be raised if duplicate names are found; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).

If pairs are passed to rename (as positional arguments or in a dictionary or a vector) then:

  • from value can be a Symbol, an AbstractString or an Integer;
  • to value can be a Symbol or an AbstractString.

Mixing symbols and strings in to and from is not allowed.

See also: rename!

Examples

julia> df = DataFrame(i = 1, x = 2, y = 3)
1×3 DataFrame
 Row │ i      x      y
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      3

julia> rename(df, :i => :A, :x => :X)
1×3 DataFrame
 Row │ A      X      y
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      3

julia> rename(df, :x => :y, :y => :x)
1×3 DataFrame
 Row │ i      y      x
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      3

julia> rename(df, [1 => :A, 2 => :X])
1×3 DataFrame
 Row │ A      X      y
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      3

julia> rename(df, Dict("i" => "A", "x" => "X"))
1×3 DataFrame
 Row │ A      X      y
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      3

julia> rename(uppercase, df)
1×3 DataFrame
 Row │ I      X      Y
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      3
source
DataFrames.rename!Function
rename!(df::AbstractDataFrame, vals::AbstractVector{Symbol};
        makeunique::Bool=false)
rename!(df::AbstractDataFrame, vals::AbstractVector{<:AbstractString};
        makeunique::Bool=false)
rename!(df::AbstractDataFrame, (from => to)::Pair...)
rename!(df::AbstractDataFrame, d::AbstractDict)
rename!(df::AbstractDataFrame, d::AbstractVector{<:Pair})
rename!(f::Function, df::AbstractDataFrame)

Rename columns of df in-place. Each name is changed at most once. Permutation of names is allowed.

Arguments

  • df : the AbstractDataFrame
  • d : an AbstractDict or an AbstractVector of Pairs that maps the original names or column numbers to new names
  • f : a function which for each column takes the old name as a String and returns the new name that gets converted to a Symbol
  • vals : new column names as a vector of Symbols or AbstractStrings of the same length as the number of columns in df
  • makeunique : if false (the default), an error will be raised if duplicate names are found; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).

If pairs are passed to rename! (as positional arguments or in a dictionary or a vector) then:

  • from value can be a Symbol, an AbstractString or an Integer;
  • to value can be a Symbol or an AbstractString.

Mixing symbols and strings in to and from is not allowed.

See also: rename

Examples

julia> df = DataFrame(i = 1, x = 2, y = 3)
1×3 DataFrame
 Row │ i      x      y
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      3

julia> rename!(df, Dict(:i => "A", :x => "X"))
1×3 DataFrame
 Row │ A      X      y
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      3

julia> rename!(df, [:a, :b, :c])
1×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      3

julia> rename!(df, [:a, :b, :a])
ERROR: ArgumentError: Duplicate variable names: :a. Pass makeunique=true to make them unique using a suffix automatically.

julia> rename!(df, [:a, :b, :a], makeunique=true)
1×3 DataFrame
 Row │ a      b      a_1
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      3

julia> rename!(uppercase, df)
1×3 DataFrame
 Row │ A      B      A_1
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      3
source

Mutating and transforming data frames and grouped data frames

Base.append!Function
append!(df::DataFrame, df2::AbstractDataFrame; cols::Symbol=:setequal,
        promote::Bool=(cols in [:union, :subset]))
append!(df::DataFrame, table; cols::Symbol=:setequal,
        promote::Bool=(cols in [:union, :subset]))

Add the rows of df2 to the end of df. If the second argument table is not an AbstractDataFrame then it is converted using DataFrame(table, copycols=false) before being appended.

The exact behavior of append! depends on the cols argument:

  • If cols == :setequal (this is the default) then df2 must contain exactly the same columns as df (but possibly in a different order).
  • If cols == :orderequal then df2 must contain the same columns in the same order (for AbstractDict this option requires that keys(row) matches propertynames(df) to allow for support of ordered dicts; however, if df2 is a Dict an error is thrown as it is an unordered collection).
  • If cols == :intersect then df2 may contain more columns than df, but all column names that are present in df must be present in df2 and only these are used.
  • If cols == :subset then append! behaves like for :intersect but if some column is missing in df2 then a missing value is pushed to df.
  • If cols == :union then append! adds columns missing in df that are present in row, for columns present in df but missing in row a missing value is pushed.

If promote=true and element type of a column present in df does not allow the type of a pushed argument then a new column with a promoted element type allowing it is freshly allocated and stored in df. If promote=false an error is thrown.

The above rule has the following exceptions:

  • If df has no columns then copies of columns from df2 are added to it.
  • If df2 has no columns then calling append! leaves df unchanged.

Please note that append! must not be used on a DataFrame that contains columns that are aliases (equal when compared with ===).

See also

Use push! to add individual rows to a data frame and vcat to vertically concatenate data frames.

Examples

julia> df1 = DataFrame(A=1:3, B=1:3)
3×2 DataFrame
 Row │ A      B
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      3

julia> df2 = DataFrame(A=4.0:6.0, B=4:6)
3×2 DataFrame
 Row │ A        B
     │ Float64  Int64
─────┼────────────────
   1 │     4.0      4
   2 │     5.0      5
   3 │     6.0      6

julia> append!(df1, df2);

julia> df1
6×2 DataFrame
 Row │ A      B
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      3
   4 │     4      4
   5 │     5      5
   6 │     6      6
source
DataFrames.combineFunction
combine(df::AbstractDataFrame, args...; renamecols::Bool=true)
combine(f::Callable, df::AbstractDataFrame; renamecols::Bool=true)
combine(gd::GroupedDataFrame, args...;
        keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true)
combine(f::Base.Callable, gd::GroupedDataFrame;
        keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true)

Create a new data frame that contains columns from df or gd specified by args and return it. The result can have any number of rows that is determined by the values returned by passed transformations.

Below detailed common rules for all transformation functions supported by DataFrames.jl are explained and compared.

All these operations are supported both for AbstractDataFrame (when split and combine steps are skipped) and GroupedDataFrame. Technically, AbstractDataFrame is just considered as being grouped on no columns (meaning it has a single group, or zero groups if it is empty). The only difference is that in this case the keepkeys and ungroup keyword arguments (described below) are not supported and a data frame is always returned, as there are no split and combine steps in this case.

In order to perform operations by groups you first need to create a GroupedDataFrame object from your data frame using the groupby function that takes two arguments: (1) a data frame to be grouped, and (2) a set of columns to group by.

Operations can then be applied on each group using one of the following functions:

  • combine: does not put restrictions on number of rows returned, the order of rows is specified by the order of groups in GroupedDataFrame; it is typically used to compute summary statistics by group;
  • select: return a data frame with the number and order of rows exactly the same as the source data frame, including only new calculated columns; select! is an in-place version of select;
  • transform: return a data frame with the number and order of rows exactly the same as the source data frame, including all columns from the source and new calculated columns; transform! is an in-place version of transform.

All these functions take a specification of one or more functions to apply to each subset of the DataFrame. This specification can be of the following forms:

  1. standard column selectors (integers, Symbols, strings, vectors of integers, vectors of Symbols, vectors of strings, All, Cols, :, Between, Not and regular expressions)
  2. a cols => function pair indicating that function should be called with positional arguments holding columns cols, which can be any valid column selector; in this case target column name is automatically generated and it is assumed that function returns a single value or a vector; the generated name is created by concatenating source column name and function name by default (see examples below).
  3. a cols => function => target_cols form additionally explicitly specifying the target column or columns.
  4. a col => target_cols pair, which renames the column col to target_cols, which must be single name (as a Symbol or a string).
  5. a nrow or nrow => target_cols form which efficiently computes the number of rows in a group; without target_cols the new column is called :nrow, otherwise it must be single name (as a Symbol or a string).
  6. vectors or matrices containing transformations specified by the Pair syntax described in points 2 to 5
  7. a function which will be called with a SubDataFrame corresponding to each group; this form should be avoided due to its poor performance unless a very large number of columns are processed (in which case SubDataFrame avoids excessive compilation)

All functions have two types of signatures. One of them takes a GroupedDataFrame as the first argument and an arbitrary number of transformations described above as following arguments. The second type of signature is when a Function or a Type is passed as the first argument and a GroupedDataFrame as the second argument (similar to map).

As a special rule, with the cols => function and cols => function => target_cols syntaxes, if cols is wrapped in an AsTable object then a NamedTuple containing columns selected by cols is passed to function.

What is allowed for function to return is determined by the target_cols value:

  1. If both cols and target_cols are omitted (so only a function is passed), then returning a data frame, a matrix, a NamedTuple, or a DataFrameRow will produce multiple columns in the result. Returning any other value produces a single column.
  2. If target_cols is a Symbol or a string then the function is assumed to return a single column. In this case returning a data frame, a matrix, a NamedTuple, or a DataFrameRow raises an error.
  3. If target_cols is a vector of Symbols or strings or AsTable it is assumed that function returns multiple columns. If function returns one of AbstractDataFrame, NamedTuple, DataFrameRow, AbstractMatrix then rules described in point 1 above apply. If function returns an AbstractVector then each element of this vector must support the keys function, which must return a collection of Symbols, strings or integers; the return value of keys must be identical for all elements. Then as many columns are created as there are elements in the return value of the keys function. If target_cols is AsTable then their names are set to be equal to the key names except if keys returns integers, in which case they are prefixed by x (so the column names are e.g. x1, x2, ...). If target_cols is a vector of Symbols or strings then column names produced using the rules above are ignored and replaced by target_cols (the number of columns must be the same as the length of target_cols in this case). If fun returns a value of any other type then it is assumed that it is a table conforming to the Tables.jl API and the Tables.columntable function is called on it to get the resulting columns and their names. The names are retained when target_cols is AsTable and are replaced if target_cols is a vector of Symbols or strings.

In all of these cases, function can return either a single row or multiple rows. As a particular rule, values wrapped in a Ref or a 0-dimensional AbstractArray are unwrapped and then treated as a single row.

select/select! and transform/transform! always return a DataFrame with the same number and order of rows as the source (even if GroupedDataFrame had its groups reordered).

For combine, rows in the returned object appear in the order of groups in the GroupedDataFrame. The functions can return an arbitrary number of rows for each group, but the kind of returned object and the number and names of columns must be the same for all groups, except when a DataFrame() or NamedTuple() is returned, in which case a given group is skipped.

It is allowed to mix single values and vectors if multiple transformations are requested. In this case single value will be repeated to match the length of columns specified by returned vectors.

To apply function to each row instead of whole columns, it can be wrapped in a ByRow struct. cols can be any column indexing syntax, in which case function will be passed one argument for each of the columns specified by cols or a NamedTuple of them if specified columns are wrapped in AsTable. If ByRow is used it is allowed for cols to select an empty set of columns, in which case function is called for each row without any arguments and an empty NamedTuple is passed if empty set of columns is wrapped in AsTable.

If a collection of column names is passed then requesting duplicate column names in target data frame are accepted (e.g. select!(df, [:a], :, r"a") is allowed) and only the first occurrence is used. In particular a syntax to move column :col to the first position in the data frame is select!(df, :col, :). On the contrary, output column names of renaming, transformation and single column selection operations must be unique, so e.g. select!(df, :a, :a => :a) or select!(df, :a, :a => ByRow(sin) => :a) are not allowed.

As a general rule if copycols=true columns are copied and when copycols=false columns are reused if possible. Note, however, that including the same column several times in the data frame via renaming or transformations that return the same object without copying may create column aliases even if copycols=true. An example of such a situation is select!(df, :a, :a => :b, :a => identity => :c).

If df is a SubDataFrame and copycols=true then a DataFrame is returned and the same copying rules apply as for a DataFrame input: this means in particular that selected columns will be copied. If copycols=false, a SubDataFrame is returned without copying columns.

Keyword arguments

  • renamecols::Bool=true : whether in the cols => function form automatically generated column names should include the name of transformation functions or not.
  • keepkeys::Bool=true : whether grouping columns of gd should be kept in the returned data frame.
  • ungroup::Bool=true : whether the return value of the operation on gd should be a data frame or a GroupedDataFrame.

Examples

julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     2      5
   3 │     3      6

julia> combine(df, :a => sum, nrow, renamecols=false)
1×2 DataFrame
 Row │ a      nrow
     │ Int64  Int64
─────┼──────────────
   1 │     6      3

julia> combine(df, :a => ByRow(sin) => :c, :b)
3×2 DataFrame
 Row │ c         b
     │ Float64   Int64
─────┼─────────────────
   1 │ 0.841471      4
   2 │ 0.909297      5
   3 │ 0.14112       6

julia> combine(df, :, [:a, :b] => (a, b) -> a .+ b .- sum(b)/length(b))
3×3 DataFrame
 Row │ a      b      a_b_function
     │ Int64  Int64  Float64
─────┼────────────────────────────
   1 │     1      4           0.0
   2 │     2      5           2.0
   3 │     3      6           4.0

julia> combine(df, names(df) .=> [minimum maximum])
1×4 DataFrame
 Row │ a_minimum  b_minimum  a_maximum  b_maximum
     │ Int64      Int64      Int64      Int64
─────┼────────────────────────────────────────────
   1 │         1          4          3          6

julia> using Statistics

julia> combine(df, AsTable(:) => ByRow(mean), renamecols=false)
3×1 DataFrame
 Row │ a_b
     │ Float64
─────┼─────────
   1 │     2.5
   2 │     3.5
   3 │     4.5

julia> combine(first, df)
1×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4

julia> df = DataFrame(a=1:3, b=4:6, c=7:9)
3×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      4      7
   2 │     2      5      8
   3 │     3      6      9

julia> combine(df, AsTable(:) => ByRow(x -> (mean=mean(x), std=std(x))) => :stats,
               AsTable(:) => ByRow(x -> (mean=mean(x), std=std(x))) => AsTable)
3×3 DataFrame
 Row │ stats                    mean     std
     │ NamedTup…                Float64  Float64
─────┼───────────────────────────────────────────
   1 │ (mean = 4.0, std = 3.0)      4.0      3.0
   2 │ (mean = 5.0, std = 3.0)      5.0      3.0
   3 │ (mean = 6.0, std = 3.0)      6.0      3.0

julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
                      b = repeat([2, 1], outer=[4]),
                      c = 1:8);

julia> gd = groupby(df, :a);

julia> combine(gd, :c => sum, nrow)
4×3 DataFrame
 Row │ a      c_sum  nrow
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      6      2
   2 │     2      8      2
   3 │     3     10      2
   4 │     4     12      2

julia> combine(gd, :c => sum, nrow, ungroup=false)
GroupedDataFrame with 4 groups based on key: a
First Group (1 row): a = 1
 Row │ a      c_sum  nrow
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      6      2
⋮
Last Group (1 row): a = 4
 Row │ a      c_sum  nrow
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     4     12      2

julia> combine(gd) do d # do syntax for the slower variant
           sum(d.c)
       end
4×2 DataFrame
 Row │ a      x1
     │ Int64  Int64
─────┼──────────────
   1 │     1      6
   2 │     2      8
   3 │     3     10
   4 │     4     12

julia> combine(gd, :c => (x -> sum(log, x)) => :sum_log_c) # specifying a name for target column
4×2 DataFrame
 Row │ a      sum_log_c
     │ Int64  Float64
─────┼──────────────────
   1 │     1    1.60944
   2 │     2    2.48491
   3 │     3    3.04452
   4 │     4    3.46574

julia> combine(gd, [:b, :c] .=> sum) # passing a vector of pairs
4×3 DataFrame
 Row │ a      b_sum  c_sum
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      4      6
   2 │     2      2      8
   3 │     3      4     10
   4 │     4      2     12

julia> combine(gd) do sdf # dropping group when DataFrame() is returned
          sdf.c[1] != 1 ? sdf : DataFrame()
       end
6×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     2      1      2
   2 │     2      1      6
   3 │     3      2      3
   4 │     3      2      7
   5 │     4      1      4
   6 │     4      1      8

# auto-splatting, renaming and keepkeys
julia> combine(gd, :b => :b1, :c => :c1, [:b, :c] => +, keepkeys=false)
8×3 DataFrame
 Row │ b1     c1     b_c_+
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     2      1      3
   2 │     2      5      7
   3 │     1      2      3
   4 │     1      6      7
   5 │     2      3      5
   6 │     2      7      9
   7 │     1      4      5
   8 │     1      8      9

# broadcasting and column expansion
julia> combine(gd, :b, AsTable([:b, :c]) => ByRow(extrema) => [:min, :max])
8×4 DataFrame
 Row │ a      b      min    max
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1      2      1      2
   2 │     1      2      2      5
   3 │     2      1      1      2
   4 │     2      1      1      6
   5 │     3      2      2      3
   6 │     3      2      2      7
   7 │     4      1      1      4
   8 │     4      1      1      8

julia> combine(gd, [:b, :c] .=> Ref) # preventing vector from being spread across multiple rows
4×3 DataFrame
 Row │ a      b_Ref      c_Ref
     │ Int64  SubArray…  SubArray…
─────┼─────────────────────────────
   1 │     1  [2, 2]     [1, 5]
   2 │     2  [1, 1]     [2, 6]
   3 │     3  [2, 2]     [3, 7]
   4 │     4  [1, 1]     [4, 8]

julia> combine(gd, AsTable(Not(:a)) => Ref) # protecting result
4×2 DataFrame
 Row │ a      b_c_Ref
     │ Int64  NamedTup…
─────┼─────────────────────────────────
   1 │     1  (b = [2, 2], c = [1, 5])
   2 │     2  (b = [1, 1], c = [2, 6])
   3 │     3  (b = [2, 2], c = [3, 7])
   4 │     4  (b = [1, 1], c = [4, 8])

julia> combine(gd, :, AsTable(Not(:a)) => sum, renamecols=false)
8×4 DataFrame
 Row │ a      b      c      b_c
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1      2      1      3
   2 │     1      2      5      7
   3 │     2      1      2      3
   4 │     2      1      6      7
   5 │     3      2      3      5
   6 │     3      2      7      9
   7 │     4      1      4      5
   8 │     4      1      8      9
source
DataFrames.flattenFunction
flatten(df::AbstractDataFrame, cols)

When columns cols of data frame df have iterable elements that define length (for example a Vector of Vectors), return a DataFrame where each element of each col in cols is flattened, meaning the column corresponding to col becomes a longer vector where the original entries are concatenated. Elements of row i of df in columns other than cols will be repeated according to the length of df[i, col]. These lengths must therefore be the same for each col in cols, or else an error is raised. Note that these elements are not copied, and thus if they are mutable changing them in the returned DataFrame will affect df.

cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

Examples

julia> df1 = DataFrame(a = [1, 2], b = [[1, 2], [3, 4]], c = [[5, 6], [7, 8]])
2×3 DataFrame
 Row │ a      b       c
     │ Int64  Array…  Array…
─────┼───────────────────────
   1 │     1  [1, 2]  [5, 6]
   2 │     2  [3, 4]  [7, 8]

julia> flatten(df1, :b)
4×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Array…
─────┼──────────────────────
   1 │     1      1  [5, 6]
   2 │     1      2  [5, 6]
   3 │     2      3  [7, 8]
   4 │     2      4  [7, 8]

julia> flatten(df1, [:b, :c])
4×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      1      5
   2 │     1      2      6
   3 │     2      3      7
   4 │     2      4      8

julia> df2 = DataFrame(a = [1, 2], b = [("p", "q"), ("r", "s")])
2×2 DataFrame
 Row │ a      b
     │ Int64  Tuple…
─────┼───────────────────
   1 │     1  ("p", "q")
   2 │     2  ("r", "s")

julia> flatten(df2, :b)
4×2 DataFrame
 Row │ a      b
     │ Int64  String
─────┼───────────────
   1 │     1  p
   2 │     1  q
   3 │     2  r
   4 │     2  s

julia> df3 = DataFrame(a = [1, 2], b = [[1, 2], [3, 4]], c = [[5, 6], [7]])
2×3 DataFrame
 Row │ a      b       c
     │ Int64  Array…  Array…
─────┼───────────────────────
   1 │     1  [1, 2]  [5, 6]
   2 │     2  [3, 4]  [7]

julia> flatten(df3, [:b, :c])
ERROR: ArgumentError: Lengths of iterables stored in columns :b and :c are not the same in row 2
source
Base.hcatFunction
hcat(df::AbstractDataFrame...;
     makeunique::Bool=false, copycols::Bool=true)
hcat(df::AbstractDataFrame..., vs::AbstractVector;
     makeunique::Bool=false, copycols::Bool=true)
hcat(vs::AbstractVector, df::AbstractDataFrame;
     makeunique::Bool=false, copycols::Bool=true)

Horizontally concatenate AbstractDataFrames and optionally AbstractVectors.

If AbstractVector is passed then a column name for it is automatically generated as :x1 by default.

If makeunique=false (the default) column names of passed objects must be unique. If makeunique=true then duplicate column names will be suffixed with _i (i starting at 1 for the first duplicate).

If copycols=true (the default) then the DataFrame returned by hcat will contain copied columns from the source data frames. If copycols=false then it will contain columns as they are stored in the source (without copying). This option should be used with caution as mutating either the columns in sources or in the returned DataFrame might lead to the corruption of the other object.

Example

julia> df1 = DataFrame(A=1:3, B=1:3)
3×2 DataFrame
 Row │ A      B
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      3

julia> df2 = DataFrame(A=4:6, B=4:6)
3×2 DataFrame
 Row │ A      B
     │ Int64  Int64
─────┼──────────────
   1 │     4      4
   2 │     5      5
   3 │     6      6

julia> df3 = hcat(df1, df2, makeunique=true)
3×4 DataFrame
 Row │ A      B      A_1    B_1
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1      1      4      4
   2 │     2      2      5      5
   3 │     3      3      6      6

julia> df3.A === df1.A
false

julia> df3 = hcat(df1, df2, makeunique=true, copycols=false);

julia> df3.A === df1.A
true
source
DataFrames.insertcols!Function
insertcols!(df::DataFrame[, col], (name=>val)::Pair...;
            makeunique::Bool=false, copycols::Bool=true)

Insert a column into a data frame in place. Return the updated DataFrame. If col is omitted it is set to ncol(df)+1 (the column is inserted as the last column).

Arguments

  • df : the DataFrame to which we want to add columns
  • col : a position at which we want to insert a column, passed as an integer or a column name (a string or a Symbol); the column selected with col and columns following it are shifted to the right in df after the operation
  • name : the name of the new column
  • val : an AbstractVector giving the contents of the new column or a value of any type other than AbstractArray which will be repeated to fill a new vector; As a particular rule a values stored in a Ref or a 0-dimensional AbstractArray are unwrapped and treated in the same way.
  • makeunique : Defines what to do if name already exists in df; if it is false an error will be thrown; if it is true a new unique name will be generated by adding a suffix
  • copycols : whether vectors passed as columns should be copied

If val is an AbstractRange then the result of collect(val) is inserted.

Examples

julia> df = DataFrame(a=1:3)
3×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3

julia> insertcols!(df, 1, :b => 'a':'c')
3×2 DataFrame
 Row │ b     a
     │ Char  Int64
─────┼─────────────
   1 │ a         1
   2 │ b         2
   3 │ c         3

julia> insertcols!(df, 2, :c => 2:4, :c => 3:5, makeunique=true)
3×4 DataFrame
 Row │ b     c      c_1    a
     │ Char  Int64  Int64  Int64
─────┼───────────────────────────
   1 │ a         2      3      1
   2 │ b         3      4      2
   3 │ c         4      5      3
source
DataFrames.mapcolsFunction
mapcols(f::Union{Function, Type}, df::AbstractDataFrame)

Return a DataFrame where each column of df is transformed using function f. f must return AbstractVector objects all with the same length or scalars (all values other than AbstractVector are considered to be a scalar).

Note that mapcols guarantees not to reuse the columns from df in the returned DataFrame. If f returns its argument then it gets copied before being stored.

Examples

julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │     1     11
   2 │     2     12
   3 │     3     13
   4 │     4     14

julia> mapcols(x -> x.^2, df)
4×2 DataFrame
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │     1    121
   2 │     4    144
   3 │     9    169
   4 │    16    196
source
DataFrames.mapcols!Function
mapcols!(f::Union{Function, Type}, df::DataFrame)

Update a DataFrame in-place where each column of df is transformed using function f. f must return AbstractVector objects all with the same length or scalars (all values other than AbstractVector are considered to be a scalar).

Note that mapcols! reuses the columns from df if they are returned by f.

Examples

julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │     1     11
   2 │     2     12
   3 │     3     13
   4 │     4     14

julia> mapcols!(x -> x.^2, df);

julia> df
4×2 DataFrame
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │     1    121
   2 │     4    144
   3 │     9    169
   4 │    16    196
source
Base.push!Function
push!(df::DataFrame, row::Union{Tuple, AbstractArray}; promote::Bool=false)
push!(df::DataFrame, row::Union{DataFrameRow, NamedTuple, AbstractDict};
      cols::Symbol=:setequal, promote::Bool=(cols in [:union, :subset]))

Add in-place one row at the end of df taking the values from row.

Column types of df are preserved, and new values are converted if necessary. An error is thrown if conversion fails.

If row is neither a DataFrameRow, NamedTuple nor AbstractDict then it must be a Tuple or an AbstractArray and columns are matched by order of appearance. In this case row must contain the same number of elements as the number of columns in df.

If row is a DataFrameRow, NamedTuple or AbstractDict then values in row are matched to columns in df based on names. The exact behavior depends on the cols argument value in the following way:

  • If cols == :setequal (this is the default) then row must contain exactly the same columns as df (but possibly in a different order).
  • If cols == :orderequal then row must contain the same columns in the same order (for AbstractDict this option requires that keys(row) matches propertynames(df) to allow for support of ordered dicts; however, if row is a Dict an error is thrown as it is an unordered collection).
  • If cols == :intersect then row may contain more columns than df, but all column names that are present in df must be present in row and only they are used to populate a new row in df.
  • If cols == :subset then push! behaves like for :intersect but if some column is missing in row then a missing value is pushed to df.
  • If cols == :union then columns missing in df that are present in row are added to df (using missing for existing rows) and a missing value is pushed to columns missing in row that are present in df.

If promote=true and element type of a column present in df does not allow the type of a pushed argument then a new column with a promoted element type allowing it is freshly allocated and stored in df. If promote=false an error is thrown.

As a special case, if df has no columns and row is a NamedTuple or DataFrameRow, columns are created for all values in row, using their names and order.

Please note that push! must not be used on a DataFrame that contains columns that are aliases (equal when compared with ===).

Examples

julia> df = DataFrame(A=1:3, B=1:3);

julia> push!(df, (true, false))
4×2 DataFrame
 Row │ A      B
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      3
   4 │     1      0

julia> push!(df, df[1, :])
5×2 DataFrame
 Row │ A      B
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      3
   4 │     1      0
   5 │     1      1

julia> push!(df, (C="something", A=true, B=false), cols=:intersect)
6×2 DataFrame
 Row │ A      B
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      3
   4 │     1      0
   5 │     1      1
   6 │     1      0

julia> push!(df, Dict(:A=>1.0, :C=>1.0), cols=:union)
7×3 DataFrame
 Row │ A        B        C
     │ Float64  Int64?   Float64?
─────┼─────────────────────────────
   1 │     1.0        1  missing
   2 │     2.0        2  missing
   3 │     3.0        3  missing
   4 │     1.0        0  missing
   5 │     1.0        1  missing
   6 │     1.0        0  missing
   7 │     1.0  missing        1.0

julia> push!(df, NamedTuple(), cols=:subset)
8×3 DataFrame
 Row │ A          B        C
     │ Float64?   Int64?   Float64?
─────┼───────────────────────────────
   1 │       1.0        1  missing
   2 │       2.0        2  missing
   3 │       3.0        3  missing
   4 │       1.0        0  missing
   5 │       1.0        1  missing
   6 │       1.0        0  missing
   7 │       1.0  missing        1.0
   8 │ missing    missing  missing
source
Base.repeatFunction
repeat(df::AbstractDataFrame; inner::Integer = 1, outer::Integer = 1)

Construct a data frame by repeating rows in df. inner specifies how many times each row is repeated, and outer specifies how many times the full set of rows is repeated.

Example

julia> df = DataFrame(a = 1:2, b = 3:4)
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      3
   2 │     2      4

julia> repeat(df, inner = 2, outer = 3)
12×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      3
   2 │     1      3
   3 │     2      4
   4 │     2      4
   5 │     1      3
   6 │     1      3
   7 │     2      4
   8 │     2      4
   9 │     1      3
  10 │     1      3
  11 │     2      4
  12 │     2      4
source
repeat(df::AbstractDataFrame, count::Integer)

Construct a data frame by repeating each row in df the number of times specified by count.

Example

julia> df = DataFrame(a = 1:2, b = 3:4)
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      3
   2 │     2      4

julia> repeat(df, 2)
4×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      3
   2 │     2      4
   3 │     1      3
   4 │     2      4
source
DataFrames.repeat!Function
repeat!(df::DataFrame; inner::Integer = 1, outer::Integer = 1)

Update a data frame df in-place by repeating its rows. inner specifies how many times each row is repeated, and outer specifies how many times the full set of rows is repeated. Columns of df are freshly allocated.

Example

julia> df = DataFrame(a = 1:2, b = 3:4)
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      3
   2 │     2      4

julia> repeat!(df, inner = 2, outer = 3);

julia> df
12×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      3
   2 │     1      3
   3 │     2      4
   4 │     2      4
   5 │     1      3
   6 │     1      3
   7 │     2      4
   8 │     2      4
   9 │     1      3
  10 │     1      3
  11 │     2      4
  12 │     2      4
source
repeat!(df::DataFrame, count::Integer)

Update a data frame df in-place by repeating its rows the number of times specified by count. Columns of df are freshly allocated.

Example

julia> df = DataFrame(a = 1:2, b = 3:4)
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      3
   2 │     2      4

julia> repeat(df, 2)
4×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      3
   2 │     2      4
   3 │     1      3
   4 │     2      4
source
DataFrames.selectFunction
select(df::AbstractDataFrame, args...; copycols::Bool=true, renamecols::Bool=true)
select(args::Callable, df::DataFrame; renamecols::Bool=true)
select(gd::GroupedDataFrame, args...; copycols::Bool=true, keepkeys::Bool=true,
       ungroup::Bool=true, renamecols::Bool=true)
select(f::Base.Callable, gd::GroupedDataFrame; copycols::Bool=true,
       keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true)

Create a new data frame that contains columns from df or gd specified by args and return it. The result is guaranteed to have the same number of rows as df, except when no columns are selected (in which case the result has zero rows).

Below detailed common rules for all transformation functions supported by DataFrames.jl are explained and compared.

All these operations are supported both for AbstractDataFrame (when split and combine steps are skipped) and GroupedDataFrame. Technically, AbstractDataFrame is just considered as being grouped on no columns (meaning it has a single group, or zero groups if it is empty). The only difference is that in this case the keepkeys and ungroup keyword arguments (described below) are not supported and a data frame is always returned, as there are no split and combine steps in this case.

In order to perform operations by groups you first need to create a GroupedDataFrame object from your data frame using the groupby function that takes two arguments: (1) a data frame to be grouped, and (2) a set of columns to group by.

Operations can then be applied on each group using one of the following functions:

  • combine: does not put restrictions on number of rows returned, the order of rows is specified by the order of groups in GroupedDataFrame; it is typically used to compute summary statistics by group;
  • select: return a data frame with the number and order of rows exactly the same as the source data frame, including only new calculated columns; select! is an in-place version of select;
  • transform: return a data frame with the number and order of rows exactly the same as the source data frame, including all columns from the source and new calculated columns; transform! is an in-place version of transform.

All these functions take a specification of one or more functions to apply to each subset of the DataFrame. This specification can be of the following forms:

  1. standard column selectors (integers, Symbols, strings, vectors of integers, vectors of Symbols, vectors of strings, All, Cols, :, Between, Not and regular expressions)
  2. a cols => function pair indicating that function should be called with positional arguments holding columns cols, which can be any valid column selector; in this case target column name is automatically generated and it is assumed that function returns a single value or a vector; the generated name is created by concatenating source column name and function name by default (see examples below).
  3. a cols => function => target_cols form additionally explicitly specifying the target column or columns.
  4. a col => target_cols pair, which renames the column col to target_cols, which must be single name (as a Symbol or a string).
  5. a nrow or nrow => target_cols form which efficiently computes the number of rows in a group; without target_cols the new column is called :nrow, otherwise it must be single name (as a Symbol or a string).
  6. vectors or matrices containing transformations specified by the Pair syntax described in points 2 to 5
  7. a function which will be called with a SubDataFrame corresponding to each group; this form should be avoided due to its poor performance unless a very large number of columns are processed (in which case SubDataFrame avoids excessive compilation)

All functions have two types of signatures. One of them takes a GroupedDataFrame as the first argument and an arbitrary number of transformations described above as following arguments. The second type of signature is when a Function or a Type is passed as the first argument and a GroupedDataFrame as the second argument (similar to map).

As a special rule, with the cols => function and cols => function => target_cols syntaxes, if cols is wrapped in an AsTable object then a NamedTuple containing columns selected by cols is passed to function.

What is allowed for function to return is determined by the target_cols value:

  1. If both cols and target_cols are omitted (so only a function is passed), then returning a data frame, a matrix, a NamedTuple, or a DataFrameRow will produce multiple columns in the result. Returning any other value produces a single column.
  2. If target_cols is a Symbol or a string then the function is assumed to return a single column. In this case returning a data frame, a matrix, a NamedTuple, or a DataFrameRow raises an error.
  3. If target_cols is a vector of Symbols or strings or AsTable it is assumed that function returns multiple columns. If function returns one of AbstractDataFrame, NamedTuple, DataFrameRow, AbstractMatrix then rules described in point 1 above apply. If function returns an AbstractVector then each element of this vector must support the keys function, which must return a collection of Symbols, strings or integers; the return value of keys must be identical for all elements. Then as many columns are created as there are elements in the return value of the keys function. If target_cols is AsTable then their names are set to be equal to the key names except if keys returns integers, in which case they are prefixed by x (so the column names are e.g. x1, x2, ...). If target_cols is a vector of Symbols or strings then column names produced using the rules above are ignored and replaced by target_cols (the number of columns must be the same as the length of target_cols in this case). If fun returns a value of any other type then it is assumed that it is a table conforming to the Tables.jl API and the Tables.columntable function is called on it to get the resulting columns and their names. The names are retained when target_cols is AsTable and are replaced if target_cols is a vector of Symbols or strings.

In all of these cases, function can return either a single row or multiple rows. As a particular rule, values wrapped in a Ref or a 0-dimensional AbstractArray are unwrapped and then treated as a single row.

select/select! and transform/transform! always return a DataFrame with the same number and order of rows as the source (even if GroupedDataFrame had its groups reordered).

For combine, rows in the returned object appear in the order of groups in the GroupedDataFrame. The functions can return an arbitrary number of rows for each group, but the kind of returned object and the number and names of columns must be the same for all groups, except when a DataFrame() or NamedTuple() is returned, in which case a given group is skipped.

It is allowed to mix single values and vectors if multiple transformations are requested. In this case single value will be repeated to match the length of columns specified by returned vectors.

To apply function to each row instead of whole columns, it can be wrapped in a ByRow struct. cols can be any column indexing syntax, in which case function will be passed one argument for each of the columns specified by cols or a NamedTuple of them if specified columns are wrapped in AsTable. If ByRow is used it is allowed for cols to select an empty set of columns, in which case function is called for each row without any arguments and an empty NamedTuple is passed if empty set of columns is wrapped in AsTable.

If a collection of column names is passed then requesting duplicate column names in target data frame are accepted (e.g. select!(df, [:a], :, r"a") is allowed) and only the first occurrence is used. In particular a syntax to move column :col to the first position in the data frame is select!(df, :col, :). On the contrary, output column names of renaming, transformation and single column selection operations must be unique, so e.g. select!(df, :a, :a => :a) or select!(df, :a, :a => ByRow(sin) => :a) are not allowed.

As a general rule if copycols=true columns are copied and when copycols=false columns are reused if possible. Note, however, that including the same column several times in the data frame via renaming or transformations that return the same object without copying may create column aliases even if copycols=true. An example of such a situation is select!(df, :a, :a => :b, :a => identity => :c).

If df is a SubDataFrame and copycols=true then a DataFrame is returned and the same copying rules apply as for a DataFrame input: this means in particular that selected columns will be copied. If copycols=false, a SubDataFrame is returned without copying columns.

Keyword arguments

  • copycols::Bool=true : whether columns of the source data frame should be copied if no transformation is applied to them.
  • renamecols::Bool=true : whether in the cols => function form automatically generated column names should include the name of transformation functions or not.
  • keepkeys::Bool=true : whether grouping columns of gd should be kept in the returned data frame.
  • ungroup::Bool=true : whether the return value of the operation on gd should be a data frame or a GroupedDataFrame.

Examples

julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     2      5
   3 │     3      6

julia> select(df, 2)
3×1 DataFrame
 Row │ b
     │ Int64
─────┼───────
   1 │     4
   2 │     5
   3 │     6

julia> select(df, :a => ByRow(sin) => :c, :b)
3×2 DataFrame
 Row │ c         b
     │ Float64   Int64
─────┼─────────────────
   1 │ 0.841471      4
   2 │ 0.909297      5
   3 │ 0.14112       6

julia> select(df, :, [:a, :b] => (a, b) -> a .+ b .- sum(b)/length(b))
3×3 DataFrame
 Row │ a      b      a_b_function
     │ Int64  Int64  Float64
─────┼────────────────────────────
   1 │     1      4           0.0
   2 │     2      5           2.0
   3 │     3      6           4.0

julia> select(df, names(df) .=> [minimum maximum])
3×4 DataFrame
 Row │ a_minimum  b_minimum  a_maximum  b_maximum
     │ Int64      Int64      Int64      Int64
─────┼────────────────────────────────────────────
   1 │         1          4          3          6
   2 │         1          4          3          6
   3 │         1          4          3          6

julia> using Statistics

julia> select(df, AsTable(:) => ByRow(mean), renamecols=false)
3×1 DataFrame
 Row │ a_b
     │ Float64
─────┼─────────
   1 │     2.5
   2 │     3.5
   3 │     4.5

julia> select(first, df)
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     1      4
   3 │     1      4

julia> df = DataFrame(a=1:3, b=4:6, c=7:9)
3×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      4      7
   2 │     2      5      8
   3 │     3      6      9

julia> select(df, AsTable(:) => ByRow(x -> (mean=mean(x), std=std(x))) => :stats,
              AsTable(:) => ByRow(x -> (mean=mean(x), std=std(x))) => AsTable)
3×3 DataFrame
 Row │ stats                    mean     std
     │ NamedTup…                Float64  Float64
─────┼───────────────────────────────────────────
   1 │ (mean = 4.0, std = 3.0)      4.0      3.0
   2 │ (mean = 5.0, std = 3.0)      5.0      3.0
   3 │ (mean = 6.0, std = 3.0)      6.0      3.0

julia> df = DataFrame(a = [1, 1, 1, 2, 2, 1, 1, 2],
                      b = repeat([2, 1], outer=[4]),
                      c = 1:8)
8×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      1
   2 │     1      1      2
   3 │     1      2      3
   4 │     2      1      4
   5 │     2      2      5
   6 │     1      1      6
   7 │     1      2      7
   8 │     2      1      8

julia> gd = groupby(df, :a);

# specifying a name for target column
julia> select(gd, :c => (x -> sum(log, x)) => :sum_log_c)
8×2 DataFrame
 Row │ a      sum_log_c
     │ Int64  Float64
─────┼──────────────────
   1 │     1    5.52943
   2 │     1    5.52943
   3 │     1    5.52943
   4 │     2    5.07517
   5 │     2    5.07517
   6 │     1    5.52943
   7 │     1    5.52943
   8 │     2    5.07517

julia> select(gd, [:b, :c] .=> sum) # passing a vector of pairs
8×3 DataFrame
 Row │ a      b_sum  c_sum
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      8     19
   2 │     1      8     19
   3 │     1      8     19
   4 │     2      4     17
   5 │     2      4     17
   6 │     1      8     19
   7 │     1      8     19
   8 │     2      4     17

# multiple arguments, renaming and keepkeys
julia> select(gd, :b => :b1, :c => :c1, [:b, :c] => +, keepkeys=false)
8×3 DataFrame
 Row │ b1     c1     b_c_+
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     2      1      3
   2 │     1      2      3
   3 │     2      3      5
   4 │     1      4      5
   5 │     2      5      7
   6 │     1      6      7
   7 │     2      7      9
   8 │     1      8      9

# broadcasting and column expansion
julia> select(gd, :b, AsTable([:b, :c]) => ByRow(extrema) => [:min, :max])
8×4 DataFrame
 Row │ a      b      min    max
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1      2      1      2
   2 │     1      1      1      2
   3 │     1      2      2      3
   4 │     2      1      1      4
   5 │     2      2      2      5
   6 │     1      1      1      6
   7 │     1      2      2      7
   8 │     2      1      1      8

julia> select(gd, :, AsTable(Not(:a)) => sum, renamecols=false)
8×4 DataFrame
 Row │ a      b      c      b_c
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1      2      1      3
   2 │     1      1      2      3
   3 │     1      2      3      5
   4 │     2      1      4      5
   5 │     2      2      5      7
   6 │     1      1      6      7
   7 │     1      2      7      9
   8 │     2      1      8      9
source
DataFrames.select!Function
select!(df::DataFrame, args...; renamecols::Bool=true)
select!(args::Base.Callable, df::DataFrame; renamecols::Bool=true)
select!(gd::GroupedDataFrame{DataFrame}, args...; ungroup::Bool=true, renamecols::Bool=true)
select!(f::Base.Callable, gd::GroupedDataFrame; ungroup::Bool=true, renamecols::Bool=true)

Mutate df or gd in place to retain only columns or transformations specified by args... and return it. The result is guaranteed to have the same number of rows as df or parent of gd, except when no columns are selected (in which case the result has zero rows).

If gd is passed then it is updated to reflect the new rows of its updated parent. If there are independent GroupedDataFrame objects constructed using the same parent data frame they might get corrupt.

Below detailed common rules for all transformation functions supported by DataFrames.jl are explained and compared.

All these operations are supported both for AbstractDataFrame (when split and combine steps are skipped) and GroupedDataFrame. Technically, AbstractDataFrame is just considered as being grouped on no columns (meaning it has a single group, or zero groups if it is empty). The only difference is that in this case the keepkeys and ungroup keyword arguments (described below) are not supported and a data frame is always returned, as there are no split and combine steps in this case.

In order to perform operations by groups you first need to create a GroupedDataFrame object from your data frame using the groupby function that takes two arguments: (1) a data frame to be grouped, and (2) a set of columns to group by.

Operations can then be applied on each group using one of the following functions:

  • combine: does not put restrictions on number of rows returned, the order of rows is specified by the order of groups in GroupedDataFrame; it is typically used to compute summary statistics by group;
  • select: return a data frame with the number and order of rows exactly the same as the source data frame, including only new calculated columns; select! is an in-place version of select;
  • transform: return a data frame with the number and order of rows exactly the same as the source data frame, including all columns from the source and new calculated columns; transform! is an in-place version of transform.

All these functions take a specification of one or more functions to apply to each subset of the DataFrame. This specification can be of the following forms:

  1. standard column selectors (integers, Symbols, strings, vectors of integers, vectors of Symbols, vectors of strings, All, Cols, :, Between, Not and regular expressions)
  2. a cols => function pair indicating that function should be called with positional arguments holding columns cols, which can be any valid column selector; in this case target column name is automatically generated and it is assumed that function returns a single value or a vector; the generated name is created by concatenating source column name and function name by default (see examples below).
  3. a cols => function => target_cols form additionally explicitly specifying the target column or columns.
  4. a col => target_cols pair, which renames the column col to target_cols, which must be single name (as a Symbol or a string).
  5. a nrow or nrow => target_cols form which efficiently computes the number of rows in a group; without target_cols the new column is called :nrow, otherwise it must be single name (as a Symbol or a string).
  6. vectors or matrices containing transformations specified by the Pair syntax described in points 2 to 5
  7. a function which will be called with a SubDataFrame corresponding to each group; this form should be avoided due to its poor performance unless a very large number of columns are processed (in which case SubDataFrame avoids excessive compilation)

All functions have two types of signatures. One of them takes a GroupedDataFrame as the first argument and an arbitrary number of transformations described above as following arguments. The second type of signature is when a Function or a Type is passed as the first argument and a GroupedDataFrame as the second argument (similar to map).

As a special rule, with the cols => function and cols => function => target_cols syntaxes, if cols is wrapped in an AsTable object then a NamedTuple containing columns selected by cols is passed to function.

What is allowed for function to return is determined by the target_cols value:

  1. If both cols and target_cols are omitted (so only a function is passed), then returning a data frame, a matrix, a NamedTuple, or a DataFrameRow will produce multiple columns in the result. Returning any other value produces a single column.
  2. If target_cols is a Symbol or a string then the function is assumed to return a single column. In this case returning a data frame, a matrix, a NamedTuple, or a DataFrameRow raises an error.
  3. If target_cols is a vector of Symbols or strings or AsTable it is assumed that function returns multiple columns. If function returns one of AbstractDataFrame, NamedTuple, DataFrameRow, AbstractMatrix then rules described in point 1 above apply. If function returns an AbstractVector then each element of this vector must support the keys function, which must return a collection of Symbols, strings or integers; the return value of keys must be identical for all elements. Then as many columns are created as there are elements in the return value of the keys function. If target_cols is AsTable then their names are set to be equal to the key names except if keys returns integers, in which case they are prefixed by x (so the column names are e.g. x1, x2, ...). If target_cols is a vector of Symbols or strings then column names produced using the rules above are ignored and replaced by target_cols (the number of columns must be the same as the length of target_cols in this case). If fun returns a value of any other type then it is assumed that it is a table conforming to the Tables.jl API and the Tables.columntable function is called on it to get the resulting columns and their names. The names are retained when target_cols is AsTable and are replaced if target_cols is a vector of Symbols or strings.

In all of these cases, function can return either a single row or multiple rows. As a particular rule, values wrapped in a Ref or a 0-dimensional AbstractArray are unwrapped and then treated as a single row.

select/select! and transform/transform! always return a DataFrame with the same number and order of rows as the source (even if GroupedDataFrame had its groups reordered).

For combine, rows in the returned object appear in the order of groups in the GroupedDataFrame. The functions can return an arbitrary number of rows for each group, but the kind of returned object and the number and names of columns must be the same for all groups, except when a DataFrame() or NamedTuple() is returned, in which case a given group is skipped.

It is allowed to mix single values and vectors if multiple transformations are requested. In this case single value will be repeated to match the length of columns specified by returned vectors.

To apply function to each row instead of whole columns, it can be wrapped in a ByRow struct. cols can be any column indexing syntax, in which case function will be passed one argument for each of the columns specified by cols or a NamedTuple of them if specified columns are wrapped in AsTable. If ByRow is used it is allowed for cols to select an empty set of columns, in which case function is called for each row without any arguments and an empty NamedTuple is passed if empty set of columns is wrapped in AsTable.

If a collection of column names is passed then requesting duplicate column names in target data frame are accepted (e.g. select!(df, [:a], :, r"a") is allowed) and only the first occurrence is used. In particular a syntax to move column :col to the first position in the data frame is select!(df, :col, :). On the contrary, output column names of renaming, transformation and single column selection operations must be unique, so e.g. select!(df, :a, :a => :a) or select!(df, :a, :a => ByRow(sin) => :a) are not allowed.

As a general rule if copycols=true columns are copied and when copycols=false columns are reused if possible. Note, however, that including the same column several times in the data frame via renaming or transformations that return the same object without copying may create column aliases even if copycols=true. An example of such a situation is select!(df, :a, :a => :b, :a => identity => :c).

If df is a SubDataFrame and copycols=true then a DataFrame is returned and the same copying rules apply as for a DataFrame input: this means in particular that selected columns will be copied. If copycols=false, a SubDataFrame is returned without copying columns.

Keyword arguments

  • renamecols::Bool=true : whether in the cols => function form automatically generated column names should include the name of transformation functions or not.
  • ungroup::Bool=true : whether the return value of the operation on gd should be a data frame or a GroupedDataFrame.

See select for examples. ```

source
DataFrames.transformFunction
transform(df::AbstractDataFrame, args...; copycols::Bool=true, renamecols::Bool=true)
transform(f::Callable, df::DataFrame; renamecols::Bool=true)
transform(gd::GroupedDataFrame, args...; copycols::Bool=true,
          keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true)
transform(f::Base.Callable, gd::GroupedDataFrame; copycols::Bool=true,
          keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true)

Create a new data frame that contains columns from df or gd plus columns specified by args and return it. The result is guaranteed to have the same number of rows as df. Equivalent to select(df, :, args...) or select(gd, :, args...).

Below detailed common rules for all transformation functions supported by DataFrames.jl are explained and compared.

All these operations are supported both for AbstractDataFrame (when split and combine steps are skipped) and GroupedDataFrame. Technically, AbstractDataFrame is just considered as being grouped on no columns (meaning it has a single group, or zero groups if it is empty). The only difference is that in this case the keepkeys and ungroup keyword arguments (described below) are not supported and a data frame is always returned, as there are no split and combine steps in this case.

In order to perform operations by groups you first need to create a GroupedDataFrame object from your data frame using the groupby function that takes two arguments: (1) a data frame to be grouped, and (2) a set of columns to group by.

Operations can then be applied on each group using one of the following functions:

  • combine: does not put restrictions on number of rows returned, the order of rows is specified by the order of groups in GroupedDataFrame; it is typically used to compute summary statistics by group;
  • select: return a data frame with the number and order of rows exactly the same as the source data frame, including only new calculated columns; select! is an in-place version of select;
  • transform: return a data frame with the number and order of rows exactly the same as the source data frame, including all columns from the source and new calculated columns; transform! is an in-place version of transform.

All these functions take a specification of one or more functions to apply to each subset of the DataFrame. This specification can be of the following forms:

  1. standard column selectors (integers, Symbols, strings, vectors of integers, vectors of Symbols, vectors of strings, All, Cols, :, Between, Not and regular expressions)
  2. a cols => function pair indicating that function should be called with positional arguments holding columns cols, which can be any valid column selector; in this case target column name is automatically generated and it is assumed that function returns a single value or a vector; the generated name is created by concatenating source column name and function name by default (see examples below).
  3. a cols => function => target_cols form additionally explicitly specifying the target column or columns.
  4. a col => target_cols pair, which renames the column col to target_cols, which must be single name (as a Symbol or a string).
  5. a nrow or nrow => target_cols form which efficiently computes the number of rows in a group; without target_cols the new column is called :nrow, otherwise it must be single name (as a Symbol or a string).
  6. vectors or matrices containing transformations specified by the Pair syntax described in points 2 to 5
  7. a function which will be called with a SubDataFrame corresponding to each group; this form should be avoided due to its poor performance unless a very large number of columns are processed (in which case SubDataFrame avoids excessive compilation)

All functions have two types of signatures. One of them takes a GroupedDataFrame as the first argument and an arbitrary number of transformations described above as following arguments. The second type of signature is when a Function or a Type is passed as the first argument and a GroupedDataFrame as the second argument (similar to map).

As a special rule, with the cols => function and cols => function => target_cols syntaxes, if cols is wrapped in an AsTable object then a NamedTuple containing columns selected by cols is passed to function.

What is allowed for function to return is determined by the target_cols value:

  1. If both cols and target_cols are omitted (so only a function is passed), then returning a data frame, a matrix, a NamedTuple, or a DataFrameRow will produce multiple columns in the result. Returning any other value produces a single column.
  2. If target_cols is a Symbol or a string then the function is assumed to return a single column. In this case returning a data frame, a matrix, a NamedTuple, or a DataFrameRow raises an error.
  3. If target_cols is a vector of Symbols or strings or AsTable it is assumed that function returns multiple columns. If function returns one of AbstractDataFrame, NamedTuple, DataFrameRow, AbstractMatrix then rules described in point 1 above apply. If function returns an AbstractVector then each element of this vector must support the keys function, which must return a collection of Symbols, strings or integers; the return value of keys must be identical for all elements. Then as many columns are created as there are elements in the return value of the keys function. If target_cols is AsTable then their names are set to be equal to the key names except if keys returns integers, in which case they are prefixed by x (so the column names are e.g. x1, x2, ...). If target_cols is a vector of Symbols or strings then column names produced using the rules above are ignored and replaced by target_cols (the number of columns must be the same as the length of target_cols in this case). If fun returns a value of any other type then it is assumed that it is a table conforming to the Tables.jl API and the Tables.columntable function is called on it to get the resulting columns and their names. The names are retained when target_cols is AsTable and are replaced if target_cols is a vector of Symbols or strings.

In all of these cases, function can return either a single row or multiple rows. As a particular rule, values wrapped in a Ref or a 0-dimensional AbstractArray are unwrapped and then treated as a single row.

select/select! and transform/transform! always return a DataFrame with the same number and order of rows as the source (even if GroupedDataFrame had its groups reordered).

For combine, rows in the returned object appear in the order of groups in the GroupedDataFrame. The functions can return an arbitrary number of rows for each group, but the kind of returned object and the number and names of columns must be the same for all groups, except when a DataFrame() or NamedTuple() is returned, in which case a given group is skipped.

It is allowed to mix single values and vectors if multiple transformations are requested. In this case single value will be repeated to match the length of columns specified by returned vectors.

To apply function to each row instead of whole columns, it can be wrapped in a ByRow struct. cols can be any column indexing syntax, in which case function will be passed one argument for each of the columns specified by cols or a NamedTuple of them if specified columns are wrapped in AsTable. If ByRow is used it is allowed for cols to select an empty set of columns, in which case function is called for each row without any arguments and an empty NamedTuple is passed if empty set of columns is wrapped in AsTable.

If a collection of column names is passed then requesting duplicate column names in target data frame are accepted (e.g. select!(df, [:a], :, r"a") is allowed) and only the first occurrence is used. In particular a syntax to move column :col to the first position in the data frame is select!(df, :col, :). On the contrary, output column names of renaming, transformation and single column selection operations must be unique, so e.g. select!(df, :a, :a => :a) or select!(df, :a, :a => ByRow(sin) => :a) are not allowed.

As a general rule if copycols=true columns are copied and when copycols=false columns are reused if possible. Note, however, that including the same column several times in the data frame via renaming or transformations that return the same object without copying may create column aliases even if copycols=true. An example of such a situation is select!(df, :a, :a => :b, :a => identity => :c).

If df is a SubDataFrame and copycols=true then a DataFrame is returned and the same copying rules apply as for a DataFrame input: this means in particular that selected columns will be copied. If copycols=false, a SubDataFrame is returned without copying columns.

Keyword arguments

  • copycols::Bool=true : whether columns of the source data frame should be copied if no transformation is applied to them.
  • renamecols::Bool=true : whether in the cols => function form automatically generated column names should include the name of transformation functions or not.
  • keepkeys::Bool=true : whether grouping columns of gd should be kept in the returned data frame.
  • ungroup::Bool=true : whether the return value of the operation on gd should be a data frame or a GroupedDataFrame.

Note that when the first argument is a GroupedDataFrame, keepkeys=false is needed to be able to return a different value for the grouping column:

julia> gdf = groupby(DataFrame(x=1:2), :x)
GroupedDataFrame with 2 groups based on key: x
First Group (1 row): x = 1
 Row │ x
     │ Int64
─────┼───────
   1 │     1
⋮
Last Group (1 row): x = 2
 Row │ x
     │ Int64
─────┼───────
   1 │     2

julia> transform(gdf, x -> (x=10,), keepkeys=false)
2×1 DataFrame
 Row │ x
     │ Int64
─────┼───────
   1 │    10
   2 │    10

julia> transform(gdf, x -> (x=10,), keepkeys=true)
ERROR: ArgumentError: column :x in returned data frame is not equal to grouping key :x

See select for more examples.

source
DataFrames.transform!Function
transform!(df::DataFrame, args...; renamecols::Bool=true)
transform!(args::Callable, df::DataFrame; renamecols::Bool=true)
transform!(gd::GroupedDataFrame{DataFrame}, args...; ungroup::Bool=true, renamecols::Bool=true)
transform!(f::Base.Callable, gd::GroupedDataFrame; ungroup::Bool=true, renamecols::Bool=true)

Mutate df or gd in place to add columns specified by args... and return it. The result is guaranteed to have the same number of rows as df. Equivalent to select!(df, :, args...) or select!(gd, :, args...).

Below detailed common rules for all transformation functions supported by DataFrames.jl are explained and compared.

All these operations are supported both for AbstractDataFrame (when split and combine steps are skipped) and GroupedDataFrame. Technically, AbstractDataFrame is just considered as being grouped on no columns (meaning it has a single group, or zero groups if it is empty). The only difference is that in this case the keepkeys and ungroup keyword arguments (described below) are not supported and a data frame is always returned, as there are no split and combine steps in this case.

In order to perform operations by groups you first need to create a GroupedDataFrame object from your data frame using the groupby function that takes two arguments: (1) a data frame to be grouped, and (2) a set of columns to group by.

Operations can then be applied on each group using one of the following functions:

  • combine: does not put restrictions on number of rows returned, the order of rows is specified by the order of groups in GroupedDataFrame; it is typically used to compute summary statistics by group;
  • select: return a data frame with the number and order of rows exactly the same as the source data frame, including only new calculated columns; select! is an in-place version of select;
  • transform: return a data frame with the number and order of rows exactly the same as the source data frame, including all columns from the source and new calculated columns; transform! is an in-place version of transform.

All these functions take a specification of one or more functions to apply to each subset of the DataFrame. This specification can be of the following forms:

  1. standard column selectors (integers, Symbols, strings, vectors of integers, vectors of Symbols, vectors of strings, All, Cols, :, Between, Not and regular expressions)
  2. a cols => function pair indicating that function should be called with positional arguments holding columns cols, which can be any valid column selector; in this case target column name is automatically generated and it is assumed that function returns a single value or a vector; the generated name is created by concatenating source column name and function name by default (see examples below).
  3. a cols => function => target_cols form additionally explicitly specifying the target column or columns.
  4. a col => target_cols pair, which renames the column col to target_cols, which must be single name (as a Symbol or a string).
  5. a nrow or nrow => target_cols form which efficiently computes the number of rows in a group; without target_cols the new column is called :nrow, otherwise it must be single name (as a Symbol or a string).
  6. vectors or matrices containing transformations specified by the Pair syntax described in points 2 to 5
  7. a function which will be called with a SubDataFrame corresponding to each group; this form should be avoided due to its poor performance unless a very large number of columns are processed (in which case SubDataFrame avoids excessive compilation)

All functions have two types of signatures. One of them takes a GroupedDataFrame as the first argument and an arbitrary number of transformations described above as following arguments. The second type of signature is when a Function or a Type is passed as the first argument and a GroupedDataFrame as the second argument (similar to map).

As a special rule, with the cols => function and cols => function => target_cols syntaxes, if cols is wrapped in an AsTable object then a NamedTuple containing columns selected by cols is passed to function.

What is allowed for function to return is determined by the target_cols value:

  1. If both cols and target_cols are omitted (so only a function is passed), then returning a data frame, a matrix, a NamedTuple, or a DataFrameRow will produce multiple columns in the result. Returning any other value produces a single column.
  2. If target_cols is a Symbol or a string then the function is assumed to return a single column. In this case returning a data frame, a matrix, a NamedTuple, or a DataFrameRow raises an error.
  3. If target_cols is a vector of Symbols or strings or AsTable it is assumed that function returns multiple columns. If function returns one of AbstractDataFrame, NamedTuple, DataFrameRow, AbstractMatrix then rules described in point 1 above apply. If function returns an AbstractVector then each element of this vector must support the keys function, which must return a collection of Symbols, strings or integers; the return value of keys must be identical for all elements. Then as many columns are created as there are elements in the return value of the keys function. If target_cols is AsTable then their names are set to be equal to the key names except if keys returns integers, in which case they are prefixed by x (so the column names are e.g. x1, x2, ...). If target_cols is a vector of Symbols or strings then column names produced using the rules above are ignored and replaced by target_cols (the number of columns must be the same as the length of target_cols in this case). If fun returns a value of any other type then it is assumed that it is a table conforming to the Tables.jl API and the Tables.columntable function is called on it to get the resulting columns and their names. The names are retained when target_cols is AsTable and are replaced if target_cols is a vector of Symbols or strings.

In all of these cases, function can return either a single row or multiple rows. As a particular rule, values wrapped in a Ref or a 0-dimensional AbstractArray are unwrapped and then treated as a single row.

select/select! and transform/transform! always return a DataFrame with the same number and order of rows as the source (even if GroupedDataFrame had its groups reordered).

For combine, rows in the returned object appear in the order of groups in the GroupedDataFrame. The functions can return an arbitrary number of rows for each group, but the kind of returned object and the number and names of columns must be the same for all groups, except when a DataFrame() or NamedTuple() is returned, in which case a given group is skipped.

It is allowed to mix single values and vectors if multiple transformations are requested. In this case single value will be repeated to match the length of columns specified by returned vectors.

To apply function to each row instead of whole columns, it can be wrapped in a ByRow struct. cols can be any column indexing syntax, in which case function will be passed one argument for each of the columns specified by cols or a NamedTuple of them if specified columns are wrapped in AsTable. If ByRow is used it is allowed for cols to select an empty set of columns, in which case function is called for each row without any arguments and an empty NamedTuple is passed if empty set of columns is wrapped in AsTable.

If a collection of column names is passed then requesting duplicate column names in target data frame are accepted (e.g. select!(df, [:a], :, r"a") is allowed) and only the first occurrence is used. In particular a syntax to move column :col to the first position in the data frame is select!(df, :col, :). On the contrary, output column names of renaming, transformation and single column selection operations must be unique, so e.g. select!(df, :a, :a => :a) or select!(df, :a, :a => ByRow(sin) => :a) are not allowed.

As a general rule if copycols=true columns are copied and when copycols=false columns are reused if possible. Note, however, that including the same column several times in the data frame via renaming or transformations that return the same object without copying may create column aliases even if copycols=true. An example of such a situation is select!(df, :a, :a => :b, :a => identity => :c).

If df is a SubDataFrame and copycols=true then a DataFrame is returned and the same copying rules apply as for a DataFrame input: this means in particular that selected columns will be copied. If copycols=false, a SubDataFrame is returned without copying columns.

Keyword arguments

  • renamecols::Bool=true : whether in the cols => function form automatically generated column names should include the name of transformation functions or not.
  • ungroup::Bool=true : whether the return value of the operation on gd should be a data frame or a GroupedDataFrame.

See select for examples.

source
Base.vcatFunction
vcat(dfs::AbstractDataFrame...;
     cols::Union{Symbol, AbstractVector{Symbol},
                 AbstractVector{<:AbstractString}}=:setequal)

Vertically concatenate AbstractDataFrames.

The cols keyword argument determines the columns of the returned data frame:

  • :setequal: require all data frames to have the same column names disregarding order. If they appear in different orders, the order of the first provided data frame is used.
  • :orderequal: require all data frames to have the same column names and in the same order.
  • :intersect: only the columns present in all provided data frames are kept. If the intersection is empty, an empty data frame is returned.
  • :union: columns present in at least one of the provided data frames are kept. Columns not present in some data frames are filled with missing where necessary.
  • A vector of Symbols or strings: only listed columns are kept. Columns not present in some data frames are filled with missing where necessary.

The order of columns is determined by the order they appear in the included data frames, searching through the header of the first data frame, then the second, etc.

The element types of columns are determined using promote_type, as with vcat for AbstractVectors.

vcat ignores empty data frames, making it possible to initialize an empty data frame at the beginning of a loop and vcat onto it.

Example

julia> df1 = DataFrame(A=1:3, B=1:3)
3×2 DataFrame
 Row │ A      B
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      3

julia> df2 = DataFrame(A=4:6, B=4:6)
3×2 DataFrame
 Row │ A      B
     │ Int64  Int64
─────┼──────────────
   1 │     4      4
   2 │     5      5
   3 │     6      6

julia> df3 = DataFrame(A=7:9, C=7:9)
3×2 DataFrame
 Row │ A      C
     │ Int64  Int64
─────┼──────────────
   1 │     7      7
   2 │     8      8
   3 │     9      9

julia> d4 = DataFrame()
0×0 DataFrame

julia> vcat(df1, df2)
6×2 DataFrame
 Row │ A      B
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      3
   4 │     4      4
   5 │     5      5
   6 │     6      6

julia> vcat(df1, df3, cols=:union)
6×3 DataFrame
 Row │ A      B        C
     │ Int64  Int64?   Int64?
─────┼─────────────────────────
   1 │     1        1  missing
   2 │     2        2  missing
   3 │     3        3  missing
   4 │     7  missing        7
   5 │     8  missing        8
   6 │     9  missing        9

julia> vcat(df1, df3, cols=:intersect)
6×1 DataFrame
 Row │ A
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3
   4 │     7
   5 │     8
   6 │     9

julia> vcat(d4, df1)
3×2 DataFrame
 Row │ A      B
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      3
source

Reshaping data frames between tall and wide formats

DataFrames.stackFunction
stack(df::AbstractDataFrame[, measure_vars[, id_vars] ];
      variable_name=:variable, value_name=:value,
      view::Bool=false, variable_eltype::Type=String)

Stack a data frame df, i.e. convert it from wide to long format.

Return the long-format DataFrame with: columns for each of the id_vars, column value_name (:value by default) holding the values of the stacked columns (measure_vars), and column variable_name (:variable by default) a vector holding the name of the corresponding measure_vars variable.

If view=true then return a stacked view of a data frame (long format). The result is a view because the columns are special AbstractVectors that return views into the original data frame.

Arguments

  • df : the AbstractDataFrame to be stacked
  • measure_vars : the columns to be stacked (the measurement variables), as a column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers). If neither measure_vars or id_vars are given, measure_vars defaults to all floating point columns.
  • id_vars : the identifier columns that are repeated during stacking, as a column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers). Defaults to all variables that are not measure_vars
  • variable_name : the name (Symbol or string) of the new stacked column that shall hold the names of each of measure_vars
  • value_name : the name (Symbol or string) of the new stacked column containing the values from each of measure_vars
  • view : whether the stacked data frame should be a view rather than contain freshly allocated vectors.
  • variable_eltype : determines the element type of column variable_name. By default a PooledArray{String} is created. If variable_eltype=Symbol a PooledVector{Symbol} is created, and if variable_eltype=CategoricalValue{String} a CategoricalArray{String} is produced (call using CategoricalArrays first if needed) Passing any other type T will produce a PooledVector{T} column as long as it supports conversion from String. When view=true, a RepeatedVector{T} is produced.

Examples

julia> using Random

julia> Random.seed!(1234);

julia> df = DataFrame(a = repeat([1:3;], inner = [2]),
                      b = repeat([1:2;], inner = [3]),
                      c = randn(6),
                      d = randn(),
                      e = map(string, 'a':'f'))
6×5 DataFrame
 Row │ a      b      c          d         e
     │ Int64  Int64  Float64    Float64   String
─────┼───────────────────────────────────────────
   1 │     1      1   0.867347  0.532813  a
   2 │     1      1  -0.901744  0.532813  b
   3 │     2      1  -0.494479  0.532813  c
   4 │     2      2  -0.902914  0.532813  d
   5 │     3      2   0.864401  0.532813  e
   6 │     3      2   2.21188   0.532813  f

julia> stack(df, [:c, :d])
12×5 DataFrame
 Row │ a      b      e       variable  value
     │ Int64  Int64  String  String    Float64
─────┼───────────────────────────────────────────
   1 │     1      1  a       c          0.867347
   2 │     1      1  b       c         -0.901744
   3 │     2      1  c       c         -0.494479
   4 │     2      2  d       c         -0.902914
  ⋮  │   ⋮      ⋮      ⋮        ⋮          ⋮
  10 │     2      2  d       d          0.532813
  11 │     3      2  e       d          0.532813
  12 │     3      2  f       d          0.532813
                                   5 rows omitted

julia> stack(df, [:c, :d], [:a])
12×3 DataFrame
 Row │ a      variable  value
     │ Int64  String    Float64
─────┼────────────────────────────
   1 │     1  c          0.867347
   2 │     1  c         -0.901744
   3 │     2  c         -0.494479
   4 │     2  c         -0.902914
  ⋮  │   ⋮       ⋮          ⋮
  10 │     2  d          0.532813
  11 │     3  d          0.532813
  12 │     3  d          0.532813
                    5 rows omitted

julia> stack(df, Not([:a, :b, :e]))
12×5 DataFrame
 Row │ a      b      e       variable  value
     │ Int64  Int64  String  String    Float64
─────┼───────────────────────────────────────────
   1 │     1      1  a       c          0.867347
   2 │     1      1  b       c         -0.901744
   3 │     2      1  c       c         -0.494479
   4 │     2      2  d       c         -0.902914
  ⋮  │   ⋮      ⋮      ⋮        ⋮          ⋮
  10 │     2      2  d       d          0.532813
  11 │     3      2  e       d          0.532813
  12 │     3      2  f       d          0.532813
                                   5 rows omitted

julia> stack(df, Not([:a, :b, :e]), variable_name=:somemeasure)
12×5 DataFrame
 Row │ a      b      e       somemeasure  value
     │ Int64  Int64  String  String       Float64
─────┼──────────────────────────────────────────────
   1 │     1      1  a       c             0.867347
   2 │     1      1  b       c            -0.901744
   3 │     2      1  c       c            -0.494479
   4 │     2      2  d       c            -0.902914
  ⋮  │   ⋮      ⋮      ⋮          ⋮           ⋮
  10 │     2      2  d       d             0.532813
  11 │     3      2  e       d             0.532813
  12 │     3      2  f       d             0.532813
                                      5 rows omitted
source
DataFrames.unstackFunction
unstack(df::AbstractDataFrame, rowkeys, colkey, value; renamecols::Function=identity,
        allowmissing::Bool=false, allowduplicates::Bool=false)
unstack(df::AbstractDataFrame, colkey, value; renamecols::Function=identity,
        allowmissing::Bool=false, allowduplicates::Bool=false)
unstack(df::AbstractDataFrame; renamecols::Function=identity,
        allowmissing::Bool=false, allowduplicates::Bool=false)

Unstack data frame df, i.e. convert it from long to wide format.

Row and column keys will be ordered in the order of their first appearance.

Positional arguments

  • df : the AbstractDataFrame to be unstacked
  • rowkeys : the columns with a unique key for each row, if not given, find a key by grouping on anything not a colkey or value. Can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).
  • colkey : the column (Symbol, string or integer) holding the column names in wide format, defaults to :variable
  • value : the value column (Symbol, string or integer), defaults to :value

Keyword arguments

  • renamecols: a function called on each unique value in colkey; it must return the name of the column to be created (typically as a string or a Symbol). Duplicates in resulting names when converted to Symbol are not allowed. By default no transformation is performed.
  • allowmissing: if false (the default) then an error will be thrown if colkey contains missing values; if true then a column referring to missing value will be created.
  • allowduplicates: iffalse(the default) then an error an error will be thrown if combination ofrowkeysandcolkeycontains duplicate entries; iftruethen then the last encounteredvalue` will be retained.

Examples

julia> using Random

julia> Random.seed!(1234);

julia> wide = DataFrame(id = 1:6,
                        a  = repeat([1:3;], inner = [2]),
                        b  = repeat([1:2;], inner = [3]),
                        c  = randn(6),
                        d  = randn(6))
6×5 DataFrame
 Row │ id     a      b      c          d
     │ Int64  Int64  Int64  Float64    Float64
─────┼────────────────────────────────────────────
   1 │     1      1      1   0.867347   0.532813
   2 │     2      1      1  -0.901744  -0.271735
   3 │     3      2      1  -0.494479   0.502334
   4 │     4      2      2  -0.902914  -0.516984
   5 │     5      3      2   0.864401  -0.560501
   6 │     6      3      2   2.21188   -0.0192918

julia> long = stack(wide)
12×5 DataFrame
 Row │ id     a      b      variable  value
     │ Int64  Int64  Int64  String    Float64
─────┼───────────────────────────────────────────
   1 │     1      1      1  c          0.867347
   2 │     2      1      1  c         -0.901744
   3 │     3      2      1  c         -0.494479
   4 │     4      2      2  c         -0.902914
  ⋮  │   ⋮      ⋮      ⋮       ⋮          ⋮
  10 │     4      2      2  d         -0.516984
  11 │     5      3      2  d         -0.560501
  12 │     6      3      2  d         -0.0192918
                                   5 rows omitted

julia> unstack(long)
6×5 DataFrame
 Row │ id     a      b      c          d
     │ Int64  Int64  Int64  Float64?   Float64?
─────┼────────────────────────────────────────────
   1 │     1      1      1   0.867347   0.532813
   2 │     2      1      1  -0.901744  -0.271735
   3 │     3      2      1  -0.494479   0.502334
   4 │     4      2      2  -0.902914  -0.516984
   5 │     5      3      2   0.864401  -0.560501
   6 │     6      3      2   2.21188   -0.0192918

julia> unstack(long, :variable, :value)
6×5 DataFrame
 Row │ id     a      b      c          d
     │ Int64  Int64  Int64  Float64?   Float64?
─────┼────────────────────────────────────────────
   1 │     1      1      1   0.867347   0.532813
   2 │     2      1      1  -0.901744  -0.271735
   3 │     3      2      1  -0.494479   0.502334
   4 │     4      2      2  -0.902914  -0.516984
   5 │     5      3      2   0.864401  -0.560501
   6 │     6      3      2   2.21188   -0.0192918

julia> unstack(long, :id, :variable, :value)
6×3 DataFrame
 Row │ id     c          d
     │ Int64  Float64?   Float64?
─────┼──────────────────────────────
   1 │     1   0.867347   0.532813
   2 │     2  -0.901744  -0.271735
   3 │     3  -0.494479   0.502334
   4 │     4  -0.902914  -0.516984
   5 │     5   0.864401  -0.560501
   6 │     6   2.21188   -0.0192918

julia> unstack(long, [:id, :a], :variable, :value)
6×4 DataFrame
 Row │ id     a      c          d
     │ Int64  Int64  Float64?   Float64?
─────┼─────────────────────────────────────
   1 │     1      1   0.867347   0.532813
   2 │     2      1  -0.901744  -0.271735
   3 │     3      2  -0.494479   0.502334
   4 │     4      2  -0.902914  -0.516984
   5 │     5      3   0.864401  -0.560501
   6 │     6      3   2.21188   -0.0192918

julia> unstack(long, :id, :variable, :value, renamecols=x->Symbol(:_, x))
6×3 DataFrame
 Row │ id     _c         _d
     │ Int64  Float64?   Float64?
─────┼──────────────────────────────
   1 │     1   0.867347   0.532813
   2 │     2  -0.901744  -0.271735
   3 │     3  -0.494479   0.502334
   4 │     4  -0.902914  -0.516984
   5 │     5   0.864401  -0.560501
   6 │     6   2.21188   -0.0192918

Note that there are some differences between the widened results above.

source
Base.permutedimsFunction
permutedims(df::AbstractDataFrame, src_namescol::Union{Int, Symbol, AbstractString},
            [dest_namescol::Union{Symbol, AbstractString}];
            makeunique::Bool=false)

Turn df on its side such that rows become columns and values in the column indexed by src_namescol become the names of new columns. In the resulting DataFrame, column names of df will become the first column with name specified by dest_namescol.

Arguments

  • df : the AbstractDataFrame
  • src_namescol : the column that will become the new header. This column's element type must be AbstractString or Symbol.
  • dest_namescol : the name of the first column in the returned DataFrame. Defaults to the same name as src_namescol.
  • makeunique : if false (the default), an error will be raised if duplicate names are found; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).

Note: The element types of columns in resulting DataFrame (other than the first column, which always has element type String) will depend on the element types of all input columns based on the result of promote_type. That is, if the source data frame contains Int and Float64 columns, resulting columns will have element type Float64. If the source has Int and String columns, resulting columns will have element type Any.

Examples

julia> df1 = DataFrame(a=["x", "y"], b=[1.0, 2.0], c=[3, 4], d=[true, false])
2×4 DataFrame
 Row │ a       b        c      d
     │ String  Float64  Int64  Bool
─────┼───────────────────────────────
   1 │ x           1.0      3   true
   2 │ y           2.0      4  false

julia> permutedims(df1, 1) # note the column types
3×3 DataFrame
 Row │ a       x        y
     │ String  Float64  Float64
─────┼──────────────────────────
   1 │ b           1.0      2.0
   2 │ c           3.0      4.0
   3 │ d           1.0      0.0

julia> df2 = DataFrame(a=["x", "y"], b=[1, "two"], c=[3, 4], d=[true, false])
2×4 DataFrame
 Row │ a       b    c      d
     │ String  Any  Int64  Bool
─────┼───────────────────────────
   1 │ x       1        3   true
   2 │ y       two      4  false

julia> permutedims(df2, 1, "different_name")
3×3 DataFrame
 Row │ different_name  x     y
     │ String          Any   Any
─────┼─────────────────────────────
   1 │ b               1     two
   2 │ c               3     4
   3 │ d               true  false
source

Sorting

Base.issortedFunction
issorted(df::AbstractDataFrame, cols;
         lt=isless, by=identity, rev::Bool=false, order::Ordering=Forward)

Test whether data frame df sorted by column(s) cols.

cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If rev is true, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true) in cols, with c the corresponding column index (see example below). See other methods for a description of other keyword arguments.

source
DataFrames.orderFunction
order(col::ColumnIndex; kwargs...)

Specify sorting order for a column col in a data frame. kwargs can be lt, by, rev, and order with values following the rules defined in sort!.

See also: sort!, sort

Examples

julia> df = DataFrame(x = [-3, -1, 0, 2, 4], y = 1:5)
5×2 DataFrame
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │    -3      1
   2 │    -1      2
   3 │     0      3
   4 │     2      4
   5 │     4      5

julia> sort(df, order(:x, rev=true))
5×2 DataFrame
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │     4      5
   2 │     2      4
   3 │     0      3
   4 │    -1      2
   5 │    -3      1

julia> sort(df, order(:x, by=abs))
5×2 DataFrame
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │     0      3
   2 │    -1      2
   3 │     2      4
   4 │    -3      1
   5 │     4      5
source
Base.sortFunction
sort(df::AbstractDataFrame, cols;
     alg::Union{Algorithm, Nothing}=nothing, lt=isless, by=identity,
     rev::Bool=false, order::Ordering=Forward, view::Bool=false)

Return a data frame containing the rows in df sorted by column(s) cols.

cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If alg is nothing (the default), the most appropriate algorithm is chosen automatically among TimSort, MergeSort and RadixSort depending on the type of the sorting columns and on the number of rows in df. If rev is true, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true) in cols, with c the corresponding column index (see example below).

If view=false a freshly allocated DataFrame is returned. If view=true then a SubDataFrame view into df is returned.

See sort! for a description of other keyword arguments.

Examples

julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
 Row │ x      y
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     1  c
   3 │     2  a
   4 │     1  b

julia> sort(df, :x)
4×2 DataFrame
 Row │ x      y
     │ Int64  String
─────┼───────────────
   1 │     1  c
   2 │     1  b
   3 │     2  a
   4 │     3  b

julia> sort(df, [:x, :y])
4×2 DataFrame
 Row │ x      y
     │ Int64  String
─────┼───────────────
   1 │     1  b
   2 │     1  c
   3 │     2  a
   4 │     3  b

julia> sort(df, [:x, :y], rev=true)
4×2 DataFrame
 Row │ x      y
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  a
   3 │     1  c
   4 │     1  b

julia> sort(df, [:x, order(:y, rev=true)])
4×2 DataFrame
 Row │ x      y
     │ Int64  String
─────┼───────────────
   1 │     1  c
   2 │     1  b
   3 │     2  a
   4 │     3  b
source
Base.sort!Function
sort!(df::AbstractDataFrame, cols;
      alg::Union{Algorithm, Nothing}=nothing, lt=isless, by=identity,
      rev::Bool=false, order::Ordering=Forward)

Sort data frame df by column(s) cols.

cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If alg is nothing (the default), the most appropriate algorithm is chosen automatically among TimSort, MergeSort and RadixSort depending on the type of the sorting columns and on the number of rows in df. If rev is true, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true) in cols, with c the corresponding column index (see example below). See other methods for a description of other keyword arguments.

Examples

julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
 Row │ x      y
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     1  c
   3 │     2  a
   4 │     1  b

julia> sort!(df, :x)
4×2 DataFrame
 Row │ x      y
     │ Int64  String
─────┼───────────────
   1 │     1  c
   2 │     1  b
   3 │     2  a
   4 │     3  b

julia> sort!(df, [:x, :y])
4×2 DataFrame
 Row │ x      y
     │ Int64  String
─────┼───────────────
   1 │     1  b
   2 │     1  c
   3 │     2  a
   4 │     3  b

julia> sort!(df, [:x, :y], rev=true)
4×2 DataFrame
 Row │ x      y
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  a
   3 │     1  c
   4 │     1  b

julia> sort!(df, [:x, order(:y, rev=true)])
4×2 DataFrame
 Row │ x      y
     │ Int64  String
─────┼───────────────
   1 │     1  c
   2 │     1  b
   3 │     2  a
   4 │     3  b
source
Base.sortpermFunction
sortperm(df::AbstractDataFrame, cols;
         alg::Union{Algorithm, Nothing}=nothing, lt=isless, by=identity,
         rev::Bool=false, order::Ordering=Forward)

Return a permutation vector of row indices of data frame df that puts them in sorted order according to column(s) cols.

cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If alg is nothing (the default), the most appropriate algorithm is chosen automatically among TimSort, MergeSort and RadixSort depending on the type of the sorting columns and on the number of rows in df. If rev is true, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true) in cols, with c the corresponding column index (see example below). See other methods for a description of other keyword arguments.

Examples

julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
 Row │ x      y
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     1  c
   3 │     2  a
   4 │     1  b

julia> sortperm(df, :x)
4-element Array{Int64,1}:
 2
 4
 3
 1

julia> sortperm(df, (:x, :y))
4-element Array{Int64,1}:
 4
 2
 3
 1

julia> sortperm(df, (:x, :y), rev=true)
4-element Array{Int64,1}:
 1
 3
 2
 4

 julia> sortperm(df, (:x, order(:y, rev=true)))
 4-element Array{Int64,1}:
  2
  4
  3
  1
source

Joining

DataFrames.antijoinFunction
antijoin(df1, df2; on, makeunique=false, validate=(false, false), matchmissing=:error)

Perform an anti join of two data frame objects and return a DataFrame containing the result. An anti join returns the subset of rows of df1 that do not match with the keys in df2.

The order of rows in the result is undefined and may change in the future releases.

Arguments

  • df1, df2: the AbstractDataFrames to be joined

Keyword Arguments

  • on : A column name to join df1 and df2 on. If the columns on which df1 and df2 will be joined have different names, then a left=>right pair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed).
  • makeunique : if false (the default), an error will be raised if duplicate names are found in columns not joined on; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).
  • validate : whether to check that columns passed as the on argument define unique keys in each input data frame (according to isequal). Can be a tuple or a pair, with the first element indicating whether to run check for df1 and the second element for df2. By default no check is performed.
  • matchmissing : if equal to :error throw an error if missing is present in on columns; if equal to :equal then missing is allowed and missings are matched (isequal is used for comparisons of rows for equality)

It is not allowed to join on columns that contain NaN or -0.0 in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector.

When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.

See also: innerjoin, leftjoin, rightjoin, outerjoin, semijoin, crossjoin.

Examples

julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
 Row │ ID     Name
     │ Int64  String
─────┼──────────────────
   1 │     1  John Doe
   2 │     2  Jane Doe
   3 │     3  Joe Blogs

julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
 Row │ ID     Job
     │ Int64  String
─────┼───────────────
   1 │     1  Lawyer
   2 │     2  Doctor
   3 │     4  Farmer

julia> antijoin(name, job, on = :ID)
1×2 DataFrame
 Row │ ID     Name
     │ Int64  String
─────┼──────────────────
   1 │     3  Joe Blogs

julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
 Row │ identifier  Job
     │ Int64       String
─────┼────────────────────
   1 │          1  Lawyer
   2 │          2  Doctor
   3 │          4  Farmer

julia> antijoin(name, job2, on = :ID => :identifier)
1×2 DataFrame
 Row │ ID     Name
     │ Int64  String
─────┼──────────────────
   1 │     3  Joe Blogs

julia> antijoin(name, job2, on = [:ID => :identifier])
1×2 DataFrame
 Row │ ID     Name
     │ Int64  String
─────┼──────────────────
   1 │     3  Joe Blogs
source
DataFrames.crossjoinFunction
crossjoin(df1, df2, dfs...; makeunique = false)

Perform a cross join of two or more data frame objects and return a DataFrame containing the result. A cross join returns the cartesian product of rows from all passed data frames, where the first passed data frame is assigned to the dimension that changes the slowest and the last data frame is assigned to the dimension that changes the fastest.

Arguments

  • df1, df2, dfs... : the AbstractDataFrames to be joined

Keyword Arguments

  • makeunique : if false (the default), an error will be raised if duplicate names are found in columns not joined on; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).

If more than two data frames are passed, the join is performed recursively with left associativity.

See also: innerjoin, leftjoin, rightjoin, outerjoin, semijoin, antijoin.

Examples

julia> df1 = DataFrame(X=1:3)
3×1 DataFrame
 Row │ X
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3

julia> df2 = DataFrame(Y=["a", "b"])
2×1 DataFrame
 Row │ Y
     │ String
─────┼────────
   1 │ a
   2 │ b

julia> crossjoin(df1, df2)
6×2 DataFrame
 Row │ X      Y
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  b
   3 │     2  a
   4 │     2  b
   5 │     3  a
   6 │     3  b
source
DataFrames.innerjoinFunction
innerjoin(df1, df2; on, makeunique=false, validate=(false, false),
          renamecols=(identity => identity), matchmissing=:error)
innerjoin(df1, df2, dfs...; on, makeunique=false,
          validate=(false, false), matchmissing=:error)

Perform an inner join of two or more data frame objects and return a DataFrame containing the result. An inner join includes rows with keys that match in all passed data frames.

The order of rows in the result is undefined and may change in the future releases.

Arguments

  • df1, df2, dfs...: the AbstractDataFrames to be joined

Keyword Arguments

  • on : A column name to join df1 and df2 on. If the columns on which df1 and df2 will be joined have different names, then a left=>right pair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed). If more than two data frames are joined then only a column name or a vector of column names are allowed. on is a required argument.
  • makeunique : if false (the default), an error will be raised if duplicate names are found in columns not joined on; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).
  • validate : whether to check that columns passed as the on argument define unique keys in each input data frame (according to isequal). Can be a tuple or a pair, with the first element indicating whether to run check for df1 and the second element for df2. By default no check is performed.
  • renamecols : a Pair specifying how columns of left and right data frames should be renamed in the resulting data frame. Each element of the pair can be a string or a Symbol can be passed in which case it is appended to the original column name; alternatively a function can be passed in which case it is applied to each column name, which is passed to it as a String. Note that renamecols does not affect on columns, whose names are always taken from the left data frame and left unchanged.
  • matchmissing : if equal to :error throw an error if missing is present in on columns; if equal to :equal then missing is allowed and missings are matched (isequal is used for comparisons of rows for equality)

It is not allowed to join on columns that contain NaN or -0.0 in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector.

When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.

If more than two data frames are passed, the join is performed recursively with left associativity. In this case the validate keyword argument is applied recursively with left associativity.

See also: leftjoin, rightjoin, outerjoin, semijoin, antijoin, crossjoin.

Examples

julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
 Row │ ID     Name
     │ Int64  String
─────┼──────────────────
   1 │     1  John Doe
   2 │     2  Jane Doe
   3 │     3  Joe Blogs

julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
 Row │ ID     Job
     │ Int64  String
─────┼───────────────
   1 │     1  Lawyer
   2 │     2  Doctor
   3 │     4  Farmer

julia> innerjoin(name, job, on = :ID)
2×3 DataFrame
 Row │ ID     Name      Job
     │ Int64  String    String
─────┼─────────────────────────
   1 │     1  John Doe  Lawyer
   2 │     2  Jane Doe  Doctor

julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
 Row │ identifier  Job
     │ Int64       String
─────┼────────────────────
   1 │          1  Lawyer
   2 │          2  Doctor
   3 │          4  Farmer

julia> innerjoin(name, job2, on = :ID => :identifier, renamecols = "_left" => "_right")
2×3 DataFrame
 Row │ ID     Name_left  Job_right
     │ Int64  String     String
─────┼─────────────────────────────
   1 │     1  John Doe   Lawyer
   2 │     2  Jane Doe   Doctor

julia> innerjoin(name, job2, on = [:ID => :identifier], renamecols = uppercase => lowercase)
2×3 DataFrame
 Row │ ID     NAME      job
     │ Int64  String    String
─────┼─────────────────────────
   1 │     1  John Doe  Lawyer
   2 │     2  Jane Doe  Doctor
source
DataFrames.leftjoinFunction
leftjoin(df1, df2; on, makeunique=false, indicator=nothing, validate=(false, false),
         renamecols=(identity => identity), matchmissing=:error)

Perform a left join of twodata frame objects and return a DataFrame containing the result. A left join includes all rows from df1.

The order of rows in the result is undefined and may change in the future releases.

Arguments

  • df1, df2: the AbstractDataFrames to be joined

Keyword Arguments

  • on : A column name to join df1 and df2 on. If the columns on which df1 and df2 will be joined have different names, then a left=>right pair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed).
  • makeunique : if false (the default), an error will be raised if duplicate names are found in columns not joined on; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).
  • indicator : Default: nothing. If a Symbol or string, adds categorical indicator column with the given name, for whether a row appeared in only df1 ("left_only"), only df2 ("right_only") or in both ("both"). If the name is already in use, the column name will be modified if makeunique=true.
  • validate : whether to check that columns passed as the on argument define unique keys in each input data frame (according to isequal). Can be a tuple or a pair, with the first element indicating whether to run check for df1 and the second element for df2. By default no check is performed.
  • renamecols : a Pair specifying how columns of left and right data frames should be renamed in the resulting data frame. Each element of the pair can be a string or a Symbol can be passed in which case it is appended to the original column name; alternatively a function can be passed in which case it is applied to each column name, which is passed to it as a String. Note that renamecols does not affect on columns, whose names are always taken from the left data frame and left unchanged.
  • matchmissing : if equal to :error throw an error if missing is present in on columns; if equal to :equal then missing is allowed and missings are matched (isequal is used for comparisons of rows for equality)

All columns of the returned data table will support missing values.

It is not allowed to join on columns that contain NaN or -0.0 in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector.

When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.

See also: innerjoin, rightjoin, outerjoin, semijoin, antijoin, crossjoin.

Examples

julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
 Row │ ID     Name
     │ Int64  String
─────┼──────────────────
   1 │     1  John Doe
   2 │     2  Jane Doe
   3 │     3  Joe Blogs

julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
 Row │ ID     Job
     │ Int64  String
─────┼───────────────
   1 │     1  Lawyer
   2 │     2  Doctor
   3 │     4  Farmer

julia> leftjoin(name, job, on = :ID)
3×3 DataFrame
 Row │ ID     Name       Job
     │ Int64  String     String?
─────┼───────────────────────────
   1 │     1  John Doe   Lawyer
   2 │     2  Jane Doe   Doctor
   3 │     3  Joe Blogs  missing

julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
 Row │ identifier  Job
     │ Int64       String
─────┼────────────────────
   1 │          1  Lawyer
   2 │          2  Doctor
   3 │          4  Farmer

julia> leftjoin(name, job2, on = :ID => :identifier, renamecols = "_left" => "_right")
3×3 DataFrame
 Row │ ID     Name_left  Job_right
     │ Int64  String     String?
─────┼─────────────────────────────
   1 │     1  John Doe   Lawyer
   2 │     2  Jane Doe   Doctor
   3 │     3  Joe Blogs  missing

julia> leftjoin(name, job2, on = [:ID => :identifier], renamecols = uppercase => lowercase)
3×3 DataFrame
 Row │ ID     NAME       job
     │ Int64  String     String?
─────┼───────────────────────────
   1 │     1  John Doe   Lawyer
   2 │     2  Jane Doe   Doctor
   3 │     3  Joe Blogs  missing
source
DataFrames.outerjoinFunction
outerjoin(df1, df2; on, makeunique=false, indicator=nothing, validate=(false, false),
          renamecols=(identity => identity), matchmissing=:error)
outerjoin(df1, df2, dfs...; on, makeunique = false,
          validate = (false, false), matchmissing=:error)

Perform an outer join of two or more data frame objects and return a DataFrame containing the result. An outer join includes rows with keys that appear in any of the passed data frames.

The order of rows in the result is undefined and may change in the future releases.

Arguments

  • df1, df2, dfs... : the AbstractDataFrames to be joined

Keyword Arguments

  • on : A column name to join df1 and df2 on. If the columns on which df1 and df2 will be joined have different names, then a left=>right pair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed). If more than two data frames are joined then only a column name or a vector of column names are allowed. on is a required argument.
  • makeunique : if false (the default), an error will be raised if duplicate names are found in columns not joined on; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).
  • indicator : Default: nothing. If a Symbol or string, adds categorical indicator column with the given name for whether a row appeared in only df1 ("left_only"), only df2 ("right_only") or in both ("both"). If the name is already in use, the column name will be modified if makeunique=true. This argument is only supported when joining exactly two data frames.
  • validate : whether to check that columns passed as the on argument define unique keys in each input data frame (according to isequal). Can be a tuple or a pair, with the first element indicating whether to run check for df1 and the second element for df2. By default no check is performed.
  • renamecols : a Pair specifying how columns of left and right data frames should be renamed in the resulting data frame. Each element of the pair can be a string or a Symbol can be passed in which case it is appended to the original column name; alternatively a function can be passed in which case it is applied to each column name, which is passed to it as a String. Note that renamecols does not affect on columns, whose names are always taken from the left data frame and left unchanged.
  • matchmissing : if equal to :error throw an error if missing is present in on columns; if equal to :equal then missing is allowed and missings are matched (isequal is used for comparisons of rows for equality)

All columns of the returned data table will support missing values.

It is not allowed to join on columns that contain NaN or -0.0 in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector.

When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.

If more than two data frames are passed, the join is performed recursively with left associativity. In this case the indicator keyword argument is not supported and validate keyword argument is applied recursively with left associativity.

See also: innerjoin, leftjoin, rightjoin, semijoin, antijoin, crossjoin.

Examples

julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
 Row │ ID     Name
     │ Int64  String
─────┼──────────────────
   1 │     1  John Doe
   2 │     2  Jane Doe
   3 │     3  Joe Blogs

julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
 Row │ ID     Job
     │ Int64  String
─────┼───────────────
   1 │     1  Lawyer
   2 │     2  Doctor
   3 │     4  Farmer

julia> outerjoin(name, job, on = :ID)
4×3 DataFrame
 Row │ ID     Name       Job
     │ Int64  String?    String?
─────┼───────────────────────────
   1 │     1  John Doe   Lawyer
   2 │     2  Jane Doe   Doctor
   3 │     3  Joe Blogs  missing
   4 │     4  missing    Farmer

julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
 Row │ identifier  Job
     │ Int64       String
─────┼────────────────────
   1 │          1  Lawyer
   2 │          2  Doctor
   3 │          4  Farmer

julia> rightjoin(name, job2, on = :ID => :identifier, renamecols = "_left" => "_right")
3×3 DataFrame
 Row │ ID     Name_left  Job_right
     │ Int64  String?    String
─────┼─────────────────────────────
   1 │     1  John Doe   Lawyer
   2 │     2  Jane Doe   Doctor
   3 │     4  missing    Farmer

julia> rightjoin(name, job2, on = [:ID => :identifier], renamecols = uppercase => lowercase)
3×3 DataFrame
 Row │ ID     NAME      job
     │ Int64  String?   String
─────┼─────────────────────────
   1 │     1  John Doe  Lawyer
   2 │     2  Jane Doe  Doctor
   3 │     4  missing   Farmer
source
DataFrames.rightjoinFunction
rightjoin(df1, df2; on, makeunique=false, indicator = nothing,
          validate=(false, false), renamecols=(identity => identity),
          matchmissing=:error)

Perform a right join on two data frame objects and return a DataFrame containing the result. A right join includes all rows from df2.

The order of rows in the result is undefined and may change in the future releases.

Arguments

  • df1, df2: the AbstractDataFrames to be joined

Keyword Arguments

  • on : A column name to join df1 and df2 on. If the columns on which df1 and df2 will be joined have different names, then a left=>right pair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed).
  • makeunique : if false (the default), an error will be raised if duplicate names are found in columns not joined on; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).
  • indicator : Default: nothing. If a Symbol or string, adds categorical indicator column with the given name for whether a row appeared in only df1 ("left_only"), only df2 ("right_only") or in both ("both"). If the name is already in use, the column name will be modified if makeunique=true.
  • validate : whether to check that columns passed as the on argument define unique keys in each input data frame (according to isequal). Can be a tuple or a pair, with the first element indicating whether to run check for df1 and the second element for df2. By default no check is performed.
  • renamecols : a Pair specifying how columns of left and right data frames should be renamed in the resulting data frame. Each element of the pair can be a string or a Symbol can be passed in which case it is appended to the original column name; alternatively a function can be passed in which case it is applied to each column name, which is passed to it as a String. Note that renamecols does not affect on columns, whose names are always taken from the left data frame and left unchanged.
  • matchmissing : if equal to :error throw an error if missing is present in on columns; if equal to :equal then missing is allowed and missings are matched (isequal is used for comparisons of rows for equality)

All columns of the returned data table will support missing values.

It is not allowed to join on columns that contain NaN or -0.0 in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector.

When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.

See also: innerjoin, leftjoin, outerjoin, semijoin, antijoin, crossjoin.

Examples

julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
 Row │ ID     Name
     │ Int64  String
─────┼──────────────────
   1 │     1  John Doe
   2 │     2  Jane Doe
   3 │     3  Joe Blogs

julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
 Row │ ID     Job
     │ Int64  String
─────┼───────────────
   1 │     1  Lawyer
   2 │     2  Doctor
   3 │     4  Farmer

julia> rightjoin(name, job, on = :ID)
3×3 DataFrame
 Row │ ID     Name      Job
     │ Int64  String?   String
─────┼─────────────────────────
   1 │     1  John Doe  Lawyer
   2 │     2  Jane Doe  Doctor
   3 │     4  missing   Farmer

julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
 Row │ identifier  Job
     │ Int64       String
─────┼────────────────────
   1 │          1  Lawyer
   2 │          2  Doctor
   3 │          4  Farmer

julia> rightjoin(name, job2, on = :ID => :identifier, renamecols = "_left" => "_right")
3×3 DataFrame
 Row │ ID     Name_left  Job_right
     │ Int64  String?    String
─────┼─────────────────────────────
   1 │     1  John Doe   Lawyer
   2 │     2  Jane Doe   Doctor
   3 │     4  missing    Farmer

julia> rightjoin(name, job2, on = [:ID => :identifier], renamecols = uppercase => lowercase)
3×3 DataFrame
 Row │ ID     NAME      job
     │ Int64  String?   String
─────┼─────────────────────────
   1 │     1  John Doe  Lawyer
   2 │     2  Jane Doe  Doctor
   3 │     4  missing   Farmer
source
DataFrames.semijoinFunction
semijoin(df1, df2; on, makeunique=false, validate=(false, false), matchmissing=:error)

Perform a semi join of two data frame objects and return a DataFrame containing the result. A semi join returns the subset of rows of df1 that match with the keys in df2.

The order of rows in the result is undefined and may change in the future releases.

Arguments

  • df1, df2: the AbstractDataFrames to be joined

Keyword Arguments

  • on : A column name to join df1 and df2 on. If the columns on which df1 and df2 will be joined have different names, then a left=>right pair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed).
  • makeunique : if false (the default), an error will be raised if duplicate names are found in columns not joined on; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).
  • indicator : Default: nothing. If a Symbol or string, adds categorical indicator column with the given name for whether a row appeared in only df1 ("left_only"), only df2 ("right_only") or in both ("both"). If the name is already in use, the column name will be modified if makeunique=true.
  • validate : whether to check that columns passed as the on argument define unique keys in each input data frame (according to isequal). Can be a tuple or a pair, with the first element indicating whether to run check for df1 and the second element for df2. By default no check is performed.
  • matchmissing : if equal to :error throw an error if missing is present in on columns; if equal to :equal then missing is allowed and missings are matched (isequal is used for comparisons of rows for equality)

It is not allowed to join on columns that contain NaN or -0.0 in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector.

When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.

See also: innerjoin, leftjoin, rightjoin, outerjoin, antijoin, crossjoin.

Examples

julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
 Row │ ID     Name
     │ Int64  String
─────┼──────────────────
   1 │     1  John Doe
   2 │     2  Jane Doe
   3 │     3  Joe Blogs

julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
 Row │ ID     Job
     │ Int64  String
─────┼───────────────
   1 │     1  Lawyer
   2 │     2  Doctor
   3 │     4  Farmer

julia> semijoin(name, job, on = :ID)
2×2 DataFrame
 Row │ ID     Name
     │ Int64  String
─────┼─────────────────
   1 │     1  John Doe
   2 │     2  Jane Doe

julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
 Row │ identifier  Job
     │ Int64       String
─────┼────────────────────
   1 │          1  Lawyer
   2 │          2  Doctor
   3 │          4  Farmer

julia> semijoin(name, job2, on = :ID => :identifier)
2×2 DataFrame
 Row │ ID     Name
     │ Int64  String
─────┼─────────────────
   1 │     1  John Doe
   2 │     2  Jane Doe

julia> semijoin(name, job2, on = [:ID => :identifier])
2×2 DataFrame
 Row │ ID     Name
     │ Int64  String
─────┼─────────────────
   1 │     1  John Doe
   2 │     2  Jane Doe
source

Grouping

Base.getFunction
get(gd::GroupedDataFrame, key, default)

Get a group based on the values of the grouping columns.

key may be a GroupKey, NamedTuple or Tuple of grouping column values (in the same order as the cols argument to groupby). It may also be an AbstractDict, in which case the order of the arguments does not matter.

Examples

julia> df = DataFrame(a = repeat([:foo, :bar, :baz], outer=[2]),
                      b = repeat([2, 1], outer=[3]),
                      c = 1:6);

julia> gd = groupby(df, :a)
GroupedDataFrame with 3 groups based on key: a
First Group (2 rows): a = :foo
 Row │ a       b      c
     │ Symbol  Int64  Int64
─────┼──────────────────────
   1 │ foo         2      1
   2 │ foo         1      4
⋮
Last Group (2 rows): a = :baz
 Row │ a       b      c
     │ Symbol  Int64  Int64
─────┼──────────────────────
   1 │ baz         2      3
   2 │ baz         1      6

julia> get(gd, (a=:bar,), nothing)
2×3 SubDataFrame
 Row │ a       b      c
     │ Symbol  Int64  Int64
─────┼──────────────────────
   1 │ bar         1      2
   2 │ bar         2      5

julia> get(gd, (:baz,), nothing)
2×3 SubDataFrame
 Row │ a       b      c
     │ Symbol  Int64  Int64
─────┼──────────────────────
   1 │ baz         2      3
   2 │ baz         1      6

julia> get(gd, (:qux,), nothing)
source
DataFrames.groupbyFunction
groupby(d::AbstractDataFrame, cols; sort=false, skipmissing=false)

Return a GroupedDataFrame representing a view of an AbstractDataFrame split into row groups.

Arguments

  • df : an AbstractDataFrame to split
  • cols : data frame columns to group by. Can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).
  • sort : whether to sort groups according to the values of the grouping columns cols; if sort=false then the order of groups in the result is undefined and may change in future releases. In the current implementation groups are ordered following the order of appearance of values in the grouping columns, except when all grouping columns provide non-nothing DataAPI.refpool in which case the order of groups follows the order of values returned by DataAPI.refpool. As a particular application of this rule if all cols are CategoricalVectors then groups are always sorted irrespective of the value of sort.
  • skipmissing : whether to skip groups with missing values in one of the grouping columns cols

Details

An iterator over a GroupedDataFrame returns a SubDataFrame view for each grouping into df. Within each group, the order of rows in df is preserved.

cols can be any valid data frame indexing expression. In particular if it is an empty vector then a single-group GroupedDataFrame is created.

A GroupedDataFrame also supports indexing by groups, map (which applies a function to each group) and combine (which applies a function to each group and combines the result into a data frame).

GroupedDataFrame also supports the dictionary interface. The keys are GroupKey objects returned by keys(::GroupedDataFrame), which can also be used to get the values of the grouping columns for each group. Tuples and NamedTuples containing the values of the grouping columns (in the same order as the cols argument) are also accepted as indices. Finally, an AbstractDict can be used to index into a grouped data frame where the keys are column names of the data frame. The order of the keys does not matter in this case.

See also

combine, select, select!, transform, transform!

Examples

julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
                      b = repeat([2, 1], outer=[4]),
                      c = 1:8);

julia> gd = groupby(df, :a)
GroupedDataFrame with 4 groups based on key: a
First Group (2 rows): a = 1
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      1
   2 │     1      2      5
⋮
Last Group (2 rows): a = 4
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     4      1      4
   2 │     4      1      8

julia> gd[1]
2×3 SubDataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      1
   2 │     1      2      5

julia> last(gd)
2×3 SubDataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     4      1      4
   2 │     4      1      8

julia> gd[(a=3,)]
2×3 SubDataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     3      2      3
   2 │     3      2      7

julia> gd[Dict("a" => 3)]
2×3 SubDataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     3      2      3
   2 │     3      2      7

julia> gd[(3,)]
2×3 SubDataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     3      2      3
   2 │     3      2      7

julia> k = first(keys(gd))
GroupKey: (a = 1,)

julia> gd[k]
2×3 SubDataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      1
   2 │     1      2      5

julia> for g in gd
           println(g)
       end
2×3 SubDataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      1
   2 │     1      2      5
2×3 SubDataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     2      1      2
   2 │     2      1      6
2×3 SubDataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     3      2      3
   2 │     3      2      7
2×3 SubDataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     4      1      4
   2 │     4      1      8
source
DataFrames.groupcolsFunction
groupcols(gd::GroupedDataFrame)

Return a vector of Symbol column names in parent(gd) used for grouping.

source
DataFrames.groupindicesFunction
groupindices(gd::GroupedDataFrame)

Return a vector of group indices for each row of parent(gd).

Rows appearing in group gd[i] are attributed index i. Rows not present in any group are attributed missing (this can happen if skipmissing=true was passed when creating gd, or if gd is a subset from a larger GroupedDataFrame).

source
Base.keysFunction
keys(gd::GroupedDataFrame)

Get the set of keys for each group of the GroupedDataFrame gd as a GroupKeys object. Each key is a GroupKey, which behaves like a NamedTuple holding the values of the grouping columns for a given group. Unlike the equivalent Tuple, NamedTuple, and AbstractDict, these keys can be used to index into gd efficiently. The ordering of the keys is identical to the ordering of the groups of gd under iteration and integer indexing.

Examples

julia> df = DataFrame(a = repeat([:foo, :bar, :baz], outer=[4]),
                      b = repeat([2, 1], outer=[6]),
                      c = 1:12);

julia> gd = groupby(df, [:a, :b])
GroupedDataFrame with 6 groups based on keys: a, b
First Group (2 rows): a = :foo, b = 2
 Row │ a       b      c
     │ Symbol  Int64  Int64
─────┼──────────────────────
   1 │ foo         2      1
   2 │ foo         2      7
⋮
Last Group (2 rows): a = :baz, b = 1
 Row │ a       b      c
     │ Symbol  Int64  Int64
─────┼──────────────────────
   1 │ baz         1      6
   2 │ baz         1     12

julia> keys(gd)
6-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (a = :foo, b = 2)
 GroupKey: (a = :bar, b = 1)
 GroupKey: (a = :baz, b = 2)
 GroupKey: (a = :foo, b = 1)
 GroupKey: (a = :bar, b = 2)
 GroupKey: (a = :baz, b = 1)

julia> k = keys(gd)[1]
GroupKey: (a = :foo, b = 2)

julia> keys(k)
2-element Array{Symbol,1}:
 :a
 :b

julia> values(k)  # Same as Tuple(k)
(:foo, 2)

julia> NamedTuple(k)
(a = :foo, b = 2)

julia> k.a
:foo

julia> k[:a]
:foo

julia> k[1]
:foo

Keys can be used as indices to retrieve the corresponding group from their GroupedDataFrame:

julia> gd[k]
2×3 SubDataFrame
 Row │ a       b      c
     │ Symbol  Int64  Int64
─────┼──────────────────────
   1 │ foo         2      1
   2 │ foo         2      7

julia> gd[keys(gd)[1]] == gd[1]
true
source
keys(dfc::DataFrameColumns)

Get a vector of column names of dfc as Symbols.

source
Base.parentFunction
parent(gd::GroupedDataFrame)

Return the parent data frame of gd.

source
DataFrames.valuecolsFunction
valuecols(gd::GroupedDataFrame)

Return a vector of Symbol column names in parent(gd) not used for grouping.

source

Filtering rows

Base.delete!Function
delete!(df::DataFrame, inds)

Delete rows specified by inds from a DataFrame df in place and return it.

Internally deleteat! is called for all columns so inds must be: a vector of sorted and unique integers, a boolean vector, an integer, or Not.

Examples

julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     2      5
   3 │     3      6

julia> delete!(df, 2)
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     3      6
source
Base.emptyFunction
empty(df::AbstractDataFrame)

Create a new DataFrame with the same column names and column element types as df but with zero rows.

source
Base.empty!Function
empty!(df::DataFrame)

Remove all rows from df, making each of its columns empty.

source
Base.filterFunction
filter(fun, df::AbstractDataFrame; view::Bool=false)
filter(cols => fun, df::AbstractDataFrame; view::Bool=false)

Return a data frame containing only rows from df for which fun returns true.

If cols is not specified then the predicate fun is passed DataFrameRows.

If cols is specified then the predicate fun is passed elements of the corresponding columns as separate positional arguments, unless cols is an AsTable selector, in which case a NamedTuple of these arguments is passed. cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers), and column duplicates are allowed if a vector of Symbols, strings, or integers is passed.

If view=false a freshly allocated DataFrame is returned. If view=true then a SubDataFrame view into df is returned.

Passing cols leads to a more efficient execution of the operation for large data frames.

See also: filter!

Examples

julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
 Row │ x      y
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     1  c
   3 │     2  a
   4 │     1  b

julia> filter(row -> row.x > 1, df)
2×2 DataFrame
 Row │ x      y
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  a

julia> filter(:x => x -> x > 1, df)
2×2 DataFrame
 Row │ x      y
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  a

julia> filter([:x, :y] => (x, y) -> x == 1 || y == "b", df)
3×2 DataFrame
 Row │ x      y
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     1  c
   3 │     1  b

julia> filter(AsTable(:) => nt -> nt.x == 1 || nt.y == "b", df)
3×2 DataFrame
 Row │ x      y
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     1  c
   3 │     1  b
source
filter(fun, gdf::GroupedDataFrame)
filter(cols => fun, gdf::GroupedDataFrame)

Return a new GroupedDataFrame containing only groups for which fun returns true.

If cols is not specified then the predicate fun is called with a SubDataFrame for each group.

If cols is specified then the predicate fun is called for each group with views of the corresponding columns as separate positional arguments, unless cols is an AsTable selector, in which case a NamedTuple of these arguments is passed. cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers), and column duplicates are allowed if a vector of Symbols, strings, or integers is passed.

Examples

julia> df = DataFrame(g=[1, 2], x=['a', 'b']);

julia> gd = groupby(df, :g)
GroupedDataFrame with 2 groups based on key: g
First Group (1 row): g = 1
 Row │ g      x
     │ Int64  Char
─────┼─────────────
   1 │     1  a
⋮
Last Group (1 row): g = 2
 Row │ g      x
     │ Int64  Char
─────┼─────────────
   1 │     2  b

julia> filter(x -> x.x[1] == 'a', gd)
GroupedDataFrame with 1 group based on key: g
First Group (1 row): g = 1
 Row │ g      x
     │ Int64  Char
─────┼─────────────
   1 │     1  a

julia> filter(:x => x -> x[1] == 'a', gd)
GroupedDataFrame with 1 group based on key: g
First Group (1 row): g = 1
 Row │ g      x
     │ Int64  Char
─────┼─────────────
   1 │     1  a
source
Base.filter!Function
filter!(fun, df::AbstractDataFrame)
filter!(cols => fun, df::AbstractDataFrame)

Remove rows from data frame df for which fun returns false.

If cols is not specified then the predicate fun is passed DataFrameRows.

If cols is specified then the predicate fun is passed elements of the corresponding columns as separate positional arguments, unless cols is an AsTable selector, in which case a NamedTuple of these arguments is passed. cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers), and column duplicates are allowed if a vector of Symbols, strings, or integers is passed.

Passing cols leads to a more efficient execution of the operation for large data frames.

See also: filter

Examples

julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
 Row │ x      y
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     1  c
   3 │     2  a
   4 │     1  b

julia> filter!(row -> row.x > 1, df)
2×2 DataFrame
 Row │ x      y
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     2  a

julia> filter!(:x => x -> x == 3, df)
1×2 DataFrame
 Row │ x      y
     │ Int64  String
─────┼───────────────
   1 │     3  b

julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"]);

julia> filter!([:x, :y] => (x, y) -> x == 1 || y == "b", df)
3×2 DataFrame
 Row │ x      y
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     1  c
   3 │     1  b

julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"]);

julia> filter!(AsTable(:) => nt -> nt.x == 1 || nt.y == "b", df)
3×2 DataFrame
 Row │ x      y
     │ Int64  String
─────┼───────────────
   1 │     3  b
   2 │     1  c
   3 │     1  b
source
Base.firstFunction
first(df::AbstractDataFrame)

Get the first row of df as a DataFrameRow.

source
first(df::AbstractDataFrame, n::Integer)

Get a data frame with the n first rows of df.

source
Base.lastFunction
last(df::AbstractDataFrame)

Get the last row of df as a DataFrameRow.

source
last(df::AbstractDataFrame, n::Integer)

Get a data frame with the n last rows of df.

source
Base.Iterators.onlyFunction
only(df::AbstractDataFrame)

If df has a single row return it as a DataFrameRow; otherwise throw ArgumentError.

source
DataFrames.nonuniqueFunction
nonunique(df::AbstractDataFrame)
nonunique(df::AbstractDataFrame, cols)

Return a Vector{Bool} in which true entries indicate duplicate rows. A row is a duplicate if there exists a prior row with all columns containing equal values (according to isequal).

See also unique and unique!.

Arguments

  • df : AbstractDataFrame
  • cols : a selector specifying the column(s) to compare. Can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

Examples

julia> df = DataFrame(i = 1:4, x = [1, 2, 1, 2])
4×2 DataFrame
 Row │ i      x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      1
   4 │     4      2

julia> df = vcat(df, df)
8×2 DataFrame
 Row │ i      x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      1
   4 │     4      2
   5 │     1      1
   6 │     2      2
   7 │     3      1
   8 │     4      2

julia> nonunique(df)
8-element Array{Bool,1}:
 0
 0
 0
 0
 1
 1
 1
 1

julia> nonunique(df, 2)
8-element Array{Bool,1}:
 0
 0
 1
 1
 1
 1
 1
 1
source
Base.uniqueFunction
unique(df::AbstractDataFrame; view::Bool=false)
unique(df::AbstractDataFrame, cols; view::Bool=false)
unique!(df::AbstractDataFrame)
unique!(df::AbstractDataFrame, cols)

Return a data frame containing only the first occurrence of unique rows in df. When cols is specified, the returned DataFrame contains complete rows, retaining in each case the first instance for which df[cols] is unique. cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

For unique, if view=false a freshly allocated DataFrame is returned, and if view=true then a SubDataFrame view into df is returned.

unique! updates df in-place and does not support the view keyword argument.

See also nonunique.

Arguments

  • df : the AbstractDataFrame
  • cols : column indicator (Symbol, Int, Vector{Symbol}, Regex, etc.)

specifying the column(s) to compare.

Examples

julia> df = DataFrame(i = 1:4, x = [1, 2, 1, 2])
4×2 DataFrame
 Row │ i      x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      1
   4 │     4      2

julia> df = vcat(df, df)
8×2 DataFrame
 Row │ i      x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      1
   4 │     4      2
   5 │     1      1
   6 │     2      2
   7 │     3      1
   8 │     4      2

julia> unique(df)   # doesn't modify df
4×2 DataFrame
 Row │ i      x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      1
   4 │     4      2

julia> unique(df, 2)
2×2 DataFrame
 Row │ i      x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2

julia> unique!(df)  # modifies df
4×2 DataFrame
 Row │ i      x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      1
   4 │     4      2
source
Base.unique!Function
unique(df::AbstractDataFrame; view::Bool=false)
unique(df::AbstractDataFrame, cols; view::Bool=false)
unique!(df::AbstractDataFrame)
unique!(df::AbstractDataFrame, cols)

Return a data frame containing only the first occurrence of unique rows in df. When cols is specified, the returned DataFrame contains complete rows, retaining in each case the first instance for which df[cols] is unique. cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

For unique, if view=false a freshly allocated DataFrame is returned, and if view=true then a SubDataFrame view into df is returned.

unique! updates df in-place and does not support the view keyword argument.

See also nonunique.

Arguments

  • df : the AbstractDataFrame
  • cols : column indicator (Symbol, Int, Vector{Symbol}, Regex, etc.)

specifying the column(s) to compare.

Examples

julia> df = DataFrame(i = 1:4, x = [1, 2, 1, 2])
4×2 DataFrame
 Row │ i      x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      1
   4 │     4      2

julia> df = vcat(df, df)
8×2 DataFrame
 Row │ i      x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      1
   4 │     4      2
   5 │     1      1
   6 │     2      2
   7 │     3      1
   8 │     4      2

julia> unique(df)   # doesn't modify df
4×2 DataFrame
 Row │ i      x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      1
   4 │     4      2

julia> unique(df, 2)
2×2 DataFrame
 Row │ i      x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2

julia> unique!(df)  # modifies df
4×2 DataFrame
 Row │ i      x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      1
   4 │     4      2
source

Working with missing values

Missings.allowmissingFunction
allowmissing(df::AbstractDataFrame, cols=:)

Return a copy of data frame df with columns cols converted to element type Union{T, Missing} from T to allow support for missing values.

cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If cols is omitted all columns in the data frame are converted.

Examples

julia> df = DataFrame(a=[1, 2])
2×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2

julia> allowmissing(df)
2×1 DataFrame
 Row │ a
     │ Int64?
─────┼────────
   1 │      1
   2 │      2
source
DataFrames.allowmissing!Function
allowmissing!(df::DataFrame, cols=:)

Convert columns cols of data frame df from element type T to Union{T, Missing} to support missing values.

cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If cols is omitted all columns in the data frame are converted.

source
DataFrames.completecasesFunction
completecases(df::AbstractDataFrame, cols=:)

Return a Boolean vector with true entries indicating rows without missing values (complete cases) in data frame df.

If cols is provided, only missing values in the corresponding columns areconsidered. cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

See also: dropmissing and dropmissing!. Use findall(completecases(df)) to get the indices of the rows.

Examples

julia> df = DataFrame(i = 1:5,
                      x = [missing, 4, missing, 2, 1],
                      y = [missing, missing, "c", "d", "e"])
5×3 DataFrame
5×3 DataFrame
 Row │ i      x        y
     │ Int64  Int64?   String?
─────┼─────────────────────────
   1 │     1  missing  missing
   2 │     2        4  missing
   3 │     3  missing  c
   4 │     4        2  d
   5 │     5        1  e

julia> completecases(df)
5-element BitArray{1}:
 false
 false
 false
  true
  true

julia> completecases(df, :x)
5-element BitArray{1}:
 false
  true
 false
  true
  true

julia> completecases(df, [:x, :y])
5-element BitArray{1}:
 false
 false
 false
  true
  true
source
Missings.disallowmissingFunction
disallowmissing(df::AbstractDataFrame, cols=:; error::Bool=true)

Return a copy of data frame df with columns cols converted from element type Union{T, Missing} to T to drop support for missing values.

cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If cols is omitted all columns in the data frame are converted.

If error=false then columns containing a missing value will be skipped instead of throwing an error.

Examples

julia> df = DataFrame(a=Union{Int, Missing}[1, 2])
2×1 DataFrame
 Row │ a
     │ Int64?
─────┼────────
   1 │      1
   2 │      2

julia> disallowmissing(df)
2×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2

julia> df = DataFrame(a=[1, missing])
2×1 DataFrame
 Row │ a
     │ Int64?
─────┼─────────
   1 │       1
   2 │ missing

julia> disallowmissing(df, error=false)
2×1 DataFrame
 Row │ a
     │ Int64?
─────┼─────────
   1 │       1
   2 │ missing
source
DataFrames.disallowmissing!Function
disallowmissing!(df::DataFrame, cols=:; error::Bool=true)

Convert columns cols of data frame df from element type Union{T, Missing} to T to drop support for missing values.

cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If cols is omitted all columns in the data frame are converted.

If error=false then columns containing a missing value will be skipped instead of throwing an error.

source
DataFrames.dropmissingFunction
dropmissing(df::AbstractDataFrame, cols=:; view::Bool=false, disallowmissing::Bool=!view)

Return a data frame excluding rows with missing values in df.

If cols is provided, only missing values in the corresponding columns are considered. cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If view=false a freshly allocated DataFrame is returned. If view=true then a SubDataFrame view into df is returned. In this case disallowmissing must be false.

If disallowmissing is true (the default when view is false) then columns specified in cols will be converted so as not to allow for missing values using disallowmissing!.

See also: completecases and dropmissing!.

Examples

julia> df = DataFrame(i = 1:5,
                      x = [missing, 4, missing, 2, 1],
                      y = [missing, missing, "c", "d", "e"])
5×3 DataFrame
 Row │ i      x        y
     │ Int64  Int64?   String?
─────┼─────────────────────────
   1 │     1  missing  missing
   2 │     2        4  missing
   3 │     3  missing  c
   4 │     4        2  d
   5 │     5        1  e

julia> dropmissing(df)
2×3 DataFrame
 Row │ i      x      y
     │ Int64  Int64  String
─────┼──────────────────────
   1 │     4      2  d
   2 │     5      1  e

julia> dropmissing(df, disallowmissing=false)
2×3 DataFrame
 Row │ i      x       y
     │ Int64  Int64?  String?
─────┼────────────────────────
   1 │     4       2  d
   2 │     5       1  e

julia> dropmissing(df, :x)
3×3 DataFrame
 Row │ i      x      y
     │ Int64  Int64  String?
─────┼───────────────────────
   1 │     2      4  missing
   2 │     4      2  d
   3 │     5      1  e

julia> dropmissing(df, [:x, :y])
2×3 DataFrame
 Row │ i      x      y
     │ Int64  Int64  String
─────┼──────────────────────
   1 │     4      2  d
   2 │     5      1  e
source
DataFrames.dropmissing!Function
dropmissing!(df::AbstractDataFrame, cols=:; disallowmissing::Bool=true)

Remove rows with missing values from data frame df and return it.

If cols is provided, only missing values in the corresponding columns are considered. cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If disallowmissing is true (the default) then the cols columns will get converted using disallowmissing!.

See also: dropmissing and completecases.

julia> df = DataFrame(i = 1:5,
                      x = [missing, 4, missing, 2, 1],
                      y = [missing, missing, "c", "d", "e"])
5×3 DataFrame
 Row │ i      x        y
     │ Int64  Int64?   String?
─────┼─────────────────────────
   1 │     1  missing  missing
   2 │     2        4  missing
   3 │     3  missing  c
   4 │     4        2  d
   5 │     5        1  e

julia> dropmissing!(copy(df))
2×3 DataFrame
 Row │ i      x      y
     │ Int64  Int64  String
─────┼──────────────────────
   1 │     4      2  d
   2 │     5      1  e

julia> dropmissing!(copy(df), disallowmissing=false)
2×3 DataFrame
 Row │ i      x       y
     │ Int64  Int64?  String?
─────┼────────────────────────
   1 │     4       2  d
   2 │     5       1  e

julia> dropmissing!(copy(df), :x)
3×3 DataFrame
 Row │ i      x      y
     │ Int64  Int64  String?
─────┼───────────────────────
   1 │     2      4  missing
   2 │     4      2  d
   3 │     5      1  e

julia> dropmissing!(df, [:x, :y])
2×3 DataFrame
 Row │ i      x      y
     │ Int64  Int64  String
─────┼──────────────────────
   1 │     4      2  d
   2 │     5      1  e
source

Iteration

Base.eachcolFunction
eachcol(df::AbstractDataFrame)

Return a DataFrameColumns object that is a vector-like that allows iterating an AbstractDataFrame column by column.

Indexing into DataFrameColumns objects using integer, Symbol or string returns the corresponding column (without copying). Indexing into DataFrameColumns objects using a multiple column selector returns a subsetted DataFrameColumns object with a new parent containing only the selected columns (without copying).

DataFrameColumns supports most of the AbstractVector API. The key differences are that it is read-only and that the keys function returns a vector of Symbols (and not integers as for normal vectors).

In particular findnext, findprev, findfirst, findlast, and findall functions are supported, and in findnext and findprev functions it is allowed to pass an integer, string, or Symbol as a reference index.

Examples

julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │     1     11
   2 │     2     12
   3 │     3     13
   4 │     4     14

julia> eachcol(df)
4×2 DataFrameColumns
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │     1     11
   2 │     2     12
   3 │     3     13
   4 │     4     14

julia> collect(eachcol(df))
2-element Array{AbstractArray{T,1} where T,1}:
 [1, 2, 3, 4]
 [11, 12, 13, 14]

julia> map(eachcol(df)) do col
           maximum(col) - minimum(col)
       end
2-element Array{Int64,1}:
 3
 3

julia> sum.(eachcol(df))
2-element Array{Int64,1}:
 10
 50
source
Base.eachrowFunction
eachrow(df::AbstractDataFrame)

Return a DataFrameRows that iterates a data frame row by row, with each row represented as a DataFrameRow.

Because DataFrameRows have an eltype of Any, use copy(dfr::DataFrameRow) to obtain a named tuple, which supports iteration and property access like a DataFrameRow, but also passes information on the eltypes of the columns of df.

Examples

julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │     1     11
   2 │     2     12
   3 │     3     13
   4 │     4     14

julia> eachrow(df)
4×2 DataFrameRows
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │     1     11
   2 │     2     12
   3 │     3     13
   4 │     4     14

julia> copy.(eachrow(df))
4-element Array{NamedTuple{(:x, :y),Tuple{Int64,Int64}},1}:
 (x = 1, y = 11)
 (x = 2, y = 12)
 (x = 3, y = 13)
 (x = 4, y = 14)

julia> eachrow(view(df, [4, 3], [2, 1]))
2×2 DataFrameRows
 Row │ y      x
     │ Int64  Int64
─────┼──────────────
   1 │    14      4
   2 │    13      3
source
Base.valuesFunction
values(dfc::DataFrameColumns)

Get a vector of columns from dfc.

source
Base.pairsFunction
pairs(dfc::DataFrameColumns)

Return an iterator of pairs associating the name of each column of dfc with the corresponding column vector, i.e. name => col where name is the column name of the column col.

source

Equality

Base.isapproxFunction
isapprox(df1::AbstractDataFrame, df2::AbstractDataFrame;
         rtol::Real=atol>0 ? 0 : √eps, atol::Real=0,
         nans::Bool=false, norm::Function=norm)

Inexact equality comparison. df1 and df2 must have the same size and column names. Return true if isapprox with given keyword arguments applied to all pairs of columns stored in df1 and df2 returns true.

source