Functions
Multithreading support
By default, selected operations in DataFrames.jl automatically use multiple threads when available. It is task-based and implemented using the @spawn
macro from Julia Base. Functions that take user-defined functions and may run it in parallel accept a threads
keyword argument which allows disabling multithreading when the provided function requires serial execution or is not thread-safe.
This is a list of operations that currently make use of multi-threading:
DataFrame
constructor withcopycols=true
; also recursively all functions that call this constructor, e.g.copy
.getindex
when multiple columns are selected.groupby
(both when hashing is required and when fast path usingDataAPI.refpool
is used).*join
functions for composing output data frame (but currently not for finding matching rows in joined data frames).combine
,select[!]
, andtransform[!]
onGroupedDataFrame
when either of the conditions below is met:- multiple transformations are performed (each transformation is spawned in a separate task)
- a transformation produces one row per group and the passed transformation is a custom function (i.e. not for standard reductions, which use optimized single-threaded methods).
dropmissing
when the provided data frame has more than 1 column andview=false
(subsetting of individual columns is spawned in separate tasks).
In general at least Julia 1.4 is required to ensure that multi-threading is used and the Julia process must be started with more than one thread. Some operations turn on multi-threading only if enough rows in the processed data frame are present (the exact threshold when multi-threading is enabled is considered to be undefined and might change in the future).
Except for the list above, where multi-threading is used automatically, all functions provided by DataFrames.jl that update a data frame are not thread safe. This means that while they can be called from any thread, the caller is responsible for ensuring that a given DataFrame
object is never modified by one thread while others are using it (either for reading or writing). Using the same DataFrame
at the same time from different threads is safe as long as it is not modified.
Index
Base.Iterators.only
Base.Iterators.partition
Base.allunique
Base.append!
Base.copy
Base.deleteat!
Base.eachcol
Base.eachrow
Base.empty
Base.empty!
Base.filter
Base.filter!
Base.first
Base.get
Base.hcat
Base.insert!
Base.invpermute!
Base.isapprox
Base.isempty
Base.issorted
Base.keepat!
Base.keys
Base.last
Base.length
Base.names
Base.ndims
Base.pairs
Base.parent
Base.permute!
Base.permutedims
Base.pop!
Base.popat!
Base.popfirst!
Base.prepend!
Base.propertynames
Base.push!
Base.pushfirst!
Base.reduce
Base.repeat
Base.resize!
Base.reverse
Base.reverse!
Base.show
Base.similar
Base.size
Base.sort
Base.sort!
Base.sortperm
Base.stack
Base.unique
Base.unique!
Base.values
Base.vcat
DataAPI.allcombinations
DataAPI.antijoin
DataAPI.colmetadata
DataAPI.colmetadata!
DataAPI.colmetadatakeys
DataAPI.crossjoin
DataAPI.deletecolmetadata!
DataAPI.deletemetadata!
DataAPI.describe
DataAPI.emptycolmetadata!
DataAPI.emptymetadata!
DataAPI.innerjoin
DataAPI.leftjoin
DataAPI.metadata
DataAPI.metadata!
DataAPI.metadatakeys
DataAPI.ncol
DataAPI.nrow
DataAPI.outerjoin
DataAPI.rightjoin
DataAPI.rownumber
DataAPI.semijoin
DataFrames.allowmissing!
DataFrames.combine
DataFrames.completecases
DataFrames.disallowmissing!
DataFrames.dropmissing
DataFrames.dropmissing!
DataFrames.fillcombinations
DataFrames.flatten
DataFrames.groupby
DataFrames.groupcols
DataFrames.groupindices
DataFrames.insertcols
DataFrames.insertcols!
DataFrames.leftjoin!
DataFrames.mapcols
DataFrames.mapcols!
DataFrames.nonunique
DataFrames.order
DataFrames.proprow
DataFrames.rename
DataFrames.rename!
DataFrames.repeat!
DataFrames.select
DataFrames.select!
DataFrames.subset
DataFrames.subset!
DataFrames.table_transformation
DataFrames.transform
DataFrames.transform!
DataFrames.unstack
DataFrames.valuecols
Missings.allowmissing
Missings.disallowmissing
Random.shuffle
Random.shuffle!
Constructing data frames
DataAPI.allcombinations
— Functionallcombinations(DataFrame, pairs::Pair...)
allcombinations(DataFrame; kwargs...)
Create a DataFrame
from all combinations of values in passed arguments. The first passed values vary fastest.
Arguments associating a column name with values to expand can be specified either as Pair
s passed as positional arguments, or as keyword arguments. Column names must be Symbol
s or strings and must be unique.
Column value can be a vector which is consumed as is or an object of any other type (except AbstractArray
). In the latter case the passed value is treated as having length one for expansion. As a particular rule values stored in a Ref
or a 0
-dimensional AbstractArray
are unwrapped and treated as having length one.
See also: crossjoin
can be used to get the cartesian product of rows from passed data frames.
Examples
julia> allcombinations(DataFrame, a=1:2, b='a':'c')
6×2 DataFrame
Row │ a b
│ Int64 Char
─────┼─────────────
1 │ 1 a
2 │ 2 a
3 │ 1 b
4 │ 2 b
5 │ 1 c
6 │ 2 c
julia> allcombinations(DataFrame, "a" => 1:2, "b" => 'a':'c', "c" => "const")
6×3 DataFrame
Row │ a b c
│ Int64 Char String
─────┼─────────────────────
1 │ 1 a const
2 │ 2 a const
3 │ 1 b const
4 │ 2 b const
5 │ 1 c const
6 │ 2 c const
Base.copy
— Functioncopy(df::DataFrame; copycols::Bool=true)
Copy data frame df
. If copycols=true
(the default), return a new DataFrame
holding copies of column vectors in df
. If copycols=false
, return a new DataFrame
sharing column vectors with df
.
Metadata: this function preserves all table-level and column-level metadata.
copy(dfr::DataFrameRow)
Construct a NamedTuple
with the same contents as the DataFrameRow
. This method returns a NamedTuple
so that the returned object is not affected by changes to the parent data frame of which dfr
is a view.
copy(key::GroupKey)
Construct a NamedTuple
with the same contents as the GroupKey
.
Base.similar
— Functionsimilar(df::AbstractDataFrame, rows::Integer=nrow(df))
Create a new DataFrame
with the same column names and column element types as df
. An optional second argument can be provided to request a number of rows that is different than the number of rows present in df
.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Summary information
DataAPI.describe
— Functiondescribe(df::AbstractDataFrame; cols=:)
describe(df::AbstractDataFrame, stats::Union{Symbol, Pair}...; cols=:)
Return descriptive statistics for a data frame as a new DataFrame
where each row represents a variable and each column a summary statistic.
Arguments
df
: theAbstractDataFrame
stats::Union{Symbol, Pair}...
: the summary statistics to report. Arguments can be:- A symbol from the list
:mean
,:std
,:min
,:q25
,:median
,:q75
,:max
,:sum
,:eltype
,:nunique
,:nuniqueall
,:first
,:last
,:nnonmissing
, and:nmissing
. The default statistics used are:mean
,:min
,:median
,:max
,:nmissing
, and:eltype
. :detailed
as the onlySymbol
argument to return all statistics except:first
,:last
,:sum
,:nuniqueall
, and:nnonmissing
.:all
as the onlySymbol
argument to return all statistics.- A
function => name
pair wherename
is aSymbol
or string. This will create a column of summary statistics with the provided name.
- A symbol from the list
cols
: a keyword argument allowing to select only a subset or transformation of columns fromdf
to describe. Can be any column selector or transformation accepted byselect
.
Details
For Real
columns, compute the mean, standard deviation, minimum, first quantile, median, third quantile, and maximum. If a column does not derive from Real
, describe
will attempt to calculate all statistics, using nothing
as a fall-back in the case of an error.
When stats
contains :nunique
, describe
will report the number of unique values in a column. If a column's base type derives from Real
, :nunique
will return nothing
s. Use :nuniqueall
to report the number of unique values in all columns.
Missing values are filtered in the calculation of all statistics, however the column :nmissing
will report the number of missing values of that variable and :nnonmissing
the number of non-missing values.
If custom functions are provided, they are called repeatedly with the vector corresponding to each column as the only argument. For columns allowing for missing values, the vector is wrapped in a call to skipmissing
: custom functions must therefore support such objects (and not only vectors), and cannot access missing values.
Metadata: this function drops all metadata.
Examples
julia> df = DataFrame(i=1:10, x=0.1:0.1:1.0, y='a':'j');
julia> describe(df)
3×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Union… Any Union… Any Int64 DataType
─────┼────────────────────────────────────────────────────────
1 │ i 5.5 1 5.5 10 0 Int64
2 │ x 0.55 0.1 0.55 1.0 0 Float64
3 │ y a j 0 Char
julia> describe(df, :min, :max)
3×3 DataFrame
Row │ variable min max
│ Symbol Any Any
─────┼────────────────────
1 │ i 1 10
2 │ x 0.1 1.0
3 │ y a j
julia> describe(df, :min, sum => :sum)
3×3 DataFrame
Row │ variable min sum
│ Symbol Any Union…
─────┼───────────────────────
1 │ i 1 55
2 │ x 0.1 5.5
3 │ y a
julia> describe(df, :min, sum => :sum, cols=:x)
1×3 DataFrame
Row │ variable min sum
│ Symbol Float64 Float64
─────┼────────────────────────────
1 │ x 0.1 5.5
Base.isempty
— Functionisempty(df::AbstractDataFrame)
Return true
if data frame df
has zero rows, and false
otherwise.
Base.length
— Functionlength(dfr::DataFrameRow)
Return the number of elements of dfr
.
See also: size
Examples
julia> dfr = DataFrame(a=1:3, b='a':'c')[1, :]
DataFrameRow
Row │ a b
│ Int64 Char
─────┼─────────────
1 │ 1 a
julia> length(dfr)
2
DataAPI.ncol
— Functionncol(df::AbstractDataFrame)
Return the number of columns in an AbstractDataFrame
df
.
Examples
julia> df = DataFrame(i=1:10, x=rand(10), y=rand(["a", "b", "c"], 10));
julia> ncol(df)
3
Base.ndims
— Functionndims(::AbstractDataFrame)
ndims(::Type{<:AbstractDataFrame})
Return the number of dimensions of a data frame, which is always 2
.
ndims(::DataFrameRow)
ndims(::Type{<:DataFrameRow})
Return the number of dimensions of a data frame row, which is always 1
.
DataAPI.nrow
— Functionnrow(df::AbstractDataFrame)
Return the number of rows in an AbstractDataFrame
df
.
Examples
julia> df = DataFrame(i=1:10, x=rand(10), y=rand(["a", "b", "c"], 10));
julia> nrow(df)
10
DataAPI.rownumber
— Functionrownumber(dfr::DataFrameRow)
Return a row number in the AbstractDataFrame
that dfr
was created from.
Note that this differs from the first element in the tuple returned by parentindices
. The latter gives the row number in the parent(dfr)
, which is the source DataFrame
where data that dfr
gives access to is stored.
Examples
julia> df = DataFrame(reshape(1:12, 3, 4), :auto)
3×4 DataFrame
Row │ x1 x2 x3 x4
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 4 7 10
2 │ 2 5 8 11
3 │ 3 6 9 12
julia> dfr = df[2, :]
DataFrameRow
Row │ x1 x2 x3 x4
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
2 │ 2 5 8 11
julia> rownumber(dfr)
2
julia> parentindices(dfr)
(2, Base.OneTo(4))
julia> parent(dfr)
3×4 DataFrame
Row │ x1 x2 x3 x4
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 4 7 10
2 │ 2 5 8 11
3 │ 3 6 9 12
julia> dfv = @view df[2:3, 1:3]
2×3 SubDataFrame
Row │ x1 x2 x3
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 2 5 8
2 │ 3 6 9
julia> dfrv = dfv[2, :]
DataFrameRow
Row │ x1 x2 x3
│ Int64 Int64 Int64
─────┼─────────────────────
3 │ 3 6 9
julia> rownumber(dfrv)
2
julia> parentindices(dfrv)
(3, 1:3)
julia> parent(dfrv)
3×4 DataFrame
Row │ x1 x2 x3 x4
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 4 7 10
2 │ 2 5 8 11
3 │ 3 6 9 12
Base.show
— Functionshow([io::IO, ]df::AbstractDataFrame;
allrows::Bool = !get(io, :limit, false),
allcols::Bool = !get(io, :limit, false),
allgroups::Bool = !get(io, :limit, false),
rowlabel::Symbol = :Row,
summary::Bool = true,
eltypes::Bool = true,
truncate::Int = 32,
kwargs...)
Render a data frame to an I/O stream. The specific visual representation chosen depends on the width of the display.
If io
is omitted, the result is printed to stdout
, and allrows
, allcols
and allgroups
default to false
.
Arguments
io::IO
: The I/O stream to whichdf
will be printed.df::AbstractDataFrame
: The data frame to print.allrows::Bool
: Whether to print all rows, rather than a subset that fits the device height. By default this is the case only ifio
does not have theIOContext
propertylimit
set.allcols::Bool
: Whether to print all columns, rather than a subset that fits the device width. By default this is the case only ifio
does not have theIOContext
propertylimit
set.allgroups::Bool
: Whether to print all groups rather than the first and last, whendf
is aGroupedDataFrame
. By default this is the case only ifio
does not have theIOContext
propertylimit
set.rowlabel::Symbol = :Row
: The label to use for the column containing row numbers.summary::Bool = true
: Whether to print a brief string summary of the data frame.eltypes::Bool = true
: Whether to print the column types under column names.truncate::Int = 32
: the maximal display width the output can use before being truncated (in thetextwidth
sense, excluding…
). Iftruncate
is 0 or less, no truncation is applied.kwargs...
: Any keyword argument supported by the functionpretty_table
of PrettyTables.jl can be passed here to customize the output.
Examples
julia> using DataFrames
julia> df = DataFrame(A=1:3, B=["x", "y", "z"]);
julia> show(df, show_row_number=false)
3×2 DataFrame
A B
Int64 String
───────────────
1 x
2 y
3 z
show(io::IO, mime::MIME, df::AbstractDataFrame)
Render a data frame to an I/O stream in MIME type mime
.
Arguments
io::IO
: The I/O stream to whichdf
will be printed.mime::MIME
: supported MIME types are:"text/plain"
,"text/html"
,"text/latex"
,"text/csv"
,"text/tab-separated-values"
(the last two MIME types do not support showing#undef
values)df::AbstractDataFrame
: The data frame to print.
Additionally selected MIME types support passing the following keyword arguments:
- MIME type
"text/plain"
accepts all listed keyword arguments and their behavior is identical as forshow(::IO, ::AbstractDataFrame)
- MIME type
"text/html"
accepts the following keyword arguments:eltypes::Bool = true
: Whether to print the column types under column names.summary::Bool = true
: Whether to print a brief string summary of the data frame.max_column_width::AbstractString = ""
: The maximum column width. It must be a string containing a valid CSS length. For example, passing "100px" will limit the width of all columns to 100 pixels. If empty, the columns will be rendered without limits.kwargs...
: Any keyword argument supported by the functionpretty_table
of PrettyTables.jl can be passed here to customize the output.
Examples
julia> show(stdout, MIME("text/latex"), DataFrame(A=1:3, B=["x", "y", "z"]))
\begin{tabular}{r|cc}
& A & B\\
\hline
& Int64 & String\\
\hline
1 & 1 & x \\
2 & 2 & y \\
3 & 3 & z \\
\end{tabular}
14
julia> show(stdout, MIME("text/csv"), DataFrame(A=1:3, B=["x", "y", "z"]))
"A","B"
1,"x"
2,"y"
3,"z"
Base.size
— Functionsize(df::AbstractDataFrame[, dim])
Return a tuple containing the number of rows and columns of df
. Optionally a dimension dim
can be specified, where 1
corresponds to rows and 2
corresponds to columns.
Examples
julia> df = DataFrame(a=1:3, b='a':'c');
julia> size(df)
(3, 2)
julia> size(df, 1)
3
size(dfr::DataFrameRow[, dim])
Return a 1-tuple containing the number of elements of dfr
. If an optional dimension dim
is specified, it must be 1
, and the number of elements is returned directly as a number.
See also: length
Examples
julia> dfr = DataFrame(a=1:3, b='a':'c')[1, :]
DataFrameRow
Row │ a b
│ Int64 Char
─────┼─────────────
1 │ 1 a
julia> size(dfr)
(2,)
julia> size(dfr, 1)
2
Working with column names
Base.names
— Functionnames(df::AbstractDataFrame, cols=:)
names(df::DataFrameRow, cols=:)
names(df::GroupedDataFrame, cols=:)
names(df::DataFrameRows, cols=:)
names(df::DataFrameColumns, cols=:)
names(df::GroupKey)
Return a freshly allocated Vector{String}
of names of columns contained in df
.
If cols
is passed then restrict returned column names to those matching the selector (this is useful in particular with regular expressions, Cols
, Not
, and Between
). cols
can be:
- any column selector (
Symbol
, string or integer;:
,Cols
,All
,Between
,Not
, a regular expression, or a vector ofSymbol
s, strings or integers); these column selectors are documented in the General rules section of the Indexing part of the DataFrames.jl manual - a
Type
, in which case names of columns whoseeltype
is a subtype ofT
are returned - a
Function
predicate taking the column name as a string and returningtrue
for columns that should be kept
See also propertynames
which returns a Vector{Symbol}
.
Examples
julia> df = DataFrame(x1=[1, missing, missing], x2=[3, 2, 4], x3=[3, missing, 2], x4=Union{Int, Missing}[2, 4, 4])
3×4 DataFrame
Row │ x1 x2 x3 x4
│ Int64? Int64 Int64? Int64?
─────┼─────────────────────────────────
1 │ 1 3 3 2
2 │ missing 2 missing 4
3 │ missing 4 2 4
julia> names(df)
4-element Vector{String}:
"x1"
"x2"
"x3"
"x4"
julia> names(df, Int) # pick columns whose element type is Int
1-element Vector{String}:
"x2"
julia> names(df, x -> x[end] == '2') # pick columns for which last character in their name is '2'
1-element Vector{String}:
"x2"
julia> fun(col) = sum(skipmissing(col)) >= 10
fun (generic function with 1 method)
julia> names(df, fun.(eachcol(df))) # pick columns for which sum of their elements is at least 10
1-element Vector{String}:
"x4"
julia> names(df, eltype.(eachcol(df)) .>: Missing) # pick columns that allow missing values
3-element Vector{String}:
"x1"
"x3"
"x4"
julia> names(df, any.(ismissing, eachcol(df))) # pick columns that contain missing values
2-element Vector{String}:
"x1"
"x3"
Base.propertynames
— Functionpropertynames(df::AbstractDataFrame)
Return a freshly allocated Vector{Symbol}
of names of columns contained in df
.
DataFrames.rename
— Functionrename(df::AbstractDataFrame, vals::AbstractVector{Symbol};
makeunique::Bool=false)
rename(df::AbstractDataFrame, vals::AbstractVector{<:AbstractString};
makeunique::Bool=false)
rename(df::AbstractDataFrame, (from => to)::Pair...)
rename(df::AbstractDataFrame, d::AbstractDict)
rename(df::AbstractDataFrame, d::AbstractVector{<:Pair})
rename(f::Function, df::AbstractDataFrame)
Create a new data frame that is a copy of df
with changed column names. Each name is changed at most once. Permutation of names is allowed.
Arguments
df
: theAbstractDataFrame
; if it is aSubDataFrame
then renaming is only allowed if it was created using:
as a column selector.d
: anAbstractDict
or anAbstractVector
ofPair
s that maps the original names or column numbers to new namesf
: a function which for each column takes the old name as aString
and returns the new name that gets converted to aSymbol
vals
: new column names as a vector ofSymbol
s orAbstractString
s of the same length as the number of columns indf
makeunique
: iffalse
(the default), an error will be raised if duplicate names are found; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).
If pairs are passed to rename
(as positional arguments or in a dictionary or a vector) then:
from
value can be aSymbol
, anAbstractString
or anInteger
;to
value can be aSymbol
or anAbstractString
.
Mixing symbols and strings in to
and from
is not allowed.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Column-level :note
-style metadata is considered to be attached to column number: when a column is renamed, its :note
-style metadata becomes associated to its new name.
See also: rename!
Examples
julia> df = DataFrame(i=1, x=2, y=3)
1×3 DataFrame
Row │ i x y
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> rename(df, [:a, :b, :c])
1×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> rename(df, :i => "A", :x => "X")
1×3 DataFrame
Row │ A X y
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> rename(df, :x => :y, :y => :x)
1×3 DataFrame
Row │ i y x
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> rename(df, [1 => :A, 2 => :X])
1×3 DataFrame
Row │ A X y
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> rename(df, Dict("i" => "A", "x" => "X"))
1×3 DataFrame
Row │ A X y
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> rename(uppercase, df)
1×3 DataFrame
Row │ I X Y
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
DataFrames.rename!
— Functionrename!(df::AbstractDataFrame, vals::AbstractVector{Symbol};
makeunique::Bool=false)
rename!(df::AbstractDataFrame, vals::AbstractVector{<:AbstractString};
makeunique::Bool=false)
rename!(df::AbstractDataFrame, (from => to)::Pair...)
rename!(df::AbstractDataFrame, d::AbstractDict)
rename!(df::AbstractDataFrame, d::AbstractVector{<:Pair})
rename!(f::Function, df::AbstractDataFrame)
Rename columns of df
in-place. Each name is changed at most once. Permutation of names is allowed.
Arguments
df
: theAbstractDataFrame
d
: anAbstractDict
or anAbstractVector
ofPair
s that maps the original names or column numbers to new namesf
: a function which for each column takes the old name as aString
and returns the new name that gets converted to aSymbol
vals
: new column names as a vector ofSymbol
s orAbstractString
s of the same length as the number of columns indf
makeunique
: iffalse
(the default), an error will be raised if duplicate names are found; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).
If pairs are passed to rename!
(as positional arguments or in a dictionary or a vector) then:
from
value can be aSymbol
, anAbstractString
or anInteger
;to
value can be aSymbol
or anAbstractString
.
Mixing symbols and strings in to
and from
is not allowed.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Metadata having other styles is dropped (from parent data frame when df
is a SubDataFrame
). Column-level :note
-style metadata is considered to be attached to column number: when a column is renamed, its :note
-style metadata becomes associated to its new name.
See also: rename
Examples
julia> df = DataFrame(i=1, x=2, y=3)
1×3 DataFrame
Row │ i x y
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> rename!(df, Dict(:i => "A", :x => "X"))
1×3 DataFrame
Row │ A X y
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> rename!(df, [:a, :b, :c])
1×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> rename!(df, [:a, :b, :a])
ERROR: ArgumentError: Duplicate variable names: :a. Pass makeunique=true to make them unique using a suffix automatically.
julia> rename!(df, [:a, :b, :a], makeunique=true)
1×3 DataFrame
Row │ a b a_1
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> rename!(uppercase, df)
1×3 DataFrame
Row │ A B A_1
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
Mutating and transforming data frames and grouped data frames
Base.append!
— Functionappend!(df::DataFrame, df2::AbstractDataFrame; cols::Symbol=:setequal,
promote::Bool=(cols in [:union, :subset]))
append!(df::DataFrame, table; cols::Symbol=:setequal,
promote::Bool=(cols in [:union, :subset]))
Add the rows of df2
to the end of df
. If the second argument table
is not an AbstractDataFrame
then it is converted using DataFrame(table, copycols=false)
before being appended.
The exact behavior of append!
depends on the cols
argument:
- If
cols == :setequal
(this is the default) thendf2
must contain exactly the same columns asdf
(but possibly in a different order). - If
cols == :orderequal
thendf2
must contain the same columns in the same order (forAbstractDict
this option requires thatkeys(row)
matchespropertynames(df)
to allow for support of ordered dicts; however, ifdf2
is aDict
an error is thrown as it is an unordered collection). - If
cols == :intersect
thendf2
may contain more columns thandf
, but all column names that are present indf
must be present indf2
and only these are used. - If
cols == :subset
thenappend!
behaves like for:intersect
but if some column is missing indf2
then amissing
value is pushed todf
. - If
cols == :union
thenappend!
adds columns missing indf
that are present indf2
, for columns present indf
but missing indf2
amissing
value is pushed.
If promote=true
and element type of a column present in df
does not allow the type of a pushed argument then a new column with a promoted element type allowing it is freshly allocated and stored in df
. If promote=false
an error is thrown.
The above rule has the following exceptions:
- If
df
has no columns then copies of columns fromdf2
are added to it. - If
df2
has no columns then callingappend!
leavesdf
unchanged.
Please note that append!
must not be used on a DataFrame
that contains columns that are aliases (equal when compared with ===
).
Metadata: table-level :note
-style metadata and column-level :note
-style metadata for columns present in df
are preserved. If new columns are added their :note
-style metadata is copied from the appended table. Other metadata is dropped.
See also: use push!
to add individual rows to a data frame, prepend!
to add a table at the beginning, and vcat
to vertically concatenate data frames.
Examples
julia> df1 = DataFrame(A=1:3, B=1:3)
3×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
julia> df2 = DataFrame(A=4.0:6.0, B=4:6)
3×2 DataFrame
Row │ A B
│ Float64 Int64
─────┼────────────────
1 │ 4.0 4
2 │ 5.0 5
3 │ 6.0 6
julia> append!(df1, df2);
julia> df1
6×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
4 │ 4 4
5 │ 5 5
6 │ 6 6
DataFrames.combine
— Functioncombine(df::AbstractDataFrame, args...;
renamecols::Bool=true, threads::Bool=true)
combine(f::Callable, df::AbstractDataFrame;
renamecols::Bool=true, threads::Bool=true)
combine(gd::GroupedDataFrame, args...;
keepkeys::Bool=true, ungroup::Bool=true,
renamecols::Bool=true, threads::Bool=true)
combine(f::Base.Callable, gd::GroupedDataFrame;
keepkeys::Bool=true, ungroup::Bool=true,
renamecols::Bool=true, threads::Bool=true)
Create a new data frame that contains columns from df
or gd
specified by args
and return it. The result can have any number of rows that is determined by the values returned by passed transformations.
Below detailed common rules for all transformation functions supported by DataFrames.jl are explained and compared.
All these operations are supported both for AbstractDataFrame
(when split and combine steps are skipped) and GroupedDataFrame
. Technically, AbstractDataFrame
is just considered as being grouped on no columns (meaning it has a single group, or zero groups if it is empty). The only difference is that in this case the keepkeys
and ungroup
keyword arguments (described below) are not supported and a data frame is always returned, as there are no split and combine steps in this case.
In order to perform operations by groups you first need to create a GroupedDataFrame
object from your data frame using the groupby
function that takes two arguments: (1) a data frame to be grouped, and (2) a set of columns to group by.
Operations can then be applied on each group using one of the following functions:
combine
: does not put restrictions on number of rows returned per group; the returned values are vertically concatenated following order of groups inGroupedDataFrame
; it is typically used to compute summary statistics by group; forGroupedDataFrame
if grouping columns are kept they are put as first columns in the result;select
: return a data frame with the number and order of rows exactly the same as the source data frame, including only new calculated columns;select!
is an in-place version ofselect
; forGroupedDataFrame
if grouping columns are kept they are put as first columns in the result;transform
: return a data frame with the number and order of rows exactly the same as the source data frame, including all columns from the source and new calculated columns;transform!
is an in-place version oftransform
; existing columns in the source data frame are put as first columns in the result;
As a special case, if a GroupedDataFrame
that has zero groups is passed then the result of the operation is determined by performing a single call to the transformation function with a 0-row argument passed to it. The output of this operation is only used to identify the number and type of produced columns, but the result has zero rows.
All these functions take a specification of one or more functions to apply to each subset of the DataFrame
. This specification can be of the following forms:
- standard column selectors (integers,
Symbol
s, strings, vectors of integers, vectors ofSymbol
s, vectors of strings,All
,Cols
,:
,Between
,Not
and regular expressions) - a
cols => function
pair indicating thatfunction
should be called with positional arguments holding columnscols
, which can be any valid column selector; in this case target column name is automatically generated and it is assumed thatfunction
returns a single value or a vector; the generated name is created by concatenating source column name andfunction
name by default (see examples below). - a
cols => function => target_cols
form additionally explicitly specifying the target column or columns, which must be a single name (as aSymbol
or a string), a vector of names orAsTable
. Additionally it can be aFunction
which takes a string or a vector of strings as an argument containing names of columns selected bycols
, and returns the target columns names (all accepted types exceptAsTable
are allowed). - a
col => target_cols
pair, which renames the columncol
totarget_cols
, which must be single name (as aSymbol
or a string), a vector of names orAsTable
. - column-independent operations
function => target_cols
or justfunction
for specificfunction
s where the input columns are omitted; withouttarget_cols
the new column has the same name asfunction
, otherwise it must be single name (as aSymbol
or a string). Supportedfunction
s are:nrow
to efficiently compute the number of rows in each group.proprow
to efficiently compute the proportion of rows in each group.eachindex
to return a vector holding the number of each row within each group.groupindices
to return the group number.
- vectors or matrices containing transformations specified by the
Pair
syntax described in points 2 to 5 - a function which will be called with a
SubDataFrame
corresponding to each group if aGroupedDataFrame
is processed, or with the data frame itself if anAbstractDataFrame
is processed; this form should be avoided due to its poor performance unless the number of groups is small or a very large number of columns are processed (in which caseSubDataFrame
avoids excessive compilation)
Note! If the expression of the form x => y
is passed then except for the special convenience form nrow => target_cols
it is always interpreted as cols => function
. In particular the following expression function => target_cols
is not a valid transformation specification.
Note! If cols
or target_cols
are one of All
, Cols
, Between
, or Not
, broadcasting using .=>
is supported and is equivalent to broadcasting the result of names(df, cols)
or names(df, target_cols)
. This behaves as if broadcasting happened after replacing the selector with selected column names within the data frame scope.
All functions have two types of signatures. One of them takes a GroupedDataFrame
as the first argument and an arbitrary number of transformations described above as following arguments. The second type of signature is when a Function
or a Type
is passed as the first argument and a GroupedDataFrame
as the second argument (similar to map
).
As a special rule, with the cols => function
and cols => function => target_cols
syntaxes, if cols
is wrapped in an AsTable
object then a NamedTuple
containing columns selected by cols
is passed to function
. The documentation of DataFrames.table_transformation
provides more information about this functionality, in particular covering performance considerations.
What is allowed for function
to return is determined by the target_cols
value:
- If both
cols
andtarget_cols
are omitted (so only afunction
is passed), then returning a data frame, a matrix, aNamedTuple
, aTables.AbstractRow
or aDataFrameRow
will produce multiple columns in the result. Returning any other value produces a single column. - If
target_cols
is aSymbol
or a string then the function is assumed to return a single column. In this case returning a data frame, a matrix, aNamedTuple
, aTables.AbstractRow
, or aDataFrameRow
raises an error. - If
target_cols
is a vector ofSymbol
s or strings orAsTable
it is assumed thatfunction
returns multiple columns. Iffunction
returns one ofAbstractDataFrame
,NamedTuple
,DataFrameRow
,Tables.AbstractRow
,AbstractMatrix
then rules described in point 1 above apply. Iffunction
returns anAbstractVector
then each element of this vector must support thekeys
function, which must return a collection ofSymbol
s, strings or integers; the return value ofkeys
must be identical for all elements. Then as many columns are created as there are elements in the return value of thekeys
function. Iftarget_cols
isAsTable
then their names are set to be equal to the key names except ifkeys
returns integers, in which case they are prefixed byx
(so the column names are e.g.x1
,x2
, ...). Iftarget_cols
is a vector ofSymbol
s or strings then column names produced using the rules above are ignored and replaced bytarget_cols
(the number of columns must be the same as the length oftarget_cols
in this case). Iffun
returns a value of any other type then it is assumed that it is a table conforming to the Tables.jl API and theTables.columntable
function is called on it to get the resulting columns and their names. The names are retained whentarget_cols
isAsTable
and are replaced iftarget_cols
is a vector ofSymbol
s or strings.
In all of these cases, function
can return either a single row or multiple rows. As a particular rule, values wrapped in a Ref
or a 0
-dimensional AbstractArray
are unwrapped and then treated as a single row.
select
/select!
and transform
/transform!
always return a data frame with the same number and order of rows as the source (even if GroupedDataFrame
had its groups reordered), except when selection results in zero columns in the resulting data frame (in which case the result has zero rows).
For combine
, rows in the returned object appear in the order of groups in the GroupedDataFrame
. The functions can return an arbitrary number of rows for each group, but the kind of returned object and the number and names of columns must be the same for all groups, except when a DataFrame()
or NamedTuple()
is returned, in which case a given group is skipped.
It is allowed to mix single values and vectors if multiple transformations are requested. In this case single value will be repeated to match the length of columns specified by returned vectors.
To apply function
to each row instead of whole columns, it can be wrapped in a ByRow
struct. cols
can be any column indexing syntax, in which case function
will be passed one argument for each of the columns specified by cols
or a NamedTuple
of them if specified columns are wrapped in AsTable
. If ByRow
is used it is allowed for cols
to select an empty set of columns, in which case function
is called for each row without any arguments and an empty NamedTuple
is passed if empty set of columns is wrapped in AsTable
.
If a collection of column names is passed then requesting duplicate column names in target data frame are accepted (e.g. select!(df, [:a], :, r"a")
is allowed) and only the first occurrence is used. In particular a syntax to move column :col
to the first position in the data frame is select!(df, :col, :)
. On the contrary, output column names of renaming, transformation and single column selection operations must be unique, so e.g. select!(df, :a, :a => :a)
or select!(df, :a, :a => ByRow(sin) => :a)
are not allowed.
In general columns returned by transformations are stored in the target data frame without copying. An exception to this rule is when columns from the source data frame are reused in the target data frame. This can happen via expressions like: :x1
, [:x1, :x2]
, :x1 => :x2
, :x1 => identity => :x2
, or :x1 => (x -> @view x[inds])
(note that in the last case the source column is reused indirectly via a view). In such cases the behavior depends on the value of the copycols
keyword argument:
- if
copycols=true
then results of such transformations always perform a copy of the source column or its view; - if
copycols=false
then copies are only performed to avoid storing the same column several times in the target data frame; more precisely, no copy is made the first time a column is used, but each subsequent reuse of a source column (when compared using===
, which excludes views of source columns) performs a copy;
Note that performing transform!
or select!
assumes that copycols=false
.
If df
is a SubDataFrame
and copycols=true
then a DataFrame
is returned and the same copying rules apply as for a DataFrame
input: this means in particular that selected columns will be copied. If copycols=false
, a SubDataFrame
is returned without copying columns and in this case transforming or renaming columns is not allowed.
If a GroupedDataFrame
is passed and threads=true
(the default), a separate task is spawned for each specified transformation; each transformation then spawns as many tasks as Julia threads, and splits processing of groups across them (however, currently transformations with optimized implementations like sum
and transformations that return multiple rows use a single task for all groups). This allows for parallel operation when Julia was started with more than one thread. Passed transformation functions must therefore not modify global variables (i.e. they must be pure), use locks to control parallel accesses, or threads=false
must be passed to disable multithreading. In the future, parallelism may be extended to other cases, so this requirement also holds for DataFrame
inputs.
In order to improve the performance of the operations some transformations invoke optimized implementation, see DataFrames.table_transformation
for details.
Keyword arguments
renamecols::Bool=true
: whether in thecols => function
form automatically generated column names should include the name of transformation functions or not.keepkeys::Bool=true
: whether grouping columns ofgd
should be kept in the returned data frame.ungroup::Bool=true
: whether the return value of the operation ongd
should be a data frame or aGroupedDataFrame
.threads::Bool=true
: whether transformations may be run in separate tasks which can execute in parallel (possibly being applied to multiple rows or groups at the same time). Whether or not tasks are actually spawned and their number are determined automatically. Set tofalse
if some transformations require serial execution or are not thread-safe.
Metadata: this function propagates table-level :note
-style metadata. Column-level :note
-style metadata is propagated if: a) a single column is transformed to a single column and the name of the column does not change (this includes all column selection operations), or b) a single column is transformed with identity
or copy
to a single column even if column name is changed (this includes column renaming). As a special case for GroupedDataFrame
if the output has the same name as a grouping column and keepkeys=true
, metadata is taken from original grouping column.
Examples
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> combine(df, :a => sum, nrow, renamecols=false)
1×2 DataFrame
Row │ a nrow
│ Int64 Int64
─────┼──────────────
1 │ 6 3
julia> combine(df, :a => ByRow(sin) => :c, :b)
3×2 DataFrame
Row │ c b
│ Float64 Int64
─────┼─────────────────
1 │ 0.841471 4
2 │ 0.909297 5
3 │ 0.14112 6
julia> combine(df, :, [:a, :b] => (a, b) -> a .+ b .- sum(b)/length(b))
3×3 DataFrame
Row │ a b a_b_function
│ Int64 Int64 Float64
─────┼────────────────────────────
1 │ 1 4 0.0
2 │ 2 5 2.0
3 │ 3 6 4.0
julia> combine(df, All() .=> [minimum maximum])
1×4 DataFrame
Row │ a_minimum b_minimum a_maximum b_maximum
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────────────────────
1 │ 1 4 3 6
julia> using Statistics
julia> combine(df, AsTable(:) => ByRow(mean), renamecols=false)
3×1 DataFrame
Row │ a_b
│ Float64
─────┼─────────
1 │ 2.5
2 │ 3.5
3 │ 4.5
julia> combine(df, AsTable(:) => ByRow(mean) => x -> join(x, "_"))
3×1 DataFrame
Row │ a_b
│ Float64
─────┼─────────
1 │ 2.5
2 │ 3.5
3 │ 4.5
julia> combine(first, df)
1×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
julia> df = DataFrame(a=1:3, b=4:6, c=7:9)
3×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 4 7
2 │ 2 5 8
3 │ 3 6 9
julia> combine(df, AsTable(:) => ByRow(x -> (mean=mean(x), std=std(x))) => :stats,
AsTable(:) => ByRow(x -> (mean=mean(x), std=std(x))) => AsTable)
3×3 DataFrame
Row │ stats mean std
│ NamedTup… Float64 Float64
─────┼───────────────────────────────────────────
1 │ (mean = 4.0, std = 3.0) 4.0 3.0
2 │ (mean = 5.0, std = 3.0) 5.0 3.0
3 │ (mean = 6.0, std = 3.0) 6.0 3.0
julia> df = DataFrame(a=repeat([1, 2, 3, 4], outer=[2]),
b=repeat([2, 1], outer=[4]),
c=1:8);
julia> gd = groupby(df, :a);
julia> combine(gd, :c => sum, nrow)
4×3 DataFrame
Row │ a c_sum nrow
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 6 2
2 │ 2 8 2
3 │ 3 10 2
4 │ 4 12 2
julia> combine(gd, :c => sum, nrow, ungroup=false)
GroupedDataFrame with 4 groups based on key: a
First Group (1 row): a = 1
Row │ a c_sum nrow
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 6 2
⋮
Last Group (1 row): a = 4
Row │ a c_sum nrow
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 4 12 2
julia> combine(gd) do d # do syntax for the slower variant
sum(d.c)
end
4×2 DataFrame
Row │ a x1
│ Int64 Int64
─────┼──────────────
1 │ 1 6
2 │ 2 8
3 │ 3 10
4 │ 4 12
julia> combine(gd, :c => (x -> sum(log, x)) => :sum_log_c) # specifying a name for target column
4×2 DataFrame
Row │ a sum_log_c
│ Int64 Float64
─────┼──────────────────
1 │ 1 1.60944
2 │ 2 2.48491
3 │ 3 3.04452
4 │ 4 3.46574
julia> combine(gd, [:b, :c] .=> sum) # passing a vector of pairs
4×3 DataFrame
Row │ a b_sum c_sum
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 4 6
2 │ 2 2 8
3 │ 3 4 10
4 │ 4 2 12
julia> combine(gd) do sdf # dropping group when DataFrame() is returned
sdf.c[1] != 1 ? sdf : DataFrame()
end
6×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 2 1 2
2 │ 2 1 6
3 │ 3 2 3
4 │ 3 2 7
5 │ 4 1 4
6 │ 4 1 8
auto-splatting, renaming and keepkeys
julia> df = DataFrame(a=repeat([1, 2, 3, 4], outer=[2]),
b=repeat([2, 1], outer=[4]),
c=1:8);
julia> gd = groupby(df, :a);
julia> combine(gd, :b => :b1, :c => :c1, [:b, :c] => +, keepkeys=false)
8×3 DataFrame
Row │ b1 c1 b_c_+
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 2 1 3
2 │ 2 5 7
3 │ 1 2 3
4 │ 1 6 7
5 │ 2 3 5
6 │ 2 7 9
7 │ 1 4 5
8 │ 1 8 9
broadcasting and column expansion
julia> df = DataFrame(a=repeat([1, 2, 3, 4], outer=[2]),
b=repeat([2, 1], outer=[4]),
c=1:8);
julia> gd = groupby(df, :a);
julia> combine(gd, :b, AsTable([:b, :c]) => ByRow(extrema) => [:min, :max])
8×4 DataFrame
Row │ a b min max
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 2 1 2
2 │ 1 2 2 5
3 │ 2 1 1 2
4 │ 2 1 1 6
5 │ 3 2 2 3
6 │ 3 2 2 7
7 │ 4 1 1 4
8 │ 4 1 1 8
julia> combine(gd, [:b, :c] .=> Ref) # preventing vector from being spread across multiple rows
4×3 DataFrame
Row │ a b_Ref c_Ref
│ Int64 SubArray… SubArray…
─────┼─────────────────────────────
1 │ 1 [2, 2] [1, 5]
2 │ 2 [1, 1] [2, 6]
3 │ 3 [2, 2] [3, 7]
4 │ 4 [1, 1] [4, 8]
julia> combine(gd, AsTable(Not(:a)) => Ref) # protecting result
4×2 DataFrame
Row │ a b_c_Ref
│ Int64 NamedTup…
─────┼─────────────────────────────────
1 │ 1 (b = [2, 2], c = [1, 5])
2 │ 2 (b = [1, 1], c = [2, 6])
3 │ 3 (b = [2, 2], c = [3, 7])
4 │ 4 (b = [1, 1], c = [4, 8])
julia> combine(gd, :, AsTable(Not(:a)) => sum, renamecols=false)
8×4 DataFrame
Row │ a b c b_c
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 2 1 3
2 │ 1 2 5 7
3 │ 2 1 2 3
4 │ 2 1 6 7
5 │ 3 2 3 5
6 │ 3 2 7 9
7 │ 4 1 4 5
8 │ 4 1 8 9
DataFrames.fillcombinations
— Functionfillcombinations(df::AbstractDataFrame, indexcols;
allowduplicates::Bool=false,
fill=missing)
Generate all combinations of levels of column(s) indexcols
in data frame df
. Levels and their order are determined by the levels
function (i.e. unique values sorted lexicographically by default, or a custom set of levels for e.g. CategoricalArray
columns), in addition to missing
if present.
For combinations of indexcols
not present in df
these columns are filled with the fill
value (missing
by default).
If allowduplicates=false
(the default) indexcols
may only contain unique combinations of indexcols
values. If allowduplicates=true
duplicates are allowed.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Examples
julia> df = DataFrame(x=1:2, y='a':'b', z=["x", "y"])
2×3 DataFrame
Row │ x y z
│ Int64 Char String
─────┼─────────────────────
1 │ 1 a x
2 │ 2 b y
julia> fillcombinations(df, [:x, :y])
4×3 DataFrame
Row │ x y z
│ Int64 Char String?
─────┼──────────────────────
1 │ 1 a x
2 │ 2 a missing
3 │ 1 b missing
4 │ 2 b y
julia> fillcombinations(df, [:y, :z], fill=0)
4×3 DataFrame
Row │ x y z
│ Int64? Char String
─────┼──────────────────────
1 │ 1 a x
2 │ 0 b x
3 │ 0 a y
4 │ 2 b y
DataFrames.flatten
— Functionflatten(df::AbstractDataFrame, cols; scalar::Type=Union{})
When columns cols
of data frame df
have iterable elements that define length
(for example a Vector
of Vector
s), return a DataFrame
where each element of each col
in cols
is flattened, meaning the column corresponding to col
becomes a longer vector where the original entries are concatenated. Elements of row i
of df
in columns other than cols
will be repeated according to the length of df[i, col]
. These lengths must therefore be the same for each col
in cols
, or else an error is raised. Note that these elements are not copied, and thus if they are mutable changing them in the returned DataFrame
will affect df
.
cols
can be any column selector (Symbol
, string or integer; :
, Cols
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
If scalar
is passed then values that have this type in flattened columns are treated as scalars and broadcasted as many times as is needed to match lengths of values stored in other columns. If all values in a row are scalars, a single row is produced.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Examples
julia> df1 = DataFrame(a=[1, 2], b=[[1, 2], [3, 4]], c=[[5, 6], [7, 8]])
2×3 DataFrame
Row │ a b c
│ Int64 Array… Array…
─────┼───────────────────────
1 │ 1 [1, 2] [5, 6]
2 │ 2 [3, 4] [7, 8]
julia> flatten(df1, :b)
4×3 DataFrame
Row │ a b c
│ Int64 Int64 Array…
─────┼──────────────────────
1 │ 1 1 [5, 6]
2 │ 1 2 [5, 6]
3 │ 2 3 [7, 8]
4 │ 2 4 [7, 8]
julia> flatten(df1, [:b, :c])
4×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 5
2 │ 1 2 6
3 │ 2 3 7
4 │ 2 4 8
julia> df2 = DataFrame(a=[1, 2], b=[("p", "q"), ("r", "s")])
2×2 DataFrame
Row │ a b
│ Int64 Tuple…
─────┼───────────────────
1 │ 1 ("p", "q")
2 │ 2 ("r", "s")
julia> flatten(df2, :b)
4×2 DataFrame
Row │ a b
│ Int64 String
─────┼───────────────
1 │ 1 p
2 │ 1 q
3 │ 2 r
4 │ 2 s
julia> df3 = DataFrame(a=[1, 2], b=[[1, 2], [3, 4]], c=[[5, 6], [7]])
2×3 DataFrame
Row │ a b c
│ Int64 Array… Array…
─────┼───────────────────────
1 │ 1 [1, 2] [5, 6]
2 │ 2 [3, 4] [7]
julia> flatten(df3, [:b, :c])
ERROR: ArgumentError: Lengths of iterables stored in columns :b and :c are not the same in row 2
julia> df4 = DataFrame(a=[1, 2, 3],
b=[[1, 2], missing, missing],
c=[[5, 6], missing, [7, 8]])
3×3 DataFrame
Row │ a b c
│ Int64 Array…? Array…?
─────┼─────────────────────────
1 │ 1 [1, 2] [5, 6]
2 │ 2 missing missing
3 │ 3 missing [7, 8]
julia> flatten(df4, [:b, :c], scalar=Missing)
5×3 DataFrame
Row │ a b c
│ Int64 Int64? Int64?
─────┼─────────────────────────
1 │ 1 1 5
2 │ 1 2 6
3 │ 2 missing missing
4 │ 3 missing 7
5 │ 3 missing 8
Base.hcat
— Functionhcat(df::AbstractDataFrame...;
makeunique::Bool=false, copycols::Bool=true)
Horizontally concatenate data frames.
If makeunique=false
(the default) column names of passed objects must be unique. If makeunique=true
then duplicate column names will be suffixed with _i
(i
starting at 1 for the first duplicate).
If copycols=true
(the default) then the DataFrame
returned by hcat
will contain copied columns from the source data frames. If copycols=false
then it will contain columns as they are stored in the source (without copying). This option should be used with caution as mutating either the columns in sources or in the returned DataFrame
might lead to the corruption of the other object.
Metadata: hcat
propagates table-level :note
-style metadata for keys that are present in all passed data frames and have the same value; it propagates column-level :note
-style metadata.
Example
julia> df1 = DataFrame(A=1:3, B=1:3)
3×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
julia> df2 = DataFrame(A=4:6, B=4:6)
3×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 4 4
2 │ 5 5
3 │ 6 6
julia> df3 = hcat(df1, df2, makeunique=true)
3×4 DataFrame
Row │ A B A_1 B_1
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 1 4 4
2 │ 2 2 5 5
3 │ 3 3 6 6
julia> df3.A === df1.A
false
julia> df3 = hcat(df1, df2, makeunique=true, copycols=false);
julia> df3.A === df1.A
true
Base.insert!
— Functioninsert!(df::DataFrame, index::Integer, row::Union{Tuple, AbstractArray}; promote::Bool=false)
insert!(df::DataFrame, index::Integer, row::Union{DataFrameRow, NamedTuple,
AbstractDict, Tables.AbstractRow};
cols::Symbol=:setequal, promote::Bool=(cols in [:union, :subset]))
Add one row to df
at position index
in-place, taking the values from row
. index
must be a integer between 1
and nrow(df)+1
.
Column types of df
are preserved, and new values are converted if necessary. An error is thrown if conversion fails.
If row
is neither a DataFrameRow
, NamedTuple
nor AbstractDict
then it must be a Tuple
or an AbstractArray
and columns are matched by order of appearance. In this case row
must contain the same number of elements as the number of columns in df
.
If row
is a DataFrameRow
, NamedTuple
, AbstractDict
, or Tables.AbstractRow
then values in row
are matched to columns in df
based on names. The exact behavior depends on the cols
argument value in the following way:
- If
cols == :setequal
(this is the default) thenrow
must contain exactly the same columns asdf
(but possibly in a different order). - If
cols == :orderequal
thenrow
must contain the same columns in the same order (forAbstractDict
this option requires thatkeys(row)
matchespropertynames(df)
to allow for support of ordered dicts; however, ifrow
is aDict
an error is thrown as it is an unordered collection). - If
cols == :intersect
thenrow
may contain more columns thandf
, but all column names that are present indf
must be present inrow
and only they are used to populate a new row indf
. - If
cols == :subset
then the behavior is like for:intersect
but if some column is missing inrow
then amissing
value is pushed todf
. - If
cols == :union
then columns missing indf
that are present inrow
are added todf
(usingmissing
for existing rows) and amissing
value is pushed to columns missing inrow
that are present indf
.
If promote=true
and element type of a column present in df
does not allow the type of a pushed argument then a new column with a promoted element type allowing it is freshly allocated and stored in df
. If promote=false
an error is thrown.
As a special case, if df
has no columns and row
is a NamedTuple
, DataFrameRow
, or Tables.AbstractRow
, columns are created for all values in row
, using their names and order.
Please note that this function must not be used on a DataFrame
that contains columns that are aliases (equal when compared with ===
).
Metadata: this function preserves table-level and column-level :note
-style metadata.
See also: push!
, pushfirst!
Examples
julia> df = DataFrame(A='a':'c', B=1:3)
3×2 DataFrame
Row │ A B
│ Char Int64
─────┼─────────────
1 │ a 1
2 │ b 2
3 │ c 3
julia> insert!(df, 2, (true, false), promote=true)
4×2 DataFrame
Row │ A B
│ Any Int64
─────┼─────────────
1 │ a 1
2 │ true 0
3 │ b 2
4 │ c 3
julia> insert!(df, 5, df[1, :])
5×2 DataFrame
Row │ A B
│ Any Int64
─────┼─────────────
1 │ a 1
2 │ true 0
3 │ b 2
4 │ c 3
5 │ a 1
julia> insert!(df, 1, (C="something", A=11, B=12), cols=:intersect)
6×2 DataFrame
Row │ A B
│ Any Int64
─────┼─────────────
1 │ 11 12
2 │ a 1
3 │ true 0
4 │ b 2
5 │ c 3
6 │ a 1
julia> insert!(df, 7, Dict(:A=>1.0, :C=>1.0), cols=:union)
7×3 DataFrame
Row │ A B C
│ Any Int64? Float64?
─────┼──────────────────────────
1 │ 11 12 missing
2 │ a 1 missing
3 │ true 0 missing
4 │ b 2 missing
5 │ c 3 missing
6 │ a 1 missing
7 │ 1.0 missing 1.0
julia> insert!(df, 3, NamedTuple(), cols=:subset)
8×3 DataFrame
Row │ A B C
│ Any Int64? Float64?
─────┼─────────────────────────────
1 │ 11 12 missing
2 │ a 1 missing
3 │ missing missing missing
4 │ true 0 missing
5 │ b 2 missing
6 │ c 3 missing
7 │ a 1 missing
8 │ 1.0 missing 1.0
DataFrames.insertcols
— Functioninsertcols(df::AbstractDataFrame[, col], (name=>val)::Pair...;
after::Bool=false, makeunique::Bool=false, copycols::Bool=true)
Insert a column into a copy of df
data frame using the insertcols!
function and return the newly created data frame.
If col
is omitted it is set to ncol(df)+1
(the column is inserted as the last column).
Arguments
df
: the data frame to which we want to add columnscol
: a position at which we want to insert a column, passed as an integer or a column name (a string or aSymbol
); the column selected withcol
and columns following it are shifted to the right indf
after the operationname
: the name of the new columnval
: anAbstractVector
giving the contents of the new column or a value of any type other thanAbstractArray
which will be repeated to fill a new vector; As a particular rule a values stored in aRef
or a0
-dimensionalAbstractArray
are unwrapped and treated in the same wayafter
: iftrue
columns are inserted aftercol
makeunique
: defines what to do ifname
already exists indf
; if it isfalse
an error will be thrown; if it istrue
a new unique name will be generated by adding a suffixcopycols
: whether vectors passed as columns should be copied
If val
is an AbstractRange
then the result of collect(val)
is inserted.
If df
is a SubDataFrame
then it must have been created with :
as column selector (otherwise an error is thrown). In this case the copycols
keyword argument is ignored (i.e. the added column is always copied) and the parent data frame's column is filled with missing
in rows that are filtered out by df
.
If df
isa DataFrame
that has no columns and only values other than AbstractVector
are passed then it is used to create a one-element column. If df
isa DataFrame
that has no columns and at least one AbstractVector
is passed then its length is used to determine the number of elements in all created columns. In all other cases the number of rows in all created columns must match nrow(df)
.
Metadata: this function preserves table-level and column-level :note
-style metadata.
See also insertcols!
.
Examples
julia> df = DataFrame(a=1:3)
3×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
julia> insertcols(df, 1, :b => 'a':'c')
3×2 DataFrame
Row │ b a
│ Char Int64
─────┼─────────────
1 │ a 1
2 │ b 2
3 │ c 3
julia> insertcols(df, :c => 2:4, :c => 3:5, makeunique=true)
3×3 DataFrame
Row │ a c c_1
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
2 │ 2 3 4
3 │ 3 4 5
julia> insertcols(df, :a, :d => 7:9, after=true)
3×2 DataFrame
Row │ a d
│ Int64 Int64
─────┼──────────────
1 │ 1 7
2 │ 2 8
3 │ 3 9
DataFrames.insertcols!
— Functioninsertcols!(df::AbstractDataFrame[, col], (name=>val)::Pair...;
after::Bool=false, makeunique::Bool=false, copycols::Bool=true)
Insert a column into a data frame in place. Return the updated data frame.
If col
is omitted it is set to ncol(df)+1
(the column is inserted as the last column).
Arguments
df
: the data frame to which we want to add columnscol
: a position at which we want to insert a column, passed as an integer or a column name (a string or aSymbol
); the column selected withcol
and columns following it are shifted to the right indf
after the operationname
: the name of the new columnval
: anAbstractVector
giving the contents of the new column or a value of any type other thanAbstractArray
which will be repeated to fill a new vector; As a particular rule a values stored in aRef
or a0
-dimensionalAbstractArray
are unwrapped and treated in the same wayafter
: iftrue
columns are inserted aftercol
makeunique
: defines what to do ifname
already exists indf
; if it isfalse
an error will be thrown; if it istrue
a new unique name will be generated by adding a suffixcopycols
: whether vectors passed as columns should be copied
If val
is an AbstractRange
then the result of collect(val)
is inserted.
If df
is a SubDataFrame
then it must have been created with :
as column selector (otherwise an error is thrown). In this case the copycols
keyword argument is ignored (i.e. the added column is always copied) and the parent data frame's column is filled with missing
in rows that are filtered out by df
.
If df
isa DataFrame
that has no columns and only values other than AbstractVector
are passed then it is used to create a one-element column. If df
isa DataFrame
that has no columns and at least one AbstractVector
is passed then its length is used to determine the number of elements in all created columns. In all other cases the number of rows in all created columns must match nrow(df)
.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Metadata having other styles is dropped (from parent data frame when df
is a SubDataFrame
).
See also insertcols
.
Examples
julia> df = DataFrame(a=1:3)
3×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
julia> insertcols!(df, 1, :b => 'a':'c')
3×2 DataFrame
Row │ b a
│ Char Int64
─────┼─────────────
1 │ a 1
2 │ b 2
3 │ c 3
julia> insertcols!(df, 2, :c => 2:4, :c => 3:5, makeunique=true)
3×4 DataFrame
Row │ b c c_1 a
│ Char Int64 Int64 Int64
─────┼───────────────────────────
1 │ a 2 3 1
2 │ b 3 4 2
3 │ c 4 5 3
julia> insertcols!(df, :b, :d => 7:9, after=true)
3×5 DataFrame
Row │ b d c c_1 a
│ Char Int64 Int64 Int64 Int64
─────┼──────────────────────────────────
1 │ a 7 2 3 1
2 │ b 8 3 4 2
3 │ c 9 4 5 3
Base.invpermute!
— Functioninvpermute!(df::AbstractDataFrame, p)
Like permute!
, but the inverse of the given permutation is applied.
invpermute!
will produce a correct result even if some columns of passed data frame or permutation p
are identical (checked with ===
). Otherwise, if two columns share some part of memory but are not identical (e.g. are different views of the same parent vector) then invpermute!
result might be incorrect.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Metadata having other styles is dropped (from parent data frame when df
is a SubDataFrame
).
Examples
julia> df = DataFrame(a=1:5, b=6:10, c=11:15)
5×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 6 11
2 │ 2 7 12
3 │ 3 8 13
4 │ 4 9 14
5 │ 5 10 15
julia> permute!(df, [5, 3, 1, 2, 4])
5×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 5 10 15
2 │ 3 8 13
3 │ 1 6 11
4 │ 2 7 12
5 │ 4 9 14
julia> invpermute!(df, [5, 3, 1, 2, 4])
5×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 6 11
2 │ 2 7 12
3 │ 3 8 13
4 │ 4 9 14
5 │ 5 10 15
DataFrames.mapcols
— Functionmapcols(f::Union{Function, Type}, df::AbstractDataFrame)
Return a DataFrame
where each column of df
is transformed using function f
. f
must return AbstractVector
objects all with the same length or scalars (all values other than AbstractVector
are considered to be a scalar).
Note that mapcols
guarantees not to reuse the columns from df
in the returned DataFrame
. If f
returns its argument then it gets copied before being stored.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Examples
julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
3 │ 3 13
4 │ 4 14
julia> mapcols(x -> x.^2, df)
4×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 121
2 │ 4 144
3 │ 9 169
4 │ 16 196
DataFrames.mapcols!
— Functionmapcols!(f::Union{Function, Type}, df::DataFrame)
Update a DataFrame
in-place where each column of df
is transformed using function f
. f
must return AbstractVector
objects all with the same length or scalars (all values other than AbstractVector
are considered to be a scalar).
Note that mapcols!
reuses the columns from df
if they are returned by f
.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Examples
julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
3 │ 3 13
4 │ 4 14
julia> mapcols!(x -> x.^2, df);
julia> df
4×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 121
2 │ 4 144
3 │ 9 169
4 │ 16 196
Base.permute!
— Functionpermute!(df::AbstractDataFrame, p)
Permute data frame df
in-place, according to permutation p
. Throws ArgumentError
if p
is not a permutation.
To return a new data frame instead of permuting df
in-place, use df[p, :]
.
permute!
will produce a correct result even if some columns of passed data frame or permutation p
are identical (checked with ===
). Otherwise, if two columns share some part of memory but are not identical (e.g. are different views of the same parent vector) then permute!
result might be incorrect.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Metadata having other styles is dropped (from parent data frame when df
is a SubDataFrame
).
Examples
julia> df = DataFrame(a=1:5, b=6:10, c=11:15)
5×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 6 11
2 │ 2 7 12
3 │ 3 8 13
4 │ 4 9 14
5 │ 5 10 15
julia> permute!(df, [5, 3, 1, 2, 4])
5×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 5 10 15
2 │ 3 8 13
3 │ 1 6 11
4 │ 2 7 12
5 │ 4 9 14
Base.prepend!
— Functionprepend!(df::DataFrame, df2::AbstractDataFrame; cols::Symbol=:setequal,
promote::Bool=(cols in [:union, :subset]))
prepend!(df::DataFrame, table; cols::Symbol=:setequal,
promote::Bool=(cols in [:union, :subset]))
Add the rows of df2
to the beginning of df
. If the second argument table
is not an AbstractDataFrame
then it is converted using DataFrame(table, copycols=false)
before being prepended.
The exact behavior of prepend!
depends on the cols
argument:
- If
cols == :setequal
(this is the default) thendf2
must contain exactly the same columns asdf
(but possibly in a different order). - If
cols == :orderequal
thendf2
must contain the same columns in the same order (forAbstractDict
this option requires thatkeys(row)
matchespropertynames(df)
to allow for support of ordered dicts; however, ifdf2
is aDict
an error is thrown as it is an unordered collection). - If
cols == :intersect
thendf2
may contain more columns thandf
, but all column names that are present indf
must be present indf2
and only these are used. - If
cols == :subset
thenappend!
behaves like for:intersect
but if some column is missing indf2
then amissing
value is pushed todf
. - If
cols == :union
thenappend!
adds columns missing indf
that are present indf2
, for columns present indf
but missing indf2
amissing
value is pushed.
If promote=true
and element type of a column present in df
does not allow the type of a pushed argument then a new column with a promoted element type allowing it is freshly allocated and stored in df
. If promote=false
an error is thrown.
The above rule has the following exceptions:
- If
df
has no columns then copies of columns fromdf2
are added to it. - If
df2
has no columns then callingprepend!
leavesdf
unchanged.
Please note that prepend!
must not be used on a DataFrame
that contains columns that are aliases (equal when compared with ===
).
Metadata: table-level :note
-style metadata and column-level :note
-style metadata for columns present in df
are preserved. If new columns are added their :note
-style metadata is copied from the appended table. Other metadata is dropped.
See also: use pushfirst!
to add individual rows at the beginning of a data frame, append!
to add a table at the end, and vcat
to vertically concatenate data frames.
Examples
julia> df1 = DataFrame(A=1:3, B=1:3)
3×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
julia> df2 = DataFrame(A=4.0:6.0, B=4:6)
3×2 DataFrame
Row │ A B
│ Float64 Int64
─────┼────────────────
1 │ 4.0 4
2 │ 5.0 5
3 │ 6.0 6
julia> prepend!(df1, df2);
julia> df1
6×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 4 4
2 │ 5 5
3 │ 6 6
4 │ 1 1
5 │ 2 2
6 │ 3 3
Base.push!
— Functionpush!(df::DataFrame, row::Union{Tuple, AbstractArray}; promote::Bool=false)
push!(df::DataFrame, row::Union{DataFrameRow, NamedTuple, AbstractDict,
Tables.AbstractRow};
cols::Symbol=:setequal, promote::Bool=(cols in [:union, :subset]))
Add one row at the end of df
in-place, taking the values from row
.
Column types of df
are preserved, and new values are converted if necessary. An error is thrown if conversion fails.
If row
is neither a DataFrameRow
, NamedTuple
nor AbstractDict
then it must be a Tuple
or an AbstractArray
and columns are matched by order of appearance. In this case row
must contain the same number of elements as the number of columns in df
.
If row
is a DataFrameRow
, NamedTuple
, AbstractDict
, or Tables.AbstractRow
then values in row
are matched to columns in df
based on names. The exact behavior depends on the cols
argument value in the following way:
- If
cols == :setequal
(this is the default) thenrow
must contain exactly the same columns asdf
(but possibly in a different order). - If
cols == :orderequal
thenrow
must contain the same columns in the same order (forAbstractDict
this option requires thatkeys(row)
matchespropertynames(df)
to allow for support of ordered dicts; however, ifrow
is aDict
an error is thrown as it is an unordered collection). - If
cols == :intersect
thenrow
may contain more columns thandf
, but all column names that are present indf
must be present inrow
and only they are used to populate a new row indf
. - If
cols == :subset
then the behavior is like for:intersect
but if some column is missing inrow
then amissing
value is pushed todf
. - If
cols == :union
then columns missing indf
that are present inrow
are added todf
(usingmissing
for existing rows) and amissing
value is pushed to columns missing inrow
that are present indf
.
If promote=true
and element type of a column present in df
does not allow the type of a pushed argument then a new column with a promoted element type allowing it is freshly allocated and stored in df
. If promote=false
an error is thrown.
As a special case, if df
has no columns and row
is a NamedTuple
, DataFrameRow
, or Tables.AbstractRow
, columns are created for all values in row
, using their names and order.
Please note that this function must not be used on a DataFrame
that contains columns that are aliases (equal when compared with ===
).
Metadata: this function preserves table-level and column-level :note
-style metadata.
See also: pushfirst!
, insert!
Examples
julia> df = DataFrame(A='a':'c', B=1:3)
3×2 DataFrame
Row │ A B
│ Char Int64
─────┼─────────────
1 │ a 1
2 │ b 2
3 │ c 3
julia> push!(df, (true, false), promote=true)
4×2 DataFrame
Row │ A B
│ Any Int64
─────┼─────────────
1 │ a 1
2 │ b 2
3 │ c 3
4 │ true 0
julia> push!(df, df[1, :])
5×2 DataFrame
Row │ A B
│ Any Int64
─────┼─────────────
1 │ a 1
2 │ b 2
3 │ c 3
4 │ true 0
5 │ a 1
julia> push!(df, (C="something", A=11, B=12), cols=:intersect)
6×2 DataFrame
Row │ A B
│ Any Int64
─────┼─────────────
1 │ a 1
2 │ b 2
3 │ c 3
4 │ true 0
5 │ a 1
6 │ 11 12
julia> push!(df, Dict(:A=>1.0, :C=>1.0), cols=:union)
7×3 DataFrame
Row │ A B C
│ Any Int64? Float64?
─────┼──────────────────────────
1 │ a 1 missing
2 │ b 2 missing
3 │ c 3 missing
4 │ true 0 missing
5 │ a 1 missing
6 │ 11 12 missing
7 │ 1.0 missing 1.0
julia> push!(df, NamedTuple(), cols=:subset)
8×3 DataFrame
Row │ A B C
│ Any Int64? Float64?
─────┼─────────────────────────────
1 │ a 1 missing
2 │ b 2 missing
3 │ c 3 missing
4 │ true 0 missing
5 │ a 1 missing
6 │ 11 12 missing
7 │ 1.0 missing 1.0
8 │ missing missing missing
Base.pushfirst!
— Functionpushfirst!(df::DataFrame, row::Union{Tuple, AbstractArray}; promote::Bool=false)
pushfirst!(df::DataFrame, row::Union{DataFrameRow, NamedTuple, AbstractDict,
Tables.AbstractRow};
cols::Symbol=:setequal, promote::Bool=(cols in [:union, :subset]))
Add one row at the beginning of df
in-place, taking the values from row
.
Column types of df
are preserved, and new values are converted if necessary. An error is thrown if conversion fails.
If row
is neither a DataFrameRow
, NamedTuple
nor AbstractDict
then it must be a Tuple
or an AbstractArray
and columns are matched by order of appearance. In this case row
must contain the same number of elements as the number of columns in df
.
If row
is a DataFrameRow
, NamedTuple
, AbstractDict
, or Tables.AbstractRow
then values in row
are matched to columns in df
based on names. The exact behavior depends on the cols
argument value in the following way:
- If
cols == :setequal
(this is the default) thenrow
must contain exactly the same columns asdf
(but possibly in a different order). - If
cols == :orderequal
thenrow
must contain the same columns in the same order (forAbstractDict
this option requires thatkeys(row)
matchespropertynames(df)
to allow for support of ordered dicts; however, ifrow
is aDict
an error is thrown as it is an unordered collection). - If
cols == :intersect
thenrow
may contain more columns thandf
, but all column names that are present indf
must be present inrow
and only they are used to populate a new row indf
. - If
cols == :subset
then the behavior is like for:intersect
but if some column is missing inrow
then amissing
value is pushed todf
. - If
cols == :union
then columns missing indf
that are present inrow
are added todf
(usingmissing
for existing rows) and amissing
value is pushed to columns missing inrow
that are present indf
.
If promote=true
and element type of a column present in df
does not allow the type of a pushed argument then a new column with a promoted element type allowing it is freshly allocated and stored in df
. If promote=false
an error is thrown.
As a special case, if df
has no columns and row
is a NamedTuple
, DataFrameRow
, or Tables.AbstractRow
, columns are created for all values in row
, using their names and order.
Please note that this function must not be used on a DataFrame
that contains columns that are aliases (equal when compared with ===
).
Metadata: this function preserves table-level and column-level :note
-style metadata.
Examples
julia> df = DataFrame(A='a':'c', B=1:3)
3×2 DataFrame
Row │ A B
│ Char Int64
─────┼─────────────
1 │ a 1
2 │ b 2
3 │ c 3
julia> pushfirst!(df, (true, false), promote=true)
4×2 DataFrame
Row │ A B
│ Any Int64
─────┼─────────────
1 │ true 0
2 │ a 1
3 │ b 2
4 │ c 3
julia> pushfirst!(df, df[1, :])
5×2 DataFrame
Row │ A B
│ Any Int64
─────┼─────────────
1 │ true 0
2 │ true 0
3 │ a 1
4 │ b 2
5 │ c 3
julia> pushfirst!(df, (C="something", A=11, B=12), cols=:intersect)
6×2 DataFrame
Row │ A B
│ Any Int64
─────┼─────────────
1 │ 11 12
2 │ true 0
3 │ true 0
4 │ a 1
5 │ b 2
6 │ c 3
julia> pushfirst!(df, Dict(:A=>1.0, :C=>1.0), cols=:union)
7×3 DataFrame
Row │ A B C
│ Any Int64? Float64?
─────┼──────────────────────────
1 │ 1.0 missing 1.0
2 │ 11 12 missing
3 │ true 0 missing
4 │ true 0 missing
5 │ a 1 missing
6 │ b 2 missing
7 │ c 3 missing
julia> pushfirst!(df, NamedTuple(), cols=:subset)
8×3 DataFrame
Row │ A B C
│ Any Int64? Float64?
─────┼─────────────────────────────
1 │ missing missing missing
2 │ 1.0 missing 1.0
3 │ 11 12 missing
4 │ true 0 missing
5 │ true 0 missing
6 │ a 1 missing
7 │ b 2 missing
8 │ c 3 missing
Base.reduce
— Functionreduce(::typeof(vcat),
dfs::Union{AbstractVector{<:AbstractDataFrame},
Tuple{AbstractDataFrame, Vararg{AbstractDataFrame}}};
cols::Union{Symbol, AbstractVector{Symbol},
AbstractVector{<:AbstractString}}=:setequal,
source::Union{Nothing, Symbol, AbstractString,
Pair{<:Union{Symbol, AbstractString}, <:AbstractVector}}=nothing,
init::AbstractDataFrame=DataFrame())
Efficiently reduce the given vector or tuple of AbstractDataFrame
s with vcat
.
See the vcat
docstring for a description of keyword arguments cols
and source
.
The keyword argument init
is the initial value to use in the reductions. It must be a data frame that has zero rows. It is not taken into account when computing the value of the source
column nor when determining metadata of the produced data frame.
The column order, names, and types of the resulting DataFrame
, and the behavior of cols
and source
keyword arguments follow the rules specified for vcat
of AbstractDataFrame
s.
Metadata: vcat
propagates table-level :note
-style metadata for keys that are present in all passed data frames and have the same value. vcat
propagates column-level :note
-style metadata for keys that are present in all passed data frames that contain this column and have the same value.
Example
julia> df1 = DataFrame(A=1:3, B=1:3)
3×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
julia> df2 = DataFrame(A=4:6, B=4:6)
3×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 4 4
2 │ 5 5
3 │ 6 6
julia> df3 = DataFrame(A=7:9, C=7:9)
3×2 DataFrame
Row │ A C
│ Int64 Int64
─────┼──────────────
1 │ 7 7
2 │ 8 8
3 │ 9 9
julia> reduce(vcat, (df1, df2))
6×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
4 │ 4 4
5 │ 5 5
6 │ 6 6
julia> reduce(vcat, [df1, df2, df3], cols=:union, source=:source)
9×4 DataFrame
Row │ A B C source
│ Int64 Int64? Int64? Int64
─────┼─────────────────────────────────
1 │ 1 1 missing 1
2 │ 2 2 missing 1
3 │ 3 3 missing 1
4 │ 4 4 missing 2
5 │ 5 5 missing 2
6 │ 6 6 missing 2
7 │ 7 missing 7 3
8 │ 8 missing 8 3
9 │ 9 missing 9 3
Base.repeat
— Functionrepeat(df::AbstractDataFrame; inner::Integer = 1, outer::Integer = 1)
Construct a data frame by repeating rows in df
. inner
specifies how many times each row is repeated, and outer
specifies how many times the full set of rows is repeated.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Example
julia> df = DataFrame(a=1:2, b=3:4)
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 2 4
julia> repeat(df, inner=2, outer=3)
12×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 1 3
3 │ 2 4
4 │ 2 4
5 │ 1 3
6 │ 1 3
7 │ 2 4
8 │ 2 4
9 │ 1 3
10 │ 1 3
11 │ 2 4
12 │ 2 4
repeat(df::AbstractDataFrame, count::Integer)
Construct a data frame by repeating each row in df
the number of times specified by count
.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Example
julia> df = DataFrame(a=1:2, b=3:4)
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 2 4
julia> repeat(df, 2)
4×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 2 4
3 │ 1 3
4 │ 2 4
DataFrames.repeat!
— Functionrepeat!(df::DataFrame; inner::Integer=1, outer::Integer=1)
Update a data frame df
in-place by repeating its rows. inner
specifies how many times each row is repeated, and outer
specifies how many times the full set of rows is repeated. Columns of df
are freshly allocated.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Example
julia> df = DataFrame(a=1:2, b=3:4)
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 2 4
julia> repeat!(df, inner=2, outer=3);
julia> df
12×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 1 3
3 │ 2 4
4 │ 2 4
5 │ 1 3
6 │ 1 3
7 │ 2 4
8 │ 2 4
9 │ 1 3
10 │ 1 3
11 │ 2 4
12 │ 2 4
repeat!(df::DataFrame, count::Integer)
Update a data frame df
in-place by repeating its rows the number of times specified by count
. Columns of df
are freshly allocated.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Example
julia> df = DataFrame(a=1:2, b=3:4)
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 2 4
julia> repeat(df, 2)
4×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 2 4
3 │ 1 3
4 │ 2 4
Base.reverse
— Functionreverse(df::AbstractDataFrame, start=1, stop=nrow(df))
Return a data frame containing the rows in df
in reversed order. If start
and stop
are provided, only rows in the start:stop
range are affected.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Examples
julia> df = DataFrame(a=1:5, b=6:10, c=11:15)
5×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 6 11
2 │ 2 7 12
3 │ 3 8 13
4 │ 4 9 14
5 │ 5 10 15
julia> reverse(df)
5×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 5 10 15
2 │ 4 9 14
3 │ 3 8 13
4 │ 2 7 12
5 │ 1 6 11
julia> reverse(df, 2, 3)
5×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 6 11
2 │ 3 8 13
3 │ 2 7 12
4 │ 4 9 14
5 │ 5 10 15
Base.reverse!
— Functionreverse!(df::AbstractDataFrame, start=1, stop=nrow(df))
Mutate data frame in-place to reverse its row order. If start
and stop
are provided, only rows in the start:stop
range are affected.
reverse!
will produce a correct result even if some columns of passed data frame are identical (checked with ===
). Otherwise, if two columns share some part of memory but are not identical (e.g. are different views of the same parent vector) then reverse!
result might be incorrect.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Metadata having other styles is dropped (from parent data frame when df
is a SubDataFrame
).
Examples
julia> df = DataFrame(a=1:5, b=6:10, c=11:15)
5×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 6 11
2 │ 2 7 12
3 │ 3 8 13
4 │ 4 9 14
5 │ 5 10 15
julia> reverse!(df)
5×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 5 10 15
2 │ 4 9 14
3 │ 3 8 13
4 │ 2 7 12
5 │ 1 6 11
julia> reverse!(df, 2, 3)
5×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 5 10 15
2 │ 3 8 13
3 │ 4 9 14
4 │ 2 7 12
5 │ 1 6 11
DataFrames.select
— Functionselect(df::AbstractDataFrame, args...;
copycols::Bool=true, renamecols::Bool=true, threads::Bool=true)
select(args::Callable, df::DataFrame;
renamecols::Bool=true, threads::Bool=true)
select(gd::GroupedDataFrame, args...;
copycols::Bool=true, keepkeys::Bool=true, ungroup::Bool=true,
renamecols::Bool=true, threads::Bool=true)
select(f::Base.Callable, gd::GroupedDataFrame;
copycols::Bool=true, keepkeys::Bool=true, ungroup::Bool=true,
renamecols::Bool=true, threads::Bool=true)
Create a new data frame that contains columns from df
or gd
specified by args
and return it. The result is guaranteed to have the same number of rows as df
, except when no columns are selected (in which case the result has zero rows).
Below detailed common rules for all transformation functions supported by DataFrames.jl are explained and compared.
All these operations are supported both for AbstractDataFrame
(when split and combine steps are skipped) and GroupedDataFrame
. Technically, AbstractDataFrame
is just considered as being grouped on no columns (meaning it has a single group, or zero groups if it is empty). The only difference is that in this case the keepkeys
and ungroup
keyword arguments (described below) are not supported and a data frame is always returned, as there are no split and combine steps in this case.
In order to perform operations by groups you first need to create a GroupedDataFrame
object from your data frame using the groupby
function that takes two arguments: (1) a data frame to be grouped, and (2) a set of columns to group by.
Operations can then be applied on each group using one of the following functions:
combine
: does not put restrictions on number of rows returned per group; the returned values are vertically concatenated following order of groups inGroupedDataFrame
; it is typically used to compute summary statistics by group; forGroupedDataFrame
if grouping columns are kept they are put as first columns in the result;select
: return a data frame with the number and order of rows exactly the same as the source data frame, including only new calculated columns;select!
is an in-place version ofselect
; forGroupedDataFrame
if grouping columns are kept they are put as first columns in the result;transform
: return a data frame with the number and order of rows exactly the same as the source data frame, including all columns from the source and new calculated columns;transform!
is an in-place version oftransform
; existing columns in the source data frame are put as first columns in the result;
As a special case, if a GroupedDataFrame
that has zero groups is passed then the result of the operation is determined by performing a single call to the transformation function with a 0-row argument passed to it. The output of this operation is only used to identify the number and type of produced columns, but the result has zero rows.
All these functions take a specification of one or more functions to apply to each subset of the DataFrame
. This specification can be of the following forms:
- standard column selectors (integers,
Symbol
s, strings, vectors of integers, vectors ofSymbol
s, vectors of strings,All
,Cols
,:
,Between
,Not
and regular expressions) - a
cols => function
pair indicating thatfunction
should be called with positional arguments holding columnscols
, which can be any valid column selector; in this case target column name is automatically generated and it is assumed thatfunction
returns a single value or a vector; the generated name is created by concatenating source column name andfunction
name by default (see examples below). - a
cols => function => target_cols
form additionally explicitly specifying the target column or columns, which must be a single name (as aSymbol
or a string), a vector of names orAsTable
. Additionally it can be aFunction
which takes a string or a vector of strings as an argument containing names of columns selected bycols
, and returns the target columns names (all accepted types exceptAsTable
are allowed). - a
col => target_cols
pair, which renames the columncol
totarget_cols
, which must be single name (as aSymbol
or a string), a vector of names orAsTable
. - column-independent operations
function => target_cols
or justfunction
for specificfunction
s where the input columns are omitted; withouttarget_cols
the new column has the same name asfunction
, otherwise it must be single name (as aSymbol
or a string). Supportedfunction
s are:nrow
to efficiently compute the number of rows in each group.proprow
to efficiently compute the proportion of rows in each group.eachindex
to return a vector holding the number of each row within each group.groupindices
to return the group number.
- vectors or matrices containing transformations specified by the
Pair
syntax described in points 2 to 5 - a function which will be called with a
SubDataFrame
corresponding to each group if aGroupedDataFrame
is processed, or with the data frame itself if anAbstractDataFrame
is processed; this form should be avoided due to its poor performance unless the number of groups is small or a very large number of columns are processed (in which caseSubDataFrame
avoids excessive compilation)
Note! If the expression of the form x => y
is passed then except for the special convenience form nrow => target_cols
it is always interpreted as cols => function
. In particular the following expression function => target_cols
is not a valid transformation specification.
Note! If cols
or target_cols
are one of All
, Cols
, Between
, or Not
, broadcasting using .=>
is supported and is equivalent to broadcasting the result of names(df, cols)
or names(df, target_cols)
. This behaves as if broadcasting happened after replacing the selector with selected column names within the data frame scope.
All functions have two types of signatures. One of them takes a GroupedDataFrame
as the first argument and an arbitrary number of transformations described above as following arguments. The second type of signature is when a Function
or a Type
is passed as the first argument and a GroupedDataFrame
as the second argument (similar to map
).
As a special rule, with the cols => function
and cols => function => target_cols
syntaxes, if cols
is wrapped in an AsTable
object then a NamedTuple
containing columns selected by cols
is passed to function
. The documentation of DataFrames.table_transformation
provides more information about this functionality, in particular covering performance considerations.
What is allowed for function
to return is determined by the target_cols
value:
- If both
cols
andtarget_cols
are omitted (so only afunction
is passed), then returning a data frame, a matrix, aNamedTuple
, aTables.AbstractRow
or aDataFrameRow
will produce multiple columns in the result. Returning any other value produces a single column. - If
target_cols
is aSymbol
or a string then the function is assumed to return a single column. In this case returning a data frame, a matrix, aNamedTuple
, aTables.AbstractRow
, or aDataFrameRow
raises an error. - If
target_cols
is a vector ofSymbol
s or strings orAsTable
it is assumed thatfunction
returns multiple columns. Iffunction
returns one ofAbstractDataFrame
,NamedTuple
,DataFrameRow
,Tables.AbstractRow
,AbstractMatrix
then rules described in point 1 above apply. Iffunction
returns anAbstractVector
then each element of this vector must support thekeys
function, which must return a collection ofSymbol
s, strings or integers; the return value ofkeys
must be identical for all elements. Then as many columns are created as there are elements in the return value of thekeys
function. Iftarget_cols
isAsTable
then their names are set to be equal to the key names except ifkeys
returns integers, in which case they are prefixed byx
(so the column names are e.g.x1
,x2
, ...). Iftarget_cols
is a vector ofSymbol
s or strings then column names produced using the rules above are ignored and replaced bytarget_cols
(the number of columns must be the same as the length oftarget_cols
in this case). Iffun
returns a value of any other type then it is assumed that it is a table conforming to the Tables.jl API and theTables.columntable
function is called on it to get the resulting columns and their names. The names are retained whentarget_cols
isAsTable
and are replaced iftarget_cols
is a vector ofSymbol
s or strings.
In all of these cases, function
can return either a single row or multiple rows. As a particular rule, values wrapped in a Ref
or a 0
-dimensional AbstractArray
are unwrapped and then treated as a single row.
select
/select!
and transform
/transform!
always return a data frame with the same number and order of rows as the source (even if GroupedDataFrame
had its groups reordered), except when selection results in zero columns in the resulting data frame (in which case the result has zero rows).
For combine
, rows in the returned object appear in the order of groups in the GroupedDataFrame
. The functions can return an arbitrary number of rows for each group, but the kind of returned object and the number and names of columns must be the same for all groups, except when a DataFrame()
or NamedTuple()
is returned, in which case a given group is skipped.
It is allowed to mix single values and vectors if multiple transformations are requested. In this case single value will be repeated to match the length of columns specified by returned vectors.
To apply function
to each row instead of whole columns, it can be wrapped in a ByRow
struct. cols
can be any column indexing syntax, in which case function
will be passed one argument for each of the columns specified by cols
or a NamedTuple
of them if specified columns are wrapped in AsTable
. If ByRow
is used it is allowed for cols
to select an empty set of columns, in which case function
is called for each row without any arguments and an empty NamedTuple
is passed if empty set of columns is wrapped in AsTable
.
If a collection of column names is passed then requesting duplicate column names in target data frame are accepted (e.g. select!(df, [:a], :, r"a")
is allowed) and only the first occurrence is used. In particular a syntax to move column :col
to the first position in the data frame is select!(df, :col, :)
. On the contrary, output column names of renaming, transformation and single column selection operations must be unique, so e.g. select!(df, :a, :a => :a)
or select!(df, :a, :a => ByRow(sin) => :a)
are not allowed.
In general columns returned by transformations are stored in the target data frame without copying. An exception to this rule is when columns from the source data frame are reused in the target data frame. This can happen via expressions like: :x1
, [:x1, :x2]
, :x1 => :x2
, :x1 => identity => :x2
, or :x1 => (x -> @view x[inds])
(note that in the last case the source column is reused indirectly via a view). In such cases the behavior depends on the value of the copycols
keyword argument:
- if
copycols=true
then results of such transformations always perform a copy of the source column or its view; - if
copycols=false
then copies are only performed to avoid storing the same column several times in the target data frame; more precisely, no copy is made the first time a column is used, but each subsequent reuse of a source column (when compared using===
, which excludes views of source columns) performs a copy;
Note that performing transform!
or select!
assumes that copycols=false
.
If df
is a SubDataFrame
and copycols=true
then a DataFrame
is returned and the same copying rules apply as for a DataFrame
input: this means in particular that selected columns will be copied. If copycols=false
, a SubDataFrame
is returned without copying columns and in this case transforming or renaming columns is not allowed.
If a GroupedDataFrame
is passed and threads=true
(the default), a separate task is spawned for each specified transformation; each transformation then spawns as many tasks as Julia threads, and splits processing of groups across them (however, currently transformations with optimized implementations like sum
and transformations that return multiple rows use a single task for all groups). This allows for parallel operation when Julia was started with more than one thread. Passed transformation functions must therefore not modify global variables (i.e. they must be pure), use locks to control parallel accesses, or threads=false
must be passed to disable multithreading. In the future, parallelism may be extended to other cases, so this requirement also holds for DataFrame
inputs.
In order to improve the performance of the operations some transformations invoke optimized implementation, see DataFrames.table_transformation
for details.
Keyword arguments
copycols::Bool=true
: whether columns of the source data frame should be copied if no transformation is applied to them.renamecols::Bool=true
: whether in thecols => function
form automatically generated column names should include the name of transformation functions or not.keepkeys::Bool=true
: whether grouping columns ofgd
should be kept in the returned data frame.ungroup::Bool=true
: whether the return value of the operation ongd
should be a data frame or aGroupedDataFrame
.threads::Bool=true
: whether transformations may be run in separate tasks which can execute in parallel (possibly being applied to multiple rows or groups at the same time). Whether or not tasks are actually spawned and their number are determined automatically. Set tofalse
if some transformations require serial execution or are not thread-safe.
Metadata: this function propagates table-level :note
-style metadata. Column-level :note
-style metadata is propagated if: a) a single column is transformed to a single column and the name of the column does not change (this includes all column selection operations), or b) a single column is transformed with identity
or copy
to a single column even if column name is changed (this includes column renaming). As a special case for GroupedDataFrame
if the output has the same name as a grouping column and keepkeys=true
, metadata is taken from original grouping column.
Examples
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> select(df, 2)
3×1 DataFrame
Row │ b
│ Int64
─────┼───────
1 │ 4
2 │ 5
3 │ 6
julia> select(df, :a => ByRow(sin) => :c, :b)
3×2 DataFrame
Row │ c b
│ Float64 Int64
─────┼─────────────────
1 │ 0.841471 4
2 │ 0.909297 5
3 │ 0.14112 6
julia> select(df, :, [:a, :b] => (a, b) -> a .+ b .- sum(b)/length(b))
3×3 DataFrame
Row │ a b a_b_function
│ Int64 Int64 Float64
─────┼────────────────────────────
1 │ 1 4 0.0
2 │ 2 5 2.0
3 │ 3 6 4.0
julia> select(df, All() .=> [minimum maximum])
3×4 DataFrame
Row │ a_minimum b_minimum a_maximum b_maximum
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────────────────────
1 │ 1 4 3 6
2 │ 1 4 3 6
3 │ 1 4 3 6
julia> using Statistics
julia> select(df, AsTable(:) => ByRow(mean), renamecols=false)
3×1 DataFrame
Row │ a_b
│ Float64
─────┼─────────
1 │ 2.5
2 │ 3.5
3 │ 4.5
julia> select(df, AsTable(:) => ByRow(mean) => x -> join(x, "_"))
3×1 DataFrame
Row │ a_b
│ Float64
─────┼─────────
1 │ 2.5
2 │ 3.5
3 │ 4.5
julia> select(first, df)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 1 4
3 │ 1 4
julia> df = DataFrame(a=1:3, b=4:6, c=7:9)
3×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 4 7
2 │ 2 5 8
3 │ 3 6 9
julia> select(df, AsTable(:) => ByRow(x -> (mean=mean(x), std=std(x))) => :stats,
AsTable(:) => ByRow(x -> (mean=mean(x), std=std(x))) => AsTable)
3×3 DataFrame
Row │ stats mean std
│ NamedTup… Float64 Float64
─────┼───────────────────────────────────────────
1 │ (mean = 4.0, std = 3.0) 4.0 3.0
2 │ (mean = 5.0, std = 3.0) 5.0 3.0
3 │ (mean = 6.0, std = 3.0) 6.0 3.0
julia> df = DataFrame(a=[1, 1, 1, 2, 2, 1, 1, 2],
b=repeat([2, 1], outer=[4]),
c=1:8)
8×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 1
2 │ 1 1 2
3 │ 1 2 3
4 │ 2 1 4
5 │ 2 2 5
6 │ 1 1 6
7 │ 1 2 7
8 │ 2 1 8
julia> gd = groupby(df, :a)
GroupedDataFrame with 2 groups based on key: a
First Group (5 rows): a = 1
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 1
2 │ 1 1 2
3 │ 1 2 3
4 │ 1 1 6
5 │ 1 2 7
⋮
Last Group (3 rows): a = 2
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 2 1 4
2 │ 2 2 5
3 │ 2 1 8
specifying a name for target column
julia> df = DataFrame(a=[1, 1, 1, 2, 2, 1, 1, 2],
b=repeat([2, 1], outer=[4]),
c=1:8);
julia> gd = groupby(df, :a);
julia> select(gd, :c => (x -> sum(log, x)) => :sum_log_c)
8×2 DataFrame
Row │ a sum_log_c
│ Int64 Float64
─────┼──────────────────
1 │ 1 5.52943
2 │ 1 5.52943
3 │ 1 5.52943
4 │ 2 5.07517
5 │ 2 5.07517
6 │ 1 5.52943
7 │ 1 5.52943
8 │ 2 5.07517
julia> select(gd, [:b, :c] .=> sum) # passing a vector of pairs
8×3 DataFrame
Row │ a b_sum c_sum
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 8 19
2 │ 1 8 19
3 │ 1 8 19
4 │ 2 4 17
5 │ 2 4 17
6 │ 1 8 19
7 │ 1 8 19
8 │ 2 4 17
multiple arguments, renaming and keepkeys
julia> df = DataFrame(a=[1, 1, 1, 2, 2, 1, 1, 2],
b=repeat([2, 1], outer=[4]),
c=1:8);
julia> gd = groupby(df, :a);
julia> select(gd, :b => :b1, :c => :c1, [:b, :c] => +, keepkeys=false)
8×3 DataFrame
Row │ b1 c1 b_c_+
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 2 1 3
2 │ 1 2 3
3 │ 2 3 5
4 │ 1 4 5
5 │ 2 5 7
6 │ 1 6 7
7 │ 2 7 9
8 │ 1 8 9
broadcasting and column expansion
julia> df = DataFrame(a=[1, 1, 1, 2, 2, 1, 1, 2],
b=repeat([2, 1], outer=[4]),
c=1:8);
julia> gd = groupby(df, :a);
julia> select(gd, :b, AsTable([:b, :c]) => ByRow(extrema) => [:min, :max])
8×4 DataFrame
Row │ a b min max
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 2 1 2
2 │ 1 1 1 2
3 │ 1 2 2 3
4 │ 2 1 1 4
5 │ 2 2 2 5
6 │ 1 1 1 6
7 │ 1 2 2 7
8 │ 2 1 1 8
julia> select(gd, :, AsTable(Not(:a)) => sum, renamecols=false)
8×4 DataFrame
Row │ a b c b_c
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 2 1 3
2 │ 1 1 2 3
3 │ 1 2 3 5
4 │ 2 1 4 5
5 │ 2 2 5 7
6 │ 1 1 6 7
7 │ 1 2 7 9
8 │ 2 1 8 9
column-independent operations
julia> df = DataFrame(a=[1, 1, 1, 2, 2, 1, 1, 2],
b=repeat([2, 1], outer=[4]),
c=1:8);
julia> gd = groupby(df, :a);
julia> select(gd, nrow, proprow, groupindices, eachindex)
8×5 DataFrame
Row │ a nrow proprow groupindices eachindex
│ Int64 Int64 Float64 Int64 Int64
─────┼────────────────────────────────────────────────
1 │ 1 5 0.625 1 1
2 │ 1 5 0.625 1 2
3 │ 1 5 0.625 1 3
4 │ 2 3 0.375 2 1
5 │ 2 3 0.375 2 2
6 │ 1 5 0.625 1 4
7 │ 1 5 0.625 1 5
8 │ 2 3 0.375 2 3
DataFrames.select!
— Functionselect!(df::AbstractDataFrame, args...;
renamecols::Bool=true, threads::Bool=true)
select!(args::Base.Callable, df::DataFrame;
renamecols::Bool=true, threads::Bool=true)
select!(gd::GroupedDataFrame, args...; ungroup::Bool=true,
renamecols::Bool=true, threads::Bool=true)
select!(f::Base.Callable, gd::GroupedDataFrame; ungroup::Bool=true,
renamecols::Bool=true, threads::Bool=true)
Mutate df
or gd
in place to retain only columns or transformations specified by args...
and return it. The result is guaranteed to have the same number of rows as df
or parent of gd
, except when no columns are selected (in which case the result has zero rows).
If a SubDataFrame
or GroupedDataFrame{SubDataFrame}
is passed, the parent data frame is updated using columns generated by args...
, following the same rules as indexing:
- for existing columns filtered-out rows are filled with values present in the old columns
- for new columns (which is only allowed if
SubDataFrame
was created with:
as column selector) filtered-out rows are filled withmissing
- dropped columns (which are only allowed if
SubDataFrame
was created with:
as column selector) are removed - if
SubDataFrame
was not created with:
as column selector thenselect!
is only allowed if the transformations keep exactly the same sequence of column names as is in the passeddf
If a GroupedDataFrame
is passed then it is updated to reflect the new rows of its updated parent. If there are independent GroupedDataFrame
objects constructed using the same parent data frame they might get corrupt.
Below detailed common rules for all transformation functions supported by DataFrames.jl are explained and compared.
All these operations are supported both for AbstractDataFrame
(when split and combine steps are skipped) and GroupedDataFrame
. Technically, AbstractDataFrame
is just considered as being grouped on no columns (meaning it has a single group, or zero groups if it is empty). The only difference is that in this case the keepkeys
and ungroup
keyword arguments (described below) are not supported and a data frame is always returned, as there are no split and combine steps in this case.
In order to perform operations by groups you first need to create a GroupedDataFrame
object from your data frame using the groupby
function that takes two arguments: (1) a data frame to be grouped, and (2) a set of columns to group by.
Operations can then be applied on each group using one of the following functions:
combine
: does not put restrictions on number of rows returned per group; the returned values are vertically concatenated following order of groups inGroupedDataFrame
; it is typically used to compute summary statistics by group; forGroupedDataFrame
if grouping columns are kept they are put as first columns in the result;select
: return a data frame with the number and order of rows exactly the same as the source data frame, including only new calculated columns;select!
is an in-place version ofselect
; forGroupedDataFrame
if grouping columns are kept they are put as first columns in the result;transform
: return a data frame with the number and order of rows exactly the same as the source data frame, including all columns from the source and new calculated columns;transform!
is an in-place version oftransform
; existing columns in the source data frame are put as first columns in the result;
As a special case, if a GroupedDataFrame
that has zero groups is passed then the result of the operation is determined by performing a single call to the transformation function with a 0-row argument passed to it. The output of this operation is only used to identify the number and type of produced columns, but the result has zero rows.
All these functions take a specification of one or more functions to apply to each subset of the DataFrame
. This specification can be of the following forms:
- standard column selectors (integers,
Symbol
s, strings, vectors of integers, vectors ofSymbol
s, vectors of strings,All
,Cols
,:
,Between
,Not
and regular expressions) - a
cols => function
pair indicating thatfunction
should be called with positional arguments holding columnscols
, which can be any valid column selector; in this case target column name is automatically generated and it is assumed thatfunction
returns a single value or a vector; the generated name is created by concatenating source column name andfunction
name by default (see examples below). - a
cols => function => target_cols
form additionally explicitly specifying the target column or columns, which must be a single name (as aSymbol
or a string), a vector of names orAsTable
. Additionally it can be aFunction
which takes a string or a vector of strings as an argument containing names of columns selected bycols
, and returns the target columns names (all accepted types exceptAsTable
are allowed). - a
col => target_cols
pair, which renames the columncol
totarget_cols
, which must be single name (as aSymbol
or a string), a vector of names orAsTable
. - column-independent operations
function => target_cols
or justfunction
for specificfunction
s where the input columns are omitted; withouttarget_cols
the new column has the same name asfunction
, otherwise it must be single name (as aSymbol
or a string). Supportedfunction
s are:nrow
to efficiently compute the number of rows in each group.proprow
to efficiently compute the proportion of rows in each group.eachindex
to return a vector holding the number of each row within each group.groupindices
to return the group number.
- vectors or matrices containing transformations specified by the
Pair
syntax described in points 2 to 5 - a function which will be called with a
SubDataFrame
corresponding to each group if aGroupedDataFrame
is processed, or with the data frame itself if anAbstractDataFrame
is processed; this form should be avoided due to its poor performance unless the number of groups is small or a very large number of columns are processed (in which caseSubDataFrame
avoids excessive compilation)
Note! If the expression of the form x => y
is passed then except for the special convenience form nrow => target_cols
it is always interpreted as cols => function
. In particular the following expression function => target_cols
is not a valid transformation specification.
Note! If cols
or target_cols
are one of All
, Cols
, Between
, or Not
, broadcasting using .=>
is supported and is equivalent to broadcasting the result of names(df, cols)
or names(df, target_cols)
. This behaves as if broadcasting happened after replacing the selector with selected column names within the data frame scope.
All functions have two types of signatures. One of them takes a GroupedDataFrame
as the first argument and an arbitrary number of transformations described above as following arguments. The second type of signature is when a Function
or a Type
is passed as the first argument and a GroupedDataFrame
as the second argument (similar to map
).
As a special rule, with the cols => function
and cols => function => target_cols
syntaxes, if cols
is wrapped in an AsTable
object then a NamedTuple
containing columns selected by cols
is passed to function
. The documentation of DataFrames.table_transformation
provides more information about this functionality, in particular covering performance considerations.
What is allowed for function
to return is determined by the target_cols
value:
- If both
cols
andtarget_cols
are omitted (so only afunction
is passed), then returning a data frame, a matrix, aNamedTuple
, aTables.AbstractRow
or aDataFrameRow
will produce multiple columns in the result. Returning any other value produces a single column. - If
target_cols
is aSymbol
or a string then the function is assumed to return a single column. In this case returning a data frame, a matrix, aNamedTuple
, aTables.AbstractRow
, or aDataFrameRow
raises an error. - If
target_cols
is a vector ofSymbol
s or strings orAsTable
it is assumed thatfunction
returns multiple columns. Iffunction
returns one ofAbstractDataFrame
,NamedTuple
,DataFrameRow
,Tables.AbstractRow
,AbstractMatrix
then rules described in point 1 above apply. Iffunction
returns anAbstractVector
then each element of this vector must support thekeys
function, which must return a collection ofSymbol
s, strings or integers; the return value ofkeys
must be identical for all elements. Then as many columns are created as there are elements in the return value of thekeys
function. Iftarget_cols
isAsTable
then their names are set to be equal to the key names except ifkeys
returns integers, in which case they are prefixed byx
(so the column names are e.g.x1
,x2
, ...). Iftarget_cols
is a vector ofSymbol
s or strings then column names produced using the rules above are ignored and replaced bytarget_cols
(the number of columns must be the same as the length oftarget_cols
in this case). Iffun
returns a value of any other type then it is assumed that it is a table conforming to the Tables.jl API and theTables.columntable
function is called on it to get the resulting columns and their names. The names are retained whentarget_cols
isAsTable
and are replaced iftarget_cols
is a vector ofSymbol
s or strings.
In all of these cases, function
can return either a single row or multiple rows. As a particular rule, values wrapped in a Ref
or a 0
-dimensional AbstractArray
are unwrapped and then treated as a single row.
select
/select!
and transform
/transform!
always return a data frame with the same number and order of rows as the source (even if GroupedDataFrame
had its groups reordered), except when selection results in zero columns in the resulting data frame (in which case the result has zero rows).
For combine
, rows in the returned object appear in the order of groups in the GroupedDataFrame
. The functions can return an arbitrary number of rows for each group, but the kind of returned object and the number and names of columns must be the same for all groups, except when a DataFrame()
or NamedTuple()
is returned, in which case a given group is skipped.
It is allowed to mix single values and vectors if multiple transformations are requested. In this case single value will be repeated to match the length of columns specified by returned vectors.
To apply function
to each row instead of whole columns, it can be wrapped in a ByRow
struct. cols
can be any column indexing syntax, in which case function
will be passed one argument for each of the columns specified by cols
or a NamedTuple
of them if specified columns are wrapped in AsTable
. If ByRow
is used it is allowed for cols
to select an empty set of columns, in which case function
is called for each row without any arguments and an empty NamedTuple
is passed if empty set of columns is wrapped in AsTable
.
If a collection of column names is passed then requesting duplicate column names in target data frame are accepted (e.g. select!(df, [:a], :, r"a")
is allowed) and only the first occurrence is used. In particular a syntax to move column :col
to the first position in the data frame is select!(df, :col, :)
. On the contrary, output column names of renaming, transformation and single column selection operations must be unique, so e.g. select!(df, :a, :a => :a)
or select!(df, :a, :a => ByRow(sin) => :a)
are not allowed.
In general columns returned by transformations are stored in the target data frame without copying. An exception to this rule is when columns from the source data frame are reused in the target data frame. This can happen via expressions like: :x1
, [:x1, :x2]
, :x1 => :x2
, :x1 => identity => :x2
, or :x1 => (x -> @view x[inds])
(note that in the last case the source column is reused indirectly via a view). In such cases the behavior depends on the value of the copycols
keyword argument:
- if
copycols=true
then results of such transformations always perform a copy of the source column or its view; - if
copycols=false
then copies are only performed to avoid storing the same column several times in the target data frame; more precisely, no copy is made the first time a column is used, but each subsequent reuse of a source column (when compared using===
, which excludes views of source columns) performs a copy;
Note that performing transform!
or select!
assumes that copycols=false
.
If df
is a SubDataFrame
and copycols=true
then a DataFrame
is returned and the same copying rules apply as for a DataFrame
input: this means in particular that selected columns will be copied. If copycols=false
, a SubDataFrame
is returned without copying columns and in this case transforming or renaming columns is not allowed.
If a GroupedDataFrame
is passed and threads=true
(the default), a separate task is spawned for each specified transformation; each transformation then spawns as many tasks as Julia threads, and splits processing of groups across them (however, currently transformations with optimized implementations like sum
and transformations that return multiple rows use a single task for all groups). This allows for parallel operation when Julia was started with more than one thread. Passed transformation functions must therefore not modify global variables (i.e. they must be pure), use locks to control parallel accesses, or threads=false
must be passed to disable multithreading. In the future, parallelism may be extended to other cases, so this requirement also holds for DataFrame
inputs.
In order to improve the performance of the operations some transformations invoke optimized implementation, see DataFrames.table_transformation
for details.
Keyword arguments
renamecols::Bool=true
: whether in thecols => function
form automatically generated column names should include the name of transformation functions or not.ungroup::Bool=true
: whether the return value of the operation ongd
should be a data frame or aGroupedDataFrame
.threads::Bool=true
: whether transformations may be run in separate tasks which can execute in parallel (possibly being applied to multiple rows or groups at the same time). Whether or not tasks are actually spawned and their number are determined automatically. Set tofalse
if some transformations require serial execution or are not thread-safe.
Metadata: this function propagates table-level :note
-style metadata. Column-level :note
-style metadata is propagated if: a) a single column is transformed to a single column and the name of the column does not change (this includes all column selection operations), or b) a single column is transformed with identity
or copy
to a single column even if column name is changed (this includes column renaming). As a special case for GroupedDataFrame
if the output has the same name as a grouping column and keepkeys=true
, metadata is taken from original grouping column.
See select
for examples.
Random.shuffle
— Functionshuffle([rng=GLOBAL_RNG,] df::AbstractDataFrame)
Return a copy of df
with randomly permuted rows. The optional rng
argument specifies a random number generator.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Examples
julia> using Random
julia> rng = MersenneTwister(1234);
julia> shuffle(rng, DataFrame(a=1:5, b=1:5))
5×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 2 2
2 │ 1 1
3 │ 4 4
4 │ 3 3
5 │ 5 5
Random.shuffle!
— Functionshuffle!([rng=GLOBAL_RNG,] df::AbstractDataFrame)
Randomly permute rows of df
in-place. The optional rng
argument specifies a random number generator.
shuffle!
will produce a correct result even if some columns of passed data frame are identical (checked with ===
). Otherwise, if two columns share some part of memory but are not identical (e.g. are different views of the same parent vector) then shuffle!
result might be incorrect.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Metadata having other styles is dropped (from parent data frame when df
is a SubDataFrame
).
Examples
julia> using Random
julia> rng = MersenneTwister(1234);
julia> shuffle!(rng, DataFrame(a=1:5, b=1:5))
5×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 2 2
2 │ 1 1
3 │ 4 4
4 │ 3 3
5 │ 5 5
DataFrames.table_transformation
— Functiontable_transformation(df_sel::AbstractDataFrame, fun)
This is the function called when AsTable(...) => fun
is requested. The df_sel
argument is a data frame storing columns selected by the AsTable(...)
selector.
By default it calls default_table_transformation
. However, it is allowed to add special methods for specific types of fun
, as long as the result matches what would be produced by default_table_transformation
, except that it is allowed to perform eltype
conversion of the resulting vectors or value type promotions that are consistent with promote_type
.
It is guaranteed that df_sel
has at least one column.
The main use of special table_transformation
methods is to provide more efficient than the default implementations of requested fun
transformation.
This function might become a part of the public API of DataFrames.jl in the future, currently it should be considered experimental.
Fast paths are implemented within DataFrames.jl for the following functions fun
:
sum
,ByRow(sum)
,ByRow(sum∘skipmissing)
length
,ByRow(length)
,ByRow(length∘skipmissing)
mean
,ByRow(mean)
,ByRow(mean∘skipmissing)
ByRow(var)
,ByRow(var∘skipmissing)
ByRow(std)
,ByRow(std∘skipmissing)
ByRow(median)
,ByRow(median∘skipmissing)
minimum
,ByRow(minimum)
,ByRow(minimum∘skipmissing)
maximum
,ByRow(maximum)
,ByRow(maximum∘skipmissing)
fun∘collect
andByRow(fun∘collect)
wherefun
is any function
Note that in order to improve the performance ByRow(sum)
, ByRow(sum∘skipmissing)
, ByRow(mean)
, and ByRow(mean∘skipmissing)
perform all operations in the target element type. In some very rare cases (like mixing very large Int64
values and Float64
values) it can lead to a result different from the one that would be obtained by calling the function outside of DataFrames.jl. The way to avoid this precision loss is to use an anonymous function, e.g. instead of ByRow(sum)
use ByRow(x -> sum(x))
. However, in general for such scenarios even standard aggregation functions should not be considered to provide reliable output, and users are recommended to switch to higher precision calculations. An example of a case when standard sum
is affected by the situation discussed is:
julia> sum(Any[typemax(Int), typemax(Int), 1.0])
-1.0
julia> sum(Any[1.0, typemax(Int), typemax(Int)])
1.8446744073709552e19
DataFrames.transform
— Functiontransform(df::AbstractDataFrame, args...;
copycols::Bool=true, renamecols::Bool=true, threads::Bool=true)
transform(f::Callable, df::DataFrame;
renamecols::Bool=true, threads::Bool=true)
transform(gd::GroupedDataFrame, args...;
copycols::Bool=true, keepkeys::Bool=true, ungroup::Bool=true,
renamecols::Bool=true, threads::Bool=true)
transform(f::Base.Callable, gd::GroupedDataFrame;
copycols::Bool=true, keepkeys::Bool=true, ungroup::Bool=true,
renamecols::Bool=true, threads::Bool=true)
Create a new data frame that contains columns from df
or gd
plus columns specified by args
and return it. The result is guaranteed to have the same number of rows as df
. Equivalent to select(df, :, args...)
or select(gd, :, args...)
.
Below detailed common rules for all transformation functions supported by DataFrames.jl are explained and compared.
All these operations are supported both for AbstractDataFrame
(when split and combine steps are skipped) and GroupedDataFrame
. Technically, AbstractDataFrame
is just considered as being grouped on no columns (meaning it has a single group, or zero groups if it is empty). The only difference is that in this case the keepkeys
and ungroup
keyword arguments (described below) are not supported and a data frame is always returned, as there are no split and combine steps in this case.
In order to perform operations by groups you first need to create a GroupedDataFrame
object from your data frame using the groupby
function that takes two arguments: (1) a data frame to be grouped, and (2) a set of columns to group by.
Operations can then be applied on each group using one of the following functions:
combine
: does not put restrictions on number of rows returned per group; the returned values are vertically concatenated following order of groups inGroupedDataFrame
; it is typically used to compute summary statistics by group; forGroupedDataFrame
if grouping columns are kept they are put as first columns in the result;select
: return a data frame with the number and order of rows exactly the same as the source data frame, including only new calculated columns;select!
is an in-place version ofselect
; forGroupedDataFrame
if grouping columns are kept they are put as first columns in the result;transform
: return a data frame with the number and order of rows exactly the same as the source data frame, including all columns from the source and new calculated columns;transform!
is an in-place version oftransform
; existing columns in the source data frame are put as first columns in the result;
As a special case, if a GroupedDataFrame
that has zero groups is passed then the result of the operation is determined by performing a single call to the transformation function with a 0-row argument passed to it. The output of this operation is only used to identify the number and type of produced columns, but the result has zero rows.
All these functions take a specification of one or more functions to apply to each subset of the DataFrame
. This specification can be of the following forms:
- standard column selectors (integers,
Symbol
s, strings, vectors of integers, vectors ofSymbol
s, vectors of strings,All
,Cols
,:
,Between
,Not
and regular expressions) - a
cols => function
pair indicating thatfunction
should be called with positional arguments holding columnscols
, which can be any valid column selector; in this case target column name is automatically generated and it is assumed thatfunction
returns a single value or a vector; the generated name is created by concatenating source column name andfunction
name by default (see examples below). - a
cols => function => target_cols
form additionally explicitly specifying the target column or columns, which must be a single name (as aSymbol
or a string), a vector of names orAsTable
. Additionally it can be aFunction
which takes a string or a vector of strings as an argument containing names of columns selected bycols
, and returns the target columns names (all accepted types exceptAsTable
are allowed). - a
col => target_cols
pair, which renames the columncol
totarget_cols
, which must be single name (as aSymbol
or a string), a vector of names orAsTable
. - column-independent operations
function => target_cols
or justfunction
for specificfunction
s where the input columns are omitted; withouttarget_cols
the new column has the same name asfunction
, otherwise it must be single name (as aSymbol
or a string). Supportedfunction
s are:nrow
to efficiently compute the number of rows in each group.proprow
to efficiently compute the proportion of rows in each group.eachindex
to return a vector holding the number of each row within each group.groupindices
to return the group number.
- vectors or matrices containing transformations specified by the
Pair
syntax described in points 2 to 5 - a function which will be called with a
SubDataFrame
corresponding to each group if aGroupedDataFrame
is processed, or with the data frame itself if anAbstractDataFrame
is processed; this form should be avoided due to its poor performance unless the number of groups is small or a very large number of columns are processed (in which caseSubDataFrame
avoids excessive compilation)
Note! If the expression of the form x => y
is passed then except for the special convenience form nrow => target_cols
it is always interpreted as cols => function
. In particular the following expression function => target_cols
is not a valid transformation specification.
Note! If cols
or target_cols
are one of All
, Cols
, Between
, or Not
, broadcasting using .=>
is supported and is equivalent to broadcasting the result of names(df, cols)
or names(df, target_cols)
. This behaves as if broadcasting happened after replacing the selector with selected column names within the data frame scope.
All functions have two types of signatures. One of them takes a GroupedDataFrame
as the first argument and an arbitrary number of transformations described above as following arguments. The second type of signature is when a Function
or a Type
is passed as the first argument and a GroupedDataFrame
as the second argument (similar to map
).
As a special rule, with the cols => function
and cols => function => target_cols
syntaxes, if cols
is wrapped in an AsTable
object then a NamedTuple
containing columns selected by cols
is passed to function
. The documentation of DataFrames.table_transformation
provides more information about this functionality, in particular covering performance considerations.
What is allowed for function
to return is determined by the target_cols
value:
- If both
cols
andtarget_cols
are omitted (so only afunction
is passed), then returning a data frame, a matrix, aNamedTuple
, aTables.AbstractRow
or aDataFrameRow
will produce multiple columns in the result. Returning any other value produces a single column. - If
target_cols
is aSymbol
or a string then the function is assumed to return a single column. In this case returning a data frame, a matrix, aNamedTuple
, aTables.AbstractRow
, or aDataFrameRow
raises an error. - If
target_cols
is a vector ofSymbol
s or strings orAsTable
it is assumed thatfunction
returns multiple columns. Iffunction
returns one ofAbstractDataFrame
,NamedTuple
,DataFrameRow
,Tables.AbstractRow
,AbstractMatrix
then rules described in point 1 above apply. Iffunction
returns anAbstractVector
then each element of this vector must support thekeys
function, which must return a collection ofSymbol
s, strings or integers; the return value ofkeys
must be identical for all elements. Then as many columns are created as there are elements in the return value of thekeys
function. Iftarget_cols
isAsTable
then their names are set to be equal to the key names except ifkeys
returns integers, in which case they are prefixed byx
(so the column names are e.g.x1
,x2
, ...). Iftarget_cols
is a vector ofSymbol
s or strings then column names produced using the rules above are ignored and replaced bytarget_cols
(the number of columns must be the same as the length oftarget_cols
in this case). Iffun
returns a value of any other type then it is assumed that it is a table conforming to the Tables.jl API and theTables.columntable
function is called on it to get the resulting columns and their names. The names are retained whentarget_cols
isAsTable
and are replaced iftarget_cols
is a vector ofSymbol
s or strings.
In all of these cases, function
can return either a single row or multiple rows. As a particular rule, values wrapped in a Ref
or a 0
-dimensional AbstractArray
are unwrapped and then treated as a single row.
select
/select!
and transform
/transform!
always return a data frame with the same number and order of rows as the source (even if GroupedDataFrame
had its groups reordered), except when selection results in zero columns in the resulting data frame (in which case the result has zero rows).
For combine
, rows in the returned object appear in the order of groups in the GroupedDataFrame
. The functions can return an arbitrary number of rows for each group, but the kind of returned object and the number and names of columns must be the same for all groups, except when a DataFrame()
or NamedTuple()
is returned, in which case a given group is skipped.
It is allowed to mix single values and vectors if multiple transformations are requested. In this case single value will be repeated to match the length of columns specified by returned vectors.
To apply function
to each row instead of whole columns, it can be wrapped in a ByRow
struct. cols
can be any column indexing syntax, in which case function
will be passed one argument for each of the columns specified by cols
or a NamedTuple
of them if specified columns are wrapped in AsTable
. If ByRow
is used it is allowed for cols
to select an empty set of columns, in which case function
is called for each row without any arguments and an empty NamedTuple
is passed if empty set of columns is wrapped in AsTable
.
If a collection of column names is passed then requesting duplicate column names in target data frame are accepted (e.g. select!(df, [:a], :, r"a")
is allowed) and only the first occurrence is used. In particular a syntax to move column :col
to the first position in the data frame is select!(df, :col, :)
. On the contrary, output column names of renaming, transformation and single column selection operations must be unique, so e.g. select!(df, :a, :a => :a)
or select!(df, :a, :a => ByRow(sin) => :a)
are not allowed.
In general columns returned by transformations are stored in the target data frame without copying. An exception to this rule is when columns from the source data frame are reused in the target data frame. This can happen via expressions like: :x1
, [:x1, :x2]
, :x1 => :x2
, :x1 => identity => :x2
, or :x1 => (x -> @view x[inds])
(note that in the last case the source column is reused indirectly via a view). In such cases the behavior depends on the value of the copycols
keyword argument:
- if
copycols=true
then results of such transformations always perform a copy of the source column or its view; - if
copycols=false
then copies are only performed to avoid storing the same column several times in the target data frame; more precisely, no copy is made the first time a column is used, but each subsequent reuse of a source column (when compared using===
, which excludes views of source columns) performs a copy;
Note that performing transform!
or select!
assumes that copycols=false
.
If df
is a SubDataFrame
and copycols=true
then a DataFrame
is returned and the same copying rules apply as for a DataFrame
input: this means in particular that selected columns will be copied. If copycols=false
, a SubDataFrame
is returned without copying columns and in this case transforming or renaming columns is not allowed.
If a GroupedDataFrame
is passed and threads=true
(the default), a separate task is spawned for each specified transformation; each transformation then spawns as many tasks as Julia threads, and splits processing of groups across them (however, currently transformations with optimized implementations like sum
and transformations that return multiple rows use a single task for all groups). This allows for parallel operation when Julia was started with more than one thread. Passed transformation functions must therefore not modify global variables (i.e. they must be pure), use locks to control parallel accesses, or threads=false
must be passed to disable multithreading. In the future, parallelism may be extended to other cases, so this requirement also holds for DataFrame
inputs.
In order to improve the performance of the operations some transformations invoke optimized implementation, see DataFrames.table_transformation
for details.
Keyword arguments
copycols::Bool=true
: whether columns of the source data frame should be copied if no transformation is applied to them.renamecols::Bool=true
: whether in thecols => function
form automatically generated column names should include the name of transformation functions or not.keepkeys::Bool=true
: whether grouping columns ofgd
should be kept in the returned data frame.ungroup::Bool=true
: whether the return value of the operation ongd
should be a data frame or aGroupedDataFrame
.threads::Bool=true
: whether transformations may be run in separate tasks which can execute in parallel (possibly being applied to multiple rows or groups at the same time). Whether or not tasks are actually spawned and their number are determined automatically. Set tofalse
if some transformations require serial execution or are not thread-safe.
Note that when the first argument is a GroupedDataFrame
, keepkeys=false
is needed to be able to return a different value for the grouping column:
Metadata: this function propagates table-level :note
-style metadata. Column-level :note
-style metadata is propagated if: a) a single column is transformed to a single column and the name of the column does not change (this includes all column selection operations), or b) a single column is transformed with identity
or copy
to a single column even if column name is changed (this includes column renaming). As a special case for GroupedDataFrame
if the output has the same name as a grouping column and keepkeys=true
, metadata is taken from original grouping column.
Examples
julia> gdf = groupby(DataFrame(x=1:2), :x)
GroupedDataFrame with 2 groups based on key: x
First Group (1 row): x = 1
Row │ x
│ Int64
─────┼───────
1 │ 1
⋮
Last Group (1 row): x = 2
Row │ x
│ Int64
─────┼───────
1 │ 2
julia> transform(gdf, x -> (x=10,), keepkeys=false)
2×1 DataFrame
Row │ x
│ Int64
─────┼───────
1 │ 10
2 │ 10
julia> transform(gdf, x -> (x=10,), keepkeys=true)
ERROR: ArgumentError: column :x in returned data frame is not equal to grouping key :x
See select
for more examples.
DataFrames.transform!
— Functiontransform!(df::AbstractDataFrame, args...;
renamecols::Bool=true, threads::Bool=true)
transform!(args::Callable, df::AbstractDataFrame;
renamecols::Bool=true, threads::Bool=true)
transform!(gd::GroupedDataFrame, args...;
ungroup::Bool=true, renamecols::Bool=true, threads::Bool=true)
transform!(f::Base.Callable, gd::GroupedDataFrame;
ungroup::Bool=true, renamecols::Bool=true, threads::Bool=true)
Mutate df
or gd
in place to add columns specified by args...
and return it. The result is guaranteed to have the same number of rows as df
. Equivalent to select!(df, :, args...)
or select!(gd, :, args...)
, except that column renaming performs a copy.
Below detailed common rules for all transformation functions supported by DataFrames.jl are explained and compared.
All these operations are supported both for AbstractDataFrame
(when split and combine steps are skipped) and GroupedDataFrame
. Technically, AbstractDataFrame
is just considered as being grouped on no columns (meaning it has a single group, or zero groups if it is empty). The only difference is that in this case the keepkeys
and ungroup
keyword arguments (described below) are not supported and a data frame is always returned, as there are no split and combine steps in this case.
In order to perform operations by groups you first need to create a GroupedDataFrame
object from your data frame using the groupby
function that takes two arguments: (1) a data frame to be grouped, and (2) a set of columns to group by.
Operations can then be applied on each group using one of the following functions:
combine
: does not put restrictions on number of rows returned per group; the returned values are vertically concatenated following order of groups inGroupedDataFrame
; it is typically used to compute summary statistics by group; forGroupedDataFrame
if grouping columns are kept they are put as first columns in the result;select
: return a data frame with the number and order of rows exactly the same as the source data frame, including only new calculated columns;select!
is an in-place version ofselect
; forGroupedDataFrame
if grouping columns are kept they are put as first columns in the result;transform
: return a data frame with the number and order of rows exactly the same as the source data frame, including all columns from the source and new calculated columns;transform!
is an in-place version oftransform
; existing columns in the source data frame are put as first columns in the result;
As a special case, if a GroupedDataFrame
that has zero groups is passed then the result of the operation is determined by performing a single call to the transformation function with a 0-row argument passed to it. The output of this operation is only used to identify the number and type of produced columns, but the result has zero rows.
All these functions take a specification of one or more functions to apply to each subset of the DataFrame
. This specification can be of the following forms:
- standard column selectors (integers,
Symbol
s, strings, vectors of integers, vectors ofSymbol
s, vectors of strings,All
,Cols
,:
,Between
,Not
and regular expressions) - a
cols => function
pair indicating thatfunction
should be called with positional arguments holding columnscols
, which can be any valid column selector; in this case target column name is automatically generated and it is assumed thatfunction
returns a single value or a vector; the generated name is created by concatenating source column name andfunction
name by default (see examples below). - a
cols => function => target_cols
form additionally explicitly specifying the target column or columns, which must be a single name (as aSymbol
or a string), a vector of names orAsTable
. Additionally it can be aFunction
which takes a string or a vector of strings as an argument containing names of columns selected bycols
, and returns the target columns names (all accepted types exceptAsTable
are allowed). - a
col => target_cols
pair, which renames the columncol
totarget_cols
, which must be single name (as aSymbol
or a string), a vector of names orAsTable
. - column-independent operations
function => target_cols
or justfunction
for specificfunction
s where the input columns are omitted; withouttarget_cols
the new column has the same name asfunction
, otherwise it must be single name (as aSymbol
or a string). Supportedfunction
s are:nrow
to efficiently compute the number of rows in each group.proprow
to efficiently compute the proportion of rows in each group.eachindex
to return a vector holding the number of each row within each group.groupindices
to return the group number.
- vectors or matrices containing transformations specified by the
Pair
syntax described in points 2 to 5 - a function which will be called with a
SubDataFrame
corresponding to each group if aGroupedDataFrame
is processed, or with the data frame itself if anAbstractDataFrame
is processed; this form should be avoided due to its poor performance unless the number of groups is small or a very large number of columns are processed (in which caseSubDataFrame
avoids excessive compilation)
Note! If the expression of the form x => y
is passed then except for the special convenience form nrow => target_cols
it is always interpreted as cols => function
. In particular the following expression function => target_cols
is not a valid transformation specification.
Note! If cols
or target_cols
are one of All
, Cols
, Between
, or Not
, broadcasting using .=>
is supported and is equivalent to broadcasting the result of names(df, cols)
or names(df, target_cols)
. This behaves as if broadcasting happened after replacing the selector with selected column names within the data frame scope.
All functions have two types of signatures. One of them takes a GroupedDataFrame
as the first argument and an arbitrary number of transformations described above as following arguments. The second type of signature is when a Function
or a Type
is passed as the first argument and a GroupedDataFrame
as the second argument (similar to map
).
As a special rule, with the cols => function
and cols => function => target_cols
syntaxes, if cols
is wrapped in an AsTable
object then a NamedTuple
containing columns selected by cols
is passed to function
. The documentation of DataFrames.table_transformation
provides more information about this functionality, in particular covering performance considerations.
What is allowed for function
to return is determined by the target_cols
value:
- If both
cols
andtarget_cols
are omitted (so only afunction
is passed), then returning a data frame, a matrix, aNamedTuple
, aTables.AbstractRow
or aDataFrameRow
will produce multiple columns in the result. Returning any other value produces a single column. - If
target_cols
is aSymbol
or a string then the function is assumed to return a single column. In this case returning a data frame, a matrix, aNamedTuple
, aTables.AbstractRow
, or aDataFrameRow
raises an error. - If
target_cols
is a vector ofSymbol
s or strings orAsTable
it is assumed thatfunction
returns multiple columns. Iffunction
returns one ofAbstractDataFrame
,NamedTuple
,DataFrameRow
,Tables.AbstractRow
,AbstractMatrix
then rules described in point 1 above apply. Iffunction
returns anAbstractVector
then each element of this vector must support thekeys
function, which must return a collection ofSymbol
s, strings or integers; the return value ofkeys
must be identical for all elements. Then as many columns are created as there are elements in the return value of thekeys
function. Iftarget_cols
isAsTable
then their names are set to be equal to the key names except ifkeys
returns integers, in which case they are prefixed byx
(so the column names are e.g.x1
,x2
, ...). Iftarget_cols
is a vector ofSymbol
s or strings then column names produced using the rules above are ignored and replaced bytarget_cols
(the number of columns must be the same as the length oftarget_cols
in this case). Iffun
returns a value of any other type then it is assumed that it is a table conforming to the Tables.jl API and theTables.columntable
function is called on it to get the resulting columns and their names. The names are retained whentarget_cols
isAsTable
and are replaced iftarget_cols
is a vector ofSymbol
s or strings.
In all of these cases, function
can return either a single row or multiple rows. As a particular rule, values wrapped in a Ref
or a 0
-dimensional AbstractArray
are unwrapped and then treated as a single row.
select
/select!
and transform
/transform!
always return a data frame with the same number and order of rows as the source (even if GroupedDataFrame
had its groups reordered), except when selection results in zero columns in the resulting data frame (in which case the result has zero rows).
For combine
, rows in the returned object appear in the order of groups in the GroupedDataFrame
. The functions can return an arbitrary number of rows for each group, but the kind of returned object and the number and names of columns must be the same for all groups, except when a DataFrame()
or NamedTuple()
is returned, in which case a given group is skipped.
It is allowed to mix single values and vectors if multiple transformations are requested. In this case single value will be repeated to match the length of columns specified by returned vectors.
To apply function
to each row instead of whole columns, it can be wrapped in a ByRow
struct. cols
can be any column indexing syntax, in which case function
will be passed one argument for each of the columns specified by cols
or a NamedTuple
of them if specified columns are wrapped in AsTable
. If ByRow
is used it is allowed for cols
to select an empty set of columns, in which case function
is called for each row without any arguments and an empty NamedTuple
is passed if empty set of columns is wrapped in AsTable
.
If a collection of column names is passed then requesting duplicate column names in target data frame are accepted (e.g. select!(df, [:a], :, r"a")
is allowed) and only the first occurrence is used. In particular a syntax to move column :col
to the first position in the data frame is select!(df, :col, :)
. On the contrary, output column names of renaming, transformation and single column selection operations must be unique, so e.g. select!(df, :a, :a => :a)
or select!(df, :a, :a => ByRow(sin) => :a)
are not allowed.
In general columns returned by transformations are stored in the target data frame without copying. An exception to this rule is when columns from the source data frame are reused in the target data frame. This can happen via expressions like: :x1
, [:x1, :x2]
, :x1 => :x2
, :x1 => identity => :x2
, or :x1 => (x -> @view x[inds])
(note that in the last case the source column is reused indirectly via a view). In such cases the behavior depends on the value of the copycols
keyword argument:
- if
copycols=true
then results of such transformations always perform a copy of the source column or its view; - if
copycols=false
then copies are only performed to avoid storing the same column several times in the target data frame; more precisely, no copy is made the first time a column is used, but each subsequent reuse of a source column (when compared using===
, which excludes views of source columns) performs a copy;
Note that performing transform!
or select!
assumes that copycols=false
.
If df
is a SubDataFrame
and copycols=true
then a DataFrame
is returned and the same copying rules apply as for a DataFrame
input: this means in particular that selected columns will be copied. If copycols=false
, a SubDataFrame
is returned without copying columns and in this case transforming or renaming columns is not allowed.
If a GroupedDataFrame
is passed and threads=true
(the default), a separate task is spawned for each specified transformation; each transformation then spawns as many tasks as Julia threads, and splits processing of groups across them (however, currently transformations with optimized implementations like sum
and transformations that return multiple rows use a single task for all groups). This allows for parallel operation when Julia was started with more than one thread. Passed transformation functions must therefore not modify global variables (i.e. they must be pure), use locks to control parallel accesses, or threads=false
must be passed to disable multithreading. In the future, parallelism may be extended to other cases, so this requirement also holds for DataFrame
inputs.
In order to improve the performance of the operations some transformations invoke optimized implementation, see DataFrames.table_transformation
for details.
Keyword arguments
renamecols::Bool=true
: whether in thecols => function
form automatically generated column names should include the name of transformation functions or not.ungroup::Bool=true
: whether the return value of the operation ongd
should be a data frame or aGroupedDataFrame
.threads::Bool=true
: whether transformations may be run in separate tasks which can execute in parallel (possibly being applied to multiple rows or groups at the same time). Whether or not tasks are actually spawned and their number are determined automatically. Set tofalse
if some transformations require serial execution or are not thread-safe.
Metadata: this function propagates table-level :note
-style metadata. Column-level :note
-style metadata is propagated if: a) a single column is transformed to a single column and the name of the column does not change (this includes all column selection operations), or b) a single column is transformed with identity
or copy
to a single column even if column name is changed (this includes column renaming). As a special case for GroupedDataFrame
if the output has the same name as a grouping column and keepkeys=true
, metadata is taken from original grouping column.
See select
for examples.
Base.vcat
— Functionvcat(dfs::AbstractDataFrame...;
cols::Union{Symbol, AbstractVector{Symbol},
AbstractVector{<:AbstractString}}=:setequal,
source::Union{Nothing, Symbol, AbstractString,
Pair{<:Union{Symbol, AbstractString}, <:AbstractVector}}=nothing)
Vertically concatenate AbstractDataFrame
s.
The cols
keyword argument determines the columns of the returned data frame:
:setequal
: require all data frames to have the same column names disregarding order. If they appear in different orders, the order of the first provided data frame is used.:orderequal
: require all data frames to have the same column names and in the same order.:intersect
: only the columns present in all provided data frames are kept. If the intersection is empty, an empty data frame is returned.:union
: columns present in at least one of the provided data frames are kept. Columns not present in some data frames are filled withmissing
where necessary.- A vector of
Symbol
s or strings: only listed columns are kept. Columns not present in some data frames are filled withmissing
where necessary.
The source
keyword argument, if not nothing
(the default), specifies the additional column to be added in the last position in the resulting data frame that will identify the source data frame. It can be a Symbol
or an AbstractString
, in which case the identifier will be the number of the passed source data frame, or a Pair
consisting of a Symbol
or an AbstractString
and of a vector specifying the data frame identifiers (which do not have to be unique). The name of the source column is not allowed to be present in any source data frame.
The order of columns is determined by the order they appear in the included data frames, searching through the header of the first data frame, then the second, etc.
The element types of columns are determined using promote_type
, as with vcat
for AbstractVector
s.
vcat
ignores empty data frames when composing the result (except for metadata), making it possible to initialize an empty data frame at the beginning of a loop and vcat
onto it.
Metadata: vcat
propagates table-level :note
-style metadata for keys that are present in all passed data frames and have the same value. vcat
propagates column-level :note
-style metadata for keys that are present in all passed data frames that contain this column and have the same value.
Example
julia> df1 = DataFrame(A=1:3, B=1:3)
3×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
julia> df2 = DataFrame(A=4:6, B=4:6)
3×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 4 4
2 │ 5 5
3 │ 6 6
julia> df3 = DataFrame(A=7:9, C=7:9)
3×2 DataFrame
Row │ A C
│ Int64 Int64
─────┼──────────────
1 │ 7 7
2 │ 8 8
3 │ 9 9
julia> df4 = DataFrame()
0×0 DataFrame
julia> vcat(df1, df2)
6×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
4 │ 4 4
5 │ 5 5
6 │ 6 6
julia> vcat(df1, df3, cols=:union)
6×3 DataFrame
Row │ A B C
│ Int64 Int64? Int64?
─────┼─────────────────────────
1 │ 1 1 missing
2 │ 2 2 missing
3 │ 3 3 missing
4 │ 7 missing 7
5 │ 8 missing 8
6 │ 9 missing 9
julia> vcat(df1, df3, cols=:intersect)
6×1 DataFrame
Row │ A
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 7
5 │ 8
6 │ 9
julia> vcat(df4, df1)
3×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
julia> vcat(df1, df2, df3, df4, cols=:union, source="source")
9×4 DataFrame
Row │ A B C source
│ Int64 Int64? Int64? Int64
─────┼─────────────────────────────────
1 │ 1 1 missing 1
2 │ 2 2 missing 1
3 │ 3 3 missing 1
4 │ 4 4 missing 2
5 │ 5 5 missing 2
6 │ 6 6 missing 2
7 │ 7 missing 7 3
8 │ 8 missing 8 3
9 │ 9 missing 9 3
julia> vcat(df1, df2, df4, df3, cols=:union, source=:source => 'a':'d')
9×4 DataFrame
Row │ A B C source
│ Int64 Int64? Int64? Char
─────┼─────────────────────────────────
1 │ 1 1 missing a
2 │ 2 2 missing a
3 │ 3 3 missing a
4 │ 4 4 missing b
5 │ 5 5 missing b
6 │ 6 6 missing b
7 │ 7 missing 7 d
8 │ 8 missing 8 d
9 │ 9 missing 9 d
Reshaping data frames between tall and wide formats
Base.stack
— Functionstack(df::AbstractDataFrame[, measure_vars[, id_vars] ];
variable_name=:variable, value_name=:value,
view::Bool=false, variable_eltype::Type=String)
Stack a data frame df
, i.e. convert it from wide to long format.
Return the long-format DataFrame
with: columns for each of the id_vars
, column value_name
(:value
by default) holding the values of the stacked columns (measure_vars
), and column variable_name
(:variable
by default) a vector holding the name of the corresponding measure_vars
variable.
If view=true
then return a stacked view of a data frame (long format). The result is a view because the columns are special AbstractVectors
that return views into the original data frame.
Arguments
df
: the AbstractDataFrame to be stackedmeasure_vars
: the columns to be stacked (the measurement variables), as a column selector (Symbol
, string or integer;:
,Cols
,All
,Between
,Not
, a regular expression, or a vector ofSymbol
s, strings or integers). If neithermeasure_vars
orid_vars
are given,measure_vars
defaults to all floating point columns.id_vars
: the identifier columns that are repeated during stacking, as a column selector (Symbol
, string or integer;:
,Cols
,All
,Between
,Not
, a regular expression, or a vector ofSymbol
s, strings or integers). Defaults to all variables that are notmeasure_vars
variable_name
: the name (Symbol
or string) of the new stacked column that shall hold the names of each ofmeasure_vars
value_name
: the name (Symbol
or string) of the new stacked column containing the values from each ofmeasure_vars
view
: whether the stacked data frame should be a view rather than contain freshly allocated vectors.variable_eltype
: determines the element type of columnvariable_name
. By default aPooledArray{String}
is created. Ifvariable_eltype=Symbol
aPooledVector{Symbol}
is created, and ifvariable_eltype=CategoricalValue{String}
aCategoricalArray{String}
is produced (callusing CategoricalArrays
first if needed) Passing any other typeT
will produce aPooledVector{T}
column as long as it supports conversion fromString
. Whenview=true
, aRepeatedVector{T}
is produced.
Metadata: table-level :note
-style metadata and column-level :note
-style metadata for identifier columns are preserved.
Examples
julia> df = DataFrame(a=repeat(1:3, inner=2),
b=repeat(1:2, inner=3),
c=repeat(1:1, inner=6),
d=repeat(1:6, inner=1),
e=string.('a':'f'))
6×5 DataFrame
Row │ a b c d e
│ Int64 Int64 Int64 Int64 String
─────┼────────────────────────────────────
1 │ 1 1 1 1 a
2 │ 1 1 1 2 b
3 │ 2 1 1 3 c
4 │ 2 2 1 4 d
5 │ 3 2 1 5 e
6 │ 3 2 1 6 f
julia> stack(df, [:c, :d])
12×5 DataFrame
Row │ a b e variable value
│ Int64 Int64 String String Int64
─────┼───────────────────────────────────────
1 │ 1 1 a c 1
2 │ 1 1 b c 1
3 │ 2 1 c c 1
4 │ 2 2 d c 1
5 │ 3 2 e c 1
6 │ 3 2 f c 1
7 │ 1 1 a d 1
8 │ 1 1 b d 2
9 │ 2 1 c d 3
10 │ 2 2 d d 4
11 │ 3 2 e d 5
12 │ 3 2 f d 6
julia> stack(df, [:c, :d], [:a])
12×3 DataFrame
Row │ a variable value
│ Int64 String Int64
─────┼────────────────────────
1 │ 1 c 1
2 │ 1 c 1
3 │ 2 c 1
4 │ 2 c 1
5 │ 3 c 1
6 │ 3 c 1
7 │ 1 d 1
8 │ 1 d 2
9 │ 2 d 3
10 │ 2 d 4
11 │ 3 d 5
12 │ 3 d 6
julia> stack(df, Not([:a, :b, :e]))
12×5 DataFrame
Row │ a b e variable value
│ Int64 Int64 String String Int64
─────┼───────────────────────────────────────
1 │ 1 1 a c 1
2 │ 1 1 b c 1
3 │ 2 1 c c 1
4 │ 2 2 d c 1
5 │ 3 2 e c 1
6 │ 3 2 f c 1
7 │ 1 1 a d 1
8 │ 1 1 b d 2
9 │ 2 1 c d 3
10 │ 2 2 d d 4
11 │ 3 2 e d 5
12 │ 3 2 f d 6
julia> stack(df, Not([:a, :b, :e]), variable_name=:somemeasure)
12×5 DataFrame
Row │ a b e somemeasure value
│ Int64 Int64 String String Int64
─────┼──────────────────────────────────────────
1 │ 1 1 a c 1
2 │ 1 1 b c 1
3 │ 2 1 c c 1
4 │ 2 2 d c 1
5 │ 3 2 e c 1
6 │ 3 2 f c 1
7 │ 1 1 a d 1
8 │ 1 1 b d 2
9 │ 2 1 c d 3
10 │ 2 2 d d 4
11 │ 3 2 e d 5
12 │ 3 2 f d 6
DataFrames.unstack
— Functionunstack(df::AbstractDataFrame, rowkeys, colkey, value;
renamecols::Function=identity, allowmissing::Bool=false,
combine=only, fill=missing, threads::Bool=true)
unstack(df::AbstractDataFrame, colkey, value;
renamecols::Function=identity, allowmissing::Bool=false,
combine=only, fill=missing, threads::Bool=true)
unstack(df::AbstractDataFrame;
renamecols::Function=identity, allowmissing::Bool=false,
combine=only, fill=missing, threads::Bool=true)
Unstack data frame df
, i.e. convert it from long to wide format.
Row and column keys are ordered in the order of their first appearance.
Positional arguments
df
: the AbstractDataFrame to be unstackedrowkeys
: the columns with a unique key for each row, if not given, find a key by grouping on anything not acolkey
orvalue
. Can be any column selector (Symbol
, string or integer;:
,Cols
,All
,Between
,Not
, a regular expression, or a vector ofSymbol
s, strings or integers). Ifrowkeys
contains no columns all rows are assumed to have the same key.colkey
: the column (Symbol
, string or integer) holding the column names in wide format, defaults to:variable
values
: the column storing values (Symbol
, string or integer), defaults to:value
Keyword arguments
renamecols
: a function called on each unique value incolkey
; it must return the name of the column to be created (typically as a string or aSymbol
). Duplicates in resulting names when converted toSymbol
are not allowed. By default no transformation is performed.allowmissing
: iffalse
(the default) then an error is thrown ifcolkey
containsmissing
values; iftrue
then a column referring tomissing
value is created.combine
: ifonly
(the default) then an error is thrown if combination ofrowkeys
andcolkey
contains duplicate entries. Otherwise the passed value must be a function that is called on a vector view containing all elements for each combination ofrowkeys
andcolkey
present in the data.fill
: missing row/column combinations are filled with this value. The default ismissing
. If thevalue
column is aCategoricalVector
andfill
is notmissing
then in order to keep unstacked value columns alsoCategoricalVector
thefill
must be passed asCategoricalValue
threads
: whethercombine
function may be run in separate tasks which can execute in parallel (possibly being applied to multiple groups at the same time). Whether or not tasks are actually spawned and their number are determined automatically. Set tofalse
ifcombine
requires serial execution or is not thread-safe.
Metadata: table-level :note
-style metadata and column-level :note
-style metadata for row keys columns are preserved.
Deprecations
allowduplicates
keyword argument is deprecated; instead usecombine
keyword argument; an equivalent toallowduplicates=true
iscombine=last
and toallowduplicates=false
iscombine=only
(the default);
Examples
julia> wide = DataFrame(id=1:6,
a=repeat(1:3, inner=2),
b=repeat(1.0:2.0, inner=3),
c=repeat(1.0:1.0, inner=6),
d=repeat(1.0:3.0, inner=2))
6×5 DataFrame
Row │ id a b c d
│ Int64 Int64 Float64 Float64 Float64
─────┼─────────────────────────────────────────
1 │ 1 1 1.0 1.0 1.0
2 │ 2 1 1.0 1.0 1.0
3 │ 3 2 1.0 1.0 2.0
4 │ 4 2 2.0 1.0 2.0
5 │ 5 3 2.0 1.0 3.0
6 │ 6 3 2.0 1.0 3.0
julia> long = stack(wide)
18×4 DataFrame
Row │ id a variable value
│ Int64 Int64 String Float64
─────┼─────────────────────────────────
1 │ 1 1 b 1.0
2 │ 2 1 b 1.0
3 │ 3 2 b 1.0
4 │ 4 2 b 2.0
5 │ 5 3 b 2.0
6 │ 6 3 b 2.0
7 │ 1 1 c 1.0
8 │ 2 1 c 1.0
⋮ │ ⋮ ⋮ ⋮ ⋮
12 │ 6 3 c 1.0
13 │ 1 1 d 1.0
14 │ 2 1 d 1.0
15 │ 3 2 d 2.0
16 │ 4 2 d 2.0
17 │ 5 3 d 3.0
18 │ 6 3 d 3.0
3 rows omitted
julia> unstack(long)
6×5 DataFrame
Row │ id a b c d
│ Int64 Int64 Float64? Float64? Float64?
─────┼────────────────────────────────────────────
1 │ 1 1 1.0 1.0 1.0
2 │ 2 1 1.0 1.0 1.0
3 │ 3 2 1.0 1.0 2.0
4 │ 4 2 2.0 1.0 2.0
5 │ 5 3 2.0 1.0 3.0
6 │ 6 3 2.0 1.0 3.0
julia> unstack(long, :variable, :value)
6×5 DataFrame
Row │ id a b c d
│ Int64 Int64 Float64? Float64? Float64?
─────┼────────────────────────────────────────────
1 │ 1 1 1.0 1.0 1.0
2 │ 2 1 1.0 1.0 1.0
3 │ 3 2 1.0 1.0 2.0
4 │ 4 2 2.0 1.0 2.0
5 │ 5 3 2.0 1.0 3.0
6 │ 6 3 2.0 1.0 3.0
julia> unstack(long, :id, :variable, :value)
6×4 DataFrame
Row │ id b c d
│ Int64 Float64? Float64? Float64?
─────┼─────────────────────────────────────
1 │ 1 1.0 1.0 1.0
2 │ 2 1.0 1.0 1.0
3 │ 3 1.0 1.0 2.0
4 │ 4 2.0 1.0 2.0
5 │ 5 2.0 1.0 3.0
6 │ 6 2.0 1.0 3.0
julia> unstack(long, [:id, :a], :variable, :value)
6×5 DataFrame
Row │ id a b c d
│ Int64 Int64 Float64? Float64? Float64?
─────┼────────────────────────────────────────────
1 │ 1 1 1.0 1.0 1.0
2 │ 2 1 1.0 1.0 1.0
3 │ 3 2 1.0 1.0 2.0
4 │ 4 2 2.0 1.0 2.0
5 │ 5 3 2.0 1.0 3.0
6 │ 6 3 2.0 1.0 3.0
julia> unstack(long, :id, :variable, :value, renamecols=x->Symbol(:_, x))
6×4 DataFrame
Row │ id _b _c _d
│ Int64 Float64? Float64? Float64?
─────┼─────────────────────────────────────
1 │ 1 1.0 1.0 1.0
2 │ 2 1.0 1.0 1.0
3 │ 3 1.0 1.0 2.0
4 │ 4 2.0 1.0 2.0
5 │ 5 2.0 1.0 3.0
6 │ 6 2.0 1.0 3.0
Note that there are some differences between the widened results above.
julia> df = DataFrame(id=["1", "1", "2"],
variable=["Var1", "Var2", "Var1"],
value=[1, 2, 3])
3×3 DataFrame
Row │ id variable value
│ String String Int64
─────┼─────────────────────────
1 │ 1 Var1 1
2 │ 1 Var2 2
3 │ 2 Var1 3
julia> unstack(df, :variable, :value, fill=0)
2×3 DataFrame
Row │ id Var1 Var2
│ String Int64 Int64
─────┼──────────────────────
1 │ 1 1 2
2 │ 2 3 0
julia> df = DataFrame(cols=["a", "a", "b"], values=[1, 2, 4])
3×2 DataFrame
Row │ cols values
│ String Int64
─────┼────────────────
1 │ a 1
2 │ a 2
3 │ b 4
julia> unstack(df, :cols, :values, combine=copy)
1×2 DataFrame
Row │ a b
│ Array…? Array…?
─────┼──────────────────
1 │ [1, 2] [4]
julia> unstack(df, :cols, :values, combine=sum)
1×2 DataFrame
Row │ a b
│ Int64? Int64?
─────┼────────────────
1 │ 3 4
Base.permutedims
— Functionpermutedims(df::AbstractDataFrame,
[src_namescol::Union{Int, Symbol, AbstractString}],
[dest_namescol::Union{Symbol, AbstractString}];
makeunique::Bool=false, strict::Bool=true)
Turn df
on its side such that rows become columns and values in the column indexed by src_namescol
become the names of new columns. In the resulting DataFrame
, column names of df
will become the first column with name specified by dest_namescol
.
Arguments
df
: theAbstractDataFrame
src_namescol
: the column that will become the new header. If omitted then column names:x1
,:x2
, ... are generated automatically.dest_namescol
: the name of the first column in the returnedDataFrame
. Defaults to the same name assrc_namescol
. Not supported whensrc_namescol
is a vector or is omitted.makeunique
: iffalse
(the default), an error will be raised if duplicate names are found; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate). Not supported whensrc_namescol
is omitted.strict
: iftrue
(the default), an error will be raised if the values contained in thesrc_namescol
are not allSymbol
or allAbstractString
, or can all be converted toString
usingconvert
. Iffalse
then any values are accepted and the will be changed to strings using thestring
function. Not supported whensrc_namescol
is a vector or is omitted.
Note: The element types of columns in resulting DataFrame
(other than the first column if it is created from df
column names, which always has element type String
) will depend on the element types of all input columns based on the result of promote_type
. That is, if the source data frame contains Int
and Float64
columns, resulting columns will have element type Float64
. If the source has Int
and String
columns, resulting columns will have element type Any
.
Metadata: table-level :note
-style metadata is preserved and column-level metadata is dropped.
Examples
julia> df = DataFrame(a=1:2, b=3:4)
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 2 4
julia> permutedims(df)
2×2 DataFrame
Row │ x1 x2
│ Int64 Int64
─────┼──────────────
1 │ 1 2
2 │ 3 4
julia> permutedims(df, [:p, :q])
2×2 DataFrame
Row │ p q
│ Int64 Int64
─────┼──────────────
1 │ 1 2
2 │ 3 4
julia> df1 = DataFrame(a=["x", "y"], b=[1.0, 2.0], c=[3, 4], d=[true, false])
2×4 DataFrame
Row │ a b c d
│ String Float64 Int64 Bool
─────┼───────────────────────────────
1 │ x 1.0 3 true
2 │ y 2.0 4 false
julia> permutedims(df1, 1) # note the column types
3×3 DataFrame
Row │ a x y
│ String Float64 Float64
─────┼──────────────────────────
1 │ b 1.0 2.0
2 │ c 3.0 4.0
3 │ d 1.0 0.0
julia> df2 = DataFrame(a=["x", "y"], b=[1, "two"], c=[3, 4], d=[true, false])
2×4 DataFrame
Row │ a b c d
│ String Any Int64 Bool
─────┼───────────────────────────
1 │ x 1 3 true
2 │ y two 4 false
julia> permutedims(df2, 1, "different_name")
3×3 DataFrame
Row │ different_name x y
│ String Any Any
─────┼─────────────────────────────
1 │ b 1 two
2 │ c 3 4
3 │ d true false
Sorting
Base.issorted
— Functionissorted(df::AbstractDataFrame, cols=All();
lt::Union{Function, AbstractVector{<:Function}}=isless,
by::Union{Function, AbstractVector{<:Function}}=identity,
rev::Union{Bool, AbstractVector{Bool}}=false,
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward,
checkunique::Bool=false)
Test whether data frame df
sorted by column(s) cols
. Checking against multiple columns is done lexicographically.
cols
can be any column selector (Symbol
, string or integer; :
, Cols
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers). If cols
selects no columns, check whether df
is sorted on all columns (this behaviour is deprecated and will change in future versions).
If rev
is true
, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true)
in cols
, with c
the corresponding column index (see example below).
Since having repeated elements makes multiple sorting orders valid, the checkunique
keyword allows for the situation to be caught. If checkunique
is true
and duplicate elements are found an error will be thrown. The use of the checkunique
keyword is only supported when neither the by
nor the lt
keywords are being used. Similarly, the use of order(...)
clauses that specify either by
or lt
are not supported, but specifying rev
by itself is allowed.
The by
keyword allows providing a function that will be applied to each cell before comparison; the lt
keyword allows providing a custom "less than" function. If both by
and lt
are specified, the lt
function is applied to the result of the by
function.
Keyword arguments specifying sorting order (rev
, lt
or by
) can either be a single value, or a vector of length equal to the number of columns the operation is performed on. When a single value is passed, it applies to all columns. When a vector is passed, each entry applies to the column in the corresponding position in cols
.
Examples
julia> df = DataFrame(a=[1, 2, 3, 4], b=[4, 3, 2, 1])
4×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 3
3 │ 3 2
4 │ 4 1
julia> issorted(df)
true
julia> issorted(df, :a)
true
julia> issorted(df, :b)
false
julia> issorted(df, :b, rev=true)
true
DataFrames.order
— Functionorder(col::ColumnIndex; kwargs...)
Specify sorting order for a column col
in a data frame. kwargs
can be lt
, by
, rev
, and order
with values following the rules defined in sort!
.
Examples
julia> df = DataFrame(x=[-3, -1, 0, 2, 4], y=1:5)
5×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ -3 1
2 │ -1 2
3 │ 0 3
4 │ 2 4
5 │ 4 5
julia> sort(df, order(:x, rev=true))
5×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 4 5
2 │ 2 4
3 │ 0 3
4 │ -1 2
5 │ -3 1
julia> sort(df, order(:x, by=abs))
5×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 0 3
2 │ -1 2
3 │ 2 4
4 │ -3 1
5 │ 4 5
Base.sort
— Functionsort(df::AbstractDataFrame, cols=All();
alg::Union{Algorithm, Nothing}=nothing,
lt::Union{Function, AbstractVector{<:Function}}=isless,
by::Union{Function, AbstractVector{<:Function}}=identity,
rev::Union{Bool, AbstractVector{Bool}}=false,
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward,
view::Bool=false,
checkunique::Bool=false)
Return a data frame containing the rows in df
sorted by column(s) cols
. Sorting on multiple columns is done lexicographically.
cols
can be any column selector (Symbol
, string or integer; :
, Cols
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers). If cols
selects no columns, sort df
on all columns (this behaviour is deprecated and will change in future versions).
If rev
is true
, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true)
in cols
, with c
the corresponding column index (see example below).
Since having repeated elements makes multiple sorting orders valid, the checkunique
keyword allows for the situation to be caught. If checkunique
is true
and duplicate elements are found an error will be thrown. The use of the checkunique
keyword is only supported when neither the by
nor the lt
keywords are being used. Similarly, the use of order(...)
clauses that specify either by
or lt
are not supported, but specifying rev
by itself is allowed.
The by
keyword allows providing a function that will be applied to each cell before comparison; the lt
keyword allows providing a custom "less than" function. If both by
and lt
are specified, the lt
function is applied to the result of the by
function.
Keyword arguments specifying sorting order (rev
, lt
or by
) can either be a single value, or a vector of length equal to the number of columns the operation is performed on. When a single value is passed, it applies to all columns. When a vector is passed, each entry applies to the column in the corresponding position in cols
.
If alg
is nothing
(the default), the most appropriate algorithm is chosen automatically among TimSort
, MergeSort
and RadixSort
depending on the type of the sorting columns and on the number of rows in df
.
If view=false
a freshly allocated DataFrame
is returned. If view=true
then a SubDataFrame
view into df
is returned.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Examples
julia> df = DataFrame(x=[3, 1, 2, 1], y=["b", "c", "a", "b"])
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 1 c
3 │ 2 a
4 │ 1 b
julia> sort(df, :x)
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 1 c
2 │ 1 b
3 │ 2 a
4 │ 3 b
julia> sort(df, [:x, :y])
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 1 b
2 │ 1 c
3 │ 2 a
4 │ 3 b
julia> sort(df, [:x, :y], rev=true)
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 a
3 │ 1 c
4 │ 1 b
julia> sort(df, [:x, order(:y, rev=true)])
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 1 c
2 │ 1 b
3 │ 2 a
4 │ 3 b
Base.sort!
— Functionsort!(df::AbstractDataFrame, cols=All();
alg::Union{Algorithm, Nothing}=nothing,
lt::Union{Function, AbstractVector{<:Function}}=isless,
by::Union{Function, AbstractVector{<:Function}}=identity,
rev::Union{Bool, AbstractVector{Bool}}=false,
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward,
checkunique::Bool=false)
Sort data frame df
by column(s) cols
by permuting its rows in-place. Sorting on multiple columns is done lexicographicallly.
cols
can be any column selector (Symbol
, string or integer; :
, Cols
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers). If cols
selects no columns, sort df
on all columns (this behaviour is deprecated and will change in future versions).
If rev
is true
, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true)
in cols
, with c
the corresponding column index (see example below).
Since having repeated elements makes multiple sorting orders valid, the checkunique
keyword allows for the situation to be caught. If checkunique
is true
and duplicate elements are found an error will be thrown. The use of the checkunique
keyword is only supported when neither the by
nor the lt
keywords are being used. Similarly, the use of order(...)
clauses that specify either by
or lt
are not supported, but specifying rev
by itself is allowed.
The by
keyword allows providing a function that will be applied to each cell before comparison; the lt
keyword allows providing a custom "less than" function. If both by
and lt
are specified, the lt
function is applied to the result of the by
function.
Keyword arguments specifying sorting order (rev
, lt
or by
) can either be a single value, or a vector of length equal to the number of columns the operation is performed on. When a single value is passed, it applies to all columns. When a vector is passed, each entry applies to the column in the corresponding position in cols
.
If alg
is nothing
(the default), the most appropriate algorithm is chosen automatically among TimSort
, MergeSort
and RadixSort
depending on the type of the sorting columns and on the number of rows in df
.
sort!
will produce a correct result even if some columns of passed data frame are identical (checked with ===
). Otherwise, if two columns share some part of memory but are not identical (e.g. are different views of the same parent vector) then sort!
result might be incorrect.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Metadata having other styles is dropped (from parent data frame when df
is a SubDataFrame
).
Examples
julia> df = DataFrame(x=[3, 1, 2, 1], y=["b", "c", "a", "b"])
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 1 c
3 │ 2 a
4 │ 1 b
julia> sort!(df, :x)
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 1 c
2 │ 1 b
3 │ 2 a
4 │ 3 b
julia> sort!(df, [:x, :y])
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 1 b
2 │ 1 c
3 │ 2 a
4 │ 3 b
julia> sort!(df, [:x, :y], rev=true)
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 a
3 │ 1 c
4 │ 1 b
julia> sort!(df, [:x, order(:y, rev=true)])
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 1 c
2 │ 1 b
3 │ 2 a
4 │ 3 b
Base.sortperm
— Functionsortperm(df::AbstractDataFrame, cols=All();
alg::Union{Algorithm, Nothing}=nothing,
lt::Union{Function, AbstractVector{<:Function}}=isless,
by::Union{Function, AbstractVector{<:Function}}=identity,
rev::Union{Bool, AbstractVector{Bool}}=false,
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward,
checkunique::Bool=false)
Return a permutation vector of row indices of data frame df
that puts them in sorted order according to column(s) cols
. Order on multiple columns is computed lexicographically.
cols
can be any column selector (Symbol
, string or integer; :
, Cols
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers). If cols
selects no columns, return permutation vector based on sorting all columns (this behaviour is deprecated and will change in future versions).
If rev
is true
, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true)
in cols
, with c
the corresponding column index (see example below).
Since having repeated elements makes multiple sorting orders valid, the checkunique
keyword allows for the situation to be caught. If checkunique
is true
and duplicate elements are found an error will be thrown. The use of the checkunique
keyword is only supported when neither the by
nor the lt
keywords are being used. Similarly, the use of order(...)
clauses that specify either by
or lt
are not supported, but specifying rev
by itself is allowed.
The by
keyword allows providing a function that will be applied to each cell before comparison; the lt
keyword allows providing a custom "less than" function. If both by
and lt
are specified, the lt
function is applied to the result of the by
function.
Keyword arguments specifying sorting order (rev
, lt
or by
) can either be a single value, or a vector of length equal to the number of columns the operation is performed on. When a single value is passed, it applies to all columns. When a vector is passed, each entry applies to the column in the corresponding position in cols
.
If alg
is nothing
(the default), the most appropriate algorithm is chosen automatically among TimSort
, MergeSort
and RadixSort
depending on the type of the sorting columns and on the number of rows in df
.
Examples
julia> df = DataFrame(x=[3, 1, 2, 1], y=["b", "c", "a", "b"])
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 1 c
3 │ 2 a
4 │ 1 b
julia> sortperm(df, :x)
4-element Vector{Int64}:
2
4
3
1
julia> sortperm(df, [:x, :y])
4-element Vector{Int64}:
4
2
3
1
julia> sortperm(df, [:x, :y], rev=true)
4-element Vector{Int64}:
1
3
2
4
julia> sortperm(df, [:x, order(:y, rev=true)])
4-element Vector{Int64}:
2
4
3
1
Joining
DataAPI.antijoin
— Functionantijoin(df1, df2; on, makeunique=false, validate=(false, false), matchmissing=:error)
Perform an anti join of two data frame objects and return a DataFrame
containing the result. An anti join returns the subset of rows of df1
that do not match with the keys in df2
.
The order of rows in the result is kept from df1
.
Arguments
df1
,df2
: theAbstractDataFrames
to be joined
Keyword Arguments
on
: The names of the key columns on which to join the data frames. This can be a single name, or a vector of names (for joining on multiple columns). Aleft=>right
pair of names can be used instead of a name, for the case where a key has different names indf1
anddf2
(it is allowed to mix names and name pairs in a vector). Key values are compared usingisequal
.on
is a required argument.makeunique
: ignored as no columns are added todf1
columns (it is provided for consistency with other functions).validate
: whether to check that columns passed as theon
argument define unique keys in each input data frame (according toisequal
). Can be a tuple or a pair, with the first element indicating whether to run check fordf1
and the second element fordf2
. By default no check is performed.matchmissing
: if equal to:error
throw an error ifmissing
is present inon
columns; if equal to:equal
thenmissing
is allowed and missings are matched; if equal to:notequal
then missings are dropped indf2
on
columns.
It is not allowed to join on columns that contain NaN
or -0.0
in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector
.
When merging on
categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.
Metadata: table-level and column-level :note
-style metadata are taken from df1
.
See also: innerjoin
, leftjoin
, rightjoin
, outerjoin
, semijoin
, crossjoin
.
Examples
julia> name = DataFrame(ID=[1, 2, 3], Name=["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼──────────────────
1 │ 1 John Doe
2 │ 2 Jane Doe
3 │ 3 Joe Blogs
julia> job = DataFrame(ID=[1, 2, 4], Job=["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ ID Job
│ Int64 String
─────┼───────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> antijoin(name, job, on = :ID)
1×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼──────────────────
1 │ 3 Joe Blogs
julia> job2 = DataFrame(identifier=[1, 2, 4], Job=["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ identifier Job
│ Int64 String
─────┼────────────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> antijoin(name, job2, on = :ID => :identifier)
1×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼──────────────────
1 │ 3 Joe Blogs
julia> antijoin(name, job2, on = [:ID => :identifier])
1×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼──────────────────
1 │ 3 Joe Blogs
DataAPI.crossjoin
— Functioncrossjoin(df1::AbstractDataFrame, df2::AbstractDataFrame;
makeunique::Bool=false, renamecols=identity => identity)
crossjoin(df1, df2, dfs...; makeunique = false)
Perform a cross join of two or more data frame objects and return a DataFrame
containing the result. A cross join returns the cartesian product of rows from all passed data frames, where the first passed data frame is assigned to the dimension that changes the slowest and the last data frame is assigned to the dimension that changes the fastest.
Arguments
df1
,df2
,dfs...
: theAbstractDataFrames
to be joined
Keyword Arguments
makeunique
: iffalse
(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).renamecols
: aPair
specifying how columns of left and right data frames should be renamed in the resulting data frame. Each element of the pair can be a string or aSymbol
can be passed in which case it is appended to the original column name; alternatively a function can be passed in which case it is applied to each column name, which is passed to it as aString
.
If more than two data frames are passed, the join is performed recursively with left associativity.
Metadata: table-level :note
-style metadata is preserved only for keys which are defined in all passed tables and have the same value. Column-level :note
-style metadata is preserved from both tables.
See also: innerjoin
, leftjoin
, rightjoin
, outerjoin
, semijoin
, antijoin
.
Examples
julia> df1 = DataFrame(X=1:3)
3×1 DataFrame
Row │ X
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
julia> df2 = DataFrame(Y=["a", "b"])
2×1 DataFrame
Row │ Y
│ String
─────┼────────
1 │ a
2 │ b
julia> crossjoin(df1, df2)
6×2 DataFrame
Row │ X Y
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 1 b
3 │ 2 a
4 │ 2 b
5 │ 3 a
6 │ 3 b
DataAPI.innerjoin
— Functioninnerjoin(df1, df2; on, makeunique=false, validate=(false, false),
renamecols=(identity => identity), matchmissing=:error,
order=:undefined)
innerjoin(df1, df2, dfs...; on, makeunique=false,
validate=(false, false), matchmissing=:error,
order=:undefined)
Perform an inner join of two or more data frame objects and return a DataFrame
containing the result. An inner join includes rows with keys that match in all passed data frames.
In the returned data frame the type of the columns on which the data frames are joined is determined by the type of these columns in df1
. This behavior may change in future releases.
Arguments
df1
,df2
,dfs...
: theAbstractDataFrames
to be joined
Keyword Arguments
on
: The names of the key columns on which to join the data frames. This can be a single name, or a vector of names (for joining on multiple columns). When joining only two data frames, aleft=>right
pair of names can be used instead of a name, for the case where a key has different names indf1
anddf2
(it is allowed to mix names and name pairs in a vector). Key values are compared usingisequal
.on
is a required argument.makeunique
: iffalse
(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).validate
: whether to check that columns passed as theon
argument define unique keys in each input data frame (according toisequal
). Can be a tuple or a pair, with the first element indicating whether to run check fordf1
and the second element fordf2
. By default no check is performed.renamecols
: aPair
specifying how columns of left and right data frames should be renamed in the resulting data frame. Each element of the pair can be a string or aSymbol
can be passed in which case it is appended to the original column name; alternatively a function can be passed in which case it is applied to each column name, which is passed to it as aString
. Note thatrenamecols
does not affecton
columns, whose names are always taken from the left data frame and left unchanged.matchmissing
: if equal to:error
throw an error ifmissing
is present inon
columns; if equal to:equal
thenmissing
is allowed and missings are matched; if equal to:notequal
then missings are dropped indf1
anddf2
on
columns.order
: if:undefined
(the default) the order of rows in the result is undefined and may change in future releases. If:left
then the order of rows from the left data frame is retained. If:right
then the order of rows from the right data frame is retained.
It is not allowed to join on columns that contain NaN
or -0.0
in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector
.
When merging on
categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.
If more than two data frames are passed, the join is performed recursively with left associativity. In this case the validate
keyword argument is applied recursively with left associativity.
Metadata: table-level :note
-style metadata and column-level :note
-style metadata for key columns is preserved only for keys which are defined in all passed tables and have the same value. Column-level :note
-style metadata is preserved for all other columns.
See also: leftjoin
, rightjoin
, outerjoin
, semijoin
, antijoin
, crossjoin
.
Examples
julia> name = DataFrame(ID=[1, 2, 3], Name=["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼──────────────────
1 │ 1 John Doe
2 │ 2 Jane Doe
3 │ 3 Joe Blogs
julia> job = DataFrame(ID=[1, 2, 4], Job=["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ ID Job
│ Int64 String
─────┼───────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> innerjoin(name, job, on = :ID)
2×3 DataFrame
Row │ ID Name Job
│ Int64 String String
─────┼─────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
julia> job2 = DataFrame(identifier=[1, 2, 4], Job=["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ identifier Job
│ Int64 String
─────┼────────────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> innerjoin(name, job2, on = :ID => :identifier, renamecols = "_left" => "_right")
2×3 DataFrame
Row │ ID Name_left Job_right
│ Int64 String String
─────┼─────────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
julia> innerjoin(name, job2, on = [:ID => :identifier], renamecols = uppercase => lowercase)
2×3 DataFrame
Row │ ID NAME job
│ Int64 String String
─────┼─────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
DataAPI.leftjoin
— Functionleftjoin(df1, df2; on, makeunique=false, source=nothing, validate=(false, false),
renamecols=(identity => identity), matchmissing=:error, order=:undefined)
Perform a left join of two data frame objects and return a DataFrame
containing the result. A left join includes all rows from df1
.
In the returned data frame the type of the columns on which the data frames are joined is determined by the type of these columns in df1
. This behavior may change in future releases.
Arguments
df1
,df2
: theAbstractDataFrames
to be joined
Keyword Arguments
on
: The names of the key columns on which to join the data frames. This can be a single name, or a vector of names (for joining on multiple columns). Aleft=>right
pair of names can be used instead of a name, for the case where a key has different names indf1
anddf2
(it is allowed to mix names and name pairs in a vector). Key values are compared usingisequal
.on
is a required argument.makeunique
: iffalse
(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).source
: Default:nothing
. If aSymbol
or string, adds indicator column with the given name, for whether a row appeared in onlydf1
("left_only"
) or in both ("both"
). If the name is already in use, the column name will be modified ifmakeunique=true
.validate
: whether to check that columns passed as theon
argument define unique keys in each input data frame (according toisequal
). Can be a tuple or a pair, with the first element indicating whether to run check fordf1
and the second element fordf2
. By default no check is performed.renamecols
: aPair
specifying how columns of left and right data frames should be renamed in the resulting data frame. Each element of the pair can be a string or aSymbol
can be passed in which case it is appended to the original column name; alternatively a function can be passed in which case it is applied to each column name, which is passed to it as aString
. Note thatrenamecols
does not affecton
columns, whose names are always taken from the left data frame and left unchanged.matchmissing
: if equal to:error
throw an error ifmissing
is present inon
columns; if equal to:equal
thenmissing
is allowed and missings are matched; if equal to:notequal
then missings are dropped indf2
on
columns.order
: if:undefined
(the default) the order of rows in the result is undefined and may change in future releases. If:left
then the order of rows from the left data frame is retained. If:right
then the order of rows from the right data frame is retained (non-matching rows are put at the end).
All columns of the returned data frame will support missing values.
It is not allowed to join on columns that contain NaN
or -0.0
in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector
.
When merging on
categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.
Metadata: table-level and column-level :note
-style metadata is taken from df1
(including key columns), except for columns added to it from df2
, whose column-level :note
-style metadata is taken from df2
.
See also: innerjoin
, rightjoin
, outerjoin
, semijoin
, antijoin
, crossjoin
.
Examples
julia> name = DataFrame(ID=[1, 2, 3], Name=["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼──────────────────
1 │ 1 John Doe
2 │ 2 Jane Doe
3 │ 3 Joe Blogs
julia> job = DataFrame(ID=[1, 2, 4], Job=["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ ID Job
│ Int64 String
─────┼───────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> leftjoin(name, job, on = :ID)
3×3 DataFrame
Row │ ID Name Job
│ Int64 String String?
─────┼───────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 3 Joe Blogs missing
julia> job2 = DataFrame(identifier=[1, 2, 4], Job=["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ identifier Job
│ Int64 String
─────┼────────────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> leftjoin(name, job2, on = :ID => :identifier, renamecols = "_left" => "_right")
3×3 DataFrame
Row │ ID Name_left Job_right
│ Int64 String String?
─────┼─────────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 3 Joe Blogs missing
julia> leftjoin(name, job2, on = [:ID => :identifier], renamecols = uppercase => lowercase)
3×3 DataFrame
Row │ ID NAME job
│ Int64 String String?
─────┼───────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 3 Joe Blogs missing
DataFrames.leftjoin!
— Functionleftjoin!(df1, df2; on, makeunique=false, source=nothing,
matchmissing=:error)
Perform a left join of two data frame objects by updating the df1
with the joined columns from df2
.
A left join includes all rows from df1
and leaves all rows and columns from df1
untouched. Note that each row in df1
must have at most one match in df2
. Otherwise, this function would not be able to execute the join in-place since new rows would need to be added to df1
.
Arguments
df1
,df2
: theAbstractDataFrames
to be joined
Keyword Arguments
on
: The names of the key columns on which to join the data frames. This can be a single name, or a vector of names (for joining on multiple columns). Aleft=>right
pair of names can be used instead of a name, for the case where a key has different names indf1
anddf2
(it is allowed to mix names and name pairs in a vector). Key values are compared usingisequal
.on
is a required argument.makeunique
: iffalse
(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).source
: Default:nothing
. If aSymbol
or string, adds indicator column with the given name, for whether a row appeared in onlydf1
("left_only"
) or in both ("both"
). If the name is already in use, the column name will be modified ifmakeunique=true
.matchmissing
: if equal to:error
throw an error ifmissing
is present inon
columns; if equal to:equal
thenmissing
is allowed and missings are matched; if equal to:notequal
then missings are dropped indf2
on
columns.
The columns added to df1
from df2
will support missing values.
It is not allowed to join on columns that contain NaN
or -0.0
in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector
.
Metadata: table-level and column-level :note
-style metadata are taken from df1
(including key columns), except for columns added to it from df2
, whose column-level :note
-style metadata is taken from df2
.
See also: leftjoin
.
Examples
julia> name = DataFrame(ID=[1, 2, 3], Name=["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼──────────────────
1 │ 1 John Doe
2 │ 2 Jane Doe
3 │ 3 Joe Blogs
julia> job = DataFrame(ID=[1, 2, 4], Job=["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ ID Job
│ Int64 String
─────┼───────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> leftjoin!(name, job, on = :ID)
3×3 DataFrame
Row │ ID Name Job
│ Int64 String String?
─────┼───────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 3 Joe Blogs missing
julia> job2 = DataFrame(identifier=[1, 2, 4], Job=["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ identifier Job
│ Int64 String
─────┼────────────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> leftjoin!(name, job2, on = :ID => :identifier, makeunique=true, source=:source)
3×5 DataFrame
Row │ ID Name Job Job_1 source
│ Int64 String String? String? String
─────┼───────────────────────────────────────────────
1 │ 1 John Doe Lawyer Lawyer both
2 │ 2 Jane Doe Doctor Doctor both
3 │ 3 Joe Blogs missing missing left_only
DataAPI.outerjoin
— Functionouterjoin(df1, df2; on, makeunique=false, source=nothing, validate=(false, false),
renamecols=(identity => identity), matchmissing=:error, order=:undefined)
outerjoin(df1, df2, dfs...; on, makeunique = false,
validate = (false, false), matchmissing=:error, order=:undefined)
Perform an outer join of two or more data frame objects and return a DataFrame
containing the result. An outer join includes rows with keys that appear in any of the passed data frames.
The order of rows in the result is undefined and may change in future releases.
In the returned data frame the type of the columns on which the data frames are joined is determined by the element type of these columns both df1
and df2
. This behavior may change in future releases.
Arguments
df1
,df2
,dfs...
: theAbstractDataFrames
to be joined
Keyword Arguments
on
: The names of the key columns on which to join the data frames. This can be a single name, or a vector of names (for joining on multiple columns). When joining only two data frames, aleft=>right
pair of names can be used instead of a name, for the case where a key has different names indf1
anddf2
(it is allowed to mix names and name pairs in a vector). Key values are compared usingisequal
.on
is a required argument.makeunique
: iffalse
(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).source
: Default:nothing
. If aSymbol
or string, adds indicator column with the given name for whether a row appeared in onlydf1
("left_only"
), onlydf2
("right_only"
) or in both ("both"
). If the name is already in use, the column name will be modified ifmakeunique=true
. This argument is only supported when joining exactly two data frames.validate
: whether to check that columns passed as theon
argument define unique keys in each input data frame (according toisequal
). Can be a tuple or a pair, with the first element indicating whether to run check fordf1
and the second element fordf2
. By default no check is performed.renamecols
: aPair
specifying how columns of left and right data frames should be renamed in the resulting data frame. Each element of the pair can be a string or aSymbol
can be passed in which case it is appended to the original column name; alternatively a function can be passed in which case it is applied to each column name, which is passed to it as aString
. Note thatrenamecols
does not affecton
columns, whose names are always taken from the left data frame and left unchanged.matchmissing
: if equal to:error
throw an error ifmissing
is present inon
columns; if equal to:equal
thenmissing
is allowed and missings are matched.order
: if:undefined
(the default) the order of rows in the result is undefined and may change in future releases. If:left
then the order of rows from the left data frame is retained (non-matching rows are put at the end). If:right
then the order of rows from the right data frame is retained (non-matching rows are put at the end).
All columns of the returned data frame will support missing values.
It is not allowed to join on columns that contain NaN
or -0.0
in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector
.
When merging on
categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.
If more than two data frames are passed, the join is performed recursively with left associativity. In this case the indicator
keyword argument is not supported and validate
keyword argument is applied recursively with left associativity.
Metadata: table-level :note
-style metadata and column-level :note
-style metadata for key columns is preserved only for keys which are defined in all passed tables and have the same value. Column-level :note
-style metadata is preserved for all other columns.
See also: innerjoin
, leftjoin
, rightjoin
, semijoin
, antijoin
, crossjoin
.
Examples
julia> name = DataFrame(ID=[1, 2, 3], Name=["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼──────────────────
1 │ 1 John Doe
2 │ 2 Jane Doe
3 │ 3 Joe Blogs
julia> job = DataFrame(ID=[1, 2, 4], Job=["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ ID Job
│ Int64 String
─────┼───────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> outerjoin(name, job, on = :ID)
4×3 DataFrame
Row │ ID Name Job
│ Int64 String? String?
─────┼───────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 3 Joe Blogs missing
4 │ 4 missing Farmer
julia> job2 = DataFrame(identifier=[1, 2, 4], Job=["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ identifier Job
│ Int64 String
─────┼────────────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> outerjoin(name, job2, on = :ID => :identifier, renamecols = "_left" => "_right")
4×3 DataFrame
Row │ ID Name_left Job_right
│ Int64 String? String?
─────┼─────────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 3 Joe Blogs missing
4 │ 4 missing Farmer
julia> outerjoin(name, job2, on = [:ID => :identifier], renamecols = uppercase => lowercase)
4×3 DataFrame
Row │ ID NAME job
│ Int64 String? String?
─────┼───────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 3 Joe Blogs missing
4 │ 4 missing Farmer
DataAPI.rightjoin
— Functionrightjoin(df1, df2; on, makeunique=false, source=nothing,
validate=(false, false), renamecols=(identity => identity),
matchmissing=:error, order=:undefined)
Perform a right join on two data frame objects and return a DataFrame
containing the result. A right join includes all rows from df2
.
The order of rows in the result is undefined and may change in future releases.
In the returned data frame the type of the columns on which the data frames are joined is determined by the type of these columns in df2
. This behavior may change in future releases.
Arguments
df1
,df2
: theAbstractDataFrames
to be joined
Keyword Arguments
on
: The names of the key columns on which to join the data frames. This can be a single name, or a vector of names (for joining on multiple columns). Aleft=>right
pair of names can be used instead of a name, for the case where a key has different names indf1
anddf2
(it is allowed to mix names and name pairs in a vector). Key values are compared usingisequal
.on
is a required argument.makeunique
: iffalse
(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).source
: Default:nothing
. If aSymbol
or string, adds indicator column with the given name for whether a row appeared in onlydf2
("right_only"
) or in both ("both"
). If the name is already in use, the column name will be modified ifmakeunique=true
.validate
: whether to check that columns passed as theon
argument define unique keys in each input data frame (according toisequal
). Can be a tuple or a pair, with the first element indicating whether to run check fordf1
and the second element fordf2
. By default no check is performed.renamecols
: aPair
specifying how columns of left and right data frames should be renamed in the resulting data frame. Each element of the pair can be a string or aSymbol
can be passed in which case it is appended to the original column name; alternatively a function can be passed in which case it is applied to each column name, which is passed to it as aString
. Note thatrenamecols
does not affecton
columns, whose names are always taken from the left data frame and left unchanged.matchmissing
: if equal to:error
throw an error ifmissing
is present inon
columns; if equal to:equal
thenmissing
is allowed and missings are matched; if equal to:notequal
then missings are dropped indf1
on
columns.order
: if:undefined
(the default) the order of rows in the result is undefined and may change in future releases. If:left
then the order of rows from the left data frame is retained (non-matching rows are put at the end). If:right
then the order of rows from the right data frame is retained.
All columns of the returned data frame will support missing values.
It is not allowed to join on columns that contain NaN
or -0.0
in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector
.
When merging on
categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.
Metadata: table-level and column-level :note
-style metadata is taken from df2
(including key columns), except for columns added to it from df1
, whose column-level :note
-style metadata is taken from df1
.
See also: innerjoin
, leftjoin
, outerjoin
, semijoin
, antijoin
, crossjoin
.
Examples
julia> name = DataFrame(ID=[1, 2, 3], Name=["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼──────────────────
1 │ 1 John Doe
2 │ 2 Jane Doe
3 │ 3 Joe Blogs
julia> job = DataFrame(ID=[1, 2, 4], Job=["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ ID Job
│ Int64 String
─────┼───────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> rightjoin(name, job, on = :ID)
3×3 DataFrame
Row │ ID Name Job
│ Int64 String? String
─────┼─────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 4 missing Farmer
julia> job2 = DataFrame(identifier=[1, 2, 4], Job=["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ identifier Job
│ Int64 String
─────┼────────────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> rightjoin(name, job2, on = :ID => :identifier, renamecols = "_left" => "_right")
3×3 DataFrame
Row │ ID Name_left Job_right
│ Int64 String? String
─────┼─────────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 4 missing Farmer
julia> rightjoin(name, job2, on = [:ID => :identifier], renamecols = uppercase => lowercase)
3×3 DataFrame
Row │ ID NAME job
│ Int64 String? String
─────┼─────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 4 missing Farmer
DataAPI.semijoin
— Functionsemijoin(df1, df2; on, makeunique=false, validate=(false, false), matchmissing=:error)
Perform a semi join of two data frame objects and return a DataFrame
containing the result. A semi join returns the subset of rows of df1
that match with the keys in df2
.
The order of rows in the result is kept from df1
.
Arguments
df1
,df2
: theAbstractDataFrames
to be joined
Keyword Arguments
on
: The names of the key columns on which to join the data frames. This can be a single name, or a vector of names (for joining on multiple columns). Aleft=>right
pair of names can be used instead of a name, for the case where a key has different names indf1
anddf2
(it is allowed to mix names and name pairs in a vector). Key values are compared usingisequal
.on
is a required argument.makeunique
: ignored as no columns are added todf1
columns (it is provided for consistency with other functions).indicator
: Default:nothing
. If aSymbol
or string, adds categorical indicator column with the given name for whether a row appeared in onlydf1
("left_only"
), onlydf2
("right_only"
) or in both ("both"
). If the name is already in use, the column name will be modified ifmakeunique=true
.validate
: whether to check that columns passed as theon
argument define unique keys in each input data frame (according toisequal
). Can be a tuple or a pair, with the first element indicating whether to run check fordf1
and the second element fordf2
. By default no check is performed.matchmissing
: if equal to:error
throw an error ifmissing
is present inon
columns; if equal to:equal
thenmissing
is allowed and missings are matched; if equal to:notequal
then missings are dropped indf2
on
columns.
It is not allowed to join on columns that contain NaN
or -0.0
in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector
.
When merging on
categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.
Metadata: table-level and column-level :note
-style metadata are taken from df1
.
See also: innerjoin
, leftjoin
, rightjoin
, outerjoin
, antijoin
, crossjoin
.
Examples
julia> name = DataFrame(ID=[1, 2, 3], Name=["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼──────────────────
1 │ 1 John Doe
2 │ 2 Jane Doe
3 │ 3 Joe Blogs
julia> job = DataFrame(ID=[1, 2, 4], Job=["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ ID Job
│ Int64 String
─────┼───────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> semijoin(name, job, on = :ID)
2×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼─────────────────
1 │ 1 John Doe
2 │ 2 Jane Doe
julia> job2 = DataFrame(identifier=[1, 2, 4], Job=["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ identifier Job
│ Int64 String
─────┼────────────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> semijoin(name, job2, on = :ID => :identifier)
2×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼─────────────────
1 │ 1 John Doe
2 │ 2 Jane Doe
julia> semijoin(name, job2, on = [:ID => :identifier])
2×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼─────────────────
1 │ 1 John Doe
2 │ 2 Jane Doe
Grouping
Base.get
— Functionget(gd::GroupedDataFrame, key, default)
Get a group based on the values of the grouping columns.
key
may be a GroupKey
, NamedTuple
or Tuple
of grouping column values (in the same order as the cols
argument to groupby
). It may also be an AbstractDict
, in which case the order of the arguments does not matter.
Examples
julia> df = DataFrame(a=repeat([:foo, :bar, :baz], outer=[2]),
b=repeat([2, 1], outer=[3]),
c=1:6);
julia> gd = groupby(df, :a)
GroupedDataFrame with 3 groups based on key: a
First Group (2 rows): a = :foo
Row │ a b c
│ Symbol Int64 Int64
─────┼──────────────────────
1 │ foo 2 1
2 │ foo 1 4
⋮
Last Group (2 rows): a = :baz
Row │ a b c
│ Symbol Int64 Int64
─────┼──────────────────────
1 │ baz 2 3
2 │ baz 1 6
julia> get(gd, (a=:bar,), nothing)
2×3 SubDataFrame
Row │ a b c
│ Symbol Int64 Int64
─────┼──────────────────────
1 │ bar 1 2
2 │ bar 2 5
julia> get(gd, (:baz,), nothing)
2×3 SubDataFrame
Row │ a b c
│ Symbol Int64 Int64
─────┼──────────────────────
1 │ baz 2 3
2 │ baz 1 6
julia> get(gd, (:qux,), nothing)
DataFrames.groupby
— Functiongroupby(d::AbstractDataFrame, cols;
sort::Union{Bool, Nothing, NamedTuple}=nothing,
skipmissing::Bool=false)
Return a GroupedDataFrame
representing a view of an AbstractDataFrame
split into row groups.
Arguments
df
: anAbstractDataFrame
to splitcols
: data frame columns to group by. Can be any column selector (Symbol
, string or integer;:
,Cols
,All
,Between
,Not
, a regular expression, or a vector ofSymbol
s, strings or integers). In particular if the selector picks no columns then a single-groupGroupedDataFrame
is created. As a special case, ifcols
is a single column or a vector of columns then it can contain columns wrapped inorder
that will be used to determine the order of groups ifsort
istrue
or aNamedTuple
(ifsort
isfalse
, then passingorder
is an error; ifsort
isnothing
then it is set totrue
whenorder
is passed).sort
: ifsort=true
sort groups according to the values of the grouping columnscols
; ifsort=false
groups are created in their order of appearance indf
; ifsort=nothing
(the default) then the fastest available grouping algorithm is picked and in consequence the order of groups in the result is undefined and may change in future releases; below a description of the current implementation is provided. Additionallysort
can be aNamedTuple
having some or all ofalg
,lt
,by
,rev
, andorder
fields. In this case the groups are sorted and their order follows thesortperm
order.skipmissing
: whether to skip groups withmissing
values in one of the grouping columnscols
Details
An iterator over a GroupedDataFrame
returns a SubDataFrame
view for each grouping into df
. Within each group, the order of rows in df
is preserved.
A GroupedDataFrame
also supports indexing by groups, select
, transform
, and combine
(which applies a function to each group and combines the result into a data frame).
GroupedDataFrame
also supports the dictionary interface. The keys are GroupKey
objects returned by keys(::GroupedDataFrame)
, which can also be used to get the values of the grouping columns for each group. Tuples
and NamedTuple
s containing the values of the grouping columns (in the same order as the cols
argument) are also accepted as indices. Finally, an AbstractDict
can be used to index into a grouped data frame where the keys are column names of the data frame. The order of the keys does not matter in this case.
In the current implementation if sort=nothing
groups are ordered following the order of appearance of values in the grouping columns, except when all grouping columns provide non-nothing
DataAPI.refpool
, in which case the order of groups follows the order of values returned by DataAPI.refpool
. As a particular application of this rule if all cols
are CategoricalVector
s then groups are always sorted. Integer columns with a narrow range also use this this optimization, so to the order of groups when grouping on integer columns is undefined. A column is considered to be an integer column when deciding on the grouping algorithm choice if its eltype
is a subtype of Union{Missing, Real}
, all its elements are either missing
or pass isinteger
test, and none of them is equal to -0.0
.
See also
combine
, select
, select!
, transform
, transform!
Examples
julia> df = DataFrame(a=repeat([1, 2, 3, 4], outer=[2]),
b=repeat([2, 1], outer=[4]),
c=1:8);
julia> gd = groupby(df, :a)
GroupedDataFrame with 4 groups based on key: a
First Group (2 rows): a = 1
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 1
2 │ 1 2 5
⋮
Last Group (2 rows): a = 4
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 4 1 4
2 │ 4 1 8
julia> gd[1]
2×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 1
2 │ 1 2 5
julia> last(gd)
2×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 4 1 4
2 │ 4 1 8
julia> gd[(a=3,)]
2×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 3 2 3
2 │ 3 2 7
julia> gd[Dict("a" => 3)]
2×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 3 2 3
2 │ 3 2 7
julia> gd[(3,)]
2×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 3 2 3
2 │ 3 2 7
julia> k = first(keys(gd))
GroupKey: (a = 1,)
julia> gd[k]
2×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 1
2 │ 1 2 5
julia> for g in gd
println(g)
end
2×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 1
2 │ 1 2 5
2×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 2 1 2
2 │ 2 1 6
2×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 3 2 3
2 │ 3 2 7
2×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 4 1 4
2 │ 4 1 8
DataFrames.groupcols
— Functiongroupcols(gd::GroupedDataFrame)
Return a vector of Symbol
column names in parent(gd)
used for grouping.
DataFrames.groupindices
— Functiongroupindices(gd::GroupedDataFrame)
Return a vector of group indices for each row of parent(gd)
.
Rows appearing in group gd[i]
are attributed index i
. Rows not present in any group are attributed missing
(this can happen if skipmissing=true
was passed when creating gd
, or if gd
is a subset from a larger GroupedDataFrame
).
The groupindices => target_col_name
syntax (or just groupindices
without specifying the target column name) is also supported in the transformation mini-language when passing a GroupedDataFrame
to transformation functions (combine
, select
, etc.).
Examples
julia> df = DataFrame(id=["a", "c", "b", "b", "a"])
5×1 DataFrame
Row │ id
│ String
─────┼────────
1 │ a
2 │ c
3 │ b
4 │ b
5 │ a
julia> gdf = groupby(df, :id);
julia> combine(gdf, groupindices)
3×2 DataFrame
Row │ id groupindices
│ String Int64
─────┼──────────────────────
1 │ a 1
2 │ c 2
3 │ b 3
julia> select(gdf, groupindices => :gid)
5×2 DataFrame
Row │ id gid
│ String Int64
─────┼───────────────
1 │ a 1
2 │ c 2
3 │ b 3
4 │ b 3
5 │ a 1
Base.keys
— Functionkeys(gd::GroupedDataFrame)
Get the set of keys for each group of the GroupedDataFrame
gd
as a GroupKeys
object. Each key is a GroupKey
, which behaves like a NamedTuple
holding the values of the grouping columns for a given group. Unlike the equivalent Tuple
, NamedTuple
, and AbstractDict
, these keys can be used to index into gd
efficiently. The ordering of the keys is identical to the ordering of the groups of gd
under iteration and integer indexing.
Examples
julia> df = DataFrame(a=repeat([:foo, :bar, :baz], outer=[4]),
b=repeat([2, 1], outer=[6]),
c=1:12);
julia> gd = groupby(df, [:a, :b])
GroupedDataFrame with 6 groups based on keys: a, b
First Group (2 rows): a = :foo, b = 2
Row │ a b c
│ Symbol Int64 Int64
─────┼──────────────────────
1 │ foo 2 1
2 │ foo 2 7
⋮
Last Group (2 rows): a = :baz, b = 1
Row │ a b c
│ Symbol Int64 Int64
─────┼──────────────────────
1 │ baz 1 6
2 │ baz 1 12
julia> keys(gd)
6-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
GroupKey: (a = :foo, b = 2)
GroupKey: (a = :bar, b = 1)
GroupKey: (a = :baz, b = 2)
GroupKey: (a = :foo, b = 1)
GroupKey: (a = :bar, b = 2)
GroupKey: (a = :baz, b = 1)
julia> k = keys(gd)[1]
GroupKey: (a = :foo, b = 2)
julia> keys(k)
2-element Vector{Symbol}:
:a
:b
julia> values(k) # Same as Tuple(k)
(:foo, 2)
julia> NamedTuple(k)
(a = :foo, b = 2)
julia> k.a
:foo
julia> k[:a]
:foo
julia> k[1]
:foo
Keys can be used as indices to retrieve the corresponding group from their GroupedDataFrame
:
julia> gd[k]
2×3 SubDataFrame
Row │ a b c
│ Symbol Int64 Int64
─────┼──────────────────────
1 │ foo 2 1
2 │ foo 2 7
julia> gd[keys(gd)[1]] == gd[1]
true
keys(dfc::DataFrameColumns)
Get a vector of column names of dfc
as Symbol
s.
Base.parent
— Functionparent(gd::GroupedDataFrame)
Return the parent data frame of gd
.
DataFrames.proprow
— Functionproprow
Compute the proportion of rows which belong to each group, i.e. its number of rows divided by the total number of rows in a GroupedDataFrame
.
This function can only be used in the transformation mini-language via the proprow => target_col_name
syntax (or just proprow
without specifying the target column name), when passing a GroupedDataFrame
to transformation functions (combine
, select
, etc.).
Examples
julia> df = DataFrame(id=["a", "c", "b", "b", "a", "b"])
6×1 DataFrame
Row │ id
│ String
─────┼────────
1 │ a
2 │ c
3 │ b
4 │ b
5 │ a
6 │ b
julia> gdf = groupby(df, :id);
julia> combine(gdf, proprow)
3×2 DataFrame
Row │ id proprow
│ String Float64
─────┼──────────────────
1 │ a 0.333333
2 │ c 0.166667
3 │ b 0.5
julia> select(gdf, proprow => :frac)
6×2 DataFrame
Row │ id frac
│ String Float64
─────┼──────────────────
1 │ a 0.333333
2 │ c 0.166667
3 │ b 0.5
4 │ b 0.5
5 │ a 0.333333
6 │ b 0.5
DataFrames.valuecols
— Functionvaluecols(gd::GroupedDataFrame)
Return a vector of Symbol
column names in parent(gd)
not used for grouping.
Filtering rows
Base.allunique
— Functionallunique(df::AbstractDataFrame, cols=:)
Return true
if none of the rows of df
are duplicated. Two rows are duplicates if all their columns contain equal values (according to isequal
) for all columns in cols
(by default, all columns).
Arguments
df
:AbstractDataFrame
cols
: a selector specifying the column(s) or their transformations to compare. Can be any column selector or transformation accepted byselect
.
See also unique
and nonunique
.
Examples
julia> df = DataFrame(i=1:4, x=[1, 2, 1, 2])
4×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 1
4 │ 4 2
julia> allunique(df)
true
julia> allunique(df, :x)
false
julia> allunique(df, :i => ByRow(isodd))
false
Base.deleteat!
— Functiondeleteat!(df::DataFrame, inds)
Delete rows specified by inds
from a DataFrame
df
in place and return it.
Internally deleteat!
is called for all columns so inds
must be: a vector of sorted and unique integers, a boolean vector, an integer, or Not
wrapping any valid selector.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Examples
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> deleteat!(df, 2)
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 3 6
Base.empty
— Functionempty(df::AbstractDataFrame)
Create a new DataFrame
with the same column names and column element types as df
but with zero rows.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Base.empty!
— Functionempty!(df::DataFrame)
Remove all rows from df
, making each of its columns empty.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Examples
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> empty!(df)
0×2 DataFrame
Row │ a b
│ Int64 Int64
─────┴──────────────
julia> df.a, df.b
(Int64[], Int64[])
Base.filter
— Functionfilter(fun, df::AbstractDataFrame; view::Bool=false)
filter(cols => fun, df::AbstractDataFrame; view::Bool=false)
Return a data frame containing only rows from df
for which fun
returns true
.
If cols
is not specified then the predicate fun
is passed DataFrameRow
s. Elements of a DataFrameRow
may be accessed with dot syntax or column indexing inside fun
.
If cols
is specified then the predicate fun
is passed elements of the corresponding columns as separate positional arguments, unless cols
is an AsTable
selector, in which case a NamedTuple
of these arguments is passed. cols
can be any column selector (Symbol
, string or integer; :
, Cols
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers), and column duplicates are allowed if a vector of Symbol
s, strings, or integers is passed.
If view=false
a freshly allocated DataFrame
is returned. If view=true
then a SubDataFrame
view into df
is returned.
Passing cols
leads to a more efficient execution of the operation for large data frames.
This method is defined so that DataFrames.jl implements the Julia API for collections, but it is generally recommended to use the subset
function instead as it is consistent with other DataFrames.jl functions (as opposed to filter
).
Metadata: this function preserves table-level and column-level :note
-style metadata.
See also: filter!
Examples
julia> df = DataFrame(x=[3, 1, 2, 1], y=["b", "c", "a", "b"])
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 1 c
3 │ 2 a
4 │ 1 b
julia> filter(row -> row.x > 1, df)
2×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 a
julia> filter(row -> row["x"] > 1, df)
2×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 a
julia> filter(:x => x -> x > 1, df)
2×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 a
julia> filter([:x, :y] => (x, y) -> x == 1 || y == "b", df)
3×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 1 c
3 │ 1 b
julia> filter(AsTable(:) => nt -> nt.x == 1 || nt.y == "b", df)
3×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 1 c
3 │ 1 b
filter(fun, gdf::GroupedDataFrame; ungroup::Bool=false)
filter(cols => fun, gdf::GroupedDataFrame; ungroup::Bool=false)
Return only groups in gd
for which fun
returns true
as a GroupedDataFrame
if ungroup=false
(the default), or as a data frame if ungroup=true
.
If cols
is not specified then the predicate fun
is called with a SubDataFrame
for each group.
If cols
is specified then the predicate fun
is called for each group with views of the corresponding columns as separate positional arguments, unless cols
is an AsTable
selector, in which case a NamedTuple
of these arguments is passed. cols
can be any column selector (Symbol
, string or integer; :
, Cols
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers), and column duplicates are allowed if a vector of Symbol
s, strings, or integers is passed.
This method is defined so that DataFrames.jl implements the Julia API for collections, but it is generally recommended to use the subset
function instead as it is consistent with other DataFrames.jl functions (as opposed to filter
).
Examples
julia> df = DataFrame(g=[1, 2], x=['a', 'b']);
julia> gd = groupby(df, :g)
GroupedDataFrame with 2 groups based on key: g
First Group (1 row): g = 1
Row │ g x
│ Int64 Char
─────┼─────────────
1 │ 1 a
⋮
Last Group (1 row): g = 2
Row │ g x
│ Int64 Char
─────┼─────────────
1 │ 2 b
julia> filter(x -> x.x[1] == 'a', gd)
GroupedDataFrame with 1 group based on key: g
First Group (1 row): g = 1
Row │ g x
│ Int64 Char
─────┼─────────────
1 │ 1 a
julia> filter(:x => x -> x[1] == 'a', gd)
GroupedDataFrame with 1 group based on key: g
First Group (1 row): g = 1
Row │ g x
│ Int64 Char
─────┼─────────────
1 │ 1 a
julia> filter(:x => x -> x[1] == 'a', gd, ungroup=true)
1×2 DataFrame
Row │ g x
│ Int64 Char
─────┼─────────────
1 │ 1 a
Base.filter!
— Functionfilter!(fun, df::AbstractDataFrame)
filter!(cols => fun, df::AbstractDataFrame)
Remove rows from data frame df
for which fun
returns false
.
If cols
is not specified then the predicate fun
is passed DataFrameRow
s. Elements of a DataFrameRow
may be accessed with dot syntax or column indexing inside fun
.
If cols
is specified then the predicate fun
is passed elements of the corresponding columns as separate positional arguments, unless cols
is an AsTable
selector, in which case a NamedTuple
of these arguments is passed. cols
can be any column selector (Symbol
, string or integer; :
, Cols
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers), and column duplicates are allowed if a vector of Symbol
s, strings, or integers is passed.
Passing cols
leads to a more efficient execution of the operation for large data frames.
This method is defined so that DataFrames.jl implements the Julia API for collections, but it is generally recommended to use the subset!
function instead as it is consistent with other DataFrames.jl functions (as opposed to filter!
).
Metadata: this function preserves table-level and column-level :note
-style metadata.
See also: filter
Examples
julia> df = DataFrame(x=[3, 1, 2, 1], y=["b", "c", "a", "b"])
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 1 c
3 │ 2 a
4 │ 1 b
julia> filter!(row -> row.x > 1, df)
2×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 a
julia> filter!(row -> row["x"] > 1, df)
2×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 a
julia> filter!(:x => x -> x == 3, df)
1×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
julia> df = DataFrame(x=[3, 1, 2, 1], y=["b", "c", "a", "b"]);
julia> filter!([:x, :y] => (x, y) -> x == 1 || y == "b", df)
3×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 1 c
3 │ 1 b
julia> df = DataFrame(x=[3, 1, 2, 1], y=["b", "c", "a", "b"]);
julia> filter!(AsTable(:) => nt -> nt.x == 1 || nt.y == "b", df)
3×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 1 c
3 │ 1 b
Base.keepat!
— Functionkeepat!(df::DataFrame, inds)
Delete rows at all indices not specified by inds
from a DataFrame
df
in place and return it.
Internally deleteat!
is called for all columns so inds
must be: a vector of sorted and unique integers, a boolean vector, an integer, or Not
wrapping any valid selector.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Examples
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> keepat!(df, [1, 3])
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 3 6
Base.first
— Functionfirst(df::AbstractDataFrame)
Get the first row of df
as a DataFrameRow
.
Metadata: this function preserves table-level and column-level :note
-style metadata.
first(df::AbstractDataFrame, n::Integer; view::Bool=false)
Get a data frame with the n
first rows of df
.
If view=false
a freshly allocated DataFrame
is returned. If view=true
then a SubDataFrame
view into df
is returned.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Base.last
— Functionlast(df::AbstractDataFrame)
Get the last row of df
as a DataFrameRow
.
Metadata: this function preserves table-level and column-level :note
-style metadata.
last(df::AbstractDataFrame, n::Integer; view::Bool=false)
Get a data frame with the n
last rows of df
.
If view=false
a freshly allocated DataFrame
is returned. If view=true
then a SubDataFrame
view into df
is returned.
Metadata: this function preserves table-level and column-level :note
-style metadata.
DataFrames.nonunique
— Functionnonunique(df::AbstractDataFrame; keep::Symbol=:first)
nonunique(df::AbstractDataFrame, cols; keep::Symbol=:first)
Return a Vector{Bool}
in which true
entries indicate duplicate rows.
Duplicate rows are those for which at least another row contains equal values (according to isequal
) for all columns in cols
(by default, all columns). If keep=:first
(the default), only the first occurrence of a set of duplicate rows is indicated with a false
entry. If keep=:last
, only the last occurrence of a set of duplicate rows is indicated with a false
entry. If keep=:noduplicates
, only rows without any duplicates are indicated with a false
entry.
Arguments
df
:AbstractDataFrame
cols
: a selector specifying the column(s) or their transformations to compare. Can be any column selector or transformation accepted byselect
that returns at least one column ifdf
has at least one column.
Examples
julia> df = DataFrame(i=1:4, x=[1, 2, 1, 2])
4×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 1
4 │ 4 2
julia> df = vcat(df, df)
8×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 1
4 │ 4 2
5 │ 1 1
6 │ 2 2
7 │ 3 1
8 │ 4 2
julia> nonunique(df)
8-element Vector{Bool}:
0
0
0
0
1
1
1
1
julia> nonunique(df, keep=:last)
8-element Vector{Bool}:
1
1
1
1
0
0
0
0
julia> nonunique(df, 2)
8-element Vector{Bool}:
0
0
1
1
1
1
1
1
Base.Iterators.only
— Functiononly(df::AbstractDataFrame)
If df
has a single row return it as a DataFrameRow
; otherwise throw ArgumentError
.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Base.pop!
— Functionpop!(df::DataFrame)
Remove the last row from df
and return a NamedTuple
created from this row.
Using this method for very wide data frames may lead to expensive compilation.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Examples
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> pop!(df)
(a = 3, b = 6)
julia> df
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
Base.popat!
— Functionpopat!(df::DataFrame, i::Integer)
Remove the i
-th row from df
and return a NamedTuple
created from this row.
Using this method for very wide data frames may lead to expensive compilation.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Examples
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> popat!(df, 2)
(a = 2, b = 5)
julia> df
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 3 6
Base.popfirst!
— Functionpopfirst!(df::DataFrame)
Remove the first row from df
and return a NamedTuple
created from this row.
Using this method for very wide data frames may lead to expensive compilation.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Examples
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> popfirst!(df)
(a = 1, b = 4)
julia> df
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 2 5
2 │ 3 6
Base.resize!
— Functionresize!(df::DataFrame, n::Integer)
Resize df
to have n
rows by calling resize!
on all columns of df
.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Examples
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> resize!(df, 2)
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
DataFrames.subset
— Functionsubset(df::AbstractDataFrame, args...;
skipmissing::Bool=false, view::Bool=false, threads::Bool=true)
subset(gdf::GroupedDataFrame, args...;
skipmissing::Bool=false, view::Bool=false,
ungroup::Bool=true, threads::Bool=true)
Return a copy of data frame df
or parent of gdf
containing only rows for which all values produced by transformation(s) args
for a given row are true
. All transformations must produce vectors containing true
or false
. When the first argument is a GroupedDataFrame
, transformations are also allowed to return a single true
or false
value, which results in including or excluding a whole group.
If skipmissing=false
(the default) args
are required to produce results containing only Bool
values. If skipmissing=true
, additionally missing
is allowed and it is treated as false
(i.e. rows for which one of the conditions returns missing
are skipped).
Each argument passed in args
can be any specifier following the rules described for select
with the restriction that:
- specifying target column name is not allowed as
subset
does not create new columns; - every passed transformation must return a scalar or a vector (returning
AbstractDataFrame
,NamedTuple
,DataFrameRow
orAbstractMatrix
is not supported).
If view=true
a SubDataFrame
view is returned instead of a DataFrame
.
If ungroup=false
the resulting data frame is re-grouped based on the same grouping columns as gdf
and a GroupedDataFrame
is returned (preserving the order of groups from gdf
).
If threads=true
(the default) transformations may be run in separate tasks which can execute in parallel (possibly being applied to multiple rows or groups at the same time). Whether or not tasks are actually spawned and their number are determined automatically. Set to false
if some transformations require serial execution or are not thread-safe.
If a GroupedDataFrame
is passed then it must include all groups present in the parent
data frame, like in select!
.
Note that as the subset
function works in exactly the same way as other transformation functions defined in DataFrames.jl this is the preferred way to subset rows of a data frame or grouped data frame. In particular it uses a different set of rules for specifying transformations than filter
which is implemented in DataFrames.jl to ensure support for the standard Julia API for collections.
Metadata: this function preserves table-level and column-level :note
-style metadata.
See also: subset!
, filter
, select
Examples
julia> df = DataFrame(id=1:4, x=[true, false, true, false],
y=[true, true, false, false],
z=[true, true, missing, missing], v=[1, 2, 11, 12])
4×5 DataFrame
Row │ id x y z v
│ Int64 Bool Bool Bool? Int64
─────┼─────────────────────────────────────
1 │ 1 true true true 1
2 │ 2 false true true 2
3 │ 3 true false missing 11
4 │ 4 false false missing 12
julia> subset(df, :x)
2×5 DataFrame
Row │ id x y z v
│ Int64 Bool Bool Bool? Int64
─────┼────────────────────────────────────
1 │ 1 true true true 1
2 │ 3 true false missing 11
julia> subset(df, :v => x -> x .> 3)
2×5 DataFrame
Row │ id x y z v
│ Int64 Bool Bool Bool? Int64
─────┼─────────────────────────────────────
1 │ 3 true false missing 11
2 │ 4 false false missing 12
julia> subset(df, :x, :y => ByRow(!))
1×5 DataFrame
Row │ id x y z v
│ Int64 Bool Bool Bool? Int64
─────┼────────────────────────────────────
1 │ 3 true false missing 11
julia> subset(df, :x, :z, skipmissing=true)
1×5 DataFrame
Row │ id x y z v
│ Int64 Bool Bool Bool? Int64
─────┼─────────────────────────────────
1 │ 1 true true true 1
julia> subset(df, :x, :z)
ERROR: ArgumentError: missing was returned in condition number 2 but only true or false are allowed; pass skipmissing=true to skip missing values
julia> subset(groupby(df, :y), :v => x -> x .> minimum(x))
2×5 DataFrame
Row │ id x y z v
│ Int64 Bool Bool Bool? Int64
─────┼─────────────────────────────────────
1 │ 2 false true true 2
2 │ 4 false false missing 12
julia> subset(groupby(df, :y), :v => x -> minimum(x) > 5)
2×5 DataFrame
Row │ id x y z v
│ Int64 Bool Bool Bool? Int64
─────┼─────────────────────────────────────
1 │ 3 true false missing 11
2 │ 4 false false missing 12
DataFrames.subset!
— Functionsubset!(df::AbstractDataFrame, args...;
skipmissing::Bool=false, threads::Bool=true)
subset!(gdf::GroupedDataFrame{DataFrame}, args...;
skipmissing::Bool=false, ungroup::Bool=true, threads::Bool=true)
Update data frame df
or the parent of gdf
in place to contain only rows for which all values produced by transformation(s) args
for a given row is true
. All transformations must produce vectors containing true
or false
. When the first argument is a GroupedDataFrame
, transformations are also allowed to return a single true
or false
value, which results in including or excluding a whole group.
If skipmissing=false
(the default) args
are required to produce results containing only Bool
values. If skipmissing=true
, additionally missing
is allowed and it is treated as false
(i.e. rows for which one of the conditions returns missing
are skipped).
Each argument passed in args
can be any specifier following the rules described for select
with the restriction that:
- specifying target column name is not allowed as
subset!
does not create new columns; - every passed transformation must return a scalar or a vector (returning
AbstractDataFrame
,NamedTuple
,DataFrameRow
orAbstractMatrix
is not supported).
If ungroup=false
the passed GroupedDataFrame
gdf
is updated (preserving the order of its groups) and returned.
If threads=true
(the default) transformations may be run in separate tasks which can execute in parallel (possibly being applied to multiple rows or groups at the same time). Whether or not tasks are actually spawned and their number are determined automatically. Set to false
if some transformations require serial execution or are not thread-safe.
If GroupedDataFrame
is subsetted then it must include all groups present in the parent
data frame, like in select!
. In this case the passed GroupedDataFrame
is updated to have correct groups after its parent is updated.
Note that as the subset!
function works in exactly the same way as other transformation functions defined in DataFrames.jl this is the preferred way to subset rows of a data frame or grouped data frame. In particular it uses a different set of rules for specifying transformations than filter!
which is implemented in DataFrames.jl to ensure support for the standard Julia API for collections.
Metadata: this function preserves table-level and column-level :note
-style metadata.
See also: subset
, filter!
, select!
Examples
julia> df = DataFrame(id=1:4, x=[true, false, true, false], y=[true, true, false, false])
4×3 DataFrame
Row │ id x y
│ Int64 Bool Bool
─────┼─────────────────────
1 │ 1 true true
2 │ 2 false true
3 │ 3 true false
4 │ 4 false false
julia> subset!(df, :x, :y => ByRow(!));
julia> df
1×3 DataFrame
Row │ id x y
│ Int64 Bool Bool
─────┼────────────────────
1 │ 3 true false
julia> df = DataFrame(id=1:4, y=[true, true, false, false], v=[1, 2, 11, 12]);
julia> subset!(groupby(df, :y), :v => x -> x .> minimum(x));
julia> df
2×3 DataFrame
Row │ id y v
│ Int64 Bool Int64
─────┼─────────────────────
1 │ 2 true 2
2 │ 4 false 12
julia> df = DataFrame(id=1:4, x=[true, false, true, false],
z=[true, true, missing, missing], v=1:4)
4×4 DataFrame
Row │ id x z v
│ Int64 Bool Bool? Int64
─────┼──────────────────────────────
1 │ 1 true true 1
2 │ 2 false true 2
3 │ 3 true missing 3
4 │ 4 false missing 4
julia> subset!(df, :x, :z)
ERROR: ArgumentError: missing was returned in condition number 2 but only true or false are allowed; pass skipmissing=true to skip missing values
julia> subset!(df, :x, :z, skipmissing=true);
julia> df
1×4 DataFrame
Row │ id x z v
│ Int64 Bool Bool? Int64
─────┼───────────────────────────
1 │ 1 true true 1
julia> df = DataFrame(id=1:4, x=[true, false, true, false], y=[true, true, false, false],
z=[true, true, missing, missing], v=[1, 2, 11, 12]);
julia> subset!(groupby(df, :y), :v => x -> x .> minimum(x));
julia> df
2×5 DataFrame
Row │ id x y z v
│ Int64 Bool Bool Bool? Int64
─────┼─────────────────────────────────────
1 │ 2 false true true 2
2 │ 4 false false missing 12
julia> df = DataFrame(id=1:4, x=[true, false, true, false], y=[true, true, false, false],
z=[true, true, missing, missing], v=[1, 2, 11, 12]);
julia> subset!(groupby(df, :y), :v => x -> minimum(x) > 5);
julia> df
2×5 DataFrame
Row │ id x y z v
│ Int64 Bool Bool Bool? Int64
─────┼─────────────────────────────────────
1 │ 3 true false missing 11
2 │ 4 false false missing 12
Base.unique
— Functionunique(df::AbstractDataFrame; view::Bool=false, keep::Symbol=:first)
unique(df::AbstractDataFrame, cols; view::Bool=false, keep::Symbol=:first)
Return a data frame containing only unique rows in df
.
Non-unique (duplicate) rows are those for which at least another row contains equal values (according to isequal
) for all columns in cols
(by default, all columns). If keep=:first
(the default), only the first occurrence of a set of duplicate rows is kept. If keep=:last
, only the last occurrence of a set of duplicate rows is kept. If keep=:noduplicates
, only rows without any duplicates are kept.
If view=false
a freshly allocated DataFrame
is returned, and if view=true
then a SubDataFrame
view into df
is returned.
Arguments
df
: the AbstractDataFramecols
: a selector specifying the column(s) or their transformations to compare. Can be any column selector or transformation accepted byselect
that returns at least one column ifdf
has at least one column.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Examples
julia> df = DataFrame(i=1:4, x=[1, 2, 1, 2])
4×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 1
4 │ 4 2
julia> df = vcat(df, df)
8×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 1
4 │ 4 2
5 │ 1 1
6 │ 2 2
7 │ 3 1
8 │ 4 2
julia> unique(df) # doesn't modify df
4×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 1
4 │ 4 2
julia> unique(df, 2)
2×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
julia> unique(df, keep=:noduplicates)
0×2 DataFrame
Row │ i x
│ Int64 Int64
─────┴──────────────
Base.unique!
— Functionunique!(df::AbstractDataFrame; keep::Symbol=:first)
unique!(df::AbstractDataFrame, cols; keep::Symbol=:first)
Update df
in-place to contain only unique rows.
Non-unique (duplicate) rows are those for which at least another row contains equal values (according to isequal
) for all columns in cols
(by default, all columns). If keep=:first
(the default), only the first occurrence of a set of duplicate rows is kept. If keep=:last
, only the last occurrence of a set of duplicate rows is kept. If keep=:noduplicates
, only rows without any duplicates are kept.
Arguments
df
: the AbstractDataFramecols
: column indicator (Symbol
,Int
,Vector{Symbol}
,Regex
, etc.) specifying the column(s) to compare. Can be any column selector or transformation accepted byselect
that returns at least one column ifdf
has at least one column.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Examples
julia> df = DataFrame(i=1:4, x=[1, 2, 1, 2])
4×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 1
4 │ 4 2
julia> df = vcat(df, df)
8×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 1
4 │ 4 2
5 │ 1 1
6 │ 2 2
7 │ 3 1
8 │ 4 2
julia> unique!(copy(df)) # modifies df
4×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 1
4 │ 4 2
julia> unique(df, keep=:noduplicates)
0×2 DataFrame
Row │ i x
│ Int64 Int64
─────┴──────────────
Working with missing values
Missings.allowmissing
— Functionallowmissing(df::AbstractDataFrame, cols=:)
Return a copy of data frame df
with columns cols
converted to element type Union{T, Missing}
from T
to allow support for missing values.
cols
can be any column selector (Symbol
, string or integer; :
, Cols
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
If cols
is omitted all columns in the data frame are converted.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Examples
julia> df = DataFrame(a=[1, 2])
2×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
julia> allowmissing(df)
2×1 DataFrame
Row │ a
│ Int64?
─────┼────────
1 │ 1
2 │ 2
DataFrames.allowmissing!
— Functionallowmissing!(df::DataFrame, cols=:)
Convert columns cols
of data frame df
from element type T
to Union{T, Missing}
to support missing values.
cols
can be any column selector (Symbol
, string or integer; :
, Cols
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
If cols
is omitted all columns in the data frame are converted.
Metadata: this function preserves table-level and column-level :note
-style metadata.
DataFrames.completecases
— Functioncompletecases(df::AbstractDataFrame, cols=:)
Return a Boolean vector with true
entries indicating rows without missing values (complete cases) in data frame df
.
If cols
is provided, only missing values in the corresponding columns are considered. cols
can be any column selector (Symbol
, string or integer; :
, Cols
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers) that returns at least one column if df
has at least one column.
See also: dropmissing
and dropmissing!
. Use findall(completecases(df))
to get the indices of the rows.
Examples
julia> df = DataFrame(i=1:5,
x=[missing, 4, missing, 2, 1],
y=[missing, missing, "c", "d", "e"])
5×3 DataFrame
Row │ i x y
│ Int64 Int64? String?
─────┼─────────────────────────
1 │ 1 missing missing
2 │ 2 4 missing
3 │ 3 missing c
4 │ 4 2 d
5 │ 5 1 e
julia> completecases(df)
5-element BitVector:
0
0
0
1
1
julia> completecases(df, :x)
5-element BitVector:
0
1
0
1
1
julia> completecases(df, [:x, :y])
5-element BitVector:
0
0
0
1
1
Missings.disallowmissing
— Functiondisallowmissing(df::AbstractDataFrame, cols=:; error::Bool=true)
Return a copy of data frame df
with columns cols
converted from element type Union{T, Missing}
to T
to drop support for missing values.
cols
can be any column selector (Symbol
, string or integer; :
, Cols
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
If cols
is omitted all columns in the data frame are converted.
If error=false
then columns containing a missing
value will be skipped instead of throwing an error.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Examples
julia> df = DataFrame(a=Union{Int, Missing}[1, 2])
2×1 DataFrame
Row │ a
│ Int64?
─────┼────────
1 │ 1
2 │ 2
julia> disallowmissing(df)
2×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
julia> df = DataFrame(a=[1, missing])
2×1 DataFrame
Row │ a
│ Int64?
─────┼─────────
1 │ 1
2 │ missing
julia> disallowmissing(df, error=false)
2×1 DataFrame
Row │ a
│ Int64?
─────┼─────────
1 │ 1
2 │ missing
DataFrames.disallowmissing!
— Functiondisallowmissing!(df::DataFrame, cols=:; error::Bool=true)
Convert columns cols
of data frame df
from element type Union{T, Missing}
to T
to drop support for missing values.
cols
can be any column selector (Symbol
, string or integer; :
, Cols
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
If cols
is omitted all columns in the data frame are converted.
If error=false
then columns containing a missing
value will be skipped instead of throwing an error.
Metadata: this function preserves table-level and column-level :note
-style metadata.
DataFrames.dropmissing
— Functiondropmissing(df::AbstractDataFrame, cols=:; view::Bool=false, disallowmissing::Bool=!view)
Return a data frame excluding rows with missing values in df
.
If cols
is provided, only missing values in the corresponding columns are considered. cols
can be any column selector (Symbol
, string or integer; :
, Cols
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
If view=false
a freshly allocated DataFrame
is returned. If view=true
then a SubDataFrame
view into df
is returned. In this case disallowmissing
must be false
.
If disallowmissing
is true
(the default when view
is false
) then columns specified in cols
will be converted so as not to allow for missing values using disallowmissing!
.
See also: completecases
and dropmissing!
.
Metadata: this function preserves table-level and column-level :note
-style metadata.
Examples
julia> df = DataFrame(i=1:5,
x=[missing, 4, missing, 2, 1],
y=[missing, missing, "c", "d", "e"])
5×3 DataFrame
Row │ i x y
│ Int64 Int64? String?
─────┼─────────────────────────
1 │ 1 missing missing
2 │ 2 4 missing
3 │ 3 missing c
4 │ 4 2 d
5 │ 5 1 e
julia> dropmissing(df)
2×3 DataFrame
Row │ i x y
│ Int64 Int64 String
─────┼──────────────────────
1 │ 4 2 d
2 │ 5 1 e
julia> dropmissing(df, disallowmissing=false)
2×3 DataFrame
Row │ i x y
│ Int64 Int64? String?
─────┼────────────────────────
1 │ 4 2 d
2 │ 5 1 e
julia> dropmissing(df, :x)
3×3 DataFrame
Row │ i x y
│ Int64 Int64 String?
─────┼───────────────────────
1 │ 2 4 missing
2 │ 4 2 d
3 │ 5 1 e
julia> dropmissing(df, [:x, :y])
2×3 DataFrame
Row │ i x y
│ Int64 Int64 String
─────┼──────────────────────
1 │ 4 2 d
2 │ 5 1 e
DataFrames.dropmissing!
— Functiondropmissing!(df::AbstractDataFrame, cols=:; disallowmissing::Bool=true)
Remove rows with missing values from data frame df
and return it.
If cols
is provided, only missing values in the corresponding columns are considered. cols
can be any column selector (Symbol
, string or integer; :
, Cols
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
If disallowmissing
is true
(the default) then the cols
columns will get converted using disallowmissing!
.
Metadata: this function preserves table-level and column-level :note
-style metadata.
See also: dropmissing
and completecases
.
julia> df = DataFrame(i=1:5,
x=[missing, 4, missing, 2, 1],
y=[missing, missing, "c", "d", "e"])
5×3 DataFrame
Row │ i x y
│ Int64 Int64? String?
─────┼─────────────────────────
1 │ 1 missing missing
2 │ 2 4 missing
3 │ 3 missing c
4 │ 4 2 d
5 │ 5 1 e
julia> dropmissing!(copy(df))
2×3 DataFrame
Row │ i x y
│ Int64 Int64 String
─────┼──────────────────────
1 │ 4 2 d
2 │ 5 1 e
julia> dropmissing!(copy(df), disallowmissing=false)
2×3 DataFrame
Row │ i x y
│ Int64 Int64? String?
─────┼────────────────────────
1 │ 4 2 d
2 │ 5 1 e
julia> dropmissing!(copy(df), :x)
3×3 DataFrame
Row │ i x y
│ Int64 Int64 String?
─────┼───────────────────────
1 │ 2 4 missing
2 │ 4 2 d
3 │ 5 1 e
julia> dropmissing!(df, [:x, :y])
2×3 DataFrame
Row │ i x y
│ Int64 Int64 String
─────┼──────────────────────
1 │ 4 2 d
2 │ 5 1 e
Iteration
Base.eachcol
— Functioneachcol(df::AbstractDataFrame)
Return a DataFrameColumns
object that is a vector-like that allows iterating an AbstractDataFrame
column by column.
Indexing into DataFrameColumns
objects using integer, Symbol
or string returns the corresponding column (without copying). Indexing into DataFrameColumns
objects using a multiple column selector returns a subsetted DataFrameColumns
object with a new parent containing only the selected columns (without copying).
DataFrameColumns
supports most of the AbstractVector
API. The key differences are that it is read-only and that the keys
function returns a vector of Symbol
s (and not integers as for normal vectors).
In particular findnext
, findprev
, findfirst
, findlast
, and findall
functions are supported, and in findnext
and findprev
functions it is allowed to pass an integer, string, or Symbol
as a reference index.
Examples
julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
3 │ 3 13
4 │ 4 14
julia> eachcol(df)
4×2 DataFrameColumns
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
3 │ 3 13
4 │ 4 14
julia> collect(eachcol(df))
2-element Vector{AbstractVector}:
[1, 2, 3, 4]
[11, 12, 13, 14]
julia> map(eachcol(df)) do col
maximum(col) - minimum(col)
end
2-element Vector{Int64}:
3
3
julia> sum.(eachcol(df))
2-element Vector{Int64}:
10
50
Base.eachrow
— Functioneachrow(df::AbstractDataFrame)
Return a DataFrameRows
that iterates a data frame row by row, with each row represented as a DataFrameRow
.
Because DataFrameRow
s have an eltype
of Any
, use copy(dfr::DataFrameRow)
to obtain a named tuple, which supports iteration and property access like a DataFrameRow
, but also passes information on the eltypes
of the columns of df
.
Examples
julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
3 │ 3 13
4 │ 4 14
julia> eachrow(df)
4×2 DataFrameRows
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
3 │ 3 13
4 │ 4 14
julia> copy.(eachrow(df))
4-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
(x = 1, y = 11)
(x = 2, y = 12)
(x = 3, y = 13)
(x = 4, y = 14)
julia> eachrow(view(df, [4, 3], [2, 1]))
2×2 DataFrameRows
Row │ y x
│ Int64 Int64
─────┼──────────────
1 │ 14 4
2 │ 13 3
Base.values
— Functionvalues(dfc::DataFrameColumns)
Get a vector of columns from dfc
.
Base.pairs
— Functionpairs(dfc::DataFrameColumns)
Return an iterator of pairs associating the name of each column of dfc
with the corresponding column vector, i.e. name => col
where name
is the column name of the column col
.
Base.Iterators.partition
— FunctionIterators.partition(df::AbstractDataFrame, n::Integer)
Iterate over df
data frame n
rows at a time, returning each block as a SubDataFrame
.
Examples
julia> collect(Iterators.partition(DataFrame(x=1:5), 2))
3-element Vector{SubDataFrame{DataFrame, DataFrames.Index, UnitRange{Int64}}}:
2×1 SubDataFrame
Row │ x
│ Int64
─────┼───────
1 │ 1
2 │ 2
2×1 SubDataFrame
Row │ x
│ Int64
─────┼───────
1 │ 3
2 │ 4
1×1 SubDataFrame
Row │ x
│ Int64
─────┼───────
1 │ 5
Iterators.partition(dfr::DataFrameRows, n::Integer)
Iterate over DataFrameRows
dfr
n
rows at a time, returning each block as a DataFrameRows
over a view of rows of parent of dfr
.
Examples
julia> collect(Iterators.partition(eachrow(DataFrame(x=1:5)), 2))
3-element Vector{DataFrames.DataFrameRows{SubDataFrame{DataFrame, DataFrames.Index, UnitRange{Int64}}}}:
2×1 DataFrameRows
Row │ x
│ Int64
─────┼───────
1 │ 1
2 │ 2
2×1 DataFrameRows
Row │ x
│ Int64
─────┼───────
1 │ 3
2 │ 4
1×1 DataFrameRows
Row │ x
│ Int64
─────┼───────
1 │ 5
Equality
Base.isapprox
— Functionisapprox(df1::AbstractDataFrame, df2::AbstractDataFrame;
rtol::Real=atol>0 ? 0 : √eps, atol::Real=0,
nans::Bool=false, norm::Function=norm)
Inexact equality comparison. df1
and df2
must have the same size and column names. Return true
if isapprox
with given keyword arguments applied to all pairs of columns stored in df1
and df2
returns true
.
Metadata
DataAPI.metadata
— Functionmetadata(df::AbstractDataFrame, key::AbstractString, [default]; style::Bool=false)
metadata(dfr::DataFrameRow, key::AbstractString, [default]; style::Bool=false)
metadata(dfc::DataFrameColumns, key::AbstractString, [default]; style::Bool=false)
metadata(dfr::DataFrameRows, key::AbstractString, [default]; style::Bool=false)
Return table-level metadata value associated with df
for key key
. If style=true
return a tuple of metadata value and metadata style.
SubDataFrame
and DataFrameRow
expose only :note
-style metadata of their parent.
If default
is passed then return it if key
does not exist; if style=true
return (default, :default)
.
See also: metadatakeys
, metadata!
, deletemetadata!
, emptymetadata!
, colmetadata
, colmetadatakeys
, colmetadata!
, deletecolmetadata!
, emptycolmetadata!
.
Examples
julia> df = DataFrame(a=1, b=2);
julia> metadatakeys(df)
()
julia> metadata!(df, "name", "example", style=:note);
julia> metadatakeys(df)
KeySet for a Dict{String, Tuple{Any, Any}} with 1 entry. Keys:
"name"
julia> metadata(df, "name")
"example"
julia> metadata(df, "name", style=true)
("example", :note)
julia> deletemetadata!(df, "name");
julia> metadatakeys(df)
()
```
DataAPI.metadatakeys
— Functionmetadatakeys(df::AbstractDataFrame)
metadatakeys(dfr::DataFrameRow)
metadatakeys(dfc::DataFrameColumns)
metadatakeys(dfr::DataFrameRows)
Return an iterator of table-level metadata keys which are set in the object.
Values can be accessed using metadata(df, key)
.
SubDataFrame
and DataFrameRow
expose only :note
-style metadata keys of their parent.
See also: metadata
, metadata!
, deletemetadata!
, emptymetadata!
, colmetadata
, colmetadatakeys
, colmetadata!
, deletecolmetadata!
, emptycolmetadata!
.
Examples
julia> df = DataFrame(a=1, b=2);
julia> metadatakeys(df)
()
julia> metadata!(df, "name", "example", style=:note);
julia> metadatakeys(df)
KeySet for a Dict{String, Tuple{Any, Any}} with 1 entry. Keys:
"name"
julia> metadata(df, "name")
"example"
julia> metadata(df, "name", style=true)
("example", :note)
julia> deletemetadata!(df, "name");
julia> metadatakeys(df)
()
DataAPI.metadata!
— Functionmetadata!(df::AbstractDataFrame, key::AbstractString, value; style::Symbol=:default)
metadata!(dfr::DataFrameRow, key::AbstractString, value; style::Symbol=:default)
metadata!(dfc::DataFrameColumns, key::AbstractString, value; style::Symbol=:default)
metadata!(dfr::DataFrameRows, key::AbstractString, value; style::Symbol=:default)
Set table-level metadata for object df
for key key
to have value value
and style style
(:default
by default) and return df
.
For SubDataFrame
and DataFrameRow
only :note
-style is allowed. Trying to set a key-value pair for which the key already exists in the parent data frame with another style throws an error.
See also: metadata
, metadatakeys
, deletemetadata!
, emptymetadata!
, colmetadata
, colmetadatakeys
, colmetadata!
, deletecolmetadata!
, emptycolmetadata!
.
Examples
julia> df = DataFrame(a=1, b=2);
julia> metadatakeys(df)
()
julia> metadata!(df, "name", "example", style=:note);
julia> metadatakeys(df)
KeySet for a Dict{String, Tuple{Any, Any}} with 1 entry. Keys:
"name"
julia> metadata(df, "name")
"example"
julia> metadata(df, "name", style=true)
("example", :note)
julia> deletemetadata!(df, "name");
julia> metadatakeys(df)
()
```
DataAPI.deletemetadata!
— Functiondeletemetadata!(df::AbstractDataFrame, key::AbstractString)
deletemetadata!(dfr::DataFrameRow, key::AbstractString)
deletemetadata!(dfc::DataFrameColumns, key::AbstractString)
deletemetadata!(dfr::DataFrameRows, key::AbstractString)
Delete table-level metadata from object df
for key key
and return df
. If key does not exist, return df
without modification.
For SubDataFrame
and DataFrameRow
only :note
-style metadata from their parent can be deleted (as other styles are not propagated to views).
See also: metadata
, metadatakeys
, metadata!
, emptymetadata!
, colmetadata
, colmetadatakeys
, colmetadata!
, deletecolmetadata!
, emptycolmetadata!
.
Examples
julia> df = DataFrame(a=1, b=2);
julia> metadatakeys(df)
()
julia> metadata!(df, "name", "example", style=:note);
julia> metadatakeys(df)
KeySet for a Dict{String, Tuple{Any, Any}} with 1 entry. Keys:
"name"
julia> metadata(df, "name")
"example"
julia> metadata(df, "name", style=true)
("example", :note)
julia> deletemetadata!(df, "name");
julia> metadatakeys(df)
()
```
DataAPI.emptymetadata!
— Functionemptymetadata!(df::AbstractDataFrame)
emptymetadata!(dfr::DataFrameRow)
emptymetadata!(dfc::DataFrameColumns)
emptymetadata!(dfr::DataFrameRows)
Delete all table-level metadata from object df
.
For SubDataFrame
and DataFrameRow
only :note
-style metadata from their parent can be deleted (as other styles are not propagated to views).
See also: metadata
, metadatakeys
, metadata!
, deletemetadata!
, colmetadata
, colmetadatakeys
, colmetadata!
, deletecolmetadata!
, emptycolmetadata!
.
Examples
julia> df = DataFrame(a=1, b=2);
julia> metadatakeys(df)
()
julia> metadata!(df, "name", "example", style=:note);
julia> metadatakeys(df)
KeySet for a Dict{String, Tuple{Any, Any}} with 1 entry. Keys:
"name"
julia> metadata(df, "name")
"example"
julia> metadata(df, "name", style=true)
("example", :note)
julia> emptymetadata!(df);
julia> metadatakeys(df)
()
DataAPI.colmetadata
— Functioncolmetadata(df::AbstractDataFrame, col::ColumnIndex, key::AbstractString, [default]; style::Bool=false)
colmetadata(dfr::DataFrameRow, col::ColumnIndex, key::AbstractString, [default]; style::Bool=false)
colmetadata(dfc::DataFrameColumns, col::ColumnIndex, key::AbstractString, [default]; style::Bool=false)
colmetadata(dfr::DataFrameRows, col::ColumnIndex, key::AbstractString, [default]; style::Bool=false)
Return column-level metadata value associated with df
for column col
and key key
.
SubDataFrame
and DataFrameRow
expose only :note
-style metadata of their parent.
If default
is passed then return it if key
does not exist for column col
; if style=true
return (default, :default)
. If col
does not exist in df
always throw an error.
See also: metadata
, metadatakeys
, metadata!
, deletemetadata!
, emptymetadata!
, colmetadatakeys
, colmetadata!
, deletecolmetadata!
, emptycolmetadata!
.
Examples
julia> df = DataFrame(a=1, b=2);
julia> colmetadatakeys(df)
()
julia> colmetadata!(df, :a, "name", "example", style=:note);
julia> collect(colmetadatakeys(df))
1-element Vector{Pair{Symbol, Base.KeySet{String, Dict{String, Tuple{Any, Any}}}}}:
:a => ["name"]
julia> colmetadatakeys(df, :a)
KeySet for a Dict{String, Tuple{Any, Any}} with 1 entry. Keys:
"name"
julia> colmetadata(df, :a, "name")
"example"
julia> colmetadata(df, :a, "name", style=true)
("example", :note)
julia> deletecolmetadata!(df, :a, "name");
julia> colmetadatakeys(df)
()
```
DataAPI.colmetadatakeys
— Functioncolmetadatakeys(df::AbstractDataFrame, [col::ColumnIndex])
colmetadatakeys(dfr::DataFrameRow, [col::ColumnIndex])
colmetadatakeys(dfc::DataFrameColumns, [col::ColumnIndex])
colmetadatakeys(dfr::DataFrameRows, [col::ColumnIndex])
If col
is passed return an iterator of column-level metadata keys which are set for column col
. If col
is not passed return an iterator of col => colmetadatakeys(x, col)
pairs for all columns that have metadata, where col
are Symbol
.
Values can be accessed using colmetadata(df, col, key)
.
SubDataFrame
and DataFrameRow
expose only :note
-style metadata of their parent.
See also: metadata
, metadatakeys
, metadata!
, deletemetadata!
, emptymetadata!
, colmetadata
, colmetadata!
, deletecolmetadata!
, emptycolmetadata!
.
Examples
julia> df = DataFrame(a=1, b=2);
julia> colmetadatakeys(df)
()
julia> colmetadata!(df, :a, "name", "example", style=:note);
julia> collect(colmetadatakeys(df))
1-element Vector{Pair{Symbol, Base.KeySet{String, Dict{String, Tuple{Any, Any}}}}}:
:a => ["name"]
julia> colmetadatakeys(df, :a)
KeySet for a Dict{String, Tuple{Any, Any}} with 1 entry. Keys:
"name"
julia> colmetadata(df, :a, "name")
"example"
julia> colmetadata(df, :a, "name", style=true)
("example", :note)
julia> deletecolmetadata!(df, :a, "name");
julia> colmetadatakeys(df)
()
```
DataAPI.colmetadata!
— Functioncolmetadata!(df::AbstractDataFrame, col::ColumnIndex, key::AbstractString, value; style::Symbol=:default)
colmetadata!(dfr::DataFrameRow, col::ColumnIndex, key::AbstractString, value; style::Symbol=:default)
colmetadata!(dfc::DataFrameColumns, col::ColumnIndex, key::AbstractString, value; style::Symbol=:default)
colmetadata!(dfr::DataFrameRows, col::ColumnIndex, key::AbstractString, value; style::Symbol=:default)
Set column-level metadata in df
for column col
and key key
to have value value
and style style
(:default
by default) and return df
.
For SubDataFrame
and DataFrameRow
only :note
style is allowed. Trying to set a key-value pair for which the key already exists in the parent data frame with another style throws an error.
See also: metadata
, metadatakeys
, metadata!
, deletemetadata!
, emptymetadata!
, colmetadata
, colmetadatakeys
, deletecolmetadata!
, emptycolmetadata!
.
Examples
julia> df = DataFrame(a=1, b=2);
julia> colmetadatakeys(df)
()
julia> colmetadata!(df, :a, "name", "example", style=:note);
julia> collect(colmetadatakeys(df))
1-element Vector{Pair{Symbol, Base.KeySet{String, Dict{String, Tuple{Any, Any}}}}}:
:a => ["name"]
julia> colmetadatakeys(df, :a)
KeySet for a Dict{String, Tuple{Any, Any}} with 1 entry. Keys:
"name"
julia> colmetadata(df, :a, "name")
"example"
julia> colmetadata(df, :a, "name", style=true)
("example", :note)
julia> deletecolmetadata!(df, :a, "name");
julia> colmetadatakeys(df)
()
```
DataAPI.deletecolmetadata!
— Functiondeletecolmetadata!(df::AbstractDataFrame, col::ColumnIndex, key::AbstractString)
deletecolmetadata!(dfr::DataFrameRow, col::ColumnIndex, key::AbstractString)
deletecolmetadata!(dfc::DataFrameColumns, col::ColumnIndex, key::AbstractString)
deletecolmetadata!(dfr::DataFrameRows, col::ColumnIndex, key::AbstractString)
Delete column-level metadata set in df
for column col
and key key
and return df
.
For SubDataFrame
and DataFrameRow
only :note
-style metadata from their parent can be deleted (as other styles are not propagated to views).
See also: metadata
, metadatakeys
, metadata!
, deletemetadata!
, emptymetadata!
, colmetadata
, colmetadatakeys
, colmetadata!
, emptycolmetadata!
.
Examples
julia> df = DataFrame(a=1, b=2);
julia> colmetadatakeys(df)
()
julia> colmetadata!(df, :a, "name", "example", style=:note);
julia> collect(colmetadatakeys(df))
1-element Vector{Pair{Symbol, Base.KeySet{String, Dict{String, Tuple{Any, Any}}}}}:
:a => ["name"]
julia> colmetadatakeys(df, :a)
KeySet for a Dict{String, Tuple{Any, Any}} with 1 entry. Keys:
"name"
julia> colmetadata(df, :a, "name")
"example"
julia> colmetadata(df, :a, "name", style=true)
("example", :note)
julia> deletecolmetadata!(df, :a, "name");
julia> colmetadatakeys(df)
()
DataAPI.emptycolmetadata!
— Functionemptycolmetadata!(df::AbstractDataFrame, [col::ColumnIndex])
emptycolmetadata!(dfr::DataFrameRow, [col::ColumnIndex])
emptycolmetadata!(dfc::DataFrameColumns, [col::ColumnIndex])
emptycolmetadata!(dfr::DataFrameRows, [col::ColumnIndex])
Delete column-level metadata set in df
for column col
and key key
and return df
.
For SubDataFrame
and DataFrameRow
only :note
-style metadata from their parent can be deleted (as other styles are not propagated to views).
See also: metadata
, metadatakeys
, metadata!
, deletemetadata!
, emptymetadata!
, colmetadata
, colmetadatakeys
, colmetadata!
, deletecolmetadata!
.
Examples
julia> df = DataFrame(a=1, b=2);
julia> colmetadata!(df, :a, "name", "example", style=:note);
julia> collect(colmetadatakeys(df))
1-element Vector{Pair{Symbol, Base.KeySet{String, Dict{String, Tuple{Any, Any}}}}}:
:a => ["name"]
julia> colmetadatakeys(df, :a)
KeySet for a Dict{String, Tuple{Any, Any}} with 1 entry. Keys:
"name"
julia> colmetadata(df, :a, "name")
"example"
julia> colmetadata(df, :a, "name", style=true)
("example", :note)
julia> emptycolmetadata!(df, :a);
julia> colmetadatakeys(df)
()