Functions
Base.Iterators.only
Base.append!
Base.copy
Base.delete!
Base.eachcol
Base.eachrow
Base.empty
Base.empty!
Base.filter
Base.filter!
Base.first
Base.get
Base.hcat
Base.isapprox
Base.issorted
Base.keys
Base.last
Base.length
Base.names
Base.ndims
Base.pairs
Base.parent
Base.permutedims
Base.propertynames
Base.push!
Base.reduce
Base.repeat
Base.show
Base.similar
Base.size
Base.sort
Base.sort!
Base.sortperm
Base.unique
Base.unique!
Base.values
Base.vcat
DataAPI.describe
DataFrames.allowmissing!
DataFrames.antijoin
DataFrames.combine
DataFrames.completecases
DataFrames.crossjoin
DataFrames.disallowmissing!
DataFrames.dropmissing
DataFrames.dropmissing!
DataFrames.flatten
DataFrames.groupby
DataFrames.groupcols
DataFrames.groupindices
DataFrames.innerjoin
DataFrames.insertcols!
DataFrames.leftjoin
DataFrames.mapcols
DataFrames.mapcols!
DataFrames.ncol
DataFrames.nonunique
DataFrames.nrow
DataFrames.order
DataFrames.outerjoin
DataFrames.rename
DataFrames.rename!
DataFrames.repeat!
DataFrames.rightjoin
DataFrames.rownumber
DataFrames.select
DataFrames.select!
DataFrames.semijoin
DataFrames.stack
DataFrames.subset
DataFrames.subset!
DataFrames.transform
DataFrames.transform!
DataFrames.unstack
DataFrames.valuecols
Missings.allowmissing
Missings.disallowmissing
Constructing data frames
Base.copy
— Functioncopy(df::DataFrame; copycols::Bool=true)
Copy data frame df
. If copycols=true
(the default), return a new DataFrame
holding copies of column vectors in df
. If copycols=false
, return a new DataFrame
sharing column vectors with df
.
copy(dfr::DataFrameRow)
Construct a NamedTuple
with the same contents as the DataFrameRow
. This method returns a NamedTuple
so that the returned object is not affected by changes to the parent data frame of which dfr
is a view.
copy(key::GroupKey)
Construct a NamedTuple
with the same contents as the GroupKey
.
Base.similar
— Functionsimilar(df::AbstractDataFrame, rows::Integer=nrow(df))
Create a new DataFrame
with the same column names and column element types as df
. An optional second argument can be provided to request a number of rows that is different than the number of rows present in df
.
Summary information
DataAPI.describe
— Functiondescribe(df::AbstractDataFrame; cols=:)
describe(df::AbstractDataFrame, stats::Union{Symbol, Pair}...; cols=:)
Return descriptive statistics for a data frame as a new DataFrame
where each row represents a variable and each column a summary statistic.
Arguments
df
: theAbstractDataFrame
stats::Union{Symbol, Pair}...
: the summary statistics to report. Arguments can be:- A symbol from the list
:mean
,:std
,:min
,:q25
,:median
,:q75
,:max
,:eltype
,:nunique
,:first
,:last
, and:nmissing
. The default statistics used are:mean
,:min
,:median
,:max
,:nmissing
, and:eltype
. :all
as the onlySymbol
argument to return all statistics.- A
function => name
pair wherename
is aSymbol
or string. This will create a column of summary statistics with the provided name.
- A symbol from the list
cols
: a keyword argument allowing to select only a subset or transformation of columns fromdf
to describe. Can be any column selector or transformation accepted byselect
.
Details
For Real
columns, compute the mean, standard deviation, minimum, first quantile, median, third quantile, and maximum. If a column does not derive from Real
, describe
will attempt to calculate all statistics, using nothing
as a fall-back in the case of an error.
When stats
contains :nunique
, describe
will report the number of unique values in a column. If a column's base type derives from Real
, :nunique
will return nothing
s.
Missing values are filtered in the calculation of all statistics, however the column :nmissing
will report the number of missing values of that variable.
If custom functions are provided, they are called repeatedly with the vector corresponding to each column as the only argument. For columns allowing for missing values, the vector is wrapped in a call to skipmissing
: custom functions must therefore support such objects (and not only vectors), and cannot access missing values.
Examples
julia> df = DataFrame(i=1:10, x=0.1:0.1:1.0, y='a':'j');
julia> describe(df)
3×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Union… Any Union… Any Int64 DataType
─────┼────────────────────────────────────────────────────────
1 │ i 5.5 1 5.5 10 0 Int64
2 │ x 0.55 0.1 0.55 1.0 0 Float64
3 │ y a j 0 Char
julia> describe(df, :min, :max)
3×3 DataFrame
Row │ variable min max
│ Symbol Any Any
─────┼────────────────────
1 │ i 1 10
2 │ x 0.1 1.0
3 │ y a j
julia> describe(df, :min, sum => :sum)
3×3 DataFrame
Row │ variable min sum
│ Symbol Any Union…
─────┼───────────────────────
1 │ i 1 55
2 │ x 0.1 5.5
3 │ y a
julia> describe(df, :min, sum => :sum, cols=:x)
1×3 DataFrame
Row │ variable min sum
│ Symbol Float64 Float64
─────┼────────────────────────────
1 │ x 0.1 5.5
Base.length
— Functionlength(dfr::DataFrameRow)
Return the number of elements of dfr
.
See also: size
Examples
julia> dfr = DataFrame(a=1:3, b='a':'c')[1, :]
DataFrameRow
Row │ a b
│ Int64 Char
─────┼─────────────
1 │ 1 a
julia> length(dfr)
2
DataFrames.ncol
— Functionnrow(df::AbstractDataFrame)
ncol(df::AbstractDataFrame)
Return the number of rows or columns in an AbstractDataFrame
df
.
See also size
.
Examples
julia> df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10));
julia> size(df)
(10, 3)
julia> nrow(df)
10
julia> ncol(df)
3
Base.ndims
— Functionndims(::AbstractDataFrame)
ndims(::Type{<:AbstractDataFrame})
Return the number of dimensions of a data frame, which is always 2
.
ndims(::DataFrameRow)
ndims(::Type{<:DataFrameRow})
Return the number of dimensions of a data frame row, which is always 1
.
DataFrames.nrow
— Functionnrow(df::AbstractDataFrame)
ncol(df::AbstractDataFrame)
Return the number of rows or columns in an AbstractDataFrame
df
.
See also size
.
Examples
julia> df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10));
julia> size(df)
(10, 3)
julia> nrow(df)
10
julia> ncol(df)
3
DataFrames.rownumber
— Functionrownumber(dfr::DataFrameRow)
Return a row number in the AbstractDataFrame
that dfr
was created from.
Note that this differs from the first element in the tuple returned by parentindices
. The latter gives the row number in the parent(dfr)
, which is the source DataFrame
where data that dfr
gives access to is stored.
Examples
julia> df = DataFrame(reshape(1:12, 3, 4), :auto)
3×4 DataFrame
Row │ x1 x2 x3 x4
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 4 7 10
2 │ 2 5 8 11
3 │ 3 6 9 12
julia> dfr = df[2, :]
DataFrameRow
Row │ x1 x2 x3 x4
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
2 │ 2 5 8 11
julia> rownumber(dfr)
2
julia> parentindices(dfr)
(2, Base.OneTo(4))
julia> parent(dfr)
3×4 DataFrame
Row │ x1 x2 x3 x4
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 4 7 10
2 │ 2 5 8 11
3 │ 3 6 9 12
julia> dfv = @view df[2:3, 1:3]
2×3 SubDataFrame
Row │ x1 x2 x3
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 2 5 8
2 │ 3 6 9
julia> dfrv = dfv[2, :]
DataFrameRow
Row │ x1 x2 x3
│ Int64 Int64 Int64
─────┼─────────────────────
3 │ 3 6 9
julia> rownumber(dfrv)
2
julia> parentindices(dfrv)
(3, 1:3)
julia> parent(dfrv)
3×4 DataFrame
Row │ x1 x2 x3 x4
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 4 7 10
2 │ 2 5 8 11
3 │ 3 6 9 12
Base.show
— Functionshow([io::IO, ]df::AbstractDataFrame;
allrows::Bool = !get(io, :limit, false),
allcols::Bool = !get(io, :limit, false),
allgroups::Bool = !get(io, :limit, false),
rowlabel::Symbol = :Row,
summary::Bool = true,
eltypes::Bool = true,
truncate::Int = 32,
kwargs...)
Render a data frame to an I/O stream. The specific visual representation chosen depends on the width of the display.
If io
is omitted, the result is printed to stdout
, and allrows
, allcols
and allgroups
default to false
.
Arguments
io::IO
: The I/O stream to whichdf
will be printed.df::AbstractDataFrame
: The data frame to print.allrows::Bool
: Whether to print all rows, rather than a subset that fits the device height. By default this is the case only ifio
does not have theIOContext
propertylimit
set.allcols::Bool
: Whether to print all columns, rather than a subset that fits the device width. By default this is the case only ifio
does not have theIOContext
propertylimit
set.allgroups::Bool
: Whether to print all groups rather than the first and last, whendf
is aGroupedDataFrame
. By default this is the case only ifio
does not have theIOContext
propertylimit
set.rowlabel::Symbol = :Row
: The label to use for the column containing row numbers.summary::Bool = true
: Whether to print a brief string summary of the data frame.eltypes::Bool = true
: Whether to print the column types under column names.truncate::Int = 32
: the maximal display width the output can use before being truncated (in thetextwidth
sense, excluding…
). Iftruncate
is 0 or less, no truncation is applied.kwargs...
: Any keyword argument supported by the functionpretty_table
of PrettyTables.jl can be passed here to customize the output.
Examples
julia> using DataFrames
julia> df = DataFrame(A = 1:3, B = ["x", "y", "z"]);
julia> show(df, show_row_number=false)
3×2 DataFrame
A B
Int64 String
───────────────
1 x
2 y
3 z
show(io::IO, mime::MIME, df::AbstractDataFrame)
Render a data frame to an I/O stream in MIME type mime
.
Arguments
io::IO
: The I/O stream to whichdf
will be printed.mime::MIME
: supported MIME types are:"text/plain"
,"text/html"
,"text/latex"
,"text/csv"
,"text/tab-separated-values"
(the last two MIME types do not support showing#undef
values)df::AbstractDataFrame
: The data frame to print.
Additionally selected MIME types support passing the following keyword arguments:
- MIME type
"text/plain"
accepts all listed keyword arguments and therir behavior is identical as forshow(::IO, ::AbstractDataFrame)
- MIME type
"text/html"
acceptssummary
keyword argument which allows to choose whether to print a brief string summary of the data frame.
Examples
julia> show(stdout, MIME("text/latex"), DataFrame(A = 1:3, B = ["x", "y", "z"]))
\begin{tabular}{r|cc}
& A & B\\
\hline
& Int64 & String\\
\hline
1 & 1 & x \\
2 & 2 & y \\
3 & 3 & z \\
\end{tabular}
14
julia> show(stdout, MIME("text/csv"), DataFrame(A = 1:3, B = ["x", "y", "z"]))
"A","B"
1,"x"
2,"y"
3,"z"
Base.size
— Functionsize(df::AbstractDataFrame[, dim])
Return a tuple containing the number of rows and columns of df
. Optionally a dimension dim
can be specified, where 1
corresponds to rows and 2
corresponds to columns.
Examples
julia> df = DataFrame(a=1:3, b='a':'c');
julia> size(df)
(3, 2)
julia> size(df, 1)
3
size(dfr::DataFrameRow[, dim])
Return a 1-tuple containing the number of elements of dfr
. If an optional dimension dim
is specified, it must be 1
, and the number of elements is returned directly as a number.
See also: length
Examples
julia> dfr = DataFrame(a=1:3, b='a':'c')[1, :]
DataFrameRow
Row │ a b
│ Int64 Char
─────┼─────────────
1 │ 1 a
julia> size(dfr)
(2,)
julia> size(dfr, 1)
2
Working with column names
Base.names
— Functionnames(df::AbstractDataFrame)
names(df::AbstractDataFrame, cols)
Return a freshly allocated Vector{String}
of names of columns contained in df
.
If cols
is passed then restrict returned column names to those matching the selector (this is useful in particular with regular expressions, Cols
, Not
, and Between
). cols
can be:
- any column selector (
Symbol
, string or integer;:
,Cols
,All
,Between
,Not
, a regular expression, or a vector ofSymbol
s, strings or integers) - a
Type
, in which case names of columns whoseeltype
is a subtype ofT
are returned - a
Function
predicate taking the column name as a string and returningtrue
for columns that should be kept
See also propertynames
which returns a Vector{Symbol}
.
Base.propertynames
— Functionpropertynames(df::AbstractDataFrame)
Return a freshly allocated Vector{Symbol}
of names of columns contained in df
.
DataFrames.rename
— Functionrename(df::AbstractDataFrame, vals::AbstractVector{Symbol};
makeunique::Bool=false)
rename(df::AbstractDataFrame, vals::AbstractVector{<:AbstractString};
makeunique::Bool=false)
rename(df::AbstractDataFrame, (from => to)::Pair...)
rename(df::AbstractDataFrame, d::AbstractDict)
rename(df::AbstractDataFrame, d::AbstractVector{<:Pair})
rename(f::Function, df::AbstractDataFrame)
Create a new data frame that is a copy of df
with changed column names. Each name is changed at most once. Permutation of names is allowed.
Arguments
df
: theAbstractDataFrame
; if it is aSubDataFrame
then renaming is only allowed if it was created using:
as a column selector.d
: anAbstractDict
or anAbstractVector
ofPair
s that maps the original names or column numbers to new namesf
: a function which for each column takes the old name as aString
and returns the new name that gets converted to aSymbol
vals
: new column names as a vector ofSymbol
s orAbstractString
s of the same length as the number of columns indf
makeunique
: iffalse
(the default), an error will be raised if duplicate names are found; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).
If pairs are passed to rename
(as positional arguments or in a dictionary or a vector) then:
from
value can be aSymbol
, anAbstractString
or anInteger
;to
value can be aSymbol
or anAbstractString
.
Mixing symbols and strings in to
and from
is not allowed.
See also: rename!
Examples
julia> df = DataFrame(i = 1, x = 2, y = 3)
1×3 DataFrame
Row │ i x y
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> rename(df, :i => :A, :x => :X)
1×3 DataFrame
Row │ A X y
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> rename(df, :x => :y, :y => :x)
1×3 DataFrame
Row │ i y x
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> rename(df, [1 => :A, 2 => :X])
1×3 DataFrame
Row │ A X y
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> rename(df, Dict("i" => "A", "x" => "X"))
1×3 DataFrame
Row │ A X y
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> rename(uppercase, df)
1×3 DataFrame
Row │ I X Y
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
DataFrames.rename!
— Functionrename!(df::AbstractDataFrame, vals::AbstractVector{Symbol};
makeunique::Bool=false)
rename!(df::AbstractDataFrame, vals::AbstractVector{<:AbstractString};
makeunique::Bool=false)
rename!(df::AbstractDataFrame, (from => to)::Pair...)
rename!(df::AbstractDataFrame, d::AbstractDict)
rename!(df::AbstractDataFrame, d::AbstractVector{<:Pair})
rename!(f::Function, df::AbstractDataFrame)
Rename columns of df
in-place. Each name is changed at most once. Permutation of names is allowed.
Arguments
df
: theAbstractDataFrame
d
: anAbstractDict
or anAbstractVector
ofPair
s that maps the original names or column numbers to new namesf
: a function which for each column takes the old name as aString
and returns the new name that gets converted to aSymbol
vals
: new column names as a vector ofSymbol
s orAbstractString
s of the same length as the number of columns indf
makeunique
: iffalse
(the default), an error will be raised if duplicate names are found; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).
If pairs are passed to rename!
(as positional arguments or in a dictionary or a vector) then:
from
value can be aSymbol
, anAbstractString
or anInteger
;to
value can be aSymbol
or anAbstractString
.
Mixing symbols and strings in to
and from
is not allowed.
See also: rename
Examples
julia> df = DataFrame(i = 1, x = 2, y = 3)
1×3 DataFrame
Row │ i x y
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> rename!(df, Dict(:i => "A", :x => "X"))
1×3 DataFrame
Row │ A X y
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> rename!(df, [:a, :b, :c])
1×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> rename!(df, [:a, :b, :a])
ERROR: ArgumentError: Duplicate variable names: :a. Pass makeunique=true to make them unique using a suffix automatically.
julia> rename!(df, [:a, :b, :a], makeunique=true)
1×3 DataFrame
Row │ a b a_1
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> rename!(uppercase, df)
1×3 DataFrame
Row │ A B A_1
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
Mutating and transforming data frames and grouped data frames
Base.append!
— Functionappend!(df::DataFrame, df2::AbstractDataFrame; cols::Symbol=:setequal,
promote::Bool=(cols in [:union, :subset]))
append!(df::DataFrame, table; cols::Symbol=:setequal,
promote::Bool=(cols in [:union, :subset]))
Add the rows of df2
to the end of df
. If the second argument table
is not an AbstractDataFrame
then it is converted using DataFrame(table, copycols=false)
before being appended.
The exact behavior of append!
depends on the cols
argument:
- If
cols == :setequal
(this is the default) thendf2
must contain exactly the same columns asdf
(but possibly in a different order). - If
cols == :orderequal
thendf2
must contain the same columns in the same order (forAbstractDict
this option requires thatkeys(row)
matchespropertynames(df)
to allow for support of ordered dicts; however, ifdf2
is aDict
an error is thrown as it is an unordered collection). - If
cols == :intersect
thendf2
may contain more columns thandf
, but all column names that are present indf
must be present indf2
and only these are used. - If
cols == :subset
thenappend!
behaves like for:intersect
but if some column is missing indf2
then amissing
value is pushed todf
. - If
cols == :union
thenappend!
adds columns missing indf
that are present indf2
, for columns present indf
but missing indf2
amissing
value is pushed.
If promote=true
and element type of a column present in df
does not allow the type of a pushed argument then a new column with a promoted element type allowing it is freshly allocated and stored in df
. If promote=false
an error is thrown.
The above rule has the following exceptions:
- If
df
has no columns then copies of columns fromdf2
are added to it. - If
df2
has no columns then callingappend!
leavesdf
unchanged.
Please note that append!
must not be used on a DataFrame
that contains columns that are aliases (equal when compared with ===
).
See also
Use push!
to add individual rows to a data frame and vcat
to vertically concatenate data frames.
Examples
julia> df1 = DataFrame(A=1:3, B=1:3)
3×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
julia> df2 = DataFrame(A=4.0:6.0, B=4:6)
3×2 DataFrame
Row │ A B
│ Float64 Int64
─────┼────────────────
1 │ 4.0 4
2 │ 5.0 5
3 │ 6.0 6
julia> append!(df1, df2);
julia> df1
6×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
4 │ 4 4
5 │ 5 5
6 │ 6 6
DataFrames.combine
— Functioncombine(df::AbstractDataFrame, args...; renamecols::Bool=true)
combine(f::Callable, df::AbstractDataFrame; renamecols::Bool=true)
combine(gd::GroupedDataFrame, args...;
keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true)
combine(f::Base.Callable, gd::GroupedDataFrame;
keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true)
Create a new data frame that contains columns from df
or gd
specified by args
and return it. The result can have any number of rows that is determined by the values returned by passed transformations.
Below detailed common rules for all transformation functions supported by DataFrames.jl are explained and compared.
All these operations are supported both for AbstractDataFrame
(when split and combine steps are skipped) and GroupedDataFrame
. Technically, AbstractDataFrame
is just considered as being grouped on no columns (meaning it has a single group, or zero groups if it is empty). The only difference is that in this case the keepkeys
and ungroup
keyword arguments (described below) are not supported and a data frame is always returned, as there are no split and combine steps in this case.
In order to perform operations by groups you first need to create a GroupedDataFrame
object from your data frame using the groupby
function that takes two arguments: (1) a data frame to be grouped, and (2) a set of columns to group by.
Operations can then be applied on each group using one of the following functions:
combine
: does not put restrictions on number of rows returned, the order of rows is specified by the order of groups inGroupedDataFrame
; it is typically used to compute summary statistics by group;select
: return a data frame with the number and order of rows exactly the same as the source data frame, including only new calculated columns;select!
is an in-place version ofselect
;transform
: return a data frame with the number and order of rows exactly the same as the source data frame, including all columns from the source and new calculated columns;transform!
is an in-place version oftransform
.
All these functions take a specification of one or more functions to apply to each subset of the DataFrame
. This specification can be of the following forms:
- standard column selectors (integers,
Symbol
s, strings, vectors of integers, vectors ofSymbol
s, vectors of strings,All
,Cols
,:
,Between
,Not
and regular expressions) - a
cols => function
pair indicating thatfunction
should be called with positional arguments holding columnscols
, which can be any valid column selector; in this case target column name is automatically generated and it is assumed thatfunction
returns a single value or a vector; the generated name is created by concatenating source column name andfunction
name by default (see examples below). - a
cols => function => target_cols
form additionally explicitly specifying the target column or columns. - a
col => target_cols
pair, which renames the columncol
totarget_cols
, which must be single name (as aSymbol
or a string). - a
nrow
ornrow => target_cols
form which efficiently computes the number of rows in a group; withouttarget_cols
the new column is called:nrow
, otherwise it must be single name (as aSymbol
or a string). - vectors or matrices containing transformations specified by the
Pair
syntax described in points 2 to 5 - a function which will be called with a
SubDataFrame
corresponding to each group; this form should be avoided due to its poor performance unless the number of groups is small or a very large number of columns are processed (in which caseSubDataFrame
avoids excessive compilation)
All functions have two types of signatures. One of them takes a GroupedDataFrame
as the first argument and an arbitrary number of transformations described above as following arguments. The second type of signature is when a Function
or a Type
is passed as the first argument and a GroupedDataFrame
as the second argument (similar to map
).
As a special rule, with the cols => function
and cols => function => target_cols
syntaxes, if cols
is wrapped in an AsTable
object then a NamedTuple
containing columns selected by cols
is passed to function
.
What is allowed for function
to return is determined by the target_cols
value:
- If both
cols
andtarget_cols
are omitted (so only afunction
is passed), then returning a data frame, a matrix, aNamedTuple
, or aDataFrameRow
will produce multiple columns in the result. Returning any other value produces a single column. - If
target_cols
is aSymbol
or a string then the function is assumed to return a single column. In this case returning a data frame, a matrix, aNamedTuple
, or aDataFrameRow
raises an error. - If
target_cols
is a vector ofSymbol
s or strings orAsTable
it is assumed thatfunction
returns multiple columns. Iffunction
returns one ofAbstractDataFrame
,NamedTuple
,DataFrameRow
,AbstractMatrix
then rules described in point 1 above apply. Iffunction
returns anAbstractVector
then each element of this vector must support thekeys
function, which must return a collection ofSymbol
s, strings or integers; the return value ofkeys
must be identical for all elements. Then as many columns are created as there are elements in the return value of thekeys
function. Iftarget_cols
isAsTable
then their names are set to be equal to the key names except ifkeys
returns integers, in which case they are prefixed byx
(so the column names are e.g.x1
,x2
, ...). Iftarget_cols
is a vector ofSymbol
s or strings then column names produced using the rules above are ignored and replaced bytarget_cols
(the number of columns must be the same as the length oftarget_cols
in this case). Iffun
returns a value of any other type then it is assumed that it is a table conforming to the Tables.jl API and theTables.columntable
function is called on it to get the resulting columns and their names. The names are retained whentarget_cols
isAsTable
and are replaced iftarget_cols
is a vector ofSymbol
s or strings.
In all of these cases, function
can return either a single row or multiple rows. As a particular rule, values wrapped in a Ref
or a 0
-dimensional AbstractArray
are unwrapped and then treated as a single row.
select
/select!
and transform
/transform!
always return a DataFrame
with the same number and order of rows as the source (even if GroupedDataFrame
had its groups reordered).
For combine
, rows in the returned object appear in the order of groups in the GroupedDataFrame
. The functions can return an arbitrary number of rows for each group, but the kind of returned object and the number and names of columns must be the same for all groups, except when a DataFrame()
or NamedTuple()
is returned, in which case a given group is skipped.
It is allowed to mix single values and vectors if multiple transformations are requested. In this case single value will be repeated to match the length of columns specified by returned vectors.
To apply function
to each row instead of whole columns, it can be wrapped in a ByRow
struct. cols
can be any column indexing syntax, in which case function
will be passed one argument for each of the columns specified by cols
or a NamedTuple
of them if specified columns are wrapped in AsTable
. If ByRow
is used it is allowed for cols
to select an empty set of columns, in which case function
is called for each row without any arguments and an empty NamedTuple
is passed if empty set of columns is wrapped in AsTable
.
If a collection of column names is passed then requesting duplicate column names in target data frame are accepted (e.g. select!(df, [:a], :, r"a")
is allowed) and only the first occurrence is used. In particular a syntax to move column :col
to the first position in the data frame is select!(df, :col, :)
. On the contrary, output column names of renaming, transformation and single column selection operations must be unique, so e.g. select!(df, :a, :a => :a)
or select!(df, :a, :a => ByRow(sin) => :a)
are not allowed.
As a general rule if copycols=true
columns are copied and when copycols=false
columns are reused if possible. Note, however, that including the same column several times in the data frame via renaming or transformations that return the same object without copying may create column aliases even if copycols=true
. An example of such a situation is select!(df, :a, :a => :b, :a => identity => :c)
. As a special case in transform
and transform!
column renaming always copies columns to avoid storing aliased columns in the target data frame.
If df
is a SubDataFrame
and copycols=true
then a DataFrame
is returned and the same copying rules apply as for a DataFrame
input: this means in particular that selected columns will be copied. If copycols=false
, a SubDataFrame
is returned without copying columns.
If a GroupedDataFrame
is passed, a separate task is spawned for each specified transformation; each transformation then spawns as many tasks as Julia threads, and splits processing of groups across them (however, currently transformations with optimized implementations like sum
and transformations that return multiple rows use a single task for all groups). This allows for parallel operation when Julia was started with more than one thread. Passed transformation functions should therefore not modify global variables (i.e. they should be pure), or use locks to control parallel accesses. In the future, parallelism may be extended to other cases, so this requirement also holds for DataFrame
inputs.
Keyword arguments
renamecols::Bool=true
: whether in thecols => function
form automatically generated column names should include the name of transformation functions or not.keepkeys::Bool=true
: whether grouping columns ofgd
should be kept in the returned data frame.ungroup::Bool=true
: whether the return value of the operation ongd
should be a data frame or aGroupedDataFrame
.
Examples
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> combine(df, :a => sum, nrow, renamecols=false)
1×2 DataFrame
Row │ a nrow
│ Int64 Int64
─────┼──────────────
1 │ 6 3
julia> combine(df, :a => ByRow(sin) => :c, :b)
3×2 DataFrame
Row │ c b
│ Float64 Int64
─────┼─────────────────
1 │ 0.841471 4
2 │ 0.909297 5
3 │ 0.14112 6
julia> combine(df, :, [:a, :b] => (a, b) -> a .+ b .- sum(b)/length(b))
3×3 DataFrame
Row │ a b a_b_function
│ Int64 Int64 Float64
─────┼────────────────────────────
1 │ 1 4 0.0
2 │ 2 5 2.0
3 │ 3 6 4.0
julia> combine(df, names(df) .=> [minimum maximum])
1×4 DataFrame
Row │ a_minimum b_minimum a_maximum b_maximum
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────────────────────
1 │ 1 4 3 6
julia> using Statistics
julia> combine(df, AsTable(:) => ByRow(mean), renamecols=false)
3×1 DataFrame
Row │ a_b
│ Float64
─────┼─────────
1 │ 2.5
2 │ 3.5
3 │ 4.5
julia> combine(first, df)
1×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
julia> df = DataFrame(a=1:3, b=4:6, c=7:9)
3×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 4 7
2 │ 2 5 8
3 │ 3 6 9
julia> combine(df, AsTable(:) => ByRow(x -> (mean=mean(x), std=std(x))) => :stats,
AsTable(:) => ByRow(x -> (mean=mean(x), std=std(x))) => AsTable)
3×3 DataFrame
Row │ stats mean std
│ NamedTup… Float64 Float64
─────┼───────────────────────────────────────────
1 │ (mean = 4.0, std = 3.0) 4.0 3.0
2 │ (mean = 5.0, std = 3.0) 5.0 3.0
3 │ (mean = 6.0, std = 3.0) 6.0 3.0
julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = 1:8);
julia> gd = groupby(df, :a);
julia> combine(gd, :c => sum, nrow)
4×3 DataFrame
Row │ a c_sum nrow
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 6 2
2 │ 2 8 2
3 │ 3 10 2
4 │ 4 12 2
julia> combine(gd, :c => sum, nrow, ungroup=false)
GroupedDataFrame with 4 groups based on key: a
First Group (1 row): a = 1
Row │ a c_sum nrow
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 6 2
⋮
Last Group (1 row): a = 4
Row │ a c_sum nrow
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 4 12 2
julia> combine(gd) do d # do syntax for the slower variant
sum(d.c)
end
4×2 DataFrame
Row │ a x1
│ Int64 Int64
─────┼──────────────
1 │ 1 6
2 │ 2 8
3 │ 3 10
4 │ 4 12
julia> combine(gd, :c => (x -> sum(log, x)) => :sum_log_c) # specifying a name for target column
4×2 DataFrame
Row │ a sum_log_c
│ Int64 Float64
─────┼──────────────────
1 │ 1 1.60944
2 │ 2 2.48491
3 │ 3 3.04452
4 │ 4 3.46574
julia> combine(gd, [:b, :c] .=> sum) # passing a vector of pairs
4×3 DataFrame
Row │ a b_sum c_sum
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 4 6
2 │ 2 2 8
3 │ 3 4 10
4 │ 4 2 12
julia> combine(gd) do sdf # dropping group when DataFrame() is returned
sdf.c[1] != 1 ? sdf : DataFrame()
end
6×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 2 1 2
2 │ 2 1 6
3 │ 3 2 3
4 │ 3 2 7
5 │ 4 1 4
6 │ 4 1 8
auto-splatting, renaming and keepkeys
julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = 1:8);
julia> gd = groupby(df, :a);
julia> combine(gd, :b => :b1, :c => :c1, [:b, :c] => +, keepkeys=false)
8×3 DataFrame
Row │ b1 c1 b_c_+
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 2 1 3
2 │ 2 5 7
3 │ 1 2 3
4 │ 1 6 7
5 │ 2 3 5
6 │ 2 7 9
7 │ 1 4 5
8 │ 1 8 9
broadcasting and column expansion
julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = 1:8);
julia> gd = groupby(df, :a);
julia> combine(gd, :b, AsTable([:b, :c]) => ByRow(extrema) => [:min, :max])
8×4 DataFrame
Row │ a b min max
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 2 1 2
2 │ 1 2 2 5
3 │ 2 1 1 2
4 │ 2 1 1 6
5 │ 3 2 2 3
6 │ 3 2 2 7
7 │ 4 1 1 4
8 │ 4 1 1 8
julia> combine(gd, [:b, :c] .=> Ref) # preventing vector from being spread across multiple rows
4×3 DataFrame
Row │ a b_Ref c_Ref
│ Int64 SubArray… SubArray…
─────┼─────────────────────────────
1 │ 1 [2, 2] [1, 5]
2 │ 2 [1, 1] [2, 6]
3 │ 3 [2, 2] [3, 7]
4 │ 4 [1, 1] [4, 8]
julia> combine(gd, AsTable(Not(:a)) => Ref) # protecting result
4×2 DataFrame
Row │ a b_c_Ref
│ Int64 NamedTup…
─────┼─────────────────────────────────
1 │ 1 (b = [2, 2], c = [1, 5])
2 │ 2 (b = [1, 1], c = [2, 6])
3 │ 3 (b = [2, 2], c = [3, 7])
4 │ 4 (b = [1, 1], c = [4, 8])
julia> combine(gd, :, AsTable(Not(:a)) => sum, renamecols=false)
8×4 DataFrame
Row │ a b c b_c
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 2 1 3
2 │ 1 2 5 7
3 │ 2 1 2 3
4 │ 2 1 6 7
5 │ 3 2 3 5
6 │ 3 2 7 9
7 │ 4 1 4 5
8 │ 4 1 8 9
DataFrames.flatten
— Functionflatten(df::AbstractDataFrame, cols)
When columns cols
of data frame df
have iterable elements that define length
(for example a Vector
of Vector
s), return a DataFrame
where each element of each col
in cols
is flattened, meaning the column corresponding to col
becomes a longer vector where the original entries are concatenated. Elements of row i
of df
in columns other than cols
will be repeated according to the length of df[i, col]
. These lengths must therefore be the same for each col
in cols
, or else an error is raised. Note that these elements are not copied, and thus if they are mutable changing them in the returned DataFrame
will affect df
.
cols
can be any column selector (Symbol
, string or integer; :
, Cols
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
Examples
julia> df1 = DataFrame(a = [1, 2], b = [[1, 2], [3, 4]], c = [[5, 6], [7, 8]])
2×3 DataFrame
Row │ a b c
│ Int64 Array… Array…
─────┼───────────────────────
1 │ 1 [1, 2] [5, 6]
2 │ 2 [3, 4] [7, 8]
julia> flatten(df1, :b)
4×3 DataFrame
Row │ a b c
│ Int64 Int64 Array…
─────┼──────────────────────
1 │ 1 1 [5, 6]
2 │ 1 2 [5, 6]
3 │ 2 3 [7, 8]
4 │ 2 4 [7, 8]
julia> flatten(df1, [:b, :c])
4×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 5
2 │ 1 2 6
3 │ 2 3 7
4 │ 2 4 8
julia> df2 = DataFrame(a = [1, 2], b = [("p", "q"), ("r", "s")])
2×2 DataFrame
Row │ a b
│ Int64 Tuple…
─────┼───────────────────
1 │ 1 ("p", "q")
2 │ 2 ("r", "s")
julia> flatten(df2, :b)
4×2 DataFrame
Row │ a b
│ Int64 String
─────┼───────────────
1 │ 1 p
2 │ 1 q
3 │ 2 r
4 │ 2 s
julia> df3 = DataFrame(a = [1, 2], b = [[1, 2], [3, 4]], c = [[5, 6], [7]])
2×3 DataFrame
Row │ a b c
│ Int64 Array… Array…
─────┼───────────────────────
1 │ 1 [1, 2] [5, 6]
2 │ 2 [3, 4] [7]
julia> flatten(df3, [:b, :c])
ERROR: ArgumentError: Lengths of iterables stored in columns :b and :c are not the same in row 2
Base.hcat
— Functionhcat(df::AbstractDataFrame...;
makeunique::Bool=false, copycols::Bool=true)
hcat(df::AbstractDataFrame..., vs::AbstractVector;
makeunique::Bool=false, copycols::Bool=true)
hcat(vs::AbstractVector, df::AbstractDataFrame;
makeunique::Bool=false, copycols::Bool=true)
Horizontally concatenate AbstractDataFrames
and optionally AbstractVector
s.
If AbstractVector
is passed then a column name for it is automatically generated as :x1
by default.
If makeunique=false
(the default) column names of passed objects must be unique. If makeunique=true
then duplicate column names will be suffixed with _i
(i
starting at 1 for the first duplicate).
If copycols=true
(the default) then the DataFrame
returned by hcat
will contain copied columns from the source data frames. If copycols=false
then it will contain columns as they are stored in the source (without copying). This option should be used with caution as mutating either the columns in sources or in the returned DataFrame
might lead to the corruption of the other object.
Example
julia> df1 = DataFrame(A=1:3, B=1:3)
3×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
julia> df2 = DataFrame(A=4:6, B=4:6)
3×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 4 4
2 │ 5 5
3 │ 6 6
julia> df3 = hcat(df1, df2, makeunique=true)
3×4 DataFrame
Row │ A B A_1 B_1
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 1 4 4
2 │ 2 2 5 5
3 │ 3 3 6 6
julia> df3.A === df1.A
false
julia> df3 = hcat(df1, df2, makeunique=true, copycols=false);
julia> df3.A === df1.A
true
DataFrames.insertcols!
— Functioninsertcols!(df::DataFrame[, col], (name=>val)::Pair...;
makeunique::Bool=false, copycols::Bool=true)
Insert a column into a data frame in place. Return the updated DataFrame
. If col
is omitted it is set to ncol(df)+1
(the column is inserted as the last column).
Arguments
df
: the DataFrame to which we want to add columnscol
: a position at which we want to insert a column, passed as an integer or a column name (a string or aSymbol
); the column selected withcol
and columns following it are shifted to the right indf
after the operationname
: the name of the new columnval
: anAbstractVector
giving the contents of the new column or a value of any type other thanAbstractArray
which will be repeated to fill a new vector; As a particular rule a values stored in aRef
or a0
-dimensionalAbstractArray
are unwrapped and treated in the same way.makeunique
: Defines what to do ifname
already exists indf
; if it isfalse
an error will be thrown; if it istrue
a new unique name will be generated by adding a suffixcopycols
: whether vectors passed as columns should be copied
If val
is an AbstractRange
then the result of collect(val)
is inserted.
Examples
julia> df = DataFrame(a=1:3)
3×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
julia> insertcols!(df, 1, :b => 'a':'c')
3×2 DataFrame
Row │ b a
│ Char Int64
─────┼─────────────
1 │ a 1
2 │ b 2
3 │ c 3
julia> insertcols!(df, 2, :c => 2:4, :c => 3:5, makeunique=true)
3×4 DataFrame
Row │ b c c_1 a
│ Char Int64 Int64 Int64
─────┼───────────────────────────
1 │ a 2 3 1
2 │ b 3 4 2
3 │ c 4 5 3
DataFrames.mapcols
— Functionmapcols(f::Union{Function, Type}, df::AbstractDataFrame)
Return a DataFrame
where each column of df
is transformed using function f
. f
must return AbstractVector
objects all with the same length or scalars (all values other than AbstractVector
are considered to be a scalar).
Note that mapcols
guarantees not to reuse the columns from df
in the returned DataFrame
. If f
returns its argument then it gets copied before being stored.
Examples
julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
3 │ 3 13
4 │ 4 14
julia> mapcols(x -> x.^2, df)
4×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 121
2 │ 4 144
3 │ 9 169
4 │ 16 196
DataFrames.mapcols!
— Functionmapcols!(f::Union{Function, Type}, df::DataFrame)
Update a DataFrame
in-place where each column of df
is transformed using function f
. f
must return AbstractVector
objects all with the same length or scalars (all values other than AbstractVector
are considered to be a scalar).
Note that mapcols!
reuses the columns from df
if they are returned by f
.
Examples
julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
3 │ 3 13
4 │ 4 14
julia> mapcols!(x -> x.^2, df);
julia> df
4×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 121
2 │ 4 144
3 │ 9 169
4 │ 16 196
Base.push!
— Functionpush!(df::DataFrame, row::Union{Tuple, AbstractArray}; promote::Bool=false)
push!(df::DataFrame, row::Union{DataFrameRow, NamedTuple, AbstractDict};
cols::Symbol=:setequal, promote::Bool=(cols in [:union, :subset]))
Add in-place one row at the end of df
taking the values from row
.
Column types of df
are preserved, and new values are converted if necessary. An error is thrown if conversion fails.
If row
is neither a DataFrameRow
, NamedTuple
nor AbstractDict
then it must be a Tuple
or an AbstractArray
and columns are matched by order of appearance. In this case row
must contain the same number of elements as the number of columns in df
.
If row
is a DataFrameRow
, NamedTuple
or AbstractDict
then values in row
are matched to columns in df
based on names. The exact behavior depends on the cols
argument value in the following way:
- If
cols == :setequal
(this is the default) thenrow
must contain exactly the same columns asdf
(but possibly in a different order). - If
cols == :orderequal
thenrow
must contain the same columns in the same order (forAbstractDict
this option requires thatkeys(row)
matchespropertynames(df)
to allow for support of ordered dicts; however, ifrow
is aDict
an error is thrown as it is an unordered collection). - If
cols == :intersect
thenrow
may contain more columns thandf
, but all column names that are present indf
must be present inrow
and only they are used to populate a new row indf
. - If
cols == :subset
thenpush!
behaves like for:intersect
but if some column is missing inrow
then amissing
value is pushed todf
. - If
cols == :union
then columns missing indf
that are present inrow
are added todf
(usingmissing
for existing rows) and amissing
value is pushed to columns missing inrow
that are present indf
.
If promote=true
and element type of a column present in df
does not allow the type of a pushed argument then a new column with a promoted element type allowing it is freshly allocated and stored in df
. If promote=false
an error is thrown.
As a special case, if df
has no columns and row
is a NamedTuple
or DataFrameRow
, columns are created for all values in row
, using their names and order.
Please note that push!
must not be used on a DataFrame
that contains columns that are aliases (equal when compared with ===
).
Examples
julia> df = DataFrame(A=1:3, B=1:3);
julia> push!(df, (true, false))
4×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
4 │ 1 0
julia> push!(df, df[1, :])
5×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
4 │ 1 0
5 │ 1 1
julia> push!(df, (C="something", A=true, B=false), cols=:intersect)
6×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
4 │ 1 0
5 │ 1 1
6 │ 1 0
julia> push!(df, Dict(:A=>1.0, :C=>1.0), cols=:union)
7×3 DataFrame
Row │ A B C
│ Float64 Int64? Float64?
─────┼─────────────────────────────
1 │ 1.0 1 missing
2 │ 2.0 2 missing
3 │ 3.0 3 missing
4 │ 1.0 0 missing
5 │ 1.0 1 missing
6 │ 1.0 0 missing
7 │ 1.0 missing 1.0
julia> push!(df, NamedTuple(), cols=:subset)
8×3 DataFrame
Row │ A B C
│ Float64? Int64? Float64?
─────┼───────────────────────────────
1 │ 1.0 1 missing
2 │ 2.0 2 missing
3 │ 3.0 3 missing
4 │ 1.0 0 missing
5 │ 1.0 1 missing
6 │ 1.0 0 missing
7 │ 1.0 missing 1.0
8 │ missing missing missing
Base.reduce
— Functionreduce(::typeof(vcat),
dfs::Union{AbstractVector{<:AbstractDataFrame},
Tuple{AbstractDataFrame, Vararg{AbstractDataFrame}}};
cols::Union{Symbol, AbstractVector{Symbol},
AbstractVector{<:AbstractString}}=:setequal,
source::Union{Nothing, Symbol, AbstractString,
Pair{<:Union{Symbol, AbstractString}, <:AbstractVector}}=nothing)
Efficiently reduce the given vector or tuple of AbstractDataFrame
s with vcat
.
The column order, names, and types of the resulting DataFrame
, and the behavior of cols
and source
keyword arguments follow the rules specified for vcat
of AbstractDataFrame
s.
Example
julia> df1 = DataFrame(A=1:3, B=1:3)
3×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
julia> df2 = DataFrame(A=4:6, B=4:6)
3×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 4 4
2 │ 5 5
3 │ 6 6
julia> df3 = DataFrame(A=7:9, C=7:9)
3×2 DataFrame
Row │ A C
│ Int64 Int64
─────┼──────────────
1 │ 7 7
2 │ 8 8
3 │ 9 9
julia> reduce(vcat, (df1, df2))
6×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
4 │ 4 4
5 │ 5 5
6 │ 6 6
julia> reduce(vcat, [df1, df2, df3], cols=:union, source=:source)
9×4 DataFrame
Row │ A B C source
│ Int64 Int64? Int64? Int64
─────┼─────────────────────────────────
1 │ 1 1 missing 1
2 │ 2 2 missing 1
3 │ 3 3 missing 1
4 │ 4 4 missing 2
5 │ 5 5 missing 2
6 │ 6 6 missing 2
7 │ 7 missing 7 3
8 │ 8 missing 8 3
9 │ 9 missing 9 3
Base.repeat
— Functionrepeat(df::AbstractDataFrame; inner::Integer = 1, outer::Integer = 1)
Construct a data frame by repeating rows in df
. inner
specifies how many times each row is repeated, and outer
specifies how many times the full set of rows is repeated.
Example
julia> df = DataFrame(a = 1:2, b = 3:4)
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 2 4
julia> repeat(df, inner = 2, outer = 3)
12×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 1 3
3 │ 2 4
4 │ 2 4
5 │ 1 3
6 │ 1 3
7 │ 2 4
8 │ 2 4
9 │ 1 3
10 │ 1 3
11 │ 2 4
12 │ 2 4
repeat(df::AbstractDataFrame, count::Integer)
Construct a data frame by repeating each row in df
the number of times specified by count
.
Example
julia> df = DataFrame(a = 1:2, b = 3:4)
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 2 4
julia> repeat(df, 2)
4×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 2 4
3 │ 1 3
4 │ 2 4
DataFrames.repeat!
— Functionrepeat!(df::DataFrame; inner::Integer = 1, outer::Integer = 1)
Update a data frame df
in-place by repeating its rows. inner
specifies how many times each row is repeated, and outer
specifies how many times the full set of rows is repeated. Columns of df
are freshly allocated.
Example
julia> df = DataFrame(a = 1:2, b = 3:4)
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 2 4
julia> repeat!(df, inner = 2, outer = 3);
julia> df
12×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 1 3
3 │ 2 4
4 │ 2 4
5 │ 1 3
6 │ 1 3
7 │ 2 4
8 │ 2 4
9 │ 1 3
10 │ 1 3
11 │ 2 4
12 │ 2 4
repeat!(df::DataFrame, count::Integer)
Update a data frame df
in-place by repeating its rows the number of times specified by count
. Columns of df
are freshly allocated.
Example
julia> df = DataFrame(a = 1:2, b = 3:4)
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 2 4
julia> repeat(df, 2)
4×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 2 4
3 │ 1 3
4 │ 2 4
DataFrames.select
— Functionselect(df::AbstractDataFrame, args...; copycols::Bool=true, renamecols::Bool=true)
select(args::Callable, df::DataFrame; renamecols::Bool=true)
select(gd::GroupedDataFrame, args...; copycols::Bool=true, keepkeys::Bool=true,
ungroup::Bool=true, renamecols::Bool=true)
select(f::Base.Callable, gd::GroupedDataFrame; copycols::Bool=true,
keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true)
Create a new data frame that contains columns from df
or gd
specified by args
and return it. The result is guaranteed to have the same number of rows as df
, except when no columns are selected (in which case the result has zero rows).
Below detailed common rules for all transformation functions supported by DataFrames.jl are explained and compared.
All these operations are supported both for AbstractDataFrame
(when split and combine steps are skipped) and GroupedDataFrame
. Technically, AbstractDataFrame
is just considered as being grouped on no columns (meaning it has a single group, or zero groups if it is empty). The only difference is that in this case the keepkeys
and ungroup
keyword arguments (described below) are not supported and a data frame is always returned, as there are no split and combine steps in this case.
In order to perform operations by groups you first need to create a GroupedDataFrame
object from your data frame using the groupby
function that takes two arguments: (1) a data frame to be grouped, and (2) a set of columns to group by.
Operations can then be applied on each group using one of the following functions:
combine
: does not put restrictions on number of rows returned, the order of rows is specified by the order of groups inGroupedDataFrame
; it is typically used to compute summary statistics by group;select
: return a data frame with the number and order of rows exactly the same as the source data frame, including only new calculated columns;select!
is an in-place version ofselect
;transform
: return a data frame with the number and order of rows exactly the same as the source data frame, including all columns from the source and new calculated columns;transform!
is an in-place version oftransform
.
All these functions take a specification of one or more functions to apply to each subset of the DataFrame
. This specification can be of the following forms:
- standard column selectors (integers,
Symbol
s, strings, vectors of integers, vectors ofSymbol
s, vectors of strings,All
,Cols
,:
,Between
,Not
and regular expressions) - a
cols => function
pair indicating thatfunction
should be called with positional arguments holding columnscols
, which can be any valid column selector; in this case target column name is automatically generated and it is assumed thatfunction
returns a single value or a vector; the generated name is created by concatenating source column name andfunction
name by default (see examples below). - a
cols => function => target_cols
form additionally explicitly specifying the target column or columns. - a
col => target_cols
pair, which renames the columncol
totarget_cols
, which must be single name (as aSymbol
or a string). - a
nrow
ornrow => target_cols
form which efficiently computes the number of rows in a group; withouttarget_cols
the new column is called:nrow
, otherwise it must be single name (as aSymbol
or a string). - vectors or matrices containing transformations specified by the
Pair
syntax described in points 2 to 5 - a function which will be called with a
SubDataFrame
corresponding to each group; this form should be avoided due to its poor performance unless the number of groups is small or a very large number of columns are processed (in which caseSubDataFrame
avoids excessive compilation)
All functions have two types of signatures. One of them takes a GroupedDataFrame
as the first argument and an arbitrary number of transformations described above as following arguments. The second type of signature is when a Function
or a Type
is passed as the first argument and a GroupedDataFrame
as the second argument (similar to map
).
As a special rule, with the cols => function
and cols => function => target_cols
syntaxes, if cols
is wrapped in an AsTable
object then a NamedTuple
containing columns selected by cols
is passed to function
.
What is allowed for function
to return is determined by the target_cols
value:
- If both
cols
andtarget_cols
are omitted (so only afunction
is passed), then returning a data frame, a matrix, aNamedTuple
, or aDataFrameRow
will produce multiple columns in the result. Returning any other value produces a single column. - If
target_cols
is aSymbol
or a string then the function is assumed to return a single column. In this case returning a data frame, a matrix, aNamedTuple
, or aDataFrameRow
raises an error. - If
target_cols
is a vector ofSymbol
s or strings orAsTable
it is assumed thatfunction
returns multiple columns. Iffunction
returns one ofAbstractDataFrame
,NamedTuple
,DataFrameRow
,AbstractMatrix
then rules described in point 1 above apply. Iffunction
returns anAbstractVector
then each element of this vector must support thekeys
function, which must return a collection ofSymbol
s, strings or integers; the return value ofkeys
must be identical for all elements. Then as many columns are created as there are elements in the return value of thekeys
function. Iftarget_cols
isAsTable
then their names are set to be equal to the key names except ifkeys
returns integers, in which case they are prefixed byx
(so the column names are e.g.x1
,x2
, ...). Iftarget_cols
is a vector ofSymbol
s or strings then column names produced using the rules above are ignored and replaced bytarget_cols
(the number of columns must be the same as the length oftarget_cols
in this case). Iffun
returns a value of any other type then it is assumed that it is a table conforming to the Tables.jl API and theTables.columntable
function is called on it to get the resulting columns and their names. The names are retained whentarget_cols
isAsTable
and are replaced iftarget_cols
is a vector ofSymbol
s or strings.
In all of these cases, function
can return either a single row or multiple rows. As a particular rule, values wrapped in a Ref
or a 0
-dimensional AbstractArray
are unwrapped and then treated as a single row.
select
/select!
and transform
/transform!
always return a DataFrame
with the same number and order of rows as the source (even if GroupedDataFrame
had its groups reordered).
For combine
, rows in the returned object appear in the order of groups in the GroupedDataFrame
. The functions can return an arbitrary number of rows for each group, but the kind of returned object and the number and names of columns must be the same for all groups, except when a DataFrame()
or NamedTuple()
is returned, in which case a given group is skipped.
It is allowed to mix single values and vectors if multiple transformations are requested. In this case single value will be repeated to match the length of columns specified by returned vectors.
To apply function
to each row instead of whole columns, it can be wrapped in a ByRow
struct. cols
can be any column indexing syntax, in which case function
will be passed one argument for each of the columns specified by cols
or a NamedTuple
of them if specified columns are wrapped in AsTable
. If ByRow
is used it is allowed for cols
to select an empty set of columns, in which case function
is called for each row without any arguments and an empty NamedTuple
is passed if empty set of columns is wrapped in AsTable
.
If a collection of column names is passed then requesting duplicate column names in target data frame are accepted (e.g. select!(df, [:a], :, r"a")
is allowed) and only the first occurrence is used. In particular a syntax to move column :col
to the first position in the data frame is select!(df, :col, :)
. On the contrary, output column names of renaming, transformation and single column selection operations must be unique, so e.g. select!(df, :a, :a => :a)
or select!(df, :a, :a => ByRow(sin) => :a)
are not allowed.
As a general rule if copycols=true
columns are copied and when copycols=false
columns are reused if possible. Note, however, that including the same column several times in the data frame via renaming or transformations that return the same object without copying may create column aliases even if copycols=true
. An example of such a situation is select!(df, :a, :a => :b, :a => identity => :c)
. As a special case in transform
and transform!
column renaming always copies columns to avoid storing aliased columns in the target data frame.
If df
is a SubDataFrame
and copycols=true
then a DataFrame
is returned and the same copying rules apply as for a DataFrame
input: this means in particular that selected columns will be copied. If copycols=false
, a SubDataFrame
is returned without copying columns.
If a GroupedDataFrame
is passed, a separate task is spawned for each specified transformation; each transformation then spawns as many tasks as Julia threads, and splits processing of groups across them (however, currently transformations with optimized implementations like sum
and transformations that return multiple rows use a single task for all groups). This allows for parallel operation when Julia was started with more than one thread. Passed transformation functions should therefore not modify global variables (i.e. they should be pure), or use locks to control parallel accesses. In the future, parallelism may be extended to other cases, so this requirement also holds for DataFrame
inputs.
Keyword arguments
copycols::Bool=true
: whether columns of the source data frame should be copied if no transformation is applied to them.renamecols::Bool=true
: whether in thecols => function
form automatically generated column names should include the name of transformation functions or not.keepkeys::Bool=true
: whether grouping columns ofgd
should be kept in the returned data frame.ungroup::Bool=true
: whether the return value of the operation ongd
should be a data frame or aGroupedDataFrame
.
Examples
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> select(df, 2)
3×1 DataFrame
Row │ b
│ Int64
─────┼───────
1 │ 4
2 │ 5
3 │ 6
julia> select(df, :a => ByRow(sin) => :c, :b)
3×2 DataFrame
Row │ c b
│ Float64 Int64
─────┼─────────────────
1 │ 0.841471 4
2 │ 0.909297 5
3 │ 0.14112 6
julia> select(df, :, [:a, :b] => (a, b) -> a .+ b .- sum(b)/length(b))
3×3 DataFrame
Row │ a b a_b_function
│ Int64 Int64 Float64
─────┼────────────────────────────
1 │ 1 4 0.0
2 │ 2 5 2.0
3 │ 3 6 4.0
julia> select(df, names(df) .=> [minimum maximum])
3×4 DataFrame
Row │ a_minimum b_minimum a_maximum b_maximum
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────────────────────
1 │ 1 4 3 6
2 │ 1 4 3 6
3 │ 1 4 3 6
julia> using Statistics
julia> select(df, AsTable(:) => ByRow(mean), renamecols=false)
3×1 DataFrame
Row │ a_b
│ Float64
─────┼─────────
1 │ 2.5
2 │ 3.5
3 │ 4.5
julia> select(first, df)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 1 4
3 │ 1 4
julia> df = DataFrame(a=1:3, b=4:6, c=7:9)
3×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 4 7
2 │ 2 5 8
3 │ 3 6 9
julia> select(df, AsTable(:) => ByRow(x -> (mean=mean(x), std=std(x))) => :stats,
AsTable(:) => ByRow(x -> (mean=mean(x), std=std(x))) => AsTable)
3×3 DataFrame
Row │ stats mean std
│ NamedTup… Float64 Float64
─────┼───────────────────────────────────────────
1 │ (mean = 4.0, std = 3.0) 4.0 3.0
2 │ (mean = 5.0, std = 3.0) 5.0 3.0
3 │ (mean = 6.0, std = 3.0) 6.0 3.0
julia> df = DataFrame(a = [1, 1, 1, 2, 2, 1, 1, 2],
b = repeat([2, 1], outer=[4]),
c = 1:8)
8×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 1
2 │ 1 1 2
3 │ 1 2 3
4 │ 2 1 4
5 │ 2 2 5
6 │ 1 1 6
7 │ 1 2 7
8 │ 2 1 8
julia> gd = groupby(df, :a)
GroupedDataFrame with 2 groups based on key: a
First Group (5 rows): a = 1
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 1
2 │ 1 1 2
3 │ 1 2 3
4 │ 1 1 6
5 │ 1 2 7
⋮
Last Group (3 rows): a = 2
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 2 1 4
2 │ 2 2 5
3 │ 2 1 8
specifying a name for target column
julia> df = DataFrame(a = [1, 1, 1, 2, 2, 1, 1, 2],
b = repeat([2, 1], outer=[4]),
c = 1:8);
julia> gd = groupby(df, :a);
julia> select(gd, :c => (x -> sum(log, x)) => :sum_log_c)
8×2 DataFrame
Row │ a sum_log_c
│ Int64 Float64
─────┼──────────────────
1 │ 1 5.52943
2 │ 1 5.52943
3 │ 1 5.52943
4 │ 2 5.07517
5 │ 2 5.07517
6 │ 1 5.52943
7 │ 1 5.52943
8 │ 2 5.07517
julia> select(gd, [:b, :c] .=> sum) # passing a vector of pairs
8×3 DataFrame
Row │ a b_sum c_sum
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 8 19
2 │ 1 8 19
3 │ 1 8 19
4 │ 2 4 17
5 │ 2 4 17
6 │ 1 8 19
7 │ 1 8 19
8 │ 2 4 17
multiple arguments, renaming and keepkeys
julia> df = DataFrame(a = [1, 1, 1, 2, 2, 1, 1, 2],
b = repeat([2, 1], outer=[4]),
c = 1:8);
julia> gd = groupby(df, :a);
julia> select(gd, :b => :b1, :c => :c1, [:b, :c] => +, keepkeys=false)
8×3 DataFrame
Row │ b1 c1 b_c_+
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 2 1 3
2 │ 1 2 3
3 │ 2 3 5
4 │ 1 4 5
5 │ 2 5 7
6 │ 1 6 7
7 │ 2 7 9
8 │ 1 8 9
broadcasting and column expansion
julia> df = DataFrame(a = [1, 1, 1, 2, 2, 1, 1, 2],
b = repeat([2, 1], outer=[4]),
c = 1:8);
julia> gd = groupby(df, :a);
julia> select(gd, :b, AsTable([:b, :c]) => ByRow(extrema) => [:min, :max])
8×4 DataFrame
Row │ a b min max
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 2 1 2
2 │ 1 1 1 2
3 │ 1 2 2 3
4 │ 2 1 1 4
5 │ 2 2 2 5
6 │ 1 1 1 6
7 │ 1 2 2 7
8 │ 2 1 1 8
julia> select(gd, :, AsTable(Not(:a)) => sum, renamecols=false)
8×4 DataFrame
Row │ a b c b_c
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 2 1 3
2 │ 1 1 2 3
3 │ 1 2 3 5
4 │ 2 1 4 5
5 │ 2 2 5 7
6 │ 1 1 6 7
7 │ 1 2 7 9
8 │ 2 1 8 9
DataFrames.select!
— Functionselect!(df::DataFrame, args...; renamecols::Bool=true)
select!(args::Base.Callable, df::DataFrame; renamecols::Bool=true)
select!(gd::GroupedDataFrame{DataFrame}, args...; ungroup::Bool=true, renamecols::Bool=true)
select!(f::Base.Callable, gd::GroupedDataFrame; ungroup::Bool=true, renamecols::Bool=true)
Mutate df
or gd
in place to retain only columns or transformations specified by args...
and return it. The result is guaranteed to have the same number of rows as df
or parent of gd
, except when no columns are selected (in which case the result has zero rows).
If gd
is passed then it is updated to reflect the new rows of its updated parent. If there are independent GroupedDataFrame
objects constructed using the same parent data frame they might get corrupt.
Below detailed common rules for all transformation functions supported by DataFrames.jl are explained and compared.
All these operations are supported both for AbstractDataFrame
(when split and combine steps are skipped) and GroupedDataFrame
. Technically, AbstractDataFrame
is just considered as being grouped on no columns (meaning it has a single group, or zero groups if it is empty). The only difference is that in this case the keepkeys
and ungroup
keyword arguments (described below) are not supported and a data frame is always returned, as there are no split and combine steps in this case.
In order to perform operations by groups you first need to create a GroupedDataFrame
object from your data frame using the groupby
function that takes two arguments: (1) a data frame to be grouped, and (2) a set of columns to group by.
Operations can then be applied on each group using one of the following functions:
combine
: does not put restrictions on number of rows returned, the order of rows is specified by the order of groups inGroupedDataFrame
; it is typically used to compute summary statistics by group;select
: return a data frame with the number and order of rows exactly the same as the source data frame, including only new calculated columns;select!
is an in-place version ofselect
;transform
: return a data frame with the number and order of rows exactly the same as the source data frame, including all columns from the source and new calculated columns;transform!
is an in-place version oftransform
.
All these functions take a specification of one or more functions to apply to each subset of the DataFrame
. This specification can be of the following forms:
- standard column selectors (integers,
Symbol
s, strings, vectors of integers, vectors ofSymbol
s, vectors of strings,All
,Cols
,:
,Between
,Not
and regular expressions) - a
cols => function
pair indicating thatfunction
should be called with positional arguments holding columnscols
, which can be any valid column selector; in this case target column name is automatically generated and it is assumed thatfunction
returns a single value or a vector; the generated name is created by concatenating source column name andfunction
name by default (see examples below). - a
cols => function => target_cols
form additionally explicitly specifying the target column or columns. - a
col => target_cols
pair, which renames the columncol
totarget_cols
, which must be single name (as aSymbol
or a string). - a
nrow
ornrow => target_cols
form which efficiently computes the number of rows in a group; withouttarget_cols
the new column is called:nrow
, otherwise it must be single name (as aSymbol
or a string). - vectors or matrices containing transformations specified by the
Pair
syntax described in points 2 to 5 - a function which will be called with a
SubDataFrame
corresponding to each group; this form should be avoided due to its poor performance unless the number of groups is small or a very large number of columns are processed (in which caseSubDataFrame
avoids excessive compilation)
All functions have two types of signatures. One of them takes a GroupedDataFrame
as the first argument and an arbitrary number of transformations described above as following arguments. The second type of signature is when a Function
or a Type
is passed as the first argument and a GroupedDataFrame
as the second argument (similar to map
).
As a special rule, with the cols => function
and cols => function => target_cols
syntaxes, if cols
is wrapped in an AsTable
object then a NamedTuple
containing columns selected by cols
is passed to function
.
What is allowed for function
to return is determined by the target_cols
value:
- If both
cols
andtarget_cols
are omitted (so only afunction
is passed), then returning a data frame, a matrix, aNamedTuple
, or aDataFrameRow
will produce multiple columns in the result. Returning any other value produces a single column. - If
target_cols
is aSymbol
or a string then the function is assumed to return a single column. In this case returning a data frame, a matrix, aNamedTuple
, or aDataFrameRow
raises an error. - If
target_cols
is a vector ofSymbol
s or strings orAsTable
it is assumed thatfunction
returns multiple columns. Iffunction
returns one ofAbstractDataFrame
,NamedTuple
,DataFrameRow
,AbstractMatrix
then rules described in point 1 above apply. Iffunction
returns anAbstractVector
then each element of this vector must support thekeys
function, which must return a collection ofSymbol
s, strings or integers; the return value ofkeys
must be identical for all elements. Then as many columns are created as there are elements in the return value of thekeys
function. Iftarget_cols
isAsTable
then their names are set to be equal to the key names except ifkeys
returns integers, in which case they are prefixed byx
(so the column names are e.g.x1
,x2
, ...). Iftarget_cols
is a vector ofSymbol
s or strings then column names produced using the rules above are ignored and replaced bytarget_cols
(the number of columns must be the same as the length oftarget_cols
in this case). Iffun
returns a value of any other type then it is assumed that it is a table conforming to the Tables.jl API and theTables.columntable
function is called on it to get the resulting columns and their names. The names are retained whentarget_cols
isAsTable
and are replaced iftarget_cols
is a vector ofSymbol
s or strings.
In all of these cases, function
can return either a single row or multiple rows. As a particular rule, values wrapped in a Ref
or a 0
-dimensional AbstractArray
are unwrapped and then treated as a single row.
select
/select!
and transform
/transform!
always return a DataFrame
with the same number and order of rows as the source (even if GroupedDataFrame
had its groups reordered).
For combine
, rows in the returned object appear in the order of groups in the GroupedDataFrame
. The functions can return an arbitrary number of rows for each group, but the kind of returned object and the number and names of columns must be the same for all groups, except when a DataFrame()
or NamedTuple()
is returned, in which case a given group is skipped.
It is allowed to mix single values and vectors if multiple transformations are requested. In this case single value will be repeated to match the length of columns specified by returned vectors.
To apply function
to each row instead of whole columns, it can be wrapped in a ByRow
struct. cols
can be any column indexing syntax, in which case function
will be passed one argument for each of the columns specified by cols
or a NamedTuple
of them if specified columns are wrapped in AsTable
. If ByRow
is used it is allowed for cols
to select an empty set of columns, in which case function
is called for each row without any arguments and an empty NamedTuple
is passed if empty set of columns is wrapped in AsTable
.
If a collection of column names is passed then requesting duplicate column names in target data frame are accepted (e.g. select!(df, [:a], :, r"a")
is allowed) and only the first occurrence is used. In particular a syntax to move column :col
to the first position in the data frame is select!(df, :col, :)
. On the contrary, output column names of renaming, transformation and single column selection operations must be unique, so e.g. select!(df, :a, :a => :a)
or select!(df, :a, :a => ByRow(sin) => :a)
are not allowed.
As a general rule if copycols=true
columns are copied and when copycols=false
columns are reused if possible. Note, however, that including the same column several times in the data frame via renaming or transformations that return the same object without copying may create column aliases even if copycols=true
. An example of such a situation is select!(df, :a, :a => :b, :a => identity => :c)
. As a special case in transform
and transform!
column renaming always copies columns to avoid storing aliased columns in the target data frame.
If df
is a SubDataFrame
and copycols=true
then a DataFrame
is returned and the same copying rules apply as for a DataFrame
input: this means in particular that selected columns will be copied. If copycols=false
, a SubDataFrame
is returned without copying columns.
If a GroupedDataFrame
is passed, a separate task is spawned for each specified transformation; each transformation then spawns as many tasks as Julia threads, and splits processing of groups across them (however, currently transformations with optimized implementations like sum
and transformations that return multiple rows use a single task for all groups). This allows for parallel operation when Julia was started with more than one thread. Passed transformation functions should therefore not modify global variables (i.e. they should be pure), or use locks to control parallel accesses. In the future, parallelism may be extended to other cases, so this requirement also holds for DataFrame
inputs.
Keyword arguments
renamecols::Bool=true
: whether in thecols => function
form automatically generated column names should include the name of transformation functions or not.ungroup::Bool=true
: whether the return value of the operation ongd
should be a data frame or aGroupedDataFrame
.
See select
for examples.
DataFrames.transform
— Functiontransform(df::AbstractDataFrame, args...; copycols::Bool=true, renamecols::Bool=true)
transform(f::Callable, df::DataFrame; renamecols::Bool=true)
transform(gd::GroupedDataFrame, args...; copycols::Bool=true,
keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true)
transform(f::Base.Callable, gd::GroupedDataFrame; copycols::Bool=true,
keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true)
Create a new data frame that contains columns from df
or gd
plus columns specified by args
and return it. The result is guaranteed to have the same number of rows as df
. Equivalent to select(df, :, args...)
or select(gd, :, args...)
.
Below detailed common rules for all transformation functions supported by DataFrames.jl are explained and compared.
All these operations are supported both for AbstractDataFrame
(when split and combine steps are skipped) and GroupedDataFrame
. Technically, AbstractDataFrame
is just considered as being grouped on no columns (meaning it has a single group, or zero groups if it is empty). The only difference is that in this case the keepkeys
and ungroup
keyword arguments (described below) are not supported and a data frame is always returned, as there are no split and combine steps in this case.
In order to perform operations by groups you first need to create a GroupedDataFrame
object from your data frame using the groupby
function that takes two arguments: (1) a data frame to be grouped, and (2) a set of columns to group by.
Operations can then be applied on each group using one of the following functions:
combine
: does not put restrictions on number of rows returned, the order of rows is specified by the order of groups inGroupedDataFrame
; it is typically used to compute summary statistics by group;select
: return a data frame with the number and order of rows exactly the same as the source data frame, including only new calculated columns;select!
is an in-place version ofselect
;transform
: return a data frame with the number and order of rows exactly the same as the source data frame, including all columns from the source and new calculated columns;transform!
is an in-place version oftransform
.
All these functions take a specification of one or more functions to apply to each subset of the DataFrame
. This specification can be of the following forms:
- standard column selectors (integers,
Symbol
s, strings, vectors of integers, vectors ofSymbol
s, vectors of strings,All
,Cols
,:
,Between
,Not
and regular expressions) - a
cols => function
pair indicating thatfunction
should be called with positional arguments holding columnscols
, which can be any valid column selector; in this case target column name is automatically generated and it is assumed thatfunction
returns a single value or a vector; the generated name is created by concatenating source column name andfunction
name by default (see examples below). - a
cols => function => target_cols
form additionally explicitly specifying the target column or columns. - a
col => target_cols
pair, which renames the columncol
totarget_cols
, which must be single name (as aSymbol
or a string). - a
nrow
ornrow => target_cols
form which efficiently computes the number of rows in a group; withouttarget_cols
the new column is called:nrow
, otherwise it must be single name (as aSymbol
or a string). - vectors or matrices containing transformations specified by the
Pair
syntax described in points 2 to 5 - a function which will be called with a
SubDataFrame
corresponding to each group; this form should be avoided due to its poor performance unless the number of groups is small or a very large number of columns are processed (in which caseSubDataFrame
avoids excessive compilation)
All functions have two types of signatures. One of them takes a GroupedDataFrame
as the first argument and an arbitrary number of transformations described above as following arguments. The second type of signature is when a Function
or a Type
is passed as the first argument and a GroupedDataFrame
as the second argument (similar to map
).
As a special rule, with the cols => function
and cols => function => target_cols
syntaxes, if cols
is wrapped in an AsTable
object then a NamedTuple
containing columns selected by cols
is passed to function
.
What is allowed for function
to return is determined by the target_cols
value:
- If both
cols
andtarget_cols
are omitted (so only afunction
is passed), then returning a data frame, a matrix, aNamedTuple
, or aDataFrameRow
will produce multiple columns in the result. Returning any other value produces a single column. - If
target_cols
is aSymbol
or a string then the function is assumed to return a single column. In this case returning a data frame, a matrix, aNamedTuple
, or aDataFrameRow
raises an error. - If
target_cols
is a vector ofSymbol
s or strings orAsTable
it is assumed thatfunction
returns multiple columns. Iffunction
returns one ofAbstractDataFrame
,NamedTuple
,DataFrameRow
,AbstractMatrix
then rules described in point 1 above apply. Iffunction
returns anAbstractVector
then each element of this vector must support thekeys
function, which must return a collection ofSymbol
s, strings or integers; the return value ofkeys
must be identical for all elements. Then as many columns are created as there are elements in the return value of thekeys
function. Iftarget_cols
isAsTable
then their names are set to be equal to the key names except ifkeys
returns integers, in which case they are prefixed byx
(so the column names are e.g.x1
,x2
, ...). Iftarget_cols
is a vector ofSymbol
s or strings then column names produced using the rules above are ignored and replaced bytarget_cols
(the number of columns must be the same as the length oftarget_cols
in this case). Iffun
returns a value of any other type then it is assumed that it is a table conforming to the Tables.jl API and theTables.columntable
function is called on it to get the resulting columns and their names. The names are retained whentarget_cols
isAsTable
and are replaced iftarget_cols
is a vector ofSymbol
s or strings.
In all of these cases, function
can return either a single row or multiple rows. As a particular rule, values wrapped in a Ref
or a 0
-dimensional AbstractArray
are unwrapped and then treated as a single row.
select
/select!
and transform
/transform!
always return a DataFrame
with the same number and order of rows as the source (even if GroupedDataFrame
had its groups reordered).
For combine
, rows in the returned object appear in the order of groups in the GroupedDataFrame
. The functions can return an arbitrary number of rows for each group, but the kind of returned object and the number and names of columns must be the same for all groups, except when a DataFrame()
or NamedTuple()
is returned, in which case a given group is skipped.
It is allowed to mix single values and vectors if multiple transformations are requested. In this case single value will be repeated to match the length of columns specified by returned vectors.
To apply function
to each row instead of whole columns, it can be wrapped in a ByRow
struct. cols
can be any column indexing syntax, in which case function
will be passed one argument for each of the columns specified by cols
or a NamedTuple
of them if specified columns are wrapped in AsTable
. If ByRow
is used it is allowed for cols
to select an empty set of columns, in which case function
is called for each row without any arguments and an empty NamedTuple
is passed if empty set of columns is wrapped in AsTable
.
If a collection of column names is passed then requesting duplicate column names in target data frame are accepted (e.g. select!(df, [:a], :, r"a")
is allowed) and only the first occurrence is used. In particular a syntax to move column :col
to the first position in the data frame is select!(df, :col, :)
. On the contrary, output column names of renaming, transformation and single column selection operations must be unique, so e.g. select!(df, :a, :a => :a)
or select!(df, :a, :a => ByRow(sin) => :a)
are not allowed.
As a general rule if copycols=true
columns are copied and when copycols=false
columns are reused if possible. Note, however, that including the same column several times in the data frame via renaming or transformations that return the same object without copying may create column aliases even if copycols=true
. An example of such a situation is select!(df, :a, :a => :b, :a => identity => :c)
. As a special case in transform
and transform!
column renaming always copies columns to avoid storing aliased columns in the target data frame.
If df
is a SubDataFrame
and copycols=true
then a DataFrame
is returned and the same copying rules apply as for a DataFrame
input: this means in particular that selected columns will be copied. If copycols=false
, a SubDataFrame
is returned without copying columns.
If a GroupedDataFrame
is passed, a separate task is spawned for each specified transformation; each transformation then spawns as many tasks as Julia threads, and splits processing of groups across them (however, currently transformations with optimized implementations like sum
and transformations that return multiple rows use a single task for all groups). This allows for parallel operation when Julia was started with more than one thread. Passed transformation functions should therefore not modify global variables (i.e. they should be pure), or use locks to control parallel accesses. In the future, parallelism may be extended to other cases, so this requirement also holds for DataFrame
inputs.
Keyword arguments
copycols::Bool=true
: whether columns of the source data frame should be copied if no transformation is applied to them.renamecols::Bool=true
: whether in thecols => function
form automatically generated column names should include the name of transformation functions or not.keepkeys::Bool=true
: whether grouping columns ofgd
should be kept in the returned data frame.ungroup::Bool=true
: whether the return value of the operation ongd
should be a data frame or aGroupedDataFrame
.
Note that when the first argument is a GroupedDataFrame
, keepkeys=false
is needed to be able to return a different value for the grouping column:
julia> gdf = groupby(DataFrame(x=1:2), :x)
GroupedDataFrame with 2 groups based on key: x
First Group (1 row): x = 1
Row │ x
│ Int64
─────┼───────
1 │ 1
⋮
Last Group (1 row): x = 2
Row │ x
│ Int64
─────┼───────
1 │ 2
julia> transform(gdf, x -> (x=10,), keepkeys=false)
2×1 DataFrame
Row │ x
│ Int64
─────┼───────
1 │ 10
2 │ 10
julia> transform(gdf, x -> (x=10,), keepkeys=true)
ERROR: ArgumentError: column :x in returned data frame is not equal to grouping key :x
See select
for more examples.
DataFrames.transform!
— Functiontransform!(df::DataFrame, args...; renamecols::Bool=true)
transform!(args::Callable, df::DataFrame; renamecols::Bool=true)
transform!(gd::GroupedDataFrame{DataFrame}, args...; ungroup::Bool=true, renamecols::Bool=true)
transform!(f::Base.Callable, gd::GroupedDataFrame; ungroup::Bool=true, renamecols::Bool=true)
Mutate df
or gd
in place to add columns specified by args...
and return it. The result is guaranteed to have the same number of rows as df
. Equivalent to select!(df, :, args...)
or select!(gd, :, args...)
.
Below detailed common rules for all transformation functions supported by DataFrames.jl are explained and compared.
All these operations are supported both for AbstractDataFrame
(when split and combine steps are skipped) and GroupedDataFrame
. Technically, AbstractDataFrame
is just considered as being grouped on no columns (meaning it has a single group, or zero groups if it is empty). The only difference is that in this case the keepkeys
and ungroup
keyword arguments (described below) are not supported and a data frame is always returned, as there are no split and combine steps in this case.
In order to perform operations by groups you first need to create a GroupedDataFrame
object from your data frame using the groupby
function that takes two arguments: (1) a data frame to be grouped, and (2) a set of columns to group by.
Operations can then be applied on each group using one of the following functions:
combine
: does not put restrictions on number of rows returned, the order of rows is specified by the order of groups inGroupedDataFrame
; it is typically used to compute summary statistics by group;select
: return a data frame with the number and order of rows exactly the same as the source data frame, including only new calculated columns;select!
is an in-place version ofselect
;transform
: return a data frame with the number and order of rows exactly the same as the source data frame, including all columns from the source and new calculated columns;transform!
is an in-place version oftransform
.
All these functions take a specification of one or more functions to apply to each subset of the DataFrame
. This specification can be of the following forms:
- standard column selectors (integers,
Symbol
s, strings, vectors of integers, vectors ofSymbol
s, vectors of strings,All
,Cols
,:
,Between
,Not
and regular expressions) - a
cols => function
pair indicating thatfunction
should be called with positional arguments holding columnscols
, which can be any valid column selector; in this case target column name is automatically generated and it is assumed thatfunction
returns a single value or a vector; the generated name is created by concatenating source column name andfunction
name by default (see examples below). - a
cols => function => target_cols
form additionally explicitly specifying the target column or columns. - a
col => target_cols
pair, which renames the columncol
totarget_cols
, which must be single name (as aSymbol
or a string). - a
nrow
ornrow => target_cols
form which efficiently computes the number of rows in a group; withouttarget_cols
the new column is called:nrow
, otherwise it must be single name (as aSymbol
or a string). - vectors or matrices containing transformations specified by the
Pair
syntax described in points 2 to 5 - a function which will be called with a
SubDataFrame
corresponding to each group; this form should be avoided due to its poor performance unless the number of groups is small or a very large number of columns are processed (in which caseSubDataFrame
avoids excessive compilation)
All functions have two types of signatures. One of them takes a GroupedDataFrame
as the first argument and an arbitrary number of transformations described above as following arguments. The second type of signature is when a Function
or a Type
is passed as the first argument and a GroupedDataFrame
as the second argument (similar to map
).
As a special rule, with the cols => function
and cols => function => target_cols
syntaxes, if cols
is wrapped in an AsTable
object then a NamedTuple
containing columns selected by cols
is passed to function
.
What is allowed for function
to return is determined by the target_cols
value:
- If both
cols
andtarget_cols
are omitted (so only afunction
is passed), then returning a data frame, a matrix, aNamedTuple
, or aDataFrameRow
will produce multiple columns in the result. Returning any other value produces a single column. - If
target_cols
is aSymbol
or a string then the function is assumed to return a single column. In this case returning a data frame, a matrix, aNamedTuple
, or aDataFrameRow
raises an error. - If
target_cols
is a vector ofSymbol
s or strings orAsTable
it is assumed thatfunction
returns multiple columns. Iffunction
returns one ofAbstractDataFrame
,NamedTuple
,DataFrameRow
,AbstractMatrix
then rules described in point 1 above apply. Iffunction
returns anAbstractVector
then each element of this vector must support thekeys
function, which must return a collection ofSymbol
s, strings or integers; the return value ofkeys
must be identical for all elements. Then as many columns are created as there are elements in the return value of thekeys
function. Iftarget_cols
isAsTable
then their names are set to be equal to the key names except ifkeys
returns integers, in which case they are prefixed byx
(so the column names are e.g.x1
,x2
, ...). Iftarget_cols
is a vector ofSymbol
s or strings then column names produced using the rules above are ignored and replaced bytarget_cols
(the number of columns must be the same as the length oftarget_cols
in this case). Iffun
returns a value of any other type then it is assumed that it is a table conforming to the Tables.jl API and theTables.columntable
function is called on it to get the resulting columns and their names. The names are retained whentarget_cols
isAsTable
and are replaced iftarget_cols
is a vector ofSymbol
s or strings.
In all of these cases, function
can return either a single row or multiple rows. As a particular rule, values wrapped in a Ref
or a 0
-dimensional AbstractArray
are unwrapped and then treated as a single row.
select
/select!
and transform
/transform!
always return a DataFrame
with the same number and order of rows as the source (even if GroupedDataFrame
had its groups reordered).
For combine
, rows in the returned object appear in the order of groups in the GroupedDataFrame
. The functions can return an arbitrary number of rows for each group, but the kind of returned object and the number and names of columns must be the same for all groups, except when a DataFrame()
or NamedTuple()
is returned, in which case a given group is skipped.
It is allowed to mix single values and vectors if multiple transformations are requested. In this case single value will be repeated to match the length of columns specified by returned vectors.
To apply function
to each row instead of whole columns, it can be wrapped in a ByRow
struct. cols
can be any column indexing syntax, in which case function
will be passed one argument for each of the columns specified by cols
or a NamedTuple
of them if specified columns are wrapped in AsTable
. If ByRow
is used it is allowed for cols
to select an empty set of columns, in which case function
is called for each row without any arguments and an empty NamedTuple
is passed if empty set of columns is wrapped in AsTable
.
If a collection of column names is passed then requesting duplicate column names in target data frame are accepted (e.g. select!(df, [:a], :, r"a")
is allowed) and only the first occurrence is used. In particular a syntax to move column :col
to the first position in the data frame is select!(df, :col, :)
. On the contrary, output column names of renaming, transformation and single column selection operations must be unique, so e.g. select!(df, :a, :a => :a)
or select!(df, :a, :a => ByRow(sin) => :a)
are not allowed.
As a general rule if copycols=true
columns are copied and when copycols=false
columns are reused if possible. Note, however, that including the same column several times in the data frame via renaming or transformations that return the same object without copying may create column aliases even if copycols=true
. An example of such a situation is select!(df, :a, :a => :b, :a => identity => :c)
. As a special case in transform
and transform!
column renaming always copies columns to avoid storing aliased columns in the target data frame.
If df
is a SubDataFrame
and copycols=true
then a DataFrame
is returned and the same copying rules apply as for a DataFrame
input: this means in particular that selected columns will be copied. If copycols=false
, a SubDataFrame
is returned without copying columns.
If a GroupedDataFrame
is passed, a separate task is spawned for each specified transformation; each transformation then spawns as many tasks as Julia threads, and splits processing of groups across them (however, currently transformations with optimized implementations like sum
and transformations that return multiple rows use a single task for all groups). This allows for parallel operation when Julia was started with more than one thread. Passed transformation functions should therefore not modify global variables (i.e. they should be pure), or use locks to control parallel accesses. In the future, parallelism may be extended to other cases, so this requirement also holds for DataFrame
inputs.
Keyword arguments
renamecols::Bool=true
: whether in thecols => function
form automatically generated column names should include the name of transformation functions or not.ungroup::Bool=true
: whether the return value of the operation ongd
should be a data frame or aGroupedDataFrame
.
See select
for examples.
Base.vcat
— Functionvcat(dfs::AbstractDataFrame...;
cols::Union{Symbol, AbstractVector{Symbol},
AbstractVector{<:AbstractString}}=:setequal,
source::Union{Nothing, Symbol, AbstractString,
Pair{<:Union{Symbol, AbstractString}, <:AbstractVector}}=nothing)
Vertically concatenate AbstractDataFrame
s.
The cols
keyword argument determines the columns of the returned data frame:
:setequal
: require all data frames to have the same column names disregarding order. If they appear in different orders, the order of the first provided data frame is used.:orderequal
: require all data frames to have the same column names and in the same order.:intersect
: only the columns present in all provided data frames are kept. If the intersection is empty, an empty data frame is returned.:union
: columns present in at least one of the provided data frames are kept. Columns not present in some data frames are filled withmissing
where necessary.- A vector of
Symbol
s or strings: only listed columns are kept. Columns not present in some data frames are filled withmissing
where necessary.
The source
keyword argument, if not nothing
(the default), specifies the additional column to be added in the last position in the resulting data frame that will identify the source data frame. It can be a Symbol
or an AbstractString
, in which case the identifier will be the number of the passed source data frame, or a Pair
consisting of a Symbol
or an AbstractString
and of a vector specifying the data frame identifiers (which do not have to be unique). The name of the source column is not allowed to be present in any source data frame.
The order of columns is determined by the order they appear in the included data frames, searching through the header of the first data frame, then the second, etc.
The element types of columns are determined using promote_type
, as with vcat
for AbstractVector
s.
vcat
ignores empty data frames, making it possible to initialize an empty data frame at the beginning of a loop and vcat
onto it.
Example
julia> df1 = DataFrame(A=1:3, B=1:3)
3×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
julia> df2 = DataFrame(A=4:6, B=4:6)
3×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 4 4
2 │ 5 5
3 │ 6 6
julia> df3 = DataFrame(A=7:9, C=7:9)
3×2 DataFrame
Row │ A C
│ Int64 Int64
─────┼──────────────
1 │ 7 7
2 │ 8 8
3 │ 9 9
julia> df4 = DataFrame()
0×0 DataFrame
julia> vcat(df1, df2)
6×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
4 │ 4 4
5 │ 5 5
6 │ 6 6
julia> vcat(df1, df3, cols=:union)
6×3 DataFrame
Row │ A B C
│ Int64 Int64? Int64?
─────┼─────────────────────────
1 │ 1 1 missing
2 │ 2 2 missing
3 │ 3 3 missing
4 │ 7 missing 7
5 │ 8 missing 8
6 │ 9 missing 9
julia> vcat(df1, df3, cols=:intersect)
6×1 DataFrame
Row │ A
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 7
5 │ 8
6 │ 9
julia> vcat(df4, df1)
3×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
julia> vcat(df1, df2, df3, df4, cols=:union, source="source")
9×4 DataFrame
Row │ A B C source
│ Int64 Int64? Int64? Int64
─────┼─────────────────────────────────
1 │ 1 1 missing 1
2 │ 2 2 missing 1
3 │ 3 3 missing 1
4 │ 4 4 missing 2
5 │ 5 5 missing 2
6 │ 6 6 missing 2
7 │ 7 missing 7 3
8 │ 8 missing 8 3
9 │ 9 missing 9 3
julia> vcat(df1, df2, df4, df3, cols=:union, source=:source => 'a':'d')
9×4 DataFrame
Row │ A B C source
│ Int64 Int64? Int64? Char
─────┼─────────────────────────────────
1 │ 1 1 missing a
2 │ 2 2 missing a
3 │ 3 3 missing a
4 │ 4 4 missing b
5 │ 5 5 missing b
6 │ 6 6 missing b
7 │ 7 missing 7 d
8 │ 8 missing 8 d
9 │ 9 missing 9 d
Reshaping data frames between tall and wide formats
DataFrames.stack
— Functionstack(df::AbstractDataFrame[, measure_vars[, id_vars] ];
variable_name=:variable, value_name=:value,
view::Bool=false, variable_eltype::Type=String)
Stack a data frame df
, i.e. convert it from wide to long format.
Return the long-format DataFrame
with: columns for each of the id_vars
, column value_name
(:value
by default) holding the values of the stacked columns (measure_vars
), and column variable_name
(:variable
by default) a vector holding the name of the corresponding measure_vars
variable.
If view=true
then return a stacked view of a data frame (long format). The result is a view because the columns are special AbstractVectors
that return views into the original data frame.
Arguments
df
: the AbstractDataFrame to be stackedmeasure_vars
: the columns to be stacked (the measurement variables), as a column selector (Symbol
, string or integer;:
,Cols
,All
,Between
,Not
, a regular expression, or a vector ofSymbol
s, strings or integers). If neithermeasure_vars
orid_vars
are given,measure_vars
defaults to all floating point columns.id_vars
: the identifier columns that are repeated during stacking, as a column selector (Symbol
, string or integer;:
,Cols
,All
,Between
,Not
, a regular expression, or a vector ofSymbol
s, strings or integers). Defaults to all variables that are notmeasure_vars
variable_name
: the name (Symbol
or string) of the new stacked column that shall hold the names of each ofmeasure_vars
value_name
: the name (Symbol
or string) of the new stacked column containing the values from each ofmeasure_vars
view
: whether the stacked data frame should be a view rather than contain freshly allocated vectors.variable_eltype
: determines the element type of columnvariable_name
. By default aPooledArray{String}
is created. Ifvariable_eltype=Symbol
aPooledVector{Symbol}
is created, and ifvariable_eltype=CategoricalValue{String}
aCategoricalArray{String}
is produced (callusing CategoricalArrays
first if needed) Passing any other typeT
will produce aPooledVector{T}
column as long as it supports conversion fromString
. Whenview=true
, aRepeatedVector{T}
is produced.
Examples
julia> df = DataFrame(a = repeat(1:3, inner = 2),
b = repeat(1:2, inner = 3),
c = repeat(1:1, inner = 6),
d = repeat(1:6, inner = 1),
e = string.('a':'f'))
6×5 DataFrame
Row │ a b c d e
│ Int64 Int64 Int64 Int64 String
─────┼────────────────────────────────────
1 │ 1 1 1 1 a
2 │ 1 1 1 2 b
3 │ 2 1 1 3 c
4 │ 2 2 1 4 d
5 │ 3 2 1 5 e
6 │ 3 2 1 6 f
julia> stack(df, [:c, :d])
12×5 DataFrame
Row │ a b e variable value
│ Int64 Int64 String String Int64
─────┼───────────────────────────────────────
1 │ 1 1 a c 1
2 │ 1 1 b c 1
3 │ 2 1 c c 1
4 │ 2 2 d c 1
5 │ 3 2 e c 1
6 │ 3 2 f c 1
7 │ 1 1 a d 1
8 │ 1 1 b d 2
9 │ 2 1 c d 3
10 │ 2 2 d d 4
11 │ 3 2 e d 5
12 │ 3 2 f d 6
julia> stack(df, [:c, :d], [:a])
12×3 DataFrame
Row │ a variable value
│ Int64 String Int64
─────┼────────────────────────
1 │ 1 c 1
2 │ 1 c 1
3 │ 2 c 1
4 │ 2 c 1
5 │ 3 c 1
6 │ 3 c 1
7 │ 1 d 1
8 │ 1 d 2
9 │ 2 d 3
10 │ 2 d 4
11 │ 3 d 5
12 │ 3 d 6
julia> stack(df, Not([:a, :b, :e]))
12×5 DataFrame
Row │ a b e variable value
│ Int64 Int64 String String Int64
─────┼───────────────────────────────────────
1 │ 1 1 a c 1
2 │ 1 1 b c 1
3 │ 2 1 c c 1
4 │ 2 2 d c 1
5 │ 3 2 e c 1
6 │ 3 2 f c 1
7 │ 1 1 a d 1
8 │ 1 1 b d 2
9 │ 2 1 c d 3
10 │ 2 2 d d 4
11 │ 3 2 e d 5
12 │ 3 2 f d 6
julia> stack(df, Not([:a, :b, :e]), variable_name=:somemeasure)
12×5 DataFrame
Row │ a b e somemeasure value
│ Int64 Int64 String String Int64
─────┼──────────────────────────────────────────
1 │ 1 1 a c 1
2 │ 1 1 b c 1
3 │ 2 1 c c 1
4 │ 2 2 d c 1
5 │ 3 2 e c 1
6 │ 3 2 f c 1
7 │ 1 1 a d 1
8 │ 1 1 b d 2
9 │ 2 1 c d 3
10 │ 2 2 d d 4
11 │ 3 2 e d 5
12 │ 3 2 f d 6
DataFrames.unstack
— Functionunstack(df::AbstractDataFrame, rowkeys, colkey, value; renamecols::Function=identity,
allowmissing::Bool=false, allowduplicates::Bool=false)
unstack(df::AbstractDataFrame, colkey, value; renamecols::Function=identity,
allowmissing::Bool=false, allowduplicates::Bool=false)
unstack(df::AbstractDataFrame; renamecols::Function=identity,
allowmissing::Bool=false, allowduplicates::Bool=false)
Unstack data frame df
, i.e. convert it from long to wide format.
Row and column keys will be ordered in the order of their first appearance.
Positional arguments
df
: the AbstractDataFrame to be unstackedrowkeys
: the columns with a unique key for each row, if not given, find a key by grouping on anything not acolkey
orvalue
. Can be any column selector (Symbol
, string or integer;:
,Cols
,All
,Between
,Not
, a regular expression, or a vector ofSymbol
s, strings or integers).colkey
: the column (Symbol
, string or integer) holding the column names in wide format, defaults to:variable
value
: the value column (Symbol
, string or integer), defaults to:value
Keyword arguments
renamecols
: a function called on each unique value incolkey
; it must return the name of the column to be created (typically as a string or aSymbol
). Duplicates in resulting names when converted toSymbol
are not allowed. By default no transformation is performed.allowmissing
: iffalse
(the default) then an error will be thrown ifcolkey
containsmissing
values; iftrue
then a column referring tomissing
value will be created.allowduplicates
: iffalse
(the default) then an error an error will be thrown if combination ofrowkeys
andcolkey
contains duplicate entries; iftrue
then then the last encounteredvalue
will be retained.
Examples
julia> wide = DataFrame(id = 1:6,
a = repeat(1:3, inner = 2),
b = repeat(1.0:2.0, inner = 3),
c = repeat(1.0:1.0, inner = 6),
d = repeat(1.0:3.0, inner = 2))
6×5 DataFrame
Row │ id a b c d
│ Int64 Int64 Float64 Float64 Float64
─────┼─────────────────────────────────────────
1 │ 1 1 1.0 1.0 1.0
2 │ 2 1 1.0 1.0 1.0
3 │ 3 2 1.0 1.0 2.0
4 │ 4 2 2.0 1.0 2.0
5 │ 5 3 2.0 1.0 3.0
6 │ 6 3 2.0 1.0 3.0
julia> long = stack(wide)
18×4 DataFrame
Row │ id a variable value
│ Int64 Int64 String Float64
─────┼─────────────────────────────────
1 │ 1 1 b 1.0
2 │ 2 1 b 1.0
3 │ 3 2 b 1.0
4 │ 4 2 b 2.0
5 │ 5 3 b 2.0
6 │ 6 3 b 2.0
7 │ 1 1 c 1.0
8 │ 2 1 c 1.0
⋮ │ ⋮ ⋮ ⋮ ⋮
12 │ 6 3 c 1.0
13 │ 1 1 d 1.0
14 │ 2 1 d 1.0
15 │ 3 2 d 2.0
16 │ 4 2 d 2.0
17 │ 5 3 d 3.0
18 │ 6 3 d 3.0
3 rows omitted
julia> unstack(long)
6×5 DataFrame
Row │ id a b c d
│ Int64 Int64 Float64? Float64? Float64?
─────┼────────────────────────────────────────────
1 │ 1 1 1.0 1.0 1.0
2 │ 2 1 1.0 1.0 1.0
3 │ 3 2 1.0 1.0 2.0
4 │ 4 2 2.0 1.0 2.0
5 │ 5 3 2.0 1.0 3.0
6 │ 6 3 2.0 1.0 3.0
julia> unstack(long, :variable, :value)
6×5 DataFrame
Row │ id a b c d
│ Int64 Int64 Float64? Float64? Float64?
─────┼────────────────────────────────────────────
1 │ 1 1 1.0 1.0 1.0
2 │ 2 1 1.0 1.0 1.0
3 │ 3 2 1.0 1.0 2.0
4 │ 4 2 2.0 1.0 2.0
5 │ 5 3 2.0 1.0 3.0
6 │ 6 3 2.0 1.0 3.0
julia> unstack(long, :id, :variable, :value)
6×4 DataFrame
Row │ id b c d
│ Int64 Float64? Float64? Float64?
─────┼─────────────────────────────────────
1 │ 1 1.0 1.0 1.0
2 │ 2 1.0 1.0 1.0
3 │ 3 1.0 1.0 2.0
4 │ 4 2.0 1.0 2.0
5 │ 5 2.0 1.0 3.0
6 │ 6 2.0 1.0 3.0
julia> unstack(long, [:id, :a], :variable, :value)
6×5 DataFrame
Row │ id a b c d
│ Int64 Int64 Float64? Float64? Float64?
─────┼────────────────────────────────────────────
1 │ 1 1 1.0 1.0 1.0
2 │ 2 1 1.0 1.0 1.0
3 │ 3 2 1.0 1.0 2.0
4 │ 4 2 2.0 1.0 2.0
5 │ 5 3 2.0 1.0 3.0
6 │ 6 3 2.0 1.0 3.0
julia> unstack(long, :id, :variable, :value, renamecols=x->Symbol(:_, x))
6×4 DataFrame
Row │ id _b _c _d
│ Int64 Float64? Float64? Float64?
─────┼─────────────────────────────────────
1 │ 1 1.0 1.0 1.0
2 │ 2 1.0 1.0 1.0
3 │ 3 1.0 1.0 2.0
4 │ 4 2.0 1.0 2.0
5 │ 5 2.0 1.0 3.0
6 │ 6 2.0 1.0 3.0
Note that there are some differences between the widened results above.
Base.permutedims
— Functionpermutedims(df::AbstractDataFrame, src_namescol::Union{Int, Symbol, AbstractString},
[dest_namescol::Union{Symbol, AbstractString}];
makeunique::Bool=false)
Turn df
on its side such that rows become columns and values in the column indexed by src_namescol
become the names of new columns. In the resulting DataFrame
, column names of df
will become the first column with name specified by dest_namescol
.
Arguments
df
: theAbstractDataFrame
src_namescol
: the column that will become the new header. This column's element type must beAbstractString
orSymbol
.dest_namescol
: the name of the first column in the returnedDataFrame
. Defaults to the same name assrc_namescol
.makeunique
: iffalse
(the default), an error will be raised if duplicate names are found; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).
Note: The element types of columns in resulting DataFrame
(other than the first column, which always has element type String
) will depend on the element types of all input columns based on the result of promote_type
. That is, if the source data frame contains Int
and Float64
columns, resulting columns will have element type Float64
. If the source has Int
and String
columns, resulting columns will have element type Any
.
Examples
julia> df1 = DataFrame(a=["x", "y"], b=[1.0, 2.0], c=[3, 4], d=[true, false])
2×4 DataFrame
Row │ a b c d
│ String Float64 Int64 Bool
─────┼───────────────────────────────
1 │ x 1.0 3 true
2 │ y 2.0 4 false
julia> permutedims(df1, 1) # note the column types
3×3 DataFrame
Row │ a x y
│ String Float64 Float64
─────┼──────────────────────────
1 │ b 1.0 2.0
2 │ c 3.0 4.0
3 │ d 1.0 0.0
julia> df2 = DataFrame(a=["x", "y"], b=[1, "two"], c=[3, 4], d=[true, false])
2×4 DataFrame
Row │ a b c d
│ String Any Int64 Bool
─────┼───────────────────────────
1 │ x 1 3 true
2 │ y two 4 false
julia> permutedims(df2, 1, "different_name")
3×3 DataFrame
Row │ different_name x y
│ String Any Any
─────┼─────────────────────────────
1 │ b 1 two
2 │ c 3 4
3 │ d true false
Sorting
Base.issorted
— Functionissorted(df::AbstractDataFrame, cols;
lt=isless, by=identity, rev::Bool=false, order::Ordering=Forward)
Test whether data frame df
sorted by column(s) cols
.
cols
can be any column selector (Symbol
, string or integer; :
, Cols
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
If rev
is true
, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true)
in cols
, with c
the corresponding column index (see example below). See other methods for a description of other keyword arguments.
DataFrames.order
— Functionorder(col::ColumnIndex; kwargs...)
Specify sorting order for a column col
in a data frame. kwargs
can be lt
, by
, rev
, and order
with values following the rules defined in sort!
.
Examples
julia> df = DataFrame(x = [-3, -1, 0, 2, 4], y = 1:5)
5×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ -3 1
2 │ -1 2
3 │ 0 3
4 │ 2 4
5 │ 4 5
julia> sort(df, order(:x, rev=true))
5×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 4 5
2 │ 2 4
3 │ 0 3
4 │ -1 2
5 │ -3 1
julia> sort(df, order(:x, by=abs))
5×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 0 3
2 │ -1 2
3 │ 2 4
4 │ -3 1
5 │ 4 5
Base.sort
— Functionsort(df::AbstractDataFrame, cols;
alg::Union{Algorithm, Nothing}=nothing, lt=isless, by=identity,
rev::Bool=false, order::Ordering=Forward, view::Bool=false)
Return a data frame containing the rows in df
sorted by column(s) cols
.
cols
can be any column selector (Symbol
, string or integer; :
, Cols
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
If alg
is nothing
(the default), the most appropriate algorithm is chosen automatically among TimSort
, MergeSort
and RadixSort
depending on the type of the sorting columns and on the number of rows in df
. If rev
is true
, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true)
in cols
, with c
the corresponding column index (see example below).
If view=false
a freshly allocated DataFrame
is returned. If view=true
then a SubDataFrame
view into df
is returned.
See sort!
for a description of other keyword arguments.
Examples
julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 1 c
3 │ 2 a
4 │ 1 b
julia> sort(df, :x)
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 1 c
2 │ 1 b
3 │ 2 a
4 │ 3 b
julia> sort(df, [:x, :y])
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 1 b
2 │ 1 c
3 │ 2 a
4 │ 3 b
julia> sort(df, [:x, :y], rev=true)
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 a
3 │ 1 c
4 │ 1 b
julia> sort(df, [:x, order(:y, rev=true)])
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 1 c
2 │ 1 b
3 │ 2 a
4 │ 3 b
Base.sort!
— Functionsort!(df::AbstractDataFrame, cols;
alg::Union{Algorithm, Nothing}=nothing, lt=isless, by=identity,
rev::Bool=false, order::Ordering=Forward)
Sort data frame df
by column(s) cols
.
cols
can be any column selector (Symbol
, string or integer; :
, Cols
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
If alg
is nothing
(the default), the most appropriate algorithm is chosen automatically among TimSort
, MergeSort
and RadixSort
depending on the type of the sorting columns and on the number of rows in df
. If rev
is true
, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true)
in cols
, with c
the corresponding column index (see example below). See other methods for a description of other keyword arguments.
Examples
julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 1 c
3 │ 2 a
4 │ 1 b
julia> sort!(df, :x)
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 1 c
2 │ 1 b
3 │ 2 a
4 │ 3 b
julia> sort!(df, [:x, :y])
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 1 b
2 │ 1 c
3 │ 2 a
4 │ 3 b
julia> sort!(df, [:x, :y], rev=true)
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 a
3 │ 1 c
4 │ 1 b
julia> sort!(df, [:x, order(:y, rev=true)])
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 1 c
2 │ 1 b
3 │ 2 a
4 │ 3 b
Base.sortperm
— Functionsortperm(df::AbstractDataFrame, cols;
alg::Union{Algorithm, Nothing}=nothing, lt=isless, by=identity,
rev::Bool=false, order::Ordering=Forward)
Return a permutation vector of row indices of data frame df
that puts them in sorted order according to column(s) cols
.
cols
can be any column selector (Symbol
, string or integer; :
, Cols
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
If alg
is nothing
(the default), the most appropriate algorithm is chosen automatically among TimSort
, MergeSort
and RadixSort
depending on the type of the sorting columns and on the number of rows in df
. If rev
is true
, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true)
in cols
, with c
the corresponding column index (see example below). See other methods for a description of other keyword arguments.
Examples
julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 1 c
3 │ 2 a
4 │ 1 b
julia> sortperm(df, :x)
4-element Vector{Int64}:
2
4
3
1
julia> sortperm(df, [:x, :y])
4-element Vector{Int64}:
4
2
3
1
julia> sortperm(df, [:x, :y], rev=true)
4-element Vector{Int64}:
1
3
2
4
julia> sortperm(df, [:x, order(:y, rev=true)])
4-element Vector{Int64}:
2
4
3
1
Joining
DataFrames.antijoin
— Functionantijoin(df1, df2; on, makeunique=false, validate=(false, false), matchmissing=:error)
Perform an anti join of two data frame objects and return a DataFrame
containing the result. An anti join returns the subset of rows of df1
that do not match with the keys in df2
.
The order of rows in the result is undefined and may change in the future releases.
Arguments
df1
,df2
: theAbstractDataFrames
to be joined
Keyword Arguments
on
: A column name to joindf1
anddf2
on. If the columns on whichdf1
anddf2
will be joined have different names, then aleft=>right
pair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed).makeunique
: iffalse
(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).validate
: whether to check that columns passed as theon
argument define unique keys in each input data frame (according toisequal
). Can be a tuple or a pair, with the first element indicating whether to run check fordf1
and the second element fordf2
. By default no check is performed.matchmissing
: if equal to:error
throw an error ifmissing
is present inon
columns; if equal to:equal
thenmissing
is allowed and missings are matched (isequal
is used for comparisons of rows for equality)
It is not allowed to join on columns that contain NaN
or -0.0
in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector
.
When merging on
categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.
See also: innerjoin
, leftjoin
, rightjoin
, outerjoin
, semijoin
, crossjoin
.
Examples
julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼──────────────────
1 │ 1 John Doe
2 │ 2 Jane Doe
3 │ 3 Joe Blogs
julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ ID Job
│ Int64 String
─────┼───────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> antijoin(name, job, on = :ID)
1×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼──────────────────
1 │ 3 Joe Blogs
julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ identifier Job
│ Int64 String
─────┼────────────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> antijoin(name, job2, on = :ID => :identifier)
1×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼──────────────────
1 │ 3 Joe Blogs
julia> antijoin(name, job2, on = [:ID => :identifier])
1×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼──────────────────
1 │ 3 Joe Blogs
DataFrames.crossjoin
— Functioncrossjoin(df1, df2, dfs...; makeunique = false)
Perform a cross join of two or more data frame objects and return a DataFrame
containing the result. A cross join returns the cartesian product of rows from all passed data frames, where the first passed data frame is assigned to the dimension that changes the slowest and the last data frame is assigned to the dimension that changes the fastest.
Arguments
df1
,df2
,dfs...
: theAbstractDataFrames
to be joined
Keyword Arguments
makeunique
: iffalse
(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).
If more than two data frames are passed, the join is performed recursively with left associativity.
See also: innerjoin
, leftjoin
, rightjoin
, outerjoin
, semijoin
, antijoin
.
Examples
julia> df1 = DataFrame(X=1:3)
3×1 DataFrame
Row │ X
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
julia> df2 = DataFrame(Y=["a", "b"])
2×1 DataFrame
Row │ Y
│ String
─────┼────────
1 │ a
2 │ b
julia> crossjoin(df1, df2)
6×2 DataFrame
Row │ X Y
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 1 b
3 │ 2 a
4 │ 2 b
5 │ 3 a
6 │ 3 b
DataFrames.innerjoin
— Functioninnerjoin(df1, df2; on, makeunique=false, validate=(false, false),
renamecols=(identity => identity), matchmissing=:error)
innerjoin(df1, df2, dfs...; on, makeunique=false,
validate=(false, false), matchmissing=:error)
Perform an inner join of two or more data frame objects and return a DataFrame
containing the result. An inner join includes rows with keys that match in all passed data frames.
The order of rows in the result is undefined and may change in the future releases.
In the returned data frame the type of the columns on which the data frames are joined is determined by the type of these columns in df1
. This behavior may change in future releases.
Arguments
df1
,df2
,dfs...
: theAbstractDataFrames
to be joined
Keyword Arguments
on
: A column name to joindf1
anddf2
on. If the columns on whichdf1
anddf2
will be joined have different names, then aleft=>right
pair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed). If more than two data frames are joined then only a column name or a vector of column names are allowed.on
is a required argument.makeunique
: iffalse
(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).validate
: whether to check that columns passed as theon
argument define unique keys in each input data frame (according toisequal
). Can be a tuple or a pair, with the first element indicating whether to run check fordf1
and the second element fordf2
. By default no check is performed.renamecols
: aPair
specifying how columns of left and right data frames should be renamed in the resulting data frame. Each element of the pair can be a string or aSymbol
can be passed in which case it is appended to the original column name; alternatively a function can be passed in which case it is applied to each column name, which is passed to it as aString
. Note thatrenamecols
does not affecton
columns, whose names are always taken from the left data frame and left unchanged.matchmissing
: if equal to:error
throw an error ifmissing
is present inon
columns; if equal to:equal
thenmissing
is allowed and missings are matched (isequal
is used for comparisons of rows for equality)
It is not allowed to join on columns that contain NaN
or -0.0
in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector
.
When merging on
categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.
If more than two data frames are passed, the join is performed recursively with left associativity. In this case the validate
keyword argument is applied recursively with left associativity.
See also: leftjoin
, rightjoin
, outerjoin
, semijoin
, antijoin
, crossjoin
.
Examples
julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼──────────────────
1 │ 1 John Doe
2 │ 2 Jane Doe
3 │ 3 Joe Blogs
julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ ID Job
│ Int64 String
─────┼───────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> innerjoin(name, job, on = :ID)
2×3 DataFrame
Row │ ID Name Job
│ Int64 String String
─────┼─────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ identifier Job
│ Int64 String
─────┼────────────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> innerjoin(name, job2, on = :ID => :identifier, renamecols = "_left" => "_right")
2×3 DataFrame
Row │ ID Name_left Job_right
│ Int64 String String
─────┼─────────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
julia> innerjoin(name, job2, on = [:ID => :identifier], renamecols = uppercase => lowercase)
2×3 DataFrame
Row │ ID NAME job
│ Int64 String String
─────┼─────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
DataFrames.leftjoin
— Functionleftjoin(df1, df2; on, makeunique=false, source=nothing, validate=(false, false),
renamecols=(identity => identity), matchmissing=:error)
Perform a left join of two data frame objects and return a DataFrame
containing the result. A left join includes all rows from df1
.
The order of rows in the result is undefined and may change in the future releases.
In the returned data frame the type of the columns on which the data frames are joined is determined by the type of these columns in df1
. This behavior may change in future releases.
Arguments
df1
,df2
: theAbstractDataFrames
to be joined
Keyword Arguments
on
: A column name to joindf1
anddf2
on. If the columns on whichdf1
anddf2
will be joined have different names, then aleft=>right
pair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed).makeunique
: iffalse
(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).source
: Default:nothing
. If aSymbol
or string, adds indicator column with the given name, for whether a row appeared in onlydf1
("left_only"
), onlydf2
("right_only"
) or in both ("both"
). If the name is already in use, the column name will be modified ifmakeunique=true
.validate
: whether to check that columns passed as theon
argument define unique keys in each input data frame (according toisequal
). Can be a tuple or a pair, with the first element indicating whether to run check fordf1
and the second element fordf2
. By default no check is performed.renamecols
: aPair
specifying how columns of left and right data frames should be renamed in the resulting data frame. Each element of the pair can be a string or aSymbol
can be passed in which case it is appended to the original column name; alternatively a function can be passed in which case it is applied to each column name, which is passed to it as aString
. Note thatrenamecols
does not affecton
columns, whose names are always taken from the left data frame and left unchanged.matchmissing
: if equal to:error
throw an error ifmissing
is present inon
columns; if equal to:equal
thenmissing
is allowed and missings are matched (isequal
is used for comparisons of rows for equality)
All columns of the returned data table will support missing values.
It is not allowed to join on columns that contain NaN
or -0.0
in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector
.
When merging on
categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.
See also: innerjoin
, rightjoin
, outerjoin
, semijoin
, antijoin
, crossjoin
.
Examples
julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼──────────────────
1 │ 1 John Doe
2 │ 2 Jane Doe
3 │ 3 Joe Blogs
julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ ID Job
│ Int64 String
─────┼───────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> leftjoin(name, job, on = :ID)
3×3 DataFrame
Row │ ID Name Job
│ Int64 String String?
─────┼───────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 3 Joe Blogs missing
julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ identifier Job
│ Int64 String
─────┼────────────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> leftjoin(name, job2, on = :ID => :identifier, renamecols = "_left" => "_right")
3×3 DataFrame
Row │ ID Name_left Job_right
│ Int64 String String?
─────┼─────────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 3 Joe Blogs missing
julia> leftjoin(name, job2, on = [:ID => :identifier], renamecols = uppercase => lowercase)
3×3 DataFrame
Row │ ID NAME job
│ Int64 String String?
─────┼───────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 3 Joe Blogs missing
DataFrames.outerjoin
— Functionouterjoin(df1, df2; on, makeunique=false, source=nothing, validate=(false, false),
renamecols=(identity => identity), matchmissing=:error)
outerjoin(df1, df2, dfs...; on, makeunique = false,
validate = (false, false), matchmissing=:error)
Perform an outer join of two or more data frame objects and return a DataFrame
containing the result. An outer join includes rows with keys that appear in any of the passed data frames.
The order of rows in the result is undefined and may change in the future releases.
In the returned data frame the type of the columns on which the data frames are joined is determined by the element type of these columns both df1
and df2
. This behavior may change in future releases.
Arguments
df1
,df2
,dfs...
: theAbstractDataFrames
to be joined
Keyword Arguments
on
: A column name to joindf1
anddf2
on. If the columns on whichdf1
anddf2
will be joined have different names, then aleft=>right
pair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed). If more than two data frames are joined then only a column name or a vector of column names are allowed.on
is a required argument.makeunique
: iffalse
(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).source
: Default:nothing
. If aSymbol
or string, adds indicator column with the given name for whether a row appeared in onlydf1
("left_only"
), onlydf2
("right_only"
) or in both ("both"
). If the name is already in use, the column name will be modified ifmakeunique=true
. This argument is only supported when joining exactly two data frames.validate
: whether to check that columns passed as theon
argument define unique keys in each input data frame (according toisequal
). Can be a tuple or a pair, with the first element indicating whether to run check fordf1
and the second element fordf2
. By default no check is performed.renamecols
: aPair
specifying how columns of left and right data frames should be renamed in the resulting data frame. Each element of the pair can be a string or aSymbol
can be passed in which case it is appended to the original column name; alternatively a function can be passed in which case it is applied to each column name, which is passed to it as aString
. Note thatrenamecols
does not affecton
columns, whose names are always taken from the left data frame and left unchanged.matchmissing
: if equal to:error
throw an error ifmissing
is present inon
columns; if equal to:equal
thenmissing
is allowed and missings are matched (isequal
is used for comparisons of rows for equality)
All columns of the returned data table will support missing values.
It is not allowed to join on columns that contain NaN
or -0.0
in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector
.
When merging on
categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.
If more than two data frames are passed, the join is performed recursively with left associativity. In this case the indicator
keyword argument is not supported and validate
keyword argument is applied recursively with left associativity.
See also: innerjoin
, leftjoin
, rightjoin
, semijoin
, antijoin
, crossjoin
.
Examples
julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼──────────────────
1 │ 1 John Doe
2 │ 2 Jane Doe
3 │ 3 Joe Blogs
julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ ID Job
│ Int64 String
─────┼───────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> outerjoin(name, job, on = :ID)
4×3 DataFrame
Row │ ID Name Job
│ Int64 String? String?
─────┼───────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 3 Joe Blogs missing
4 │ 4 missing Farmer
julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ identifier Job
│ Int64 String
─────┼────────────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> rightjoin(name, job2, on = :ID => :identifier, renamecols = "_left" => "_right")
3×3 DataFrame
Row │ ID Name_left Job_right
│ Int64 String? String
─────┼─────────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 4 missing Farmer
julia> rightjoin(name, job2, on = [:ID => :identifier], renamecols = uppercase => lowercase)
3×3 DataFrame
Row │ ID NAME job
│ Int64 String? String
─────┼─────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 4 missing Farmer
DataFrames.rightjoin
— Functionrightjoin(df1, df2; on, makeunique=false, source=nothing,
validate=(false, false), renamecols=(identity => identity),
matchmissing=:error)
Perform a right join on two data frame objects and return a DataFrame
containing the result. A right join includes all rows from df2
.
The order of rows in the result is undefined and may change in the future releases.
In the returned data frame the type of the columns on which the data frames are joined is determined by the type of these columns in df2
. This behavior may change in future releases.
Arguments
df1
,df2
: theAbstractDataFrames
to be joined
Keyword Arguments
on
: A column name to joindf1
anddf2
on. If the columns on whichdf1
anddf2
will be joined have different names, then aleft=>right
pair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed).makeunique
: iffalse
(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).source
: Default:nothing
. If aSymbol
or string, adds indicator column with the given name for whether a row appeared in onlydf1
("left_only"
), onlydf2
("right_only"
) or in both ("both"
). If the name is already in use, the column name will be modified ifmakeunique=true
.validate
: whether to check that columns passed as theon
argument define unique keys in each input data frame (according toisequal
). Can be a tuple or a pair, with the first element indicating whether to run check fordf1
and the second element fordf2
. By default no check is performed.renamecols
: aPair
specifying how columns of left and right data frames should be renamed in the resulting data frame. Each element of the pair can be a string or aSymbol
can be passed in which case it is appended to the original column name; alternatively a function can be passed in which case it is applied to each column name, which is passed to it as aString
. Note thatrenamecols
does not affecton
columns, whose names are always taken from the left data frame and left unchanged.matchmissing
: if equal to:error
throw an error ifmissing
is present inon
columns; if equal to:equal
thenmissing
is allowed and missings are matched (isequal
is used for comparisons of rows for equality)
All columns of the returned data table will support missing values.
It is not allowed to join on columns that contain NaN
or -0.0
in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector
.
When merging on
categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.
See also: innerjoin
, leftjoin
, outerjoin
, semijoin
, antijoin
, crossjoin
.
Examples
julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼──────────────────
1 │ 1 John Doe
2 │ 2 Jane Doe
3 │ 3 Joe Blogs
julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ ID Job
│ Int64 String
─────┼───────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> rightjoin(name, job, on = :ID)
3×3 DataFrame
Row │ ID Name Job
│ Int64 String? String
─────┼─────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 4 missing Farmer
julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ identifier Job
│ Int64 String
─────┼────────────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> rightjoin(name, job2, on = :ID => :identifier, renamecols = "_left" => "_right")
3×3 DataFrame
Row │ ID Name_left Job_right
│ Int64 String? String
─────┼─────────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 4 missing Farmer
julia> rightjoin(name, job2, on = [:ID => :identifier], renamecols = uppercase => lowercase)
3×3 DataFrame
Row │ ID NAME job
│ Int64 String? String
─────┼─────────────────────────
1 │ 1 John Doe Lawyer
2 │ 2 Jane Doe Doctor
3 │ 4 missing Farmer
DataFrames.semijoin
— Functionsemijoin(df1, df2; on, makeunique=false, validate=(false, false), matchmissing=:error)
Perform a semi join of two data frame objects and return a DataFrame
containing the result. A semi join returns the subset of rows of df1
that match with the keys in df2
.
The order of rows in the result is undefined and may change in the future releases.
Arguments
df1
,df2
: theAbstractDataFrames
to be joined
Keyword Arguments
on
: A column name to joindf1
anddf2
on. If the columns on whichdf1
anddf2
will be joined have different names, then aleft=>right
pair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed).makeunique
: iffalse
(the default), an error will be raised if duplicate names are found in columns not joined on; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).indicator
: Default:nothing
. If aSymbol
or string, adds categorical indicator column with the given name for whether a row appeared in onlydf1
("left_only"
), onlydf2
("right_only"
) or in both ("both"
). If the name is already in use, the column name will be modified ifmakeunique=true
.validate
: whether to check that columns passed as theon
argument define unique keys in each input data frame (according toisequal
). Can be a tuple or a pair, with the first element indicating whether to run check fordf1
and the second element fordf2
. By default no check is performed.matchmissing
: if equal to:error
throw an error ifmissing
is present inon
columns; if equal to:equal
thenmissing
is allowed and missings are matched (isequal
is used for comparisons of rows for equality)
It is not allowed to join on columns that contain NaN
or -0.0
in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector
.
When merging on
categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.
See also: innerjoin
, leftjoin
, rightjoin
, outerjoin
, antijoin
, crossjoin
.
Examples
julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼──────────────────
1 │ 1 John Doe
2 │ 2 Jane Doe
3 │ 3 Joe Blogs
julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ ID Job
│ Int64 String
─────┼───────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> semijoin(name, job, on = :ID)
2×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼─────────────────
1 │ 1 John Doe
2 │ 2 Jane Doe
julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
Row │ identifier Job
│ Int64 String
─────┼────────────────────
1 │ 1 Lawyer
2 │ 2 Doctor
3 │ 4 Farmer
julia> semijoin(name, job2, on = :ID => :identifier)
2×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼─────────────────
1 │ 1 John Doe
2 │ 2 Jane Doe
julia> semijoin(name, job2, on = [:ID => :identifier])
2×2 DataFrame
Row │ ID Name
│ Int64 String
─────┼─────────────────
1 │ 1 John Doe
2 │ 2 Jane Doe
Grouping
Base.get
— Functionget(gd::GroupedDataFrame, key, default)
Get a group based on the values of the grouping columns.
key
may be a GroupKey
, NamedTuple
or Tuple
of grouping column values (in the same order as the cols
argument to groupby
). It may also be an AbstractDict
, in which case the order of the arguments does not matter.
Examples
julia> df = DataFrame(a = repeat([:foo, :bar, :baz], outer=[2]),
b = repeat([2, 1], outer=[3]),
c = 1:6);
julia> gd = groupby(df, :a)
GroupedDataFrame with 3 groups based on key: a
First Group (2 rows): a = :foo
Row │ a b c
│ Symbol Int64 Int64
─────┼──────────────────────
1 │ foo 2 1
2 │ foo 1 4
⋮
Last Group (2 rows): a = :baz
Row │ a b c
│ Symbol Int64 Int64
─────┼──────────────────────
1 │ baz 2 3
2 │ baz 1 6
julia> get(gd, (a=:bar,), nothing)
2×3 SubDataFrame
Row │ a b c
│ Symbol Int64 Int64
─────┼──────────────────────
1 │ bar 1 2
2 │ bar 2 5
julia> get(gd, (:baz,), nothing)
2×3 SubDataFrame
Row │ a b c
│ Symbol Int64 Int64
─────┼──────────────────────
1 │ baz 2 3
2 │ baz 1 6
julia> get(gd, (:qux,), nothing)
DataFrames.groupby
— Functiongroupby(d::AbstractDataFrame, cols; sort=false, skipmissing=false)
Return a GroupedDataFrame
representing a view of an AbstractDataFrame
split into row groups.
Arguments
df
: anAbstractDataFrame
to splitcols
: data frame columns to group by. Can be any column selector (Symbol
, string or integer;:
,Cols
,All
,Between
,Not
, a regular expression, or a vector ofSymbol
s, strings or integers).sort
: whether to sort groups according to the values of the grouping columnscols
; ifsort=false
then the order of groups in the result is undefined and may change in future releases. In the current implementation groups are ordered following the order of appearance of values in the grouping columns, except when all grouping columns provide non-nothing
DataAPI.refpool
in which case the order of groups follows the order of values returned byDataAPI.refpool
. As a particular application of this rule if allcols
areCategoricalVector
s then groups are always sorted Integer columns with a narrow range also use this this optimization, so to the order of groups when grouping on integer columns is undefined.skipmissing
: whether to skip groups withmissing
values in one of the grouping columnscols
Details
An iterator over a GroupedDataFrame
returns a SubDataFrame
view for each grouping into df
. Within each group, the order of rows in df
is preserved.
cols
can be any valid data frame indexing expression. In particular if it is an empty vector then a single-group GroupedDataFrame
is created.
A GroupedDataFrame
also supports indexing by groups, select
, transform
, and combine
(which applies a function to each group and combines the result into a data frame).
GroupedDataFrame
also supports the dictionary interface. The keys are GroupKey
objects returned by keys(::GroupedDataFrame)
, which can also be used to get the values of the grouping columns for each group. Tuples
and NamedTuple
s containing the values of the grouping columns (in the same order as the cols
argument) are also accepted as indices. Finally, an AbstractDict
can be used to index into a grouped data frame where the keys are column names of the data frame. The order of the keys does not matter in this case.
See also
combine
, select
, select!
, transform
, transform!
Examples
julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = 1:8);
julia> gd = groupby(df, :a)
GroupedDataFrame with 4 groups based on key: a
First Group (2 rows): a = 1
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 1
2 │ 1 2 5
⋮
Last Group (2 rows): a = 4
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 4 1 4
2 │ 4 1 8
julia> gd[1]
2×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 1
2 │ 1 2 5
julia> last(gd)
2×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 4 1 4
2 │ 4 1 8
julia> gd[(a=3,)]
2×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 3 2 3
2 │ 3 2 7
julia> gd[Dict("a" => 3)]
2×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 3 2 3
2 │ 3 2 7
julia> gd[(3,)]
2×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 3 2 3
2 │ 3 2 7
julia> k = first(keys(gd))
GroupKey: (a = 1,)
julia> gd[k]
2×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 1
2 │ 1 2 5
julia> for g in gd
println(g)
end
2×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 1
2 │ 1 2 5
2×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 2 1 2
2 │ 2 1 6
2×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 3 2 3
2 │ 3 2 7
2×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 4 1 4
2 │ 4 1 8
DataFrames.groupcols
— Functiongroupcols(gd::GroupedDataFrame)
Return a vector of Symbol
column names in parent(gd)
used for grouping.
DataFrames.groupindices
— Functiongroupindices(gd::GroupedDataFrame)
Return a vector of group indices for each row of parent(gd)
.
Rows appearing in group gd[i]
are attributed index i
. Rows not present in any group are attributed missing
(this can happen if skipmissing=true
was passed when creating gd
, or if gd
is a subset from a larger GroupedDataFrame
).
Base.keys
— Functionkeys(gd::GroupedDataFrame)
Get the set of keys for each group of the GroupedDataFrame
gd
as a GroupKeys
object. Each key is a GroupKey
, which behaves like a NamedTuple
holding the values of the grouping columns for a given group. Unlike the equivalent Tuple
, NamedTuple
, and AbstractDict
, these keys can be used to index into gd
efficiently. The ordering of the keys is identical to the ordering of the groups of gd
under iteration and integer indexing.
Examples
julia> df = DataFrame(a = repeat([:foo, :bar, :baz], outer=[4]),
b = repeat([2, 1], outer=[6]),
c = 1:12);
julia> gd = groupby(df, [:a, :b])
GroupedDataFrame with 6 groups based on keys: a, b
First Group (2 rows): a = :foo, b = 2
Row │ a b c
│ Symbol Int64 Int64
─────┼──────────────────────
1 │ foo 2 1
2 │ foo 2 7
⋮
Last Group (2 rows): a = :baz, b = 1
Row │ a b c
│ Symbol Int64 Int64
─────┼──────────────────────
1 │ baz 1 6
2 │ baz 1 12
julia> keys(gd)
6-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
GroupKey: (a = :foo, b = 2)
GroupKey: (a = :bar, b = 1)
GroupKey: (a = :baz, b = 2)
GroupKey: (a = :foo, b = 1)
GroupKey: (a = :bar, b = 2)
GroupKey: (a = :baz, b = 1)
julia> k = keys(gd)[1]
GroupKey: (a = :foo, b = 2)
julia> keys(k)
2-element Vector{Symbol}:
:a
:b
julia> values(k) # Same as Tuple(k)
(:foo, 2)
julia> NamedTuple(k)
(a = :foo, b = 2)
julia> k.a
:foo
julia> k[:a]
:foo
julia> k[1]
:foo
Keys can be used as indices to retrieve the corresponding group from their GroupedDataFrame
:
julia> gd[k]
2×3 SubDataFrame
Row │ a b c
│ Symbol Int64 Int64
─────┼──────────────────────
1 │ foo 2 1
2 │ foo 2 7
julia> gd[keys(gd)[1]] == gd[1]
true
keys(dfc::DataFrameColumns)
Get a vector of column names of dfc
as Symbol
s.
Base.parent
— Functionparent(gd::GroupedDataFrame)
Return the parent data frame of gd
.
DataFrames.valuecols
— Functionvaluecols(gd::GroupedDataFrame)
Return a vector of Symbol
column names in parent(gd)
not used for grouping.
Filtering rows
Base.delete!
— Functiondelete!(df::DataFrame, inds)
Delete rows specified by inds
from a DataFrame
df
in place and return it.
Internally deleteat!
is called for all columns so inds
must be: a vector of sorted and unique integers, a boolean vector, an integer, or Not
.
Examples
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> delete!(df, 2)
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 3 6
Base.empty
— Functionempty(df::AbstractDataFrame)
Create a new DataFrame
with the same column names and column element types as df
but with zero rows.
Base.empty!
— Functionempty!(df::DataFrame)
Remove all rows from df
, making each of its columns empty.
Base.filter
— Functionfilter(fun, df::AbstractDataFrame; view::Bool=false)
filter(cols => fun, df::AbstractDataFrame; view::Bool=false)
Return a data frame containing only rows from df
for which fun
returns true
.
If cols
is not specified then the predicate fun
is passed DataFrameRow
s.
If cols
is specified then the predicate fun
is passed elements of the corresponding columns as separate positional arguments, unless cols
is an AsTable
selector, in which case a NamedTuple
of these arguments is passed. cols
can be any column selector (Symbol
, string or integer; :
, Cols
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers), and column duplicates are allowed if a vector of Symbol
s, strings, or integers is passed.
If view=false
a freshly allocated DataFrame
is returned. If view=true
then a SubDataFrame
view into df
is returned.
Passing cols
leads to a more efficient execution of the operation for large data frames.
See also: filter!
Examples
julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 1 c
3 │ 2 a
4 │ 1 b
julia> filter(row -> row.x > 1, df)
2×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 a
julia> filter(:x => x -> x > 1, df)
2×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 a
julia> filter([:x, :y] => (x, y) -> x == 1 || y == "b", df)
3×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 1 c
3 │ 1 b
julia> filter(AsTable(:) => nt -> nt.x == 1 || nt.y == "b", df)
3×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 1 c
3 │ 1 b
filter(fun, gdf::GroupedDataFrame)
filter(cols => fun, gdf::GroupedDataFrame)
Return a new GroupedDataFrame
containing only groups for which fun
returns true
.
If cols
is not specified then the predicate fun
is called with a SubDataFrame
for each group.
If cols
is specified then the predicate fun
is called for each group with views of the corresponding columns as separate positional arguments, unless cols
is an AsTable
selector, in which case a NamedTuple
of these arguments is passed. cols
can be any column selector (Symbol
, string or integer; :
, Cols
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers), and column duplicates are allowed if a vector of Symbol
s, strings, or integers is passed.
Examples
julia> df = DataFrame(g=[1, 2], x=['a', 'b']);
julia> gd = groupby(df, :g)
GroupedDataFrame with 2 groups based on key: g
First Group (1 row): g = 1
Row │ g x
│ Int64 Char
─────┼─────────────
1 │ 1 a
⋮
Last Group (1 row): g = 2
Row │ g x
│ Int64 Char
─────┼─────────────
1 │ 2 b
julia> filter(x -> x.x[1] == 'a', gd)
GroupedDataFrame with 1 group based on key: g
First Group (1 row): g = 1
Row │ g x
│ Int64 Char
─────┼─────────────
1 │ 1 a
julia> filter(:x => x -> x[1] == 'a', gd)
GroupedDataFrame with 1 group based on key: g
First Group (1 row): g = 1
Row │ g x
│ Int64 Char
─────┼─────────────
1 │ 1 a
Base.filter!
— Functionfilter!(fun, df::AbstractDataFrame)
filter!(cols => fun, df::AbstractDataFrame)
Remove rows from data frame df
for which fun
returns false
.
If cols
is not specified then the predicate fun
is passed DataFrameRow
s.
If cols
is specified then the predicate fun
is passed elements of the corresponding columns as separate positional arguments, unless cols
is an AsTable
selector, in which case a NamedTuple
of these arguments is passed. cols
can be any column selector (Symbol
, string or integer; :
, Cols
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers), and column duplicates are allowed if a vector of Symbol
s, strings, or integers is passed.
Passing cols
leads to a more efficient execution of the operation for large data frames.
See also: filter
Examples
julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 1 c
3 │ 2 a
4 │ 1 b
julia> filter!(row -> row.x > 1, df)
2×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 a
julia> filter!(:x => x -> x == 3, df)
1×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"]);
julia> filter!([:x, :y] => (x, y) -> x == 1 || y == "b", df)
3×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 1 c
3 │ 1 b
julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"]);
julia> filter!(AsTable(:) => nt -> nt.x == 1 || nt.y == "b", df)
3×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 1 c
3 │ 1 b
Base.first
— Functionfirst(df::AbstractDataFrame)
Get the first row of df
as a DataFrameRow
.
first(df::AbstractDataFrame, n::Integer)
Get a data frame with the n
first rows of df
.
Base.last
— Functionlast(df::AbstractDataFrame)
Get the last row of df
as a DataFrameRow
.
last(df::AbstractDataFrame, n::Integer)
Get a data frame with the n
last rows of df
.
Base.Iterators.only
— Functiononly(df::AbstractDataFrame)
If df
has a single row return it as a DataFrameRow
; otherwise throw ArgumentError
.
DataFrames.nonunique
— Functionnonunique(df::AbstractDataFrame)
nonunique(df::AbstractDataFrame, cols)
Return a Vector{Bool}
in which true
entries indicate duplicate rows. A row is a duplicate if there exists a prior row with all columns containing equal values (according to isequal
).
Arguments
df
:AbstractDataFrame
cols
: a selector specifying the column(s) or their transformations to compare. Can be any column selector or transformation accepted byselect
.
Examples
julia> df = DataFrame(i = 1:4, x = [1, 2, 1, 2])
4×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 1
4 │ 4 2
julia> df = vcat(df, df)
8×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 1
4 │ 4 2
5 │ 1 1
6 │ 2 2
7 │ 3 1
8 │ 4 2
julia> nonunique(df)
8-element Vector{Bool}:
0
0
0
0
1
1
1
1
julia> nonunique(df, 2)
8-element Vector{Bool}:
0
0
1
1
1
1
1
1
DataFrames.subset
— Functionsubset(df::AbstractDataFrame, args...; skipmissing::Bool=false, view::Bool=false)
subset(gdf::GroupedDataFrame, args...; skipmissing::Bool=false, view::Bool=false,
ungroup::Bool=true)
Return a copy of data frame df
or parent of gdf
containing only rows for which all values produced by transformation(s) args
for a given row are true
. All transformations must produce vectors containing true
or false
(and optionally missing
if skipmissing=true
).
Each argument passed in args
can be either a single column selector or a source_columns => function
transformation specifier following the rules described for select
.
Note that as opposed to filter
the subset
function works on whole columns (and selects rows within groups for GroupedDataFrame
rather than whole groups) and must return a vector.
If skipmissing=false
(the default) args
are required to produce vectors containing only Bool
values. If skipmissing=true
, additionally missing
is allowed and it is treated as false
(i.e. rows for which one of the conditions returns missing
are skipped).
If view=true
a SubDataFrame
view is returned instead of a DataFrame
.
If ungroup=false
the resulting data frame is re-grouped based on the same grouping columns as gdf
and a GroupedDataFrame
is returned.
If a GroupedDataFrame
is passed then it must include all groups present in the parent
data frame, like in select!
.
See also: subset!
, filter
, select
Examples
julia> df = DataFrame(id=1:4, x=[true, false, true, false],
y=[true, true, false, false],
z=[true, true, missing, missing], v=[1, 2, 11, 12])
4×5 DataFrame
Row │ id x y z v
│ Int64 Bool Bool Bool? Int64
─────┼─────────────────────────────────────
1 │ 1 true true true 1
2 │ 2 false true true 2
3 │ 3 true false missing 11
4 │ 4 false false missing 12
julia> subset(df, :x)
2×5 DataFrame
Row │ id x y z v
│ Int64 Bool Bool Bool? Int64
─────┼────────────────────────────────────
1 │ 1 true true true 1
2 │ 3 true false missing 11
julia> subset(df, :v => x -> x .> 3)
2×5 DataFrame
Row │ id x y z v
│ Int64 Bool Bool Bool? Int64
─────┼─────────────────────────────────────
1 │ 3 true false missing 11
2 │ 4 false false missing 12
julia> subset(df, :x, :y => ByRow(!))
1×5 DataFrame
Row │ id x y z v
│ Int64 Bool Bool Bool? Int64
─────┼────────────────────────────────────
1 │ 3 true false missing 11
julia> subset(df, :x, :z, skipmissing=true)
1×5 DataFrame
Row │ id x y z v
│ Int64 Bool Bool Bool? Int64
─────┼─────────────────────────────────
1 │ 1 true true true 1
julia> subset(df, :x, :z)
ERROR: ArgumentError: missing was returned in condition number 2 but only true or false are allowed; pass skipmissing=true to skip missing values
julia> subset(groupby(df, :y), :v => x -> x .> minimum(x))
2×5 DataFrame
Row │ id x y z v
│ Int64 Bool Bool Bool? Int64
─────┼─────────────────────────────────────
1 │ 2 false true true 2
2 │ 4 false false missing 12
DataFrames.subset!
— Functionsubset!(df::AbstractDataFrame, args...; skipmissing::Bool=false)
subset!(gdf::GroupedDataFrame{DataFrame}, args..., skipmissing::Bool=false,
ungroup::Bool=true)
Update data frame df
or the parent of gdf
in place to contain only rows for which all values produced by transformation(s) args
for a given row is true
. All transformations must produce vectors containing true
or false
(and optionally missing
if skipmissing=true
).
Each argument passed in args
can be either a single column selector or a source_columns => function
transformation specifier following the rules described for select
.
Note that as opposed to filter!
the subset!
function works on whole columns (and selects rows within groups for GroupedDataFrame
rather than whole groups) and must return a vector.
If skipmissing=false
(the default) args
are required to produce vectors containing only Bool
values. If skipmissing=true
, additionally missing
is allowed and it is treated as false
(i.e. rows for which one of the conditions returns missing
are skipped).
If ungroup=false
the resulting data frame is re-grouped based on the same grouping columns as gdf
and a GroupedDataFrame
is returned.
If GroupedDataFrame
is subsetted then it must include all groups present in the parent
data frame, like in select!
. Also note that in general the parent of the passed GroupedDataFrame
is mutated in place, which means that the GroupedDataFrame
is not safe to use as most likely it will get corrupted due to row removal in the parent.
See also: subset
, filter!
, select!
Examples
julia> df = DataFrame(id=1:4, x=[true, false, true, false], y=[true, true, false, false])
4×3 DataFrame
Row │ id x y
│ Int64 Bool Bool
─────┼─────────────────────
1 │ 1 true true
2 │ 2 false true
3 │ 3 true false
4 │ 4 false false
julia> subset!(df, :x, :y => ByRow(!));
julia> df
1×3 DataFrame
Row │ id x y
│ Int64 Bool Bool
─────┼────────────────────
1 │ 3 true false
julia> df = DataFrame(id=1:4, y=[true, true, false, false], v=[1, 2, 11, 12]);
julia> subset!(groupby(df, :y), :v => x -> x .> minimum(x));
julia> df
2×3 DataFrame
Row │ id y v
│ Int64 Bool Int64
─────┼─────────────────────
1 │ 2 true 2
2 │ 4 false 12
julia> df = DataFrame(id=1:4, x=[true, false, true, false],
z=[true, true, missing, missing], v=1:4)
4×4 DataFrame
Row │ id x z v
│ Int64 Bool Bool? Int64
─────┼──────────────────────────────
1 │ 1 true true 1
2 │ 2 false true 2
3 │ 3 true missing 3
4 │ 4 false missing 4
julia> subset!(df, :x, :z)
ERROR: ArgumentError: missing was returned in condition number 2 but only true or false are allowed; pass skipmissing=true to skip missing values
julia> subset!(df, :x, :z, skipmissing=true);
julia> df
1×4 DataFrame
Row │ id x z v
│ Int64 Bool Bool? Int64
─────┼───────────────────────────
1 │ 1 true true 1
julia> df = DataFrame(id=1:4, x=[true, false, true, false], y=[true, true, false, false],
z=[true, true, missing, missing], v=[1, 2, 11, 12]);
julia> subset!(groupby(df, :y), :v => x -> x .> minimum(x));
julia> df
2×5 DataFrame
Row │ id x y z v
│ Int64 Bool Bool Bool? Int64
─────┼─────────────────────────────────────
1 │ 2 false true true 2
2 │ 4 false false missing 12
Base.unique
— Functionunique(df::AbstractDataFrame; view::Bool=false)
unique(df::AbstractDataFrame, cols; view::Bool=false)
unique!(df::AbstractDataFrame)
unique!(df::AbstractDataFrame, cols)
Return a data frame containing only the first occurrence of unique rows in df
. When cols
is specified, the returned DataFrame
contains complete rows, retaining in each case the first occurrence of a given combination of values in selected columns or their transformations. cols
can be any column selector or transformation accepted by select
.
For unique
, if view=false
a freshly allocated DataFrame
is returned, and if view=true
then a SubDataFrame
view into df
is returned.
unique!
updates df
in-place and does not support the view
keyword argument.
See also nonunique
.
Arguments
df
: the AbstractDataFramecols
: column indicator (Symbol, Int, Vector{Symbol}, Regex, etc.)
specifying the column(s) to compare.
Examples
julia> df = DataFrame(i = 1:4, x = [1, 2, 1, 2])
4×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 1
4 │ 4 2
julia> df = vcat(df, df)
8×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 1
4 │ 4 2
5 │ 1 1
6 │ 2 2
7 │ 3 1
8 │ 4 2
julia> unique(df) # doesn't modify df
4×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 1
4 │ 4 2
julia> unique(df, 2)
2×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
julia> unique!(df) # modifies df
4×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 1
4 │ 4 2
Base.unique!
— Functionunique(df::AbstractDataFrame; view::Bool=false)
unique(df::AbstractDataFrame, cols; view::Bool=false)
unique!(df::AbstractDataFrame)
unique!(df::AbstractDataFrame, cols)
Return a data frame containing only the first occurrence of unique rows in df
. When cols
is specified, the returned DataFrame
contains complete rows, retaining in each case the first occurrence of a given combination of values in selected columns or their transformations. cols
can be any column selector or transformation accepted by select
.
For unique
, if view=false
a freshly allocated DataFrame
is returned, and if view=true
then a SubDataFrame
view into df
is returned.
unique!
updates df
in-place and does not support the view
keyword argument.
See also nonunique
.
Arguments
df
: the AbstractDataFramecols
: column indicator (Symbol, Int, Vector{Symbol}, Regex, etc.)
specifying the column(s) to compare.
Examples
julia> df = DataFrame(i = 1:4, x = [1, 2, 1, 2])
4×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 1
4 │ 4 2
julia> df = vcat(df, df)
8×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 1
4 │ 4 2
5 │ 1 1
6 │ 2 2
7 │ 3 1
8 │ 4 2
julia> unique(df) # doesn't modify df
4×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 1
4 │ 4 2
julia> unique(df, 2)
2×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
julia> unique!(df) # modifies df
4×2 DataFrame
Row │ i x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 1
4 │ 4 2
Working with missing values
Missings.allowmissing
— Functionallowmissing(df::AbstractDataFrame, cols=:)
Return a copy of data frame df
with columns cols
converted to element type Union{T, Missing}
from T
to allow support for missing values.
cols
can be any column selector (Symbol
, string or integer; :
, Cols
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
If cols
is omitted all columns in the data frame are converted.
Examples
julia> df = DataFrame(a=[1, 2])
2×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
julia> allowmissing(df)
2×1 DataFrame
Row │ a
│ Int64?
─────┼────────
1 │ 1
2 │ 2
DataFrames.allowmissing!
— Functionallowmissing!(df::DataFrame, cols=:)
Convert columns cols
of data frame df
from element type T
to Union{T, Missing}
to support missing values.
cols
can be any column selector (Symbol
, string or integer; :
, Cols
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
If cols
is omitted all columns in the data frame are converted.
DataFrames.completecases
— Functioncompletecases(df::AbstractDataFrame, cols=:)
Return a Boolean vector with true
entries indicating rows without missing values (complete cases) in data frame df
.
If cols
is provided, only missing values in the corresponding columns areconsidered. cols
can be any column selector (Symbol
, string or integer; :
, Cols
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
See also: dropmissing
and dropmissing!
. Use findall(completecases(df))
to get the indices of the rows.
Examples
julia> df = DataFrame(i = 1:5,
x = [missing, 4, missing, 2, 1],
y = [missing, missing, "c", "d", "e"])
5×3 DataFrame
Row │ i x y
│ Int64 Int64? String?
─────┼─────────────────────────
1 │ 1 missing missing
2 │ 2 4 missing
3 │ 3 missing c
4 │ 4 2 d
5 │ 5 1 e
julia> completecases(df)
5-element BitVector:
0
0
0
1
1
julia> completecases(df, :x)
5-element BitVector:
0
1
0
1
1
julia> completecases(df, [:x, :y])
5-element BitVector:
0
0
0
1
1
Missings.disallowmissing
— Functiondisallowmissing(df::AbstractDataFrame, cols=:; error::Bool=true)
Return a copy of data frame df
with columns cols
converted from element type Union{T, Missing}
to T
to drop support for missing values.
cols
can be any column selector (Symbol
, string or integer; :
, Cols
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
If cols
is omitted all columns in the data frame are converted.
If error=false
then columns containing a missing
value will be skipped instead of throwing an error.
Examples
julia> df = DataFrame(a=Union{Int, Missing}[1, 2])
2×1 DataFrame
Row │ a
│ Int64?
─────┼────────
1 │ 1
2 │ 2
julia> disallowmissing(df)
2×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
julia> df = DataFrame(a=[1, missing])
2×1 DataFrame
Row │ a
│ Int64?
─────┼─────────
1 │ 1
2 │ missing
julia> disallowmissing(df, error=false)
2×1 DataFrame
Row │ a
│ Int64?
─────┼─────────
1 │ 1
2 │ missing
DataFrames.disallowmissing!
— Functiondisallowmissing!(df::DataFrame, cols=:; error::Bool=true)
Convert columns cols
of data frame df
from element type Union{T, Missing}
to T
to drop support for missing values.
cols
can be any column selector (Symbol
, string or integer; :
, Cols
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
If cols
is omitted all columns in the data frame are converted.
If error=false
then columns containing a missing
value will be skipped instead of throwing an error.
DataFrames.dropmissing
— Functiondropmissing(df::AbstractDataFrame, cols=:; view::Bool=false, disallowmissing::Bool=!view)
Return a data frame excluding rows with missing values in df
.
If cols
is provided, only missing values in the corresponding columns are considered. cols
can be any column selector (Symbol
, string or integer; :
, Cols
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
If view=false
a freshly allocated DataFrame
is returned. If view=true
then a SubDataFrame
view into df
is returned. In this case disallowmissing
must be false
.
If disallowmissing
is true
(the default when view
is false
) then columns specified in cols
will be converted so as not to allow for missing values using disallowmissing!
.
See also: completecases
and dropmissing!
.
Examples
julia> df = DataFrame(i = 1:5,
x = [missing, 4, missing, 2, 1],
y = [missing, missing, "c", "d", "e"])
5×3 DataFrame
Row │ i x y
│ Int64 Int64? String?
─────┼─────────────────────────
1 │ 1 missing missing
2 │ 2 4 missing
3 │ 3 missing c
4 │ 4 2 d
5 │ 5 1 e
julia> dropmissing(df)
2×3 DataFrame
Row │ i x y
│ Int64 Int64 String
─────┼──────────────────────
1 │ 4 2 d
2 │ 5 1 e
julia> dropmissing(df, disallowmissing=false)
2×3 DataFrame
Row │ i x y
│ Int64 Int64? String?
─────┼────────────────────────
1 │ 4 2 d
2 │ 5 1 e
julia> dropmissing(df, :x)
3×3 DataFrame
Row │ i x y
│ Int64 Int64 String?
─────┼───────────────────────
1 │ 2 4 missing
2 │ 4 2 d
3 │ 5 1 e
julia> dropmissing(df, [:x, :y])
2×3 DataFrame
Row │ i x y
│ Int64 Int64 String
─────┼──────────────────────
1 │ 4 2 d
2 │ 5 1 e
DataFrames.dropmissing!
— Functiondropmissing!(df::AbstractDataFrame, cols=:; disallowmissing::Bool=true)
Remove rows with missing values from data frame df
and return it.
If cols
is provided, only missing values in the corresponding columns are considered. cols
can be any column selector (Symbol
, string or integer; :
, Cols
, All
, Between
, Not
, a regular expression, or a vector of Symbol
s, strings or integers).
If disallowmissing
is true
(the default) then the cols
columns will get converted using disallowmissing!
.
See also: dropmissing
and completecases
.
julia> df = DataFrame(i = 1:5,
x = [missing, 4, missing, 2, 1],
y = [missing, missing, "c", "d", "e"])
5×3 DataFrame
Row │ i x y
│ Int64 Int64? String?
─────┼─────────────────────────
1 │ 1 missing missing
2 │ 2 4 missing
3 │ 3 missing c
4 │ 4 2 d
5 │ 5 1 e
julia> dropmissing!(copy(df))
2×3 DataFrame
Row │ i x y
│ Int64 Int64 String
─────┼──────────────────────
1 │ 4 2 d
2 │ 5 1 e
julia> dropmissing!(copy(df), disallowmissing=false)
2×3 DataFrame
Row │ i x y
│ Int64 Int64? String?
─────┼────────────────────────
1 │ 4 2 d
2 │ 5 1 e
julia> dropmissing!(copy(df), :x)
3×3 DataFrame
Row │ i x y
│ Int64 Int64 String?
─────┼───────────────────────
1 │ 2 4 missing
2 │ 4 2 d
3 │ 5 1 e
julia> dropmissing!(df, [:x, :y])
2×3 DataFrame
Row │ i x y
│ Int64 Int64 String
─────┼──────────────────────
1 │ 4 2 d
2 │ 5 1 e
Iteration
Base.eachcol
— Functioneachcol(df::AbstractDataFrame)
Return a DataFrameColumns
object that is a vector-like that allows iterating an AbstractDataFrame
column by column.
Indexing into DataFrameColumns
objects using integer, Symbol
or string returns the corresponding column (without copying). Indexing into DataFrameColumns
objects using a multiple column selector returns a subsetted DataFrameColumns
object with a new parent containing only the selected columns (without copying).
DataFrameColumns
supports most of the AbstractVector
API. The key differences are that it is read-only and that the keys
function returns a vector of Symbol
s (and not integers as for normal vectors).
In particular findnext
, findprev
, findfirst
, findlast
, and findall
functions are supported, and in findnext
and findprev
functions it is allowed to pass an integer, string, or Symbol
as a reference index.
Examples
julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
3 │ 3 13
4 │ 4 14
julia> eachcol(df)
4×2 DataFrameColumns
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
3 │ 3 13
4 │ 4 14
julia> collect(eachcol(df))
2-element Vector{AbstractVector{T} where T}:
[1, 2, 3, 4]
[11, 12, 13, 14]
julia> map(eachcol(df)) do col
maximum(col) - minimum(col)
end
2-element Vector{Int64}:
3
3
julia> sum.(eachcol(df))
2-element Vector{Int64}:
10
50
Base.eachrow
— Functioneachrow(df::AbstractDataFrame)
Return a DataFrameRows
that iterates a data frame row by row, with each row represented as a DataFrameRow
.
Because DataFrameRow
s have an eltype
of Any
, use copy(dfr::DataFrameRow)
to obtain a named tuple, which supports iteration and property access like a DataFrameRow
, but also passes information on the eltypes
of the columns of df
.
Examples
julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
3 │ 3 13
4 │ 4 14
julia> eachrow(df)
4×2 DataFrameRows
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
3 │ 3 13
4 │ 4 14
julia> copy.(eachrow(df))
4-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
(x = 1, y = 11)
(x = 2, y = 12)
(x = 3, y = 13)
(x = 4, y = 14)
julia> eachrow(view(df, [4, 3], [2, 1]))
2×2 DataFrameRows
Row │ y x
│ Int64 Int64
─────┼──────────────
1 │ 14 4
2 │ 13 3
Base.values
— Functionvalues(dfc::DataFrameColumns)
Get a vector of columns from dfc
.
Base.pairs
— Functionpairs(dfc::DataFrameColumns)
Return an iterator of pairs associating the name of each column of dfc
with the corresponding column vector, i.e. name => col
where name
is the column name of the column col
.
Equality
Base.isapprox
— Functionisapprox(df1::AbstractDataFrame, df2::AbstractDataFrame;
rtol::Real=atol>0 ? 0 : √eps, atol::Real=0,
nans::Bool=false, norm::Function=norm)
Inexact equality comparison. df1
and df2
must have the same size and column names. Return true
if isapprox
with given keyword arguments applied to all pairs of columns stored in df1
and df2
returns true
.