Indexing
General rules
The following rules explain target functionality of how getindex
, setindex!
, view
, and broadcasting are intended to work with DataFrame
, SubDataFrame
and DataFrameRow
objects.
The following values are a valid column index:
- a scalar, later denoted as
col
:- a
Symbol
; - an
AbstractString
; - an
Integer
that is notBool
;
- a
- a vector, later denoted as
cols
:- a vector of
Symbol
(does not have to be a subtype ofAbstractVector{Symbol}
); - a vector of
AbstractString
(does not have to be a subtype ofAbstractVector{<:AbstractString}
); - a vector of
Integer
that are notBool
(does not have to be a subtype ofAbstractVector{<:Integer}
); - a vector of
Bool
(must be a subtype ofAbstractVector{Bool}
); - a regular expression (will be expanded to a vector of matching column names);
- a
Not
expression (see InvertedIndices.jl);Not(idx)
selects all indices not in the passedidx
; when passed as column selectorNot(idx...)
is equivalent toNot(Cols(idx...))
. - a
Cols
expression (see DataAPI.jl);Cols(idxs...)
selects the union of the selections inidxs
; in particularCols()
selects no columns andCols(:)
selects all columns; a special rule isCols(predicate)
, wherepredicate
is a predicate function; in this case the columns whose names passed topredicate
as strings returntrue
are selected. - a
Between
expression (see DataAPI.jl);Between(first, last)
selects the columns betweenfirst
andlast
inclusively; - an
All
expression (see DataAPI.jl);All()
selects all columns, equivalent to:
; - a literal colon
:
(selects all columns).
- a vector of
The following values are a valid row index:
- a scalar, later denoted as
row
:- an
Integer
that is notBool
;
- an
- a vector, later denoted as
rows
:- a vector of
Integer
that are notBool
(does not have to be a subtype ofAbstractVector{<:Integer}
); - a vector of
Bool
(must be a subtype ofAbstractVector{Bool}
); - a
Not
expression (see InvertedIndices.jl); - a literal colon
:
(selects all rows with copying); - a literal exclamation mark
!
(selects all rows without copying).
- a vector of
Additionally it is allowed to index into an AbstractDataFrame
using a two-dimensional CartesianIndex
.
In the descriptions below df
represents a DataFrame
, sdf
is a SubDataFrame
and dfr
is a DataFrameRow
.
:
always expands to axes(df, 1)
or axes(sdf, 1)
.
df.col
works like df[!, col]
and sdf.col
works like sdf[!, col]
in all cases. An exception is that under Julia 1.6 or earlier df.col .= v
and sdf.col .= v
performs in-place broadcasting if col
is present in df
/sdf
and is a valid identifier (this inconsistency is not present under Julia 1.7 and later).
getindex
and view
The following list specifies the behavior of getindex
and view
operations depending on argument types.
In particular a description explicitly mentions that the data is copied or reused without copying.
For performance reasons, accessing, via getindex
or view
, a single row
and multiple cols
of a DataFrame
, a SubDataFrame
or a DataFrameRow
always returns a DataFrameRow
(which is a view type).
getindex
on DataFrame
:
df[row, col]
-> the value contained in rowrow
of columncol
, the same asdf[!, col][row]
;df[CartesianIndex(row, col)]
-> the same asdf[row, col]
;df[row, cols]
-> aDataFrameRow
with parentdf
;df[rows, col]
-> a copy of the vectordf[!, col]
with only the entries corresponding torows
selected, the same asdf[!, col][rows]
;df[rows, cols]
-> aDataFrame
containing copies of columnscols
with only the entries corresponding torows
selected;df[!, col]
-> the vector contained in columncol
returned without copying; the same asdf.col
ifcol
is a valid identifier.df[!, cols]
-> create a newDataFrame
with columnscols
without copying of columns; the same asselect(df, cols, copycols=false)
.
view
on DataFrame
:
@view df[row, col]
-> a0
-dimensional view intodf[!, col]
in rowrow
, the same asview(df[!, col], row)
;@view df[CartesianIndex(row, col)]
-> the same as@view df[row, col]
;@view df[row, cols]
-> the same asdf[row, cols]
;@view df[rows, col]
-> a view intodf[!, col]
withrows
selected, the same asview(df[!, col], rows)
;@view df[rows, cols]
-> aSubDataFrame
withrows
selected with parentdf
;@view df[!, col]
-> a view intodf[!, col]
with all rows.@view df[!, cols]
-> the same as@view df[:, cols]
.
getindex
on SubDataFrame
:
sdf[row, col]
-> a value contained in rowrow
of columncol
;sdf[CartesianIndex(row, col)]
-> the same assdf[row, col]
;sdf[row, cols]
-> aDataFrameRow
with parentparent(sdf)
;sdf[rows, col]
-> a copy ofsdf[!, col]
with only rowsrows
selected, the same assdf[!, col][rows]
;sdf[rows, cols]
-> aDataFrame
containing columnscols
andsdf[rows, col]
as a vector for eachcol
incols
;sdf[!, col]
-> a view of entries corresponding tosdf
in the vectorparent(sdf)[!, col]
; the same assdf.col
ifcol
is a valid identifier.sdf[!, cols]
-> create a newSubDataFrame
with columnscols
, the same parent assdf
, and the same rows selected; the same asselect(sdf, cols, copycols=false)
.
view
on SubDataFrame
:
@view sdf[row, col]
-> a0
-dimensional view intodf[!, col]
at rowrow
, the same asview(sdf[!, col], row)
;@view sdf[CartesianIndex(row, col)]
-> the same as@view sdf[row, col]
;@view sdf[row, cols]
-> aDataFrameRow
with parentparent(sdf)
;@view sdf[rows, col]
-> a view intosdf[!, col]
vector withrows
selected, the same asview(sdf[!, col], rows)
;@view sdf[rows, cols]
-> aSubDataFrame
with parentparent(sdf)
;@view sdf[!, col]
-> a view intosdf[!, col]
vector with all rows.@view sdf[!, cols]
-> the same as@view sdf[:, cols]
.
getindex
on DataFrameRow
:
dfr[col]
-> the value contained in columncol
ofdfr
; the same asdfr.col
ifcol
is a valid identifier;dfr[cols]
-> aDataFrameRow
with parentparent(dfr)
;
view
on DataFrameRow
:
@view dfr[col]
-> a0
-dimensional view intoparent(dfr)[DataFrames.row(dfr), col]
;@view dfr[cols]
-> aDataFrameRow
with parentparent(dfr)
;
Note that views created with columns selector set to :
change their columns' count if columns are added/removed/renamed in the parent; if column selector is other than :
then view points to selected columns by their number at the moment of creation of the view.
setindex!
The following list specifies the behavior of setindex!
operations depending on argument types.
In particular a description explicitly mentions if the assignment is in-place.
Note that if a setindex!
operation throws an error the target data frame may be partially changed so it is unsafe to use it afterwards (the column length correctness will be preserved).
setindex!
on DataFrame
:
df[row, col] = v
-> set value ofcol
in rowrow
tov
in-place;df[CartesianIndex(row, col)] = v
-> the same asdf[row, col] = v
;df[row, cols] = v
-> set rowrow
of columnscols
in-place; the same asdfr = df[row, cols]; dfr[:] = v
;df[rows, col] = v
-> set rowsrows
of columncol
in-place;v
must be anAbstractVector
; ifrows
is:
andcol
is aSymbol
orAbstractString
that is not present indf
then a new column indf
is created and holds acopy
ofv
; equivalent todf.col = copy(v)
ifcol
is a valid identifier;df[rows, cols] = v
-> set rowsrows
of columnscols
in-place;v
must be anAbstractMatrix
or anAbstractDataFrame
(in this case column names must match);df[!, col] = v
-> replacescol
withv
without copying (with the exception that ifv
is anAbstractRange
it gets converted to aVector
); also ifcol
is aSymbol
orAbstractString
that is not present indf
then a new column indf
is created and holdsv
; equivalent todf.col = v
ifcol
is a valid identifier; this is allowed ifncol(df) == 0 || length(v) == nrow(df)
;df[!, cols] = v
-> replaces existing columnscols
in data framedf
with copying;v
must be anAbstractMatrix
or anAbstractDataFrame
(in the latter case column names must match);
setindex!
on SubDataFrame
:
sdf[row, col] = v
-> set value ofcol
in rowrow
tov
in-place;sdf[CartesianIndex(row, col)] = v
-> the same assdf[row, col] = v
;sdf[row, cols] = v
-> the same asdfr = df[row, cols]; dfr[:] = v
in-place;sdf[rows, col] = v
-> set rowsrows
of columncol
, in-place;v
must be an abstract vector;sdf[rows, cols] = v
-> set rowsrows
of columnscols
in-place;v
can be anAbstractMatrix
orv
can beAbstractDataFrame
in which case column names must match;sdf[!, col] = v
-> replacescol
withv
with copying; ifcol
is present insdf
then filtered-out rows in newly created vector are filled with values already present in that column andpromote_type
is used to determine theeltype
of the new column; ifcol
is not present insdf
then the operation is only allowed ifsdf
was created with:
as column selector, in which case filtered-out rows are filled withmissing
; equivalent tosdf.col = v
ifcol
is a valid identifier; operation is allowed iflength(v) == nrow(sdf)
;sdf[!, cols] = v
-> replaces existing columnscols
in data framesdf
with copying;v
must be anAbstractMatrix
or anAbstractDataFrame
(in the latter case column names must match); filtered-out rows in newly created vectors are filled with values already present in respective columns andpromote_type
is used to determine theeltype
of the new columns;
The rules above mean that sdf[:, col] = v
is an in-place operation if col
is present in sdf
, therefore it will be fast in general. On the other hand using sdf[!, col] = v
or sdf.col = v
will always allocate a new vector, which is more expensive computationally.
setindex!
on DataFrameRow
:
dfr[col] = v
-> set value ofcol
in rowrow
tov
in-place; equivalent todfr.col = v
ifcol
is a valid identifier;dfr[cols] = v
-> set values of entries in columnscols
indfr
by elements ofv
in place;v
can be: 1) aTuple
or anAbstractArray
, in which cases it must have a number of elements equal tolength(dfr)
, 2) anAbstractDict
, in which case column names must match, 3) aNamedTuple
orDataFrameRow
, in which case column names and order must match;
Broadcasting
The following broadcasting rules apply to AbstractDataFrame
objects:
AbstractDataFrame
behaves in broadcasting like a two-dimensional collection compatible with matrices.- If an
AbstractDataFrame
takes part in broadcasting then aDataFrame
is always produced as a result. In this case the requested broadcasting operation produces an object with exactly two dimensions. An exception is when anAbstractDataFrame
is used only as a source of broadcast assignment into an object of dimensionality higher than two. - If multiple
AbstractDataFrame
objects take part in broadcasting then they have to have identical column names.
Note that if broadcasting assignment operation throws an error the target data frame may be partially changed so it is unsafe to use it afterwards (the column length correctness will be preserved).
Broadcasting DataFrameRow
is currently not allowed (which is consistent with NamedTuple
).
It is possible to assign a value to AbstractDataFrame
and DataFrameRow
objects using the .=
operator. In such an operation AbstractDataFrame
is considered as two-dimensional and DataFrameRow
as single-dimensional.
The rule above means that, similar to single-dimensional objects in Base (e.g. vectors), DataFrameRow
is considered to be column-oriented.
Additional rules:
- in the
df[CartesianIndex(row, col)] .= v
,df[row, col] .= v
syntaxesv
is broadcasted into the contents ofdf[row, col]
(this is consistent with Julia Base); - in the
df[row, cols] .= v
syntaxes the assignment todf
is performed in-place; - in the
df[rows, col] .= v
anddf[rows, cols] .= v
syntaxes the assignment todf
is performed in-place; ifrows
is:
andcol
isSymbol
orAbstractString
and it is missing fromdf
then a new column is allocated and added; the length of the column is always the value ofnrow(df)
before the assignment takes place; - in the
df[!, col] .= v
syntax columncol
is replaced by a freshly allocated vector; ifcol
isSymbol
orAbstractString
and it is missing fromdf
then a new column is allocated added; the length of the column is always the value ofnrow(df)
before the assignment takes place; - the
df[!, cols] .= v
syntax replaces existing columnscols
in data framedf
with freshly allocated vectors; df.col .= v
syntax currently performs in-place assignment to an existing vectordf.col
; this behavior is deprecated and a new column will be allocated in the future. Starting from Julia 1.7 if:col
is not present indf
then a new column will be created indf
.- in the
sdf[CartesianIndex(row, col)] .= v
,sdf[row, col] .= v
andsdf[row, cols] .= v
syntaxes the assignment tosdf
is performed in-place; - in the
sdf[rows, col] .= v
andsdf[rows, cols] .= v
syntaxes the assignment tosdf
is performed in-place; ifrows
is:
andcol
is aSymbol
orAbstractString
referring to a column missing fromsdf
andsdf
was created with:
as column selector then a new column is allocated and added; the filtered-out rows are filled withmissing
; - in the
sdf[!, col] .= v
syntax columncol
is replaced by a freshly allocated vector; the filtered-out rows are filled with values already present incol
; ifcol
is aSymbol
orAbstractString
referring to a column missing fromsdf
and wassdf
created with:
as column selector then a new column is allocated and added; in this case the filtered-out rows are filled withmissing
; - the
sdf[!, cols] .= v
syntax replaces existing columnscols
in data framesdf
with freshly allocated vectors; the filtered-out rows are filled with values already present incols
; sdf.col .= v
syntax currently performs in-place assignment to an existing vectorsdf.col
; this behavior is deprecated and a new column will be allocated in the future. Starting from Julia 1.7 if:col
is not present insdf
then a new column will be created insdf
ifsdf
was created with:
as a column selector.dfr.col .= v
syntax is allowed and performs in-place assignment to a value extracted bydfr.col
.
Note that sdf[!, col] .= v
and sdf[!, cols] .= v
syntaxes are not allowed as sdf
can be only modified in-place.
If column indexing using Symbol
or AbstractString
names in cols
is performed, the order of columns in the operation is specified by the order of names.
Indexing GroupedDataFrame
s
A GroupedDataFrame
can behave as either an AbstractVector
or AbstractDict
depending on the type of index used. Integers (or arrays of them) trigger vector-like indexing while Tuples
s and NamedTuple
s trigger dictionary-like indexing. An intermediate between the two is the GroupKey
type returned by keys(::GroupedDataFrame)
, which behaves similarly to a NamedTuple
but has performance on par with integer indexing.
The elements of a GroupedDataFrame
are SubDataFrame
s of its parent.
gd[i::Integer]
-> Get thei
th group.gd[key::NamedTuple]
-> Get the group corresponding to the given values of the grouping columns. The fields of theNamedTuple
must match the grouping columns columns passed togroupby
(including order).gd[key::Tuple]
-> Same as previous, but omitting the names onkey
.get(gd, key::Union{Tuple, NamedTuple}, default)
-> Get group for keykey
, returningdefault
if it does not exist.gd[key::GroupKey]
-> Get the group corresponding to theGroupKey
key
(one of the elements of the vector returned bykeys(::GroupedDataFrame)
). This should be nearly as fast as integer indexing.gd[a::AbstractVector]
-> Select multiple groups and return them in a newGroupedDataFrame
object. Groups may be selected by integer position using an array ofInteger
s orBool
s, similar to a standard array. Alternatively the array may contain keys of any of the types supported for dictionary-like indexing (GroupKey
,Tuple
, orNamedTuple
). Selected groups must be unique, and different types of indices cannot be mixed.gd[n::Not]
-> Any of the above types wrapped inNot
. The result will be a newGroupedDataFrame
containing all groups ingd
not selected by the wrapped index.
Common API for types defined in DataFrames.jl
This table presents return value types of calling names
, propertynames
, keys
, length
and ndims
on types exposed to the user by DataFrames.jl:
Type | names | propertynames | keys | length | ndims |
---|---|---|---|---|---|
AbstractDataFrame | Vector{String} | Vector{Symbol} | undefined | undefined | 2 |
DataFrameRow | Vector{String} | Vector{Symbol} | Vector{Symbol} | Int | 1 |
DataFrameRows | Vector{String} | Vector{Symbol} | vector of Int | Int | 1 |
DataFrameColumns | Vector{String} | Vector{Symbol} | Vector{Symbol} | Int | 1 |
GroupedDataFrame | Vector{String} | tuple of fields | GroupKeys | Int | 1 |
GroupKeys | undefined | tuple of fields | vector of Int | Int | 1 |
GroupKey | Vector{String} | Vector{Symbol} | Vector{Symbol} | Int | 1 |
Additionally the above types T
(i.e. AbstractDataFrame
, DataFrameRow
, DataFrameRows
, DataFrameColumns
, GroupedDataFrame
, GroupKeys
, GroupKey
) the following methods are defined:
size(::T)
returning aTuple
ofInt
.size(::T, ::Integer)
returning anInt
.axes(::T)
returning aTuple
ofInt
vectors.axes(::T, ::Integer)
returning anInt
vector for a valid dimension (exceptDataFrameRows
andGroupKeys
for whichBase.OneTo(1)
is also returned for a dimension higher than a valid one because they areAbstractVector
).firstindex(::T)
returning1
(exceptAbstractDataFrame
for which it is undefined).firstindex(::T, ::Integer)
returning1
for a valid dimension (exceptDataFrameRows
andGroupKeys
for which1
is also returned for a dimension higher than a valid one because they areAbstractVector
).lastindex(::T)
returningInt
(exceptAbstractDataFrame
for which it is undefined).lastindex(::T, ::Integer)
returningInt
for a valid dimension (exceptDataFrameRows
andGroupKeys
for which1
is also returned for a dimension higher than a valid one because they areAbstractVector
).