Indexing
General rules
The following rules explain target functionality of how getindex, setindex!, view, and broadcasting are intended to work with DataFrame, SubDataFrame and DataFrameRow objects.
The rules for a valid type of index into a column are the following:
- a value, later denoted as
col:- a
Symbol; - an
AbstractString; - an
Integerthat is notBool;
- a
- a vector, later denoted as
cols:- a vector of
Symbol(does not have to be a subtype ofAbstractVector{Symbol}); - a vector of
AbstractString(does not have to be a subtype ofAbstractVector{<:AbstractString}); - a vector of
Integerother thanBool(does not have to be a subtype ofAbstractVector{<:Integer}); - a vector of
Boolthat has to be a subtype ofAbstractVector{Bool}; - a regular expression, which gets expanded to a vector of matching column names;
- a
Notexpression (see InvertedIndices.jl); - an
AllorBetweenexpression (see DataAPI.jl); - a colon literal
:.
- a vector of
The rules for a valid type of index into a row are the following:
- a value, later denoted as
row:- an
Integerthat is notBool;
- an
- a vector, later denoted as
rows:- a vector of
Integerother thanBool(does not have to be a subtype ofAbstractVector{<:Integer}); - a vector of
Boolthat has to be a subtype ofAbstractVector{Bool}; - a
Notexpression; - a colon literal
:;
- a vector of
- an exclamation mark
!.
Additionally it is allowed to index into an AbstractDataFrame using a two-dimensional CartesianIndex.
In the descriptions below df represents a DataFrame, sdf is a SubDataFrame and dfr is a DataFrameRow.
: always expands to axes(df, 1) or axes(sdf, 1).
df.col works like df[!, col] and sdf.col works like sdf[!, col] in all cases except that df.col .= v and sdf.col .= v perform in-place broadcasting if col is present in df/sdf and is a valid identifier.
getindex and view
The following list specifies the behavior of getindex and view operations depending on argument types.
In particular a description explicitly mentions that the data is copied or reused without copying.
For performance reasons, accessing, via getindex or view, a single row and multiple cols of a DataFrame, a SubDataFrame or a DataFrameRow always returns a DataFrameRow (which is a view type).
getindex on DataFrame:
df[row, col]-> the value contained in rowrowof columncol, the same asdf[!, col][row];df[CartesianIndex(row, col)]-> the same asdf[row,col];df[row, cols]-> aDataFrameRowwith parentdf;df[rows, col]-> a copy of the vectordf[!, col]with only the entries corresponding torowsselected, the same asdf[!, col][rows];df[rows, cols]-> aDataFramecontaining copies of columnscolswith only the entries corresponding torowsselected;df[!, col]-> the vector contained in columncolreturned without copying; the same asdf.colifcolis a valid identifier.df[!, cols]-> create a newDataFramewith columnscolswithout copying of columns; the same asselect(df, cols, copycols=false).
view on DataFrame:
@view df[row, col]-> a0-dimensional view intodf[!, col]in rowrow, the same asview(df[!, col], row);@view df[CartesianIndex(row, col)]-> the same as@view df[row, col];@view df[row, cols]-> the same asdf[row, cols];@view df[rows, col]-> a view intodf[!, col]withrowsselected, the same asview(df[!, col], rows);@view df[rows, cols]-> aSubDataFramewithrowsselected with parentdf;@view df[!, col]-> a view intodf[!, col]with all rows.@view df[!, cols]-> the same as@view df[:, cols].
getindex on SubDataFrame:
sdf[row, col]-> a value contained in rowrowof columncol;sdf[CartesianIndex(row, col)]-> the same assdf[row,col];sdf[row, cols]-> aDataFrameRowwith parentparent(sdf);sdf[rows, col]-> a copy ofsdf[!, col]with only rowsrowsselected, the same assdf[!, col][rows];sdf[rows, cols]-> aDataFramecontaining columnscolsandsdf[rows, col]as a vector for eachcolincols;sdf[!, col]-> a view of entries corresponding tosdfin the vectorparent(sdf)[!, col]; the same assdf.colifcolis a valid identifier.sdf[!, cols]-> create a newSubDataFramewith columnscols, the same parent assdf, and the same rows selected; the same asselect(sdf, cols, copycols=false).
view on SubDataFrame:
@view sdf[row, col]-> a0-dimensional view intodf[!, col]at rowrow, the same asview(sdf[!, col], row);@view sdf[CartesianIndex(row, col)]-> the same as@view sdf[row, col];@view sdf[row, cols]-> aDataFrameRowwith parentparent(sdf);@view sdf[rows, col]-> a view intosdf[!, col]vector withrowsselected, the same asview(sdf[!, col], rows);@view sdf[rows, cols]-> aSubDataFramewith parentparent(sdf);@view sdf[!, col]-> a view intosdf[!, col]vector with all rows.@view sdf[!, cols]-> the same as@view sdf[:, cols].
getindex on DataFrameRow:
dfr[col]-> the value contained in columncolofdfr; the same asdfr.colifcolis a valid identifier;dfr[cols]-> aDataFrameRowwith parentparent(dfr);
view on DataFrameRow:
@view dfr[col]-> a0-dimensional view intoparent(dfr)[DataFrames.row(dfr), col];@view dfr[cols]-> aDataFrameRowwith parentparent(dfr);
Note that views created with columns selector set to : change their columns' count if columns are added/removed/renamed in the parent; if column selector is other than : then view points to selected columns by their number at the moment of creation of the view.
setindex!
The following list specifies the behavior of setindex! operations depending on argument types.
In particular a description explicitly mentions if the assignment is in-place.
Note that if a setindex! operation throws an error the target data frame may be partially changed so it is unsafe to use it afterwards (the column length correctness will be preserved).
setindex! on DataFrame:
df[row, col] = v-> set value ofcolin rowrowtovin-place;df[CartesianIndex(row, col)] = v-> the same asdf[row, col] = v;df[row, cols] = v-> set rowrowof columnscolsin-place; the same asdfr = df[row, cols]; dfr[:] = v;df[rows, col] = v-> set rowsrowsof columncolin-place;vmust be anAbstractVector; ifrowsis:andcolis aSymbolorAbstractStringthat is not present indfthen a new column indfis created and holds acopyofv; equivalent todf.col = copy(v)ifcolis a valid identifier;df[rows, cols] = v-> set rowsrowsof columnscolsin-place;vmust be anAbstractMatrixor anAbstractDataFrame(in this case column names must match);df[!, col] = v-> replacescolwithvwithout copying (with the exception that ifvis anAbstractRangeit gets converted to aVector); also ifcolis aSymbolorAbstractStringthat is not present indfthen a new column indfis created and holdsv; equivalent todf.col = vifcolis a valid identifier; this is allowed ifncol(df) == 0 || length(v) == nrow(df);df[!, cols] = v-> replaces existing columnscolsin data framedfwith copying;vmust be anAbstractMatrixor anAbstractDataFrame(in the latter case column names must match);
setindex! on SubDataFrame:
sdf[row, col] = v-> set value ofcolin rowrowtovin-place;sdf[CartesianIndex(row, col)] = v-> the same assdf[row, col] = v;sdf[row, cols] = v-> the same asdfr = df[row, cols]; dfr[:] = vin-place;sdf[rows, col] = v-> set rowsrowsof columncol, in-place;vmust be an abstract vector;sdf[rows, cols] = v-> set rowsrowsof columnscolsin-place;vcan be anAbstractMatrixorvcan beAbstractDataFramewhen column names must match;
Note that sdf[!, col] = v, sdf[!, cols] = v and sdf.col = v are not allowed as sdf can be only modified in-place.
setindex! on DataFrameRow:
dfr[col] = v-> set value ofcolin rowrowtovin-place; equivalent todfr.col = vifcolis a valid identifier;dfr[cols] = v-> set values of entries in columnscolsindfrby elements ofvin place;vcan be: 1) aTupleor anAbstractArray, in which cases it must have a number of elements equal tolength(dfr), 2) anAbstractDict, in which case column names must match, 3) aNamedTupleorDataFrameRow, in which case column names and order must match;
Broadcasting
The following broadcasting rules apply to AbstractDataFrame objects:
AbstractDataFramebehaves in broadcasting like a two-dimensional collection compatible with matrices.- If an
AbstractDataFrametakes part in broadcasting then aDataFrameis always produced as a result. In this case the requested broadcasting operation produces an object with exactly two dimensions. An exception is when anAbstractDataFrameis used only as a source of broadcast assignment into an object of dimensionality higher than two. - If multiple
AbstractDataFrameobjects take part in broadcasting then they have to have identical column names.
Note that if broadcasting assignment operation throws an error the target data frame may be partially changed so it is unsafe to use it afterwards (the column length correctness will be preserved).
Broadcasting DataFrameRow is currently not allowed (which is consistent with NamedTuple).
It is possible to assign a value to AbstractDataFrame and DataFrameRow objects using the .= operator. In such an operation AbstractDataFrame is considered as two-dimensional and DataFrameRow as single-dimensional.
The rule above means that, similar to single-dimensional objects in Base (e.g. vectors), DataFrameRow is considered to be column-oriented.
Additional rules:
- in the
df[CartesianIndex(row, col)] .= v,df[row, col] .= vsyntaxesvis broadcasted into the contents ofdf[row, col](this is consistent with Julia Base); - in the
df[row, cols] .= vsyntaxes the assignment todfis performed in-place; - in the
df[rows, col] .= vanddf[rows, cols] .= vsyntaxes the assignment todfis performed in-place; ifrowsis:andcolisSymbolorAbstractStringand it is missing fromdfthen a new column is allocated and added; the length of the column is always the value ofnrow(df)before the assignment takes place; - in the
df[!, col] .= vsyntax columncolis replaced by a freshly allocated vector; ifcolisSymbolorAbstractStringand it is missing fromdfthen a new column is allocated added; the length of the column is always the value ofnrow(df)before the assignment takes place; - the
df[!, cols] .= vsyntax replaces existing columnscolsin data framedfwith freshly allocated vectors; df.col .= vsyntax is allowed and performs in-place assignment to an existing vectordf.col.- in the
sdf[CartesianIndex(row, col)] .= v,sdf[row, col] .= vandsdf[row, cols] .= vsyntaxes the assignment tosdfis performed in-place; - in the
sdf[rows, col] .= vandsdf[rows, cols] .= vsyntaxes the assignment tosdfis performed in-place; sdf.col .= vsyntax is allowed and performs in-place assignment to an existing vectorsdf.col.dfr.col .= vsyntax is allowed and performs in-place assignment to a value extracted bydfr.col.
Note that sdf[!, col] .= v and sdf[!, cols] .= v syntaxes are not allowed as sdf can be only modified in-place.
If column indexing using Symbol or AbstractString names in cols is performed, the order of columns in the operation is specified by the order of names.
Indexing GroupedDataFrames
A GroupedDataFrame can behave as either an AbstractVector or AbstractDict depending on the type of index used. Integers (or arrays of them) trigger vector-like indexing while Tupless and NamedTuples trigger dictionary-like indexing. An intermediate between the two is the GroupKey type returned by keys(::GroupedDataFrame), which behaves similarly to a NamedTuple but has performance on par with integer indexing.
The elements of a GroupedDataFrame are SubDataFrames of its parent.
gd[i::Integer]-> Get theith group.gd[key::NamedTuple]-> Get the group corresponding to the given values of the grouping columns. The fields of theNamedTuplemust match the grouping columns columns passed togroupby(including order).gd[key::Tuple]-> Same as previous, but omitting the names onkey.get(gd, key::Union{Tuple, NamedTuple}, default)-> Get group for keykey, returningdefaultif it does not exist.gd[key::GroupKey]-> Get the group corresponding to theGroupKeykey(one of the elements of the vector returned bykeys(::GroupedDataFrame)). This should be nearly as fast as integer indexing.gd[a::AbstractVector]-> Select multiple groups and return them in a newGroupedDataFrameobject. Groups may be selected by integer position using an array ofIntegers orBools, similar to a standard array. Alternatively the array may contain keys of any of the types supported for dictionary-like indexing (GroupKey,Tuple, orNamedTuple). Selected groups must be unique, and different types of indices cannot be mixed.gd[n::Not]-> Any of the above types wrapped inNot. The result will be a newGroupedDataFramecontaining all groups ingdnot selected by the wrapped index.
Common API for types defined in DataFrames.jl
This table presents return value types of calling names, propertynames and keys on types exposed to the user by DataFrames.jl:
| Type | names | propertynames | keys |
|---|---|---|---|
AbstractDataFrame | Vector{String} | Vector{Symbol} | undefined |
DataFrameRow | Vector{String} | Vector{Symbol} | Vector{Symbol} |
DataFrameRows | Vector{String} | Vector{Symbol} | vector of Int |
DataFrameColumns | Vector{String} | Vector{Symbol} | Vector{Symbol} |
GroupedDataFrame | Vector{String} | tuple of fields | GroupKeys |
GroupKeys | undefined | tuple of fields | vector of Int |
GroupKey | Vector{String} | Vector{Symbol} | Vector{Symbol} |