Indexing
General rules
The following rules explain target functionality of how getindex, setindex!, view, and broadcasting are intended to work with DataFrame, SubDataFrame and DataFrameRow objects.
The following values are a valid column index:
- a scalar, later denoted as col:- a Symbol;
- an AbstractString;
- an Integerthat is notBool;
 
- a 
- a vector, later denoted as cols:- a vector of Symbol(does not have to be a subtype ofAbstractVector{Symbol});
- a vector of AbstractString(does not have to be a subtype ofAbstractVector{<:AbstractString});
- a vector of Integerthat are notBool(does not have to be a subtype ofAbstractVector{<:Integer});
- a vector of Bool(must be a subtype ofAbstractVector{Bool});
- a regular expression (will be expanded to a vector of matching column names);
- a Notexpression (see InvertedIndices.jl);Not(idx)selects all indices not in the passedidx; when passed as column selectorNot(idx...)is equivalent toNot(Cols(idx...)).
- a Colsexpression (see DataAPI.jl);Cols(idxs...)selects the union of the selections inidxs; in particularCols()selects no columns andCols(:)selects all columns; a special rule isCols(predicate), wherepredicateis a predicate function; in this case the columns whose names passed topredicateas strings returntrueare selected.
- a Betweenexpression (see DataAPI.jl);Between(first, last)selects the columns betweenfirstandlastinclusively;
- an Allexpression (see DataAPI.jl);All()selects all columns, equivalent to:;
- a literal colon :(selects all columns).
 
- a vector of 
The following values are a valid row index:
- a scalar, later denoted as row:- an Integerthat is notBool;
 
- an 
- a vector, later denoted as rows:- a vector of Integerthat are notBool(does not have to be a subtype ofAbstractVector{<:Integer});
- a vector of Bool(must be a subtype ofAbstractVector{Bool});
- a Notexpression (see InvertedIndices.jl);
- a literal colon :(selects all rows with copying);
- a literal exclamation mark !(selects all rows without copying).
 
- a vector of 
Additionally it is allowed to index into an AbstractDataFrame using a two-dimensional CartesianIndex.
In the descriptions below df represents a DataFrame, sdf is a SubDataFrame and dfr is a DataFrameRow.
: always expands to axes(df, 1) or axes(sdf, 1).
df.col works like df[!, col] and sdf.col works like sdf[!, col] in all cases.
getindex and view
The following list specifies the behavior of getindex and view operations depending on argument types.
In particular a description explicitly mentions that the data is copied or reused without copying.
For performance reasons, accessing, via getindex or view, a single row and multiple cols of a DataFrame, a SubDataFrame or a DataFrameRow always returns a DataFrameRow (which is a view type).
getindex on DataFrame:
- df[row, col]-> the value contained in row- rowof column- col, the same as- df[!, col][row];
- df[CartesianIndex(row, col)]-> the same as- df[row, col];
- df[row, cols]-> a- DataFrameRowwith parent- df;
- df[rows, col]-> a copy of the vector- df[!, col]with only the entries corresponding to- rowsselected, the same as- df[!, col][rows];
- df[rows, cols]-> a- DataFramecontaining copies of columns- colswith only the entries corresponding to- rowsselected;
- df[!, col]-> the vector contained in column- colreturned without copying; the same as- df.colif- colis a valid identifier.
- df[!, cols]-> create a new- DataFramewith columns- colswithout copying of columns; the same as- select(df, cols, copycols=false).
view on DataFrame:
- @view df[row, col]-> a- 0-dimensional view into- df[!, col]in row- row, the same as- view(df[!, col], row);
- @view df[CartesianIndex(row, col)]-> the same as- @view df[row, col];
- @view df[row, cols]-> the same as- df[row, cols];
- @view df[rows, col]-> a view into- df[!, col]with- rowsselected, the same as- view(df[!, col], rows);
- @view df[rows, cols]-> a- SubDataFramewith- rowsselected with parent- df;
- @view df[!, col]-> a view into- df[!, col]with all rows.
- @view df[!, cols]-> the same as- @view df[:, cols].
getindex on SubDataFrame:
- sdf[row, col]-> a value contained in row- rowof column- col;
- sdf[CartesianIndex(row, col)]-> the same as- sdf[row, col];
- sdf[row, cols]-> a- DataFrameRowwith parent- parent(sdf);
- sdf[rows, col]-> a copy of- sdf[!, col]with only rows- rowsselected, the same as- sdf[!, col][rows];
- sdf[rows, cols]-> a- DataFramecontaining columns- colsand- sdf[rows, col]as a vector for each- colin- cols;
- sdf[!, col]-> a view of entries corresponding to- sdfin the vector- parent(sdf)[!, col]; the same as- sdf.colif- colis a valid identifier.
- sdf[!, cols]-> create a new- SubDataFramewith columns- cols, the same parent as- sdf, and the same rows selected; the same as- select(sdf, cols, copycols=false).
view on SubDataFrame:
- @view sdf[row, col]-> a- 0-dimensional view into- df[!, col]at row- row, the same as- view(sdf[!, col], row);
- @view sdf[CartesianIndex(row, col)]-> the same as- @view sdf[row, col];
- @view sdf[row, cols]-> a- DataFrameRowwith parent- parent(sdf);
- @view sdf[rows, col]-> a view into- sdf[!, col]vector with- rowsselected, the same as- view(sdf[!, col], rows);
- @view sdf[rows, cols]-> a- SubDataFramewith parent- parent(sdf);
- @view sdf[!, col]-> a view into- sdf[!, col]vector with all rows.
- @view sdf[!, cols]-> the same as- @view sdf[:, cols].
getindex on DataFrameRow:
- dfr[col]-> the value contained in column- colof- dfr; the same as- dfr.colif- colis a valid identifier;
- dfr[cols]-> a- DataFrameRowwith parent- parent(dfr);
view on DataFrameRow:
- @view dfr[col]-> a- 0-dimensional view into- parent(dfr)[DataFrames.row(dfr), col];
- @view dfr[cols]-> a- DataFrameRowwith parent- parent(dfr);
Note that views created with columns selector set to : change their columns' count if columns are added/removed/renamed in the parent; if column selector is other than : then view points to selected columns by their number at the moment of creation of the view.
setindex!
The following list specifies the behavior of setindex! operations depending on argument types.
In particular a description explicitly mentions if the assignment is in-place.
Note that if a setindex! operation throws an error the target data frame may be partially changed so it is unsafe to use it afterwards (the column length correctness will be preserved).
setindex! on DataFrame:
- df[row, col] = v-> set value of- colin row- rowto- vin-place;
- df[CartesianIndex(row, col)] = v-> the same as- df[row, col] = v;
- df[row, cols] = v-> set row- rowof columns- colsin-place; the same as- dfr = df[row, cols]; dfr[:] = v;
- df[rows, col] = v-> set rows- rowsof column- colin-place;- vmust be an- AbstractVector; if- rowsis- :and- colis a- Symbolor- AbstractStringthat is not present in- dfthen a new column in- dfis created and holds a- copyof- v; equivalent to- df.col = copy(v)if- colis a valid identifier;
- df[rows, cols] = v-> set rows- rowsof columns- colsin-place;- vmust be an- AbstractMatrixor an- AbstractDataFrame(in this case column names must match);
- df[!, col] = v-> replaces- colwith- vwithout copying (with the exception that if- vis an- AbstractRangeit gets converted to a- Vector); also if- colis a- Symbolor- AbstractStringthat is not present in- dfthen a new column in- dfis created and holds- v; equivalent to- df.col = vif- colis a valid identifier; this is allowed if- ncol(df) == 0 || length(v) == nrow(df);
- df[!, cols] = v-> replaces existing columns- colsin data frame- dfwith copying;- vmust be an- AbstractMatrixor an- AbstractDataFrame(in the latter case column names must match);
setindex! on SubDataFrame:
- sdf[row, col] = v-> set value of- colin row- rowto- vin-place;
- sdf[CartesianIndex(row, col)] = v-> the same as- sdf[row, col] = v;
- sdf[row, cols] = v-> the same as- dfr = df[row, cols]; dfr[:] = vin-place;
- sdf[rows, col] = v-> set rows- rowsof column- col, in-place;- vmust be an abstract vector;
- sdf[rows, cols] = v-> set rows- rowsof columns- colsin-place;- vcan be an- AbstractMatrixor- vcan be- AbstractDataFramein which case column names must match;
- sdf[!, col] = v-> replaces- colwith- vwith copying; if- colis present in- sdfthen filtered-out rows in newly created vector are filled with values already present in that column and- promote_typeis used to determine the- eltypeof the new column; if- colis not present in- sdfthen the operation is only allowed if- sdfwas created with- :as column selector, in which case filtered-out rows are filled with- missing; equivalent to- sdf.col = vif- colis a valid identifier; operation is allowed if- length(v) == nrow(sdf);
- sdf[!, cols] = v-> replaces existing columns- colsin data frame- sdfwith copying;- vmust be an- AbstractMatrixor an- AbstractDataFrame(in the latter case column names must match); filtered-out rows in newly created vectors are filled with values already present in respective columns and- promote_typeis used to determine the- eltypeof the new columns;
The rules above mean that sdf[:, col] = v is an in-place operation if col is present in sdf, therefore it will be fast in general. On the other hand using sdf[!, col] = v or sdf.col = v will always allocate a new vector, which is more expensive computationally.
setindex! on DataFrameRow:
- dfr[col] = v-> set value of- colin row- rowto- vin-place; equivalent to- dfr.col = vif- colis a valid identifier;
- dfr[cols] = v-> set values of entries in columns- colsin- dfrby elements of- vin place;- vcan be: 1) a- Tupleor an- AbstractArray, in which cases it must have a number of elements equal to- length(dfr), 2) an- AbstractDict, in which case column names must match, 3) a- NamedTupleor- DataFrameRow, in which case column names and order must match;
Broadcasting
The following broadcasting rules apply to AbstractDataFrame objects:
- AbstractDataFramebehaves in broadcasting like a two-dimensional collection compatible with matrices.
- If an AbstractDataFrametakes part in broadcasting then aDataFrameis always produced as a result. In this case the requested broadcasting operation produces an object with exactly two dimensions. An exception is when anAbstractDataFrameis used only as a source of broadcast assignment into an object of dimensionality higher than two.
- If multiple AbstractDataFrameobjects take part in broadcasting then they have to have identical column names.
Note that if broadcasting assignment operation throws an error the target data frame may be partially changed so it is unsafe to use it afterwards (the column length correctness will be preserved).
Broadcasting DataFrameRow is currently not allowed (which is consistent with NamedTuple).
It is possible to assign a value to AbstractDataFrame and DataFrameRow objects using the .= operator. In such an operation AbstractDataFrame is considered as two-dimensional and DataFrameRow as single-dimensional.
The rule above means that, similar to single-dimensional objects in Base (e.g. vectors), DataFrameRow is considered to be column-oriented.
Additional rules:
- in the df[CartesianIndex(row, col)] .= v,df[row, col] .= vsyntaxesvis broadcasted into the contents ofdf[row, col](this is consistent with Julia Base);
- in the df[row, cols] .= vsyntaxes the assignment todfis performed in-place;
- in the df[rows, col] .= vanddf[rows, cols] .= vsyntaxes the assignment todfis performed in-place; ifrowsis:andcolisSymbolorAbstractStringand it is missing fromdfthen a new column is allocated and added; the length of the column is always the value ofnrow(df)before the assignment takes place;
- in the df[!, col] .= vsyntax columncolis replaced by a freshly allocated vector; ifcolisSymbolorAbstractStringand it is missing fromdfthen a new column is allocated added; the length of the column is always the value ofnrow(df)before the assignment takes place;
- the df[!, cols] .= vsyntax replaces existing columnscolsin data framedfwith freshly allocated vectors;
- df.col .= vsyntax currently performs in-place assignment to an existing vector- df.col; this behavior is deprecated and a new column will be allocated in the future. If- :colis not present in- dfthen a new column will be created in- df.
- in the sdf[CartesianIndex(row, col)] .= v,sdf[row, col] .= vandsdf[row, cols] .= vsyntaxes the assignment tosdfis performed in-place;
- in the sdf[rows, col] .= vandsdf[rows, cols] .= vsyntaxes the assignment tosdfis performed in-place; ifrowsis:andcolis aSymbolorAbstractStringreferring to a column missing fromsdfandsdfwas created with:as column selector then a new column is allocated and added; the filtered-out rows are filled withmissing;
- in the sdf[!, col] .= vsyntax columncolis replaced by a freshly allocated vector; the filtered-out rows are filled with values already present incol; ifcolis aSymbolorAbstractStringreferring to a column missing fromsdfand wassdfcreated with:as column selector then a new column is allocated and added; in this case the filtered-out rows are filled withmissing;
- the sdf[!, cols] .= vsyntax replaces existing columnscolsin data framesdfwith freshly allocated vectors; the filtered-out rows are filled with values already present incols;
- sdf.col .= vsyntax currently performs in-place assignment to an existing vector- sdf.col; this behavior is deprecated and a new column will be allocated in the future. If- :colis not present in- sdfthen a new column will be created in- sdfif- sdfwas created with- :as a column selector.
- dfr.col .= vsyntax is allowed and performs in-place assignment to a value extracted by- dfr.col.
Note that sdf[!, col] .= v and sdf[!, cols] .= v syntaxes are not allowed as sdf can be only modified in-place.
If column indexing using Symbol or AbstractString names in cols is performed, the order of columns in the operation is specified by the order of names.
Indexing GroupedDataFrames
A GroupedDataFrame can behave as either an AbstractVector or AbstractDict depending on the type of index used. Integers (or arrays of them) trigger vector-like indexing while Tupless and NamedTuples trigger dictionary-like indexing. An intermediate between the two is the GroupKey type returned by keys(::GroupedDataFrame), which behaves similarly to a NamedTuple but has performance on par with integer indexing.
The elements of a GroupedDataFrame are SubDataFrames of its parent.
- gd[i::Integer]-> Get the- ith group.
- gd[key::NamedTuple]-> Get the group corresponding to the given values of the grouping columns. The fields of the- NamedTuplemust match the grouping columns columns passed to- groupby(including order).
- gd[key::Tuple]-> Same as previous, but omitting the names on- key.
- get(gd, key::Union{Tuple, NamedTuple}, default)-> Get group for key- key, returning- defaultif it does not exist.
- gd[key::GroupKey]-> Get the group corresponding to the- GroupKey- key(one of the elements of the vector returned by- keys(::GroupedDataFrame)). This should be nearly as fast as integer indexing.
- gd[a::AbstractVector]-> Select multiple groups and return them in a new- GroupedDataFrameobject. Groups may be selected by integer position using an array of- Integers or- Bools, similar to a standard array. Alternatively the array may contain keys of any of the types supported for dictionary-like indexing (- GroupKey,- Tuple, or- NamedTuple). Selected groups must be unique, and different types of indices cannot be mixed.
- gd[n::Not]-> Any of the above types wrapped in- Not. The result will be a new- GroupedDataFramecontaining all groups in- gdnot selected by the wrapped index.
Common API for types defined in DataFrames.jl
This table presents return value types of calling names, propertynames, keys, length and ndims on types exposed to the user by DataFrames.jl:
| Type | names | propertynames | keys | length | ndims | 
|---|---|---|---|---|---|
| AbstractDataFrame | Vector{String} | Vector{Symbol} | undefined | undefined | 2 | 
| DataFrameRow | Vector{String} | Vector{Symbol} | Vector{Symbol} | Int | 1 | 
| DataFrameRows | Vector{String} | Vector{Symbol} | vector of Int | Int | 1 | 
| DataFrameColumns | Vector{String} | Vector{Symbol} | Vector{Symbol} | Int | 1 | 
| GroupedDataFrame | Vector{String} | tuple of fields | GroupKeys | Int | 1 | 
| GroupKeys | undefined | tuple of fields | vector of Int | Int | 1 | 
| GroupKey | Vector{String} | Vector{Symbol} | Vector{Symbol} | Int | 1 | 
Additionally the above types T (i.e. AbstractDataFrame, DataFrameRow, DataFrameRows, DataFrameColumns, GroupedDataFrame, GroupKeys, GroupKey) the following methods are defined:
- size(::T)returning a- Tupleof- Int.
- size(::T, ::Integer)returning an- Int.
- axes(::T)returning a- Tupleof- Intvectors.
- axes(::T, ::Integer)returning an- Intvector for a valid dimension (except- DataFrameRowsand- GroupKeysfor which- Base.OneTo(1)is also returned for a dimension higher than a valid one because they are- AbstractVector).
- firstindex(::T)returning- 1(except- AbstractDataFramefor which it is undefined).
- firstindex(::T, ::Integer)returning- 1for a valid dimension (except- DataFrameRowsand- GroupKeysfor which- 1is also returned for a dimension higher than a valid one because they are- AbstractVector).
- lastindex(::T)returning- Int(except- AbstractDataFramefor which it is undefined).
- lastindex(::T, ::Integer)returning- Intfor a valid dimension (except- DataFrameRowsand- GroupKeysfor which- 1is also returned for a dimension higher than a valid one because they are- AbstractVector).