Types

Type hierarchy design

AbstractDataFrame is an abstract type that provides an interface for data frame types. It is not intended as a fully generic interface for working with tabular data, which is the role of interfaces defined by Tables.jl instead.

DataFrame is the most fundamental subtype of AbstractDataFrame, which stores a set of columns as AbstractVector objects.

SubDataFrame is an AbstractDataFrame subtype representing a view into a DataFrame. It stores only a reference to the parent DataFrame and information about which rows and columns from the parent are selected (both as integer indices referring to the parent). Typically it is created using the view function or is returned by indexing into a GroupedDataFrame object.

GroupedDataFrame is a type that stores the result of a grouping operation performed on an AbstractDataFrame. It is intended to be created as a result of a call to the groupby function.

DataFrameRow is a view into a single row of an AbstractDataFrame. It stores only a reference to a parent DataFrame and information about which row and columns from the parent are selected (both as integer indices referring to the parent) The DataFrameRow type supports iteration over columns of the row and is similar in functionality to the NamedTuple type, but allows for modification of data stored in the parent DataFrame and reflects changes done to the parent after the creation of the view. Typically objects of the DataFrameRow type are encountered when returned by the eachrow function, or when accessing a single row of a DataFrame or SubDataFrame via getindex or view.

The eachrow function returns a value of the DataFrameRows type, which serves as an iterator over rows of an AbstractDataFrame, returning DataFrameRow objects.

Similarly, the eachcol function returns a value of the DataFrameColumns type, which serves as an iterator over columns of an AbstractDataFrame.

The DataFrameRows and DataFrameColumns types are subtypes of AbstractVector and support its interface with the exception that they are read only. Note that they are not exported and should not be constructed directly, but using the eachrow and eachcol functions.

The RepeatedVector and StackedVector types are subtypes of AbstractVector and support its interface with the exception that they are read only. Note that they are not exported and should not be constructed directly, but they are columns of a DataFrame returned by stack with view=true.

The ByRow type is a special type used for selection operations to signal that the wrapped function should be applied to each element (row) of the selection.

The AsTable type is a special type used for selection operations to signal that the columns selected by a wrapped selector should be passed as a NamedTuple to the function.

The design of handling of columns of a DataFrame

When a DataFrame is constructed columns are copied by default. You can disable this behavior by setting copycols keyword argument to false or by using the DataFrame! function. The exception is if an AbstractRange is passed as a column, then it is always collected to a Vector.

Also functions that transform a DataFrame to produce a new DataFrame perform a copy of the columns, unless they are passed copycols=false (available only for functions that could perform a transformation without copying the columns). Examples of such functions are vcat, hcat, filter, dropmissing, getindex, copy or the DataFrame constructor mentioned above.

On the contrary, functions that create a view of a DataFramedo not by definition make copies of the columns, and therefore require particular caution. This includes view, which returns a SubDataFrame or a DataFrameRow, and groupby, which returns a GroupedDataFrame.

A partial exception to this rule is the stack function with view=true which creates a DataFrame that contains views of the columns from the source DataFrame.

In-place functions whose names end with ! (like sort! or dropmissing!, setindex!, push!, append!) may mutate the column vectors of the DataFrame they take as an argument. These functions are safe to call due to the rules described above, except when a view of the DataFrame is in use (via a SubDataFrame, a DataFrameRow or a GroupedDataFrame). In the latter case, calling such a function on the parent might corrupt the view, which make trigger errors, silently return invalid data or even cause Julia to crash. The same caution applies when DataFrame was created using columns of another DataFrame without copying (for instance when copycols=false in functions such as DataFrame or hcat).

It is possible to have a direct access to a column col of a DataFramedf (e.g. this can be useful in performance critical code to avoid copying), using one of the following methods:

  • via the getproperty function using the syntax df.col;
  • via the getindex function using the syntax df[!, :col] (note this is in contrast to df[:, :col] which copies);
  • by creating a DataFrameColumns object using the eachcol function;
  • by calling the parent function on a view of a column of the DataFrame, e.g. parent(@view df[:, :col]);
  • by storing the reference to the column before creating a DataFrame with copycols=false;

A column obtained from a DataFrame using one of the above methods should not be mutated without caution because:

  • resizing a column vector will corrupt its parent DataFrame and any associated views as methods only check the length of the column when it is added to the DataFrame and later assume that all columns have the same length;
  • reordering values in a column vector (e.g. using sort!) will break the consistency of rows with other columns, which will also affect views (if any);
  • changing values contained in a column vector is acceptable as long as it is not used as a grouping column in a GroupedDataFrame created based on the DataFrame.

Types specification

DataFrames.AbstractDataFrameType
AbstractDataFrame

An abstract type for which all concrete types expose an interface for working with tabular data.

Common methods

An AbstractDataFrame is a two-dimensional table with Symbols or strings for column names.

The following are normally implemented for AbstractDataFrames:

  • describe : summarize columns
  • summary : show number of rows and columns
  • hcat : horizontal concatenation
  • vcat : vertical concatenation
  • repeat : repeat rows
  • names : columns names
  • rename! : rename columns names based on keyword arguments
  • length : number of columns
  • size : (nrows, ncols)
  • first : first n rows
  • last : last n rows
  • convert : convert to an array
  • completecases : boolean vector of complete cases (rows with no missings)
  • dropmissing : remove rows with missing values
  • dropmissing! : remove rows with missing values in-place
  • nonunique : indexes of duplicate rows
  • unique : remove duplicate rows
  • unique! : remove duplicate rows in-place
  • disallowmissing : drop support for missing values in columns
  • disallowmissing! : drop support for missing values in columns in-place
  • allowmissing : add support for missing values in columns
  • allowmissing! : add support for missing values in columns in-place
  • categorical : change column types to categorical
  • categorical! : change column types to categorical in-place
  • similar : a DataFrame with similar columns as d
  • filter : remove rows
  • filter! : remove rows in-place

Indexing and broadcasting

AbstractDataFrame can be indexed by passing two indices specifying row and column selectors. The allowed indices are a superset of indices that can be used for standard arrays. You can also access a single column of an AbstractDataFrame using getproperty and setproperty! functions. Columns can be selected using integers, Symbols, or strings. In broadcasting AbstractDataFrame behavior is similar to a Matrix.

A detailed description of getindex, setindex!, getproperty, setproperty!, broadcasting and broadcasting assignment for data frames is given in the "Indexing" section of the manual.

source
DataFrames.AsTableType
AsTable(cols)

A type used for selection operations to signal that the columns selected by the wrapped selector should be passed as a NamedTuple to the function.

source
DataFrames.ByRowType
ByRow

A type used for selection operations to signal that the wrapped function should be applied to each element (row) of the selection.

Note that ByRow always collects values returned by fun in a vector. Therefore, to allow for future extensions, returning NamedTuple or DataFrameRow from fun is currently disallowed.

source
DataFrames.DataFrameType
DataFrame <: AbstractDataFrame

An AbstractDataFrame that stores a set of named columns

The columns are normally AbstractVectors stored in memory, particularly a Vector or CategoricalVector.

Constructors

DataFrame(columns::AbstractVector, names::AbstractVector{Symbol};
          makeunique::Bool=false, copycols::Bool=true)
DataFrame(columns::AbstractVector, names::AbstractVector{<:AbstractString};
          makeunique::Bool=false, copycols::Bool=true)
DataFrame(columns::NTuple{N,AbstractVector}, names::NTuple{N,Symbol};
          makeunique::Bool=false, copycols::Bool=true)
DataFrame(columns::NTuple{N,AbstractVector}, names::NTuple{N,<:AbstractString};
          makeunique::Bool=false, copycols::Bool=true)
DataFrame(columns::Matrix, names::AbstractVector{Symbol}; makeunique::Bool=false)
DataFrame(columns::Matrix, names::AbstractVector{<:AbstractString};
          makeunique::Bool=false)
DataFrame(kwargs...)
DataFrame(pairs::Pair{Symbol,<:Any}...; makeunique::Bool=false, copycols::Bool=true)
DataFrame(pairs::Pair{<:AbstractString,<:Any}...; makeunique::Bool=false,
          copycols::Bool=true)
DataFrame() # an empty DataFrame
DataFrame(column_eltypes::AbstractVector, names::AbstractVector{Symbol},
          nrows::Integer=0; makeunique::Bool=false)
DataFrame(column_eltypes::AbstractVector, names::AbstractVector{<:AbstractString},
          nrows::Integer=0; makeunique::Bool=false)
DataFrame(ds::AbstractDict; copycols::Bool=true)
DataFrame(table; makeunique::Bool=false, copycols::Bool=true)
DataFrame(::Union{DataFrame, SubDataFrame}; copycols::Bool=true)
DataFrame(::GroupedDataFrame; keepkeys::Bool=true)

Arguments

  • columns : a Vector with each column as contents or a Matrix
  • names : the column names
  • makeunique : if false (the default), an error will be raised if duplicates in names are found; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).
  • kwargs : the key gives the column names, and the value is the column contents; note that the copycols keyword argument indicates if if vectors passed as columns should be copied so it is not possible to create a column whose name is :copycols using this constructor
  • t : elemental type of all columns
  • nrows, ncols : number of rows and columns
  • column_eltypes : element type of each column
  • categorical : a vector of Bool indicating which columns should be converted to CategoricalVector
  • ds : AbstractDict of columns
  • table : any type that implements the Tables.jl interface; in particular a tuple or vector of Pair{Symbol, <:AbstractVector}} objects is a table.
  • copycols : whether vectors passed as columns should be copied; if set to false then the constructor will still copy the passed columns if it is not possible to construct a DataFrame without materializing new columns.

All columns in columns must be AbstractVectors and have the same length. An exception are DataFrame(kwargs...) and DataFrame(pairs::Pair...) form constructors which additionally allow a column to be of any other type that is not an AbstractArray, in which case the passed value is automatically repeated to fill a new vector of the appropriate length. As a particular rule values stored in a Ref or a 0-dimensional AbstractArray are unwrapped and treated in the same way.

Additionally DataFrame can be used to collect a GroupedDataFrame into a DataFrame. In this case the order of rows in the result follows the order of groups in the GroupedDataFrame passed.

Notes

The DataFrame constructor by default copies all columns vectors passed to it. Pass copycols=false to reuse vectors without copying them

If a column is passed to a DataFrame constructor or is assigned as a whole using setindex! then its reference is stored in the DataFrame. An exception to this rule is assignment of an AbstractRange as a column, in which case the range is collected to a Vector.

Because column types can vary, a DataFrame is not type stable. For performance-critical code, do not index into a DataFrame inside of loops.

Examples

df = DataFrame()
v = ["x","y","z"][rand(1:3, 10)]
df1 = DataFrame(Any[collect(1:10), v, rand(10)], [:A, :B, :C])
df2 = DataFrame(A = 1:10, B = v, C = rand(10))
summary(df1)
describe(df2)
first(df1, 10)
df1.B
df2[!, :C]
df1[:, :A]
df1[1:4, 1:2]
df1[Not(1:4), Not(1:2)]
df1[1:2, [:A,:C]]
df1[1:2, r"[AC]"]
df1[:, [:A,:C]]
df1[:, [1,3]]
df1[1:4, :]
df1[1:4, :C]
df1[1:4, :C] = 40. * df1[1:4, :C]
[df1; df2]  # vcat
[df1 df2]  # hcat
size(df1)
source
DataFrames.DataFrameRowType
DataFrameRow{<:AbstractDataFrame,<:AbstractIndex}

A view of one row of an AbstractDataFrame.

A DataFrameRow is returned by getindex or view functions when one row and a selection of columns are requested, or when iterating the result of the call to the eachrow function.

The DataFrameRow constructor can also be called directly:

DataFrameRow(parent::AbstractDataFrame, row::Integer, cols=:)

A DataFrameRow supports the iteration interface and can therefore be passed to functions that expect a collection as an argument. Its element type is always Any.

Indexing is one-dimensional like specifying a column of a DataFrame. You can also access the data in a DataFrameRow using the getproperty and setproperty! functions and convert it to a Tuple, NamedTuple, or Vector using the corresponding functions.

If the selection of columns in a parent data frame is passed as : (a colon) then DataFrameRow will always have all columns from the parent, even if they are added or removed after its creation.

Examples

julia> df = DataFrame(a = repeat([1, 2], outer=[2]),
                      b = repeat(["a", "b"], inner=[2]),
                      c = 1:4)
4×3 DataFrame
│ Row │ a     │ b      │ c     │
│     │ Int64 │ String │ Int64 │
├─────┼───────┼────────┼───────┤
│ 1   │ 1     │ a      │ 1     │
│ 2   │ 2     │ a      │ 2     │
│ 3   │ 1     │ b      │ 3     │
│ 4   │ 2     │ b      │ 4     │

julia> df[1, :]
DataFrameRow
│ Row │ a     │ b      │ c     │
│     │ Int64 │ String │ Int64 │
├─────┼───────┼────────┼───────┤
│ 1   │ 1     │ a      │ 1     │

julia> @view df[end, [:a]]
DataFrameRow
│ Row │ a     │
│     │ Int64 │
├─────┼───────┤
│ 4   │ 2     │

julia> eachrow(df)[1]
DataFrameRow
│ Row │ a     │ b      │ c     │
│     │ Int64 │ String │ Int64 │
├─────┼───────┼────────┼───────┤
│ 1   │ 1     │ a      │ 1     │

julia> Tuple(df[1, :])
(1, "a", 1)

julia> NamedTuple(df[1, :])
(a = 1, b = "a", c = 1)

julia> Vector(df[1, :])
3-element Array{Any,1}:
 1
  "a"
 1
source
DataFrames.GroupedDataFrameType
GroupedDataFrame

The result of a groupby operation on an AbstractDataFrame; a view into the AbstractDataFrame grouped by rows.

Not meant to be constructed directly, see groupby.

source
DataFrames.GroupKeyType
GroupKey{T<:GroupedDataFrame}

Key for one of the groups of a GroupedDataFrame. Contains the values of the corresponding grouping columns and behaves similarly to a NamedTuple, but using it to index its GroupedDataFrame is much more effecient than using the equivalent Tuple or NamedTuple.

Instances of this type are returned by keys(::GroupedDataFrame) and are not meant to be constructed directly.

See keys(::GroupedDataFrame) for more information.

source
DataFrames.SubDataFrameType
SubDataFrame{<:AbstractDataFrame,<:AbstractIndex,<:AbstractVector{Int}} <: AbstractDataFrame

A view of an AbstractDataFrame. It is returned by a call to the view function on an AbstractDataFrame if a collections of rows and columns are specified.

A SubDataFrame is an AbstractDataFrame, so expect that most DataFrame functions should work. Such methods include describe, summary, nrow, size, by, stack, and join.

If the selection of columns in a parent data frame is passed as : (a colon) then SubDataFrame will always have all columns from the parent, even if they are added or removed after its creation.

Examples

df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
               b = repeat([2, 1], outer=[4]),
               c = randn(8))
sdf1 = view(df, 2:3) # column subsetting
sdf2 = @view df[end:-1:1, [1,3]]  # row and column subsetting
sdf3 = groupby(df, :a)[1]  # indexing a GroupedDataFrame returns a SubDataFrame
source
DataFrames.DataFrameRowsType
DataFrameRows{D<:AbstractDataFrame} <: AbstractVector{DataFrameRow{D,S}}

Iterator over rows of an AbstractDataFrame, with each row represented as a DataFrameRow.

A value of this type is returned by the eachrow function.

source
DataFrames.DataFrameColumnsType
DataFrameColumns{<:AbstractDataFrame} <: AbstractVector{AbstractVector}

An AbstractVector that allows iteration over columns of an AbstractDataFrame. Indexing into DataFrameColumns objects using integer or symbol indices returns the corresponding column (without copying).

source
DataFrames.RepeatedVectorType
RepeatedVector{T} <: AbstractVector{T}

An AbstractVector that is a view into another AbstractVector with repeated elements

NOTE: Not exported.

Constructor

RepeatedVector(parent::AbstractVector, inner::Int, outer::Int)

Arguments

  • parent : the AbstractVector that's repeated
  • inner : the numer of times each element is repeated
  • outer : the numer of times the whole vector is repeated after expanded by inner

inner and outer have the same meaning as similarly named arguments to repeat.

Examples

RepeatedVector([1,2], 3, 1)   # [1,1,1,2,2,2]
RepeatedVector([1,2], 1, 3)   # [1,2,1,2,1,2]
RepeatedVector([1,2], 2, 2)   # [1,2,1,2,1,2,1,2]
source
DataFrames.StackedVectorType
StackedVector <: AbstractVector

An AbstractVector that is a linear, concatenated view into another set of AbstractVectors

NOTE: Not exported.

Constructor

StackedVector(d::AbstractVector)

Arguments

  • d... : one or more AbstractVectors

Examples

StackedVector(Any[[1,2], [9,10], [11,12]])  # [1,2,9,10,11,12]
source