Types
DataFrames.AbstractDataFrame
DataFrames.DataFrame
DataFrames.DataFrameColumns
DataFrames.DataFrameRow
DataFrames.DataFrameRows
DataFrames.GroupKey
DataFrames.GroupedDataFrame
DataFrames.RepeatedVector
DataFrames.StackedVector
DataFrames.SubDataFrame
Type hierarchy design
AbstractDataFrame
is an abstract type that provides an interface for data frame types. It is not intended as a fully generic interface for working with tabular data, which is the role of interfaces defined by Tables.jl instead.
DataFrame
is the most fundamental subtype of AbstractDataFrame
, which stores a set of columns as AbstractVector
objects.
SubDataFrame
is an AbstractDataFrame
subtype representing a view into a DataFrame
. It stores only a reference to the parent DataFrame
and information about which rows and columns from the parent are selected (both as integer indices referring to the parent). Typically it is created using the view
function or is returned by indexing into a GroupedDataFrame
object.
GroupedDataFrame
is a type that stores the result of a grouping operation performed on an AbstractDataFrame
. It is intended to be created as a result of a call to the groupby
function.
DataFrameRow
is a view into a single row of an AbstractDataFrame
. It stores only a reference to a parent DataFrame
and information about which row and columns from the parent are selected (both as integer indices referring to the parent) The DataFrameRow
type supports iteration over columns of the row and is similar in functionality to the NamedTuple
type, but allows for modification of data stored in the parent DataFrame
and reflects changes done to the parent after the creation of the view. Typically objects of the DataFrameRow
type are encountered when returned by the eachrow
function, or when accessing a single row of a DataFrame
or SubDataFrame
via getindex
or view
.
The eachrow
function returns a value of the DataFrameRows
type, which serves as an iterator over rows of an AbstractDataFrame
, returning DataFrameRow
objects.
Similarly, the eachcol
function returns a value of the DataFrameColumns
type, which serves as an iterator over columns of an AbstractDataFrame
. The return value can have two concrete types:
- If the
eachcol
function is called with thenames
argument set totrue
then it returns a value of theDataFrameColumns{<:AbstractDataFrame, Pair{Symbol, AbstractVector}}
type, which is an iterator returning a pair containing the column name and the column vector. - If the
eachcol
function is called withnames
argument set tofalse
(the default) then it returns a value of theDataFrameColumns{<:AbstractDataFrame, AbstractVector}
type, which is an iterator returning the column vector only.
The DataFrameRows
and DataFrameColumns
types are subtypes of AbstractVector
and support its interface with the exception that they are read only. Note that they are not exported and should not be constructed directly, but using the eachrow
and eachcol
functions.
The RepeatedVector
and StackedVector
types are subtypes of AbstractVector
and support its interface with the exception that they are read only. Note that they are not exported and should not be constructed directly, but they are columns of a DataFrame
returned by stack
with view=true
.
The design of handling of columns of a DataFrame
When a DataFrame
is constructed columns are copied by default. You can disable this behavior by setting copycols
keyword argument to false
or by using the DataFrame!
function. The exception is if an AbstractRange
is passed as a column, then it is always collected to a Vector
.
Also functions that transform a DataFrame
to produce a new DataFrame
perform a copy of the columns, unless they are passed copycols=false
(available only for functions that could perform a transformation without copying the columns). Examples of such functions are vcat
, hcat
, filter
, dropmissing
, join
, getindex
, copy
or the DataFrame
constructor mentioned above.
On the contrary, functions that create a view of a DataFrame
do not by definition make copies of the columns, and therefore require particular caution. This includes view
, which returns a SubDataFrame
or a DataFrameRow
, and groupby
, which returns a GroupedDataFrame
.
A partial exception to this rule is the stack
function with view=true
which creates a DataFrame
that contains views of the columns from the source DataFrame
.
In-place functions whose names end with !
(like sort!
or dropmissing!
, setindex!
, push!
, append!
) may mutate the column vectors of the DataFrame
they take as an argument. These functions are safe to call due to the rules described above, except when a view of the DataFrame
is in use (via a SubDataFrame
, a DataFrameRow
or a GroupedDataFrame
). In the latter case, calling such a function on the parent might corrupt the view, which make trigger errors, silently return invalid data or even cause Julia to crash. The same caution applies when DataFrame
was created using columns of another DataFrame
without copying (for instance when copycols=false
in functions such as DataFrame
or hcat
).
It is possible to have a direct access to a column col
of a DataFrame
df
(e.g. this can be useful in performance critical code to avoid copying), using one of the following methods:
- via the
getproperty
function using the syntaxdf.col
; - via the
getindex
function using the syntaxdf[!, :col]
(note this is in contrast todf[:, :col]
which copies); - by creating a
DataFrameColumns
object using theeachcol
function; - by calling the
parent
function on a view of a column of theDataFrame
, e.g.parent(@view df[:, :col])
; - by storing the reference to the column before creating a
DataFrame
withcopycols=false
;
A column obtained from a DataFrame
using one of the above methods should not be mutated without caution because:
- resizing a column vector will corrupt its parent
DataFrame
and any associated views as methods only check the length of the column when it is added to theDataFrame
and later assume that all columns have the same length; - reordering values in a column vector (e.g. using
sort!
) will break the consistency of rows with other columns, which will also affect views (if any); - changing values contained in a column vector is acceptable as long as it is not used as a grouping column in a
GroupedDataFrame
created based on theDataFrame
.
Types specification
DataFrames.AbstractDataFrame
— Type.AbstractDataFrame
An abstract type for which all concrete types expose an interface for working with tabular data.
Common methods
An AbstractDataFrame is a two-dimensional table with Symbols for column names. An AbstractDataFrame is also similar to an Associative type in that it allows indexing by a key (the columns).
The following are normally implemented for AbstractDataFrames:
describe
: summarize columnssummary
: show number of rows and columnshcat
: horizontal concatenationvcat
: vertical concatenationrepeat
: repeat rowsnames
: columns namesrename!
: rename columns names based on keyword argumentslength
: number of columnssize
: (nrows, ncols)first
: firstn
rowslast
: lastn
rowsconvert
: convert to an arraycompletecases
: boolean vector of complete cases (rows with no missings)dropmissing
: remove rows with missing valuesdropmissing!
: remove rows with missing values in-placenonunique
: indexes of duplicate rowsunique!
: remove duplicate rowsdisallowmissing
: drop support for missing values in columnsallowmissing
: add support for missing values in columnscategorical
: change column types to categoricalsimilar
: a DataFrame with similar columns asd
filter
: remove rowsfilter!
: remove rows in-place
Indexing and broadcasting
AbstractDataFrame
can be indexed by passing two indices specifying row and column selectors. The allowed indices are a superset of indices that can be used for standard arrays. You can also access a single column of an AbstractDataFrame
using getproperty
and setproperty!
functions. In broadcasting AbstractDataFrame
behavior is similar to a Matrix
.
A detailed description of getindex
, setindex!
, getproperty
, setproperty!
, broadcasting and broadcasting assignment for data frames is given in the "Indexing" section of the manual.
DataFrames.DataFrame
— Type.DataFrame <: AbstractDataFrame
An AbstractDataFrame that stores a set of named columns
The columns are normally AbstractVectors stored in memory, particularly a Vector or CategoricalVector.
Constructors
DataFrame(columns::Vector, names::Vector{Symbol};
makeunique::Bool=false, copycols::Bool=true)
DataFrame(columns::NTuple{N,AbstractVector}, names::NTuple{N,Symbol};
makeunique::Bool=false, copycols::Bool=true)
DataFrame(columns::Matrix, names::Vector{Symbol}; makeunique::Bool=false)
DataFrame(kwargs...)
DataFrame(pairs::NTuple{N, Pair{Symbol, AbstractVector}}; copycols::Bool=true)
DataFrame() # an empty DataFrame
DataFrame(column_eltypes::Vector, names::AbstractVector{Symbol}, nrows::Integer=0;
makeunique::Bool=false)
DataFrame(ds::AbstractDict; copycols::Bool=true)
DataFrame(table; makeunique::Bool=false, copycols::Bool=true)
DataFrame(::Union{DataFrame, SubDataFrame}; copycols::Bool=true)
DataFrame(::GroupedDataFrame)
Arguments
columns
: a Vector with each column as contents or a Matrixnames
: the column namesmakeunique
: iffalse
(the default), an error will be raised if duplicates innames
are found; iftrue
, duplicate names will be suffixed with_i
(i
starting at 1 for the first duplicate).kwargs
: the key gives the column names, and the value is the column contents; note that thecopycols
keyword argument indicates if if vectors passed as columns should be copied so it is not possible to create a column whose name is:copycols
using this constructort
: elemental type of all columnsnrows
,ncols
: number of rows and columnscolumn_eltypes
: element type of each columncategorical
: a vector ofBool
indicating which columns should be converted toCategoricalVector
ds
:AbstractDict
of columnstable
: any type that implements the Tables.jl interface; in particular a tuple or vector ofPair{Symbol, <:AbstractVector}}
objects is a table.copycols
: whether vectors passed as columns should be copied; if set tofalse
then the constructor will still copy the passed columns if it is not possible to construct aDataFrame
without materializing new columns.
All columns in columns
should have the same length.
Notes
The DataFrame
constructor by default copies all columns vectors passed to it. Pass copycols=false
to reuse vectors without copying them
If a column is passed to a DataFrame
constructor or is assigned as a whole using setindex!
then its reference is stored in the DataFrame
. An exception to this rule is assignment of an AbstractRange
as a column, in which case the range is collected to a Vector
.
Because column types can vary, a DataFrame
is not type stable. For performance-critical code, do not index into a DataFrame
inside of loops.
Examples
df = DataFrame()
v = ["x","y","z"][rand(1:3, 10)]
df1 = DataFrame(Any[collect(1:10), v, rand(10)], [:A, :B, :C])
df2 = DataFrame(A = 1:10, B = v, C = rand(10))
summary(df1)
describe(df2)
first(df1, 10)
df1.B
df2[!, :C]
df1[:, :A]
df1[1:4, 1:2]
df1[Not(1:4), Not(1:2)]
df1[1:2, [:A,:C]]
df1[1:2, r"[AC]"]
df1[:, [:A,:C]]
df1[:, [1,3]]
df1[1:4, :]
df1[1:4, :C]
df1[1:4, :C] = 40. * df1[1:4, :C]
[df1; df2] # vcat
[df1 df2] # hcat
size(df1)
DataFrames.DataFrameRow
— Type.DataFrameRow{<:AbstractDataFrame,<:AbstractIndex}
A view of one row of an AbstractDataFrame
.
A DataFrameRow
is returned by getindex
or view
functions when one row and a selection of columns are requested, or when iterating the result of the call to the eachrow
function.
The DataFrameRow
constructor can also be called directly:
DataFrameRow(parent::AbstractDataFrame, row::Integer, cols=:)
A DataFrameRow
supports the iteration interface and can therefore be passed to functions that expect a collection as an argument.
Indexing is one-dimensional like specifying a column of a DataFrame
. You can also access the data in a DataFrameRow
using the getproperty
and setproperty!
functions and convert it to a NamedTuple
using the copy
function.
It is possible to create a DataFrameRow
with duplicate columns. All such columns will have a reference to the same entry in the parent DataFrame
.
If the selection of columns in a parent data frame is passed as :
(a colon) then DataFrameRow
will always have all columns from the parent, even if they are added or removed after its creation.
Examples
df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = randn(8))
sdf1 = view(df, 2, :)
sdf2 = @view df[end, [:a]]
sdf3 = eachrow(df)[1]
sdf4 = DataFrameRow(df, 2, 1:2)
sdf5 = DataFrameRow(df, 1)
DataFrames.GroupedDataFrame
— Type.GroupedDataFrame
The result of a groupby
operation on an AbstractDataFrame
; a view into the AbstractDataFrame
grouped by rows.
Not meant to be constructed directly, see groupby
.
DataFrames.GroupKey
— Type.GroupKey{T<:GroupedDataFrame}
Key for one of the groups of a GroupedDataFrame
. Contains the values of the corresponding grouping columns and behaves similarly to a NamedTuple
, but using it to index its GroupedDataFrame
is much more effecient than using the equivalent Tuple
or NamedTuple
.
Instances of this type are returned by keys(::GroupedDataFrame)
and are not meant to be constructed directly.
See keys(::GroupedDataFrame)
for more information.
DataFrames.SubDataFrame
— Type.SubDataFrame{<:AbstractDataFrame,<:AbstractIndex,<:AbstractVector{Int}} <: AbstractDataFrame
A view of an AbstractDataFrame
. It is returned by a call to the view
function on an AbstractDataFrame
if a collections of rows and columns are specified.
A SubDataFrame
is an AbstractDataFrame
, so expect that most DataFrame functions should work. Such methods include describe
, summary
, nrow
, size
, by
, stack
, and join
.
Indexing is just like a DataFrame
except that it is possible to create a SubDataFrame
with duplicate columns. All such columns will have a reference to the same entry in the parent DataFrame
.
If the selection of columns in a parent data frame is passed as :
(a colon) then SubDataFrame
will always have all columns from the parent, even if they are added or removed after its creation.
Examples
df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = randn(8))
sdf1 = view(df, 2:3) # column subsetting
sdf2 = @view df[end:-1:1, [1,3]] # row and column subsetting
sdf3 = groupby(df, :a)[1] # indexing a GroupedDataFrame returns a SubDataFrame
DataFrames.DataFrameRows
— Type.DataFrameRows{D<:AbstractDataFrame} <: AbstractVector{DataFrameRow{D,S}}
Iterator over rows of an AbstractDataFrame
, with each row represented as a DataFrameRow
.
A value of this type is returned by the eachrow
function.
DataFrames.DataFrameColumns
— Type.DataFrameColumns{<:AbstractDataFrame, V} <: AbstractVector{V}
Iterator over columns of an AbstractDataFrame
constructed using eachcol(df, true)
if V
is a Pair{Symbol,AbstractVector}
. Then each returned value is a pair consisting of column name and column vector. If V
is an AbstractVector
(a value returned by eachcol(df, false)
) then each returned value is a column vector.
DataFrames.RepeatedVector
— Type.RepeatedVector{T} <: AbstractVector{T}
An AbstractVector that is a view into another AbstractVector with repeated elements
NOTE: Not exported.
Constructor
RepeatedVector(parent::AbstractVector, inner::Int, outer::Int)
Arguments
parent
: the AbstractVector that's repeatedinner
: the numer of times each element is repeatedouter
: the numer of times the whole vector is repeated after expanded byinner
inner
and outer
have the same meaning as similarly named arguments to repeat
.
Examples
RepeatedVector([1,2], 3, 1) # [1,1,1,2,2,2]
RepeatedVector([1,2], 1, 3) # [1,2,1,2,1,2]
RepeatedVector([1,2], 2, 2) # [1,2,1,2,1,2,1,2]
DataFrames.StackedVector
— Type.StackedVector <: AbstractVector{Any}
An AbstractVector{Any} that is a linear, concatenated view into another set of AbstractVectors
NOTE: Not exported.
Constructor
StackedVector(d::AbstractVector...)
Arguments
d...
: one or more AbstractVectors
Examples
StackedVector(Any[[1,2], [9,10], [11,12]]) # [1,2,9,10,11,12]