Types
DataFrames.AbstractDataFrame
DataFrames.AsTable
DataFrames.DataFrame
DataFrames.DataFrameColumns
DataFrames.DataFrameRow
DataFrames.DataFrameRows
DataFrames.GroupKey
DataFrames.GroupKeys
DataFrames.GroupedDataFrame
DataFrames.RepeatedVector
DataFrames.StackedVector
DataFrames.SubDataFrame
Type hierarchy design
AbstractDataFrame
is an abstract type that provides an interface for data frame types. It is not intended as a fully generic interface for working with tabular data, which is the role of interfaces defined by Tables.jl instead.
DataFrame
is the most fundamental subtype of AbstractDataFrame
, which stores a set of columns as AbstractVector
objects. Indexing of all stored columns must be 1-based. Also, all functions exposed by DataFrames.jl API make sure to collect
passed AbstractRange
source columns before storing them in a DataFrame
.
SubDataFrame
is an AbstractDataFrame
subtype representing a view into a DataFrame
. It stores only a reference to the parent DataFrame
and information about which rows and columns from the parent are selected (both as integer indices referring to the parent). Typically it is created using the view
function or is returned by indexing into a GroupedDataFrame
object.
GroupedDataFrame
is a type that stores the result of a grouping operation performed on an AbstractDataFrame
. It is intended to be created as a result of a call to the groupby
function.
DataFrameRow
is a view into a single row of an AbstractDataFrame
. It stores only a reference to a parent DataFrame
and information about which row and columns from the parent are selected (both as integer indices referring to the parent). The DataFrameRow
type supports iteration over columns of the row and is similar in functionality to the NamedTuple
type, but allows for modification of data stored in the parent DataFrame
and reflects changes done to the parent after the creation of the view. Typically objects of the DataFrameRow
type are encountered when returned by the eachrow
function, or when accessing a single row of a DataFrame
or SubDataFrame
via getindex
or view
.
The eachrow
function returns a value of the DataFrameRows
type, which serves as an iterator over rows of an AbstractDataFrame
, returning DataFrameRow
objects. The DataFrameRows
is a subtype of AbstractVector
and supports its interface with the exception that it is read-only.
Similarly, the eachcol
function returns a value of the DataFrameColumns
type, which is not an AbstractVector
, but supports most of its API. The key differences are that it is read-only and that the keys
function returns a vector of Symbol
s (and not integers as for normal vectors).
Note that DataFrameRows
and DataFrameColumns
are not exported and should not be constructed directly, but using the eachrow
and eachcol
functions.
The RepeatedVector
and StackedVector
types are subtypes of AbstractVector
and support its interface with the exception that they are read only. Note that they are not exported and should not be constructed directly, but they are columns of a DataFrame
returned by stack
with view=true
.
The ByRow
type is a special type used for selection operations to signal that the wrapped function should be applied to each element (row) of the selection.
The AsTable
type is a special type used for selection operations to signal that the columns selected by a wrapped selector should be passed as a NamedTuple
to the function or to signal that it is requested to expand the return value of a transformation into multiple columns.
The design of handling of columns of a DataFrame
When a DataFrame
is constructed columns are copied by default. You can disable this behavior by setting copycols
keyword argument to false
. The exception is if an AbstractRange
is passed as a column, then it is always collected to a Vector
.
Also functions that transform a DataFrame
to produce a new DataFrame
perform a copy of the columns, unless they are passed copycols=false
(available only for functions that could perform a transformation without copying the columns). Examples of such functions are vcat
, hcat
, filter
, dropmissing
, getindex
, copy
or the DataFrame
constructor mentioned above.
The generic single-argument constructor DataFrame(table)
has copycols=nothing
by default, meaning that columns are copied unless table
signals that a copy of columns doesn't need to be made (this is done by wrapping the source table in Tables.CopiedColumns
). CSV.jl does this when CSV.read(file, DataFrame)
is called, since columns are built only for the purpose of use in a DataFrame
constructor. Another example is Arrow.Table
, where arrow data is inherently immutable so columns can't be accidentally mutated anyway. To be able to mutate arrow data, columns must be materialized, which can be accomplished via DataFrame(arrow_table, copycols=true)
.
On the contrary, functions that create a view of a DataFrame
do not by definition make copies of the columns, and therefore require particular caution. This includes view
, which returns a SubDataFrame
or a DataFrameRow
, and groupby
, which returns a GroupedDataFrame
.
A partial exception to this rule is the stack
function with view=true
which creates a DataFrame
that contains views of the columns from the source DataFrame
.
In-place functions whose names end with !
(like sort!
or dropmissing!
, setindex!
, push!
, append!
) may mutate the column vectors of the DataFrame
they take as an argument. These functions are safe to call due to the rules described above, except when a view of the DataFrame
is in use (via a SubDataFrame
, a DataFrameRow
or a GroupedDataFrame
). In the latter case, calling such a function on the parent might corrupt the view, which make trigger errors, silently return invalid data or even cause Julia to crash. The same caution applies when DataFrame
was created using columns of another DataFrame
without copying (for instance when copycols=false
in functions such as DataFrame
or hcat
).
It is possible to have a direct access to a column col
of a DataFrame
df
(e.g. this can be useful in performance critical code to avoid copying), using one of the following methods:
- via the
getproperty
function using the syntaxdf.col
; - via the
getindex
function using the syntaxdf[!, :col]
(note this is in contrast todf[:, :col]
which copies); - by creating a
DataFrameColumns
object using theeachcol
function; - by calling the
parent
function on a view of a column of theDataFrame
, e.g.parent(@view df[:, :col])
; - by storing the reference to the column before creating a
DataFrame
withcopycols=false
;
A column obtained from a DataFrame
using one of the above methods should not be mutated without caution because:
- resizing a column vector will corrupt its parent
DataFrame
and any associated views as methods only check the length of the column when it is added to theDataFrame
and later assume that all columns have the same length; - reordering values in a column vector (e.g. using
sort!
) will break the consistency of rows with other columns, which will also affect views (if any); - changing values contained in a column vector is acceptable as long as it is not used as a grouping column in a
GroupedDataFrame
created based on theDataFrame
.
Types specification
DataFrames.AbstractDataFrame
— TypeAbstractDataFrame
An abstract type for which all concrete types expose an interface for working with tabular data.
An AbstractDataFrame
is a two-dimensional table with Symbol
s or strings for column names.
DataFrames.jl defines two types that are subtypes of AbstractDataFrame
: DataFrame
and SubDataFrame
.
Indexing and broadcasting
AbstractDataFrame
can be indexed by passing two indices specifying row and column selectors. The allowed indices are a superset of indices that can be used for standard arrays. You can also access a single column of an AbstractDataFrame
using getproperty
and setproperty!
functions. Columns can be selected using integers, Symbol
s, or strings. In broadcasting AbstractDataFrame
behavior is similar to a Matrix
.
A detailed description of getindex
, setindex!
, getproperty
, setproperty!
, broadcasting and broadcasting assignment for data frames is given in the "Indexing" section of the manual.
DataFrames.AsTable
— TypeAsTable(cols)
A type having a special meaning in source => transformation => destination
selection operations supported by combine
, select
, select!
, transform
, transform!
, subset
, and subset!
.
If AsTable(cols)
is used in source
position it signals that the columns selected by the wrapped selector cols
should be passed as a NamedTuple
to the function.
If AsTable
is used in destination
position it means that the result of the transformation
operation is a vector of containers (or a single container if ByRow(transformation)
is used) that should be expanded into multiple columns using keys
to get column names.
Examples
julia> df1 = DataFrame(a=1:3, b=11:13)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
3 │ 3 13
julia> df2 = select(df1, AsTable([:a, :b]) => ByRow(identity))
3×1 DataFrame
Row │ a_b_identity
│ NamedTuple…
─────┼─────────────────
1 │ (a = 1, b = 11)
2 │ (a = 2, b = 12)
3 │ (a = 3, b = 13)
julia> select(df2, :a_b_identity => AsTable)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
3 │ 3 13
julia> select(df1, AsTable([:a, :b]) => ByRow(nt -> map(x -> x^2, nt)) => AsTable)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 121
2 │ 4 144
3 │ 9 169
DataFrames.DataFrame
— TypeDataFrame <: AbstractDataFrame
An AbstractDataFrame
that stores a set of named columns.
The columns are normally AbstractVector
s stored in memory, particularly a Vector
, PooledVector
or CategoricalVector
.
Constructors
DataFrame(pairs::Pair...; makeunique::Bool=false, copycols::Bool=true)
DataFrame(pairs::AbstractVector{<:Pair}; makeunique::Bool=false, copycols::Bool=true)
DataFrame(ds::AbstractDict; copycols::Bool=true)
DataFrame(; kwargs..., copycols::Bool=true)
DataFrame(table; copycols::Union{Bool, Nothing}=nothing)
DataFrame(table, names::AbstractVector;
makeunique::Bool=false, copycols::Union{Bool, Nothing}=nothing)
DataFrame(columns::AbstractVecOrMat, names::AbstractVector;
makeunique::Bool=false, copycols::Bool=true)
DataFrame(::DataFrameRow; copycols::Bool=true)
DataFrame(::GroupedDataFrame; copycols::Bool=true, keepkeys::Bool=true)
Keyword arguments
copycols
: whether vectors passed as columns should be copied; by default set totrue
and the vectors are copied; if set tofalse
then the constructor will still copy the passed columns if it is not possible to construct aDataFrame
without materializing new columns. Note thecopycols=nothing
default in the Tables.jl compatible constructor; it is provided as certain input table types may have already made a copy of columns or the columns may otherwise be immutable, in which case columns are not copied by default. To force a copy in such cases, or to get mutable columns from an immutable input table (likeArrow.Table
), passcopycols=true
explicitly.makeunique
: iffalse
(the default), an error will be raised
(note that not all constructors support these keyword arguments)
Details on behavior of different constructors
It is allowed to pass a vector of Pair
s, a list of Pair
s as positional arguments, or a list of keyword arguments. In this case each pair is considered to represent a column name to column value mapping and column name must be a Symbol
or string. Alternatively a dictionary can be passed to the constructor in which case its entries are considered to define the column name and column value pairs. If the dictionary is a Dict
then column names will be sorted in the returned DataFrame
.
In all the constructors described above column value can be a vector which is consumed as is or an object of any other type (except AbstractArray
). In the latter case the passed value is automatically repeated to fill a new vector of the appropriate length. As a particular rule values stored in a Ref
or a 0
-dimensional AbstractArray
are unwrapped and treated in the same way.
It is also allowed to pass a vector of vectors or a matrix as as the first argument. In this case the second argument must be a vector of Symbol
s or strings specifying column names, or the symbol :auto
to generate column names x1
, x2
, ... automatically. Note that in this case if the first argument is a matrix and copycols=false
the columns of the created DataFrame
will be views of columns the source matrix.
If a single positional argument is passed to a DataFrame
constructor then it is assumed to be of type that implements the Tables.jl interface using which the returned DataFrame
is materialized.
If two positional arguments are passed, where the second argument is an AbstractVector
, then the first argument is taken to be a table as described in the previous paragraph, and columns names of the resulting data frame are taken from the vector passed as the second positional argument.
Finally it is allowed to construct a DataFrame
from a DataFrameRow
or a GroupedDataFrame
. In the latter case the keepkeys
keyword argument specifies whether the resulting DataFrame
should contain the grouping columns of the passed GroupedDataFrame
and the order of rows in the result follows the order of groups in the GroupedDataFrame
passed.
Notes
The DataFrame
constructor by default copies all columns vectors passed to it. Pass the copycols=false
keyword argument (where supported) to reuse vectors without copying them.
By default an error will be raised if duplicates in column names are found. Pass makeunique=true
keyword argument (where supported) to accept duplicate names, in which case they will be suffixed with _i
(i
starting at 1 for the first duplicate).
If an AbstractRange
is passed to a DataFrame
constructor as a column it is always collected to a Vector
(even if copycols=false
). As a general rule AbstractRange
values are always materialized to a Vector
by all functions in DataFrames.jl before being stored in a DataFrame
.
DataFrame
can store only columns that use 1-based indexing. Attempting to store a vector using non-standard indexing raises an error.
The DataFrame
type is designed to allow column types to vary and to be dynamically changed also after it is constructed. Therefore DataFrame
s are not type stable. For performance-critical code that requires type-stability either use the functionality provided by select
/transform
/combine
functions, use Tables.columntable
and Tables.namedtupleiterator
functions, use barrier functions, or provide type assertions to the variables that hold columns extracted from a DataFrame
.
Metadata: this function preserves all table and column-level metadata. As a special case if a GroupedDataFrame
is passed then only :note
-style metadata from parent of the GroupedDataFrame
is preserved.
Examples
julia> DataFrame((a=[1, 2], b=[3, 4])) # Tables.jl table constructor
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 2 4
julia> DataFrame([(a=1, b=0), (a=2, b=0)]) # Tables.jl table constructor
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 0
2 │ 2 0
julia> DataFrame("a" => 1:2, "b" => 0) # Pair constructor
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 0
2 │ 2 0
julia> DataFrame([:a => 1:2, :b => 0]) # vector of Pairs constructor
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 0
2 │ 2 0
julia> DataFrame(Dict(:a => 1:2, :b => 0)) # dictionary constructor
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 0
2 │ 2 0
julia> DataFrame(a=1:2, b=0) # keyword argument constructor
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 0
2 │ 2 0
julia> DataFrame([[1, 2], [0, 0]], [:a, :b]) # vector of vectors constructor
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 0
2 │ 2 0
julia> DataFrame([1 0; 2 0], :auto) # matrix constructor
2×2 DataFrame
Row │ x1 x2
│ Int64 Int64
─────┼──────────────
1 │ 1 0
2 │ 2 0
DataFrames.DataFrameRow
— TypeDataFrameRow{<:AbstractDataFrame, <:AbstractIndex}
A view of one row of an AbstractDataFrame
.
A DataFrameRow
is returned by getindex
or view
functions when one row and a selection of columns are requested, or when iterating the result of the call to the eachrow
function.
The DataFrameRow
constructor can also be called directly:
DataFrameRow(parent::AbstractDataFrame, row::Integer, cols=:)
A DataFrameRow
supports the iteration interface and can therefore be passed to functions that expect a collection as an argument. Its element type is always Any
.
Indexing is one-dimensional like specifying a column of a DataFrame
. You can also access the data in a DataFrameRow
using the getproperty
and setproperty!
functions and convert it to a Tuple
, NamedTuple
, or Vector
using the corresponding functions.
If the selection of columns in a parent data frame is passed as :
(a colon) then DataFrameRow
will always have all columns from the parent, even if they are added or removed after its creation.
Examples
julia> df = DataFrame(a=repeat([1, 2], outer=[2]),
b=repeat(["a", "b"], inner=[2]),
c=1:4)
4×3 DataFrame
Row │ a b c
│ Int64 String Int64
─────┼──────────────────────
1 │ 1 a 1
2 │ 2 a 2
3 │ 1 b 3
4 │ 2 b 4
julia> df[1, :]
DataFrameRow
Row │ a b c
│ Int64 String Int64
─────┼──────────────────────
1 │ 1 a 1
julia> @view df[end, [:a]]
DataFrameRow
Row │ a
│ Int64
─────┼───────
4 │ 2
julia> eachrow(df)[1]
DataFrameRow
Row │ a b c
│ Int64 String Int64
─────┼──────────────────────
1 │ 1 a 1
julia> Tuple(df[1, :])
(1, "a", 1)
julia> NamedTuple(df[1, :])
(a = 1, b = "a", c = 1)
julia> Vector(df[1, :])
3-element Vector{Any}:
1
"a"
1
DataFrames.GroupedDataFrame
— TypeGroupedDataFrame
The result of a groupby
operation on an AbstractDataFrame
; a view into the AbstractDataFrame
grouped by rows.
Not meant to be constructed directly, see groupby
.
One can get the names of columns used to create GroupedDataFrame
using the groupcols
function. Similarly the groupindices
function returns a vector of group indices for each row of the parent data frame.
After its creation, a GroupedDataFrame
reflects the grouping of rows that was valid at its creation time. Therefore grouping columns of its parent data frame must not be mutated, and rows must not be added nor removed from it. To safeguard the user against such cases, if the number of rows in the parent data frame changes then trying to use GroupedDataFrame
will throw an error. However, one can add or remove columns to the parent data frame without invalidating the GroupedDataFrame
provided that columns used for grouping are not changed.
DataFrames.GroupKey
— TypeGroupKey{T<:GroupedDataFrame}
Key for one of the groups of a GroupedDataFrame
. Contains the values of the corresponding grouping columns and behaves similarly to a NamedTuple
, but using it to index its GroupedDataFrame
is more efficient than using the equivalent Tuple
and NamedTuple
, and much more efficient than using the equivalent AbstractDict
.
Instances of this type are returned by keys(::GroupedDataFrame)
and are not meant to be constructed directly.
Indexing fields of GroupKey
is allowed using an integer, a Symbol
, or a string. It is also possible to access the data in a GroupKey
using the getproperty
function. A GroupKey
can be converted to a Tuple
, NamedTuple
, a Vector
, or a Dict
. When converted to a Dict
, the keys of the Dict
are Symbol
s.
See keys(::GroupedDataFrame)
for more information.
DataFrames.GroupKeys
— TypeGroupKeys{T<:GroupedDataFrame} <: AbstractVector{GroupKey{T}}
A vector containing all GroupKey
objects for a given GroupedDataFrame
.
See keys(::GroupedDataFrame)
for more information.
DataFrames.SubDataFrame
— TypeSubDataFrame{<:AbstractDataFrame, <:AbstractIndex, <:AbstractVector{Int}} <: AbstractDataFrame
A view of an AbstractDataFrame
. It is returned by a call to the view
function on an AbstractDataFrame
if a collections of rows and columns are specified.
A SubDataFrame
is an AbstractDataFrame
, so expect that most DataFrame functions should work. Such methods include describe
, summary
, nrow
, size
, by
, stack
, and join
.
If the selection of columns in a parent data frame is passed as :
(a colon) then SubDataFrame
will always have all columns from the parent, even if they are added or removed after its creation.
Examples
julia> df = DataFrame(a=repeat([1, 2, 3, 4], outer=[2]),
b=repeat([2, 1], outer=[4]),
c=1:8)
8×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 1
2 │ 2 1 2
3 │ 3 2 3
4 │ 4 1 4
5 │ 1 2 5
6 │ 2 1 6
7 │ 3 2 7
8 │ 4 1 8
julia> sdf1 = view(df, :, 2:3) # column subsetting
8×2 SubDataFrame
Row │ b c
│ Int64 Int64
─────┼──────────────
1 │ 2 1
2 │ 1 2
3 │ 2 3
4 │ 1 4
5 │ 2 5
6 │ 1 6
7 │ 2 7
8 │ 1 8
julia> sdf2 = @view df[end:-1:1, [1, 3]] # row and column subsetting
8×2 SubDataFrame
Row │ a c
│ Int64 Int64
─────┼──────────────
1 │ 4 8
2 │ 3 7
3 │ 2 6
4 │ 1 5
5 │ 4 4
6 │ 3 3
7 │ 2 2
8 │ 1 1
julia> sdf3 = groupby(df, :a)[1] # indexing a GroupedDataFrame returns a SubDataFrame
2×3 SubDataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 1
2 │ 1 2 5
DataFrames.DataFrameRows
— TypeDataFrameRows{D<:AbstractDataFrame} <: AbstractVector{DataFrameRow}
Iterator over rows of an AbstractDataFrame
, with each row represented as a DataFrameRow
.
A value of this type is returned by the eachrow
function.
DataFrames.DataFrameColumns
— TypeDataFrameColumns{<:AbstractDataFrame}
A vector-like object that allows iteration over columns of an AbstractDataFrame
.
Indexing into DataFrameColumns
objects using integer, Symbol
or string returns the corresponding column (without copying). Indexing into DataFrameColumns
objects using a multiple column selector returns a subsetted DataFrameColumns
object with a new parent containing only the selected columns (without copying).
DataFrameColumns
supports most of the AbstractVector
API. The key differences are that it is read-only and that the keys
function returns a vector of Symbol
s (and not integers as for normal vectors).
In particular findnext
, findprev
, findfirst
, findlast
, and findall
functions are supported, and in findnext
and findprev
functions it is allowed to pass an integer, string, or Symbol
as a reference index.
DataFrames.RepeatedVector
— TypeRepeatedVector{T} <: AbstractVector{T}
An AbstractVector that is a view into another AbstractVector with repeated elements
NOTE: Not exported.
Constructor
RepeatedVector(parent::AbstractVector, inner::Int, outer::Int)
Arguments
parent
: the AbstractVector that's repeatedinner
: the number of times each element is repeatedouter
: the number of times the whole vector is repeated after expanded byinner
inner
and outer
have the same meaning as similarly named arguments to repeat
.
Examples
RepeatedVector([1, 2], 3, 1) # [1, 1, 1, 2, 2, 2]
RepeatedVector([1, 2], 1, 3) # [1, 2, 1, 2, 1, 2]
RepeatedVector([1, 2], 2, 2) # [1, 1, 2, 2, 1, 1, 2, 2]
DataFrames.StackedVector
— TypeStackedVector <: AbstractVector
An AbstractVector
that is a linear, concatenated view into another set of AbstractVectors
NOTE: Not exported.
Constructor
StackedVector(d::AbstractVector)
Arguments
d...
: one or more AbstractVectors
Examples
StackedVector(Any[[1, 2], [9, 10], [11, 12]]) # [1, 2, 9, 10, 11, 12]