Types
DataFrames.AbstractDataFrameDataFrames.DataFrameDataFrames.DataFrameColumnsDataFrames.DataFrameRowDataFrames.DataFrameRowsDataFrames.GroupedDataFrameDataFrames.RepeatedVectorDataFrames.StackedVectorDataFrames.SubDataFrame
Type hierarchy design
AbstractDataFrame is an abstract type that provides an interface for data frame types. It is not intended as a fully generic interface for working with tabular data, which is the role of interfaces defined by Tables.jl instead.
DataFrame is the most fundamental subtype of AbstractDataFrame, which stores a set of columns as AbstractVector objects.
SubDataFrame is an AbstractDataFrame subtype representing a view into a DataFrame. It stores only a reference to the parent DataFrame and information about which rows and columns from the parent are selected (both as integer indices referring to the parent). Typically it is created using the view function or is returned by indexing into a GroupedDataFrame object.
GroupedDataFrame is a type that stores the result of a grouping operation performed on an AbstractDataFrame. It is intended to be created as a result of a call to the groupby function.
DataFrameRow is a view into a single row of an AbstractDataFrame. It stores only a reference to a parent DataFrame and information about which row and columns from the parent are selected (both as integer indices referring to the parent) The DataFrameRow type supports iteration over columns of the row and is similar in functionality to the NamedTuple type, but allows for modification of data stored in the parent DataFrame and reflects changes done to the parent after the creation of the view. Typically objects of the DataFrameRow type are encountered when returned by the eachrow function, or when accessing a single row of a DataFrame or SubDataFrame via getindex or view.
The eachrow function returns a value of the DataFrameRows type, which serves as an iterator over rows of an AbstractDataFrame, returning DataFrameRow objects.
Similarly, the eachcol function returns a value of the DataFrameColumns type, which serves as an iterator over columns of an AbstractDataFrame. The return value can have two concrete types:
- If the
eachcolfunction is called with thenamesargument set totrue(currently the default, but in the future the default will change tofalse) then it returns a value of theDataFrameColumns{<:AbstractDataFrame, Pair{Symbol, AbstractVector}}type, which is an iterator returning a pair containing the column name and the column vector. - If the
eachcolfunction is called withnamesargument set tofalsethen it returns a value of theDataFrameColumns{<:AbstractDataFrame, AbstractVector}type, which is an iterator returning the column vector only.
The DataFrameRows and DataFrameColumns types are subtypes of AbstractVector and support its interface with the exception that they are read only. Note that they are not exported and should not be constructed directly, but using the eachrow and eachcol functions.
The RepeatedVector and StackedVector types are subtypes of AbstractVector and support its interface with the exception that they are read only. Note that they are not exported and should not be constructed directly, but they are columns of a DataFrame returned by stackdf and meltdf.
Types specification
DataFrames.AbstractDataFrame — Type.AbstractDataFrameAn abstract type for which all concrete types expose an interface for working with tabular data.
Common methods
An AbstractDataFrame is a two-dimensional table with Symbols for column names. An AbstractDataFrame is also similar to an Associative type in that it allows indexing by a key (the columns).
The following are normally implemented for AbstractDataFrames:
describe: summarize columnsdump: show structurehcat: horizontal concatenationvcat: vertical concatenationrepeat: repeat rowsnames: columns namesnames!: set columns namesrename!: rename columns names based on keyword argumentseltypes:eltypeof each columnlength: number of columnssize: (nrows, ncols)first: firstnrowslast: lastnrowsconvert: convert to an arraycompletecases: boolean vector of complete cases (rows with no missings)dropmissing: remove rows with missing valuesdropmissing!: remove rows with missing values in-placenonunique: indexes of duplicate rowsunique!: remove duplicate rowssimilar: a DataFrame with similar columns asdfilter: remove rowsfilter!: remove rows in-place
Indexing
Table columns are accessed (getindex) by a single index that can be a symbol identifier, an integer, or a vector of each. If a single column is selected, just the column object is returned. If multiple columns are selected, some AbstractDataFrame is returned.
d[:colA]
d[3]
d[[:colA, :colB]]
d[[1:3; 5]]Rows and columns can be indexed like a Matrix with the added feature of indexing columns by name.
d[1:3, :colA]
d[3,3]
d[3,:]
d[3,[:colA, :colB]]
d[:, [:colA, :colB]]
d[[1:3; 5], :]setindex works similarly.
DataFrames.DataFrame — Type.DataFrame <: AbstractDataFrameAn AbstractDataFrame that stores a set of named columns
The columns are normally AbstractVectors stored in memory, particularly a Vector or CategoricalVector.
Constructors
DataFrame(columns::Vector, names::Vector{Symbol}; makeunique::Bool=false)
DataFrame(columns::Matrix, names::Vector{Symbol}; makeunique::Bool=false)
DataFrame(kwargs...)
DataFrame(pairs::Pair{Symbol}...; makeunique::Bool=false)
DataFrame() # an empty DataFrame
DataFrame(t::Type, nrows::Integer, ncols::Integer) # an empty DataFrame of arbitrary size
DataFrame(column_eltypes::Vector, names::Vector, nrows::Integer; makeunique::Bool=false)
DataFrame(column_eltypes::Vector, cnames::Vector, categorical::Vector, nrows::Integer;
makeunique::Bool=false)
DataFrame(ds::AbstractDict)
DataFrame(table; makeunique::Bool=false)Arguments
columns: a Vector with each column as contents or a Matrixnames: the column namesmakeunique: iffalse(the default), an error will be raised if duplicates innamesare found; iftrue, duplicate names will be suffixed with_i(istarting at 1 for the first duplicate).kwargs: the key gives the column names, and the value is the column contentst: elemental type of all columnsnrows,ncols: number of rows and columnscolumn_eltypes: elemental type of each columncategorical:Vector{Bool}indicating which columns should be converted toCategoricalVectords:AbstractDictof columnstable: any type that implements the Tables.jl interface
Each column in columns should be the same length.
Notes
A DataFrame is a lightweight object. As long as columns are not manipulated, creation of a DataFrame from existing AbstractVectors is inexpensive. For example, indexing on columns is inexpensive, but indexing by rows is expensive because copies are made of each column.
If a column is passed to a DataFrame constructor or is assigned as a whole using setindex! then its reference is stored in the DataFrame. An exception to this rule is assignment of an AbstractRange as a column, in which case the range is collected to a Vector.
Because column types can vary, a DataFrame is not type stable. For performance-critical code, do not index into a DataFrame inside of loops.
Examples
df = DataFrame()
v = ["x","y","z"][rand(1:3, 10)]
df1 = DataFrame(Any[collect(1:10), v, rand(10)], [:A, :B, :C])
df2 = DataFrame(A = 1:10, B = v, C = rand(10))
dump(df1)
dump(df2)
describe(df2)
first(df1, 10)
df1[:A] + df2[:C]
df1[1:4, 1:2]
df1[[:A,:C]]
df1[1:2, [:A,:C]]
df1[:, [:A,:C]]
df1[:, [1,3]]
df1[1:4, :]
df1[1:4, :C]
df1[1:4, :C] = 40. * df1[1:4, :C]
[df1; df2] # vcat
[df1 df2] # hcat
size(df1)DataFrames.DataFrameRow — Type.DataFrameRow{<:AbstractDataFrame,<:AbstractIndex}A view of one row of an AbstractDataFrame.
A DataFrameRow is constructed with view or getindex when one row and a selection of columns are requested, or when iterating the result of the call to the eachrow function.
A DataFrameRow supports the iteration interface and can therefore be passed to functions that expect a collection as an argument.
Indexing is one-dimensional like specifying a column of a DataFrame. You can also access the data in a DataFrameRow using the getproperty and setproperty! functions and convert it to a NamedTuple using the copy function.
It is possible to create a DataFrameRow with duplicate columns. All such columns will have a reference to the same entry in the parent DataFrame.
If the selection of columns in a parent data frame is passed as : (a colon) then DataFrameRow will always have all columns from the parent, even if they are added or removed after its creation.
Examples
df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = randn(8))
sdf1 = view(df, 2, :)
sdf2 = @view df[end, [:a]]
sdf3 = eachrow(df)[1]DataFrames.GroupedDataFrame — Type.GroupedDataFrameThe result of a groupby operation on an AbstractDataFrame; a view into the AbstractDataFrame grouped by rows.
Not meant to be constructed directly, see groupby.
DataFrames.SubDataFrame — Type.SubDataFrame{<:AbstractDataFrame,<:AbstractIndex,<:AbstractVector{Int}} <: AbstractDataFrameA view of an AbstractDataFrame. It is returned by a call to the view function on an AbstractDataFrame if a collections of rows and columns are specified.
A SubDataFrame is an AbstractDataFrame, so expect that most DataFrame functions should work. Such methods include describe, dump, nrow, size, by, stack, and join.
Indexing is just like a DataFrame except that it is possible to create a SubDataFrame with duplicate columns. All such columns will have a reference to the same entry in the parent DataFrame.
If the selection of columns in a parent data frame is passed as : (a colon) then SubDataFrame will always have all columns from the parent, even if they are added or removed after its creation.
Examples
df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = randn(8))
sdf1 = view(df, 2:3) # column subsetting
sdf2 = @view df[end:-1:1, [1,3]] # row and column subsetting
sdf3 = groupby(df, :a)[1] # indexing a GroupedDataFrame returns a SubDataFrameDataFrames.DataFrameRows — Type.DataFrameRows{D<:AbstractDataFrame,S<:AbstractIndex} <: AbstractVector{DataFrameRow{D,S}}Iterator over rows of an AbstractDataFrame, with each row represented as a DataFrameRow.
A value of this type is returned by the eachrow function.
DataFrames.DataFrameColumns — Type.DataFrameColumns{<:AbstractDataFrame, V} <: AbstractVector{V}Iterator over columns of an AbstractDataFrame constructed using eachcol(df, true) if V is a Pair{Symbol,AbstractVector}. Then each returned value is a pair consisting of column name and column vector. If V is an AbstractVector (a value returned by eachcol(df, false)) then each returned value is a column vector.
DataFrames.RepeatedVector — Type.RepeatedVector{T} <: AbstractVector{T}An AbstractVector that is a view into another AbstractVector with repeated elements
NOTE: Not exported.
Constructor
RepeatedVector(parent::AbstractVector, inner::Int, outer::Int)Arguments
parent: the AbstractVector that's repeatedinner: the numer of times each element is repeatedouter: the numer of times the whole vector is repeated after expanded byinner
inner and outer have the same meaning as similarly named arguments to repeat.
Examples
RepeatedVector([1,2], 3, 1) # [1,1,1,2,2,2]
RepeatedVector([1,2], 1, 3) # [1,2,1,2,1,2]
RepeatedVector([1,2], 2, 2) # [1,2,1,2,1,2,1,2]DataFrames.StackedVector — Type.StackedVector <: AbstractVector{Any}An AbstractVector{Any} that is a linear, concatenated view into another set of AbstractVectors
NOTE: Not exported.
Constructor
StackedVector(d::AbstractVector...)Arguments
d...: one or more AbstractVectors
Examples
StackedVector(Any[[1,2], [9,10], [11,12]]) # [1,2,9,10,11,12]