Getting Started
Installation
The DataFrames package is available through the Julia package system and can be installed using the following commands:
using Pkg
Pkg.add("DataFrames")
Throughout the rest of this tutorial, we will assume that you have installed the DataFrames package and have already typed using DataFrames
to bring all of the relevant variables into your current namespace.
By default DataFrames.jl limits the number of rows and columns when displaying a data frame in a Jupyter Notebook to 25 and 100, respectively. You can override this behavior by changing the values of the ENV["DATAFRAMES_COLUMNS"]
and ENV["DATAFRAMES_ROWS"]
variables to hold the maximum number of columns and rows of the output. All columns or rows will be printed if those numbers are equal or lower than 0.
Alternatively, you may want to set the maximum number of data frame rows to print to 100
and the maximum number of columns to print to 1000
for every Julia session using some Jupyter kernel file (numbers 100
and 1000
are only examples and can be adjusted). In such case add a "DATAFRAME_COLUMNS": "1000", "DATAFRAMES_ROWS": "100"
entry to the "env"
variable in this Jupyter kernel file. See here for information about location and specification of Jupyter kernels.
The package PrettyTables.jl renders the DataFrame
in the Jupyter notebook. Users can customize the output by passing keywords arguments kwargs...
to the function show
: show(stdout, MIME("text/html"), df; kwargs...)
, where df
is the DataFrame
. Any argument supported by PrettyTables.jl in the HTML backend can be used here. Hence, for example, if the user wants to change the color of all numbers smaller than 0 to red in Jupyter, they can execute: show(stdout, MIME("text/html"), df; highlighters = hl_lt(0, HtmlDecoration(color = "red")))
after using PrettyTables
. For more information about the available options, check PrettyTables.jl documentation.
The DataFrame
Type
Objects of the DataFrame
type represent a data table as a series of vectors, each corresponding to a column or variable. The simplest way of constructing a DataFrame
is to pass column vectors using keyword arguments or pairs:
julia> using DataFrames
julia> DataFrame(a=1:4, b=["M", "F", "F", "M"]) # keyword argument constructor
4×2 DataFrame
Row │ a b
│ Int64 String
─────┼───────────────
1 │ 1 M
2 │ 2 F
3 │ 3 F
4 │ 4 M
Here are examples of other commonly used ways to construct a data frame:
julia> DataFrame((a=[1, 2], b=[3, 4])) # Tables.jl table constructor from a named tuple of vectors
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 2 4
julia> DataFrame([(a=1, b=0), (a=2, b=0)]) # Tables.jl table constructor from a vector of named tuples
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 0
2 │ 2 0
julia> DataFrame("a" => 1:2, "b" => 0) # Pair constructor
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 0
2 │ 2 0
julia> DataFrame([:a => 1:2, :b => 0]) # vector of Pairs constructor
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 0
2 │ 2 0
julia> DataFrame(Dict(:a => 1:2, :b => 0)) # dictionary constructor
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 0
2 │ 2 0
julia> DataFrame([[1, 2], [0, 0]], [:a, :b]) # vector of vectors constructor
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 0
2 │ 2 0
julia> DataFrame([1 0; 2 0], :auto) # matrix constructor
2×2 DataFrame
Row │ x1 x2
│ Int64 Int64
─────┼──────────────
1 │ 1 0
2 │ 2 0
Columns can be directly (i.e. without copying) extracted using df.col
, df."col"
, df[!, :col]
or df[!, "col"]
(this rule applies to getting data from a data frame, not writing data to a data frame). The two latter syntaxes are more flexible as they allow passing a variable holding the name of the column, and not only a literal name. Note that column names can be either symbols (written as :col
, :var"col"
or Symbol("col")
) or strings (written as "col"
). In the forms df."col"
and :var"col"
variable interpolation into a string using $
does not work. Columns can also be extracted using an integer index specifying their position.
Since df[!, :col]
does not make a copy, changing the elements of the column vector returned by this syntax will affect the values stored in the original df
. To get a copy of the column use df[:, :col]
: changing the vector returned by this syntax does not change df
.
julia> df = DataFrame(A=1:4, B=["M", "F", "F", "M"])
4×2 DataFrame
Row │ A B
│ Int64 String
─────┼───────────────
1 │ 1 M
2 │ 2 F
3 │ 3 F
4 │ 4 M
julia> df.A
4-element Vector{Int64}:
1
2
3
4
julia> df."A"
4-element Vector{Int64}:
1
2
3
4
julia> df.A === df[!, :A]
true
julia> df.A === df[:, :A]
false
julia> df.A == df[:, :A]
true
julia> df.A === df[!, "A"]
true
julia> df.A === df[:, "A"]
false
julia> df.A == df[:, "A"]
true
julia> df.A === df[!, 1]
true
julia> df.A === df[:, 1]
false
julia> df.A == df[:, 1]
true
julia> firstcolumn = :A
:A
julia> df[!, firstcolumn] === df.A
true
julia> df[:, firstcolumn] === df.A
false
julia> df[:, firstcolumn] == df.A
true
Column names can be obtained as strings using the names
function:
julia> names(df)
2-element Vector{String}:
"A"
"B"
You can also filter column names by passing a column selector condition as a second argument. See the names
docstring for a detailed list of available conditions. Here we give some selected examples:
julia> names(df, r"A") # a regular expression selector
1-element Vector{String}:
"A"
julia> names(df, Int) # a selector using column element type
1-element Vector{String}:
"A"
julia> names(df, Not(:B)) # selector keeping all columns except :B
1-element Vector{String}:
"A"
To get column names as Symbol
s use the propertynames
function:
julia> propertynames(df)
2-element Vector{Symbol}:
:A
:B
DataFrames.jl allows to use Symbol
s (like :A
) and strings (like "A"
) for all column indexing operations for convenience. However, using Symbol
s is slightly faster and should generally be preferred, if not generating them via string manipulation.
Constructing Column by Column
It is also possible to start with an empty DataFrame
and add columns to it one by one:
julia> df = DataFrame()
0×0 DataFrame
julia> df.A = 1:8
1:8
julia> df[:, :B] = ["M", "F", "F", "M", "F", "M", "M", "F"]
8-element Vector{String}:
"M"
"F"
"F"
"M"
"F"
"M"
"M"
"F"
julia> df[!, :C] .= 0
8-element Vector{Int64}:
0
0
0
0
0
0
0
0
julia> df
8×3 DataFrame
Row │ A B C
│ Int64 String Int64
─────┼──────────────────────
1 │ 1 M 0
2 │ 2 F 0
3 │ 3 F 0
4 │ 4 M 0
5 │ 5 F 0
6 │ 6 M 0
7 │ 7 M 0
8 │ 8 F 0
The DataFrame
we build in this way has 8 rows and 3 columns. This can be checked using the size
function:
julia> size(df, 1)
8
julia> size(df, 2)
3
julia> size(df)
(8, 3)
In the above example notice that the df[!, :C] .= 0
expression created a new column in the data frame by broadcasting a scalar.
When setting a column of a data frame the df[!, :C]
and df.C
syntaxes are equivalent and they would replace (or create) the :C
column in df
. This is different from using df[:, :C]
to set a column in a data frame, which updates the contents of column in-place if it already exists.
Here is an example showing this difference. Let us try changing the :B
column to a binary variable.
julia> df[:, :B] = df.B .== "F"
ERROR: MethodError: Cannot `convert` an object of type Bool to an object of type String
julia> df[:, :B] .= df.B .== "F"
ERROR: MethodError: Cannot `convert` an object of type Bool to an object of type String
The above operations did not work because when you use :
as row selector the :B
column is updated in-place, and it only supports storing strings.
On the other hand the following works:
julia> df.B = df.B .== "F"
8-element BitVector:
0
1
1
0
1
0
0
1
julia> df
8×3 DataFrame
Row │ A B C
│ Int64 Bool Int64
─────┼─────────────────────
1 │ 1 false 0
2 │ 2 true 0
3 │ 3 true 0
4 │ 4 false 0
5 │ 5 true 0
6 │ 6 false 0
7 │ 7 false 0
8 │ 8 true 0
As you can see because we used df.B
on the right-hand side of the assignment the :B
column was replaced. The same effect would be achieved if we used df[!, :B]
instead or if we used broadcasted assignment .=
.
In the Indexing section of the manual you can find all details about all the available indexing options.
Constructing Row by Row
It is also possible to fill a DataFrame
row by row. Let us construct an empty data frame with two columns (note that the first column can only contain integers and the second one can only contain strings):
julia> df = DataFrame(A=Int[], B=String[])
0×2 DataFrame
Row │ A B
│ Int64 String
─────┴───────────────
Rows can then be added as tuples or vectors, where the order of elements matches that of columns. To add new rows at the end of a data frame use push!
:
julia> push!(df, (1, "M"))
1×2 DataFrame
Row │ A B
│ Int64 String
─────┼───────────────
1 │ 1 M
julia> push!(df, [2, "N"])
2×2 DataFrame
Row │ A B
│ Int64 String
─────┼───────────────
1 │ 1 M
2 │ 2 N
Rows can also be added as Dict
s, where the dictionary keys match the column names:
julia> push!(df, Dict(:B => "F", :A => 3))
3×2 DataFrame
Row │ A B
│ Int64 String
─────┼───────────────
1 │ 1 M
2 │ 2 N
3 │ 3 F
Note that constructing a DataFrame
row by row is significantly less performant than constructing it all at once, or column by column. For many use-cases this will not matter, but for very large DataFrame
s this may be a consideration.
If you want to add rows at the beginning of a data frame use pushfirst!
and to insert a row in an arbitrary location use insert!
.
You can also add whole tables to a data frame using the append!
and prepend!
functions.
Constructing from another table type
DataFrames supports the Tables.jl interface for interacting with tabular data. This means that a DataFrame
can be used as a "source" to any package that expects a Tables.jl interface input, (file format packages, data manipulation packages, etc.). A DataFrame
can also be a sink for any Tables.jl interface input. Some example uses are:
df = DataFrame(a=[1, 2, 3], b=[:a, :b, :c])
# write DataFrame out to CSV file
CSV.write("dataframe.csv", df)
# store DataFrame in an SQLite database table
SQLite.load!(df, db, "dataframe_table")
# transform a DataFrame through Query.jl package
df = df |> @map({a=_.a + 1, _.b}) |> DataFrame
A particular common case of a collection that supports the Tables.jl interface is a vector of NamedTuple
s:
julia> v = [(a=1, b=2), (a=3, b=4)]
2-element Vector{@NamedTuple{a::Int64, b::Int64}}:
(a = 1, b = 2)
(a = 3, b = 4)
julia> df = DataFrame(v)
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 2
2 │ 3 4
You can also easily convert a data frame back to a vector of NamedTuple
s:
julia> using Tables
julia> Tables.rowtable(df)
2-element Vector{@NamedTuple{a::Int64, b::Int64}}:
(a = 1, b = 2)
(a = 3, b = 4)