Getting Started
Installation
The DataFrames package is available through the Julia package system and can be installed using the following commands:
using Pkg
Pkg.add("DataFrames")Throughout the rest of this tutorial, we will assume that you have installed the DataFrames package and have already typed using DataFrames to bring all of the relevant variables into your current namespace.
By default Jupyter Notebook will limit the number of rows and columns when displaying a data frame to roughly fit the screen size (like in the REPL).
You can override this behavior by changing the values of the ENV["COLUMNS"] and ENV["LINES"] variables to hold the maximum width and height of output in characters respectively.
Alternatively, you may want to set the maximum number of data frame rows to print to 100 and the maximum output width in characters to 1000 for every Julia session using some Jupyter kernel file (numbers 100 and 1000 are only examples and can be adjusted). In such case add a "COLUMNS": "1000", "LINES": "100" entry to the "env" variable in this Jupyter kernel file. See here for information about location and specification of Jupyter kernels.
The DataFrame Type
Objects of the DataFrame type represent a data table as a series of vectors, each corresponding to a column or variable. The simplest way of constructing a DataFrame is to pass column vectors using keyword arguments or pairs:
julia> using DataFrames
julia> df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
4×2 DataFrame
Row │ A B
│ Int64 String
─────┼───────────────
1 │ 1 M
2 │ 2 F
3 │ 3 F
4 │ 4 MColumns can be directly (i.e. without copying) accessed via df.col, df."col", df[!, :col] or df[!, "col"]. The two latter syntaxes are more flexible as they allow passing a variable holding the name of the column, and not only a literal name. Note that column names can be either symbols (written as :col, :var"col" or Symbol("col")) or strings (written as "col"). Note that in the forms df."col" and :var"col" variable interpolation into a string using $ does not work. Columns can also be accessed using an integer index specifying their position.
Since df[!, :col] does not make a copy, changing the elements of the column vector returned by this syntax will affect the values stored in the original df. To get a copy of the column use df[:, :col]: changing the vector returned by this syntax does not change df.
julia> df.A
4-element Vector{Int64}:
1
2
3
4
julia> df."A"
4-element Vector{Int64}:
1
2
3
4
julia> df.A === df[!, :A]
true
julia> df.A === df[:, :A]
false
julia> df.A == df[:, :A]
true
julia> df.A === df[!, "A"]
true
julia> df.A === df[:, "A"]
false
julia> df.A == df[:, "A"]
true
julia> df.A === df[!, 1]
true
julia> df.A === df[:, 1]
false
julia> df.A == df[:, 1]
true
julia> firstcolumn = :A
:A
julia> df[!, firstcolumn] === df.A
true
julia> df[:, firstcolumn] === df.A
false
julia> df[:, firstcolumn] == df.A
trueColumn names can be obtained as strings using the names function:
julia> names(df)
2-element Vector{String}:
"A"
"B"To get column names as Symbols use the propertynames function:
julia> propertynames(df)
2-element Vector{Symbol}:
:A
:BDataFrames.jl allows to use Symbols (like :A) and strings (like "A") for all column indexing operations for convenience. However, using Symbols is slightly faster and should generally be preferred, if not generating them via string manipulation.
Constructing Column by Column
It is also possible to start with an empty DataFrame and add columns to it one by one:
julia> df = DataFrame()
0×0 DataFrame
julia> df.A = 1:8
1:8
julia> df.B = ["M", "F", "F", "M", "F", "M", "M", "F"]
8-element Vector{String}:
"M"
"F"
"F"
"M"
"F"
"M"
"M"
"F"
julia> df
8×2 DataFrame
Row │ A B
│ Int64 String
─────┼───────────────
1 │ 1 M
2 │ 2 F
3 │ 3 F
4 │ 4 M
5 │ 5 F
6 │ 6 M
7 │ 7 M
8 │ 8 FThe DataFrame we build in this way has 8 rows and 2 columns. This can be checked using the size function:
julia> size(df, 1)
8
julia> size(df, 2)
2
julia> size(df)
(8, 2)
Constructing Row by Row
It is also possible to fill a DataFrame row by row. Let us construct an empty data frame with two columns (note that the first column can only contain integers and the second one can only contain strings):
julia> df = DataFrame(A = Int[], B = String[])
0×2 DataFrameRows can then be added as tuples or vectors, where the order of elements matches that of columns:
julia> push!(df, (1, "M"))
1×2 DataFrame
Row │ A B
│ Int64 String
─────┼───────────────
1 │ 1 M
julia> push!(df, [2, "N"])
2×2 DataFrame
Row │ A B
│ Int64 String
─────┼───────────────
1 │ 1 M
2 │ 2 NRows can also be added as Dicts, where the dictionary keys match the column names:
julia> push!(df, Dict(:B => "F", :A => 3))
3×2 DataFrame
Row │ A B
│ Int64 String
─────┼───────────────
1 │ 1 M
2 │ 2 N
3 │ 3 FNote that constructing a DataFrame row by row is significantly less performant than constructing it all at once, or column by column. For many use-cases this will not matter, but for very large DataFrames this may be a consideration.
Constructing from another table type
DataFrames supports the Tables.jl interface for interacting with tabular data. This means that a DataFrame can be used as a "source" to any package that expects a Tables.jl interface input, (file format packages, data manipulation packages, etc.). A DataFrame can also be a sink for any Tables.jl interface input. Some example uses are:
df = DataFrame(a=[1, 2, 3], b=[:a, :b, :c])
# write DataFrame out to CSV file
CSV.write("dataframe.csv", df)
# store DataFrame in an SQLite database table
SQLite.load!(df, db, "dataframe_table")
# transform a DataFrame through Query.jl package
df = df |> @map({a=_.a + 1, _.b}) |> DataFrameA particular common case of a collection that supports the Tables.jl interface is a vector of NamedTuples:
julia> v = [(a=1, b=2), (a=3, b=4)]
2-element Vector{NamedTuple{(:a, :b), Tuple{Int64, Int64}}}:
(a = 1, b = 2)
(a = 3, b = 4)
julia> df = DataFrame(v)
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 2
2 │ 3 4You can also easily convert a data frame back to a vector of NamedTuples:
julia> using Tables
julia> Tables.rowtable(df)
2-element Vector{NamedTuple{(:a, :b), Tuple{Int64, Int64}}}:
(a = 1, b = 2)
(a = 3, b = 4)