Getting Started

Installation

The DataFrames package is available through the Julia package system and can be installed using the following commands:

using Pkg
Pkg.add("DataFrames")

Throughout the rest of this tutorial, we will assume that you have installed the DataFrames package and have already typed using DataFrames to bring all of the relevant variables into your current namespace.

Note

By default Jupyter Notebook will limit the number of rows and columns when displaying a data frame to roughly fit the screen size (like in the REPL).

You can override this behavior by changing the values of the ENV["COLUMNS"] and ENV["LINES"] variables to hold the maximum width and height of output in characters respectively.

Alternatively, you may want to set the maximum number of data frame rows to print to 100 and the maximum output width in characters to 1000 for every Julia session using some Jupyter kernel file (numbers 100 and 1000 are only examples and can be adjusted). In such case add a "COLUMNS": "1000", "LINES": "100" entry to the "env" variable in this Jupyter kernel file. See here for information about location and specification of Jupyter kernels.

The `DataFrame` Type

Objects of the DataFrame type represent a data table as a series of vectors, each corresponding to a column or variable. The simplest way of constructing a DataFrame is to pass column vectors using keyword arguments or pairs:

julia> using DataFrames

julia> df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
4×2 DataFrame
 Row │ A      B
     │ Int64  String
─────┼───────────────
   1 │     1  M
   2 │     2  F
   3 │     3  F
   4 │     4  M

Columns can be directly (i.e. without copying) accessed via df.col, df."col", df[!, :col] or df[!, "col"]. The two latter syntaxes are more flexible as they allow passing a variable holding the name of the column, and not only a literal name. Note that column names can be either symbols (written as :col, :var"col" or Symbol("col")) or strings (written as "col"). Note that in the forms df."col" and :var"col" variable interpolation into a string using $ does not work. Columns can also be accessed using an integer index specifying their position.

Since df[!, :col] does not make a copy, changing the elements of the column vector returned by this syntax will affect the values stored in the original df. To get a copy of the column use df[:, :col]: changing the vector returned by this syntax does not change df.

julia> df.A
4-element Vector{Int64}:
 1
 2
 3
 4

julia> df."A"
4-element Vector{Int64}:
 1
 2
 3
 4

julia> df.A === df[!, :A]
true

julia> df.A === df[:, :A]
false

julia> df.A == df[:, :A]
true

julia> df.A === df[!, "A"]
true

julia> df.A === df[:, "A"]
false

julia> df.A == df[:, "A"]
true

julia> df.A === df[!, 1]
true

julia> df.A === df[:, 1]
false

julia> df.A == df[:, 1]
true

julia> firstcolumn = :A
:A

julia> df[!, firstcolumn] === df.A
true

julia> df[:, firstcolumn] === df.A
false

julia> df[:, firstcolumn] == df.A
true

Column names can be obtained as strings using the names function:

julia> names(df)
2-element Vector{String}:
 "A"
 "B"

To get column names as Symbols use the propertynames function:

julia> propertynames(df)
2-element Vector{Symbol}:
 :A
 :B

Note

DataFrames.jl allows to use Symbols (like :A) and strings (like "A") for all column indexing operations for convenience. However, using Symbols is slightly faster and should generally be preferred, if not generating them via string manipulation.

Constructing Column by Column

It is also possible to start with an empty DataFrame and add columns to it one by one:

julia> df = DataFrame()
0×0 DataFrame

julia> df.A = 1:8
1:8

julia> df.B = ["M", "F", "F", "M", "F", "M", "M", "F"]
8-element Vector{String}:
 "M"
 "F"
 "F"
 "M"
 "F"
 "M"
 "M"
 "F"

julia> df
8×2 DataFrame
 Row │ A      B
     │ Int64  String
─────┼───────────────
   1 │     1  M
   2 │     2  F
   3 │     3  F
   4 │     4  M
   5 │     5  F
   6 │     6  M
   7 │     7  M
   8 │     8  F

The DataFrame we build in this way has 8 rows and 2 columns. This can be checked using the size function:

julia> size(df, 1)
8

julia> size(df, 2)
2

julia> size(df)
(8, 2)

Constructing Row by Row

It is also possible to fill a DataFrame row by row. Let us construct an empty data frame with two columns (note that the first column can only contain integers and the second one can only contain strings):

julia> df = DataFrame(A = Int[], B = String[])
0×2 DataFrame

Rows can then be added as tuples or vectors, where the order of elements matches that of columns:

julia> push!(df, (1, "M"))
1×2 DataFrame
 Row │ A      B
     │ Int64  String
─────┼───────────────
   1 │     1  M

julia> push!(df, [2, "N"])
2×2 DataFrame
 Row │ A      B
     │ Int64  String
─────┼───────────────
   1 │     1  M
   2 │     2  N

Rows can also be added as Dicts, where the dictionary keys match the column names:

julia> push!(df, Dict(:B => "F", :A => 3))
3×2 DataFrame
 Row │ A      B
     │ Int64  String
─────┼───────────────
   1 │     1  M
   2 │     2  N
   3 │     3  F

Note that constructing a DataFrame row by row is significantly less performant than constructing it all at once, or column by column. For many use-cases this will not matter, but for very large DataFrames this may be a consideration.

Constructing from another table type

DataFrames supports the Tables.jl interface for interacting with tabular data. This means that a DataFrame can be used as a "source" to any package that expects a Tables.jl interface input, (file format packages, data manipulation packages, etc.). A DataFrame can also be a sink for any Tables.jl interface input. Some example uses are:

df = DataFrame(a=[1, 2, 3], b=[:a, :b, :c])

# write DataFrame out to CSV file
CSV.write("dataframe.csv", df)

# store DataFrame in an SQLite database table
SQLite.load!(df, db, "dataframe_table")

# transform a DataFrame through Query.jl package
df = df |> @map({a=_.a + 1, _.b}) |> DataFrame

A particular common case of a collection that supports the Tables.jl interface is a vector of NamedTuples:

julia> v = [(a=1, b=2), (a=3, b=4)]
2-element Vector{NamedTuple{(:a, :b), Tuple{Int64, Int64}}}:
 (a = 1, b = 2)
 (a = 3, b = 4)

julia> df = DataFrame(v)
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      2
   2 │     3      4

You can also easily convert a data frame back to a vector of NamedTuples:

julia> using Tables

julia> Tables.rowtable(df)
2-element Vector{NamedTuple{(:a, :b), Tuple{Int64, Int64}}}:
 (a = 1, b = 2)
 (a = 3, b = 4)