Getting Started

Getting Started

Installation

The DataFrames package is available through the Julia package system and can be installed using the following command:

Pkg.add("DataFrames")

Throughout the rest of this tutorial, we will assume that you have installed the DataFrames package and have already typed using DataFrames to bring all of the relevant variables into your current namespace.

The DataFrame Type

The DataFrame type can be used to represent data tables, each column of which is a vector. You can specify the columns using keyword arguments or pairs:

julia> using DataFrames

julia> DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
4×2 DataFrames.DataFrame
│ Row │ A │ B │
├─────┼───┼───┤
│ 1   │ 1 │ M │
│ 2   │ 2 │ F │
│ 3   │ 3 │ F │
│ 4   │ 4 │ M │

Constructing Column by Column

It is also possible to construct a DataFrame one column at a time.

julia> df = DataFrame()
0×0 DataFrames.DataFrame


julia> df[:A] = 1:8
1:8

julia> df[:B] = ["M", "F", "F", "M", "F", "M", "M", "F"]
8-element Array{String,1}:
 "M"
 "F"
 "F"
 "M"
 "F"
 "M"
 "M"
 "F"

julia> df
8×2 DataFrames.DataFrame
│ Row │ A │ B │
├─────┼───┼───┤
│ 1   │ 1 │ M │
│ 2   │ 2 │ F │
│ 3   │ 3 │ F │
│ 4   │ 4 │ M │
│ 5   │ 5 │ F │
│ 6   │ 6 │ M │
│ 7   │ 7 │ M │
│ 8   │ 8 │ F │

The DataFrame we build in this way has 8 rows and 2 columns. You can check this using the size function:

julia> size(df, 1) == 8
true

julia> size(df, 2) == 2
true

julia> size(df) == (8, 2)
true

Constructing Row by Row

It is also possible to construct a DataFrame row by row.

First a DataFrame with empty columns is constructed:

julia> df = DataFrame(A = Int[], B = String[])
0×2 DataFrames.DataFrame

Rows can then be added as Vectors, where the row order matches the columns order:

julia> push!(df, [1, "M"])
1×2 DataFrames.DataFrame
│ Row │ A │ B │
├─────┼───┼───┤
│ 1   │ 1 │ M │

Rows can also be added as Dicts, where the dictionary keys match the column names:

julia> push!(df, Dict(:B => "F", :A => 2))
2×2 DataFrames.DataFrame
│ Row │ A │ B │
├─────┼───┼───┤
│ 1   │ 1 │ M │
│ 2   │ 2 │ F │

Note that constructing a DataFrame row by row is significantly less performant than constructing it all at once, or column by column. For many use-cases this will not matter, but for very large DataFrames this may be a consideration.

Working with Data Frames

Taking a Subset

We can also look at small subsets of the data in a couple of different ways:

julia> head(df)
6×2 DataFrames.DataFrame
│ Row │ A │ B │
├─────┼───┼───┤
│ 1   │ 1 │ M │
│ 2   │ 2 │ F │
│ 3   │ 3 │ F │
│ 4   │ 4 │ M │
│ 5   │ 5 │ F │
│ 6   │ 6 │ M │

julia> tail(df)
6×2 DataFrames.DataFrame
│ Row │ A │ B │
├─────┼───┼───┤
│ 1   │ 3 │ F │
│ 2   │ 4 │ M │
│ 3   │ 5 │ F │
│ 4   │ 6 │ M │
│ 5   │ 7 │ M │
│ 6   │ 8 │ F │

julia> df[1:3, :]
3×2 DataFrames.DataFrame
│ Row │ A │ B │
├─────┼───┼───┤
│ 1   │ 1 │ M │
│ 2   │ 2 │ F │
│ 3   │ 3 │ F │

Summarizing with describe

Having seen what some of the rows look like, we can try to summarize the entire data set using describe:

2×8 DataFrames.DataFrame
│ Row │ variable │ mean │ min │ median │ max │ nunique │ nmissing │ eltype │
├─────┼──────────┼──────┼─────┼────────┼─────┼─────────┼──────────┼────────┤
│ 1   │ A        │ 4.5  │ 1   │ 4.5    │ 8   │         │          │ Int64  │
│ 2   │ B        │      │ F   │        │ M   │ 2       │          │ String │

To access individual columns of the dataset, you refer to the column names by their symbol or by their numerical index. Here we extract the first column, :A, and use it to compute the mean and variance.

julia> mean(df[:A]) == mean(df[1]) == 4.5
true

julia> var(df[:A]) ==  var(df[1]) == 6.0
true

Column-Wise Operations

We can also apply a function to each column of a DataFrame with the colwise function. For example:

julia> df = DataFrame(A = 1:4, B = 4.0:-1.0:1.0)
4×2 DataFrames.DataFrame
│ Row │ A │ B   │
├─────┼───┼─────┤
│ 1   │ 1 │ 4.0 │
│ 2   │ 2 │ 3.0 │
│ 3   │ 3 │ 2.0 │
│ 4   │ 4 │ 1.0 │

julia> colwise(sum, df)
2-element Array{Real,1}:
 10
 10.0

Importing and Exporting Data (I/O)

For reading and writing tabular data from CSV and other delimited text files, use the CSV.jl package.

If you have not used the CSV.jl package before then you may need to install it first:

Pkg.add("CSV")

The CSV.jl functions are not loaded automatically and must be imported into the session.

using CSV

A dataset can now be read from a CSV file at path input using

CSV.read(input)

A DataFrame can be written to a CSV file at path output using

df = DataFrame(x = 1, y = 2)
CSV.write(output, df)

The behavior of CSV functions can be adapted via keyword arguments. For more information, use the REPL help-mode or checkout the online CSV.jl documentation.

Loading a Classic Data Set

To see more of the functionality for working with DataFrame objects, we need a more complex data set to work with. We can access Fisher's iris data set using the following functions:

julia> using DataFrames, CSV

julia> iris = CSV.read(joinpath(Pkg.dir("DataFrames"), "test/data/iris.csv"));

julia> head(iris)
6×5 DataFrames.DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────┤
│ 1   │ 5.1         │ 3.5        │ 1.4         │ 0.2        │ setosa  │
│ 2   │ 4.9         │ 3.0        │ 1.4         │ 0.2        │ setosa  │
│ 3   │ 4.7         │ 3.2        │ 1.3         │ 0.2        │ setosa  │
│ 4   │ 4.6         │ 3.1        │ 1.5         │ 0.2        │ setosa  │
│ 5   │ 5.0         │ 3.6        │ 1.4         │ 0.2        │ setosa  │
│ 6   │ 5.4         │ 3.9        │ 1.7         │ 0.4        │ setosa  │

Querying DataFrames

While the DataFrames package provides basic data manipulation capabilities, users are encouraged to use the Query.jl, which provides a LINQ-like interface to a large number of data sources, including DataFrame instances. See the Querying frameworks section for more information.