The Split-Apply-Combine Strategy

Many data analysis tasks involve splitting a data set into groups, applying some functions to each of the groups and then combining the results. A standardized framework for handling this sort of computation is described in the paper "The Split-Apply-Combine Strategy for Data Analysis", written by Hadley Wickham.

The DataFrames package supports the Split-Apply-Combine strategy through the by function, which takes in three arguments: (1) a DataFrame, (2) one or more columns to split the DataFrame on, and (3) a function or expression to apply to each subset of the DataFrame.

We show several examples of the by function applied to the iris dataset below:

julia> using DataFrames, CSV

julia> iris = CSV.read(joinpath(Pkg.dir("DataFrames"), "test/data/iris.csv"));

julia> head(iris)
6×5 DataFrames.DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────┤
│ 1   │ 5.1         │ 3.5        │ 1.4         │ 0.2        │ setosa  │
│ 2   │ 4.9         │ 3.0        │ 1.4         │ 0.2        │ setosa  │
│ 3   │ 4.7         │ 3.2        │ 1.3         │ 0.2        │ setosa  │
│ 4   │ 4.6         │ 3.1        │ 1.5         │ 0.2        │ setosa  │
│ 5   │ 5.0         │ 3.6        │ 1.4         │ 0.2        │ setosa  │
│ 6   │ 5.4         │ 3.9        │ 1.7         │ 0.4        │ setosa  │

julia> tail(iris)
6×5 DataFrames.DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species   │
├─────┼─────────────┼────────────┼─────────────┼────────────┼───────────┤
│ 1   │ 6.7         │ 3.3        │ 5.7         │ 2.5        │ virginica │
│ 2   │ 6.7         │ 3.0        │ 5.2         │ 2.3        │ virginica │
│ 3   │ 6.3         │ 2.5        │ 5.0         │ 1.9        │ virginica │
│ 4   │ 6.5         │ 3.0        │ 5.2         │ 2.0        │ virginica │
│ 5   │ 6.2         │ 3.4        │ 5.4         │ 2.3        │ virginica │
│ 6   │ 5.9         │ 3.0        │ 5.1         │ 1.8        │ virginica │


julia> by(iris, :Species, size)
3×2 DataFrames.DataFrame
│ Row │ Species    │ x1      │
├─────┼────────────┼─────────┤
│ 1   │ setosa     │ (50, 5) │
│ 2   │ versicolor │ (50, 5) │
│ 3   │ virginica  │ (50, 5) │

julia> by(iris, :Species, df -> mean(df[:PetalLength]))
3×2 DataFrames.DataFrame
│ Row │ Species    │ x1    │
├─────┼────────────┼───────┤
│ 1   │ setosa     │ 1.462 │
│ 2   │ versicolor │ 4.26  │
│ 3   │ virginica  │ 5.552 │

julia> by(iris, :Species, df -> DataFrame(N = size(df, 1)))
3×2 DataFrames.DataFrame
│ Row │ Species    │ N  │
├─────┼────────────┼────┤
│ 1   │ setosa     │ 50 │
│ 2   │ versicolor │ 50 │
│ 3   │ virginica  │ 50 │

The by function also support the do block form:

julia> by(iris, :Species) do df
          DataFrame(m = mean(df[:PetalLength]), s² = var(df[:PetalLength]))
       end
3×3 DataFrames.DataFrame
│ Row │ Species    │ m     │ s²        │
├─────┼────────────┼───────┼───────────┤
│ 1   │ setosa     │ 1.462 │ 0.0301592 │
│ 2   │ versicolor │ 4.26  │ 0.220816  │
│ 3   │ virginica  │ 5.552 │ 0.304588  │

A second approach to the Split-Apply-Combine strategy is implemented in the aggregate function, which also takes three arguments: (1) a DataFrame, (2) one or more columns to split the DataFrame on, and (3) one or more functions that are used to compute a summary of each subset of the DataFrame. Each function is applied to each column that was not used to split the DataFrame, creating new columns of the form $name_$function. For named functions like mean this will produce columns with names like SepalLength_mean. For anonymous functions like x -> sqrt(x)^e, which Julia tracks and references by a numerical identifier e.g. #12, the produced columns will be SepalLength_#12. We show several examples of the aggregate function applied to the iris dataset below:

julia> aggregate(iris, :Species, length)
3×5 DataFrames.DataFrame
│ Row │ Species    │ SepalLength_length │ SepalWidth_length │ PetalLength_length │ PetalWidth_length │
├─────┼────────────┼────────────────────┼───────────────────┼────────────────────┼───────────────────┤
│ 1   │ setosa     │ 50                 │ 50                │ 50                 │ 50                │
│ 2   │ versicolor │ 50                 │ 50                │ 50                 │ 50                │
│ 3   │ virginica  │ 50                 │ 50                │ 50                 │ 50                │

julia> aggregate(iris, :Species, [sum, mean])
3×9 DataFrames.DataFrame
│ Row │ Species    │ SepalLength_sum │ SepalWidth_sum │ PetalLength_sum │ PetalWidth_sum │ SepalLength_mean │ SepalWidth_mean │ PetalLength_mean │ PetalWidth_mean │
├─────┼────────────┼─────────────────┼────────────────┼─────────────────┼────────────────┼──────────────────┼─────────────────┼──────────────────┼─────────────────┤
│ 1   │ setosa     │ 250.3           │ 171.4          │ 73.1            │ 12.3           │ 5.006            │ 3.428           │ 1.462            │ 0.246           │
│ 2   │ versicolor │ 296.8           │ 138.5          │ 213.0           │ 66.3           │ 5.936            │ 2.77            │ 4.26             │ 1.326           │
│ 3   │ virginica  │ 329.4           │ 148.7          │ 277.6           │ 101.3          │ 6.588            │ 2.974           │ 5.552            │ 2.026           │

If you only want to split the data set into subsets, use the groupby function:

julia> for subdf in groupby(iris, :Species)
           println(size(subdf, 1))
       end
50
50
50