Reshaping and Pivoting Data

Reshape data from wide to long format using the stack function:

julia> using DataFrames, CSV

julia> path = joinpath(pkgdir(DataFrames), "docs", "src", "assets", "iris.csv");

julia> iris = CSV.read(path, DataFrame)
150×5 DataFrame
 Row │ SepalLength  SepalWidth  PetalLength  PetalWidth  Species
     │ Float64      Float64     Float64      Float64     String15
─────┼──────────────────────────────────────────────────────────────────
   1 │         5.1         3.5          1.4         0.2  Iris-setosa
   2 │         4.9         3.0          1.4         0.2  Iris-setosa
   3 │         4.7         3.2          1.3         0.2  Iris-setosa
   4 │         4.6         3.1          1.5         0.2  Iris-setosa
   5 │         5.0         3.6          1.4         0.2  Iris-setosa
   6 │         5.4         3.9          1.7         0.4  Iris-setosa
   7 │         4.6         3.4          1.4         0.3  Iris-setosa
   8 │         5.0         3.4          1.5         0.2  Iris-setosa
  ⋮  │      ⋮           ⋮            ⋮           ⋮             ⋮
 144 │         6.8         3.2          5.9         2.3  Iris-virginica
 145 │         6.7         3.3          5.7         2.5  Iris-virginica
 146 │         6.7         3.0          5.2         2.3  Iris-virginica
 147 │         6.3         2.5          5.0         1.9  Iris-virginica
 148 │         6.5         3.0          5.2         2.0  Iris-virginica
 149 │         6.2         3.4          5.4         2.3  Iris-virginica
 150 │         5.9         3.0          5.1         1.8  Iris-virginica
                                                        135 rows omitted

julia> stack(iris, 1:4)
600×3 DataFrame
 Row │ Species         variable     value
     │ String15        String       Float64
─────┼──────────────────────────────────────
   1 │ Iris-setosa     SepalLength      5.1
   2 │ Iris-setosa     SepalLength      4.9
   3 │ Iris-setosa     SepalLength      4.7
   4 │ Iris-setosa     SepalLength      4.6
   5 │ Iris-setosa     SepalLength      5.0
   6 │ Iris-setosa     SepalLength      5.4
   7 │ Iris-setosa     SepalLength      4.6
   8 │ Iris-setosa     SepalLength      5.0
  ⋮  │       ⋮              ⋮          ⋮
 594 │ Iris-virginica  PetalWidth       2.3
 595 │ Iris-virginica  PetalWidth       2.5
 596 │ Iris-virginica  PetalWidth       2.3
 597 │ Iris-virginica  PetalWidth       1.9
 598 │ Iris-virginica  PetalWidth       2.0
 599 │ Iris-virginica  PetalWidth       2.3
 600 │ Iris-virginica  PetalWidth       1.8
                            585 rows omitted

The second optional argument to stack indicates the columns to be stacked. These are normally referred to as the measured variables. Column names can also be given:

julia> stack(iris, [:SepalLength, :SepalWidth, :PetalLength, :PetalWidth])
600×3 DataFrame
 Row │ Species         variable     value
     │ String15        String       Float64
─────┼──────────────────────────────────────
   1 │ Iris-setosa     SepalLength      5.1
   2 │ Iris-setosa     SepalLength      4.9
   3 │ Iris-setosa     SepalLength      4.7
   4 │ Iris-setosa     SepalLength      4.6
   5 │ Iris-setosa     SepalLength      5.0
   6 │ Iris-setosa     SepalLength      5.4
   7 │ Iris-setosa     SepalLength      4.6
   8 │ Iris-setosa     SepalLength      5.0
  ⋮  │       ⋮              ⋮          ⋮
 594 │ Iris-virginica  PetalWidth       2.3
 595 │ Iris-virginica  PetalWidth       2.5
 596 │ Iris-virginica  PetalWidth       2.3
 597 │ Iris-virginica  PetalWidth       1.9
 598 │ Iris-virginica  PetalWidth       2.0
 599 │ Iris-virginica  PetalWidth       2.3
 600 │ Iris-virginica  PetalWidth       1.8
                            585 rows omitted

Note that all columns can be of different types. Type promotion follows the rules of vcat.

The stacked DataFrame that results includes all of the columns not specified to be stacked. These are repeated for each stacked column. These are normally referred to as identifier (id) columns. In addition to the id columns, two additional columns labeled :variable and :values contain the column identifier and the stacked columns.

A third optional argument to stack represents the id columns that are repeated. This makes it easier to specify which variables you want included in the long format:

julia> stack(iris, [:SepalLength, :SepalWidth], :Species)
300×3 DataFrame
 Row │ Species         variable     value
     │ String15        String       Float64
─────┼──────────────────────────────────────
   1 │ Iris-setosa     SepalLength      5.1
   2 │ Iris-setosa     SepalLength      4.9
   3 │ Iris-setosa     SepalLength      4.7
   4 │ Iris-setosa     SepalLength      4.6
   5 │ Iris-setosa     SepalLength      5.0
   6 │ Iris-setosa     SepalLength      5.4
   7 │ Iris-setosa     SepalLength      4.6
   8 │ Iris-setosa     SepalLength      5.0
  ⋮  │       ⋮              ⋮          ⋮
 294 │ Iris-virginica  SepalWidth       3.2
 295 │ Iris-virginica  SepalWidth       3.3
 296 │ Iris-virginica  SepalWidth       3.0
 297 │ Iris-virginica  SepalWidth       2.5
 298 │ Iris-virginica  SepalWidth       3.0
 299 │ Iris-virginica  SepalWidth       3.4
 300 │ Iris-virginica  SepalWidth       3.0
                            285 rows omitted

If you prefer to specify the id columns then use Not with stack like this:

julia> stack(iris, Not(:Species))
600×3 DataFrame
 Row │ Species         variable     value
     │ String15        String       Float64
─────┼──────────────────────────────────────
   1 │ Iris-setosa     SepalLength      5.1
   2 │ Iris-setosa     SepalLength      4.9
   3 │ Iris-setosa     SepalLength      4.7
   4 │ Iris-setosa     SepalLength      4.6
   5 │ Iris-setosa     SepalLength      5.0
   6 │ Iris-setosa     SepalLength      5.4
   7 │ Iris-setosa     SepalLength      4.6
   8 │ Iris-setosa     SepalLength      5.0
  ⋮  │       ⋮              ⋮          ⋮
 594 │ Iris-virginica  PetalWidth       2.3
 595 │ Iris-virginica  PetalWidth       2.5
 596 │ Iris-virginica  PetalWidth       2.3
 597 │ Iris-virginica  PetalWidth       1.9
 598 │ Iris-virginica  PetalWidth       2.0
 599 │ Iris-virginica  PetalWidth       2.3
 600 │ Iris-virginica  PetalWidth       1.8
                            585 rows omitted

unstack converts from a long format to a wide format. The default is requires specifying which columns are an id variable, column variable names, and column values:

julia> iris.id = 1:size(iris, 1)
1:150

julia> longdf = stack(iris, Not([:Species, :id]))
600×4 DataFrame
 Row │ Species         id     variable     value
     │ String15        Int64  String       Float64
─────┼─────────────────────────────────────────────
   1 │ Iris-setosa         1  SepalLength      5.1
   2 │ Iris-setosa         2  SepalLength      4.9
   3 │ Iris-setosa         3  SepalLength      4.7
   4 │ Iris-setosa         4  SepalLength      4.6
   5 │ Iris-setosa         5  SepalLength      5.0
   6 │ Iris-setosa         6  SepalLength      5.4
   7 │ Iris-setosa         7  SepalLength      4.6
   8 │ Iris-setosa         8  SepalLength      5.0
  ⋮  │       ⋮           ⋮         ⋮          ⋮
 594 │ Iris-virginica    144  PetalWidth       2.3
 595 │ Iris-virginica    145  PetalWidth       2.5
 596 │ Iris-virginica    146  PetalWidth       2.3
 597 │ Iris-virginica    147  PetalWidth       1.9
 598 │ Iris-virginica    148  PetalWidth       2.0
 599 │ Iris-virginica    149  PetalWidth       2.3
 600 │ Iris-virginica    150  PetalWidth       1.8
                                   585 rows omitted

julia> unstack(longdf, :id, :variable, :value)
150×5 DataFrame
 Row │ id     SepalLength  SepalWidth  PetalLength  PetalWidth
     │ Int64  Float64?     Float64?    Float64?     Float64?
─────┼─────────────────────────────────────────────────────────
   1 │     1          5.1         3.5          1.4         0.2
   2 │     2          4.9         3.0          1.4         0.2
   3 │     3          4.7         3.2          1.3         0.2
   4 │     4          4.6         3.1          1.5         0.2
   5 │     5          5.0         3.6          1.4         0.2
   6 │     6          5.4         3.9          1.7         0.4
   7 │     7          4.6         3.4          1.4         0.3
   8 │     8          5.0         3.4          1.5         0.2
  ⋮  │   ⋮         ⋮           ⋮            ⋮           ⋮
 144 │   144          6.8         3.2          5.9         2.3
 145 │   145          6.7         3.3          5.7         2.5
 146 │   146          6.7         3.0          5.2         2.3
 147 │   147          6.3         2.5          5.0         1.9
 148 │   148          6.5         3.0          5.2         2.0
 149 │   149          6.2         3.4          5.4         2.3
 150 │   150          5.9         3.0          5.1         1.8
                                               135 rows omitted

If the remaining columns are unique, you can skip the id variable and use:

julia> unstack(longdf, :variable, :value)
150×6 DataFrame
 Row │ Species         id     SepalLength  SepalWidth  PetalLength  PetalWidth
     │ String15        Int64  Float64?     Float64?    Float64?     Float64?
─────┼─────────────────────────────────────────────────────────────────────────
   1 │ Iris-setosa         1          5.1         3.5          1.4         0.2
   2 │ Iris-setosa         2          4.9         3.0          1.4         0.2
   3 │ Iris-setosa         3          4.7         3.2          1.3         0.2
   4 │ Iris-setosa         4          4.6         3.1          1.5         0.2
   5 │ Iris-setosa         5          5.0         3.6          1.4         0.2
   6 │ Iris-setosa         6          5.4         3.9          1.7         0.4
   7 │ Iris-setosa         7          4.6         3.4          1.4         0.3
   8 │ Iris-setosa         8          5.0         3.4          1.5         0.2
  ⋮  │       ⋮           ⋮         ⋮           ⋮            ⋮           ⋮
 144 │ Iris-virginica    144          6.8         3.2          5.9         2.3
 145 │ Iris-virginica    145          6.7         3.3          5.7         2.5
 146 │ Iris-virginica    146          6.7         3.0          5.2         2.3
 147 │ Iris-virginica    147          6.3         2.5          5.0         1.9
 148 │ Iris-virginica    148          6.5         3.0          5.2         2.0
 149 │ Iris-virginica    149          6.2         3.4          5.4         2.3
 150 │ Iris-virginica    150          5.9         3.0          5.1         1.8
                                                               135 rows omitted

You can even skip passing the :variable and :value values as positional arguments, as they will be used by default, and write:

julia> unstack(longdf)
150×6 DataFrame
 Row │ Species         id     SepalLength  SepalWidth  PetalLength  PetalWidth
     │ String15        Int64  Float64?     Float64?    Float64?     Float64?
─────┼─────────────────────────────────────────────────────────────────────────
   1 │ Iris-setosa         1          5.1         3.5          1.4         0.2
   2 │ Iris-setosa         2          4.9         3.0          1.4         0.2
   3 │ Iris-setosa         3          4.7         3.2          1.3         0.2
   4 │ Iris-setosa         4          4.6         3.1          1.5         0.2
   5 │ Iris-setosa         5          5.0         3.6          1.4         0.2
   6 │ Iris-setosa         6          5.4         3.9          1.7         0.4
   7 │ Iris-setosa         7          4.6         3.4          1.4         0.3
   8 │ Iris-setosa         8          5.0         3.4          1.5         0.2
  ⋮  │       ⋮           ⋮         ⋮           ⋮            ⋮           ⋮
 144 │ Iris-virginica    144          6.8         3.2          5.9         2.3
 145 │ Iris-virginica    145          6.7         3.3          5.7         2.5
 146 │ Iris-virginica    146          6.7         3.0          5.2         2.3
 147 │ Iris-virginica    147          6.3         2.5          5.0         1.9
 148 │ Iris-virginica    148          6.5         3.0          5.2         2.0
 149 │ Iris-virginica    149          6.2         3.4          5.4         2.3
 150 │ Iris-virginica    150          5.9         3.0          5.1         1.8
                                                               135 rows omitted

Passing view=true to stack returns a data frame whose columns are views into the original wide data frame. Here is an example:

julia> stack(iris, view=true)
600×4 DataFrame
 Row │ Species         id     variable     value
     │ String15        Int64  String       Float64
─────┼─────────────────────────────────────────────
   1 │ Iris-setosa         1  SepalLength      5.1
   2 │ Iris-setosa         2  SepalLength      4.9
   3 │ Iris-setosa         3  SepalLength      4.7
   4 │ Iris-setosa         4  SepalLength      4.6
   5 │ Iris-setosa         5  SepalLength      5.0
   6 │ Iris-setosa         6  SepalLength      5.4
   7 │ Iris-setosa         7  SepalLength      4.6
   8 │ Iris-setosa         8  SepalLength      5.0
  ⋮  │       ⋮           ⋮         ⋮          ⋮
 594 │ Iris-virginica    144  PetalWidth       2.3
 595 │ Iris-virginica    145  PetalWidth       2.5
 596 │ Iris-virginica    146  PetalWidth       2.3
 597 │ Iris-virginica    147  PetalWidth       1.9
 598 │ Iris-virginica    148  PetalWidth       2.0
 599 │ Iris-virginica    149  PetalWidth       2.3
 600 │ Iris-virginica    150  PetalWidth       1.8
                                   585 rows omitted

This saves memory. To create the view, several AbstractVectors are defined:

:variable column – EachRepeatedVector This repeats the variables N times where N is the number of rows of the original AbstractDataFrame.

:value column – StackedVector This is provides a view of the original columns stacked together.

Id columns – RepeatedVector This repeats the original columns N times where N is the number of columns stacked.

To do aggregation, use the split-apply-combine functions in combination with unstack or use the combine keyword argument in unstack. Here is an example:

julia> using Statistics

julia> d = stack(iris, Not(:Species))
750×3 DataFrame
 Row │ Species         variable     value
     │ String15        String       Float64
─────┼──────────────────────────────────────
   1 │ Iris-setosa     SepalLength      5.1
   2 │ Iris-setosa     SepalLength      4.9
   3 │ Iris-setosa     SepalLength      4.7
   4 │ Iris-setosa     SepalLength      4.6
   5 │ Iris-setosa     SepalLength      5.0
   6 │ Iris-setosa     SepalLength      5.4
   7 │ Iris-setosa     SepalLength      4.6
   8 │ Iris-setosa     SepalLength      5.0
  ⋮  │       ⋮              ⋮          ⋮
 744 │ Iris-virginica  id             144.0
 745 │ Iris-virginica  id             145.0
 746 │ Iris-virginica  id             146.0
 747 │ Iris-virginica  id             147.0
 748 │ Iris-virginica  id             148.0
 749 │ Iris-virginica  id             149.0
 750 │ Iris-virginica  id             150.0
                            735 rows omitted

julia> agg = combine(groupby(d, [:variable, :Species]), :value => mean => :vmean)
15×3 DataFrame
 Row │ variable     Species          vmean
     │ String       String15         Float64
─────┼───────────────────────────────────────
   1 │ SepalLength  Iris-setosa        5.006
   2 │ SepalLength  Iris-versicolor    5.936
   3 │ SepalLength  Iris-virginica     6.588
   4 │ SepalWidth   Iris-setosa        3.418
   5 │ SepalWidth   Iris-versicolor    2.77
   6 │ SepalWidth   Iris-virginica     2.974
   7 │ PetalLength  Iris-setosa        1.464
   8 │ PetalLength  Iris-versicolor    4.26
   9 │ PetalLength  Iris-virginica     5.552
  10 │ PetalWidth   Iris-setosa        0.244
  11 │ PetalWidth   Iris-versicolor    1.326
  12 │ PetalWidth   Iris-virginica     2.026
  13 │ id           Iris-setosa       25.5
  14 │ id           Iris-versicolor   75.5
  15 │ id           Iris-virginica   125.5

julia> unstack(agg, :variable, :Species, :vmean)
5×4 DataFrame
 Row │ variable     Iris-setosa  Iris-versicolor  Iris-virginica
     │ String       Float64?     Float64?         Float64?
─────┼───────────────────────────────────────────────────────────
   1 │ SepalLength        5.006            5.936           6.588
   2 │ SepalWidth         3.418            2.77            2.974
   3 │ PetalLength        1.464            4.26            5.552
   4 │ PetalWidth         0.244            1.326           2.026
   5 │ id                25.5             75.5           125.5

julia> unstack(d, :variable, :Species, :value, combine=mean)
5×4 DataFrame
 Row │ variable     Iris-setosa  Iris-versicolor  Iris-virginica
     │ String       Float64?     Float64?         Float64?
─────┼───────────────────────────────────────────────────────────
   1 │ SepalLength        5.006            5.936           6.588
   2 │ SepalWidth         3.418            2.77            2.974
   3 │ PetalLength        1.464            4.26            5.552
   4 │ PetalWidth         0.244            1.326           2.026
   5 │ id                25.5             75.5           125.5

To turn an AbstractDataFrame on its side, use permutedims.

julia> df1 = DataFrame(a=["x", "y"], b=[1.0, 2.0], c=[3, 4], d=[true, false])
2×4 DataFrame
 Row │ a       b        c      d
     │ String  Float64  Int64  Bool
─────┼───────────────────────────────
   1 │ x           1.0      3   true
   2 │ y           2.0      4  false

julia> permutedims(df1, 1)
3×3 DataFrame
 Row │ a       x        y
     │ String  Float64  Float64
─────┼──────────────────────────
   1 │ b           1.0      2.0
   2 │ c           3.0      4.0
   3 │ d           1.0      0.0

Note that the column indexed by src_colnames in the original df becomes the column names in the permuted result, and the column names of the original become a new column. Typically, this would be used on columns with homogeneous element types, since the element types of the other columns are the result of promote_type on all the permuted columns. Note also that, by default, the new column created from the column names of the original df has the same name as src_namescol. An optional positional argument dest_namescol can alter this:

julia> df2 = DataFrame(a=["x", "y"], b=[1, "two"], c=[3, 4], d=[true, false])
2×4 DataFrame
 Row │ a       b    c      d
     │ String  Any  Int64  Bool
─────┼───────────────────────────
   1 │ x       1        3   true
   2 │ y       two      4  false

julia> permutedims(df2, 1, "different_name")
3×3 DataFrame
 Row │ different_name  x     y
     │ String          Any   Any
─────┼─────────────────────────────
   1 │ b               1     two
   2 │ c               3     4
   3 │ d               true  false