Categorical Data
Often, we have to deal with factors that take on a small number of levels:
julia> v = ["Group A", "Group A", "Group A", "Group B", "Group B", "Group B"]
6-element Array{String,1}:
"Group A"
"Group A"
"Group A"
"Group B"
"Group B"
"Group B"
The naive encoding used in an Array
represents every entry of this vector as a full string. In contrast, we can represent the data more efficiently by replacing the strings with indices into a small pool of levels. This is what the CategoricalArray
type does:
julia> using CategoricalArrays
julia> cv = CategoricalArray(v)
6-element CategoricalArray{String,1,UInt32}:
"Group A"
"Group A"
"Group A"
"Group B"
"Group B"
"Group B"
CategoricalArrays
support missing values.
julia> cv = CategoricalArray(["Group A", missing, "Group A",
"Group B", "Group B", missing])
6-element CategoricalArray{Union{Missing, String},1,UInt32}:
"Group A"
missing
"Group A"
"Group B"
"Group B"
missing
In addition to representing repeated data efficiently, the CategoricalArray
type allows us to determine efficiently the allowed levels of the variable at any time using the levels
function (note that levels may or may not be actually used in the data):
julia> levels(cv)
2-element Array{String,1}:
"Group A"
"Group B"
The levels!
function also allows changing the order of appearance of the levels, which can be useful for display purposes or when working with ordered variables.
julia> levels!(cv, ["Group B", "Group A"]);
julia> levels(cv)
2-element Array{String,1}:
"Group B"
"Group A"
julia> sort(cv)
6-element CategoricalArray{Union{Missing, String},1,UInt32}:
"Group B"
"Group B"
"Group A"
"Group A"
missing
missing
By default, a CategoricalArray
is able to represent 2<sup>32</sup> different levels. You can use less memory by calling the compress
function:
julia> cv = compress(cv)
6-element CategoricalArray{Union{Missing, String},1,UInt8}:
"Group A"
missing
"Group A"
"Group B"
"Group B"
missing
Instead of using the CategoricalArray
constructor directly you can use categorical
function. It additionally accepts one positional argument compress
which when set to true
is equivalent to calling compress
on the new vector:
julia> cv1 = categorical(["A", "B"], true)
2-element CategoricalArray{String,1,UInt8}:
"A"
"B"
If the ordered
keyword argument is set to true
, the resulting CategoricalArray
will be ordered, which means that its levels can be tested for order (rather than throwing an error):
julia> cv2 = categorical(["A", "B"], true, ordered=true)
2-element CategoricalArray{String,1,UInt8}:
"A"
"B"
julia> cv1[1] < cv1[2]
ERROR: ArgumentError: Unordered CategoricalValue objects cannot be tested for order using <. Use isless instead, or call the ordered! function on the parent array to change this
julia> cv2[1] < cv2[2]
true
You can check if a CategoricalArray
is ordered using the isordered
function and change between ordered and unordered using ordered!
function.
julia> isordered(cv1)
false
julia> ordered!(cv1, true)
2-element CategoricalArray{String,1,UInt8}:
"A"
"B"
julia> isordered(cv1)
true
julia> cv1[1] < cv1[2]
true
Often, you will have factors encoded inside a DataFrame
with Vector
columns instead of CategoricalVector
columns. You can convert one or more columns of the DataFrame
using the categorical!
function, which modifies the input DataFrame
in-place. Compression can be applied by setting the compress
keyword argument to true
.
julia> using DataFrames
julia> df = DataFrame(A = ["A", "B", "C", "D", "D", "A"],
B = ["X", "X", "X", "Y", "Y", "Y"])
6×2 DataFrame
│ Row │ A │ B │
│ │ String │ String │
├─────┼────────┼────────┤
│ 1 │ A │ X │
│ 2 │ B │ X │
│ 3 │ C │ X │
│ 4 │ D │ Y │
│ 5 │ D │ Y │
│ 6 │ A │ Y │
julia> categorical!(df, :A) # change the column `:A` to be categorical
6×2 DataFrame
│ Row │ A │ B │
│ │ Categorical… │ String │
├─────┼──────────────┼────────┤
│ 1 │ A │ X │
│ 2 │ B │ X │
│ 3 │ C │ X │
│ 4 │ D │ Y │
│ 5 │ D │ Y │
│ 6 │ A │ Y │
If columns are not specified, all columns with an AbstractString
element type are converted to be categorical. In the example below we also enable compression:
julia> categorical!(df, compress=true)
6×2 DataFrame
│ Row │ A │ B │
│ │ Categorical… │ Categorical… │
├─────┼──────────────┼──────────────┤
│ 1 │ A │ X │
│ 2 │ B │ X │
│ 3 │ C │ X │
│ 4 │ D │ Y │
│ 5 │ D │ Y │
│ 6 │ A │ Y │
julia> eltype.(eachcol(df))
2-element Array{DataType,1}:
CategoricalString{UInt8}
CategoricalString{UInt8}
Using categorical arrays is important for working with the GLM package. When fitting regression models, CategoricalVector
columns in the input are translated into 0/1 indicator columns in the ModelMatrix
with one column for each of the levels of the CategoricalVector
. This allows one to analyze categorical data efficiently.
See the CategoricalArrays package for more information regarding categorical arrays.