Categorical Data
Often, we have to deal with factors that take on a small number of levels:
julia> v = ["Group A", "Group A", "Group A", "Group B", "Group B", "Group B"]
6-element Vector{String}:
"Group A"
"Group A"
"Group A"
"Group B"
"Group B"
"Group B"
The naive encoding used in an Array
represents every entry of this vector as a full string. In contrast, we can represent the data more efficiently by replacing the strings with indices into a small pool of levels. This is what the CategoricalArray
type does:
julia> using CategoricalArrays
julia> cv = CategoricalArray(v)
6-element CategoricalArray{String,1,UInt32}:
"Group A"
"Group A"
"Group A"
"Group B"
"Group B"
"Group B"
CategoricalArray
s support missing values.
julia> cv = CategoricalArray(["Group A", missing, "Group A",
"Group B", "Group B", missing])
6-element CategoricalArray{Union{Missing, String},1,UInt32}:
"Group A"
missing
"Group A"
"Group B"
"Group B"
missing
In addition to representing repeated data efficiently, the CategoricalArray
type allows us to determine efficiently the allowed levels of the variable at any time using the levels
function (note that levels may or may not be actually used in the data):
julia> levels(cv)
2-element Vector{String}:
"Group A"
"Group B"
The levels!
function also allows changing the order of appearance of the levels, which can be useful for display purposes or when working with ordered variables.
julia> levels!(cv, ["Group B", "Group A"])
6-element CategoricalArray{Union{Missing, String},1,UInt32}:
"Group A"
missing
"Group A"
"Group B"
"Group B"
missing
julia> levels(cv)
2-element Vector{String}:
"Group B"
"Group A"
julia> sort(cv)
6-element CategoricalArray{Union{Missing, String},1,UInt32}:
"Group B"
"Group B"
"Group A"
"Group A"
missing
missing
By default, a CategoricalArray
is able to represent $2^{32}$ different levels. You can use less memory by calling the compress
function:
julia> cv = compress(cv)
6-element CategoricalArray{Union{Missing, String},1,UInt8}:
"Group A"
missing
"Group A"
"Group B"
"Group B"
missing
Instead of using the CategoricalArray
constructor directly you can use categorical
function. It additionally accepts a keyword argument compress
which when set to true
is equivalent to calling compress
on the new vector:
julia> cv1 = categorical(["A", "B"], compress=true)
2-element CategoricalArray{String,1,UInt8}:
"A"
"B"
If the ordered
keyword argument is set to true
, the resulting CategoricalArray
will be ordered, which means that its levels can be tested for order (rather than throwing an error):
julia> cv2 = categorical(["A", "B"], ordered=true)
2-element CategoricalArray{String,1,UInt32}:
"A"
"B"
julia> cv1[1] < cv1[2]
ERROR: ArgumentError: Unordered CategoricalValue objects cannot be tested for order using <. Use isless instead, or call the ordered! function on the parent array to change this
julia> cv2[1] < cv2[2]
true
You can check if a CategoricalArray
is ordered using the isordered
function and change between ordered and unordered using ordered!
function.
julia> isordered(cv1)
false
julia> ordered!(cv1, true)
2-element CategoricalArray{String,1,UInt8}:
"A"
"B"
julia> isordered(cv1)
true
julia> cv1[1] < cv1[2]
true
Using categorical arrays is important for working with the GLM package. When fitting regression models, CategoricalVector
columns in the input are translated into 0/1 indicator columns in the ModelMatrix
with one column for each of the levels of the CategoricalVector
. This allows one to analyze categorical data efficiently.
See the CategoricalArrays package for more information regarding categorical arrays.