Pooling Data (Representing Factors)
Often, we have to deal with factors that take on a small number of levels:
dv=@data(["Group A","Group A","Group A","Group B","Group B","Group B"])
The naive encoding used in a DataArray
represents every entry of this vector as a full string. In contrast, we can represent the data more efficiently by replacing the strings with indices into a small pool of levels. This is what the PooledDataArray
does:
pdv=@pdata(["Group A","Group A","Group A","Group B","Group B","Group B"])
In addition to representing repeated data efficiently, the PooledDataArray
allows us to determine the levels of the factor at any time using the levels
function:
levels(pdv)
By default, a PooledDataArray
is able to represent 232differents levels. You can use less memory by calling the compact
function:
pdv=compact(pdv)
Often, you will have factors encoded inside a DataFrame with DataArray
columns instead of PooledDataArray
columns. You can do conversion of a single column using the pool
function:
pdv=pool(dv)
Or you can edit the columns of a DataFrame
in-place using the pool!
function:
df=DataFrame(A=[1,1,1,2,2,2],B=["X","X","X","Y","Y","Y"])pool!(df,[:A,:B])
Pooling columns is important for working with the GLM package When fitting regression models, PooledDataArray
columns in the input are translated into 0/1 indicator columns in the ModelMatrix
with one column for each of the levels of the PooledDataArray
. This allows one to analyze categorical data efficiently.