This section compares DataFrames.jl with other data manipulation frameworks in Python, R, and Stata.

A sample data set can be created using the following code:

using DataFrames
using Statistics

df = DataFrame(grp = repeat(1:2, 3), x = 6:-1:1, y = 4:9, z = [3:7; missing], id = 'a':'f')
df2 = DataFrame(grp = [1, 3], w = [10, 11])

Comparison with the Python package pandas

The following table compares the main functions of DataFrames.jl with the Python package pandas (version 1.1.0):

import pandas as pd
import numpy as np

df = pd.DataFrame({'grp': [1, 2, 1, 2, 1, 2],
                   'x': range(6, 0, -1),
                   'y': range(4, 10),
                   'z': [3, 4, 5, 6, 7, None]},
                   index = list('abcdef'))
df2 = pd.DataFrame({'grp': [1, 3], 'w': [10, 11]})

Because pandas supports multi-index, this example data frame is set up with a to f as row indices rather than a separate id column.

Accessing data

Cell indexing by locationdf.iloc[1, 1]df[2, 2]
Row slicing by locationdf.iloc[1:3]df[2:3, :]
Column slicing by locationdf.iloc[:, 1:]df[:, 2:end]
Row indexing by labeldf.loc['c']df[findfirst(==('c'),, :]
Column indexing by labeldf.loc[:, 'x']df[:, :x]
Column slicing by labeldf.loc[:, ['x', 'z']]df[:, [:x, :z]]
df.loc[:, 'x':'z']df[:, Between(:x, :z)]
Mixed indexingdf.loc['c'][1]df[findfirst(==('c'),, 2]

Note that Julia uses 1-based indexing, inclusive on both ends. A special keyword end can be used to indicate the last index. Likewise, the begin keyword can be used to indicate the first index.

In addition, when indexing a data frame with the findfirst function, a single DataFrameRow object is returned. In the case that id is not unique, you can use the findall function or boolean indexing instead. It would then return a DataFrame object containing all matched rows. The following two lines of code are functionally equivalent:

df[findall(==('c'),, :]
df[ .== 'c', :]

DataFrames.jl's indexing always produces a consistent and predictable return type. By contrast, pandas' loc function returns a Series object when there is exactly one 'c' value in the index, and it returns a DataFrame object when there are multiple rows having the index value of 'c'.

Common operations

Reduce multiple valuesdf['z'].mean(skipna = False)mean(df.z)
df[['z']].agg(['mean'])combine(df, :z => mean ∘ skipmissing)
Add new columnsdf.assign(z1 = df['z'] + 1)transform(df, :z => (v -> v .+ 1) => :z1)
Rename columnsdf.rename(columns = {'x': 'x_new'})rename(df, :x => :x_new)
Pick & transform columnsdf.assign(x_mean = df['x'].mean())[['x_mean', 'y']]select(df, :x => mean, :y)
Sort rowsdf.sort_values(by = 'x')sort(df, :x)
df.sort_values(by = ['grp', 'x'], ascending = [True, False])sort(df, [:grp, order(:x, rev = true)])
Drop missing rowsdf.dropna()dropmissing(df)
Select unique rowsdf.drop_duplicates()unique(df)

Note that pandas skips NaN values in its analytic functions by default. By contrast, Julia functions do not skip NaN's. If necessary, you can filter out the NaN's before processing, for example, mean(Iterators.filter(!isnan, x)).

Pandas uses NaN for representing both missing data and the floating point "not a number" value. Julia defines a special value missing for representing missing data. DataFrames.jl respects general rules in Julia in propagating missing values by default. If necessary, the skipmissing function can be used to remove missing data. See the Missing Data section for more information.

In addition, pandas keeps the original column name after applying a function. DataFrames.jl appends a suffix to the column name by default. To keep it simple, the examples above do not synchronize the column names between pandas and DataFrames.jl (you can pass renamecols=false keyword argument to select, transform and combine functions to retain old column names).

Mutating operations

Add new columnsdf['z1'] = df['z'] + 1df.z1 = df.z .+ 1
transform!(df, :z => (x -> x .+ 1) => :z1)
df.insert(1, 'const', 10)insertcols!(df, 2, :const => 10)
Rename columnsdf.rename(columns = {'x': 'x_new'}, inplace = True)rename!(df, :x => :x_new)
Sort rowsdf.sort_values(by = 'x', inplace = True)sort!(df, :x)
Drop missing rowsdf.dropna(inplace = True)dropmissing!(df)
Select unique rowsdf.drop_duplicates(inplace = True)unique!(df)

Generally speaking, DataFrames.jl follows the Julia convention of using ! in the function name to indicate mutation behavior.

Grouping data and aggregation

DataFrames.jl provides a groupby function to apply operations over each group independently. The result of groupby is a GroupedDataFrame object which may be processed using the combine, transform, or select functions. The following table illustrates some common grouping and aggregation usages.

Aggregate by groupsdf.groupby('grp')['x'].mean()combine(groupby(df, :grp), :x => mean)
Rename column after aggregationdf.groupby('grp')['x'].mean().rename("my_mean")combine(groupby(df, :grp), :x => mean => :my_mean)
Add aggregated data as columndf.join(df.groupby('grp')['x'].mean(), on='grp', rsuffix='_mean')transform(groupby(df, :grp), :x => mean)
...and select output columnsdf.join(df.groupby('grp')['x'].mean(), on='grp', rsuffix='_mean')[['grp', 'x_mean']]select(groupby(df, :grp), :id, :x => mean)

Note that pandas returns a Series object for 1-dimensional result unless reset_index is called afterwards. The corresponding DataFrames.jl examples return an equivalent DataFrame object. Consider the first example:

>>> df.groupby('grp')['x'].mean()
1    4
2    3
Name: x, dtype: int64

For DataFrames.jl, it looks like this:

julia> combine(groupby(df, :grp), :x => mean)
2×2 DataFrame
│ Row │ grp   │ x_mean  │
│     │ Int64 │ Float64 │
│ 1   │ 1     │ 4.0     │
│ 2   │ 2     │ 3.0     │

In DataFrames.jl, the GroupedDataFrame object supports an efficient key lookup. Hence, it performs well when you need to perform lookups repeatedly.

More advanced commands

This section includes more complex examples.

Complex Functiondf[['z']].agg(lambda v: np.mean(np.cos(v)))combine(df, :z => v -> mean(cos, skipmissing(v)))
Aggregate multiple columnsdf.agg({'x': max, 'y': min})combine(df, :x => maximum, :y => minimum)
df[['x', 'y']].mean()combine(df, [:x, :y] .=> mean)
df.filter(regex=("^x")).mean()combine(df, names(df, r"^x") .=> mean)
Apply function over multiple variablesdf.assign(x_y_cor = np.corrcoef(df.x, df.y)[0, 1])transform(df, [:x, :y] => cor)
Row-wise operationdf.assign(x_y_min = df.apply(lambda v: min(v.x, v.y), axis=1))transform(df, [:x, :y] => ByRow(min))
df.assign(x_y_argmax = df.apply(lambda v: df.columns[v.argmax()], axis=1))transform(df, AsTable([:x, :y]) => ByRow(argmax))
DataFrame as inputdf.groupby('grp').head(2)combine(d -> first(d, 2), groupby(df, :grp))
DataFrame as outputdf[['x']].agg(lambda x: [min(x), max(x)])combine(:x => x -> (x = [minimum(x), maximum(x)],), df)

Note that pandas preserves the same row order after groupby whereas DataFrames.jl shows them grouped by the provided keys after the combine operation, but select and transform retain an original row ordering.

Joining data frames

DataFrames.jl supports join operations similar to a relational database.

Inner joinpd.merge(df, df2, how = 'inner', on = 'grp')innerjoin(df, df2, on = :grp)
Outer joinpd.merge(df, df2, how = 'outer', on = 'grp')outerjoin(df, df2, on = :grp)
Left joinpd.merge(df, df2, how = 'left', on = 'grp')leftjoin(df, df2, on = :grp)
Right joinpd.merge(df, df2, how = 'right', on = 'grp')rightjoin(df, df2, on = :grp)
Semi join (filtering)df[df.grp.isin(df2.grp)]semijoin(df, df2, on = :grp)
Anti join (filtering)df[~df.grp.isin(df2.grp)]antijoin(df, df2, on = :grp)

For multi-column joins, both pandas and DataFrames.jl accept an array for the on keyword argument.

In the cases of semi joins and anti joins, the isin function in pandas can still be used as long as the join keys are combined in a tuple. In DataFrames.jl, it just works normally with an array of join keys specified in the on keyword argument.

Comparison with the R package dplyr

The following table compares the main functions of DataFrames.jl with the R package dplyr (version 1):

df <- tibble(grp = rep(1:2, 3), x = 6:1, y = 4:9,
             z = c(3:7, NA), id = letters[1:6])
Reduce multiple valuessummarize(df, mean(x))combine(df, :x => mean)
Add new columnsmutate(df, x_mean = mean(x))transform(df, :x => mean => :x_mean)
Rename columnsrename(df, x_new = x)rename(df, :x => :x_new)
Pick columnsselect(df, x, y)select(df, :x, :y)
Pick & transform columnstransmute(df, mean(x), y)select(df, :x => mean, :y)
Pick rowsfilter(df, x >= 1)filter(:x => >=(1), df)
Sort rowsarrange(df, x)sort(df, :x)

As in dplyr, some of these functions can be applied to grouped data frames, in which case they operate by group:

Reduce multiple valuessummarize(group_by(df, grp), mean(x))combine(groupby(df, :grp), :x => mean)
Add new columnsmutate(group_by(df, grp), mean(x))transform(groupby(df, :grp), :x => mean)
Pick & transform columnstransmute(group_by(df, grp), mean(x), y)select(groupby(df, :grp), :x => mean, :y)

The table below compares more advanced commands:

Complex Functionsummarize(df, mean(x, na.rm = T))combine(df, :x => x -> mean(skipmissing(x)))
Transform several columnssummarize(df, max(x), min(y))combine(df, :x => maximum, :y => minimum)
summarize(df, across(c(x, y), mean))combine(df, [:x, :y] .=> mean)
summarize(df, across(starts_with("x"), mean))combine(df, names(df, r"^x") .=> mean)
summarize(df, across(c(x, y), list(max, min)))combine(df, ([:x, :y] .=> [maximum minimum])...)
Multivariate functionmutate(df, cor(x, y))transform(df, [:x, :y] => cor)
Row-wisemutate(rowwise(df), min(x, y))transform(df, [:x, :y] => ByRow(min))
mutate(rowwise(df), which.max(c_across(matches("^x"))))transform(df, AsTable(r"^x") => ByRow(argmax))
DataFrame as inputsummarize(df, head(across(), 2))combine(d -> first(d, 2), df)
DataFrame as outputsummarize(df, tibble(value = c(min(x), max(x))))combine(:x => x -> (value = [minimum(x), maximum(x)],), df)

Comparison with Stata (version 8 and above)

The following table compares the main functions of DataFrames.jl with Stata:

Reduce multiple valuescollapse (mean) xcombine(df, :x => mean)
Add new columnsegen x_mean = mean(x)transform!(df, :x => mean => :x_mean)
Rename columnsrename x x_newrename!(df, :x => :x_new)
Pick columnskeep x yselect!(df, :x, :y)
Pick rowskeep if x >= 1filter!(:x => >=(1), df)
Sort rowssort xsort!(df, :x)

Note that the suffix ! (i.e. transform!, select!, etc) ensures that the operation transforms the dataframe in place, as in Stata

Some of these functions can be applied to grouped data frames, in which case they operate by group:

Add new columnsegen x_mean = mean(x), by(grp)transform!(groupby(df, :grp), :x => mean)
Reduce multiple valuescollapse (mean) x, by(grp)combine(groupby(df, :grp), :x => mean)

The table below compares more advanced commands:

Transform certain rowsreplace x = 0 if x <= 0transform(df, :x => (x -> ifelse.(x .<= 0, 0, x)) => :x)
Transform several columnscollapse (max) x (min) ycombine(df, :x => maximum, :y => minimum)
collapse (mean) x ycombine(df, [:x, :y] .=> mean)
collapse (mean) x*combine(df, names(df, r"^x") .=> mean)
collapse (max) x y (min) x ycombine(df, ([:x, :y] .=> [maximum minimum])...)
Multivariate functionegen z = corr(x y)transform!(df, [:x, :y] => cor => :z)
Row-wiseegen z = rowmin(x y)transform!(df, [:x, :y] => ByRow(min) => :z)