Skip to content

Grouping & Aggregation

Split your data into groups and compute summary statistics with a single pipeline.

What you'll learn

  • Grouping rows with table_group_by
  • Computing aggregates: count, sum, mean, min, max, median
  • Sorting grouped results
  • Deriving columns before grouping for custom breakdowns
  • Cross-tabulation patterns
  • Pivoting data with table_pivot

Setup

This tutorial uses the employees.csv and sales.csv sample files from the Data Tutorials index.

Basic group-by

The most common analytics task: split rows by a column, then summarize each group. In Tova, this is a two-step pipeline -- table_group_by followed by table_agg.

tova
async fn main() {
  employees = await read("data/employees.csv")

  dept_summary = employees
    |> tableGroupBy("department")
    |> tableAgg({
      headcount: aggCount(),
      avg_salary: aggMean("salary"),
      max_salary: aggMax("salary"),
      min_salary: aggMin("salary"),
      median_salary: aggMedian("salary"),
      avg_score: aggMean("performance_score")
    })
    |> tableSortBy("avg_salary", {desc: true})

  print(dept_summary)
}

Expected output:

_groupheadcountavg_salarymax_salarymin_salarymedian_salaryavg_score
Engineering10152700210000920001500004.34
Sales510700016000072000880003.82
Marketing511720019500088000980004.06

The grouped column appears as _group in the output. Results are sorted by average salary, highest first.

Aggregation functions reference

Tova provides six built-in aggregation functions. Each takes a column name (except agg_count).

FunctionArgumentDescription
aggCount()noneNumber of rows in each group
aggSum("col")column nameSum of values in the column
aggMean("col")column nameArithmetic mean of values
aggMin("col")column nameMinimum value
aggMax("col")column nameMaximum value
aggMedian("col")column nameMedian value (middle element)

You can use any combination of these inside a single table_agg call. Each key in the object becomes a column name in the output.

Group by different columns

You are not limited to one grouping column. Any column works -- just change the string passed to table_group_by.

tova
async fn main() {
  employees = await read("data/employees.csv")

  city_counts = employees
    |> tableGroupBy("city")
    |> tableAgg({
      headcount: aggCount(),
      total_payroll: aggSum("salary"),
      avg_performance: aggMean("performance_score")
    })
    |> tableSortBy("headcount", {desc: true})

  print(city_counts)
}

Expected output:

_groupheadcounttotal_payrollavg_performance
San Francisco58550004.76
Austin55620003.64
Chicago55330003.80
New York56250004.36

Here is another example, this time grouping sales by product:

tova
async fn main() {
  sales = await read("data/sales.csv")

  product_revenue = sales
    |> tableGroupBy("product")
    |> tableAgg({
      total_revenue: aggSum("amount"),
      num_deals: aggCount(),
      avg_deal_size: aggMean("amount"),
      total_units: aggSum("quantity")
    })
    |> tableSortBy("total_revenue", {desc: true})

  print(product_revenue)
}

Expected output:

_grouptotal_revenuenum_dealsavg_deal_sizetotal_units
Cloud Platform2200004550007
CRM Pro15600043900010
Server Rack1950003650005
Security Suite12400034133311
Analytics Tool17400044350015
Office Suite159004397527
Laptop Pro3000031000012

Derived columns before grouping

Sometimes the column you want to group by does not exist yet. Use table_derive to create it first, then group.

A common pattern is extracting a time period from a date column:

tova
async fn main() {
  sales = await read("data/sales.csv")

  monthly = sales
    |> tableDerive({month: fn(r) r.date.slice(0, 7)})
    |> tableGroupBy("month")
    |> tableAgg({revenue: aggSum("amount"), deals: aggCount()})
    |> tableSortBy("_group")

  print(monthly)
}

Expected output:

_grouprevenuedeals
2024-011170002
2024-02917004
2024-031835004
2024-041493004
2024-051214004
2024-062521007

The table_derive call adds a month column by slicing the first 7 characters of each date (e.g., "2024-01-15" becomes "2024-01"). Note the lambda syntax: fn(r) r.date.slice(0, 7) -- no arrow, just the function keyword followed by parameters and the body expression.

Sorting by "_group" orders the results by the group key itself, giving chronological order.

Cross-tabulation

To analyze data across two dimensions, combine the dimensions into a single group key using table_derive, then group on that derived column.

tova
async fn main() {
  employees = await read("data/employees.csv")

  cross = employees
    |> tableDerive({dept_city: fn(r) "{r.department}|{r.city}"})
    |> tableGroupBy("dept_city")
    |> tableAgg({headcount: aggCount(), avg_salary: aggMean("salary")})
    |> tableSortBy("headcount", {desc: true})

  print(cross)
}

Expected output (first few rows):

_groupheadcountavg_salary
Engineering|Austin3114000
Engineering|San Francisco3176667
Sales|New York2124000
Marketing|Chicago393667
.........

The pattern is: derive a combined key with a delimiter (here |), then group on it. You can split the key back apart in downstream processing if needed.

Pivot tables

For a spreadsheet-style pivot, use table_pivot. This transforms grouped data so that one column's distinct values become output columns.

tova
async fn main() {
  sales = await read("data/sales.csv")

  pivoted = sales
    |> tablePivot({index: "region", columns: "category", values: "amount"})

  print(pivoted)
}

The three arguments control the pivot:

ParameterDescription
indexColumn whose values become rows (left axis)
columnsColumn whose distinct values become new column headers
valuesColumn to aggregate into the cells (summed by default)

This gives you one row per region with revenue broken out by product category -- similar to a pivot table in Excel or Google Sheets.

Try it yourself

  1. Top-performing departments: Group employees by department and find which department has the highest average performance_score.

  2. Quarterly sales: Derive a quarter column from the sales date (hint: use .slice(0, 7) to get the month, then map months to quarters), group by it, and compute total revenue per quarter.

  3. City payroll breakdown: Group employees by city, compute total_payroll using aggSum("salary"), and sort by total payroll descending. Which city has the highest total payroll?

  4. Product units pivot: Pivot the sales data with region as the index, product as the columns, and quantity as the values to see how many units of each product sold in each region.

Next

Next: Joins & Combining Data

Released under the MIT License.