Skip to main content
This component is currently in beta and may be subject to changes or instability. The Outlier Report component surfaces unusual values at two levels of granularity: within individual columns, and across combinations of columns. A row may look unremarkable in any single column while still being anomalous as a whole — the multivariate analysis is specifically designed to catch those cases. The output is a warning, not a verdict. Values and rows flagged here are worth a closer look, not outright dismissal.

What it computes

The component runs two independent analyses. Per-column outlier report — for each column, the appropriate method is applied based on column type:
  • Numeric columns — Tukey’s fences (Q1 − 1.5 × IQR, Q3 + 1.5 × IQR) define the normal range. Values outside these fences are counted as outliers, and the 10 values furthest from the median are returned as extreme_values.
  • Categorical columns — rare categories are identified and returned.
Multivariate outlier report — an isolation forest is trained on all numeric columns. This isolation forest then assigns an anomaly score between 0 and 1 to every row, and rows with scores sufficiently near 1 are identified as anomalous. For each row flagged as anomalous, the features identified by the isolation forest as most likely to be responsible for this anomaly are reported as the likely_responsible_columns. Columns that are neither numeric nor categorical (e.g. timestamps) and unique identifier coumns are excluded from both analyses.

Inputs

All inputs are optional. The component runs with sensible defaults and most users will not need to change anything.
InputDescriptionDefault
n_estimatorsNumber of decision trees used to detect anomalies. More trees means more stable results but slower runtime.100
max_samplesNumber of rows each tree is trained on. Scales automatically with dataset size, from 256 (small datasets) up to 2 048 (very large ones).auto
anomaly_score_cutoffHow unusual a row must be before it’s flagged. Lower values flag more rows; higher values flag fewer.0.595
top_k_likely_responsible_columnsHow many columns to show as likely contributors for each flagged row.3
max_depthControls how deep into each tree the attribution analysis looks. Scales automatically with dataset size.auto

Output shape

{
  "per_column_outlier_report": [
    {
      "column": "price_per_sqft",
      "lower_fence": 87.50,
      "upper_fence": 612.50,
      "outlier_count": 38,
      "outlier_pct": 0.76,
      "extreme_values": [
        { "value": 1840.0, "row_index": 7823 },
        { "value": 1705.0, "row_index": 2104 }
      ]
    }
    {
      "column": "property_type",
      "rare_values": [
        { "value": "Houseboat", "count": 4, "frequency": 0.00008 },
        { "value": "Timeshare", "count": 2, "frequency": 0.00004 }
      ]
    }
  ],
  "multivariate_outliers_report": {
    "anomaly_score_cutoff": 0.5946,
    "outlier_count": 61,
    "outlier_pct": 1.22
  },
  "multivariate_outliers": [
    {
      "row_index": 7823,
      "anomaly_score": 0.847,
      "likely_responsible_columns": [
        { "column": "sale_price", "prevalence_pct": 54.1 },
        { "column": "lot_size_sqft", "prevalence_pct": 31.2 },
        { "column": "year_built", "prevalence_pct": 14.7 }
      ]
    }
  ],
  "anomaly_score_distribution": [
    { "anomaly_score": 0.005, "proportion": 0.031 },
    { "anomaly_score": 0.015, "proportion": 0.044 }
  ]
}```

## Display

The Outlier Report renders four blocks:

- **Per-column outlier report table** — one row per column showing fences and outlier counts (numeric), or rare value lists (categorical).
- **Multivariate outliers summary** — key/value block with `outlier_count`, `outlier_pct`, and `anomaly_score_cutoff`.
- **Anomaly score distribution chart** — area chart of the full score distribution, useful for assessing whether the cutoff is well-placed.
- **Multivariate outliers table** — one row per flagged row with its anomaly score and likely responsible columns.

## Filtering from chat

Summand can query the report in two ways:

```text
# Per-column report for one column
analyze({
    component: "outlier_report",
    params: { column_name: "year_built" }
})

# Multivariate outlier detail for one anomalous row,
# identified by its 0-indexed row index
analyze({
    component: "outlier_report",
    params: { multivariate_outlier: 10291 }
})
By default, Summand sees the summary fields (per_column_outlier_report and multivariate_outliers_report) which are compact enough for quick answers. Drilling into a specific column or row fetches only that record.

Compute profile

ProfileCPUMemoryScales with rows
Fargate4 vCPU16 GBYes

Use cases

  • Data quality review — catch data-entry errors, unit mismatches, or upstream pipeline bugs before they affect downstream models.
  • Clinical / operational anomaly detection — surface patients or events with unusual combinations of values that wouldn’t trigger any single-column alert.
  • Pre-modelling sanity check — identify extreme values that could distort training before fitting a predictive model.
  • Rare-category discovery — find low-frequency labels in categorical columns that may need consolidation or special handling.

Common gotchas

This is expected and is one of the main reasons to run the multivariate analysis. A row with height = 6'2", y = 135 lb can be perfectly normal for each column individually while being anomalous in combination. The isolation forest detects the combination; the per-column Tukey fences do not.
The isolation forest requires at least one numeric column and at least one row where all numeric columns are non-null simultaneously. If every numeric column has at least one missing value in every row — or if there are no numeric columns at all — the multivariate analysis cannot run. The per-column report still runs for any columns it can handle.
The 1% threshold is applied to cumulative frequency, and ties at the boundary are excluded together. A column with many singleton categories (e.g. 50 categories each appearing once in a 100-row dataset) will report zero rare values — all singletons are tied and their combined frequency exceeds 1%, so all are excluded. This is intentional: it avoids surfacing noise as signal in high-cardinality columns.