What it computes
The component runs two independent analyses. Per-column outlier report — for each column, the appropriate method is applied based on column type:- Numeric columns — Tukey’s fences (Q1 − 1.5 × IQR, Q3 + 1.5 × IQR) define the normal range. Values outside these fences are counted as outliers, and the 10 values furthest from the median are returned as
extreme_values. - Categorical columns — rare categories are identified and returned.
likely_responsible_columns.
Columns that are neither numeric nor categorical (e.g. timestamps) and unique identifier coumns are excluded from both analyses.
Inputs
All inputs are optional. The component runs with sensible defaults and most users will not need to change anything.| Input | Description | Default |
|---|---|---|
n_estimators | Number of decision trees used to detect anomalies. More trees means more stable results but slower runtime. | 100 |
max_samples | Number of rows each tree is trained on. Scales automatically with dataset size, from 256 (small datasets) up to 2 048 (very large ones). | auto |
anomaly_score_cutoff | How unusual a row must be before it’s flagged. Lower values flag more rows; higher values flag fewer. | 0.595 |
top_k_likely_responsible_columns | How many columns to show as likely contributors for each flagged row. | 3 |
max_depth | Controls how deep into each tree the attribution analysis looks. Scales automatically with dataset size. | auto |
Output shape
per_column_outlier_report and multivariate_outliers_report) which are compact enough for quick answers. Drilling into a specific column or row fetches only that record.
Compute profile
| Profile | CPU | Memory | Scales with rows |
|---|---|---|---|
| Fargate | 4 vCPU | 16 GB | Yes |
Use cases
- Data quality review — catch data-entry errors, unit mismatches, or upstream pipeline bugs before they affect downstream models.
- Clinical / operational anomaly detection — surface patients or events with unusual combinations of values that wouldn’t trigger any single-column alert.
- Pre-modelling sanity check — identify extreme values that could distort training before fitting a predictive model.
- Rare-category discovery — find low-frequency labels in categorical columns that may need consolidation or special handling.
Common gotchas
A row shows as a multivariate outlier but not in any per-column report
A row shows as a multivariate outlier but not in any per-column report
This is expected and is one of the main reasons to run the multivariate analysis. A row with
height = 6'2", y = 135 lb can be perfectly normal for each column individually while being anomalous in combination. The isolation forest detects the combination; the per-column Tukey fences do not.The multivariate report says 'reason_not_computed'
The multivariate report says 'reason_not_computed'
The isolation forest requires at least one numeric column and at least one row where all numeric columns are non-null simultaneously. If every numeric column has at least one missing value in every row — or if there are no numeric columns at all — the multivariate analysis cannot run. The per-column report still runs for any columns it can handle.
Rare categories threshold seems too strict or too loose
Rare categories threshold seems too strict or too loose
The 1% threshold is applied to cumulative frequency, and ties at the boundary are excluded together. A column with many singleton categories (e.g. 50 categories each appearing once in a 100-row dataset) will report zero rare values — all singletons are tied and their combined frequency exceeds 1%, so all are excluded. This is intentional: it avoids surfacing noise as signal in high-cardinality columns.