What it computes
The measure used for a given pair depends on the types of the two columns:| Column pair | Measure | Range |
|---|---|---|
| Numeric–numeric | Pearson’s correlation coefficient | −1 to +1 |
| Categorical–categorical | Cramér’s V statistic | 0 to +1 |
| Numeric–categorical | Correlation ratio | 0 to +1 |
Inputs
None. The component runs on the full dataset with no configuration required.Output shape
summary_table contains each pair once, sorted by descending absolute association value. associations contains each pair twice (both orderings of the columns), so that agent lookups filtering on column_1 alone can find any pair. The artifact is written as correlation_association_matrix.json.
Display
The component renders a sortable table of all column pairs where the association could be computed. Each row shows the two column names, the association value, and the method. The table is sorted by absolute association strength by default.Filtering from chat
Summand can query associations in two ways:association_between_two_columns parameter expects the exact column names joined with & as separator — the order does not matter since the associations list stores both orderings.
Use cases
- Preliminary feature selection — identify columns that vary strongly together and drop redundant ones before training a model.
- Identify co-occurring groups — e.g. patients with diagnoses A and B who are statistically likely to also have diagnosis C.
- Measure group impact on outcomes — how does insurance type or care setting relate to length of stay?
- Exploratory data analysis — surface unexpected relationships in a new dataset.
Compute profile
| Profile | Profile |
|---|---|
| Fargate | 2 CPU, 8 GB |
Common gotchas
A column I expected to appear is missing from the matrix
A column I expected to appear is missing from the matrix
Two things will cause a column to be excluded: it’s neither numeric nor categorical (e.g. a timestamp column) or it’s a unique identifier — a column where every value is distinct. Primary-key-style ID columns fall into this last category.
An association shows 'not computed' instead of a value
An association shows 'not computed' instead of a value
This happens when at least one column in the pair has zero variance (all rows share the same value), or when there are no rows where both columns have non-null values simultaneously. The
reason_not_computed field in the associations list explains which case applies.High association doesn't mean causation
High association doesn't mean causation
Correlation, Cramér’s V, and the correlation ratio all measure co-movement or predictability, not causation. A high value tells you the columns move together; it says nothing about why.