Correlation-association matrix

This component is currently in beta and may be subject to changes or instability. The Correlation-Association Matrix component answers the question: which columns move together, and how strongly? It does not answer questions of causation — two things happening together does not mean one causes the other. For every column pair in the dataset, the component selects one of three association measures based on the column types involved, then assembles a summary table sorted by absolute association strength.

What it computes

The measure used for a given pair depends on the types of the two columns:

Column pair	Measure	Range
Numeric–numeric	Pearson’s correlation coefficient	−1 to +1
Categorical–categorical	Cramér’s V statistic (with Bergsma’s bias-correction)	0 to +1
Numeric–categorical	Correlation ratio	0 to +1

Notes:

Columns that are neither numeric nor categorical (e.g. timestamps, timedelays) are skipped.
Cramér’s V bias correction means a value of 0 can mean either no association or one too weak to distinguish from noise at this sample size.
Cramér’s V and the correlation ratio also report how many categories were involved in each categorical column. Together with the sample size, this helps interpret how meaningful the association value is.

Inputs

None. The component runs on the full dataset with no configuration required.

Output shape

{
  "summary_table": [
    {
      "column_1": "length_of_stay",
      "column_2": "insurance_type",
      "association": 0.431,
      "method": "correlation_ratio"
    },
    {
      "column_1": "heart_rate",
      "column_2": "cholesterol",
      "association": 0.287,
      "method": "correlation_coefficient"
    },
    {
      "column_1": "blood_type",
      "column_2": "icd_code",
      "association": 0.104,
      "method": "cramer_v_statistic"
    }
  ],
  "associations": [
    {
      "column_pair": "length_of_stay & insurance_type",
      "column_1": "length_of_stay",
      "column_1_type": "numeric",
      "column_2": "insurance_type",
      "column_2_type": "categorical",
      "sample_size": 4821,
      "association_is_computed": true,
      "association_type": "correlation_ratio",
      "association_value": 0.431,
      "number_of_categories": 4
    },
    {
      "column_pair": "blood_type & icd_code",
      "column_1": "blood_type",
      "column_1_type": "categorical",
      "column_2": "icd_code",
      "column_2_type": "categorical",
      "sample_size": 4821,
      "association_is_computed": true,
      "association_type": "cramer_v_statistic",
      "association_value": 0.104,
      "number_of_categories_in_column_1": 4,
      "number_of_categories_in_column_2": 812
    },
    ...
  ]
}

summary_table contains each pair once, sorted by descending absolute association value. associations contains each pair twice (both orderings of the columns), so that agent lookups filtering on column_1 alone can find any pair. The artifact is written as correlation_association_matrix.json.

Display

The component renders a sortable table of all column pairs where the association could be computed. Each row shows the two column names, the association value, and the method. The table is sorted by absolute association strength by default.

Filtering from chat

Summand can query associations in two ways:

# Association between two specific columns
analyze({
    component: "correlation_association_matrix",
    params: { association_between_two_columns: "heart_rate & cholesterol" }
})

# All associations involving one column
analyze({
    component: "correlation_association_matrix",
    params: { all_associations_with_column: "length_of_stay" } 
})

The association_between_two_columns parameter expects the exact column names joined with & as separator — the order does not matter since the associations list stores both orderings.

Use cases

Preliminary feature selection — identify columns that vary strongly together and drop redundant ones before training a model.
Identify co-occurring groups — e.g. patients with diagnoses A and B who are statistically likely to also have diagnosis C.
Measure group impact on outcomes — how does insurance type or care setting relate to length of stay?
Exploratory data analysis — surface unexpected relationships in a new dataset.

Compute profile

Profile	Profile
Fargate	2 CPU, 8 GB

The component runs in pandas on the curated Parquet. Pairwise computation scales as O(n_columns ^ 2); on wide datasets (hundreds of columns) runtime grows accordingly, but the per-pair work is cheap.

Common gotchas

A column I expected to appear is missing from the matrix

Two things will cause a column to be excluded: it’s neither numeric nor categorical (e.g. a timestamp column) or it’s a unique identifier — a column where every value is distinct. Primary-key-style ID columns fall into this last category.

An association shows 'not computed' instead of a value

This happens when at least one column in the pair has zero variance (all rows share the same value), when there are no rows where both columns have non-null values simultaneously, when one column’s cardinality exactly equals the number of valid rows for that pair (it’s trivially, perfectly predictive of the other, so there’s nothing informative to measure), or, for a categorical pair, when the contingency table implied by the two columns’ cardinalities would be too large to compute safely. The reason_not_computed field in the associations list explains which case applies.

High association doesn't mean causation

Correlation, Cramér’s V, and the correlation ratio all measure co-movement or predictability, not causation. A high value tells you the columns move together; it says nothing about why.

Get started

Core concepts

Data sources

Guides

Account & billing

Resources

Correlation-association matrix

What it computes

Inputs

Output shape

Display

Filtering from chat

Use cases

Compute profile

Common gotchas

​What it computes

​Inputs

​Output shape

​Display

​Filtering from chat

​Use cases

​Compute profile

​Common gotchas

What it computes

Inputs

Output shape

Display

Filtering from chat

Use cases

Compute profile

Common gotchas