Skip to main content
This component is currently in beta and may be subject to changes or instability. The Correlation-Association Matrix component answers the question: which columns move together, and how strongly? It does not answer questions of causation — two things happening together does not mean one causes the other. For every column pair in the dataset, the component selects one of three association measures based on the column types involved, then assembles a summary table sorted by absolute association strength.

What it computes

The measure used for a given pair depends on the types of the two columns:
Column pairMeasureRange
Numeric–numericPearson’s correlation coefficient−1 to +1
Categorical–categoricalCramér’s V statistic0 to +1
Numeric–categoricalCorrelation ratio0 to +1
Note: Columns that are neither numeric nor categorical (e.g. timestamps, timedelays) are skipped.

Inputs

None. The component runs on the full dataset with no configuration required.

Output shape

{
  "summary_table": [
    {
      "column_1": "length_of_stay",
      "column_2": "insurance_type",
      "association": 0.431,
      "method": "correlation_ratio"
    },
    {
      "column_1": "heart_rate",
      "column_2": "cholesterol",
      "association": 0.287,
      "method": "correlation_coefficient"
    },
    {
      "column_1": "blood_type",
      "column_2": "icd_code",
      "association": 0.104,
      "method": "cramer_v_statistic"
    }
  ],
  "associations": [
    {
      "column_pair": "length_of_stay & insurance_type",
      "column_1": "length_of_stay",
      "column_1_type": "numeric",
      "column_2": "insurance_type",
      "column_2_type": "categorical",
      "sample_size": 4821,
      "association_is_computed": true,
      "association_type": "correlation_ratio",
      "association_value": 0.431
    },
    ...
  ]
}
summary_table contains each pair once, sorted by descending absolute association value. associations contains each pair twice (both orderings of the columns), so that agent lookups filtering on column_1 alone can find any pair. The artifact is written as correlation_association_matrix.json.

Display

The component renders a sortable table of all column pairs where the association could be computed. Each row shows the two column names, the association value, and the method. The table is sorted by absolute association strength by default.

Filtering from chat

Summand can query associations in two ways:
# Association between two specific columns
analyze({
    component: "correlation_association_matrix",
    params: { association_between_two_columns: "heart_rate & cholesterol" }
})

# All associations involving one column
analyze({
    component: "correlation_association_matrix",
    params: { all_associations_with_column: "length_of_stay" } 
})
The association_between_two_columns parameter expects the exact column names joined with & as separator — the order does not matter since the associations list stores both orderings.

Use cases

  • Preliminary feature selection — identify columns that vary strongly together and drop redundant ones before training a model.
  • Identify co-occurring groups — e.g. patients with diagnoses A and B who are statistically likely to also have diagnosis C.
  • Measure group impact on outcomes — how does insurance type or care setting relate to length of stay?
  • Exploratory data analysis — surface unexpected relationships in a new dataset.

Compute profile

ProfileProfile
Fargate2 CPU, 8 GB
The component runs in pandas on the curated Parquet. Pairwise computation scales as O(n_columns ^ 2); on wide datasets (hundreds of columns) runtime grows accordingly, but the per-pair work is cheap.

Common gotchas

Two things will cause a column to be excluded: it’s neither numeric nor categorical (e.g. a timestamp column) or it’s a unique identifier — a column where every value is distinct. Primary-key-style ID columns fall into this last category.
This happens when at least one column in the pair has zero variance (all rows share the same value), or when there are no rows where both columns have non-null values simultaneously. The reason_not_computed field in the associations list explains which case applies.
Correlation, Cramér’s V, and the correlation ratio all measure co-movement or predictability, not causation. A high value tells you the columns move together; it says nothing about why.