Missing-value patterns

This component is currently in beta and may be subject to changes or instability. The Missing-Value Patterns component answers the question: are missing values random or do they follow a pattern?

Use cases

Retail data quality — Are product weight and dimensions always missing together, or are these independent supplier data problems?
Selective disclosure — Are customers who skip the household income question also refusing to answer questions about their spending habits?
Sensor fault attribution — Are multiple conveyor-belt sensors dropping out together, suggesting a shared infrastructure failure?

What it computes

$p$ -value for Little’s MCAR test: values below 5% suggest that data is not missing at random.
Missing-value pattern frequencies: which combinations of columns go missing together most often?
Conditional missingness matrix: when one column goes missing, which other column is most often missing?
Missingness correlation matrix: are some column pairs always missing together, or never missing at the same time? This is done by turning each column into a boolean indicator via the question “is this row missing in this column?” and then computing Pearson’s correlation coefficient between the resulting boolean columns. A value of +1 indicates that the two columns always go missing together; a value of -1 indicates that for every row, exactly one of the two columns is missing: they are never both missing at the same time, and never both observed.

Display

The missing-value patterns component displays several pieces of information in the Catalog.

Result of Little’s MCAR test: the $p$ -value, the sample size, and which numeric columns were used. Only numeric columns with non-zero variance are included; constant columns and entirely-missing columns are excluded.
Missing-value pattern frequencies: how often does each missing-value pattern occur, as a percentage of all rows? (For datasets with many distinct missing-value patterns, only the 100 most common patterns are provided, as well as any patterns where only a single column is missing — even if it is not among the 100 most common.)
Conditional missingness matrix: when the reference column is missing, how often is the conditionally missing columns also missing?
Missingness correlation matrix: how often two columns go missing together and how often only one of them goes missing.

Inputs

None. The component runs on the full dataset with no configuration required.

Output shape

{
    "little_mcar_test": {
        "p_value": float,
        "sample_size": int,
        "n_columns_used": int,
        "columns_used": list[str],  # numeric columns with non-zero variance used in the test; constant columns and entirely-missing columns are excluded
    },
    "missing_value_pattern_frequencies": [
        # Sorted by descending frequency_as_percentage; ties broken lexicographically by missing_columns.
        {
            "frequency_as_percentage": float,
            "missing_columns": str,  # column names joined by " & " in lexicographic order;
                                     # empty string for complete rows
        },
        # ...
    ],
    "missing_value_pattern_frequencies_top_10": [...],  # same schema, first 10 entries
    "conditional_missingness_matrix": [
        # Not sorted. Use conditional_missingness_matrix_top_10 for the highest-rate entries.
        {
            "directional_column_pair": str,  # "col_a -> col_b"
            "reference_column": str,
            "conditionally_missing_column": str,
            "conditional_missing_rate": float,  # P(col_b missing | col_a missing)
            "n_reference_rows_used": int,       # rows where col_a is missing
        },
        # ...
    ],
    "conditional_missingness_matrix_top_10": [...],  # same schema; sorted by descending conditional_missing_rate, ties broken lexicographically by (reference_column, conditionally_missing_column)
    "missingness_correlation_matrix": [
        # Not sorted. Each unordered pair stored twice (both orderings) so that filtering on
        # column_1 alone finds every correlation involving that column.
        # Use missingness_correlation_matrix_top_10 for the entries with the largest absolute value
        # of the coefficient.
        {
            "column_pair": str,  # "col_1 & col_2"
            "column_1": str,
            "column_2": str,
            "missingness_correlation_coefficient": float,
        },
        # ...
    ],
    "missingness_correlation_matrix_top_10": [
        # Unique pairs only (stored once), sorted by descending absolute value of the coefficient; ties broken lexicographically by (column_1, column_2).
        {
            "column_pair": str,
            "column_1": str,
            "column_2": str,
            "missingness_correlation_coefficient": float,
        },
        # ...
    ],
}

When Little’s MCAR test cannot be run, little_mcar_test contains a single reason_not_computed field instead of the fields above:

{
    "little_mcar_test": {
        "reason_not_computed": str
        # Possible values:
        # "[NoValidNumericColumns] ..."              — no numeric columns with non-zero variance
        # "[NoMissingValuesInValidNumericColumns] ..." — included columns are fully observed
        # "[CollinearityPresent] ..."                — covariance matrix is singular
        # "[NegativeDegreesOfFreedom] ..."           — internal error; should not occur in practice
    },
    ...
}

Filtering from chat

Summand can dig into the missing-value patterns in a variety of ways:

# How often a specific missing-value pattern appears (columns listed alphabetically)
analyze({
    component: "missing_value_patterns",
    params: { missing_value_pattern: "dimensions & weight" }
})

# How often the second column is missing given that the first column is missing
analyze({
    component: "missing_value_patterns",
    params: { conditional_missingness_directed_pair: "household_income -> spending_habits" }
})

# How often all other columns go missing given that this reference column is missing
analyze({
    component: "missing_value_patterns",
    params: { conditional_missingness_reference_column: "household_income" }
})

# How often this column goes missing given that one of the other columns is missing
analyze({
    component: "missing_value_patterns",
    params: { conditionally_missing_column: "spending_habits" }
})

# The correlation between the boolean missingness indicators of two specific columns
analyze({
    component: "missing_value_patterns",
    params: { missingness_correlation_column_pair: "dimensions & weight" }
})

# The correlation between the boolean missingness indicators of one column and all other columns
analyze({
    component: "missing_value_patterns",
    params: { missingness_correlation_column: "weight" }
})

Compute profile

Profile	Profile
Fargate	2 CPU, 8 GB

​Use cases

​What it computes

​Display

​Inputs

​Output shape

​Filtering from chat

​Compute profile