Skip to main content
This component is currently in beta and may be subject to changes or instability. The Distribution Profile component answers the question: what is the shape of each column’s data distribution?

Use cases

  • Fraud detection preprocessing — Is the transaction amount column heavy-tailed enough to warrant a log transformation, or is a Normal assumption reasonable?
  • Sensor data validation — Are the readings from a temperature sensor unimodal as expected, or does a second peak suggest a recurring calibration fault?
  • Feature engineering — Which columns are highly skewed and would benefit from normalisation before being fed into a model?

What it computes

  • Skewness and kurtosis: measures of distributional shape — how lopsided and how heavy-tailed each numeric column is.
  • Best-fit parametric distribution: the best-fitting distribution from a candidate set, along with its fitted parameters.
  • Shapiro-Wilk normality test: a pp-value testing whether each numeric column is normally distributed.
  • Modality: whether each numeric column’s distribution is unimodal, bimodal, or multimodal.
  • Kernel density estimate (KDE): a smooth estimate of each numeric column’s probability density.
  • Histograms: a 30-bin histogram for each numeric column, and a proportion histogram for categorical columns with 10 or fewer unique categories.

Display

The Distribution Profile component displays the following in the Catalog.
  • Distribution profiles table: skewness, kurtosis, best-fit distribution and its parameters, Shapiro-Wilk pp-value, modality, and top peaks — one row per numeric column.
  • Fitted distributions: all candidate parametric distributions ranked by BIC, with fitted parameters and goodness-of-fit pp-value — one table per numeric column.
  • Numeric histograms: a 30-bin histogram per numeric column.
  • Categorical histograms: a proportion histogram per low-cardinality categorical column (10 or fewer unique categories).
  • Kernel density estimates: a smooth density curve per numeric column.

Inputs

NameTypeRequiredDefaultDescription
peak_prominence_cutoffnumberNo0.1Controls how pronounced a peak in the distribution must be to count toward modality. Increase to ignore minor bumps; decrease to detect subtler ones.

Output shape

{
    "distribution_profiles": [
        {
            "column": str,
            "sample_size": int,
            "skewness": float,
            "kurtosis": float,
            "best_fit_distribution": str | None,
            "best_fit_distribution_parameters": [{"name": str, "value": float}] | None,
            "best_fit_sample_size": int,
            "shapiro_wilk_p_value": float,            # present when Shapiro-Wilk was computed
            "shapiro_wilk_reason_not_computed": str,  # present instead when Shapiro-Wilk was not computed
            "shapiro_wilk_sample_size": int,
            "modality": "unimodal" | "bimodal" | "multimodal",
            "peaks": [
                {"location": float, "prominence": float},
                # ...  top 20 most prominent peaks, ordered by descending prominence
            ],
        },
        # ...  numeric columns only
    ],
    "fitted_distributions": [
        {
            "column": str,
            "sample_size": int,
            "reason_not_computed": str,  # present when column has fewer than 2 observed values
            "distributions": [
                {
                    "distribution_name": str,
                    "bayesian_information_criterion": float,        # absent when fitting failed for this distribution
                    "parameters": [{"name": str, "value": float}],  # absent when fitting failed for this distribution
                    "goodness_of_fit_p_value": float,               # absent when fitting failed for this distribution
                    "reason_not_computed": str,                     # present when fitting failed for this distribution
                },
                # ...  one entry per candidate distribution, ranked by descending BIC
            ],
        },
        # ...  numeric columns only
    ],
    "numeric_histograms": [
        {
            "column": str,
            "plot_data": [
                {
                    "left": float,
                    "center": float,
                    "right": float,
                    "percentage": float,
                },
                # ...  30 entries
            ],
        },
        # ...  numeric columns only
    ],
    "categorical_histograms": [
        {
            "column": str,
            "plot_data": [
                {
                    "category": str,
                    "percentage": float,
                },
                # ...  one entry per unique category, ordered by descending percentage
            ],
        },
        # ...  categorical columns with 10 unique categories only
    ],
    "kernel_density_estimates": [
        {
            "column": str,
            "bandwidth": float,
            "plot_data": [
                {
                    "sample_point": float,
                    "density_estimate": float,
                },
                # ...  100 entries
            ],
        },
        # ...  numeric columns only
    ],
}

Filtering from chat

Summand can query the distribution profile artifact in a variety of ways:
# Skewness, kurtosis, best-fit distribution, Shapiro-Wilk p-value, modality, and peaks for a specific column
analyze({
    component: "distribution_profile",
    params: { distribution_shape_for_column: "transaction_amount" }
})

# All candidate distributions ranked by BIC, with fitted parameters and goodness-of-fit p-value
analyze({
    component: "distribution_profile",
    params: { list_of_fitted_distributions_for_column: "transaction_amount" }
})

# 30-bin histogram for a specific numeric column
analyze({
    component: "distribution_profile",
    params: { numeric_histogram_for_column: "transaction_amount" }
})

# Proportion histogram for a specific categorical column
analyze({
    component: "distribution_profile",
    params: { categorical_histogram_for_column: "product_category" }
})

# KDE plot data and bandwidth for a specific numeric column
analyze({
    component: "distribution_profile",
    params: { kernel_density_estimate_for_column: "transaction_amount" }
})

Compute profile

ProfileValue
Fargate2 CPU, 8 GB