Components reference

Every dataset’s semantic layer is built from the components below. Each one is independently versioned, stored in S3, and addressable by name through the MCP get_semantic_data tool.

At a glance

Component	Purpose	Required	Depends on
`column_stats`	Per-column univariate statistics	No	—
`feature_metadata`	LLM-generated dataset & feature descriptions	Yes	—
`ebm_model`	Trained Explainable Boosting Machine	Yes	—
`ebm_graphs`	Feature-effect graphs and importances	Yes	`ebm_model`
`umap_embedding`	2D UMAP projection for cluster views	No	—
`semantic_metadata`	Dataset-level overview and run metadata	Yes	—

`column_stats`

Per-column univariate statistics computed from the full source data, before feature selection. Useful for data profiling and quick checks (“how many nulls in email?”, “what’s the median revenue?”). Filename: column_stats.json · Content type: application/json

Shape

{
  "totalRows": 500000,
  "totalColumns": 42,
  "columns": [
    {
      "name": "age",
      "dtype": "int64",
      "count": 498500,
      "nullCount": 1500,
      "nullPct": 0.3,
      "uniqueCount": 89,
      "uniquePct": 0.02,
      "min": 18, "max": 99,
      "mean": 45.3, "median": 44, "std": 18.2,
      "q1": 30, "q3": 61,
      "zeroCount": 0, "zeroPct": 0.0,
      "isTarget": false
    }
  ]
}

Categorical columns include topValues (the ten most frequent values with counts) instead of numeric quantiles.

Querying

Param	Type	Effect
`column_name`	string	Returns stats for a single column instead of all columns

get_semantic_data(component="column_stats", params={"column_name": "age"})

`feature_metadata`

Human-readable descriptions of the dataset and each selected feature, generated by Claude with structured outputs. User-supplied context is merged in and takes precedence over auto-generated descriptions. Filename: feature_metadata.json · Content type: application/json

Shape

{
  "selectedFeatures": ["age", "income", "credit_score", "employment_status"],
  "context": {
    "dataset_context": "500,000 financial records from 2020-2024, target is loan_default (binary).",
    "feature_contexts": {
      "age": "Age of the applicant in years, ranging from 18 to 99.",
      "income": "Annual household income in USD."
    }
  },
  "categoricalMappings": {
    "employment_status": {
      "type": "categorical",
      "examples": ["employed", "self-employed", "unemployed", "retired"],
      "n_unique": 4
    }
  },
  "artifacts": {
    "income": {
      "sentinel_values": [],
      "extreme_outliers": [{"value": 9999999.0, "count": 5, "std_from_median": 45.3}],
      "artifact_summary": "1 extreme repeated outlier detected",
      "is_binary": false
    }
  }
}

artifacts flags data-quality oddities (sentinel values like 999999, repeated extreme outliers) that the pipeline detected during preparation.

Querying

No params — the artifact is small and always returned in full.

`ebm_model`

The trained Explainable Boosting Machine for the dataset. This is the model behind every feature-effect graph and SHAP-like explanation in Summand. Filename: model.pkl · Content type: application/octet-stream

What it contains

A pickled interpret-ml EBM, trained on up to 50,000 sampled rows of the selected features against the target column. Configuration:

max_bins=256, max_interaction_bins=32, interactions=5
outer_bags=8, learning_rate=0.04, smoothing_rounds=500
early_stopping_rounds=100, min_samples_leaf=4

Constraints

The model is not exposed to AI agents — it’s a binary blob with no useful summary. It’s also subject to several preflight rejections:

Multi-class targets (>2 classes) are rejected; binary classification or regression only.
Targets need at least 50 non-null samples and ≥2 distinct values.
Numeric targets need non-zero variance.
Datetime columns, zero-variance columns, and high-cardinality nominals (>10,000 unique values) are dropped before training.

After training, selectedFeatures in the manifest is reconciled to the features the EBM actually kept.

`ebm_graphs`

The visualisable form of the EBM: per-feature score curves with confidence bands, plus pairwise interaction surfaces. This is the artifact behind every feature-effect chart in the Summand UI and the most common thing AI agents query. Filename: graphs.json.gz · Content type: application/json · Encoding: gzip

Shape

{
  "model_id": "2026-01-15T10:30:45Z",
  "feature_names": ["age", "income", "credit_score", "age & income"],
  "feature_importances": [0.45, 0.38, 0.12, 0.05],
  "features": [
    {
      "name": "age",
      "type": "univariate",
      "importance": 0.45,
      "names": ["18-25", "26-35", "36-45", "46-55", "56-65", "65+"],
      "scores": [-0.2, -0.1, 0.0, 0.15, 0.3, 0.25],
      "upper_bounds": [-0.15, -0.05, 0.05, 0.2, 0.35, 0.3],
      "lower_bounds": [-0.25, -0.15, -0.05, 0.1, 0.25, 0.2],
      "scores_range": [-0.2, 0.3],
      "density": {
        "names": ["18", "30", "45", "60", "75"],
        "scores": [5000, 45000, 120000, 80000, 20000]
      }
    }
  ],
  "metadata": {
    "total_features": 4,
    "main_features": 3,
    "interaction_features": 1
  }
}

Interaction terms appear with names like "age & income", type: "interaction", and include bin_edges for heatmap rendering.

Querying

Param	Type	Effect
`feature_name`	string	Return only the named feature or interaction

get_semantic_data(component="ebm_graphs", params={"feature_name": "income"})

The full artifact can be tens of megabytes; always filter by feature_name when an agent only needs one curve.

`umap_embedding`

A 2D UMAP projection of the dataset, suitable for plotting clusters and inspecting data structure visually. Filename: umap_embedding.json · Content type: application/json

Shape

{
  "coordinates": [[1.2, 3.4], [-0.5, 2.1], [2.8, -1.2]],
  "colors": [0, 1, 0, 1, 1, 0]
}

coordinates is a list of [x, y] pairs, one per row. colors is the parallel target value, used to colour points in the viewer.

How it’s computed

Categorical columns are converted to numeric codes; missing values are imputed with the median.
Constant columns are dropped before fitting.
UMAP runs with n_components=2, n_neighbors=min(15, n_rows-1), min_dist=0.1.
Datasets larger than 50,000 rows are subsampled before fitting (UMAP is roughly O(n^1.14); a 650k-row dataset took over an hour without sampling).

Querying

Not exposed to AI agents — the artifact is purely for visual rendering.

`semantic_metadata`

A small record of dataset-level facts about the run itself: when it was computed, against what target, with how many features. Filename: semantic_metadata.json · Content type: application/json

Shape

{
  "version": "2026-01-15T10:30:45Z",
  "datasetId": "ds_abc123xyz",
  "createdAt": "2026-01-15T10:30:45Z",
  "datasetName": "Financial Risk 2024",
  "targetColumn": "loan_default",
  "totalRows": 500000,
  "totalFeatures": 42,
  "selectedFeaturesCount": 15
}

totalRows and totalFeatures reflect the source dataset, not the sampled working set used for model training.

Querying

No params — always returned in full.

The manifest

Every component appears in manifest.json with status, size, content type, agent-query metadata, and frontend display hints. A trimmed example:

{
  "manifestVersion": "1.0",
  "datasetId": "ds_abc123xyz",
  "version": "2026-01-15T10:30:45Z",
  "computeDurationSeconds": 150.5,
  "totalRows": 500000,
  "selectedFeatures": ["age", "income", "credit_score"],
  "components": {
    "ebm_graphs": {
      "status": "completed",
      "filename": "graphs.json.gz",
      "contentType": "application/json",
      "contentEncoding": "gzip",
      "sizeBytes": 5200000,
      "agentConfig": {
        "summary_fields": ["feature_names", "feature_importances", "metadata"],
        "params": [
          {
            "name": "feature_name",
            "type": "string",
            "collection_path": "features",
            "filter_key": "name"
          }
        ],
        "max_tokens_estimate": 50000
      }
    }
  }
}

The agentConfig block is what get_semantic_data consults to decide which fields to return when no params are supplied, and how to filter the artifact when params are.

Use components from MCP

Reference for the get_semantic_data tool that wraps all of this.

Getting started

Features

Semantic layer

Components reference

At a glance

`column_stats`

Shape

Querying

`feature_metadata`

Shape

Querying

`ebm_model`

What it contains

Constraints

`ebm_graphs`

Shape

Querying

`umap_embedding`

Shape

How it’s computed

Querying

`semantic_metadata`

Shape

Querying

The manifest

Use components from MCP

Getting started

Features

Semantic layer

Documentation Index

​At a glance

​column_stats

​Shape

​Querying

​feature_metadata

​Shape

​Querying

​ebm_model

​What it contains

​Constraints

​ebm_graphs

​Shape

​Querying

​umap_embedding

​Shape

​How it’s computed

​Querying

​semantic_metadata

​Shape

​Querying

​The manifest

Use components from MCP

At a glance

`column_stats`

Shape

Querying

`feature_metadata`

Shape

Querying

`ebm_model`

What it contains

Constraints

`ebm_graphs`

Shape

Querying

`umap_embedding`

Shape

How it’s computed

Querying

`semantic_metadata`

Shape

Querying

The manifest