Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.summand.com/llms.txt

Use this file to discover all available pages before exploring further.

Every dataset’s semantic layer is built from the components below. Each one is independently versioned, stored in S3, and addressable by name through the MCP get_semantic_data tool.

At a glance

ComponentPurposeRequiredDepends on
column_statsPer-column univariate statisticsNo
feature_metadataLLM-generated dataset & feature descriptionsYes
ebm_modelTrained Explainable Boosting MachineYes
ebm_graphsFeature-effect graphs and importancesYesebm_model
umap_embedding2D UMAP projection for cluster viewsNo
semantic_metadataDataset-level overview and run metadataYes

column_stats

Per-column univariate statistics computed from the full source data, before feature selection. Useful for data profiling and quick checks (“how many nulls in email?”, “what’s the median revenue?”). Filename: column_stats.json · Content type: application/json

Shape

{
  "totalRows": 500000,
  "totalColumns": 42,
  "columns": [
    {
      "name": "age",
      "dtype": "int64",
      "count": 498500,
      "nullCount": 1500,
      "nullPct": 0.3,
      "uniqueCount": 89,
      "uniquePct": 0.02,
      "min": 18, "max": 99,
      "mean": 45.3, "median": 44, "std": 18.2,
      "q1": 30, "q3": 61,
      "zeroCount": 0, "zeroPct": 0.0,
      "isTarget": false
    }
  ]
}
Categorical columns include topValues (the ten most frequent values with counts) instead of numeric quantiles.

Querying

ParamTypeEffect
column_namestringReturns stats for a single column instead of all columns
get_semantic_data(component="column_stats", params={"column_name": "age"})

feature_metadata

Human-readable descriptions of the dataset and each selected feature, generated by Claude with structured outputs. User-supplied context is merged in and takes precedence over auto-generated descriptions. Filename: feature_metadata.json · Content type: application/json

Shape

{
  "selectedFeatures": ["age", "income", "credit_score", "employment_status"],
  "context": {
    "dataset_context": "500,000 financial records from 2020-2024, target is loan_default (binary).",
    "feature_contexts": {
      "age": "Age of the applicant in years, ranging from 18 to 99.",
      "income": "Annual household income in USD."
    }
  },
  "categoricalMappings": {
    "employment_status": {
      "type": "categorical",
      "examples": ["employed", "self-employed", "unemployed", "retired"],
      "n_unique": 4
    }
  },
  "artifacts": {
    "income": {
      "sentinel_values": [],
      "extreme_outliers": [{"value": 9999999.0, "count": 5, "std_from_median": 45.3}],
      "artifact_summary": "1 extreme repeated outlier detected",
      "is_binary": false
    }
  }
}
artifacts flags data-quality oddities (sentinel values like 999999, repeated extreme outliers) that the pipeline detected during preparation.

Querying

No params — the artifact is small and always returned in full.

ebm_model

The trained Explainable Boosting Machine for the dataset. This is the model behind every feature-effect graph and SHAP-like explanation in Summand. Filename: model.pkl · Content type: application/octet-stream

What it contains

A pickled interpret-ml EBM, trained on up to 50,000 sampled rows of the selected features against the target column. Configuration:
  • max_bins=256, max_interaction_bins=32, interactions=5
  • outer_bags=8, learning_rate=0.04, smoothing_rounds=500
  • early_stopping_rounds=100, min_samples_leaf=4

Constraints

The model is not exposed to AI agents — it’s a binary blob with no useful summary. It’s also subject to several preflight rejections:
  • Multi-class targets (>2 classes) are rejected; binary classification or regression only.
  • Targets need at least 50 non-null samples and ≥2 distinct values.
  • Numeric targets need non-zero variance.
  • Datetime columns, zero-variance columns, and high-cardinality nominals (>10,000 unique values) are dropped before training.
After training, selectedFeatures in the manifest is reconciled to the features the EBM actually kept.

ebm_graphs

The visualisable form of the EBM: per-feature score curves with confidence bands, plus pairwise interaction surfaces. This is the artifact behind every feature-effect chart in the Summand UI and the most common thing AI agents query. Filename: graphs.json.gz · Content type: application/json · Encoding: gzip

Shape

{
  "model_id": "2026-01-15T10:30:45Z",
  "feature_names": ["age", "income", "credit_score", "age & income"],
  "feature_importances": [0.45, 0.38, 0.12, 0.05],
  "features": [
    {
      "name": "age",
      "type": "univariate",
      "importance": 0.45,
      "names": ["18-25", "26-35", "36-45", "46-55", "56-65", "65+"],
      "scores": [-0.2, -0.1, 0.0, 0.15, 0.3, 0.25],
      "upper_bounds": [-0.15, -0.05, 0.05, 0.2, 0.35, 0.3],
      "lower_bounds": [-0.25, -0.15, -0.05, 0.1, 0.25, 0.2],
      "scores_range": [-0.2, 0.3],
      "density": {
        "names": ["18", "30", "45", "60", "75"],
        "scores": [5000, 45000, 120000, 80000, 20000]
      }
    }
  ],
  "metadata": {
    "total_features": 4,
    "main_features": 3,
    "interaction_features": 1
  }
}
Interaction terms appear with names like "age & income", type: "interaction", and include bin_edges for heatmap rendering.

Querying

ParamTypeEffect
feature_namestringReturn only the named feature or interaction
get_semantic_data(component="ebm_graphs", params={"feature_name": "income"})
The full artifact can be tens of megabytes; always filter by feature_name when an agent only needs one curve.

umap_embedding

A 2D UMAP projection of the dataset, suitable for plotting clusters and inspecting data structure visually. Filename: umap_embedding.json · Content type: application/json

Shape

{
  "coordinates": [[1.2, 3.4], [-0.5, 2.1], [2.8, -1.2]],
  "colors": [0, 1, 0, 1, 1, 0]
}
coordinates is a list of [x, y] pairs, one per row. colors is the parallel target value, used to colour points in the viewer.

How it’s computed

  • Categorical columns are converted to numeric codes; missing values are imputed with the median.
  • Constant columns are dropped before fitting.
  • UMAP runs with n_components=2, n_neighbors=min(15, n_rows-1), min_dist=0.1.
  • Datasets larger than 50,000 rows are subsampled before fitting (UMAP is roughly O(n^1.14); a 650k-row dataset took over an hour without sampling).

Querying

Not exposed to AI agents — the artifact is purely for visual rendering.

semantic_metadata

A small record of dataset-level facts about the run itself: when it was computed, against what target, with how many features. Filename: semantic_metadata.json · Content type: application/json

Shape

{
  "version": "2026-01-15T10:30:45Z",
  "datasetId": "ds_abc123xyz",
  "createdAt": "2026-01-15T10:30:45Z",
  "datasetName": "Financial Risk 2024",
  "targetColumn": "loan_default",
  "totalRows": 500000,
  "totalFeatures": 42,
  "selectedFeaturesCount": 15
}
totalRows and totalFeatures reflect the source dataset, not the sampled working set used for model training.

Querying

No params — always returned in full.

The manifest

Every component appears in manifest.json with status, size, content type, agent-query metadata, and frontend display hints. A trimmed example:
{
  "manifestVersion": "1.0",
  "datasetId": "ds_abc123xyz",
  "version": "2026-01-15T10:30:45Z",
  "computeDurationSeconds": 150.5,
  "totalRows": 500000,
  "selectedFeatures": ["age", "income", "credit_score"],
  "components": {
    "ebm_graphs": {
      "status": "completed",
      "filename": "graphs.json.gz",
      "contentType": "application/json",
      "contentEncoding": "gzip",
      "sizeBytes": 5200000,
      "agentConfig": {
        "summary_fields": ["feature_names", "feature_importances", "metadata"],
        "params": [
          {
            "name": "feature_name",
            "type": "string",
            "collection_path": "features",
            "filter_key": "name"
          }
        ],
        "max_tokens_estimate": 50000
      }
    }
  }
}
The agentConfig block is what get_semantic_data consults to decide which fields to return when no params are supplied, and how to filter the artifact when params are.

Use components from MCP

Reference for the get_semantic_data tool that wraps all of this.