Every dataset’s semantic layer is built from the components below. Each one is independently versioned, stored in S3, and addressable by name through the MCPDocumentation Index
Fetch the complete documentation index at: https://docs.summand.com/llms.txt
Use this file to discover all available pages before exploring further.
get_semantic_data tool.
At a glance
| Component | Purpose | Required | Depends on |
|---|---|---|---|
column_stats | Per-column univariate statistics | No | — |
feature_metadata | LLM-generated dataset & feature descriptions | Yes | — |
ebm_model | Trained Explainable Boosting Machine | Yes | — |
ebm_graphs | Feature-effect graphs and importances | Yes | ebm_model |
umap_embedding | 2D UMAP projection for cluster views | No | — |
semantic_metadata | Dataset-level overview and run metadata | Yes | — |
column_stats
Per-column univariate statistics computed from the full source data, before feature selection. Useful for data profiling and quick checks (“how many nulls in email?”, “what’s the median revenue?”).
Filename: column_stats.json · Content type: application/json
Shape
topValues (the ten most frequent values with counts) instead of numeric quantiles.
Querying
| Param | Type | Effect |
|---|---|---|
column_name | string | Returns stats for a single column instead of all columns |
feature_metadata
Human-readable descriptions of the dataset and each selected feature, generated by Claude with structured outputs. User-supplied context is merged in and takes precedence over auto-generated descriptions.
Filename: feature_metadata.json · Content type: application/json
Shape
artifacts flags data-quality oddities (sentinel values like 999999, repeated extreme outliers) that the pipeline detected during preparation.
Querying
No params — the artifact is small and always returned in full.ebm_model
The trained Explainable Boosting Machine for the dataset. This is the model behind every feature-effect graph and SHAP-like explanation in Summand.
Filename: model.pkl · Content type: application/octet-stream
What it contains
A pickledinterpret-ml EBM, trained on up to 50,000 sampled rows of the selected features against the target column. Configuration:
max_bins=256,max_interaction_bins=32,interactions=5outer_bags=8,learning_rate=0.04,smoothing_rounds=500early_stopping_rounds=100,min_samples_leaf=4
Constraints
The model is not exposed to AI agents — it’s a binary blob with no useful summary. It’s also subject to several preflight rejections:- Multi-class targets (>2 classes) are rejected; binary classification or regression only.
- Targets need at least 50 non-null samples and ≥2 distinct values.
- Numeric targets need non-zero variance.
- Datetime columns, zero-variance columns, and high-cardinality nominals (>10,000 unique values) are dropped before training.
selectedFeatures in the manifest is reconciled to the features the EBM actually kept.
ebm_graphs
The visualisable form of the EBM: per-feature score curves with confidence bands, plus pairwise interaction surfaces. This is the artifact behind every feature-effect chart in the Summand UI and the most common thing AI agents query.
Filename: graphs.json.gz · Content type: application/json · Encoding: gzip
Shape
"age & income", type: "interaction", and include bin_edges for heatmap rendering.
Querying
| Param | Type | Effect |
|---|---|---|
feature_name | string | Return only the named feature or interaction |
feature_name when an agent only needs one curve.
umap_embedding
A 2D UMAP projection of the dataset, suitable for plotting clusters and inspecting data structure visually.
Filename: umap_embedding.json · Content type: application/json
Shape
coordinates is a list of [x, y] pairs, one per row. colors is the parallel target value, used to colour points in the viewer.
How it’s computed
- Categorical columns are converted to numeric codes; missing values are imputed with the median.
- Constant columns are dropped before fitting.
- UMAP runs with
n_components=2,n_neighbors=min(15, n_rows-1),min_dist=0.1. - Datasets larger than 50,000 rows are subsampled before fitting (UMAP is roughly O(n^1.14); a 650k-row dataset took over an hour without sampling).
Querying
Not exposed to AI agents — the artifact is purely for visual rendering.semantic_metadata
A small record of dataset-level facts about the run itself: when it was computed, against what target, with how many features.
Filename: semantic_metadata.json · Content type: application/json
Shape
totalRows and totalFeatures reflect the source dataset, not the sampled working set used for model training.
Querying
No params — always returned in full.The manifest
Every component appears inmanifest.json with status, size, content type, agent-query metadata, and frontend display hints. A trimmed example:
agentConfig block is what get_semantic_data consults to decide which fields to return when no params are supplied, and how to filter the artifact when params are.
Use components from MCP
Reference for the
get_semantic_data tool that wraps all of this.
