Semantic layer overview

The semantic layer is the structured set of artifacts Summand computes for every dataset: column statistics, feature descriptions, a trained interpretable model, feature-effect graphs, a 2D embedding, and dataset-level metadata. Together these are what power dashboards, AI insights, and the MCP server. Each artifact is produced by a semantic component — a self-contained unit that takes the curated dataset and emits a JSON (or binary) artifact, versioned and stored in S3.

Why components

Splitting the semantic layer into discrete components has a few practical consequences:

Computed once, served many times. Models are trained on a schedule; downstream calls just read artifacts.
Independent failure. Optional components can fail without breaking the rest of the layer.
Versioned. Every refresh produces a new immutable version with its own manifest.
AI-queryable. Each component declares how an AI agent can summarize and filter its data, so MCP tools can return the right slice without dumping the whole artifact into context.

What gets produced

Every successful run yields a manifest plus one artifact per component, written to:

s3://summand-artifacts/semantic-layers/{datasetId}/versions/{version}/
├── manifest.json
├── column_stats.json
├── feature_metadata.json
├── model.pkl
├── graphs.json.gz
├── umap_embedding.json
└── semantic_metadata.json

The manifest is the index. It lists every component, its status, size, content type, and the metadata an AI agent needs to query it. A pointer at semantic-layers/{datasetId}/current/version.json always resolves to the latest completed version.

Component reference

Full breakdown of every component, its fields, and how to query it.

Pipeline shape

A run is orchestrated as a DAG. Components declare their dependencies and the orchestrator topologically sorts them:

column_stats ─┐
feature_metadata ─┐
umap_embedding ─┐
semantic_metadata ─┐
                 ├─→ (parallelizable, all independent)
ebm_model ────────→ ebm_graphs

Required components (feature_metadata, ebm_model, ebm_graphs, semantic_metadata) crash the run on failure. Optional components (column_stats, umap_embedding) are best-effort — the run continues without them.

How clients consume it

Fetch the dataset record and read its semanticLayer.manifest.
Pick a component by name.
Request a presigned URL for that component’s artifact.
Decompress (gzip if the manifest says so) and apply any filtering described in agentConfig.params.

For AI clients, this is wrapped behind the MCP get_semantic_data tool, which handles fetching, decompression, and parameter filtering automatically.

Refresh cadence

Connectors can be configured to recompute the semantic layer on a schedule (hourly, daily, weekly). Scheduled refresh reuses already-curated data, so it skips the ingestion cost and only re-runs the components themselves — typically 2–5 minutes per dataset.

Getting started

Features

Semantic layer

Semantic layer overview

Why components

What gets produced

Component reference

Pipeline shape

How clients consume it

Refresh cadence

Getting started

Features

Semantic layer

Documentation Index

​Why components

​What gets produced

Component reference

​Pipeline shape

​How clients consume it

​Refresh cadence

Why components

What gets produced

Pipeline shape

How clients consume it

Refresh cadence