UMAP embedding

umap_embedding takes the curated Parquet for a dataset, projects every row into two dimensions using UMAP, and emits the coordinates. The result drives the cluster scatter plot on the dataset page and grounds Summand’s “show me similar rows” answers. UMAP is unsupervised — it doesn’t need a target column. It looks at the geometry of the rows in feature space and finds a 2-D layout that preserves local neighborhoods.

What it’s good for

Cluster discovery — see at a glance whether your data has natural groupings, and how many.
Outlier visualization — outliers tend to land far from the dense regions of the projection.
Sanity-checking labels — if you have a categorical column you suspect is meaningful, color the scatter plot by it and see whether the colors form coherent regions.

UMAP is not a tool for measuring distances quantitatively in 2-D. The projection compresses high-dimensional structure; absolute distance numbers don’t carry the meaning they have in feature space. Use it for shape, not for numerics.

Inputs

None. The component runs on the full curated dataset.

Output shape

{
  "coordinates": [
    [0.42, -1.18],
    [0.51, -1.06],
    [-2.31,  0.74],
    ...
  ]
}

One [x, y] pair per row, in the same order as the curated Parquet. The artifact is umap_embedding.json. For datasets with more than ~50,000 rows, the component subsamples to keep the projection tractable. The sampled-row indices are recorded in the manifest so downstream consumers can align points back to their source rows.

Display

The UMAP component ships with a bespoke React viewer (UMAPVisualization):

Scatter plot with zoom and pan.
Color-by-column — pick any column from the schema to color points; categorical columns get a discrete legend, numeric columns a gradient.
Hover for row preview — a tooltip shows the nearest point’s actual row values.
Lasso select — drag-select a region; the panel below shows the selected rows in a table.

Filtering from chat

UMAP isn’t directly queryable from chat — the raw coordinates aren’t useful in a text response. Summand can reference the artifact existing (“there are three visible clusters in this dataset’s UMAP”), but the actual visualization lives on the dataset page. The component’s agent_config is enabled: false for this reason — analyze({ component: "umap_embedding", ... }) is intentionally a no-op.

Compute profile

Profile	Memory	Notes
Fargate	4 CPU, 16 GB	Scales with row count; subsampling at 50k rows keeps cost bounded

UMAP fitting is the most compute-intensive component in the catalog after EBM model fitting. For datasets under 10k rows it finishes in a minute or two; larger datasets benefit noticeably from the subsample cap.

Common gotchas

The plot looks like a single dense blob

Most often this means your dataset has one dominant cluster — common for filtered or single-segment datasets. Color by a categorical column (region, plan tier, product category) to see whether structure exists within the blob; if not, try running UMAP on a view that includes more diverse rows.

Categorical colors look noisy

UMAP groups by all features at once, not by your color column. If the color column isn’t strongly correlated with the rest of the feature space, the colors won’t form coherent regions. That’s a finding, not a bug — it means the color column doesn’t drive the broader structure.

Two runs look completely different

UMAP is stochastic. Re-running on the same data produces a layout with the same cluster structure but possibly mirrored, rotated, or reflected. The clusters themselves are reproducible; the absolute axes aren’t.

Subsampled — can I get the full N=200,000 projection?

Not today. The 50k cap is a deliberate cost ceiling. If you need a denser projection, contact support@summand.com — Enterprise customers can have it raised under contract.

Get started

Core concepts

Data sources

Guides

Account & billing

Resources

What it’s good for

Inputs

Output shape

Display

Filtering from chat

Compute profile

Common gotchas

​What it’s good for

​Inputs

​Output shape

​Display

​Filtering from chat

​Compute profile

​Common gotchas

What it’s good for

Inputs

Output shape

Display

Filtering from chat

Compute profile

Common gotchas