Documentation Index
Fetch the complete documentation index at: https://docs.summand.com/llms.txt
Use this file to discover all available pages before exploring further.
umap_embedding takes the curated Parquet for a dataset, projects every row into two dimensions using UMAP, and emits the coordinates. The result drives the cluster scatter plot on the dataset page and grounds Summand’s “show me similar rows” answers.
UMAP is unsupervised — it doesn’t need a target column. It looks at the geometry of the rows in feature space and finds a 2-D layout that preserves local neighborhoods.
What it’s good for
- Cluster discovery — see at a glance whether your data has natural groupings, and how many.
- Outlier visualization — outliers tend to land far from the dense regions of the projection.
- Sanity-checking labels — if you have a categorical column you suspect is meaningful, color the scatter plot by it and see whether the colors form coherent regions.
Inputs
None. The component runs on the full curated dataset.Output shape
[x, y] pair per row, in the same order as the curated Parquet. The artifact is umap_embedding.json.
For datasets with more than ~50,000 rows, the component subsamples to keep the projection tractable. The sampled-row indices are recorded in the manifest so downstream consumers can align points back to their source rows.
Display
The UMAP component ships with a bespoke React viewer (UMAPVisualization):
- Scatter plot with zoom and pan.
- Color-by-column — pick any column from the schema to color points; categorical columns get a discrete legend, numeric columns a gradient.
- Hover for row preview — a tooltip shows the nearest point’s actual row values.
- Lasso select — drag-select a region; the panel below shows the selected rows in a table.
Filtering from chat
UMAP isn’t directly queryable from chat — the raw coordinates aren’t useful in a text response. Summand can reference the artifact existing (“there are three visible clusters in this dataset’s UMAP”), but the actual visualization lives on the dataset page. The component’sagent_config is enabled: false for this reason — analyze({ component: "umap_embedding", ... }) is intentionally a no-op.
Compute profile
| Profile | Memory | Notes |
|---|---|---|
| Fargate | 4 CPU, 16 GB | Scales with row count; subsampling at 50k rows keeps cost bounded |
Common gotchas
The plot looks like a single dense blob
The plot looks like a single dense blob
Most often this means your dataset has one dominant cluster — common for filtered or single-segment datasets. Color by a categorical column (region, plan tier, product category) to see whether structure exists within the blob; if not, try running UMAP on a view that includes more diverse rows.
Categorical colors look noisy
Categorical colors look noisy
UMAP groups by all features at once, not by your color column. If the color column isn’t strongly correlated with the rest of the feature space, the colors won’t form coherent regions. That’s a finding, not a bug — it means the color column doesn’t drive the broader structure.
Two runs look completely different
Two runs look completely different
UMAP is stochastic. Re-running on the same data produces a layout with the same cluster structure but possibly mirrored, rotated, or reflected. The clusters themselves are reproducible; the absolute axes aren’t.
Subsampled — can I get the full N=200,000 projection?
Subsampled — can I get the full N=200,000 projection?
Not today. The 50k cap is a deliberate cost ceiling. If you need a denser projection, contact support@summand.com — Enterprise customers can have it raised under contract.