Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.summand.com/llms.txt

Use this file to discover all available pages before exploring further.

column_stats is the baseline component. It runs on first ingest of any dataset (no experiment configuration needed) and powers the Overview tab, dataset summaries shown to Summand, and the schema-level views the rest of the product builds on.

What it computes

For every column in the dataset:
  • Common statsname, dtype, count, nullCount, nullPct, uniqueCount, uniquePct.
  • Numeric columnsmin, max, mean, median, std, first and third quartiles, zeroCount, zeroPct.
  • Categorical / string columns — top 10 distinct values with counts.
Plus dataset-level totals: totalRows, totalColumns.

Inputs

None. column_stats takes no configuration — it always runs on the full dataset.

Output shape

{
  "totalRows": 50000,
  "totalColumns": 24,
  "columns": [
    {
      "name": "revenue",
      "dtype": "float64",
      "count": 49872,
      "nullCount": 128,
      "nullPct": 0.26,
      "uniqueCount": 12451,
      "uniquePct": 24.94,
      "min": 0.0,
      "max": 9842.5,
      "mean": 142.7,
      "median": 89.5,
      "std": 218.3,
      "q1": 32.0,
      "q3": 184.2,
      "zeroCount": 4,
      "zeroPct": 0.01
    },
    {
      "name": "tier",
      "dtype": "object",
      "count": 50000,
      "nullCount": 0,
      "nullPct": 0.0,
      "uniqueCount": 3,
      "uniquePct": 0.01,
      "topValues": [
        { "value": "free",    "count": 38240 },
        { "value": "pro",     "count":  9810 },
        { "value": "enterprise", "count": 1950 }
      ]
    }
  ]
}
The artifact is written as column_stats.json to the dataset’s versioned S3 path.

Where it shows up

  • Dataset detail → Overview tab reads the latest column_stats to render the column-by-column summary.
  • Summand chat queries it via the analyze tool — “What’s the missingness of revenue?” resolves to a filtered read of the artifact.
  • The Predictors component uses column stats internally to make feature-engineering decisions (numeric vs. categorical, low- vs. high-cardinality).

Compute profile

ProfileMemoryTimeout
Lambda2 GB120 s
For most datasets this finishes in a few seconds. The component reads the curated Parquet, computes the stats in pandas, and writes the JSON.

Filtering from chat

Summand can ask for a single column instead of the whole stats blob:
analyze({ component: "column_stats", target: ..., params: { column_name: "revenue" } })
The agent receives just that column’s stats — efficient even on wide tables.