Guides

Structured queries and typed fields

Filter, sort, and aggregate your documents by their structured values — exact counts, rankings, and ranges — without ever running a semantic search. Structured queries are deterministic: the same query over the same data always returns the same answer, because no embedding is involved.

Semantic search answers "what is this about?". Structured queries answer "how many, which ones, sorted how, grouped by what?" — the questions a spreadsheet or a SQL GROUP BY answers. "How many paid invoices over $500 last quarter, by region, highest total first" is not a similarity question, and no amount of vector search will answer it exactly. Aether lets you ask it directly, over the same store your agent already searches.


When to use structured queries

Reach for a structured query when the answer depends on exact values, not meaning:

  • Filtering on a field: status = "paid", amount >= 500, created_at in a date range.
  • Sorting by a typed value: highest amount first, oldest due_date last.
  • Counting and aggregating: totals, averages, min/max, distinct counts — optionally grouped.
  • Exact, complete results: structured queries return the entire matching set (paged), not a top-k by relevance.

Keep using semantic search when the answer depends on meaning — "notes that discuss anxiety", "documents similar to this one". The two compose: the same filter grammar you use here also narrows a search call (see Filtering search).


Step 1 — Declare the fields you want to query

A field is a typed, indexed view over a value your documents already carry. You declare a field once per workspace; Aether then extracts and indexes that value from every existing and future document, so filters and aggregates over it are exact and fast.

Each field has a name, a type, and a source:

TypeAcceptsOrdering
stringa text valuelexicographic
inta whole number (or a whole-valued number like 3.0)numeric
floatany numbernumeric
booltrue / falsefalse < true
datetimean RFC 3339 timestamp stringchronological
string_listan array of text values (tag-like)

A field's source is where its value comes from:

  • { "metadata": "<key>" } — lift the value from a document's structured metadata (the typed key/value map you attach at insert time).
  • { "regex": "<pattern>" } — extract the first capture group of a regular expression run over the document's text (for example, pulling a ticket number out of a subject line).
client.schema.declare_fields([
    {"name": "amount",   "type": "float",       "source": {"metadata": "amount"}},
    {"name": "status",   "type": "string",      "source": {"metadata": "status"}},
    {"name": "region",   "type": "string",      "source": {"metadata": "region"}},
    {"name": "labels",   "type": "string_list", "source": {"metadata": "labels"}},
    {"name": "ticket",   "type": "string",      "source": {"regex": "TICKET-(\\d+)"}},
])

Declaring a field triggers a background backfill over your existing documents. list_fields reports each field's live coverage (how many documents have a value), mismatch_count, and backfill progress:

for field in client.schema.list_fields():
    print(field.name, field.type, field.coverage, field.mismatch_count)

Re-declaring a field name replaces its definition and re-backfills. delete_field(name) removes it and returns the remaining fields.

A bad value never fails ingest

If a document's source value can't be coerced to the field's type — "cheap" for a float field, say — that document is simply treated as not having a value for the field (it's excluded from filters and aggregates on it), and it increments the field's mismatch_count. Inserting a document never fails because of a declared field, so declaring a field can't break your write path.


Step 2 — Filter with the unified grammar

The filter grammar is one small JSON shape used everywhere Aether takes a filter. A filter is either a leaf — one comparison — or a combinator that nests other filters.

A leaf: one comparison

JSON
{ "field": "amount", "op": "gte", "value": 100 }

field is a declared field or one of the always-available built-ins (see below). op is one of:

OperatorMeaningvalue
eqequal toa scalar
neqnot equal toa scalar
inequal to any ofan array of scalars
gt / gtegreater than / or equala scalar
lt / lteless than / or equala scalar
betweenwithin an inclusive rangea 2-element array [low, high]
existshas (or lacks) a valuetrue (default) or false
containslist contains the valuea string (only on string_list fields)
prefixstring starts with the valuea string (only on string fields)

Comparisons are typed, which is the whole point: int and float fields order numerically, so 9 is less than 10 (not the string surprise where "9" sorts after "10"), and datetime fields order chronologically.

Combinators: and, or, not

Nest leaves with and / or (arrays) and not (a single filter). They compose to any depth:

JSON
{ "and": [
    { "field": "status", "op": "eq",  "value": "paid" },
    { "or": [
        { "field": "amount", "op": "gte", "value": 500 },
        { "field": "region", "op": "in",  "value": ["us-east", "us-west"] }
    ]},
    { "not": { "field": "labels", "op": "contains", "value": "test" } }
] }

Built-in fields

Every document exposes these without a declaration:

FieldType
created_atdatetime
updated_atdatetime
sourcestring
content_typestring
tagsstring_list
entity_idstring

Missing values

A document with no value for a field — never set, or a type mismatch — does not match a comparison on that field. (Equivalently, not on such a comparison does match it.) So a filter never silently coerces missing data into a false positive.


Step 3 — Query documents (Mode A)

query with a filter, an optional typed sort, and limit / offset returns the matching documents as a page. sort is a list of { by, dir } keys (dir is asc or desc, default asc); documents missing a sort field come last in either direction. The page carries total (the full matching count) and has_more so you can page through the entire set.

page = client.query(
    filter={"and": [
        {"field": "status", "op": "eq",  "value": "paid"},
        {"field": "amount", "op": "gte", "value": 100},
    ]},
    sort=[{"by": "amount", "dir": "desc"}],
    limit=20,
)

print(page.total, page.has_more)
for doc in page:
    print(doc.doc_id, doc.metadata.get("amount"))

Omit filter to page over every document in scope. To walk the whole result set, repeat with offset += limit while has_more is true.


Step 4 — Aggregate (Mode B)

Add an aggregate list and query switches to aggregation mode: instead of documents, it returns computed rows. Optionally group_by up to two fields to get one row per group.

Aggregate operators: count, count_distinct, sum, avg, min, max. The numeric operators (sum, avg, min, max) require an int or float field; give each an optional as to name its output. A document missing the aggregated field is excluded from that aggregate (a count of the group still counts it).

result = client.query(
    filter={"field": "created_at", "op": "gte", "value": "2026-04-01T00:00:00Z"},
    group_by=["region"],
    aggregate=[
        {"op": "count"},
        {"op": "sum", "field": "amount", "as": "total"},
        {"op": "avg", "field": "amount", "as": "avg_amount"},
    ],
    sort=[{"by": "total", "dir": "desc"}],
    limit=20,
)

for group in result.groups:
    print(group.keys["region"], group.aggregates["total"])

sort in Mode B orders by an aggregate output name or a group key; limit caps the number of groups returned. The result carries total_groups and scanned (how many documents were considered). A sum over an int field stays an integer; sum and avg otherwise accumulate as floating point.


Guardrails fail loud

Structured queries never return a partial answer dressed up as a complete one. Anything Aether can't answer exactly returns a 400 with a precise message, so you fix the query rather than trust a silently truncated result:

  • an unknown field (not declared, not a built-in);
  • a type-mismatched literal (comparing a float field to "cheap");
  • a non-numeric aggregate (sum over a string field);
  • exceeding the group cap or the candidate-scan cap — narrow the filter or add a partition.

Exact means exact

Because the guardrails return 400 instead of a truncated 200, a structured query is safe to build a total or a ranking on: if it returns a result, that result is complete and exact. If your filter is too broad to answer within the caps, you get an error telling you to narrow it — never a wrong number.


Filtering search with the same grammar

The grammar above is not exclusive to query. You can pass the same filter to search and to document listings, as a superset of the simpler metadata filters — so a structured range predicate sharpens a semantic search's pre-filter too:

results = client.search(
    "refund policy",
    k=10,
    filter={"field": "amount", "op": "gte", "value": 500},
)

Need the exact count or a ranking rather than the top-k by relevance? That's a structured query. Need the most relevant matches? That's search. Same filter, two read paths over one store.


Partition scoping

On a partition-scoped handle, query and every schema call are automatically scoped to that partition, exactly like the rest of the client:

scoped = client.partition("client-7")
page = scoped.query(filter={"field": "status", "op": "eq", "value": "open"})

See multi-tenant patterns for how partitions isolate one customer's data from another's.


Reference

The full request/response shapes and every parameter are in the Structured Query API reference.