# `window`

## Description

Applies a **window function** over a set of related rows and **adds the result as a new column** to every row in the result set.

- A `window` operation **does not collapse rows** (unlike `groupby`). It enriches each row with a computed value.
- You can optionally:
- **Partition** the data into independent groups using `by`
- **Order** rows inside each partition using `orderby`
- Define a **frame** to limit which rows are considered for each output row (e.g., “previous 10 rows”)

Note

A single `window` operator supports **one window function call** inside `calc`. To add multiple window-based columns, chain multiple `window` operators.

Warning

If you use an `orderby` clause, the `window` operator will **re-order** the result set by the `orderby` expressions.

## Syntax

```dataprime
window
  [ by <expr> [, <expr> ...] ]
  [ orderby <expr> [asc|desc] [, <expr> [asc|desc] ...] ]
  [ <frame-type> <frame-function> ]
  calc <window-function-expression> as <alias>
```

## Arguments

| Name           | Type                         | Required  | Description                                                                                                                                                                                                                   |
| -------------- | ---------------------------- | --------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| by             | list[scalar-expression]      | **false** | Partitions the current result set into independent groups (“buckets”). The window calculation is performed separately within each partition. If omitted, the window is computed over the entire result set (no partitioning). |
| orderby        | list\[scalar-expression (asc | desc)\]   | **false**                                                                                                                                                                                                                     |
| frame-type     | enum                         | **false** | Defines how the window **frame** is interpreted.                                                                                                                                                                              |
| frame-function | frame-function               | **false** | Defines which rows are included in the calculation \*\*relative to the current row.                                                                                                                                           |

### frame-type

- `rows` — frame boundaries are based on row counts (e.g., previous 10 rows)
- `range` — frame boundaries are based on the ordered value (commonly time intervals when ordering by timestamp)
- `groups` — frame boundaries are based on peer groups according to the `orderby` clause

If omitted, a default frame is applied (see `frame-function`).

### frame-function

Supported frame functions:

- `between(from, to)` — explicit frame bounds (both `from` and `to` must be **literals**)
- `all()` — equivalent to `between(null, null)`
- `running()` — cumulative window up to the current row (equivalent to `between(to=0)`)
- `previous(n)` — previous *n* rows / groups / interval, excluding the current row (`between(-n, -1)`)
- `next(n)` — next *n* rows / groups / interval, excluding the current row (`between(1, n)`)
- `lookbehind(n)` — includes current row and *n* rows / groups / interval behind (`between(-n, 0)`)
- `lookahead(n)` — includes current row and *n* rows / groups / interval ahead (`between(0, n)`)
- `lookaround(n)` — includes current row and *n* rows / groups / interval on both sides (`between(-n, n)`)

Mapping of `between(from, to)` to SQL-style semantics:

- `from = null` → unbounded preceding
- `to = null` → unbounded following
- `from/to = 0` → current row
- `from/to = -1` → 1 preceding
- `from/to = 1` → 1 following

Defaults (Postgres-like semantics):

- If `orderby` is **omitted** → default frame is `rows all()`
- If `orderby` is **present** → default frame is `groups running()`

Validations:

- `lookahead`, `lookbehind`, `lookaround`, `next`, `previous`: `n` must be a **non-negative literal**
- `between(from, to)`: `from` and `to` must be **literals**

Rules:

- Must contain **exactly one** window function call.
- Can include normal scalar operations around the window function (e.g., `value / avg(value)`).
- The result can be named using `as <alias>`.

Supported window functions include:

**Ranking**

- `dense_rank()` — Returns an integer rank (starting at `1`) within the partition, **without gaps**. Peer rows (rows with equal `orderby` values) share the same rank, and the next distinct `orderby` value increases the rank by **1**.
- `percent_rank()` — Returns the relative rank as a fraction in `[0, 1]`, computed as `(rank() - 1) / (count(*) - 1)` within the partition. Returns `0` when the partition contains a single row. Because it’s based on `rank()`, it may **not** reach `1` if the last rows are peers (ties).
- `rank()` — Returns an integer rank (starting at `1`) within the partition, **with gaps**. Peer rows share the same rank, and the next rank jumps by the number of peer rows (e.g., ordered values `[10, 10, 20]` → ranks `[1, 1, 3]`).
- `row_number()` — Returns a unique sequential integer for each row (starting at `1`) in the partition, ordered by `orderby`. Rows that compare equal in `orderby` still get different row numbers; add additional `orderby` expressions to make tie-breaking deterministic.
- `cume_dist()` — Returns the cumulative distribution as a fraction in `(0, 1]`, computed as `count(rows with orderby value <= current row’s value) / count(*)` within the partition. All peer rows return the same value.
- `ntile(n: number)` — Splits the ordered partition into `n` buckets as evenly as possible and returns the **1-based** bucket number for the current row. Bucket sizes differ by at most `1`; if the rows don’t divide evenly, earlier buckets contain one extra row. `n` must be a positive integer.

**Value**

- `lag(expr, offset?: number = 1, default?: any = null)`
- `lead(expr, offset?: number = 1, default?: any = null)`
- `first_value(expr)`
- `nth_value(expr, n: number)`
- `last_value(expr)`

**Aggregations**

- Aggregation functions can also be used as window functions.
- All aggregation functions work for **cumulative** frames (e.g., `all()`, `running()`).
- For **sliding** frames (where rows can leave the window), only these aggregates are supported: `avg()`, `count()`, `max()`, `min()`, `sum()`, `stddev()`, `variance()`, `sample_stddev()`, `sample_variance()`

## Example 1

**Use case: Keep one DataPrime query per user based on bytes scanned**

Partition queries by user, rank each user’s queries using the `orderby` clause, and then keep only the row with `rank == 1`.

### Example data

```json
{ "clientInfo": { "userEmail": "alice@example.com" }, "queryInfo": { "queryId": "q1", "interfaceType": "dataprime", "queryOutcome": { "storage": { "stats": { "bytesScanned": 1200 } } } } },
{ "clientInfo": { "userEmail": "alice@example.com" }, "queryInfo": { "queryId": "q2", "interfaceType": "dataprime", "queryOutcome": { "storage": { "stats": { "bytesScanned": 900 } } } } },
{ "clientInfo": { "userEmail": "bob@example.com" }, "queryInfo": { "queryId": "q3", "interfaceType": "dataprime", "queryOutcome": { "storage": { "stats": { "bytesScanned": 5000 } } } } },
{ "clientInfo": { "userEmail": "bob@example.com" }, "queryInfo": { "queryId": "q4", "interfaceType": "dataprime", "queryOutcome": { "storage": { "stats": { "bytesScanned": 3000 } } } } }
```

### Example query

```dataprime
source system/engine.queries
| filter queryInfo.interfaceType == 'dataprime'
| window by clientInfo.userEmail
    orderby queryInfo.queryOutcome.storage.stats.bytesScanned
    calc rank() as rank
| filter rank == 1
| choose clientInfo.userEmail as user, queryInfo.queryId,queryInfo.queryOutcome.storage.stats.bytesScanned
```

### Example output

| user              | queryId | bytesScanned |
| ----------------- | ------- | ------------ |
| alice@example.com | q2      | 900          |
| bob@example.com   | q4      | 3000         |

## Example 2

**Use case: Keep the top 3 most recent log entries per application**

Partition logs by `applicationname`, rank each entry from newest to oldest using `dense_rank()`, then keep only the first 3 ranks per partition. This pattern returns top N rows per group — distinct from the `top` command, which returns the top N groups overall.

### Example data

```json
{ "timestamp": "2026-05-11T12:40:23Z", "applicationname": "cx-metrics" },
{ "timestamp": "2026-05-11T12:39:51Z", "applicationname": "cx-metrics" },
{ "timestamp": "2026-05-11T12:39:31Z", "applicationname": "cx-metrics" },
{ "timestamp": "2026-05-11T12:38:00Z", "applicationname": "cx-metrics" },
{ "timestamp": "2026-05-11T12:39:09Z", "applicationname": "claude-code" },
{ "timestamp": "2026-05-11T12:39:08Z", "applicationname": "claude-code" },
{ "timestamp": "2026-05-11T12:30:00Z", "applicationname": "claude-code" }
```

### Example query

```dataprime
source application_logs
| window by applicationname
    orderby timestamp desc
    calc dense_rank() as rank
| filter rank <= 3
| choose timestamp, applicationname
```

### Example output

| timestamp            | applicationname |
| -------------------- | --------------- |
| 2026-05-11T12:40:23Z | cx-metrics      |
| 2026-05-11T12:39:51Z | cx-metrics      |
| 2026-05-11T12:39:31Z | cx-metrics      |
| 2026-05-11T12:39:09Z | claude-code     |
| 2026-05-11T12:39:08Z | claude-code     |
| 2026-05-11T12:30:00Z | claude-code     |

## Example 3

**Use case: Detect slow HTTP requests compared to the previous 10 requests (per method)**

Order requests by time, then compute the average duration over the previous 10 rows in the same HTTP method partition. Finally, filter requests that are more than 3× slower than that baseline.

### Example data

```json
{ "method": "GET", "timestamp": "2026-02-21T10:00:00Z", "duration_ms": 120 },
{ "method": "GET", "timestamp": "2026-02-21T10:00:01Z", "duration_ms": 110 },
{ "method": "GET", "timestamp": "2026-02-21T10:00:02Z", "duration_ms": 115 },
{ "method": "GET", "timestamp": "2026-02-21T10:00:03Z", "duration_ms": 900 },
{ "method": "POST", "timestamp": "2026-02-21T10:00:00Z", "duration_ms": 200 },
{ "method": "POST", "timestamp": "2026-02-21T10:00:01Z", "duration_ms": 210 },
{ "method": "POST", "timestamp": "2026-02-21T10:00:02Z", "duration_ms": 205 },
{ "method": "POST", "timestamp": "2026-02-21T10:00:03Z", "duration_ms": 800 }
```

### Example query

```dataprime
source http_logs
| window by method
    orderby timestamp asc
    rows previous(10)
    calc avg(duration_ms) as avg_prev_10_ms
| filter duration_ms > avg_prev_10_ms * 3
| choose method, timestamp, duration_ms, avg_prev_10_ms
```

### Example output

| method | timestamp            | duration_ms | avg_prev_10_ms |
| ------ | -------------------- | ----------- | -------------- |
| GET    | 2026-02-21T10:00:03Z | 900         | 115            |
| POST   | 2026-02-21T10:00:03Z | 800         | 205            |
