# MainMarket Published Price Indices — Methodology

**Version 1.0 · 2026-05-04**

This document defines the methodology for MainMarket's published monthly price indices. It is the authoritative reference for institutional buyers, journalists, and analysts who need to understand exactly how each index is computed.

Indices currently published:
- **`eggs_cage_free_large_dozen`** — single-SKU egg index
- **`soda_12pk_12oz_cans`** — multi-SKU soda category index
- **`basket_low_income`** — USDA Thrifty Food Plan-anchored basket at value-tier chains
- **`basket_high_income`** — USDA Liberal Food Plan-anchored basket at premium-tier chains

For machine-readable definitions of each index (constituents, floor/ceiling bounds, chain filters), see the `published_indices` registry exposed at `GET /v1/indices/{slug}` (returned in `meta.constituents`).

---

## 1. Index types and aggregation formulas

MainMarket publishes three types of indices. Each uses a distinct aggregation method.

### 1.1 Single-SKU index (`single_sku`)

A single-SKU index tracks the price of one canonical product (one UPC) over time. The Egg Index is single-SKU.

**Formula:**

For a given (slug, period, region):

```
value_usd = MEDIAN(regular_price)  over rows P satisfying:
    canonical_products.upc = constituent.upc
  AND store.state ∈ region_states (or any state if region = 'national')
  AND store.chain matches published_indices.chain_filter (or any chain if filter is NULL)
  AND last_scraped_at within period  (i.e. scrape month = period)
  AND floor_usd ≤ regular_price ≤ ceiling_usd
```

We use **median** rather than mean because median is robust to outliers and matches BLS practice for the Average Retail Food Prices series. Outliers are excluded prior to aggregation via the floor/ceiling bounds (§3) but median additionally protects against any residual data quality issues.

### 1.2 Multi-SKU same-format index (`multi_sku_same_format`)

A multi-SKU same-format index tracks an aggregate price across N UPCs that are sold in identical pack format. The Soda Index is multi-SKU same-format (5 UPCs, all in 12pk × 12oz cans).

**Formula:**

For each constituent UPC `i`, compute the per-SKU median exactly as in §1.1:

```
median_i = MEDIAN(regular_price for UPC_i across stores in region/chain_filter)
```

Compute store-distribution weight per SKU:

```
n_stores_i = COUNT(DISTINCT store_id) for UPC_i in region
weight_i   = n_stores_i / Σ n_stores_j  (sum across all constituent SKUs)
```

Then:

```
value_usd = Σ (median_i × weight_i)
```

This is the **distribution-weighted basket** formula. It captures the *de facto* market by weighting each SKU by how broadly it's stocked, preserving the consumer's-eye-view of category pricing. A SKU stocked at 700 stores contributes more to the index than one stocked at 400.

### 1.3 Basket total cost (`basket_total_cost`)

A basket total cost index reports the aggregate dollar cost of a basket of N constituent UPCs at qty 1 each (or specified quantities). The Low Income and High Income baskets are basket_total_cost type.

**Formula:**

For each constituent UPC `i`, compute the per-SKU median as in §1.1, **filtered by `chain_filter`**:

```
median_i = MEDIAN(regular_price for UPC_i across stores in region
                  WHERE chain.market_position matches chain_filter)
```

Then:

```
value_usd = Σ (median_i × qty_i)
```

This is identical to how the FAO Food Price Index aggregates within a sub-index and how BLS food-at-home CPI aggregates across items. The result is a real dollar amount — the cost of buying the basket once at the basket's eligible chains — not a unitless index value.

A normalized indexed value is computed at query time (§5).

---

## 2. Region definitions

Indices are computed for `national` plus the 4 US Census Bureau regions used by BLS:

| Region | States |
|---|---|
| **Northeast** | CT, ME, MA, NH, NJ, NY, PA, RI, VT |
| **Midwest** | IL, IN, IA, KS, MI, MN, MO, NE, ND, OH, SD, WI |
| **South** | AL, AR, DE, DC, FL, GA, KY, LA, MD, MS, NC, OK, SC, TN, TX, VA, WV |
| **West** | AK, AZ, CA, CO, HI, ID, MT, NV, NM, OR, UT, WA, WY |

`national` aggregates across all 50 states (no region filter).

A store contributes to exactly one regional index (the region containing its state) and to the national index. State assignment is from `stores.state`.

---

## 3. Sanity bounds (per-constituent floor and ceiling)

Every constituent UPC has a `floor_usd` and `ceiling_usd` defined in `published_indices.constituents`. Any SPP row outside `[floor_usd, ceiling_usd]` is excluded from the median calculation **and logged** to `index_snapshot_dropped_rows` with the reason (`'below_floor'` or `'above_ceiling'`).

Bounds are set per UPC based on observed P5/P95 of historical scrapes plus a margin for promotional volatility. Bounds are conservative: their goal is to reject obvious data errors (case-of-cases mislistings, decimal-point typos) without rejecting legitimate sale prices.

**Example bounds in v1:**

| Index | UPC | Floor | Ceiling | Rationale |
|---|---|---|---|---|
| Egg | 715141514643 | $1.00 | $12.00 | Observed range $2.00–$7.49; padded for promo + premium markets |
| Soda (each SKU) | various | $3.00 | $15.00 | Observed range $4.00–$14.99 across all 5 SKUs |

The initial bounds caught a single $45.00 Mountain Dew row at Save-A-Lot Crossville, TN, scraped 2026-04-20 — almost certainly a case-of-cases mislisting. That row appears in the audit log but was not used in any published value.

The full audit log is queryable for transparency. When a snapshot is computed, its API response includes `meta.dropped_rows: N` so consumers know how many observations were excluded.

---

## 4. Coverage scoring

Every snapshot reports `coverage_score`, a 0–1 metric representing the fraction of in-scope stores that contributed to the snapshot.

**For single-SKU and multi-SKU indices:**

```
coverage_score = n_stores_with_observation_this_period / n_stores_carrying_this_constituent_ever
```

**For basket indices:**

```
coverage_score = MEAN over constituents of (per-constituent coverage)
```

Each `published_indices` row carries a `coverage_threshold` (default 0.6 for SKU indices, 0.5 for baskets). Snapshots below threshold are still computed and stored, but their `publish_status` may be marked `'forced_publish'` (manual override) or simply included in the API response with the low coverage transparently shown in `meta`. The API response always includes `coverage_score` so consumers can weight thin snapshots appropriately.

### Beta period (April 2026) vs Live period (May 2026 onward)

April 2026 is the **beta** period for v1. Some chain crons did not fire on a normal schedule during April; coverage is therefore non-uniform. April snapshots are stored with `publish_status = 'beta'` and excluded from the default API response unless `?include_beta=true` is set.

**May 2026 is the official live inception** for all v1 indices. Snapshots from May onward are `publish_status = 'auto'` and form the published series. The base period (`base_period`) is set to `'2026-05'` for all v1 indices; the indexed value (§5) is normalized against the May 2026 absolute USD value.

---

## 5. Indexed value normalization

Each snapshot stores its absolute `value_usd`. The indexed value (where base period = 100) is computed at query time:

```
value_indexed = value_usd / published_indices.base_value_usd × 100
```

`base_value_usd` is locked at the first publishable snapshot for the base period (typically May 2026 for v1 indices). It does not change unless the index methodology version is bumped (§6), at which point a new base may be set.

The API response always returns both `value_usd` (absolute) and `value_indexed` (normalized) so consumers can use whichever is appropriate for their purpose. Hedge funds and economists typically prefer indexed; journalists and consumers typically prefer absolute.

---

## 6. Methodology versioning

Constituent UPCs, floor/ceiling bounds, and aggregation methods are immutable per `methodology_version`. When any of these change, we **bump the version** on `published_indices.current_methodology_version`. Snapshots store the version active at compute time and stay frozen at that version forever.

**This means:**

- A series labeled `methodology_version = 1` always uses the same constituents and rules
- A version bump effectively starts a new series alongside the old one
- Old snapshots are never silently revised when the methodology changes
- The API can serve historical snapshots at their original methodology version

When a constituent UPC is discontinued (e.g. Eggland's Best ceases producing Brown Cage Free 12ct), the policy is:
1. Continue computing using the discontinued UPC for as long as residual SPP coverage allows (typically 3 months)
2. Add the replacement UPC as an additional constituent for an overlap period (3 months)
3. Bump methodology_version to remove the discontinued UPC

The methodology document is also versioned. The current document version is shown at the top.

---

## 7. Revision policy

Snapshots are **immutable at the row level** and **append-only** at the series level. If a published value is later found to be wrong (e.g. a chain reported corrupted prices that passed our floor/ceiling checks), the policy is:

1. Mark the original snapshot `revoked = true`, set `revoked_at` and `revoked_reason`
2. Compute a new corrected snapshot for the same (slug, period, region, methodology_version), with `replaces_snapshot_id` pointing back at the revoked one
3. Default API responses return only non-revoked snapshots
4. Audit/transparency consumers can request `?include_revoked=true` to see the revision history

This mirrors institutional revision practice (BEA, BLS, IMF). Published values are never silently overwritten.

---

## 8. Refresh cadence and snapshot timing

**Internal collection cadence:** weekly. Every Sunday at 02:00 UTC, the `web_price_snapshots` cron (migration 170) freezes a copy of all currently-scraped store prices.

**Published index cadence:** monthly. On the 1st of each month at 04:00 UTC, the `compute_index_snapshot` cron computes the prior month's snapshot for every active index × region combination. The published series therefore lags the calendar by ~1 day.

Within-month snapshots may be available via `?include_preview=true` for QA but are not part of the published series.

---

## 9. Anomaly detection

When a new snapshot is computed, two anomaly checks run automatically:

1. **MoM delta check.** If `|value_usd - prior_period_value_usd| / prior_period_value_usd > 0.10`, the snapshot is flagged `flagged_for_review = true` with `flag_reason = 'mom_delta_>10pct'`.
2. **Coverage check.** If `coverage_score < published_indices.coverage_threshold`, the snapshot is flagged with `flag_reason = 'low_coverage'`.

Flagged snapshots are still published by default but include `meta.flagged: true, meta.flag_reason: ...` in the API response. Flagging triggers a manual review by the data team to confirm the value before consumers act on it. Investigating flagged snapshots is part of the standard monthly QA cadence.

---

## 10. Data sources and methodology citations

MainMarket's index methodology draws on the following authoritative sources:

- **U.S. Bureau of Labor Statistics (BLS)** — Average Retail Food Prices, [`APU` series](https://download.bls.gov/pub/time.series/ap/). The Egg Index methodology (single-SKU, monthly, regional medians) is adapted from BLS practice for `APU0000708111` (Eggs, Grade A, Large per dozen).
- **U.S. Department of Agriculture, Economic Research Service (USDA ERS)** — [Food Plans](https://www.fns.usda.gov/cnpp/usda-food-plans-cost-food-monthly-reports), specifically the Thrifty Food Plan 2021 Reevaluation and the Liberal Food Plan. The Low Income Basket constituents are anchored on Thrifty Plan categories; the High Income Basket on Liberal Plan categories.
- **U.S. Census Bureau** — Census Region definitions used for regional rollups.
- **Food and Agriculture Organization of the UN (FAO)** — [Food Price Index](https://www.fao.org/worldfoodsituation/foodpricesindex/) methodology. The MainMarket index types and the convention of publishing both absolute and indexed values are modeled on FAO practice.

---

## 11. Worked example — Soda Index, April 2026, National (beta)

This worked example walks through one snapshot computation by hand against the live MainMarket data so a reader can independently verify the formula.

**Inputs:**

| SKU | UPC | n_stores (April) | Median price |
|---|---|---|---|
| Pepsi Real Sugar Cola 12pk | 00012000030680 | 663 | $9.99 |
| Mountain Dew Diet 12pk | 00012000809972 | 720 | $9.99 |
| Dr Pepper Zero Sugar 12pk | 078000035261 | 506 | $10.49 |
| Diet Coke Caffeine Free 12pk | 00049000006131 | 460 | $10.49 |
| Diet Coke Soda Fridge Pack 12pk | 049000028911 | 698 | $10.49 |

(Medians shown reflect actual April 2026 SPP data; per-SKU outlier the $45 Mountain Dew row is excluded by the ceiling check.)

**Step 1: total store count for distribution weights:**

```
Σ n_stores = 663 + 720 + 506 + 460 + 698 = 3047
```

**Step 2: per-SKU weights:**

| SKU | n_stores | weight |
|---|---|---|
| Pepsi Real Sugar | 663 | 0.2176 |
| Mtn Dew Diet | 720 | 0.2363 |
| Dr Pepper Zero | 506 | 0.1660 |
| Diet Coke Caf-Free | 460 | 0.1510 |
| Diet Coke Fridge | 698 | 0.2291 |

**Step 3: weighted sum:**

```
value_usd = (9.99 × 0.2176) + (9.99 × 0.2363) + (10.49 × 0.1660)
          + (10.49 × 0.1510) + (10.49 × 0.2291)
        ≈ 2.174 + 2.361 + 1.741 + 1.584 + 2.403
        ≈ $10.26
```

**Snapshot row written to `index_snapshots`:**

```
slug:                  soda_12pk_12oz_cans
period:                2026-04
region:                national
methodology_version:   1
value_usd:             10.26
n_observations:        3047  (sum of per-SKU stores; same store may appear under multiple SKUs)
n_stores:              ~1500 (DISTINCT stores across all 5 SKUs)
n_chains:              ~25
coverage_score:        ~1.0 (full coverage)
method:                distribution_weighted_basket
publish_status:        beta  (April is beta period)
constituents_breakdown: [...]  (per-SKU stats as JSONB)
```

This snapshot's `value_indexed` (§5) is undefined until the May 2026 base value is set.

---

## 12. Limitations (honest disclosure)

- **US grocery only.** No restaurants, no convenience stores, no international markets.
- **Packaged retail goods.** Prepared deli items, weighted produce (sold by the pound at variable rates), and in-store-only specials (not on the chain's online catalog) are generally outside the catalog and outside the indices.
- **`chain_level` chains.** Some chains use uniform pricing across all stores. For those chains, only one representative store contributes to the index. This is disclosed via `meta.constituents_breakdown[i].chain_level_chains` for transparency.
- **Freshness varies by chain.** Each `published_indices` row's component data may have been scraped on different days within the period. The index value reflects "best available current price" within the month, not a single-instant snapshot.
- **Beta period (April 2026) coverage is non-uniform.** Some chain crons did not fire on a normal schedule. Beta snapshots are excluded from the default API response and should be treated as preview data.
- **Income basket chain assignments are categorical.** A real low-income shopper may shop at multiple chain tiers. The `chain_filter` reflects the dominant shopping environment per income tier, not the only one. We may publish a "blended income basket" in v1.5 that mixes chain tiers per real shopping behavior.

---

## 13. Contact + data access

Methodology questions, audit requests, or institutional licensing inquiries: **zach@axe.software**.

The full `published_indices` registry (including current constituents, bounds, and chain filters per index) is queryable via `GET /v1/indices` (coming soon) or directly inspectable in `meta.constituents` of any `GET /v1/indices/{slug}` response.

Audit log of dropped rows is available on request.

---

*Methodology v1.0. Last updated 2026-05-04.*
