RealHouse — Automatable Public Data Sources

Executive Summary
Scope, Target Metrics, and Canonical Geography
Federal Datasets That Answer Most of Your Metrics
- Comparative Catalog of Core Federal Sources
- Direct Endpoints and "What to Pull" for Each Source
HUD/ArcGIS Endpoints and Federal Distribution Layers
- Data.gov as a Discovery and Routing Layer
- ArcGIS REST Patterns
County/City Permitting Systems and Open-Data Portals
County Recorder/Assessor Data for Closings and Transfers
Practical Metro-Level Estimation Methods
Automation Blueprint

Executive Summary

A national, automatable "CBSA dashboard" for the top 25 U.S. housing markets is feasible using public data, but you will almost certainly need a two-tier architecture: (a) a federal backbone that is consistent across geographies (permitting, prices, mortgage origination activity, national construction pipeline), and (b) a local-systems layer that is inconsistent but can unlock starts/under-construction/completions and "effective inventory" at metro scale when jurisdictions expose inspection and certificate-of-occupancy milestones.

The most automation-friendly, high-coverage backbone sources are: (i) the Building Permits Survey (BPS) micro/compiled files for permits at CBSA/county/place granularity, (ii) HMDA public APIs and file services for mortgage originations and lender activity by MSA/MD and county, (iii) FHFA HPI for home price dynamics at metro/county/ZIP/tract, and (iv) FRED APIs as a convenient distribution layer for many housing series and release calendars.

For the specific metrics you listed — permits, starts (quarterly), under construction, completions/closings, finished vacant inventory, and sales closings with mortgage vs. cash proxies — public sources map as follows:

Permits issued (monthly to quarterly): BPS provides CBSA/county/place permit authorizations with a well-defined revision schedule and bulk downloads.
Starts / under construction / completions (pipeline): New Residential Construction (NRC) provides authoritative U.S. and regional pipeline totals and quarterly "purpose/design" breakdowns, but not CBSA; metro-level estimates require modeling or local milestone data.
Finished vacant new-home inventory (new homes for sale, completed unsold): New Residential Sales (NRS) publishes "for sale by stage of construction," including completed units, but only national/regional.
Mortgage-backed "closings" proxy: HMDA provides loan-level and queryable subsets; you can count originations in a CBSA's MSA/MD code(s) as a strong proxy for financed home purchase closings (with definitional caveats).
Cash closings proxy: county recorder / clerk-recorded deed transfers are the most direct public path, but automation difficulty varies dramatically by county (CAPTCHAs, paywalls, ToS constraints, "in-person only").
Vacancy / "under construction" address signals: HUD's USPS vacancy/no-stat series is conceptually powerful (no-stat includes "homes under construction"), but access is restricted to governmental and registered nonprofit users under HUD's agreement with USPS.

Scope, Target Metrics, and Canonical Geography

Your stated target is the "top 25 U.S. housing markets (CBSAs)." That "top 25" definition is unspecified; in practice you should treat it as a parameterized cohort defined by a rule (population, housing stock, permit volume, transaction volume, etc.) so the pipeline remains stable when rankings change. The federal statistical definition and membership of CBSAs changes over time, so your system needs a vintage-aware CBSA dimension.

A robust canonical geography layer usually needs:

CBSA delineations (county membership) from Census delineation files (e.g., July 2023 vintage) and/or complementary BLS crosswalks.
A mapping between CBSA and HMDA's MSA/MD five-digit codes (which can diverge when metropolitan divisions apply); FFIEC provides geocoding guidance and tools around MSA/MD usage in HMDA/CRA contexts.
A standard set of geographic keys: state FIPS + county FIPS, CBSA code, place, plus time-varying membership.

The "top 25 CBSAs" list itself should be stored as a dataset (effective_start/effective_end) rather than hardcoded.

Placeholder cohort (since the exact list is unspecified):

Rank	CBSA code	CBSA name	Selection rule	Notes
1	`<CBSA_01>`	`<CBSA_NAME_01>`	`<rule>`	`<placeholder>`
2	`<CBSA_02>`	`<CBSA_NAME_02>`	`<rule>`	`<placeholder>`
…	…	…	…	…
25	`<CBSA_25>`	`<CBSA_NAME_25>`	`<rule>`	`<placeholder>`

Federal Datasets That Answer Most of Your Metrics

This section focuses on public, primary sources (or federal "official distribution layers") that are realistic to automate nationally.

Comparative Catalog of Core Federal Sources

Difficulty scoring is on a 1–5 scale (1 = straightforward bulk/API; 5 = restricted access, heavy normalization, or institution-by-institution constraints).

Source	Primary metrics it supports	Cadence	Geo granularity	Automation interface	Automation difficulty
BPS (Building Permits)	Permit authorizations (monthly to quarterly; single vs multifamily typically available in BPS outputs)	Monthly revised; annual final	U.S., state, CBSA, county, place	Bulk downloads via Census files (Excel + ASCII); compiled master ZIP	1–2
NRC / Survey of Construction outputs	Starts, under construction, completions (authoritative)	Monthly; some quarterly tables	U.S. + Census regions (not CBSA)	Downloadable tables (XLSX)	2 (national), 4 (metro via modeling)
NRS / Survey of Construction outputs	New homes sold; new-home inventory; "for sale by stage of construction" (incl. completed)	Monthly; some quarterly tables	U.S. + Census regions (not CBSA)	Downloadable tables (XLSX)	2 (national), 4 (metro via modeling)
HUD–USPS vacancy/no-stat	Vacancy and "no-stat" counts; growth/decline signals; "no-stat includes under construction"	Quarterly	Typically very granular (down to neighborhood geographies in the HUD product), but access-gated	Portal download + license constraints	5 (access restricted)
HMDA (public)	Mortgage originations/denials; lender activity; financed home-purchase proxy	Annual (filing year); query APIs support subsetting	Nationwide; MSA/MD, state, county via filters	Public file service + public query API; additional static datasets via S3	2–3
FHFA HPI	Price index changes; metro/county/ZIP/tract price dynamics	Monthly + quarterly	National, division, state, metro, county, ZIP, tract	Direct downloads + dataset catalogs	1–2
FRED	Convenient API to many housing/macroeconomic series (including some NRC/NRS-derived series)	Varies by series	Varies (national; some regional/geo series)	REST API w/ key; multiple formats	1–2

Direct Endpoints and "What to Pull" for Each Required Federal Source

Below are concrete, automatable endpoints (or stable download locations) and the specific fields/series that tend to matter for your use case.

Building Permits (BPS): CBSA-Ready Permit Issuance Counts

BPS — Building Permits Survey

BPS explicitly states that data are available monthly/YTD/annual at CBSA (formerly MSA), county, and place levels, making it the primary national backbone for metro permitting.

BPS release cadence is operationally important: preliminary permits appear with the NRC press release on the 12th workday (U.S./region only), while revised permits by metro/county/place are released on the 17th workday; annual final is usually the first workday of May.

Programmatic download surfaces:

# Census BPS bulk directories (machine-friendly listings)
https://www2.census.gov/econ/bps/

# Master compiled dataset (large ZIP) + doc + sample
https://www2.census.gov/econ/bps/Master%20Data%20Set/
  - BPS Compiled_YYYYMM.zip
  - Compiled Data Documentation.docx
  - Compiled File Sample.csv

The public directory structure includes separate folders for CBSA, county, place, etc., which is ideal for a scheduled ingestion job.

Key transformations:

Quarterly permits (e.g., "Q4 last year in Houston"): sum monthly permit counts for Oct–Dec in the relevant CBSA.
New construction focus: you will normally restrict to residential permit types/structures as provided by BPS outputs.

Starts / Under-Construction / Completions (NRC): Authoritative Pipeline Totals

NRC — New Residential Construction

NRC outputs are the authoritative federal series for starts and completions and explicitly cover new, privately-owned units (including apartments/condos) and exclude HUD-code manufactured homes.

The historical series page provides downloadable tables for:

authorized units,
authorized-but-not-started,
started,
under construction at end of period,
completed,

plus quarterly tables by purpose/design.

Download endpoints:

# Examples of NRC historical XLSX tables (files are hosted under /construction/nrc/xls/)
https://www.census.gov/construction/nrc/xls/permits_cust.xlsx
https://www.census.gov/construction/nrc/xls/starts_cust.xlsx
https://www.census.gov/construction/nrc/xls/under_cust.xlsx
https://www.census.gov/construction/nrc/xls/comps_cust.xlsx

# Quarterly starts by purpose and design (example)
https://www.census.gov/construction/nrc/xls/starts_quarterly_cust.xlsx

NRC is indispensable for calibrating metro-level estimates, because it provides the national/region-level ground truth for the pipeline that permits eventually flow into.

Finished Vacant New-Home Inventory and New-Home Sales (NRS)

NRS — New Residential Sales

The NRS program explicitly provides: new houses sold and for sale, and houses for sale by stage of construction, which is the closest federal proxy to "finished vacant new-home inventory" (completed, for-sale, unsold) in a consistent series.

Key downloadable tables include "sold and for sale by stage of construction."

# Examples of NRS historical XLSX tables (hosted under /construction/nrs/xls/)
https://www.census.gov/construction/nrs/xls/sold_cust.xlsx
https://www.census.gov/construction/nrs/xls/fsale_cust.xlsx
https://www.census.gov/construction/nrs/xls/stage_cust.xlsx

For your end metrics, NRS is best used in two ways:

National/regional benchmark for how "completed-for-sale inventory" moves relative to starts and sales.
Model target when you build metro-level "completed & not closed" estimates from permits + local completions + local closings proxies.

HMDA: Mortgage Origination Activity (Financed "Closings" Proxy)

HMDA — Home Mortgage Disclosure Act

HMDA public data is hosted on the FFIEC HMDA platform and includes loan-level information modified for privacy; it is designed for public disclosure and policy analysis.

The HMDA "file serving" documentation provides stable endpoints for retrieving institution-specific Modified LAR files and states that other files are served from a public S3 bucket prefix.

# Modified LAR file serving (institution-by-institution)
https://ffiec.cfpb.gov/file/modifiedLar/year/{year}/institution/{lei}/csv
https://ffiec.cfpb.gov/file/modifiedLar/year/{year}/institution/{lei}/csv/header
https://ffiec.cfpb.gov/file/modifiedLar/year/{year}/institution/{lei}/txt
https://ffiec.cfpb.gov/file/modifiedLar/year/{year}/institution/{lei}/txt/header

# Public S3 prefix for other HMDA publication files
https://files.ffiec.cfpb.gov/

The HMDA Data Browser API documentation is particularly valuable for your use case because it supports:

JSON aggregation reports for nationwide or geography subsets, and
streamed CSV subsets of raw data filtered by geography and action codes.

# Aggregations
GET https://ffiec.cfpb.gov/v2/data-browser-api/view/nationwide/aggregations
GET https://ffiec.cfpb.gov/v2/data-browser-api/view/aggregations

# CSV subsets
GET https://ffiec.cfpb.gov/v2/data-browser-api/view/nationwide/csv
GET https://ffiec.cfpb.gov/v2/data-browser-api/view/csv

Geographic filters include msamds (5-digit MSA/MD), states, and counties (5-digit county FIPS) — meaning you can produce CBSA-level views either directly via MSA/MD or by aggregating counties that belong to the CBSA delineation.

FHFA HPI: Metro/County/ZIP/Tract Price Signals

FHFA HPI — House Price Index

FHFA HPI provides a broad, transparent methodology and explicitly offers geographic detail down to metro area, county, ZIP code, and census tract, alongside national/state/division levels.

A particularly automation-friendly direct download is the master CSV published by FHFA (also indexed in Data.gov's catalog).

# FHFA master HPI CSV
https://www.fhfa.gov/hpi/download/monthly/hpi_master.csv

FHFA also publishes its own release dates for monthly and quarterly updates, which helps you schedule ingestion and backfills.

FRED: High-Leverage Distribution Layer and API Formats

FRED — Federal Reserve Economic Data

FRED's API requires an API key and provides standard endpoints such as fred/series/observations, which can return XML/JSON/XLSX/CSV depending on parameters.

# Example endpoint for time series observations
https://api.stlouisfed.org/fred/series/observations?series_id={SERIES_ID}&api_key={KEY}&file_type=json

FRED's API documentation enumerates release/series endpoints and also describes a Maps API for geographically keyed series where available.

HUD/ArcGIS Endpoints and Federal Distribution Layers

A recurring theme in housing-related public data is that agencies publish data through federal catalogs and geospatial hubs, which are highly automatable if you standardize on a small set of access patterns.

Data.gov as a Discovery and Routing Layer (Including State/Local)

Data.gov's catalog pages often include direct downloads (CSV/ZIP/GeoJSON/Shapefile) and stable "identifier" keys that link to upstream systems. For example, "Residential Construction Permits by County" is explicitly described as:

BPS-derived residential permit data,
aggregated to county,
final permits only (not preliminary),
published annually (final data published in May of the following year),
with data back to 1980 and possible imputations,

and offers direct download resources.

The same page provides a machine-usable ArcGIS download URL pattern (format parameter, spatial reference, and where=1=1).

# ArcGIS Open Data download API pattern (example from the dataset page)
https://opendata.arcgis.com/api/v3/datasets/{ITEMID}_{LAYER}/downloads/data?format=fgdb&spatialRefId=4326&where=1%3D1
https://opendata.arcgis.com/api/v3/datasets/{ITEMID}_{LAYER}/downloads/data?format=shp&spatialRefId=4326&where=1%3D1

Data.gov itself is backed by CKAN; its user guide documents the CKAN API base URL and advises using package_search (since package_list is disabled). For higher-volume crawling of metadata, GSA also exposes an API gateway endpoint requiring an API key in headers.

ArcGIS REST Patterns You Can Normalize Across HUD and Many Local Portals

Many federal and local open data portals use the same "query a feature layer" semantics. The ArcGIS REST API documents the canonical FeatureServer/<layerId>/query endpoint and how to query IDs vs. feature sets. Common parameters like output format f=json and optional tokens for authenticated resources are also standardized.

This matters because once you build a robust ArcGIS connector, you can reuse it across:

HUD geospatial datasets distributed through open data endpoints,
many counties/cities hosting building permit, parcel, inspection, or zoning layers as feature services (sometimes tied to permitting vendors),
Data.gov-indexed state/local datasets that list "ArcGIS GeoServices REST API" formats.

County/City Permitting Systems and Open-Data Portals

To automate construction pipeline metrics at metro scale, your biggest leverage comes from standardizing how you ingest permit lifecycle events (application → issuance → inspections → final/CO) from dozens of issuing authorities, even if each publishes differently.

Permitting Platforms: Typical APIs, Auth, Limits, and Field Normalization

Accela (Vendor Platform)

Accela

Accela's REST "Getting Started" documentation is unusually explicit about required headers and auth patterns: it uses headers such as Authorization (access token), x-accela-appid, and environment/agency headers, with content negotiation via Content-Type/Accept.

It also documents offset-based pagination with max limits (up to 1000 per request) and provides request samples.

For automation governance, Accela's docs also mention response rate limit headers (x-ratelimit-limit, x-ratelimit-remaining, x-ratelimit-reset) and warn that agencies can customize record type definitions — meaning your normalization layer cannot assume consistent "permit type" taxonomies across jurisdictions.

Automation implication: Accela itself is "API-friendly," but your integration effort is dominated by per-agency configuration and taxonomy mapping.

Tyler / EnerGov (Government Software Ecosystem)

Tyler / EnerGov

A key reality for automation is that many local governments run permitting workflows through enterprise suites; Tyler's API Catalog describes a "Permits and Code Enforcement API Toolkit" that provides programmatic access to building permits (including new construction/additions), inspections, and code enforcement resources and processes.

Even when full APIs are not openly documented, public-facing portals often expose consistent milestone interactions. A public PDF describing EnerGov and its portal capabilities notes that ePortal supports online permit and plan submission/payment and online inspection requests/cancellations — i.e., the same lifecycle events you need to reconstruct starts/under construction/completions.

Some jurisdictions also expose permitting-related datasets through ArcGIS feature services with vendor-branded service namespaces (illustrative for discovery, not universal).

OpenGov (Vendor Platform)

OpenGov

OpenGov maintains a developer portal and publishes API-facing documentation (though some pages rely on client-side rendering). Their materials describe API usage in terms of product suites including Permitting & Licensing and API key handling.

Automation implication: you should treat OpenGov similarly to Accela: build a reusable connector if you can get credentials, but assume per-jurisdiction onboarding and schema exploration.

Open-Data Portals Used by Many Cities for Permit Disclosure

When a city publishes permit data publicly, it is often via a general-purpose platform rather than the permitting vendor API. Two high-yield patterns:

Socrata (Open Data Platform)

Socrata

Socrata's documentation describes consistent endpoint construction using dataset identifiers and supports multiple formats (CSV/JSON/GeoJSON/XML).

It also provides a concrete approach to rate management: application tokens increase throttling limits and can be passed via X-App-Token or query parameters depending on API version.

Pagination and maximum limits depend on the endpoint version, and Socrata documents these constraints.

Automation implication: Socrata is one of the easiest sources to automate (once you locate the dataset and interpret fields), but coverage varies by city.

BLDS (Building & Land Development Specification) as a Normalization Target

Some local governments publish permit datasets conforming to the BLDS open data specification; the Data.gov listing for one city explicitly states its building permit application dataset conforms to BLDS and clarifies a real-world modeling issue: "applications" may map to multiple permits.

Automation implication: BLDS is a practical canonical schema for "issued permits" and "applications," but you will still need extensions for inspections, certificates of occupancy, and parcel/legal identifiers.

Recommended Normalization Schema for Permit and Inspection Data

A scalable approach is to normalize everything into an event-sourced permit lifecycle, because different systems publish different "current status" fields but often expose timestamps for status changes.

Recommended canonical objects:

PermitApplication (many-to-one relationship to permits in some systems)
Permit / Record (the issued permit identifier used across inspections)
InspectionEvent (scheduled, completed, passed/failed, trade type)
Final / CO Event (certificate of occupancy, final inspection approval)
Location (address normalized + parcel/APN when available; county FIPS; CBSA via county membership)
WorkType taxonomy (single-family, multifamily, addition, remodel; new construction focus)

County Recorder/Assessor Data for Closings and Transfers

If your goal is true closings (not just financed originations), the most direct public signal is the recording of deeds and related instruments, but automation is uneven.

A county clerk-recorder description illustrates the "ideal" public-record model: the recorder preserves and archives documents relating to real property transactions; documents are recorded, indexed, digitally archived, and made available to the public (often with fees), and recording makes a document part of the public record.

However, automation constraints can be severe. One large county explicitly states it no longer offers online search of the official record index; records are only available for purchase in person or by mail for privacy protections.

Practical Expectations for Automating Recorder/Assessor Ingestion

Availability ranges from open searchable indexes with downloadable documents to "instrument number required" workflows and paywalls.
Formats vary (PDF documents, HTML search, sometimes GIS layers for parcels).
Identifiers differ (instrument number, book/page, document type codes).
Legal/ToS constraints can be as limiting as technical constraints; many portals block automated scraping via CAPTCHAs or session-based flows.

National automation strategy: treat recorder ingestion as a CBSA-by-CBSA (and county-by-county) program with a prioritized rollout, rather than assuming a single national API.

Practical Metro-Level Estimation Methods for Starts, Under Construction, Completions, and Finished Inventory

Because NRC/NRS do not provide CBSA-level starts/under-construction/completions, you need a modeling layer that combines federal truth series with metro permit issuance and (where possible) local milestone data.

Method Family That Tends to Work in Practice

Permit-to-Start and Start-to-Completion "Lag Pipeline" Models

Input series: monthly CBSA permits from BPS, aggregated to quarterly if needed.
Calibration targets: national and regional starts/under construction/completions from NRC.
Model structure: a distributed lag (or state-space) model where:
- starts are a lagged function of permits,
- completions are a lagged function of starts, and
- "under construction" is an implied stock (previous stock + starts − completions).
Regional conditioning: use Census region factors (from NRC) to set baseline lag/throughput behavior, then allow CBSA-specific deviations (hierarchical shrinkage).
Validation: for metros with strong local milestone data (inspections/CO), estimate lags directly and compare against your model outputs.

This approach respects the fact that NRC provides official pipeline totals while BPS provides local permitting volumes; the model is effectively a translation layer between "authorized" and "in progress/completed" at metro scale.

Using USPS "No-Stat" as an Auxiliary Construction Signal (When Eligible)

HUD's USPS data description notes that "no-stat" can include addresses for homes under construction and suggests comparing total address counts and no-stat changes to distinguish growth vs. decline areas.

If you qualify for access, you can use changes in "no-stat residential" counts as a supplementary signal for construction activity and demolition, but the dataset's access restrictions and methodological caveats (e.g., USPS operational changes) make it a high-friction dependency.

Finished Vacant Inventory at Metro Level: Combining Completions and Closings Proxies

A workable "metro finished vacant new-home inventory" estimate often looks like:

Completed new units (metro) from local CO/final-inspection events (preferred) or a calibrated completions estimate from your permit → start → completion model.
New-home "absorptions" (metro) approximated by:
- financed home-purchase originations from HMDA (subset to purchase purpose and originated action codes), plus
- deed transfer counts from recorder data to capture cash purchases (when automatable).
The difference (completed minus absorbed, adjusted for timing) approximates "completed, not yet sold/closed," with NRS national "stage of construction" series used as a plausibility benchmark.

Automation Blueprint: Effort Scoring, ETL Schema, and CBSA "Data Availability Card"

Difficulty/Effort Scoring and Time-to-Automate Estimates

These are practical engineering estimates for building a reliable, monitored ingestion (not a one-off download script). Times assume one experienced data engineer + basic infra.

Source / layer	Typical automation work	Difficulty	"Time to first reliable ingestion"
BPS bulk files	Scheduled downloads + parsing + CBSA vintaging handling + quarterly aggregation	1–2	~2–5 days
NRC/NRS tables	Scheduled downloads + parsing + versioning + release lag handling	2	~3–7 days
FHFA HPI master CSV	Scheduled pull + incremental update detection + geo key normalization	1–2	~2–4 days
HMDA Data Browser API	Parameterized query builder; caching; streaming CSV handling; MSA/MD-to-CBSA mapping	2–3	~1–3 weeks
Data.gov discovery	Metadata harvesting; rules to pick authoritative datasets per CBSA; monitoring link rot	2–3	~1–2 weeks
ArcGIS Open Data ingestion	Generic ArcGIS connector (query + downloads) + schema inference + paging/stats	2–3	~1–2 weeks
Socrata ingestion	Generic Socrata connector + dataset discovery + field mapping to permit schema	2–3	~1–2 weeks
Vendor permitting APIs (Accela/OpenGov/Tyler)	Credentialed per-jurisdiction onboarding + taxonomy mapping + delta sync	3–5	~2–6 weeks per platform + onboarding per agency
County recorder ingestion	Highly variable portal automation + legal/ToS review + per-county customization	4–5	~2–8+ weeks per county (often not scalable purely by engineering)

Accela's API is technically well specified (headers, pagination, rate-limit headers), but the hard part is agency-specific configuration and semantic normalization. Socrata is typically "easy-mode" from a pure API perspective (identifiers, multiple formats, app token throttling control).

Recommended ETL and Normalization Schema (Sample Tables)

A warehouse-friendly schema that supports both "federal backbone" and "local lifecycle" data:

-- Geography and cohort
create table dim_cbsa (
  cbsa_code varchar primary key,
  cbsa_name varchar,
  delineation_vintage date,
  effective_start date,
  effective_end date
);

create table bridge_cbsa_county (
  cbsa_code varchar,
  county_fips char(5),
  delineation_vintage date,
  primary key (cbsa_code, county_fips, delineation_vintage)
);

-- Federal: permits (BPS)
create table fact_bps_permits (
  ref_month date,                 -- first day of month
  cbsa_code varchar,
  county_fips char(5) null,
  place_fips varchar null,
  units_authorized integer,
  source_file varchar,
  revised_flag boolean,
  load_ts timestamp,
  primary key (ref_month, cbsa_code, coalesce(county_fips,'-'), coalesce(place_fips,'-'), revised_flag)
);

-- Federal: construction pipeline (NRC)
create table fact_nrc_pipeline (
  ref_month date,
  geo_level varchar,              -- US, REGION
  geo_code varchar,               -- e.g. US, NE, MW, S, W
  metric varchar,                 -- permits, starts, under_construction, completions, auth_not_started
  units_saar numeric,
  units_nsa numeric null,
  source_table varchar,
  load_ts timestamp,
  primary key (ref_month, geo_level, geo_code, metric)
);

-- Federal: new residential sales (NRS)
create table fact_nrs_inventory (
  ref_month date,
  geo_level varchar,              -- US, REGION
  geo_code varchar,
  metric varchar,                 -- sold, for_sale, for_sale_completed, for_sale_under_construction, etc.
  value numeric,
  source_table varchar,
  load_ts timestamp,
  primary key (ref_month, geo_level, geo_code, metric)
);

-- Federal: HMDA aggregates (for CBSA dashboards)
create table fact_hmda_aggregations (
  activity_year integer,
  geo_level varchar,              -- MSA_MD, STATE, COUNTY, CBSA_DERIVED
  geo_code varchar,
  filter_hash varchar,            -- stable signature of parameter set
  loans_count bigint,
  loan_amount_sum numeric,
  load_ts timestamp,
  primary key (activity_year, geo_level, geo_code, filter_hash)
);

-- Local: event-sourced permitting lifecycle (raw + normalized)
create table raw_permit_events (
  source_system varchar,          -- Accela, Socrata, ArcGIS, etc.
  jurisdiction_id varchar,
  record_id varchar,
  event_type varchar,             -- applied, issued, inspection_scheduled, inspection_passed, co_issued, finaled
  event_ts timestamp,
  payload_json jsonb,
  load_ts timestamp
);

create table fact_permit_lifecycle (
  jurisdiction_id varchar,
  record_id varchar,
  cbsa_code varchar,
  county_fips char(5),
  event_type varchar,
  event_ts timestamp,
  work_type varchar,
  units integer null,
  valuation numeric null,
  primary key (jurisdiction_id, record_id, event_type, event_ts)
);

-- Recorder / deeds (where available)
create table fact_deed_transfers (
  county_fips char(5),
  recording_date date,
  document_type varchar,
  sales_price numeric null,
  loan_amount numeric null,
  is_cash boolean null,
  parcel_id varchar null,
  load_ts timestamp
);

Data Flow and ETL Architecture

The following diagram describes the overall data flow from federal and local sources through the platform's ingestion, normalization, and serving layers.

flowchart LR
  subgraph FederalBackbone
    BPS[BPS permits files]
    NRC[NRC tables]
    NRS[NRS tables]
    HMDA[HMDA APIs + files]
    FHFA[FHFA HPI downloads]
    FRED[FRED API]
    CBSADef[CBSA delineation files]
  end

  subgraph LocalLayer
    OPD[Open data portals (Socrata / ArcGIS / CKAN)]
    Vendor[Permitting vendor APIs (Accela / OpenGov / Tyler)]
    Recorder[County recorder portals]
  end

  subgraph Platform
    Ingest[Scheduled ingestion + backfills]
    Raw[Raw lake (original files + JSON)]
    Normalize[Normalization + entity resolution (address/parcel/geo keys)]
    Facts[Fact tables + aggregates (monthly/quarterly CBSA metrics)]
    Serve[API + dashboard layer]
    Monitor[QA + anomaly detection, release lag + schema drift]
  end

  BPS --> Ingest
  NRC --> Ingest
  NRS --> Ingest
  HMDA --> Ingest
  FHFA --> Ingest
  FRED --> Ingest
  CBSADef --> Normalize

  OPD --> Ingest
  Vendor --> Ingest
  Recorder --> Ingest

  Ingest --> Raw --> Normalize --> Facts --> Serve
  Facts --> Monitor
  Raw --> Monitor

The diagram above is expressed in Mermaid syntax. The data flow proceeds from federal backbone and local-layer sources through scheduled ingestion into a raw lake, then through normalization into fact tables, and finally to an API/dashboard serving layer with QA monitoring.

Federal Backbone
BPS, NRC, NRS, HMDA, FHFA, FRED

Local Layer
Socrata, ArcGIS, Accela, Tyler, Recorder

Ingestion + Normalization
Raw lake, entity resolution, geo keys

Fact Tables + Dashboard
CBSA metrics, API, QA monitoring

Release Timeline Planning

Use release-aware scheduling to reduce churn. BPS documents the distinction between preliminary and revised permit releases and their typical workday timing, which is important for your "Q4 last year" reporting cutoffs. FHFA publishes a forward calendar of monthly/quarterly release dates.

gantt
  title Typical public-data refresh cadence (conceptual)
  dateFormat  YYYY-MM-DD
  axisFormat  %b %Y

  section Monthly
  BPS revised permits (metro/county/place) :active, 2026-01-01, 2026-12-31
  NRC monthly construction pipeline          :active, 2026-01-01, 2026-12-31
  NRS monthly new-home sales/inventory       :active, 2026-01-01, 2026-12-31
  FHFA monthly HPI                           :active, 2026-01-01, 2026-12-31

  section Quarterly / annual
  NRC quarterly purpose/design tables        :active, 2026-01-01, 2026-12-31
  FHFA quarterly HPI                         :active, 2026-01-01, 2026-12-31
  HMDA annual filing year publication        :active, 2026-01-01, 2026-12-31
  HUD-USPS vacancy/no-stat (restricted)      :active, 2026-01-01, 2026-12-31

The Gantt chart above is expressed in Mermaid syntax. It shows that BPS, NRC, NRS, and FHFA HPI refresh monthly, while NRC quarterly tables, FHFA quarterly HPI, HMDA annual publications, and HUD-USPS vacancy data refresh on longer cycles.

Source	Cadence	Typical release timing
BPS revised permits (metro/county/place)	Monthly	17th workday of month
NRC construction pipeline	Monthly	12th workday of month
NRS new-home sales/inventory	Monthly	Monthly release
FHFA HPI	Monthly + Quarterly	Published release calendar
NRC quarterly tables (purpose/design)	Quarterly	Quarterly release
HMDA annual filing	Annual	Annual publication
HUD-USPS vacancy/no-stat	Quarterly	Access restricted

CBSA "Data Availability Card" Template

Use this card to operationalize rollout across the top-25 CBSAs and keep a structured view of what is automatable.

cbsa_availability_card:
  cbsa_code: "<CBSA>"
  cbsa_name: "<NAME>"
  delineation_vintage: "2023-07-01"   # example
  counties:
    - county_fips: "_____"            # list all member counties
      county_name: "<COUNTY>"
      recorder_access:
        mode: ["open_search", "login_required", "paywall", "in_person_only"]
        automation_risk: ["low", "medium", "high"]
        notes: "<constraints>"
  permitting_authorities:
    - authority_name: "<CITY/COUNTY DEPT>"
      system_type: ["Accela", "Tyler/EnerGov", "OpenGov", "custom", "unknown"]
      public_data_surface:
        - type: ["Socrata", "ArcGIS", "CKAN", "CSV", "HTML"]
          endpoint: "<url>"
          auth: ["none", "app_token", "token"]
      fields_present:
        permits_issued: true/false
        valuation: true/false
        units: true/false
        inspections: true/false
        certificate_of_occupancy_or_final: true/false
      normalization_notes: "<permit types/status mapping>"
  federal_backbone_coverage:
    bps_cbsa_permits: true
    hmda_msamd_or_county: true
    fhfa_hpi_metro_or_county: true
    usps_vacancy_restricted: "unknown/eligible/not_eligible"
  computed_metrics_feasibility:
    permits_qtr: "high"
    starts_qtr: "medium (modeled)"
    under_construction_stock: "medium (modeled)"
    completions_qtr: "medium (modeled or local)"
    finished_vacant_inventory: "medium/low (depends on local CO + closings)"
    closings_total: "low/medium (depends on recorder)"
  overall_score:
    data_completeness: 0-100
    automation_effort_weeks: "<estimate>"

Key Strategic Recommendations

Anchor everything on BPS + FHFA + HMDA, because they are the most reliable nationally automatable CBSA/county-level public sources.
Treat starts/under construction/completions as a modeled metro layer, calibrated to NRC national/regional truth, until you have local milestone data (inspections/CO) for each CBSA's dominant issuing authorities.
Invest early in two reusable connectors: a Socrata connector (identifier-based datasets, app tokens) and an ArcGIS connector (FeatureServer query/download patterns). These two cover a large portion of state/local open data disclosure.
Assume recorder data is the long pole for "cash + total closings" and plan a phased approach with legal/ToS review and selective county prioritization.

Finally, for the original motivating example — "how many homes started in Q4 last year in Houston; how many closed; how many under construction; how many finished vacant" — the federal system can give you Q4 permits for the Houston CBSA directly (BPS), but Q4 starts/under-construction/completions and finished vacant inventory require either modeling calibrated to NRC/NRS or local milestone data plus recorder/HMDA-derived absorptions.

Automatable Public Data Sources for U.S. Residential Homebuilding

Table of Contents