Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Schema Storage Across Backends

Horizon Epoch stores schemas consistently regardless of the underlying storage backend, but the level of enforcement varies. This document explains how schemas are stored, introspected, and enforced across different backends.

Storage Architecture

┌──────────────────────────────────────────────────────────────────────────┐
│                           Metadata Database                               │
│  ┌────────────────────────────────────────────────────────────────────┐  │
│  │  table_versions                                                     │  │
│  │  ┌──────────────────────────────────────────────────────────────┐  │  │
│  │  │ id | table_id | commit_id | schema (JSONB)                   │  │  │
│  │  │ ----------------------------------------------------------   │  │  │
│  │  │ 1  | users    | abc123    | {"fields": [...], ...}           │  │  │
│  │  │ 2  | users    | def456    | {"fields": [...], ...}           │  │  │
│  │  │ 3  | orders   | abc123    | {"fields": [...], ...}           │  │  │
│  │  └──────────────────────────────────────────────────────────────┘  │  │
│  └────────────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────────────┘
                                      │
    ┌─────────────┬─────────────┬─────┴─────┬─────────────┬─────────────┐
    │             │             │           │             │             │
    ▼             ▼             ▼           ▼             ▼             ▼
┌────────┐  ┌────────┐  ┌────────────┐  ┌────────┐  ┌────────┐  ┌────────┐
│  SQL   │  │  SQL   │  │   Object   │  │  File  │  │  File  │  │  File  │
│ (Full) │  │(Part.) │  │  Storage   │  │Storage │  │Storage │  │Storage │
├────────┤  ├────────┤  │ (Metadata) │  │(Meta.) │  │(Meta.) │  │(Meta.) │
│PostgreSQL │MySQL  │  ├────────────┤  ├────────┤  ├────────┤  ├────────┤
│  MySQL  │  │SQLite │  │     S3     │  │ Azure  │  │  GCS   │  │ Local  │
│  MSSQL  │  │       │  │            │  │  Blob  │  │        │  │   FS   │
└────────┘  └────────┘  └────────────┘  └────────┘  └────────┘  └────────┘

Key Principle: Schema is always stored in the metadata database as JSONB. The storage backend may or may not enforce the schema.


PostgreSQL: Full Schema Support

PostgreSQL provides ConstraintSupportLevel::Full - all schema elements are enforced by the database engine.

Schema Storage

Schemas are stored in two places:

  1. Metadata database: JSONB in table_versions.schema
  2. Live database: Actual DDL (tables, constraints, indexes)

Introspection

Horizon Epoch introspects PostgreSQL schemas using system catalogs:

-- Columns from information_schema
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = 'users';

-- Constraints from pg_catalog
SELECT conname, contype, pg_get_constraintdef(oid)
FROM pg_constraint
WHERE conrelid = 'users'::regclass;

-- Indexes from pg_indexes
SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename = 'users';

-- Enum types from pg_type
SELECT typname, enumlabel
FROM pg_type t
JOIN pg_enum e ON t.oid = e.enumtypid
WHERE typname = 'order_status';

What’s Captured

ElementCapturedEnforcedNotes
Column namesYesYes
Column typesYesYesFull PostgreSQL type mapping
NullabilityYesYesNOT NULL constraints
DefaultsYesYesLiterals, functions, sequences
Primary keysYesYes
Foreign keysYesYesWith referential actions
Unique constraintsYesYesIncluding NULLS NOT DISTINCT
Check constraintsYesYesFull expression capture
Exclusion constraintsYesYesGiST-based
IndexesYesYesB-tree, GIN, GiST, BRIN
Enum typesYesYesWith value order

CLI Examples

# Show schema from PostgreSQL table
epoch schema show users

# Output:
# Table: users (PostgreSQL)
# Source: Live database introspection
#
# Columns:
#   id         SERIAL        PRIMARY KEY
#   email      VARCHAR(255)  NOT NULL
#   name       VARCHAR(255)  NOT NULL
#   status     user_status   DEFAULT 'active'
#   created_at TIMESTAMP     DEFAULT CURRENT_TIMESTAMP
#
# Constraints:
#   pk_users              PRIMARY KEY (id)
#   uq_users_email        UNIQUE (email)
#   chk_users_email       CHECK (email ~ '^[^@]+@[^@]+$')
#   fk_users_org          FOREIGN KEY (org_id) REFERENCES orgs (id)
#
# Indexes:
#   idx_users_email       btree (email)
#   idx_users_status      btree (status) WHERE status = 'active'
#
# Enum Types:
#   user_status           (active, inactive, suspended)

Python Example

import asyncio
from horizon_epoch import Client

async def inspect_schema():
    async with Client.connect("postgresql://localhost/horizon_epoch") as client:
        # Get constraints for a table
        constraints = await client.get_constraints("users", branch="main")

        print(f"Unique Constraints: {len(constraints.unique_constraints)}")
        print(f"Foreign Keys: {len(constraints.foreign_keys)}")
        print(f"Check Constraints: {len(constraints.check_constraints)}")

asyncio.run(inspect_schema())

S3/Delta Lake: Metadata-Only Support

S3/Delta Lake provides ConstraintSupportLevel::MetadataOnly - schemas are stored for documentation and merge operations but not enforced at the storage level.

Schema Storage

Schemas are stored in:

  1. Metadata database: JSONB in table_versions.schema
  2. Delta Lake: Parquet schema (columns and types only)
  3. Delta Lake table properties: Constraint metadata

What’s Stored vs Enforced

ElementStoredEnforcedNotes
Column namesYesYesVia Arrow/Parquet schema
Column typesYesYesMapped to Arrow types
NullabilityYesPartialArrow schema nullability
DefaultsYesNoStored as metadata only
Primary keysYesNoUsed as merge key, not unique
Foreign keysYesNoStored for documentation
Unique constraintsYesNoDuplicates allowed
Check constraintsYesNoExpressions not evaluated
Exclusion constraintsYesNoPostgreSQL-specific
IndexesYesNoDelta Lake handles indexing
Enum typesYesPartialArrow dictionary encoding

Schema Conversion

Horizon Epoch converts schemas to/from Arrow format:

Horizon Epoch Schema ←→ Apache Arrow Schema ←→ Parquet Files

Type Mapping:

Horizon EpochArrow TypeNotes
int8Int8
int16Int16
int32Int32
int64Int64
float32Float32
float64Float64
decimal(p,s)Decimal128Precision preserved
varchar(n)Utf8Length not enforced
textLargeUtf8
booleanBoolean
timestampTimestampWith timezone
dateDate32
timeTime64Nanoseconds
uuidFixedSizeBinary(16)
jsonbLargeUtf8Stored as JSON string
byteaLargeBinary
enumDictionaryValues preserved

CLI Examples

# Show schema from S3/Delta table
epoch schema show events

# Output:
# Table: events (S3/Delta Lake)
# Source: Stored metadata (constraints not enforced)
#
# Columns:
#   id         UUID
#   user_id    INT64
#   event_type VARCHAR(100)
#   payload    JSONB
#   created_at TIMESTAMP
#
# Constraints (metadata only - not enforced):
#   pk_events             PRIMARY KEY (id)
#   fk_events_user        FOREIGN KEY (user_id) REFERENCES users (id)
#
# Indexes (not created - Delta Lake manages indexing):
#   idx_events_user       btree (user_id)
#   idx_events_type       btree (event_type)
#
# Note: S3/Delta Lake does not enforce constraints at the storage level.
# Constraints are stored for documentation and merge conflict detection.

Python Example

import asyncio
from horizon_epoch import Client

async def check_delta_constraints():
    async with Client.connect("postgresql://localhost/horizon_epoch") as client:
        # Get constraints from Delta table
        constraints = await client.get_constraints("events", branch="main")

        # Constraints are stored but not enforced by S3/Delta
        print(f"Foreign Keys: {len(constraints.foreign_keys)}")
        print(f"Unique Constraints: {len(constraints.unique_constraints)}")

        # For S3/Delta backends, constraints are metadata-only
        # Validation happens at write time (application level)

asyncio.run(check_delta_constraints())

Constraint Storage in Delta Lake

Constraints are stored as Delta Lake table properties:

{
  "horizon_epoch.constraints": {
    "primary_key": ["id"],
    "foreign_keys": [
      {
        "name": "fk_events_user",
        "columns": ["user_id"],
        "references_table": "users",
        "references_columns": ["id"]
      }
    ],
    "unique_constraints": [],
    "check_constraints": [
      {
        "name": "chk_events_type",
        "expression": "event_type IN ('click', 'view', 'purchase')"
      }
    ]
  }
}

Cross-Backend Operations

Merge Behavior

Constraint conflicts are detected regardless of enforcement level:

# Branch A (PostgreSQL) modifies constraint
epoch checkout main
psql -c "ALTER TABLE users ADD CONSTRAINT chk_age CHECK (age >= 18)"
epoch commit -m "Add age constraint"

# Branch B (same table) modifies same constraint
epoch checkout feature
psql -c "ALTER TABLE users ADD CONSTRAINT chk_age CHECK (age >= 21)"
epoch commit -m "Different age constraint"

# Merge detects conflict
epoch merge feature
# CONFLICT (constraint): chk_age modified in both branches

Even for S3/Delta tables where constraints aren’t enforced, merge conflicts are detected:

# Both branches modify the same constraint metadata
# Merge requires resolution, even though neither is enforced
epoch merge feature
# CONFLICT (constraint): chk_events_type modified in both branches

Generating DDL for S3 Schemas

You can generate PostgreSQL DDL from schemas stored with S3/Delta tables:

# Generate DDL from S3 table schema
epoch schema show events --generate-ddl

# Output:
# -- DDL for events (from S3/Delta metadata)
# CREATE TABLE events (
#     id UUID PRIMARY KEY,
#     user_id INT64 NOT NULL,
#     event_type VARCHAR(100),
#     payload JSONB,
#     created_at TIMESTAMP
# );
#
# ALTER TABLE events ADD CONSTRAINT fk_events_user
#     FOREIGN KEY (user_id) REFERENCES users (id);
#
# CREATE INDEX idx_events_user ON events (user_id);

This is useful for:

  • Creating PostgreSQL replicas of S3 data
  • Validating schema compatibility
  • Documentation

Constraint Support Level API

from horizon_epoch import ConstraintSupportLevel

# Check what a backend supports
level = client.get_constraint_support("postgresql://...")

# Methods available
level.can_enforce()       # True for Full only
level.can_store()         # True for Full, MetadataOnly, Partial
level.can_execute_ddl()   # True for Full only
level.validation_mode()   # "database", "application", or "none"

# Use in application logic
if level.can_enforce():
    # Let database handle constraint enforcement
    client.insert(data)
else:
    # Validate in application before insert
    violations = client.validate_constraints(data)
    if violations:
        raise ValueError(f"Constraint violations: {violations}")
    client.insert(data)

Best Practices

1. Know Your Enforcement Level

support = client.get_constraint_support(table_location)
if support != ConstraintSupportLevel.Full:
    logger.warning(f"Constraints not enforced for {table_location}")

2. Validate Before Writing to S3

# For S3/Delta tables, validate in application
def insert_to_s3_table(client, table, data):
    schema = client.get_schema(table)

    # Validate primary key uniqueness
    existing_keys = client.query(f"SELECT id FROM {table}")
    if data['id'] in existing_keys:
        raise ValueError("Duplicate primary key")

    # Validate foreign keys exist
    for fk in schema.foreign_keys:
        # Check referenced table
        pass

    # Then insert
    client.insert(table, data)

3. Use PostgreSQL for Critical Constraints

If constraint enforcement is critical, use PostgreSQL tables:

# Analytics data - S3 is fine (no strict constraints needed)
client.register_table("events", "s3://bucket/events")

# User data - PostgreSQL for enforcement
client.register_table("users", "postgresql://mydb/public.users")

4. Document Constraint Semantics

In mixed-backend environments, document which constraints are enforced:

## Data Model

| Table | Backend | Constraints Enforced |
|-------|---------|---------------------|
| users | PostgreSQL | All (FK, UK, CHECK) |
| events | S3/Delta | None (metadata only) |
| orders | PostgreSQL | All |
| analytics | S3/Delta | None |

See Also