Schema Storage Across Backends
Horizon Epoch stores schemas consistently regardless of the underlying storage backend, but the level of enforcement varies. This document explains how schemas are stored, introspected, and enforced across different backends.
Storage Architecture
┌──────────────────────────────────────────────────────────────────────────┐
│ Metadata Database │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ table_versions │ │
│ │ ┌──────────────────────────────────────────────────────────────┐ │ │
│ │ │ id | table_id | commit_id | schema (JSONB) │ │ │
│ │ │ ---------------------------------------------------------- │ │ │
│ │ │ 1 | users | abc123 | {"fields": [...], ...} │ │ │
│ │ │ 2 | users | def456 | {"fields": [...], ...} │ │ │
│ │ │ 3 | orders | abc123 | {"fields": [...], ...} │ │ │
│ │ └──────────────────────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────┘
│
┌─────────────┬─────────────┬─────┴─────┬─────────────┬─────────────┐
│ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ SQL │ │ SQL │ │ Object │ │ File │ │ File │ │ File │
│ (Full) │ │(Part.) │ │ Storage │ │Storage │ │Storage │ │Storage │
├────────┤ ├────────┤ │ (Metadata) │ │(Meta.) │ │(Meta.) │ │(Meta.) │
│PostgreSQL │MySQL │ ├────────────┤ ├────────┤ ├────────┤ ├────────┤
│ MySQL │ │SQLite │ │ S3 │ │ Azure │ │ GCS │ │ Local │
│ MSSQL │ │ │ │ │ │ Blob │ │ │ │ FS │
└────────┘ └────────┘ └────────────┘ └────────┘ └────────┘ └────────┘
Key Principle: Schema is always stored in the metadata database as JSONB. The storage backend may or may not enforce the schema.
PostgreSQL: Full Schema Support
PostgreSQL provides ConstraintSupportLevel::Full - all schema elements are enforced by the database engine.
Schema Storage
Schemas are stored in two places:
- Metadata database: JSONB in
table_versions.schema - Live database: Actual DDL (tables, constraints, indexes)
Introspection
Horizon Epoch introspects PostgreSQL schemas using system catalogs:
-- Columns from information_schema
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = 'users';
-- Constraints from pg_catalog
SELECT conname, contype, pg_get_constraintdef(oid)
FROM pg_constraint
WHERE conrelid = 'users'::regclass;
-- Indexes from pg_indexes
SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename = 'users';
-- Enum types from pg_type
SELECT typname, enumlabel
FROM pg_type t
JOIN pg_enum e ON t.oid = e.enumtypid
WHERE typname = 'order_status';
What’s Captured
| Element | Captured | Enforced | Notes |
|---|---|---|---|
| Column names | Yes | Yes | |
| Column types | Yes | Yes | Full PostgreSQL type mapping |
| Nullability | Yes | Yes | NOT NULL constraints |
| Defaults | Yes | Yes | Literals, functions, sequences |
| Primary keys | Yes | Yes | |
| Foreign keys | Yes | Yes | With referential actions |
| Unique constraints | Yes | Yes | Including NULLS NOT DISTINCT |
| Check constraints | Yes | Yes | Full expression capture |
| Exclusion constraints | Yes | Yes | GiST-based |
| Indexes | Yes | Yes | B-tree, GIN, GiST, BRIN |
| Enum types | Yes | Yes | With value order |
CLI Examples
# Show schema from PostgreSQL table
epoch schema show users
# Output:
# Table: users (PostgreSQL)
# Source: Live database introspection
#
# Columns:
# id SERIAL PRIMARY KEY
# email VARCHAR(255) NOT NULL
# name VARCHAR(255) NOT NULL
# status user_status DEFAULT 'active'
# created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
#
# Constraints:
# pk_users PRIMARY KEY (id)
# uq_users_email UNIQUE (email)
# chk_users_email CHECK (email ~ '^[^@]+@[^@]+$')
# fk_users_org FOREIGN KEY (org_id) REFERENCES orgs (id)
#
# Indexes:
# idx_users_email btree (email)
# idx_users_status btree (status) WHERE status = 'active'
#
# Enum Types:
# user_status (active, inactive, suspended)
Python Example
import asyncio
from horizon_epoch import Client
async def inspect_schema():
async with Client.connect("postgresql://localhost/horizon_epoch") as client:
# Get constraints for a table
constraints = await client.get_constraints("users", branch="main")
print(f"Unique Constraints: {len(constraints.unique_constraints)}")
print(f"Foreign Keys: {len(constraints.foreign_keys)}")
print(f"Check Constraints: {len(constraints.check_constraints)}")
asyncio.run(inspect_schema())
S3/Delta Lake: Metadata-Only Support
S3/Delta Lake provides ConstraintSupportLevel::MetadataOnly - schemas are stored for documentation and merge operations but not enforced at the storage level.
Schema Storage
Schemas are stored in:
- Metadata database: JSONB in
table_versions.schema - Delta Lake: Parquet schema (columns and types only)
- Delta Lake table properties: Constraint metadata
What’s Stored vs Enforced
| Element | Stored | Enforced | Notes |
|---|---|---|---|
| Column names | Yes | Yes | Via Arrow/Parquet schema |
| Column types | Yes | Yes | Mapped to Arrow types |
| Nullability | Yes | Partial | Arrow schema nullability |
| Defaults | Yes | No | Stored as metadata only |
| Primary keys | Yes | No | Used as merge key, not unique |
| Foreign keys | Yes | No | Stored for documentation |
| Unique constraints | Yes | No | Duplicates allowed |
| Check constraints | Yes | No | Expressions not evaluated |
| Exclusion constraints | Yes | No | PostgreSQL-specific |
| Indexes | Yes | No | Delta Lake handles indexing |
| Enum types | Yes | Partial | Arrow dictionary encoding |
Schema Conversion
Horizon Epoch converts schemas to/from Arrow format:
Horizon Epoch Schema ←→ Apache Arrow Schema ←→ Parquet Files
Type Mapping:
| Horizon Epoch | Arrow Type | Notes |
|---|---|---|
| int8 | Int8 | |
| int16 | Int16 | |
| int32 | Int32 | |
| int64 | Int64 | |
| float32 | Float32 | |
| float64 | Float64 | |
| decimal(p,s) | Decimal128 | Precision preserved |
| varchar(n) | Utf8 | Length not enforced |
| text | LargeUtf8 | |
| boolean | Boolean | |
| timestamp | Timestamp | With timezone |
| date | Date32 | |
| time | Time64 | Nanoseconds |
| uuid | FixedSizeBinary(16) | |
| jsonb | LargeUtf8 | Stored as JSON string |
| bytea | LargeBinary | |
| enum | Dictionary | Values preserved |
CLI Examples
# Show schema from S3/Delta table
epoch schema show events
# Output:
# Table: events (S3/Delta Lake)
# Source: Stored metadata (constraints not enforced)
#
# Columns:
# id UUID
# user_id INT64
# event_type VARCHAR(100)
# payload JSONB
# created_at TIMESTAMP
#
# Constraints (metadata only - not enforced):
# pk_events PRIMARY KEY (id)
# fk_events_user FOREIGN KEY (user_id) REFERENCES users (id)
#
# Indexes (not created - Delta Lake manages indexing):
# idx_events_user btree (user_id)
# idx_events_type btree (event_type)
#
# Note: S3/Delta Lake does not enforce constraints at the storage level.
# Constraints are stored for documentation and merge conflict detection.
Python Example
import asyncio
from horizon_epoch import Client
async def check_delta_constraints():
async with Client.connect("postgresql://localhost/horizon_epoch") as client:
# Get constraints from Delta table
constraints = await client.get_constraints("events", branch="main")
# Constraints are stored but not enforced by S3/Delta
print(f"Foreign Keys: {len(constraints.foreign_keys)}")
print(f"Unique Constraints: {len(constraints.unique_constraints)}")
# For S3/Delta backends, constraints are metadata-only
# Validation happens at write time (application level)
asyncio.run(check_delta_constraints())
Constraint Storage in Delta Lake
Constraints are stored as Delta Lake table properties:
{
"horizon_epoch.constraints": {
"primary_key": ["id"],
"foreign_keys": [
{
"name": "fk_events_user",
"columns": ["user_id"],
"references_table": "users",
"references_columns": ["id"]
}
],
"unique_constraints": [],
"check_constraints": [
{
"name": "chk_events_type",
"expression": "event_type IN ('click', 'view', 'purchase')"
}
]
}
}
Cross-Backend Operations
Merge Behavior
Constraint conflicts are detected regardless of enforcement level:
# Branch A (PostgreSQL) modifies constraint
epoch checkout main
psql -c "ALTER TABLE users ADD CONSTRAINT chk_age CHECK (age >= 18)"
epoch commit -m "Add age constraint"
# Branch B (same table) modifies same constraint
epoch checkout feature
psql -c "ALTER TABLE users ADD CONSTRAINT chk_age CHECK (age >= 21)"
epoch commit -m "Different age constraint"
# Merge detects conflict
epoch merge feature
# CONFLICT (constraint): chk_age modified in both branches
Even for S3/Delta tables where constraints aren’t enforced, merge conflicts are detected:
# Both branches modify the same constraint metadata
# Merge requires resolution, even though neither is enforced
epoch merge feature
# CONFLICT (constraint): chk_events_type modified in both branches
Generating DDL for S3 Schemas
You can generate PostgreSQL DDL from schemas stored with S3/Delta tables:
# Generate DDL from S3 table schema
epoch schema show events --generate-ddl
# Output:
# -- DDL for events (from S3/Delta metadata)
# CREATE TABLE events (
# id UUID PRIMARY KEY,
# user_id INT64 NOT NULL,
# event_type VARCHAR(100),
# payload JSONB,
# created_at TIMESTAMP
# );
#
# ALTER TABLE events ADD CONSTRAINT fk_events_user
# FOREIGN KEY (user_id) REFERENCES users (id);
#
# CREATE INDEX idx_events_user ON events (user_id);
This is useful for:
- Creating PostgreSQL replicas of S3 data
- Validating schema compatibility
- Documentation
Constraint Support Level API
from horizon_epoch import ConstraintSupportLevel
# Check what a backend supports
level = client.get_constraint_support("postgresql://...")
# Methods available
level.can_enforce() # True for Full only
level.can_store() # True for Full, MetadataOnly, Partial
level.can_execute_ddl() # True for Full only
level.validation_mode() # "database", "application", or "none"
# Use in application logic
if level.can_enforce():
# Let database handle constraint enforcement
client.insert(data)
else:
# Validate in application before insert
violations = client.validate_constraints(data)
if violations:
raise ValueError(f"Constraint violations: {violations}")
client.insert(data)
Best Practices
1. Know Your Enforcement Level
support = client.get_constraint_support(table_location)
if support != ConstraintSupportLevel.Full:
logger.warning(f"Constraints not enforced for {table_location}")
2. Validate Before Writing to S3
# For S3/Delta tables, validate in application
def insert_to_s3_table(client, table, data):
schema = client.get_schema(table)
# Validate primary key uniqueness
existing_keys = client.query(f"SELECT id FROM {table}")
if data['id'] in existing_keys:
raise ValueError("Duplicate primary key")
# Validate foreign keys exist
for fk in schema.foreign_keys:
# Check referenced table
pass
# Then insert
client.insert(table, data)
3. Use PostgreSQL for Critical Constraints
If constraint enforcement is critical, use PostgreSQL tables:
# Analytics data - S3 is fine (no strict constraints needed)
client.register_table("events", "s3://bucket/events")
# User data - PostgreSQL for enforcement
client.register_table("users", "postgresql://mydb/public.users")
4. Document Constraint Semantics
In mixed-backend environments, document which constraints are enforced:
## Data Model
| Table | Backend | Constraints Enforced |
|-------|---------|---------------------|
| users | PostgreSQL | All (FK, UK, CHECK) |
| events | S3/Delta | None (metadata only) |
| orders | PostgreSQL | All |
| analytics | S3/Delta | None |
See Also
- Storage Adapters - Backend implementation details
- Constraint Reference - Constraint types
- Multi-Backend Setup - Configuring multiple backends