Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Multi-Backend Architecture

Horizon Epoch is designed to manage data across multiple, heterogeneous storage backends through a unified interface.

Design Philosophy

Separation of Concerns

┌─────────────────────────────────────────────────────────────┐
│                    Horizon Epoch                             │
│                                                              │
│  ┌────────────────────────────────────────────────────────┐ │
│  │                 Metadata Layer                          │ │
│  │  - Commits, branches, tags                             │ │
│  │  - Table registrations                                  │ │
│  │  - Change tracking indices                              │ │
│  │  - Version graph                                        │ │
│  └────────────────────────────────────────────────────────┘ │
│                           │                                  │
│                           │ (references, not data)          │
│                           ▼                                  │
│  ┌────────────────────────────────────────────────────────┐ │
│  │                  Storage Layer                          │ │
│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐  │ │
│  │  │PostgreSQL│ │  MySQL   │ │SQL Server│ │  SQLite  │  │ │
│  │  └──────────┘ └──────────┘ └──────────┘ └──────────┘  │ │
│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐  │ │
│  │  │  AWS S3  │ │  Azure   │ │   GCS    │ │  Local   │  │ │
│  │  └──────────┘ └──────────┘ └──────────┘ └──────────┘  │ │
│  └────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

Key insight: The metadata layer stores what changed and when, while storage adapters handle where the data physically lives.

Benefits

  1. No Data Migration - Keep data where it is
  2. Best Tool for Job - Use PostgreSQL for transactional, S3 for analytics
  3. Gradual Adoption - Add version control to existing infrastructure
  4. Unified Operations - Same commands work across all backends

Architecture Components

Storage Registry

Central registry of all configured backends:

config = Config(
    metadata_url="postgresql://localhost/horizon_epoch"
).add_postgres(
    "prod_users",      # Logical name
    "postgresql://prod/users"
).add_postgres(
    "prod_orders",
    "postgresql://prod/orders"
).add_s3(
    "analytics",
    bucket="company-analytics"
)

Storage Location

Each table has a storage location that identifies:

  1. Protocol - Which adapter to use
  2. Backend Name - Which configured backend
  3. Path - Table identifier within the backend
postgresql://prod_users/public.users
│            │          │
protocol     backend    table path

Table Registration

Tables are registered with their location:

# PostgreSQL table
client.register_table("users", "postgresql://prod_users/public.users")

# S3 Delta table
client.register_table("events", "s3://analytics/delta/events")

Metadata References

Metadata stores references to data, not the data itself:

-- In metadata database
SELECT * FROM epoch_tables;
┌──────────┬────────────────────────────────────────┐
│ name     │ location                                │
├──────────┼────────────────────────────────────────┤
│ users    │ postgresql://prod_users/public.users   │
│ orders   │ postgresql://prod_orders/public.orders │
│ events   │ s3://analytics/delta/events            │
└──────────┴────────────────────────────────────────┘

Cross-Backend Operations

Branching

Branches span all registered tables:

# Creates a branch that covers:
# - users (PostgreSQL)
# - orders (PostgreSQL)
# - events (S3)
client.create_branch("feature/new-reporting")

Each backend maintains its own overlay:

  • PostgreSQL: Overlay tables
  • S3: Separate Delta log

Committing

Commits can include changes from multiple backends:

# Changes to users (PostgreSQL) and events (S3)
# are captured in a single commit
client.commit(message="Update user events schema")

The commit metadata tracks which backends have changes:

{
  "commit_id": "abc123",
  "message": "Update user events schema",
  "changes": {
    "postgresql://prod_users": ["users"],
    "s3://analytics": ["events"]
  }
}

Diffing

Diff operations aggregate across backends:

diff = client.diff("main", "feature/branch")

# Returns changes from all backends
for table_diff in diff.table_diffs:
    print(f"{table_diff.location}: {table_diff.status}")

Merging

Merges are coordinated across backends:

  1. Compute changes per backend
  2. Detect conflicts per backend
  3. Apply changes per backend (in transaction if supported)
  4. Create unified merge commit

Consistency Model

Within a Backend

Operations within a single backend use that backend’s consistency guarantees:

  • PostgreSQL: ACID transactions
  • S3/Delta: Serializable via Delta protocol

Across Backends

Cross-backend operations provide best-effort consistency:

┌─────────────┐     ┌─────────────┐
│ PostgreSQL  │     │     S3      │
│   commit    │     │   commit    │
└──────┬──────┘     └──────┬──────┘
       │                    │
       └────────┬───────────┘
                │
         ┌──────▼──────┐
         │   Metadata  │
         │   Commit    │
         └─────────────┘

If one backend fails:

  • The operation is marked as partial
  • Rollback is attempted where possible
  • User is notified of partial state
try:
    client.commit(message="Multi-backend update")
except PartialCommitError as e:
    print(f"Committed to: {e.successful_backends}")
    print(f"Failed on: {e.failed_backends}")
    # Manual intervention needed

Configuration Patterns

Separate Environments

# Development
[storage.postgres.dev_db]
url = "postgresql://localhost/dev"

# Staging
[storage.postgres.staging_db]
url = "postgresql://staging-db.internal/staging"

# Production
[storage.postgres.prod_db]
url = "postgresql://prod-db.internal/production"
aws_secret_id = "prod-db-credentials"

Mixed Workloads

# Transactional data in PostgreSQL
[storage.postgres.transactional]
url = "postgresql://oltp-db/production"

# Analytics in S3
[storage.s3.analytics]
bucket = "company-analytics"
region = "us-east-1"

# Archive in Glacier
[storage.s3.archive]
bucket = "company-archive"
region = "us-east-1"
storage_class = "GLACIER_IR"

Cross-Region

[storage.s3.us_data]
bucket = "data-us"
region = "us-east-1"

[storage.s3.eu_data]
bucket = "data-eu"
region = "eu-west-1"

Routing and Discovery

Explicit Routing

Specify backend when registering tables:

client.register_table("users", "postgresql://prod_users/public.users")

Pattern-Based Routing

Configure default routing patterns:

[routing]
# Tables starting with "raw_" go to S3
"raw_*" = "s3://analytics"

# Everything else goes to PostgreSQL
"*" = "postgresql://default"

Auto-Discovery

Discover tables from backends:

# List tables in a backend
tables = client.discover_tables("postgresql://prod_db")

# Register discovered tables
for table in tables:
    client.register_table(table.name, table.location)

Performance Considerations

Query Routing

Queries are routed to the appropriate backend:

# Routed to PostgreSQL
client.query("SELECT * FROM users")

# Routed to S3
client.query("SELECT * FROM events")

Cross-Backend Queries

Currently, joins across backends are not supported in a single query. Use application-level joining:

# Query each backend
users = client.query("SELECT * FROM users")
events = client.query("SELECT * FROM events WHERE user_id IN (...)")

# Join in application
result = join(users, events, on="user_id")

Caching

Per-backend connection pooling and caching:

[storage.postgres.prod_db]
pool_size = 20
cache_schema = true

[storage.s3.analytics]
cache_metadata = true
cache_ttl = 300

Limitations

  1. No cross-backend transactions - ACID only within single backend
  2. No cross-backend joins - Query each backend separately
  3. Eventual consistency - Cross-backend commits may be partially applied
  4. Network latency - Operations touch multiple backends

Best Practices

  1. Group related tables - Tables that are often queried together should be in the same backend
  2. Consider latency - Place backends close to where they’re accessed
  3. Plan for failures - Have recovery procedures for partial commits
  4. Monitor all backends - Track health and performance per backend
  5. Document routing - Make it clear which tables are where