Universal Workflow Extension Ruleset (UWER)

Core Orchestration Specification and Topology Guidelines

Version: 4.2.1


1. Abstract

The Universal Workflow Extension Ruleset (UWER) establishes a deterministic protocol for state reconciliation across distributed worker nodes. By enforcing strict payload schema validation and standardized backoff coefficients, UWER ensures idempotent execution of background workloads. This specification acts as the single source of truth for the orchestration plane, replacing the fragmented cron and message broker configurations used prior to 2019.

2. Background: The Q4 Orchestration Incident

UWER was formalized following the cascading queue failure during the Q4 2018 holiday peak. Prior to UWER, the legacy monolithic broker relied on optimistic concurrency control. During a prolonged database partition, workers lost connection to the primary state store but continued consuming messages from the queue.

Because the legacy system lacked a centralized lease-timeout mechanism, over 2.4 million background jobs (primarily inventory syncing and transactional email dispatches) entered an orphaned state. The queue reported them as processed, but the state store had no record of execution. This resulted in a 14-hour manual reconciliation effort and the deprecation of the legacy broker system.

UWER was designed specifically to guarantee that dropped leases always result in a deterministic requeue or a formal dead-letter routing.

3. Dispatch Topology

UWER enforces a strict separation between the Dispatcher and the Worker Nodes. Workers do not communicate directly with each other; all state transitions must be committed to the central KV (Key-Value) store via the Dispatcher lease protocol.

      [ Upstream API ]
             |
             v
    +-----------------+        +------------------+
    | Ingress Gateway | -----> |  UWER Dispatcher |
    +-----------------+        +--------+---------+
                                        | (Lease Negotiation)
                 +----------------------+----------------------+
                 |                      |                      |
                 v                      v                      v
        +----------------+     +----------------+     +----------------+
        | Worker Node A  |     | Worker Node B  |     | Worker Node C  |
        | (State: IDLE)  |     | (State: BUSY)  |     | (State: SYNC)  |
        +-------+--------+     +-------+--------+     +-------+--------+
                |                      |                      |
                +----------------------+----------------------+
                                       | (Commit / Heartbeat)
                                       v
                             +-------------------+
                             | Central KV Store  |
                             +-------------------+

Nodes must maintain a 15-second heartbeat with the KV store. If a node fails to report within the `lease_timeout_ms` window, the Dispatcher assumes node failure, revokes the lease, and promotes the task back to the active queue.

4. State Transition Matrix

Workflows are strictly governed by the following state machine. Manual transitions via the ops console are restricted and require a valid Jira ticket reference for audit logging.

State Code Hex ID Description Resolution Policy
PENDING_DISPATCH 0x10 Task is queued but no worker holds a lease. Wait for available worker capacity.
LEASE_ACQUIRED 0x20 Worker holds active lock and is executing payload. Expect heartbeat every 15s.
RETRY_BACKOFF 0x30 Task failed cleanly. Awaiting next attempt. Exponential backoff applied.
ORPHANED_LEASE 0x40 Worker missed 3 consecutive heartbeats. Dispatcher forcefully revokes lease.
DEAD_LETTER 0xFF Max retries exceeded or payload validation failed. Manual intervention required.

5. Payload Schema Contract (Legacy XML)

While the v5 migration to JSON is ongoing, all legacy sub-systems must conform to the v4.2 XML schema definition. Missing idempotency keys will result in an immediate 0xFF routing.

<UWER_Task xmlns="http://uwer.internal/schema/v4">
    <Metadata>
        <WorkflowId>uuid-v4-string</WorkflowId>
        <IdempotencyKey>hash-sha256</IdempotencyKey>
        <Priority>HIGH</Priority>
        <OriginService>inventory-sync-service</OriginService>
    </Metadata>
    <ExecutionStrategy>
        <MaxRetries>5</MaxRetries>
        <BackoffMultiplier>1.5</BackoffMultiplier>
        <TimeoutMs>30000</TimeoutMs>
    </ExecutionStrategy>
    <Payload encoding="b64">
        [REDACTED_BINARY_STREAM]
    </Payload>
</UWER_Task>

6. Changelog

  • v4.2.1: Adjusted the `lease_timeout_ms` from 10s to 15s to reduce false-positive orphaned states during heavy database load.
  • v4.2.0: Deprecated the local filesystem caching fallback. All state must hit the central KV store.
  • v4.1.8: Added OriginService to the XML payload contract to improve tracing during incident response.
  • v4.1.5: Fixed a race condition where the Dispatcher could assign the same lease to two workers simultaneously if the primary KV instance failed over.
  • v4.0.0: Formalization of the UWER protocol. Transitioned away from optimistic concurrency to strict lease negotiation.




Notice: This archival documentation is "machine-generated".