Introducing a Reliable Transport Layer Between Nodes and the Central System
Context
As the system moved beyond pilot usage, distributed nodes began sending increasing volumes of device data to the central platform for further processing. The existing approach relied on direct push mechanisms and custom synchronization logic, which became difficult to scale and recover from under heavy load or failure.
Decision
Move node-to-central data delivery from direct push mechanisms to a buffered transport model, implemented using Apache Kafka
Alternatives Considered
Continue using direct push (WebSocket / HTTP)
- Real-time delivery
- No additional infrastructure
- Simple initial implementation
- Tight coupling between nodes and center
- Limited buffering under load
- Harder recovery after outages
- Scaling requires coordination across services
Extend custom retry and reconciliation logic
- Avoids new platform components
- Preserves existing patterns
- Retry behavior remains inconsistent
- Operational complexity grows
- Failure handling stays implicit
- Difficult observability
Reasoning
Apache Kafka introduced a durable buffer between data producers (nodes) and consumers (central processing). This moved delivery responsibility out of direct execution paths, allowing device data to be ingested independently of central availability. Failures could be isolated using retry topics and dead-letter queues for malformed or inherently failing entries.
Why This Decision Mattered
Distributed nodes generated device data that required centralized storage and processing.
Direct push approaches tied node availability and load directly to the central system.
Implementation Approach
Kafka was introduced as the transport backbone:
- Nodes publish device data to topics
- The central system consumes and stores data
- Retry topics handle transient failures
- Dead-letter topics isolate malformed or unrecoverable entries
Outcome
- Decoupled nodes from central availability
- Enabled horizontal scaling of data processing
- Improved resilience under load
- Transport and processing responsibilities became clearly separated
- Shifted delivery complexity from application logic to shared transport infrastructure