Project Overview & PDR
MIO — Project Overview & Product Development Requirements
Last updated: 2026-05-13 (P9.5 admin control plane + TUI shipped)
Status: POC, locked design for core inbound/outbound path, admin introspection, control plane ready for tenant management
Problem & Motivation
Every chat platform webhook has a hard ack deadline:
- Slack: 3 seconds
- Discord: 3 seconds
- Zoho Cliq: ~5 seconds
- Telegram: ~30 seconds (loose, but retry-heavy)
LLM-driven responses take 2–30 seconds. Coupling the webhook ack to an LLM call means:
- First slow run drops the message (timeout)
- Channel retries fire while the first request is still in-flight
- Duplicate consumers corrupt the AI conversation state
- One chatty user starves others (no per-tenant fairness)
MIO’s answer: Decouple the transport layer from intelligence. Gateway acks fast within the channel deadline, durably persists to a bus, AI service consumes on its own schedule.
Vision & Scope
What MIO is:
- A messaging I/O platform that normalizes inbound messages from any chat channel into a canonical envelope
- A durable bus (NATS JetStream) that decouples receive from processing
- A sender pool that dispatches outbound commands back to channels
- A sidecar (media-vault) that persists attachments across platform TTL boundaries
- A thin SDK for AI services to consume/publish without importing channel-specific code
What MIO is not:
- An AI agent framework (that’s MIU’s job)
- A UI or workspace admin console (owned by MIU)
- A staging cluster (solo-dev scale with fast rollback)
- A managed cloud service (cloud-agnostic K8s by design)
- A dedicated BigQuery sink (GCS + external tables, pipeline owned by deployer)
Target Users
- AI service teams — Build agents without understanding channel-specific quirks
- Channel-ops engineers — Deploy new channel adapters, manage inbound/outbound flows
- Deployment teams — Self-host on GKE or run locally for development
- Analytics consumers — Tap the bus for training data, audit logs, cold storage
Functional Requirements
Inbound Path
- Receive webhook POST from channel with HMAC signature
- Verify signature (HMAC-SHA256)
- Normalize to canonical
Messageproto envelope - Idempotent dedup via
(account_id, source_message_id)UNIQUE constraint - Publish to
MESSAGES_INBOUNDstream within channel deadline (p99 < 500ms) - Silently 200 OK on duplicate to prevent channel retry loops
- Support attachments with platform-TTL-aware fetching
- Enrich attachment URLs with persistent storage references
- Reconcile provider history gaps outside the webhook hot path
Outbound Path
- AI service publishes
SendCommandtoMESSAGES_OUTBOUNDstream - Sender pool (per-channel consumer) drains work queue
- Respect per-account rate limits (channel-specific)
- Retry transient errors (5xx, network timeout) with exponential backoff
- Terminal-error dead-letter on permanent failures (4xx)
- Support message edits via
edit_of_external_idreference - Support channel-agnostic rich outbound content (
SendCommand.rich_content) - Render URL-backed outbound attachments in supported adapters
Multi-tenancy & Fairness
- Per-account rate limits (one workspace must not starve others)
- Per-conversation ordering via
MaxAckPending=1on AI consumer - Account-scoped idempotency keys
- Tenant isolation in deployments
Observability
- OpenTelemetry trace context propagation (channel → gateway → bus → AI → outbound)
- Prometheus metrics (inbound latency, outbound send count, JetStream consumer lag)
- Structured JSON logging (slog in Go, structlog in Python)
- Label discipline (only
channel_type,direction,outcome— no cardinality bombs)
Non-Functional Requirements
| Requirement | Target | Rationale |
|---|---|---|
| Gateway inbound p99 latency | < 500ms | Must ack within channel deadline (3–5s) |
| Attachment availability | ≥ 7 days after receipt | Spans LLM processing + retry window |
| AI consumer ordering | Per-conversation FIFO | Correct conversation thread |
| Per-account rate limiting | Enforce by channel API limits | Prevent workspace starvation |
| Data durability | 7-day retention on bus | Replay for iteration, re-training |
| Cloud portability | No vendor lock-in | Run on GKE, EKS, AKS unchanged |
| Wire format stability | WIRE_JSON breaking checks | SDK versioning safety |
| Schema evolution | Proto field numbers never reused | Historical data queryable |
Architectural Principles
- Gateway is dumb. Only signature → normalize → publish → ack. Intelligence lives in consumers.
- Consumers own their logic. No shared business logic in the gateway; pull from the bus.
- Idempotency at the edge.
(account_id, source_message_id)is the idempotency key, not request UUID. - Bus is the contract. All consumers see the same protobuf
MessageandSendCommandenvelopes. - Per-workspace fairness. Rate limits are per-account, not global. Slow consumers don’t block others.
- Two-step UX for latency. LLM run > 1s? Send “thinking…” first, then edit-in-place with the answer.
- Storage tiers are separate. Operational (Postgres) ≠ Bus (NATS JetStream) ≠ Lake (GCS).
Success Metrics
POC Milestone (P8–P9)
- End-to-end Cliq message → echo reply within 3-second budget
- Attachment fetchable ≥ 7 days after receipt
- Gateway p99 inbound latency < 500ms (measured)
- Per-account rate limiting enforced (tested via
TestFairnessbench) - Zero data loss on channel signature verify → publish → ack
Production Readiness (future)
- SLO: p99 inbound < 500ms, p95 outbound < 2s per channel API
- RTO: < 5 minutes on gateway crash (new pod + rejoin NATS cluster)
- RPO: < 1 minute (stream retention 7d for inbound, 24h for outbound)
- Scalability: 1k msg/sec inbound per account (burst), 10k msg/sec cluster-wide
Stakeholders & Boundaries
MIO owns (this repo)
- Message envelope schema (protobuf
Message,SendCommand) - Gateway (inbound signature verify, normalize, publish; outbound send)
- SDKs (sdk-go, sdk-py) for consuming/publishing
- Bus provisioning (NATS JetStream streams, retention, consumer defs)
- Archive sink (sink-gcs) to cold storage
- Attachment sidecar (media-vault) for TTL-aware persistence
- Control plane (admin server) for tenant/channel/message introspection
MIU owns (separate repo)
- AI agent logic (LangGraph, Hatchet integration)
- Intelligence routing (which agent processes which message)
- Conversation state (Postgres operational storage)
- UI/UX (workspace admin console, agent management)
Deployer owns (infra repo, not mio)
- Kubernetes cluster (GKE, EKS, etc.)
- Secret management (age keys, webhook secrets, credentials)
- CNPG Postgres instance + backups
- GCS bucket provisioning
- BigQuery loader pipeline (external to mio)
- FluxCD reconciliation (Helm releases)
Contract between MIO and MIU
MESSAGES_INBOUND stream (mio.v1.Message) ↓ [AI consumes via sdk-py] [Think] ↓ [AI publishes via sdk-py]MESSAGES_OUTBOUND stream (mio.v1.SendCommand) ↓ [Gateway sender-pool consumes] [Deliver to channel API]SDK contract: Client.consume() and Client.publish() accept/return canonical proto messages. No channel-specific logic in either library.
Open Questions (Design, not Implementation)
- Per-thread sharding: Stay global
MaxAckPending=1or shard byconversation_idwhen throughput demands? (Decide at 1k msg/s milestone) - Edit semantics resolver: Each channel (Slack, Cliq, Telegram, Discord) has different edit models. How does the gateway know which resolver to invoke? (Design at P5+, not now)
- Dead-letter strategy: Separate
MESSAGES_DLQstream vs. in-placeterminatedflag + quarantine table? (Defer until first real permanent failure) - Attachment backend swap: S3, Azure Blob, Backblaze B2 factory pattern ready; do we need day-one multi-backend? (Defer until operational need)
References
- System Architecture — Full design doc with mermaid diagrams
- Code Standards — Governance rules, adapter pattern, subject grammar
- Deployment Guide — GKE reference, secret rotation, HA paths
- Codebase Summary — Component layout and API surfaces