Architecting a Resilient Discord AI Bot: A Solutions Architect Case Study
Most AI bot writeups focus on model quality. This one is about system behaviour.
When I built this Discord + Ollama bot, the real challenge was not “can the model answer?” It was “can the system stay predictable when multiple users hit it at once, when inference is slow, and when platform limits are strict?”
That is exactly where Solutions Architecture work matters.
Context and constraints
The product goal was simple: people should be able to mention the bot in Discord and get useful responses with natural conversational continuity.
The operating constraints were not simple:
- Discord has hard message limits and strict interaction patterns.
- Local LLM inference (Ollama-hosted) has bounded throughput.
- User behaviour is bursty and concurrent.
- Failure modes are common: timeouts, deleted messages, permission changes, malformed inputs.
So the architecture had to optimise for controlled behaviour under stress, not best-case demos.
Architecture diagrams (PlantUML)
1) Component boundary map
2) Message request sequence
3) Failure and control flow
My role as Solutions Architect
I treated this as an architecture exercise across functional and non-functional requirements.
Functional:
- mention/reply interaction model
- streaming responses
- multi-turn context
- model switching for admins
Non-functional:
- reliability under concurrent demand
- graceful degradation on failures
- abuse prevention and safety controls
- maintainability through clear boundaries
- operational simplicity for a single-node deployment
My job was to turn those requirements into explicit boundaries and policies so the implementation stayed understandable and supportable.
Architectural principles applied
1) Separation of concerns
I split behaviour into focused layers: events, commands, services, middleware, and utilities. This kept integration points clear and made change safer over time.
2) Event-driven request flow
The core runtime hangs off Discord events. That gave a clean, traceable lifecycle from inbound message to outbound streamed response.
3) Backpressure via explicit queueing
Inference throughput is the bottleneck, so I made it explicit. Requests are serialised with queue limits and request timeouts instead of pretending unlimited concurrency exists.
4) Bounded conversation memory
Conversation history is persisted but aggressively pruned by both message count and token budget. Context quality remains useful while memory growth stays controlled.
5) Defensive input/output controls
Inputs are sanitised and truncated. Outputs are sanitised to neutralise @everyone and @here mentions. Rate limits and allowlists are applied before expensive work starts.
6) Layered failure handling
Errors are handled at multiple levels: request, command, message edit/send, and process-level handlers with graceful shutdown behaviour.
7) Configuration-driven behaviour
Model parameters, limits, and environment settings are externalised in config. Runtime behaviour becomes tunable without code churn.
Key decisions and tradeoffs
Serialised queueing vs higher throughput
I chose serialised processing because Ollama inference is effectively the constrained shared resource in this design.
Tradeoff:
- Pros: predictable latency envelope, fewer cascading failures, no model host thrashing.
- Cons: reduced parallel throughput under heavy burst traffic.
This was acceptable because consistency and reliability mattered more than peak concurrent completion for this use case.
Persistent history plus pruning vs stateless requests
I persisted conversation history in SQLite and applied dual pruning (sliding window + token budget).
Tradeoff:
- Pros: continuity feels natural and context remains relevant.
- Cons: requires lifecycle management, cleanup, and bounded retention policies.
For a chat bot, continuity is part of product quality, so bounded persistence was the right compromise.
Strict rate limits and guard clauses vs maximal openness
Cooldown and hourly caps run early, alongside guild/channel allowlists and input checks.
Tradeoff:
- Pros: protects capacity, curbs abuse, preserves fair usage.
- Cons: occasionally blocks legitimate high-frequency users.
I prioritised system stability and predictable service quality for the majority.
Streaming UX vs platform limits
Streaming improves perceived responsiveness, but Discord imposes a 2,000-character message limit.
Decision:
- stream incremental edits while content remains within limits
- sanitise output and split large responses into coherent chunks
This kept UX responsive while respecting hard platform constraints.
Outcome
The resulting system behaves like an operational product instead of a fragile demo:
- predictable under concurrent usage
- resilient across common runtime failures
- maintainable because responsibilities are explicit
- adaptable through configuration and modular services
Most importantly, architecture decisions translated directly into user-visible reliability.
What I’d evolve next
If I were extending this architecture now, I would prioritise:
- Observability and SLOs: structured telemetry, queue latency histograms, error budgets.
- Smarter model routing: workload-aware model selection and fallback policies.
- Admin operations surface: richer diagnostic and control commands in Discord.
- Deeper resilience patterns: retry budgets, circuit-breaker semantics, and clearer degradation modes.
The core lesson remains the same: in AI-enabled systems, reliability architecture is product architecture.