When it fits
- You want strong per-session isolation — each bot runs in its own VM, so misbehavior in one session can’t affect another.
- Your traffic is bursty enough that maintaining a warm pool would mostly burn money on idle capacity.
- You’re comfortable with cold-start latency in the seconds range on each new session (mitigated by image size discipline and the provider’s start-up speed).
- You don’t want to operate a long-running fleet — the provider’s machines API is the only thing you talk to.
When it doesn’t
- Cold-start latency dominates your UX. If users expect a bot to answer within a second of pressing “call”, per-session VMs are usually too slow without significant work (small images, warm reserves, optimistic client connect + poll).
- Your concurrency exceeds the provider’s per-account API rate limits or instance-count quotas. These are routinely raised on request but worth checking before committing.
- Your bots are extremely lightweight — at some point the VM-per-session overhead dominates the actual work.
How it usually looks
The dispatcher:- Receives
POST /start(or whatever your equivalent is). - Authenticates the request.
- Creates whatever transport-side resources the bot will need (e.g. a Daily room and tokens).
- Calls the cloud provider’s machines API with the bot image and a command that passes
--room-url/--token/ etc. into the bot. - Waits for the machine to enter a “started” state (most providers expose a synchronous wait endpoint).
- Returns the join URL to the client.
Tradeoffs worth being explicit about
- Cold-start vs. cost. No warm capacity means low idle cost and slow first-byte. Mitigations: keep images small, pre-cache pipeline models at build time, optimistically return the join URL once dispatch has been requested (and let the client poll for “bot ready”).
- Isolation vs. response time. Fresh VM per session is the cleanest possible isolation model, but every session pays the cloud provider’s full instance-startup latency on the way in. There’s no way to amortize that across sessions without giving up the per-session isolation that’s the whole point.
- Rate limits and quotas. Your machines-API quota becomes your real concurrency ceiling. Worth knowing in advance.
- Image discipline. Image size directly affects cold-start time. Multi-stage builds, baking only what the bot needs at runtime, and keeping VAD/STT models cached are all material here.
See also
- Fly.io worked example — end-to-end walkthrough of the pattern on Fly Machines.
- Modal — similar pattern on Modal’s function infrastructure.
- Cerebrium — similar with GPU support if your pipeline needs it.