Scaling Live Video Compositing Without a DevOps Team

Scaling live video compositing infrastructure is harder than it looks – it breaks most of the assumptions that make ordinary backends easy to scale.

For the past few months I've been working on this at Software Mansion, and we ran into a stack of problems that I think most teams hit when they try to build it themselves: orchestration, scaling, recovery, the usual suspects. I want to walk through what we learned, but first, a bit of context.

What even is "live video compositing"?

Live video compositing is the process of combining multiple video streams into one in real time. Take N video streams: your web cam, your favorite streamer's livestream and a very necessary AI avatar. We feed them to a magical black box and receive a single video stream as the output: you and your AI avatar friend watching your favorite streamer side-by-side.

In the real world, that "magical black box" is called server-side compositing. It’s the same heavy-lifting tech Google deployed for YouTube TV’s Multiview feature. Instead of forcing your phone or laptop to decode three intense video feeds at once, a cloud server (running something like FFmpeg) "sews" them together in real time, delivering a single, lightweight stream to your screen.

Stylized illustration showing two video player windows connected by dashed lines through a central node. Each player contains a play button and progress bar, with smaller participant tiles (a person icon with headphones) and label bars attached. The left player feeds into a connecting point that branches out to the right player, where the participant tile and label are now composited into the video frame itself. — Many streams in, one frame out — that's real-time compositing.

That's video compositing. For it to be live, ~~it has to not be dead~~ it has to be fast enough, let's say less than 1 second (though for some scenarios it might make sense to say less than 10 seconds).

Server-side vs. client-side video compositing

Most video compositing happens locally, on the client-side.

If you've ever used OBS, congrats – you've probably done client-side video compositing! It all happens on your machine, which is why you might've heard your fans spin up.

Client-side compositing is a sane default, but not every use case can – or should – run on consumer hardware. More demanding workloads, like high-quality broadcasts, need dedicated hardware. That's where server-side compositing comes in. Other use cases benefit from it too: compositing a livestream from a video conferencing room on the server gives every viewer a unified feed and cuts bandwidth costs.

This post is about server-side compositing – because the only way to "scale" on the client is to buy a better graphics card.

Who even needs live video compositing?

I do! But in all seriousness, the main use cases are:

Live broadcasting – sports broadcasts, co-streaming, anything that needs dynamic overlays or combines multiple concurrent streams.
AI and generated content – newer territory, but those chatbots with live avatars need some form of video compositing under the hood.
Interactive livestreaming – game shows, auctions, watch parties, anything where viewers' camera feeds get pulled into the broadcast in real time.

One thing you might notice about these use cases is that they can generally be associated with a rather large scale. Sports are everywhere, so is collaborative livestreaming. I know you wouldn't believe me, but AI is also pretty popular right now.

Scaling live video compositing is hard

Now that we know what live video compositing is, how do we go about actually integrating it into our application?

Luckily for me, my colleagues at Software Mansion are building Smelter, a live compositing server with a declarative API. We could also reach for FFmpeg, GStreamer, or even OBS as the underlying compositing engine – but each comes with its own headaches.

We get a proof of concept working with Smelter rolling in a few days and it works like a charm! Now we just need to do the same thing, but instead of 1 Smelter instance, we need 20.

Quick question: what makes a workload easy to scale?

Statelessness

You've probably heard this one. I won't rehash why state is a problem in a multi-server setting – there are plenty of good explanations out there. TL;DR: state makes request routing complex.

Usually it's easy to sidestep: offload the state to a database and let someone else figure it out.

That doesn't work for real-time video compositing. A database can't handle the raw throughput of video – the streams have to land on the same compositing server, frame by frame, so it can render the output in real time. And changing a layout on the fly means mutating state that lives on one specific server.

Latency tolerance

No one likes latency – but some tolerate it less than others. If an API request takes 5 seconds, you'll assume someone wrote their backend in Python. If a video call has 5 seconds of delay, you'll leave and never come back.

Real-time video compositing is much closer to the video call end of that spectrum than a "normal" backend. You don't want to be watching a football match and get a goal notification on your phone before the ball even hits the net.

Latency also shapes how immersive the experience feels. A livestream where the chat reacts in sync with what's happening on screen feels alive; one where comments lag a few seconds behind feels like watching a recording with strangers shouting at the wrong moments. Every second of latency is a second of immersion lost.

So clients need to connect to servers that are geographically close, and we can't afford too many network hops inside our own infrastructure either. In other words: this isn't a problem you solve by throwing it into a single data center.

Draining

The more servers you have, the harder it gets to update them.

First, you ignore the problem. Then you start hearing about how all the cool kids have zero-downtime deployments. You don't really need them – but you want to be cool, so you implement zero-downtime deployments:

Spin up a new server
Stop routing requests to the old server
Wait for the old server to finish processing its requests
Stop the old server

Now you're cool.

One question – how long did you wait in step 3? If it was an HTTP API server, probably a few seconds.

If it was a live video compositing server, you waited a bit longer. Like a few minutes. Or a few hours.

There are more properties that make scaling easier or harder, but I'll stop here – the ones above already make my job sound hard enough.

Scalable video compositing infrastructure

The outlined issues make scaling horizontally more of a challenge. We're lazy, so we don't want to reinvent the wheel. Ideally, someone has already deployed similarly behaving workloads at scale and we can "borrow" their ideas...

If you squint hard enough, a live video compositing server is basically a multiplayer game server for something like an online FPS.

Don't believe me? Let's compare:

Comparison Table

	Resource usage	Stateful?	Sensitive to latency?	Takes long to drain?
Smelter	Very high (GPU-bound)	Yes	Yes – sub-second	Yes – minutes to hours
Game server	High	Yes	Yes – sub-100ms	Yes – until match ends

They even run on a fixed framerate clock 🤯

The similarity is uncanny, so we can leverage it when designing our architecture. How do popular multiplayer games support hundreds of thousands of concurrent players with low latency?

Decouple latency-sensitive traffic (player inputs, game state) from everything else (matchmaking, login, leaderboards).
Route players to servers geographically close to them to minimize round-trip time.
Send latency-sensitive traffic directly to the game server, bypassing load balancers.
Keep a pool of warm servers – idle instances ready to take new traffic at a moment's notice.

There are dedicated tools for this – Agones and AWS GameLift are the big ones.

From games to video compositing

Let's adapt these ideas to our video compositing scenario. In our case, instead of players we have inputs/outputs and instead of rooms we have compositions.

We use Anycast to help select a Smelter instance close to the origin of a compositing creation request.
Latency-sensitive inputs and outputs (e.g. WHIP clients) connect directly to the Smelter instance's unicast IP.
Standard HTTP requests are routed more traditionally, through load balancers.

Diagram showing three network paths from clients to Smelter instances across regions. Path 1 (control plane, HTTP): a Client sending standard HTTP routes through a Load balancer handling auth and routing to a Smelter in Region A. Path 2 (composition creation, anycast): a Client creating a composition routes through an Anycast IP that picks the nearest region to a Smelter in Region B. Path 3 (media, direct unicast): a WHIP client handling media in/out connects directly to a Smelter in Region C via a direct unicast connection with no load balancer in the path. — Three network paths from client to Smelter: HTTP control plane via load balancer, composition creation via anycast, and media via direct unicast.

Conclusion

Hopefully the above was neat and interesting.The catch: a scalable setup has a lot of moving parts. With this complexity, there's a lot of extra operational overhead (and we haven't even touched observability).

Once the infrastructure's there, maintenance starts eating away at time previously spent on other things like your business logic. You hire one DevOps engineer. Then a whole team of them. Just to keep things rolling.

The only real way to get rid of operational burden is to hand it to someone else. You don't want to spend your time operating your database, it should just work. Same goes for video compositing – for most teams, it's a means to an end.

That's why we've been exposing Smelter's live compositing API through Fishjam, our managed real-time media platform. The goal is simple: to take the infrastructure burden of real-time video off your plate.

If you're interested in scalable real-time video compositing without scalable headaches, sign up for early access.

If you're more interested in self-hosting, then I can't recommend software-mansion/smelter enough. It makes video compositing feel like web dev – in a good way.