Why SUMO’s Rendered Videos Should Never Be Used as RL Training Data

Introduction

The rendered images or videos generated by SUMO have no place in a RL pipeline. They may look clear, structured, and almost “dataset-like,” yet they fundamentally lack the properties that make a visual observation meaningful to an agent. What they provide is appearance, not information.

This is not an insight I arrived at after trying to extract learning signals from SUMO’s frames. Rather, it becomes clear the moment one understands how SUMO represents its world internally. The simulator was never built around vision; it was built around structure—precise, explicit, symbolic structure. And because of this, its video output occupies a different conceptual layer entirely.


Visualization Hides Semantics

The rendered frame that SUMO shows us is, in essence, a human-friendly sketch of a world that exists somewhere else. The world SUMO actually simulates consists of continuous lane coordinates, exact positions and velocities, acceleration profiles, signal states, right-of-way relationships, and route intentions that evolve according to a coherent traffic model. This internal world is numerical and symbolic down to its core.

None of this structure survives the trip into the visual renderer. A car in a SUMO frame is just a colored rectangle, detached from the lane graph it belongs to, stripped of its intentions and priorities, reduced to a geometry that no longer reveals why it will or will not proceed through an intersection. A traffic signal becomes a green or red dot with no connection to the control logic that governs it. Even spatial quantities, such as gaps or queue lengths, degrade into visual approximations that depend on camera perspective rather than on the simulator’s own measurements. It is tempting to think that pixels “contain enough information,” but the truth is that the renderer deliberately hides most of what matters.


What RL Actually Needs

RL does not learn from appearance, it learns from state. And a useful state is one that preserves the causal relationships driving the environment. With SUMO, those relationships live entirely in its internal representation: the topology of the road network, the numerical values describing motion, the logic that governs priority and right-of-way, the exact timing and sequencing of traffic lights. These are not afterthoughts; they are the environment.

A video frame, regardless of how cleanly rendered, cannot express the rules that determine how traffic flows. It does not encode which vehicle is yielding, which is accelerating, which is constrained by downstream occupancy, or which is simply following a route that is invisible to the eye. Observing the render is, at best, witnessing the shadow of a system whose logic has already been stripped away.

Training an RL agent on such shadows is not simply inefficient; it is conceptually misaligned. The agent is forced to reconstruct the environment’s logic from incomplete projections of it, even though SUMO already provides the fully-formed state in precise numerical terms. The agent ends up solving a perception problem that exists only because we discarded the very information the simulator was designed to offer.


Reconstructing What the Simulator Already Knows

If one insists on using SUMO’s images as input, the learning problem becomes unnecessarily inverted. To make sense of a frame, an agent must infer positions, velocities, lanes, intentions, and interactions—all of which SUMO already calculates. This re-derivation is not only redundant; it is structurally impossible to perform without error, because the necessary information does not exist in the pixels in the first place. SUMO’s renderer never intended to expose it.

Vision makes sense when there is no alternative, such as in real-world driving where sensors are noisy and perception is inherently uncertain. But SUMO is not an uncertain environment. It is fully observable. Every variable that matters is crisp, deterministic, and accessible. To replace this with pixels is to voluntarily abandon the clarity that simulation exists to provide.

It is important to remember why SUMO has visualization at all. The renderer exists so that humans can watch the simulation unfold. It is a debugging tool, a qualitative sanity check, a way to illustrate traffic scenarios for presentations and reports. The rendering is not a sensory modality of the environment, and it was never meant to be. Its purpose is interpretive, not generative.

If an agent were to rely on SUMO’s visual output, it would be relying on something designed explicitly for human interpretation. And human interpretation thrives on abstractions, simplifications, and aesthetic conventions. All of which run counter to what machine learning expects from observational data.


Comment

The visual output of SUMO does not fail because it is low quality. It fails because it is the wrong abstraction. It belongs to a layer of the system intended for people, not agents. The real substance of SUMO such as the logic, the topology, the numerics and the decisions is found in its internal state, not in its renderings.

This is why SUMO’s images should never be used as RL input. Not because they are flawed, but because they are unrelated to the simulation’s meaning. To rely on them is to ignore the very nature of the environment we are trying to learn from.




    Enjoy Reading This Article? Here are some more articles you might like to read next:

  • A Quick Mental Model for Estimating LLM GPU Memory Use
  • Designing a Maintainable Replay Buffer in RL Systems
  • Tracing the Root Cause of Missing GPUs in Docker Containers
  • A Measure of Range Compression on Different Board Textures
  • Running dm-control on a Headless Server: A Complete Debugging Log
  • Re-running an RL Experiment and Getting a Different Answer
  • On the Equity Difference Between Suited and Offsuit Starting Hands