Why Server Automation Tools Fail at Robotics Configuration Management

Apr 09, 2025

Intro

When infrastructure engineers at robotics teams first learn about Server Configuration Management (CM)/Automation tools like Ansible, Salt, Chef, and Puppet, they jump for joy.

They think they’ve found their panacea: “Robots are just edge servers, right? Let’s use CM tools for everything: provisioning, deployment, configs, and inventory management. Done.” Time to go home, kick up their feet, and let a CM tool handle their work for them!

That optimism isn’t entirely misplaced. These tools are solid. Robotics teams commonly use them in different parts of their CI/CD pipelines.

However, things start to break down when using CM for robotics application configuration; this is where the mental model of “robots as servers” falls apart.

Robots need dynamic, device-specific, and runtime-dependent configurations, which differ from the static config templates CM tools were built for.

In this blog, we’ll discuss why CM tools fall short for robotics configs—and what we need instead.

Incorrect Networking Assumptions

CM tools were built to configure servers sitting comfortably in data centers, like the AWS us-east-1 region in Virginia. Every square inch of these facilities is supported by world-class networking infrastructure and teams of experts on standby to fix whatever goes wrong.

At the time of writing, the last major outage in us-east-1 was in June 2023. It lasted two hours.

In robotics, a mere two-hour network drop is a walk in the park. Depending on your use case, it’s common for your robot to drop off the network for hours, even days.

CM tools, built with data center networking assumptions, aren’t designed for this level of flakiness.

Push-Based vs. Pull-Based Model

Some CM tools (Ansible or Salt’s default) rely on a push-based model. This means the server (cloud) pushes configuration updates to the client (robot).

In robotics, this is risky. If the network’s shaky, you might push a new configuration while part of your fleet is offline. Now you’ve got different robots running different versions of your software. Worse, if a robot goes offline mid-update, it might end up in a broken state that needs manual recovery.

A pull-based model is better. In this approach, the client polls the server for new configurations and pulls them down whenever an update is ready. Puppet and Chef use this approach.

But whether they push or pull, these tools still assume a reliable network — and that’s where things fail.

Queuing

Let’s zoom in on a pull-based system. It defaults to pulling configurations from the server at some regular interval (Puppet’s is every 30 minutes). However, in the event of networking or power drops, we’d prefer that the client poll the server for a new update immediately upon restarting or re-establishing the network connection.

But we don’t usually want the robot to apply the new configuration immediately once it has been downloaded. Robots are stateful systems. They might be mid-mission, interacting with a human, or doing something sensitive in the physical world. Applying a configuration change at the wrong moment can have dangerous consequences.

Servers don’t face this problem as sharply. They run in predictable environments, with workloads that can be drained or paused with less disruption.

So, for robots, instead of applying updates immediately, we want a local queue. The robot should pull down the config and store it locally. Then, when it’s safe, the robot can apply it.

CM tools don’t give you this out of the box. No retry logic. No local job queue. If you care about when and how updates get applied (you should!), you’re left writing that logic yourself.

Safe Updates and On-Device Versioning

Atomic Updates

It’s best practice to have atomic updates and rollback functionality. We don’t want to incrementally update configs and be left in a bad state if the update completes partially. We also want to easily roll back to a previous configuration version if our application isn’t behaving as expected.

An atomic update is all or nothing. If any part of the update fails, the whole system is rolled back, leaving it in its original, working state.

Salt, Puppet, and Chef apply updates sequentially, like scripts. If something fails, they move on. This assumption makes sense in a server farm, where you treat your machines like cattle. However, robots are more like pets, so we should treat our configuration updates with care. We can’t simply move on if the update only completes partially.

Device Versioning

CM tools rely on a central manifest that defines the server’s preferred device state. Because it is a central command/control system, it doesn’t care much about the device’s current state. This means that versioned, per-device history is not a first-class citizen.

So if a robot misses config v3 and the latest version is v4, it will naively try to jump ahead—but that’s not always safe. You’d prefer to give the device the data for both v3 and v4 to upgrade sequentially, ensuring a smooth update.

This lack of versioned per-device history also makes rollbacks difficult. Rollbacks require versioning and snapshots. We want to know the device’s version history, its last good state, and establish a safe way to reapply it.

With traditional CM, you have to create version tracking, define migration paths, and handle rollbacks yourself. That’s critical path infrastructure we’d prefer not to build ourselves!

Drift Detection

Unlike servers, robots have configs that are edited locally. Whether an operator or end user modifies a setting in the business logic layer (like a CAD input in construction) or a robot recalibrates itself, these changes are not in lockstep with the cloud.

When local configs and cloud versions are out of sync, we call this a configuration drift. The drift itself isn’t problematic, but what’s important is being able to resolve it when it happens.

Teams have varying policies for resolving drift. Some want the robot to be the source of truth. Others want the cloud to win. Some even prefer one-off, snowflake configs. But all of them need visibility. CM tools don’t give you that. They only track the cloud’s desired state. They have no idea what the device is running right now.

That means you can’t handle drift—you don’t even know it exists. And if you don’t know what’s running on the device, you can’t be confident that your next update will work.

Analytics

Weak device versioning also negatively affects our ability to run analytics.

A robot’s application configs are its analytics primitives. When debugging a field issue, measuring uptime, or running an A/B test, you correlate performance with the config in place at the time. That requires device-level versioning.

Again, CM tools miss this. Without it, we can't trace metrics back to the exact config that was running. We’re forced to hack around this with brittle logging or tagging systems that don’t scale. The result: we lose precision in our data, and any insight we draw about fleet performance is less trustworthy.

Schema Management

CM tools are process runners. They don’t know or care what’s in the config—they’ll happily deploy garbage as long as the YAML parses. But our deployment process can’t be that naive. A valid file != valid config. We must ensure that our configuration schemas are correct and safe to roll out, without piling more friction onto the software engineer deploying them.

CM tools don’t natively integrate with configuration schemas like JSON Schema, Protobufs, or CUE that enforce types, valid values, or structural constraints. That means you can accidentally set max_torque: 1389, launching a robot arm through a wall!

CM tools don’t have a built-in way to check if a new config is compatible with the previous config and the application code. If there’s an incompatibility, errors get caught at runtime instead of deploy time, where the stakes are much higher.

This puts the onus on the software engineers, who must perform this checking. But engineers don’t want to add this infrastructure to their laundry lists of tasks. They don’t want to build config validation infrastructure. They also don’t want to cross their fingers every time they deploy.

They want something that just works, and control over the schema to make sure it does.

Overrides

Overrides should be simple. You should be able to specify: “These five robots at Customer A get a camera config X” or “Any robot running version 2.1 should increase its timeout to 25s.” You shouldn’t need to write a new file for every robot or fork your entire structure to change one value.

But that’s exactly what traditional tools make you do.

Salt lets you override through Pillars, but you end up writing conditionals in YAML or Jinja, hardcoding robot IDs into top.sls files. There's no layering, no merging — just string-matching and crossed fingers.

Puppet uses Hiera, which sounds fancy until you realize it means a new file per override. You want to tweak something for ten robots? You’re now ten files deep in a folder maze. There's no way to say “robots with camera=flir” or “battery < 50%.”

Chef does support layered overrides — default, normal, override, and force_override which sounds like a good idea. But it’s hard to trace where a value came from in practice. Overrides can come from cookbooks, roles, environments, or node-specific JSON. There’s no unified view or merge semantics you can reason about. So when something goes wrong, debugging turns into an expedition through scattered files and Chef's run-time state.

Ansible is even simpler — and not in a good way. Host vars and group vars overwrite each other. There’s no merging, no targeting, and no live visibility.

They all assume that configuration is something you set once, like a shipping label. But robotics doesn’t work like that. They need dynamic, layered, and functional overrides.

Conclusion

CM tools are still useful in robotics. They’re powerful for orchestrating software updates, doing OS-level configurations, and installing dependencies.

But they weren’t built to safely manage configs in robotics for fleets of stateful, network-constrained robots. They lack the needed robustness, composability, and targeting.

Robotics teams need something better.

That’s why we’re building Miru: configuration management purpose-built for robotics. We help robotics teams manage, version, and deploy their application configs. If this has been challenging for you, reach out! We’d love to help.

Nihar Asare

Apr 9

Interesting read! I recently built a Rust log analyzer for multi-sensor systems (IMU, Ultrasound, Optical) to debug configuration mismatches. How do you see tools that parse heterogeneous telemetry helping with robotics-specific challenges like drift detection or schema validation? Does Miru plan to address these sensor-driven configuration issues?

Expand full comment

1 reply by Vedant Nair

1 more comment...

vantewrld

Discussion about this post