Summary

From July 7, 2025 8:50 → July 8, 2025 15:44, customers observed stale authorization query results resulting from a bad deployment that caused some nodes to stop processing new fact and policy message updates.

Root Cause and Remediation

The Elastic Container Service (ECS) scheduler terminated the writer task on an EC2 instance while there were still active reader tasks. As a result, readers did not observe new messages and returned query results based on stale data. Queries that depended on facts received prior to the writer task being terminated were accurate.

<aside> ℹ️

Oso deploys its edge nodes as ECS tasks on EC2 instances. Each EC2 instance has a dedicated writer task responsible for processing messages from Kafka and updating the environment databases. The reader tasks on the EC2 instance share access to the environment databases. The writer tasks are scheduled using the DAEMON scheduling strategy to ensure that there is exactly one writer per EC2 host.

In normal operations, there are two pools of EC2 instances, with one of the pools being empty. On upgrades, Oso launches EC2 instances in the empty pool and shifts traffic over to the newly populated EC2 instance pool before draining the pool running the old version.

Additional information about the layout of our infrastructure are available on our website.

</aside>

On July 7, 2025 15:22, Oso applied an infrastructure change to increase the compute capacity in us-east-1 by increasing the number of EC2 instances. This infrastructure change also updated the task definition of the writer service because there was a new Oso image available.

When Oso applied the change, the scale up happened before the service update. This caused a few new EC2 instances to start with the old writer version. Then, the task definition was updated for that pool’s writer service. Normally, this would update a service with zero tasks and complete instantly, but the sequencing of these operations resulted in the state shown in the below diagram.

State of the EC2 instance pools after the infrastructure update.

By default, ECS updates this type of service by first stopping all tasks, and then creating new ones. Because this would cause write downtime, the writer service is configured to never go below 99% capacity during deployments. This is meant as an extra layer to “disable” the ECS deployment for writer service variants with greater than zero tasks, as a safety mechanism, but was never meant to be exercised. However, due to the race mentioned above, the ECS scheduler at this point had a stuck deployment to the active writer service that was still pending.

Oso performs a refresh of all the instances by doubling the size of the pool and reducing it back to the desired size. The refresh process is how Oso updates indexes and ensures the databases are optimally configured. In practice, there are many more instances than depicted in the diagram, so when Oso doubled the number of instances, one task represented ~1% of the pool, which allowed the ECS scheduler to start terminating tasks that it wasn’t previously able to. Some of the terminated writer tasks were responsible for updating environment databases used by active reader tasks; when the writer task was terminated, the dependent reader tasks returned stale results until the EC2 instance was terminated.

EC2 instances with reader tasks that returned stale results.

At July 8, 2025 13:08, the Oso team was notified of an increase in stale results. At July 8, 2025 13:21, the on-call engineer terminated the EC2 instance represented by the dark blue line above. After terminating the offending EC2 instance, the on-call engineer observed that new instances started to exhibit the same behavior as a result of their writer tasks being terminated. The on-call engineer continued terminating individual EC2 instances that appeared on the dashboard and preemptively performed a full instance pool refresh in us-east-1. The issue was fully resolved after the full refresh was completed at July 8, 2025 15:44.

Follow-up Actions

Near term
- Augment existing alerting to capture the case where the readers are returning stale results as a result of new messages not being processed. DONE
Medium term
- Prevent infrastructure changes that that update the task definition and increase the number of instances in a single change set.

Timeline

July 7, 2025 15:22: Infrastructure changes applied to increase the number of tasks and update the service task definition.

July 7, 2025 20:16: Instance pool refresh started, doubling the number of instances in the pool.

July 7, 2025 21:32: Instance pool refresh completed, started reducing the number of instances in the pool.

July 8, 2025 8:50: First case of writer being terminated without a corresponding termination of the EC2 instance, resulting in queries to readers on the EC2 instance returning stale results.