Buildkite logo and current status indicator

Buildkite Status

Buildkite is currently operational with all systems functioning normally.

Last checked Jun 13, 2026 1:44 AM UTC from Buildkite's official status page

Incident History

Showing incidents from the last 15 days

Report: "Increased latency on REST and GraphQL APIs"

Last update
resolved

The mitigation applied before the last update had the intended effect, and we have seen recovery in REST API latency.

monitoring

We've isolated the issue to elevated load on our REST API service and have mitigated the issue. The agent and stacks API isn’t affected.

monitoring

We've isolated the issue to elevated load on our REST API service and are working to mitigate. The agent and stacks API isn’t affected.

investigating

We're observing increased latency for all our customers. We're currently investigating and will provide status updates as they become available.

Report: "Increased latency and error rates for Agent API"

Last update
resolved

Between 00:05 - 00:34 UTC, a subset of customers experienced increased latency and timeout errors on the Agent API. This impacts job assignment. At peak impact, we saw an error rate of 1.3% of requests and job acceptance latency up to 53s.

investigating

We're observing increased latency and error rates for a subset of our customers on the Agent API. We're currently investigating and will provide status updates as they become available.

Report: "Delayed notifications"

Last update
postmortem

## Service Impact Customers experienced delayed Buildkite notification delivery. The customer impact varied depending on how those notifications are used. For some customers, delayed notifications also delayed downstream CI, merge, or deployment workflows. ## Incident Summary On 28 May, Buildkite experienced elevated notification delivery latency after part of our notification-processing infrastructure became underprovisioned. This happened because the Prometheus service used by our EKS autoscaling path ran out of storage, which meant some EKS-based workers could not autoscale correctly while queues were growing. We mitigated the incident by moving affected workloads back to our previous ECS-based infrastructure and manually increasing worker capacity. Recovery took longer than expected because the rollback path did not fully handle this scenario. ### Impact window 1 At 20:01 UTC, notification-processing workers became underprovisioned and notification delivery latency increased. We detected the issue through internal queue latency monitoring and began shifting affected workloads from EKS back to ECS. This rollback took longer than expected because the ECS services we were rolling back to were not ready to immediately take the full load. Engineers had to manually adjust scaling configuration and worker counts while the incident was active. Notification latency recovered for most customers by 21:00 UTC. ### Impact window 2 A second, shorter impact window occurred between 22:12 UTC and 22:40 UTC for a subset of customers. After the first recovery, some workloads were still running on EKS and had started autoscaling again after Prometheus recovered. We incorrectly believed those workloads were no longer serving traffic. When we reconciled our infrastructure configuration, those EKS workloads were scaled down before their ECS equivalents had been fully scaled up. This caused another period of underprovisioning for some notification-processing workers. We resolved it by completing the rollback and scaling the remaining affected ECS services. ### Customer Impact The impact was not identical for every customer. For customers who use Buildkite notifications as an input to other CI or deployment systems, notification latency can delay those downstream workflows. Some customers also experienced secondary or longer-running effects based on the specific notification types, retry behaviour, or integrations involved. We are following up directly with affected customers where their impact differed from the general incident. ## Changes we're making We have made the following immediate changes: * Increased Prometheus storage capacity and reconciled that change in infrastructure-as-code. * Added monitoring to alert before Prometheus storage exhaustion can affect autoscaling. * Moved affected notification-processing workloads back to known-good ECS capacity. * Fixed GitHub notification retry behaviour for a class of errors that could cause repeated retries and extend notification delays. We are also making the following reliability improvements: * Hardening the EKS-to-ECS rollback process so it verifies destination capacity, autoscaling configuration, and traffic movement before and during rollback. * Reviewing other EKS control-plane dependencies, including KEDA and Karpenter, to ensure their CPU, memory, and storage allocations are appropriate for production load. * Reassessing the order and pace of future EKS migrations so customer-critical workloads move more gradually and with clearer settling periods. * Improving customer-level monitoring for notification delivery latency, so we can detect customer-impacting regressions earlier. * Reviewing which notification types are on the scheduling or CI hot path for customers, and whether they need tighter latency expectations, separate queueing, or more specific alerting than general notification work. ## Areas we are improving: incident communication During this incident, our public status page did not reflect customer-visible impact as quickly or clearly as it should have. In particular, notification delivery latency can affect customers differently depending on how notifications are used in their CI and deployment workflows. We are improving how we communicate during notification latency incidents by: * Updating the status page earlier when notification latency is likely to affect customer workflows * Making status page updates clearer about the customer-visible impact, not just the affected internal service * Improving internal escalation paths for customers who report critical CI impact before the incident is fully understood * Using customer-level notification latency monitoring to help identify affected customers sooner

resolved

This incident has been resolved.

identified

We've spotted that something has gone wrong. We're currently investigating the issue, and will provide an update soon.

investigating

We are investigating delays to build and job notifications for a subset of customers.

Report: "Email deliveries are delayed"

Last update
resolved

We have received reports email deliveries have not been working, affecting signup and invite emails as well as build notification emails. This issue has now been resolved.