Periodic network interruption communicating with multiple Azure regions

lowDatadogAPMAug 28, 2025 19:30Duration: 47h 29m
network
Network / Routing

Summary

Our monitoring has shown Azure’s fix to be stable since our last update. This incident has been resolved.

Impact

minor

Timeline

Aug 28, 2025 19:30

[investigating] Degraded network capacity in an Azure datacenter is causing network packet loss and increased latency when communicating with AWS in eastern US regions. Customers may experience communications failures trying to submit data from agents running in AWS and may experience delayed data from AWS integrations. Microsoft has identified the root cause and is working on mitigations.

via statuspage
+1h 7m
Aug 28, 2025 20:37

[monitoring] The root cause of the issue has been identified and Microsoft has implemented mitigations, we are monitoring network traffic to confirm.

via statuspage
+24h 11m
Aug 29, 2025 20:47

[monitoring] Azure has temporarily mitigated the network capacity issues which have caused episodic packet loss for customers who are hosted in Azure data centers and are using Datadog’s US1 region (accessible via https://app.datadoghq.com). Azure engineers are continuing to work to fully resolve this issue. Until they fully resolve the issue, customers with Datadog agents running in Azure data centers may see brief periods of delayed ingestion of data from agents and from Azure integrations. We don’t expect a noticeable impact thanks to agent buffering but cannot exclude the possibility of spurious alerts due to temporarily delayed data. We are continuing to monitor the situation in conjunction with Azure, and will do so throughout the weekend. We will post status page updates as soon as the situation improves, and at least every 24 hours. We thank you for your patience throughout this incident.

via statuspage
+6h 24m
Aug 30, 2025 03:12

[monitoring] Azure has implemented a permanent fix to the network issue. Both Azure and Datadog engineers are continuing to monitor overnight and will provide an update tomorrow.

via statuspage
+15h 47m
Aug 30, 2025 18:58

[resolved] Our monitoring has shown Azure’s fix to be stable since our last update. This incident has been resolved.

via statuspage

Lessons Learned

Datadog has experienced 33 incidents in the past year. This frequency suggests systemic reliability challenges that may warrant additional monitoring.

📊Incidents related to network have occurred 48 times across all providers in the past year. This is one of the most common failure categories in cloud infrastructure.

💡This incident is categorized as: Network / Routing. Consider implementing preventive measures specific to this failure category.