Service Degradation For Some Services

Incident Report for Deque Services Health and Status

Postmortem

On June 13, 2023, Deque’s cloud hosting provider (AWS) encountered a problem with it’s Lambda service. The problem was described as increased error rates and latency. This problem impacted many other AWS services that depend on Lambda. As a result, two of Deque’s services, Axe Mobile and Axe Developer Hub experienced intermittent periods of latency and unavailability. During this time customers may have experienced error messages while interacting with these services. No intervention was required from Deque; performance and availability began to improve as AWS resolved the problem.

‌

The following is the incident timeline and description, taken from the AWS service status page at https://health.aws.amazon.com/

Event data

Feedback on this event

Service

Multiple services

Start time

June 13, 2023 at 3:08:00 PM UTC-4

Severity

Resolved

Status

Closed

End time

June 13, 2023 at 6:42:39 PM UTC-4

Region / Availability Zone

us-east-1

Description

[RESOLVED] Increased Error Rates and Latencies [03:42 PM PDT]

Between 11:49 AM PDT and 3:37 PM PDT, we experienced increased error rates and latencies for multiple AWS Services in the US-EAST-1 Region. Our engineering teams were immediately engaged and began investigating. We quickly narrowed down the root cause to be an issue with a subsystem responsible for capacity management for AWS Lambda, which caused errors directly for customers (including through API Gateway) and indirectly through the use of other AWS services. Additionally, customers may have experienced authentication or sign-in errors when using the AWS Management Console, or authenticating through Cognito or IAM STS. Customers may also have experienced issues when attempting to initiate a Call or Chat to AWS Support. As of 2:47 PM PDT, the issue initiating calls and chats to AWS Support was resolved. By 1:41 PM PDT, the underlying issue with the subsystem responsible for AWS Lambda was resolved. At that time, we began processing the backlog of asynchronous Lambda invocations that accumulated during the event, including invocations from other AWS services. As of 3:37 PM PDT, the backlog was fully processed. The issue has been resolved and all AWS Services are operating normally.

[02:49 PM PDT] We are working to accelerate the rate at which Lambda asynchronous invocations are processed, and now estimate that the queue will be fully processed over the next hour. We expect that all queued invocations will be executed.

[02:29 PM PDT] Lambda synchronous invocation APIs have recovered. We are still working on processing the backlog of asynchronous Lambda invocations that accumulated during the event, including invocations from other AWS services (such as SQS and EventBridge). Lambda is working to process these messages during the next few hours and during this time, we expect to see continued delays in the execution of asynchronous invocations.

[02:00 PM PDT] Many AWS services are now fully recovered and marked Resolved on this event. We are continuing to work to fully recover all services.

[01:48 PM PDT] Beginning at 11:49 AM PDT, customers began experiencing errors and latencies with multiple AWS services in the US-EAST-1 Region. Our engineering teams were immediately engaged and began investigating. We quickly narrowed down the root cause to be an issue with a subsystem responsible for capacity management for AWS Lambda, which caused errors directly for customers (including through API Gateway) and indirectly through the use by other AWS services. We have associated other services that are impacted by this issue to this post on the Health Dashboard. Additionally, customers may experience authentication or sign-in errors when using the AWS Management Console, or authenticating through Cognito or IAM STS. Customers may also experience intermittent issues when attempting to call or initiate a chat to AWS Support. We are now observing sustained recovery of the Lambda invoke error rates, and recovery of other affected AWS services. We are continuing to monitor closely as we work towards full recovery across all services.

[01:38 PM PDT] We are beginning to see an improvement in the Lambda function error rates. We are continuing to work towards full recovery.

[01:14 PM PDT] We are continuing to work to resolve the error rates invoking Lambda functions. We're also observing elevated errors obtaining temporary credentials from the AWS Security Token Service, and are working in parallel to resolve these errors.

[12:36 PM PDT] We are continuing to experience increased error rates and latencies for multiple AWS Services in the US-EAST-1 Region. We have identified the root cause as an issue with AWS Lambda, and are actively working toward resolution. For customers attempting to access the AWS Management Console, we recommend using a region-specific endpoint (such as: https://us-west-2.console.aws.amazon.com). We are actively working on full mitigation and will continue to provide regular updates. [12:26 PM PDT] We have identified the root cause of the elevated errors invoking AWS Lambda functions, and are actively working to resolve this issue.

[12:19 PM PDT] AWS Lambda function invocation is experiencing elevated error rates. We are working to identify the root cause of this issue. [12:08 PM PDT] We are investigating increased error rates and latencies in the US-EAST-1 Region.

Affected AWS services

The following AWS services have been affected by this issue.

Resolved (104 services)

AWS Account Management

AWS Amplify

AWS Amplify Admin

AWS AppSync

AWS Batch

AWS Certificate Manager

AWS Cloud9

AWS CloudFormation

AWS CodeCommit

AWS CodePipeline

AWS CodeStar

AWS Config

AWS Control Tower

AWS Data Exchange

AWS DataSync

AWS Directory Service

AWS Elemental

AWS Fargate

AWS Fault Injection Simulator

AWS Global Accelerator

AWS Glue

AWS Ground Station

AWS Identity and Access Management

AWS IoT Device Management

AWS IoT FleetWise

AWS IoT Greengrass

AWS IoT SiteWise

AWS Lake Formation

AWS Lambda

AWS License Manager

AWS Management Console

AWS Marketplace

AWS Migration Hub Strategy Recommendations

AWS Organizations

AWS Outposts

AWS Private Certificate Authority

AWS QuickSight

AWS Resource Explorer

AWS Resource Groups

AWS Secrets Manager

AWS Service Catalog

AWS Single Sign-On

AWS Support Center

AWS Transfer Family

AWS VPCE PrivateLink

AWS Well-Architected Tool

Amazon API Gateway

Amazon AppStream 2.0

Amazon Athena

Amazon Augmented AI

Amazon Braket

Amazon Chime

Amazon CloudFront

Amazon CloudWatch

Amazon CloudWatch Synthetics

Amazon CodeCatalyst

Amazon CodeGuru Profiler

Amazon CodeGuru Reviewer

Amazon Cognito

Amazon Comprehend

Amazon Connect

Amazon DevOps Guru

Amazon DocumentDB

Amazon EMR Serverless

Amazon ElastiCache

Amazon Elastic Container Registry

Amazon Elastic Container Service

Amazon Elastic File System

Amazon Elastic Kubernetes Service

Amazon Elastic Load Balancing

Amazon Elastic MapReduce

Amazon EventBridge

Amazon FSx

Amazon FreeRTOS

Amazon GameLift

Amazon GuardDuty

Amazon Inspector

Amazon Interactive Video Service

Amazon Kendra

Amazon Kinesis Firehose

Amazon Kinesis Video Streams

Amazon Lightsail

Amazon Location Service

Amazon MQ

Amazon Managed Grafana

Amazon Managed Service for Prometheus

Amazon Managed Streaming for Apache Kafka

Amazon Managed Workflows for Apache Airflow

Amazon MemoryDB for Redis

Amazon OpenSearch Service

Amazon Pinpoint

Amazon Quantum Ledger Database

Amazon Redshift

Amazon Relational Database Service

Amazon Route 53

Amazon SageMaker

Amazon Simple Email Service

Amazon Simple Queue Service

Amazon Transcribe

Amazon Translate

Amazon VPC Lattice

Amazon WorkMail

Amazon WorkSpaces

EC2 Image Builder

Posted Jun 14, 2023 - 11:41 EDT

Resolved

Our cloud provider has stabilized all services. Axe Mobile and Developer Hub have remained stable for some time now, and we are considering this incident resolved. We will continue to monitor.

Posted Jun 13, 2023 - 17:37 EDT

Investigating

Our cloud provider is experiencing performance degradation for several services in the us-east-1 region. This is currently intermittently impacting the availability of http://axe-mobile.deque.com/, as well as Axe Developer Hub.

We are verifying the status of all other Deque services. Note that this is being actively investigated by our cloud provider, and the list of services impacted may continue to change.

AWS status page: https://health.aws.amazon.com/health/status

Posted Jun 13, 2023 - 16:20 EDT

This incident affected: Axe Devtools Mobile.