查看原文
其他

AWS S3 挂掉原因:程序员输错字母,误删了服务器,故障4个小时!

2017-03-03 云头条

AWS解释了其广大US-EAST-1地理区域的S3存储服务是如何受到中断的,以及它在采取什么措施防止这种情况再次发生。

 


AWS周四声称,输错命令导致了亚马逊网络服务(AWS)出现持续数小时的故障事件。这起事件导致周二知名网站断网,并给另外几个网站带来了问题。


这家云基础设施提供商给出了以下解释:


亚马逊简单存储服务(S3)团队当时在调试一个问题,该问题导致S3计费系统的处理速度比预期来得慢。太平洋标准时(PST)上午9:37,一名获得授权的S3团队成员使用事先编写的playbook,执行一条命令,该命令旨在为S3计费流程使用的其中一个S3子系统删除少量服务器。遗憾的是,输入命令时输错了一个字母,结果删除了一大批本不该删除的服务器。


这个错误无意中删除了对US-EAST-1区域的所有S3对象而言至关重要的两个子系统――这个区域是庞大的数据中心地区,恰恰也是亚马逊历史最悠久的区域。两个系统都需要完全重启。亚马逊特别指出,这个过程以及运行必要的安全检查“所花的时间超出了预期。”


重新启动时,S3无法处理服务请求。该区域依赖S3进行存储的其他AWS服务也受到了影响,包括S3控制台、亚马逊弹性计算云(EC2)新实例的启动、亚马逊弹性块存储(EBS)卷(需要从S3快照获取数据时)以及AWSLambda。


亚马逊特别指出,索引子系统到下午1:18分已完全恢复,而布置子系统在下午1:54分恢复正常。到那时,S3已正常运行。


AWS特别指出,由于这起事件,自己正在“做几方面的变化”,包括采取将来防止错误输入引发此类问题的措施。


官方博客解释:“虽然删除容量是一个重要的操作做法,但在这种情况下,使用的那款工具允许非常快地删除大量的容量。我们已修改了此工具,以便更慢地删除容量,并增加了防范措施,防止任何子系统低于最少所需容量级别时被删除容量。”


AWS已经采取的其他值得注意的措施有:它开始致力于将索引子系统的部分划分到更小的单元。该公司还改变了AWS服务运行状况仪表板(AWSService Health Dashboard)的管理控制台,以便仪表板可以跨多个AWS区域运行――颇具讽刺意味的是,那个拼写错误在周二导致仪表板失效,于是AWS不得不依靠Twitter,向客户通报问题的最新进展。


针对北弗吉尼亚(US-EAST-1)区域亚马逊S3服务中断的简要说明

 

我们想为大家透露另外一些信息,解释2月28日上午出现在北弗吉尼亚(US-EAST-1)区域的服务中断事件。亚马逊简单存储服务(S3)团队当时在调试一个问题,该问题导致S3计费系统的处理速度比预期来得慢。太平洋标准时(PST)上午9:37,一名获得授权的S3团队成员使用事先编写的playbook,执行一条命令,该命令旨在为S3计费流程使用的其中一个S3子系统删除少量服务器。遗憾的是,输入命令时输错了一个字母,结果删除了一大批本不该删除的服务器。不小心删除的服务器支持另外两个S3子系统。其中一个系统是索引子系统,负责管理该区域所有S3对象的元数据和位置信息。这个子系统是服务所有的GET、LIST、PUT和DELETE请求所必可不少的。第二个子系统是布置子系统,负责管理新存储的分配,它的正常运行离不开索引子系统的正常运行。在PUT请求为新对象分配存储资源过程中用到布置子系统。删除相当大一部分的容量导致这每个系统都需要完全重启。这些子系统在重启过程中,S3无法处理服务请求。S3 API处于不可用的状态时,该区域依赖S3用于存储的其他AWS服务也受到了影响,包括S3控制台、亚马逊弹性计算云(EC2)新实例的启动、亚马逊弹性块存储(EBS)卷(需要从S3快照获取数据时)以及AWSLambda。


S3子系统是为支持相当大一部分容量的删除或故障而设计的,确保对客户基本上没有什么影响。我们在设计系统时就想到了难免偶尔会出现故障,于是我们依赖删除和更换容量的功能,这是我们的核心操作流程之一。虽然自推出S3以来我们就依赖这种操作来维护自己的系统,但是多年来,我们之前还没有在更广泛的区域完全重启过索引子系统或布置子系统。过去这几年,S3迎来了迅猛发展,重启这些服务、运行必要的安全检查以验证元数据完整性的过程所花费的时间超出了预期。索引子系统是两个受影响的子系统中需要重启的第一个。到PST 12:26,索引子系统已激活了足够的容量,开始处理S3 GET、LIST和DELETE请求。到下午1:18,索引子系统已完全恢复过来,GET、LIST和DELETE API已恢复正常。S3 PUT API还需要布置子系统。索引子系统正常运行后,布置子系统开始恢复,等到下午1:54已完成恢复。至此,S3已正常运行。受此事件影响的其他AWS服务开始恢复过来。其中一些服务在S3中断期间积压下了大量的工作,需要更多的时间才能完全恢复如初。


由于这次操作事件,我们在做几方面的变化。虽然删除容量是一个重要的操作做法,但在这种情况下,使用的那款工具允许非常快地删除大量的容量。我们已修改了此工具,以便更慢地删除容量,并增加了防范措施,防止任何子系统低于最少所需容量级别时被删除容量。这将防止将来不正确的输入引发类似事件。我们还将审查其他操作工具,确保我们有类似的安全检查机制。我们还将做一些变化,缩短关键S3子系统的恢复时间。我们采用了多种方法,让我们的服务在遇到任何故障后可以迅速恢复。最重要的方法之一就是将服务分成小部分,我们称之为单元(cell)。工程团队将服务分解成多个单元,那样就能评估、全面地测试恢复过程,甚至是最庞大服务或子系统的恢复过程。随着S3不断扩展,团队已做了大量的工作,将服务的各部分重新分解成更小的单元,减小破坏影响、改善恢复机制。在这次事件过程中,索引子系统的恢复时间仍超过了我们的预期。S3团队原计划今年晚些时候对索引子系统进一步分区。我们在重新调整这项工作的优先级,立即开始着手。


从这起事件开始一直到上午11:37,我们无法在AWS服务运行状况仪表板(SHD)上更新各项服务的状态,那是由于SHD管理控制器依赖亚马逊S3。相反,我们使用AWS Twitter帐户(@AWSCloud)和SHD横幅文本向大家告知状态,直到我们能够在SHD上更新各项服务的状态。我们明白,SHD为我们的客户在操作事件过程中提供了重要的可见性,我们已更改了SHD管理控制台,以便跨多个AWS区域运行。


最后,我们为这次事件给广大客户带来的影响深表歉意。虽然我们为亚马逊S3长期以来在可用性方面的卓越表现备感自豪,但我们知道这项服务对客户、它们的应用程序及最终用户以及公司业务来说有多重要。我们会竭力从这起事件中汲取教训,以便进一步提高我们的可用性。


Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region


We’d like to give you some additional information about the service disruption that occurred in the Northern Virginia (US-EAST-1) Region on the morning of February 28th. The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems.  One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.  


S3 subsystems are designed to support the removal or failure of significant capacity with little or no customer impact. We build our systems with the assumption that things will occasionally fail, and we rely on the ability to remove and replace capacity as one of our core operational processes. While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected. The index subsystem was the first of the two affected subsystems that needed to be restarted. By 12:26PM PST, the index subsystem had activated enough capacity to begin servicing S3 GET, LIST, and DELETE requests. By 1:18PM PST, the index subsystem was fully recovered and GET, LIST, and DELETE APIs were functioning normally.  The S3 PUT API also required the placement subsystem. The placement subsystem began recovery when the index subsystem was functional and finished recovery at 1:54PM PST. At this point, S3 was operating normally. Other AWS services that were impacted by this event began recovering. Some of these services had accumulated a backlog of work during the S3 disruption and required additional time to fully recover.


We are making several changes as a result of this operational event. While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future. We are also auditing our other operational tools to ensure we have similar safety checks. We will also make changes to improve the recovery time of key S3 subsystems. We employ multiple techniques to allow our services to recover from any failure quickly. One of the most important involves breaking services into small partitions which we call cells. By factoring services into cells, engineering teams can assess and thoroughly test recovery processes of even the largest service or subsystem. As S3 has scaled, the team has done considerable work to refactor parts of the service into smaller cells to reduce blast radius and improve recovery. During this event, the recovery time of the index subsystem still took longer than we expected. The S3 team had planned further partitioning of the index subsystem later this year. We are reprioritizing that work to begin immediately.


From the beginning of this event until 11:37AM PST, we were unable to update the individual services’ status on the AWS Service Health Dashboard (SHD) because of a dependency the SHD administration console has on Amazon S3. Instead, we used the AWS Twitter feed (@AWSCloud) and SHD banner text to communicate status until we were able to update the individual services’ status on the SHD.  We understand that the SHD provides important visibility to our customers during operational events and we have changed the SHD administration console to run across multiple AWS regions.


Finally, we want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.


云头条编译|未经授权谢绝转载


相关阅读:

中高端IT圈人群,欢迎加入!

AWS S3 云存储莫名消失:各大网站和 Docker 纷纷中招!

Cloudflare程序员把 >= 写成了 == 导致内存泄漏,害得互联网半壁江山风雨飘摇


您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存