Google SRE和开发者如何协作
Google’s Site Reliability Engineering (SRE) team is a specialist engineering organization focused on designing, building, and maintaining large-scale production services. SREs can be software engineers or systems engineers but usually bring a blend of both skill sets.
Google 的网站可靠性工程 (SRE) 团队是一家专业工程组织,专注于设计、构建和维护大规模生产服务。SRE 可以是软件工程师或系统工程师,但通常会混合这两种技能。
Google SRE’s mission is to Google SRE 的使命是:
Ensure that Google’s products and infrastructure meet their availability
targets. 确保 Google 的产品和基础架构符合其可用性目标。Subject to (1), maximize long-term feature velocity. 根据(1),最大化长期特征-速度。 Use software rather than human toil to accomplish (1) and (2). 使用软件而不是人工辛勤工作来完成(1)和(2)。 Engage only when (1) through (3) are accomplished more efficiently by SRE than developers.仅当 SRE 比开发人员更有效地完成 (1) 到 (3) 时,才参与。
Reliability and velocity are not mutually exclusive. Often, velocity can benefit from improved reliability and vice versa. However, when a tradeoff between reliability and velocity is necessary, SRE prioritizes reliability over velocity, but only until the product or service in question reaches the desired SLO. When the SLO is not met, working on reliability is more important for user satisfaction than feature velocity. When the product is within SLO, additional reliability at the expense of feature velocity is counterproductive. Instead of using a brute-force approach to fulfill its mission, SRE applies engineering and automation rather than repetitive human work (“toil”) to optimize operations.
可靠性和速度并不相互排斥。通常,速度可以从提高的可靠性中受益,反之亦然。但是,当需要在可靠性和速度之间进行权衡时,SRE 会将可靠性优先于速度,但仅在相关产品或服务达到所需的 SLO 之前。当不满足 SLO 时,对于用户满意度而言,可靠性比功能速度更重要。当产品在SLO内时,以牺牲特征速度为代价的额外可靠性会适得其反。SRE 不是使用蛮力方法来完成其任务,而是应用工程和自动化而不是重复的人工工作(“辛劳”)来优化运营。
As a specialist organization, Google SRE is in high demand by product development (hereafter shortened as “Dev”) teams; the opportunities for SRE to provide additional value are plentiful. SRE can be a force multiplier in many situations, but when a problem can be solved just as well by an engineer in the Dev organization, hiring a Dev instead of an SRE is a more flexible approach that creates less cross-organizational overhead. The goal is to staff just enough SREs to maximize the ratio of impact to overhead.
作为一个专业组织,Google SRE 在产品开发(以下简称“开发”)团队中需求量很大;SRE提供附加价值的机会很多。在许多情况下,SRE 可以成为力量倍增器,但是当开发组织中的工程师也可以解决问题时,雇用开发人员而不是 SRE 是一种更灵活的方法,可以减少跨组织的开销。目标是为足够的 SRE 配备人员,以最大限度地提高影响与间接费用的比率。
Google SRE should not be taken as a blueprint for implementing SRE elsewhere but rather as a case study. Every organization is unique; their needs and goals are unlikely to be exactly the same as Google’s. But almost twenty years of practical experience have provided many lessons that can help others to fast track their individual SRE journey.
Google SRE不应被视为在其他地方实施SRE的蓝图,而应作为案例研究。每个组织都是独一无二的;他们的需求和目标不太可能与谷歌完全相同。但是,近二十年的实践经验提供了许多经验教训,可以帮助其他人快速跟踪他们的个人 SRE 之旅。
SRE at Google is not static—it is constantly evolving, and different parts of Google apply the model differently according to their needs. Unlike similar functions at many other organizations, SRE is a centralized group at Google. From its humble beginnings, the SRE group has grown to a few thousand engineers. As shown in Figure 1, teams of SREs are dedicated to a specific Product Area (PA) and work closely with their Dev counterparts in that PA. SRE PAs vary in size and can consist of up to a few hundred SREs. SRE PAs are funded by the Dev partner organization and collaborate with them on every organizational level. An SRE PA is typically an order of magnitude smaller than the Dev partner organization, but the ratios can vary heavily. Most SRE teams are dual-homed in two locations, with six to eight SREs each and with a time zone difference of five to nine hours to enable a follow-the-sun on-call rotation.
谷歌的SRE不是一成不变的——它是不断发展的,谷歌的不同部门根据自己的需求以不同的方式应用模型。与许多其他组织的类似职能不同,SRE是Google的一个集中式小组。从最初的卑微开始,SRE集团已经发展到几千名工程师。如图 1 所示,SRE 团队致力于特定的产品领域 (PA),并与该 PA 中的开发对应方密切合作。SRE PA 的大小各不相同,最多可以包含几百个 SRE。SRE PA 由开发合作伙伴组织资助,并在每个组织级别与他们协作。SRE PA 通常比开发合作伙伴组织小一个数量级,但比率可能会有很大差异。大多数 SRE 团队在两个地点双驻,每个地点有 6 到 8 个 SRE,时区差为 5 到 9 小时,以实现全天候待命轮换。
图 1:Google SRE 组织结构
Engagement Principles 参与原则
SRE work is based on engagements with their Dev counterparts. An engagement is a collaboration between both sides, typically around a specific service or product. Most often, an engagement is a partnership between SRE and Dev to improve the reliability, infrastructure, and operations of a specific production system. Other engagements might be focused on the end-to-end user experience of a product or a horizontal infrastructure topic; either can span numerous production systems. A typical SRE team maintains a set of engagements with the Dev teams that develop the systems in scope.
SRE 工作基于与开发人员同行的互动。参与是双方之间的协作,通常围绕特定的服务或产品。大多数情况下,参与是 SRE 和 Dev 之间的合作伙伴关系,以提高特定生产系统的可靠性、基础结构和操作。其他参与可能侧重于产品的端到端用户体验或横向基础结构主题;两者都可以跨越众多生产系统。典型的 SRE 团队与开发范围内的系统的开发团队保持一组互动。
The Engagement Model is one of Google SRE’s foundational concepts. It describes the principles for engagements and a set of best practices to facilitate efficient allocation of resources, communication, coordination, and cooperation between SRE and Dev. While not a strict rules-based model, it is aimed at providing clarity and setting mutual expectations for the involved parties and to allow easy identification of outliers or degradations of engagement.
参与模式是 Google SRE 的基本概念之一。它描述了参与的原则和一组最佳实践,以促进 SRE 和 Dev 之间的资源有效分配、沟通、协调和合作。虽然不是一个严格的基于规则的模型,但它旨在为相关各方提供清晰度和设定共同的期望,并允许轻松识别异常值或参与度下降。
This section describes the principles of the Engagement Model. We’ll then discuss the categories of engagements (“engagement types”) and how to apply them in practice.
本节介绍互动模型的原则。然后,我们将讨论互动的类别(“互动类型”)以及如何在实践中应用它们。
1. Aligned with SRE’s Mission 与 SRE 的使命保持一致
SRE’s mission as indicated earlier is to improve the reliability, efficiency, and velocity of Google’s products, as well as maintain high team health. This mission should be at the core of every engagement, and each engagement should have a measurable positive impact on these goals.
如前所述,SRE 的使命是提高 Google 产品的可靠性、效率和速度,并保持较高的团队健康度。这项任务应该是每次行动的核心,每次行动都应该对这些目标产生可衡量的积极影响。
2. Advocate for the User 为用户辩护
SRE is an advocate for the user and for the user’s experience—whether that user is external or internal. The fact that SRE’s engagements may be enumerated by systems (or groups of systems) should not diminish SRE’s focus on how the user perceives reliability (or lack thereof). This focus is reflected in an emphasis upon end-to-end, or customer-centric, SLOs, as well as SRE’s responsibility to highlight reliability gaps and risks to Dev partners even when these are outside the immediate areas of responsibility of the SRE team. It may also suggest aligning first at the product level, then focusing SRE teams on particular critical user journeys (CUJs) or end-to-end experiences, even if their particular area of immediate responsibility is delineated by a (possibly wide) group of services.
SRE 是用户和用户体验的倡导者 - 无论该用户是外部用户还是内部用户。SRE 的参与可能由系统(或系统组)枚举的事实不应减少 SRE 对用户如何感知可靠性(或缺乏可靠性)的关注。这种关注反映在强调端到端或以客户为中心的 SLO,以及 SRE 向开发合作伙伴强调可靠性差距和风险的责任,即使这些差距和风险超出了 SRE 团队的直接职责范围。它还可能建议首先在产品级别保持一致,然后将 SRE 团队集中在特定的关键用户旅程 (CUJ) 或端到端体验上,即使他们的特定直接责任领域由一组(可能广泛的)服务划定。
3. Clear Value Proposition 明确的价值主张
SRE should only take on work that SRE can perform significantly more efficiently than anyone else. Adding a specialized team to partner with the Dev team introduces additional organizational complexity and increases the risk of silos. If the work can be done with similar quality and efficiency inside of the Dev team, that solution is preferred—it is not only simpler but also allows teams to shift work more flexibly when requirements change.
SRE 应该只承担 SRE 可以比其他任何人都更有效地执行的工作。添加专业团队与开发团队合作会增加组织复杂性,并增加孤岛的风险。如果工作可以在开发团队内部以类似的质量和效率完成,那么该解决方案是首选 - 它不仅更简单,而且允许团队在需求发生变化时更灵活地转移工作。
SREs are skilled, specialized engineers who are highly sought after talent and paid comparably to their Dev counterparts. In order to justify adding SRE headcount, an engagement should involve substantial reliability engineering work of enduring value, rather than mostly on-call work. Otherwise, adding Dev headcount makes more sense. A certain amount of exposure to on-call work is valuable in order to provide insight into which engineering streams provide the highest value, but providing mostly on-call work to a team of highly trained engineers is likely to lead to dissatisfaction within the SRE team. The fact that a Dev team is too small to provide its own on-call coverage or in a single location is not prima facie a sufficient reason to justify an SRE engagement.
SRE 是技术娴熟的专业工程师,他们备受追捧,薪酬与开发同行相当。为了证明增加 SRE 员工人数的合理性,参与应涉及具有持久价值的大量可靠性工程工作,而不是主要是待命工作。否则,增加开发人员人数更有意义。为了深入了解哪些工程流提供最高价值,一定程度的待命工作是有价值的,但向训练有素的工程师团队提供大部分待命工作可能会导致 SRE 团队内部的不满。开发团队太小,无法提供自己的待命覆盖范围或位于单个位置,这一事实并不能初步证明 SRE 参与的充分理由。
4. Clear Scope 明确的范围
SRE teams should be scoped to a set of services (or a set of CUJs) with clear correlation and boundaries. SRE does not have an obligation to take accountability for a specific service, but typically provides a base level of support to all products within the Dev team’s scope. Dev and SRE leadership regularly negotiate engagement scope.
SRE 团队的范围应限定为一组具有明确相关性和边界的服务(或一组 CUJ)。SRE 没有义务对特定服务负责,但通常为开发团队范围内的所有产品提供基本级别的支持。开发人员和 SRE 领导层定期协商参与范围。
5. Funded by Dev 由开发资助
SRE PAs receive headcount grants from their respective Dev orgs. SRE does not receive headcount through its own management chain or carry its own unallocated headcount. While SRE teams are funded by Dev, once headcount is transferred, SRE has responsibility for that headcount. The SRE PA lead has an obligation to use that headcount efficiently and effectively in consultation with the funding Dev partner. Headcount should be returned to the funding Dev org if it cannot be used to deliver substantially more value via enduring SRE work than the funding Dev partner could deliver.
SRE PA 从各自的开发组织获得员工人数补助。SRE 不通过自己的管理链接收员工人数,也不携带自己的未分配员工人数。虽然 SRE 团队由 Dev 资助,但一旦人员转移,SRE 就负责该员工人数。SRE PA 负责人有义务与资助开发合作伙伴协商,高效且有效地利用该员工人数。如果员工人数不能用于通过持久的 SRE 工作提供比资金开发合作伙伴可以提供的更多的价值,则应将其退还给资助开发组织。
Funding should be long term (but not permanent). It takes a long time to both hire SREs and to onboard SREs to a service. For that reason, Google SRE plans for headcount funding on a time horizon of two or more years and does not tie funding to short-term, time-bounded activities. Swings in headcount level will create inefficiencies and won’t allow the SRE team to engage deeply with the product.
资金应该是长期的(但不是永久性的)。租用 SRE 和将 SRE 加入服务都需要很长时间。出于这个原因,谷歌SRE计划在两年或更长时间的时间内进行人员统计资金,并且不会将资金与短期的,有时限的活动联系起来。员工人数的波动将导致效率低下,并且不允许 SRE 团队深入参与产品
The level of funding on an engagement (or a group of engagements) should be regularly reviewed by SRE and Dev leadership, e.g., annually. The review should consider whether the engagement type is correct and whether to reduce or increase funding—either via a grant or return of headcount or by reallocation within SRE. Decisions should be made by consensus. However, SRE leadership ultimately owns reallocation of project priorities within existing headcount limits. Otherwise, factors like adequate staffing for an engagement might not be fully accounted for.
SRE 和 Dev 领导层应定期审查一项(或一组参与)的资金水平,例如每年一次。审查应考虑参与类型是否正确,以及是否通过拨款或返还员工人数或通过 SRE 内部的重新分配来减少或增加资金。决定应以协商一致方式作出。但是,SRE领导层最终拥有在现有员工人数限制内重新分配项目优先级。否则,可能无法完全考虑参与所需的足够人员等因素。
6. Strategic Partnership 战略伙伴关系
Production excellence is a long-term investment. Engagements are not considered in isolation but at the SRE PA level. The SRE PA as a whole should have a strategic vision that is aligned with and complementary to that of the Dev org. Merely executing a series of unconnected engagements is an anti-pattern. The SRE PA lead owns the SRE PA vision and the task of priority negotiation with the Dev org lead.
卓越的生产是一项长期投资。参与不是孤立地考虑的,而是在 SRE PA 级别考虑的。SRE PA 作为一个整体应该有一个与开发组织一致并互补的战略愿景。仅仅执行一系列无关的参与是一种反模式。SRE PA 负责人拥有 SRE PA 愿景以及与开发组织负责人进行优先级协商的任务。
Each individual engagement is built according to a multi-year planning horizon. Service engagements are expected to yield a shared road map between Dev and SRE. Work should move in both directions between SRE and Dev. SRE is not simply a repository for work handed to it by Dev.
每个单独的参与都是根据多年规划范围构建的。预计服务参与将在 Dev 和 SRE 之间产生共享路线图。工作应该在 SRE 和 Dev 之间双向移动,SRE 不仅仅是 Dev 交给它的工作的存储库。
Expectations should be set before there are issues in the arrangement; under duress it is more difficult to form a written agreement. Systems and their components change, merge, and diverge. SRE needs to carefully move with the product and can’t pivot instantly to support a new system without sufficient ramp-up time.
在安排出现问题之前,应设定期望;在胁迫下,形成书面协议更加困难。系统及其组件会发生变化、合并和发散。SRE 需要谨慎地移动产品,并且没有足够的启动时间,无法立即转向以支持新系统。
7. Dev Ownership 开发所有权
Irrespective of the type of engagement, the service itself and its reliability is ultimately owned by the Dev team, even if day-to-day production authority rests with SRE under some forms of engagement. This means responsibility for having a reliable service is not off-loaded onto the SRE team; rather, the SRE team members are specialists in reliability engineering who can help the Dev team attain their reliability objectives by working in partnership under one of the engagement types (which in turn set out SRE’s responsibility to Dev).
无论参与的类型如何,服务本身及其可靠性最终归开发团队所有,即使日常生产权限在某些形式的参与下属于 SRE。这意味着拥有可靠服务的责任不会转移到 SRE 团队身上;相反,SRE 团队成员是可靠性工程方面的专家,他们可以通过在一种参与类型下合作来帮助开发团队实现其可靠性目标(这反过来又规定了 SRE 对 Dev 的责任)。
An active, robust Dev engagement is part of a healthy service. Since SRE doesn’t control headcount allocation, it cannot be solely responsible for a service, and historical cases where Dev engagement ended while the service was still live and SRE supported (“abandoned services”) have ended poorly. Accordingly, Dev teams intending to sunset their staffing of a service need to also plan to sunset the service itself and migrate remaining users to other services. SRE’s engagement with a service will cease once it no longer has Dev support, and any assigned headcount will be returned by the time Dev engagement ends.
积极、可靠的开发人员参与是正常服务的一部分。由于 SRE 不控制人员分配,因此它不能单独负责服务,并且历史案例中,开发人员参与在服务仍处于活动状态且 SRE 支持(“放弃的服务”)结束时结束得很差。因此,打算停用服务人员的开发团队还需要计划停用服务本身,并将剩余用户迁移到其他服务。SRE 与服务的互动将在不再获得开发人员支持后停止,并且在开发人员参与结束时将返回任何分配的员工人数。
8. Joint Partnership 联合伙伴关系
Starting and continuing with an SRE engagement is a joint decision for Dev and SRE. SRE cannot be forced to take an engagement, Dev cannot be forced to fund one, and either Dev or SRE can end one.
开始并继续 SRE 合作是 Dev 和 SRE 的共同决定。SRE不能被迫接受约定,Dev不能被迫资助一个,Dev或SRE都可以结束一个。
Corollary: If either Dev or SRE wishes to end an engagement, it should end, and the headcount position should be reexamined (either redeployed or returned) in a manner compliant with the funding principles discussed above. Ending an engagement by a means other than consensus is something that both parties should seek to avoid.
推论:如果 Dev 或 SRE 希望结束一项聘用,则应结束,并应以符合上述资助原则的方式重新审查(重新部署或返回)员工人数职位。以协商一致以外的方式结束接触是双方都应努力避免的事情。
9. Shared Endeavor 共同努力
SRE and Dev bring different expertise: SRE focuses on reliability principles, system architecture, and best practices for production, while the Dev org is typically more experienced in their business domain. The success of a service is a shared endeavor. Despite being separate teams and having different roles, both sides work toward a common goal. This includes joint OKRs (objectives and key results) where appropriate and adhering to an error budget policy (a.k.a. freezing feature releases when a service/CUJ is out of SLO). Dev and SRE have a shared interest to operate a service within SLO in the most cost-efficient manner possible, so SLO violations are a critical issue for Dev and SRE to address together.
SRE 和 Dev 带来了不同的专业知识:SRE 专注于可靠性原则、系统架构和生产最佳实践,而 Dev 组织通常在其业务领域更有经验。服务的成功是一项共同的努力。尽管是独立的团队,角色也不同,但双方都朝着一个共同的目标努力。这包括在适当的情况下联合 OKR(目标和关键结果),并遵守错误预算策略(即在服务/CUJ 超出 SLO 时冻结功能发布)。开发人员和 SRE 在以最具成本效益的方式在 SLO 中运行服务具有共同利益,因此 SLO 违规是 Dev 和 SRE 共同解决的关键问题。
SLOs and error budgets promote a common understanding of reliability goals and an objective tool to measure success. This allows SRE and Dev to jointly make informed decisions about whether the balance between reliability and velocity needs adjustment. Freeze policies provide a simple way to adjust that balance toward reliability when customer/user trust is in danger of being broken.
SLO 和错误预算有助于对可靠性目标和衡量成功的客观工具达成共识。这使得 SRE 和 Dev 能够共同就是否需要调整可靠性和速度之间的平衡做出明智的决策。冻结策略提供了一种简单的方法,可以在客户/用户信任面临被破坏的危险时调整这种平衡,使其具有可靠性。
Operational and on-call responsibilities are also a shared endeavor, and as a service becomes more mature, the bulk (but not 100%) of operational responsibilities are often carried by SRE.
运营和待命责任也是一项共同的努力,随着服务变得更加成熟,大部分(但不是 100%)运营职责通常由 SRE 承担。
10. SRE Is Not an “Ops Team” SRE 不是“运营团队”
SRE’s mission is not to handle operations but to improve the inherent reliability of systems through engineering. Being on call is a means to an end to SRE; it often provides valuable insights that wouldn’t be available otherwise. However, on-call work has no long-term value in and of itself. On-call coverage is not at the core of SRE work, and it alone does not justify the formation of an SRE team. SRE has strict limits on ops work; toilsome work (interrupts, production clean-up, etc.) should not exceed 50% of the SRE team’s time. If toil exceeds this threshold, Dev must handle excess ops work. This mechanism guarantees that SRE has enough time to work on projects to reduce the ops workload.
SRE的使命不是处理Operations,而是通过工程来提高系统的固有可靠性。随叫随到是结束 SRE 的一种手段;它通常提供有价值的见解,否则将无法获得。然而,待命工作本身没有长期价值。随叫随到的覆盖范围不是 SRE 工作的核心,仅凭这一点并不能证明组建 SRE 团队是合理的。SRE 对运营工作有严格的限制;繁重的工作(中断、生产清理等)不应超过 SRE 团队时间的 50%。如果辛劳超过此阈值,Dev 必须处理多余的操作工作。此机制保证 SRE 有足够的时间处理项目以减少操作工作负载。
It is expected that Dev always carries at least some of the operational responsibilities. Typical examples include a secondary on-call rotation for escalations, ownership of non-production environments, and/or handling noncritical ops work. The exposure to ops is essential to maintain and foster production knowledge in the Dev team. The split of responsibilities should be tracked in writing to avoid misunderstandings.
预计 Dev 始终至少承担一些运营责任。典型示例包括用于升级的辅助待命轮换、非生产环境的所有权和/或处理非关键运营工作。接触运维对于维护和培养开发团队的生产知识至关重要。责任划分应以书面形式进行跟踪,以避免误解。
11. Ops Is Not a Zero-Sum Game Ops 不是零和游戏
Instead of simply moving operational responsibilities from one place to another, an SRE engagement should focus on reducing the overall ops workload. A successful engagement reduces the ops load to a point where which team holds the pager is no longer critical. Independent of who is officially holding the pager, Dev is generally expected to maintain a 24/7 on-call escalation path.
SRE 参与不应简单地将运营职责从一个地方转移到另一个地方,而应侧重于减少整体运营工作量。成功的参与将操作负载减少到哪个团队持有寻呼机不再重要的程度。无论谁正式持有寻呼机,Dev 通常都应保持 24/7 全天候待命升级路径。
12. Teach to Fish 授人以渔
SRE should not serve as a human abstraction layer for production. This approach is not scalable, reinforces silos, undermines critical feedback loops, and turns production complexity into an existential need to justify SRE’s existence. Instead, SRE helps Dev gain a deeper understanding of the production aspects of the service.
SRE 不应充当生产的人工抽象层。这种方法不可扩展,强化了孤岛,破坏了关键的反馈循环,并将生产复杂性转化为证明 SRE 存在的生存需求。相反,SRE 可帮助开发人员更深入地了解服务的生产方面。
13. Promote Production Standardization 推进生产标准化
SRE should promote the use of common production platforms and standardized infrastructure. Such platforms have several advantages:
SRE应促进使用通用生产平台和标准化基础设施。此类平台具有以下几个优点:
Provide a consistent service management infrastructure, which reduces the cost of implementing cross-service requirements (“horizontals”) in production. 提供一致的服务管理基础结构,从而降低在生产中实现跨服务需求(“水平”)的成本。 Reduce the ongoing cost of operating individual services in production (e.g., onboarding time, engineer training time, toil).降低在生产中运营单个服务的持续成本(例如,入职时间、工程师培训时间、辛勤工作)。 Reduce the cost of supporting all services in production in aggregate; by making skills portable, it is simpler for engineers to work on disparate services.降低总体上支持生产中所有服务的成本;通过使技能可移植,工程师可以更轻松地处理不同的服务。 Reduce the cost of moving services between Dev and SRE as well as between different SRE teams.降低在 Dev 和 SRE 之间以及不同 SRE 团队之间移动服务的成本。 Improve the mobility of engineers between teams.提高工程师在团队之间的流动性。 Simplify and reduce risk in production.简化并降低生产风险。 Improve engineering velocity.提高工程速度。 Improve resource efficiency of production services in aggregate.总体上提高生产服务的资源效率。
SRE should promulgate standards for production platforms at an SRE PA level—a principle that’s applicable to services irrespective of the level of SRE support.
SRE 应在 SRE PA 级别颁布生产平台标准,无论 SRE 支持级别如何,该原则都适用于服务。
14. Meaningful Work 有意义的工作
Quality of work must be a priority. SREs at Google have the same opportunities around mobility as Dev and therefore require a novel, challenging, interesting environment to allow personal development. SRE aligns closely with Dev on OKR planning but ultimately owns its own OKRs.
工作质量必须是一个优先事项。Google 的 SRE 在移动性方面与 Dev 拥有相同的机会,因此需要一个新颖、具有挑战性、有趣的环境来促进个人发展。SRE 在 OKR 规划方面与 Dev 紧密配合,但最终拥有自己的 OKR。
15. Success Must Be Tracked 必须跟踪成功
SRE engagements are a significant investment and require structured planning and success tracking. SRE and Dev maintain a shared road map and track progress toward goals. They regularly review service health, criticality, business justification, and priority. This can be facilitated through business reviews, quarterly reports, and production health reviews.
SRE 参与是一项重大投资,需要结构化的规划和成功跟踪。SRE 和 Dev 维护共享路线图并跟踪实现目标的进度。他们定期审查服务运行状况、关键性、业务理由和优先级。这可以通过业务审查、季度报告和生产运行状况审查来促进。
16. Shift Left 左移
SRE engagements are possible at any phase of the service life cycle—not only after the production launch. Often, they are most impactful and efficient when they happen early in the life cycle (are “shifted left”)—for example, during design and implementation. Fundamental architecture and infrastructure decisions can be changed easily during the design phase but are often extremely hard or prohibitively expensive to revise for a fully productionized system. An early engagement with SRE can prevent significant headaches later.
SRE 接洽可以在服务生命周期的任何阶段进行,而不仅仅是在生产启动之后。通常,当它们发生在生命周期的早期(“左移”)时(例如,在设计和实现期间),它们最具影响力和效率。在设计阶段,基本架构和基础设施决策可以轻松更改,但对于完全生产化的系统,修改通常非常困难或成本高昂。尽早与 SRE 接触可以防止以后出现严重的头痛。