查看原文
其他

Kubernetes HPA一定会减少资源使用吗?HPA可观测性实践分享!

毕海成 Qunar技术沙龙 2023-04-08

作者介绍:

毕海成,去哪儿旅行高级开发工程师, 主要负责网络平台,硬件平台,Kubernetes相关开发与运维。


一、背景

HPA(horizontal-pod-autoscaler) 水平 pods 自动扩缩功能是 Kubernetes 的一个重要功能,它可以根据 cpu ,内存等指标实现工作负载的自动扩缩容,相比手动扩容有诸多显而易见的优点,由于我们的 Kubernetes 是多集群部署,必然需要多集群 HPA ,用户只需根据自己 Appcode 资源使用情况配置 cpu ,内存,自定义指标,以及 appcode 整体需要的最大最小副本数,多集群 HPA 会根据各个集群权重自动分配不同集群的最大最小副本数。
集群权重分两种方式,一种是手动,另一种是自动,手动配置集群权重的优点是用户可控,比如机房维护,某些组件的升级时需要人工控制集群权重为0,这样应用就不会往权重为 0 的集群发布,他的缺点是权重调整不实时,因为各个集群的资源使用情况是动态变化的,所以需要集群权重能根据当前资源使用情况实时调整,这样就发展出第二种方式自动调整集群权重,自动调整集群权重会计算各个集群当前剩余资源,按照剩余资源调整权重,如果多 appcode 并发扩容,还需要加上资源锁,当前的集群权重调整方式取这两种调整集群权重方式的优点,支持人工和自动两种方式且人工调整的优先级高于自动调整。
基于以上对 HPA 的了解,可能很多人会觉得 HPA 一定会减少资源使用,因为它最显著的特点是,低峰缩容,高峰扩容,按需分配资源,但是实际是这样吗?配置的最小/最大副本数合理吗?最小副本数配置高了还能缩容吗?
HPA 能带来哪些好处?
  • 提高资源利用率

  • 节约人力成本(不用手动调整服务器数量)

  • 对于HPA运行状况和应用的资源需求更加具体直观,方便运维决策
如何体现出这些好处呢?
衡量指标包含一下几点:
  • 扩容次数

  • 缩容次数

  • 扩容上限次数

  • 缩容下限次数

  • HPA阈值(cpu,内存,自定义)

  • 最小副本数

  • 最大副本数

  • 高峰时段副本数和cpu平均使用率(如果高峰期一直是最大副本可以适当上调HPA上限)

  • 低峰时段副本数和cpu平均使用率(如果低峰期一直是最小副本数可以适当降低HPA下限)

二、HPA相关统计

数据收集

上述指标中扩容次数(uc),缩容次数(dc),扩容上限次数(maxuc),缩容下限次数(mindc) 需要收集
新建数据表tbxxx(HPA指标表)收集(uc, dc, maxuc, mindc)
create table tb_hpa_xxxx( id SERIAL PRIMARY KEY, appcode varchar(256), uc int DEFAULT 0, dc int DEFAULT 0, maxuc int DEFAULT 0, mindc int DEFAULT 0, create_time timestamptz NOT NULL DEFAULT now(), update_time timestamptz NOT NULL DEFAULT now());COMMENT ON TABLE tb_hpa_xxxx IS 'HPA指标收集';COMMENT ON COLUMN tb_hpa_xxxx.id IS '自增ID';COMMENT ON COLUMN tb_hpa_xxxx.appcode IS 'appcode';COMMENT ON COLUMN tb_hpa_xxxx.uc IS '扩容次数';COMMENT ON COLUMN tb_hpa_xxxx.dc IS '缩容次数';COMMENT ON COLUMN tb_hpa_xxxx.maxuc IS '最大副本数次数';COMMENT ON COLUMN tb_hpa_xxxx.mindc IS '最小副本数次数';COMMENT ON COLUMN tb_hpa_xxxx.create_time IS '创建时间';COMMENT ON COLUMN tb_hpa_xxxx.update_time IS '更新时间';

扩缩容次数统计

用上面收集到的数据和hpa的配置,统计出appcode,env,uc(扩容次数), dc(缩容次数),maxuc(扩容上限次数),mindc(缩容下限次数),hpa阈值(hpa配置信息,cpu,内存,自定义指标阈值),全集群最小副本数,全集群最大副本数。
select G.*, N.min_replicas, N.max_replicasfrom ( select A.deployment_base as env, A.appcode, A.annotations as hpa, coalesce(M.uc, 0) as uc, coalesce(M.dc, 0) as dc, coalesce(M.maxuc, 0) as maxuc, coalesce(M.mindc, 0) as mindc from ( select appcode, deployment_base, detail->'metadata'->'annotations' as annotations from tb_k8s_hpaxxx where dep_status = 0 and status = 0 group by appcode, deployment_base, detail->'metadata'->'annotations' ) A left join ( select appcode, env_name, sum(uc) as uc, sum(dc) as dc, sum(maxuc) as maxuc, sum(mindc) as mindc from tb_hpa_metrics where create_time >= '2022-06-10' and create_time < '2022-06-11' group by appcode, env_name ) M on M.appcode = A.appcode and M.env_name = A.deployment_base ) G left join tb_k8s_appcode_hpa N on G.appcode = N.appcode and G.env = N.deployment_base;

三、容器 cpu 使用率统计 

使用 Prometheus 配置 kubernetes 环境中 Container 的 CPU 使用率时,会经常遇到 CPU 使用超出 100% ,下面就来解释一下
  1. container_spec_cpu_period

    当对容器进行CPU限制时,CFS调度的时间窗口,又称容器CPU的时钟周期通常是100,000微秒
  2. container_spec_cpu_quota

    是指容器的使用CPU时间周期总量,如果quota设置的是700,000,就代表该容器可用的CPU时间是7*100,000微秒,通常对应kubernetes的resource.cpu.limits的值
  3. container_spec_cpu_share

    是指 container 使用分配主机 CPU 相对值,比如 share 设置的是 500m ,代表窗口启动时向主机节点申请 0.5 个 CPU ,也就是 50,000 微秒,通常对应 kubernetes 的 resource.cpu.requests 的值。
  4. container_cpu_usage_seconds_total

    统计容器的 CPU 在一秒内消耗使用率,应注意的是该 container 所有的 CORE 。
  5. container_cpu_system_seconds_total

    统计容器内核态在一秒时间内消耗的 CPU 。
  6. container_cpu_user_seconds_total

    统计容器用户态在一秒时间内消耗的 CPU 。
    (参考官方地址:https://github.com/google/cadvisor/blob/master/docs/storage/prometheus.md)

查询各个集群的P50,P90,P99的平均P50, P90, P99

select appcode, avg(p50) as p50, avg(p90) as p90, avg(p99) as p99, avg(mean) as meanfrom tb_cpu_usage_statxwhere sampling_point = 'day' and stat_start >= '2022-06-08' and stat_end < '2022-06-09'group by appcode;

查询各个集群的P50,P90,P99的 P50,P90,P99

select appcode, percentile_cont(0.5) within group ( order by p50 ) as p50, percentile_cont(0.9) within group ( order by p90 ) as p90, percentile_cont(0.99) within group ( order by p99 ) as p99, avg(mean) as meanfrom tb_cpu_usage_statwhere sampling_point = 'day' and stat_start >= '2022-06-08' and stat_end < '2022-06-09'group by appcode;

各集群P90的P90 跟不分集群计算的P90是不一样的,下面是个举例,其他P99,P50以此类推。

所以直接用已有数据会不准确,需要从原始数据重新计算高低峰期时段cpu使用率。

单日高峰期时段cpu使用率

(2022-06-08 8:00 - 2022-06-08 23:00) 

select appcode, percentile_cont(0.5) within group ( order by cpu_usage ) as p50, percentile_cont(0.9) within group ( order by cpu_usage ) as p90, percentile_cont(0.99) within group ( order by cpu_usage ) as p99, avg(cpu_usage)from tb_container_cpu_usage_seconds_totalwhere collect_time >= '2022-06-08 08:00:00' and collect_time <= '2022-06-08 22:59:59'group by appcode;

结果如下图所示:

单日低峰期时段cpu使用率

(2022-06-08 23:00 - 2022-06-08 23:59:59, 2022-06-08 00:00 - 2022-06-08 07:59:59) 

select appcode, percentile_cont(0.5) within group ( order by cpu_usage ) as p50, percentile_cont(0.9) within group ( order by cpu_usage ) as p90, percentile_cont(0.99) within group ( order by cpu_usage ) as p99, avg(cpu_usage)from tb_container_cpu_xxxwhere collect_time >= '2022-06-08 23:00:00' and collect_time < '2022-06-08 23:59:59' or collect_time >= '2022-06-08 00:00:00' and collect_time <= '2022-06-08 07:59:59'group by appcode;

如下图所示:

低峰时段POD数统计

select appcode, round(sum(pod_replicas_avail) / 9.0, 2) as podsfrom tb_k8s_resourcewhere ( ( record_time >= '2022-06-09 23:00:00' and record_time <= '2022-06-09 23:59:59' ) or ( record_time >= '2022-06-09 00:00:00' and record_time <= '2022-06-09 07:59:59' ) )group by appcode;
结果如下所示:

高峰时段POD数统计

select appcode, round(sum(pod_replicas_avail) / 15.0, 2) as podsfrom tb_k8s_resourcewhere record_time >= '2022-06-09 08:00:00' and record_time <= '2022-06-09 22:59:59'group by appcode;

执行结果如下所示:

四、报表数据

将上面统计的日数据写入到表里,记录一下历史数据,方便以后统计周数据,月数据。

create table tb_hpa_report_xxx( id SERIAL PRIMARY KEY, appcode varchar(256), env_name varchar(256), uc int DEFAULT 0, dc int DEFAULT 0, maxuc int DEFAULT 0, mindc int DEFAULT 0, cpu int DEFAULT 0, mem int DEFAULT 0, cname VARCHAR(512) DEFAULT '', cval int DEFAULT 0, min_replicas int DEFAULT 0, max_replicas int DEFAULT 0, hcpu_p50 numeric(10,4) DEFAULT 0, hcpu_p90 numeric(10,4) DEFAULT 0, hcpu_p99 numeric(10,4) DEFAULT 0, hcpu_mean numeric(10,4) DEFAULT 0, lcpu_p50 numeric(10,4) DEFAULT 0, lcpu_p90 numeric(10,4) DEFAULT 0, lcpu_p99 numeric(10,4) DEFAULT 0, lcpu_mean numeric(10,4) DEFAULT 0, record_time timestamptz, create_time timestamptz NOT NULL DEFAULT now(), update_time timestamptz NOT NULL DEFAULT now());COMMENT ON TABLE tb_hpa_report_xxx IS 'HPA数据报表';COMMENT ON COLUMN tb_hpa_report_xxx.id IS '自增ID';COMMENT ON COLUMN tb_hpa_report_xxx.appcode IS 'appcode';COMMENT ON COLUMN tb_hpa_report_xxx.env_name IS 'env_name';COMMENT ON COLUMN tb_hpa_report_xxx.uc IS '扩容次数';COMMENT ON COLUMN tb_hpa_report_xxx.dc IS '缩容次数';COMMENT ON COLUMN tb_hpa_report_xxx.maxuc IS '最大副本数次数';COMMENT ON COLUMN tb_hpa_report_xxx.mindc IS '最小副本数次数';COMMENT ON COLUMN tb_hpa_report_xxx.cpu IS 'cpu阈值';COMMENT ON COLUMN tb_hpa_report_xxx.mem IS '内存阈值';COMMENT ON COLUMN tb_hpa_report_xxx.cname IS '自定义指标名';COMMENT ON COLUMN tb_hpa_report_xxx.cval IS '自定义指标阈值';COMMENT ON COLUMN tb_hpa_report_xxx.min_replicas IS '最小副本数';COMMENT ON COLUMN tb_hpa_report_xxx.max_replicas IS '最大副本数';COMMENT ON COLUMN tb_hpa_report_xxx.hcpu_p50 IS '高峰cpu p50使用率';COMMENT ON COLUMN tb_hpa_report_xxx.hcpu_p90 IS '高峰cpu p90使用率';COMMENT ON COLUMN tb_hpa_report_xxx.hcpu_p99 IS '高峰cpu p99使用率';COMMENT ON COLUMN tb_hpa_report_xxx.hcpu_mean IS '高峰cpu 平均使用率';COMMENT ON COLUMN tb_hpa_report_xxx.lcpu_p50 IS '低峰cpu p50使用率';COMMENT ON COLUMN tb_hpa_report_xxx.lcpu_p90 IS '低峰cpu p90使用率';COMMENT ON COLUMN tb_hpa_report_xxx.lcpu_p99 IS '低峰cpu p99使用率';COMMENT ON COLUMN tb_hpa_report_xxx.lcpu_mean IS '低峰cpu 平均使用率';COMMENT ON COLUMN tb_hpa_report_xxx.record_time IS '数据统计日期';COMMENT ON COLUMN tb_hpa_report_xxx.create_time IS '创建时间';COMMENT ON COLUMN tb_hpa_report_xxx.update_time IS '更新时间';

五、代码实现

  1. 定时任务统计HPA和cpu使用率的日数据,写入到统计报表

"""HPA统计相关"""from server.db.base import Basefrom server.conf.conf import CONFimport sentry_sdkimport datetimefrom sqlalchemy import textfrom server.db.model.meta import commit_on_success, db, try_catch_db_exceptionfrom server.db.model.model import HpaReportModelfrom server.db.hpa import HPAfrom server.libs.mail import SendMailfrom server.libs.qtalk import SendQtalkMsgfrom server.libs.decorators import statsd_indexfrom server.libs.error import Errorimport logging

LOG = logging.getLogger('gunicorn.access')
class HpaReport(Base):
def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs)
@try_catch_db_exception @commit_on_success def stats_hpa_updown(self, start_time, end_time): rows = db.session.execute(text( """ select G.*, N.min_replicas, N.max_replicas from ( select A.deployment_base as env, A.appcode, A.annotations as hpa, coalesce(M.uc, 0) as uc, coalesce(M.dc, 0) as dc, coalesce(M.maxuc, 0) as maxuc, coalesce(M.mindc, 0) as mindc from ( select appcode, deployment_base, detail->'metadata'->'annotations' as annotations from tb_k8s_hpa_rec where dep_status = 0 and status = 0 group by appcode, deployment_base, detail->'metadata'->'annotations' ) A left join ( select appcode, env_name, sum(uc) as uc, sum(dc) as dc, sum(maxuc) as maxuc, sum(mindc) as mindc from tb_hpa_metrics where create_time >= :start_time and create_time <= :end_time group by appcode, env_name ) M on M.appcode = A.appcode and M.env_name = A.deployment_base ) G left join tb_k8s_appcode_hpa N on G.appcode = N.appcode and G.env = N.deployment_base; """ ), {"start_time": start_time, "end_time": end_time}) LOG.info(f'stats_hpa_updown: {rows.rowcount}') return self.rows_as_dicts(rows.cursor)
@try_catch_db_exception @commit_on_success def stats_high_time_cpu(self, start_time, end_time): LOG.info(f'stats_high_time_cpu: {start_time}, {end_time}') rows = db.session.execute(text( """ select appcode, percentile_cont(0.5) within group ( order by cpu_usage ) as p50, percentile_cont(0.9) within group ( order by cpu_usage ) as p90, percentile_cont(0.99) within group ( order by cpu_usage ) as p99, avg(cpu_usage) from tb_container_cpu_usage_seconds_total where collect_time >= :start_time and collect_time <= :end_time group by appcode """ ), {"start_time": start_time, "end_time": end_time}) LOG.info(f'stats_high_time_cpu: {rows.rowcount}') return self.rows_as_dicts(rows.cursor)
@try_catch_db_exception @commit_on_success def stats_high_time_pods(self, start_time, end_time): LOG.info(f'stats_high_time_pods: {start_time}, {end_time}') rows = db.session.execute(text( """ select appcode, round(sum(pod_replicas_avail) / 15.0, 2) as pods from tb_k8s_resource where record_time >= :start_time and record_time <= :end_time group by appcode """ ), {"start_time": start_time, "end_time": end_time}) LOG.info(f'stats_high_time_pods: {rows.rowcount}') return self.rows_as_dicts(rows.cursor)
@try_catch_db_exception @commit_on_success def stats_low_time_pods(self, s1, e1, s2, e2): LOG.info(f'stats_low_time_pods: {s1}, {e1}, {s2}, {e2}') """低峰期分两段(2022-06-08 23:00 - 2022-06-08 23:59:59, 2022-06-08 00:00 - 2022-06-08 07:59:59)
@param s1 start_time1 低峰时段1开始时间 @param e1 end_time1 低峰时段1结束时间 @param s2 start_time2 低峰时段2开始时间 @param e2 end_time2 低峰时段2结束时间 """ rows = db.session.execute(text( """ select appcode, round(sum(pod_replicas_avail) / 9.0, 2) as pods from tb_k8s_resource where ( ( record_time >= :s1 and record_time <= :e1 ) or ( record_time >= :s2 and record_time <= :e2 ) ) group by appcode """ ), {"s1": s1, "e1": e1, "s2": s2, "e2": e2}) LOG.info(f'stats_low_time_pods: {rows.rowcount}') return self.rows_as_dicts(rows.cursor)
@staticmethod def rows_as_dicts(cursor): """convert tuple result to dict with cursor""" col_names = [i[0] for i in cursor.description] return [dict(zip(col_names, row)) for row in cursor]
@try_catch_db_exception @commit_on_success def stats_low_time_cpu(self, s1, e1, s2, e2): """低峰期分两段(2022-06-08 23:00 - 2022-06-08 23:59:59, 2022-06-08 00:00 - 2022-06-08 07:59:59)
@param s1 start_time1 低峰时段1开始时间 @param e1 end_time1 低峰时段1结束时间 @param s2 start_time2 低峰时段2开始时间 @param e2 end_time2 低峰时段2结束时间 """ LOG.info(f'stats_low_time_cpu: {s1}, {e1}, {s2}, {e2}') rows = db.session.execute(text( """ select appcode, percentile_cont(0.5) within group ( order by cpu_usage ) as p50, percentile_cont(0.9) within group ( order by cpu_usage ) as p90, percentile_cont(0.99) within group ( order by cpu_usage ) as p99, avg(cpu_usage) from tb_container_cpu_usage_seconds_total where collect_time >= :s1 and collect_time <= :e1 or collect_time >= :s2 and collect_time <= :e2 group by appcode """ ), {"s1": s1, "e1": e1, "s2": s2, "e2": e2}) LOG.info(f'stats_low_time_cpu: {rows.rowcount}') return self.rows_as_dicts(rows.cursor)
@statsd_index('hpa_report.sendmail') @commit_on_success def send_report_form(self, day): try: start = datetime.datetime.combine(day, datetime.time(0,0,0)) end = datetime.datetime.combine(day, datetime.time(23,59,59)) q = HpaReportModel.query.filter( HpaReportModel.record_time >= start, HpaReportModel.record_time <= end ).order_by( HpaReportModel.uc.desc(), HpaReportModel.dc.desc(), HpaReportModel.maxuc.desc(), HpaReportModel.mindc.desc() ) count = q.count() day_data = q.all() cell = "" if count > 0: for stat in day_data: cell += f""" <tr> <td>{stat.appcode}</td> <td>{stat.env_name}</td> <td>{stat.min_replicas}</td> <td>{stat.max_replicas}</td> <td>{stat.cpu}</td> <td>{stat.mem}</td> <td>{stat.cname}:{stat.cval}</td> <td>{stat.uc}</td> <td>{stat.dc}</td> <td>{stat.maxuc}</td> <td>{stat.mindc}</td> <td>{round(stat.hpods, 2)}</td> <td>{round(stat.hcpu_mean, 2)}%</td> <td>{round(stat.lpods, 2)}</td> <td>{round(stat.lcpu_mean, 2)}%</td> </tr>""" content = f""" <div> <h2>{day} 00:00:00至23:59:59</h2> <h3>高峰(08:00-23:00), 低锋(23:00-08:00)</h3> <table border='1' cellpadding='1' cellspacing='0'> <tr> <th>Appcode</th> <th>环境</th> <th>最小副本数</th> <th>最大副本数</th> <th>CPU扩容阈值</th> <th>内存扩容阈值</th> <th>自定义扩容阈值</th> <th>扩容次数</th> <th>缩容次数</th> <th>最大副本数次数</th> <th>最小副本数次数</th> <th>高峰副本数</th> <th>高峰CPU平均使用率</th> <th>低锋副本数</th> <th>低锋CPU平均使用率</th> </tr> {cell} </table> </div><br><br>""" SendMail.send_mail( CONF.notice_user.users.split(','), "HPA阔缩容次数及CPU使用率相关统计", content) SendQtalkMsg.send_msg(CONF.notice_user.users.split(','), 'HPA阔缩容次数及CPU使用率相关统计错误报表发送完成') except Exception as ex: sentry_sdk.capture_exception() SendQtalkMsg.send_msg(['haicheng.bi'], f'HPA阔缩容次数及CPU使用率相关统计错误: {ex}')
@try_catch_db_exception @commit_on_success @statsd_index('hpa_report.save_stats_result') def save_stats_result(self, day): """保存HPA和cpu统计的结果
:param day date 统计日期 """ if not isinstance(day, datetime.date): raise Error(f"param day is invalid type, we need datetime.date type.") LOG.info(f'save_stats_result: {day}') hpa_start = datetime.datetime.combine(day, datetime.time(0,0,0)) hpa_end = datetime.datetime.combine(day, datetime.time(23,59,59)) hpa_stats_rows = self.stats_hpa_updown(hpa_start, hpa_end) # 08-23 h_start = datetime.datetime.combine(day, datetime.time(8,0,0)) h_end = datetime.datetime.combine(day, datetime.time(22,59,59)) # 23-00, 00-08 l_s1 = datetime.datetime.combine(day, datetime.time(23,0,0)) l_e1 = datetime.datetime.combine(day, datetime.time(23,59,59)) l_s2 = datetime.datetime.combine(day, datetime.time(0,0,0)) l_e2 = datetime.datetime.combine(day, datetime.time(7,59,59)) hcpu_stats_rows = self.stats_high_time_cpu(h_start, h_end) lcpu_stats_rows = self.stats_low_time_cpu(l_s1,l_e1, l_s2, l_e2) hpods_stats_rows = self.stats_high_time_pods(h_start, h_end) lpods_stats_rows = self.stats_low_time_pods(l_s1,l_e1, l_s2, l_e2) cpus_rows = {} pods_rows = {} report_rows = {} for row in hpods_stats_rows: appcode = row.get('appcode', '') pods_rows[appcode] = { 'appcode': appcode, 'hpods': row.get('pods', 0) } for row in lpods_stats_rows: appcode = row.get('appcode', '') pods_rows[appcode].update({ 'appcode': appcode, 'lpods': row.get('pods', 0) }) for row in hcpu_stats_rows: appcode = row.get('appcode', '') cpus_rows[appcode] = { 'appcode': appcode, 'hcpu_p50': row.get('p50', 0), 'hcpu_p90': row.get('p90', 0), 'hcpu_p99': row.get('p99', 0), 'hcpu_mean': row.get('avg', 0), } for row in lcpu_stats_rows: appcode = row.get('appcode', '') cpus_rows[appcode].update({ 'lcpu_p50': row.get('p50', 0), 'lcpu_p90': row.get('p90', 0), 'lcpu_p99': row.get('p99', 0), 'lcpu_mean': row.get('avg', 0), }) for row in hpa_stats_rows: appcode = row.get('appcode', '') env_name = row.get('env', '') hpa = row.get('hpa') if not hpa: continue report_rows[f'{appcode}-{env_name}'] = { 'appcode': appcode, 'env_name': env_name, 'uc': row.get('uc', 0), 'dc': row.get('dc', 0), 'maxuc': row.get('maxuc', 0), 'mindc': row.get('mindc', 0), 'cpu': int(row.get('hpa', {}).get('cpuTargetUtilization', 0)), 'mem': int(row.get('hpa', {}).get('memoryTargetValue', 0)), 'cname': row.get('hpa', {}).get('customName', ''), 'cval': int(row.get('hpa', {}).get('customTargetValue', 0)), 'min_replicas': row.get('min_replicas', 0), 'max_replicas': row.get('max_replicas', 0), } report_rows[f'{appcode}-{env_name}'].update(cpus_rows.get(appcode, {})) report_rows[f'{appcode}-{env_name}'].update(pods_rows.get(appcode, {})) HpaReportModel.query.filter( HpaReportModel.record_time == day ).delete() for value in report_rows.values(): model = HpaReportModel(record_time=day, **value) db.session.add(model)

2. 结果如下图所示:

3. 统计完成后邮件形式发出

六、结果校验

  • 数据完整(包括全部已经开启HPA的应用列表)

  • 缩次数准确(扩容,缩容次数跟实际发生的一致)

  • CPU使用率准确(高低峰)

  • Pods数量准确(高低峰)

以 t_where_go 为例检查如图所示:
1. 副本数确认

2. cpu使用率和pods数确认

3. 数据完整确认

    a. 对比已开通hpa的appcode,env和hpa配置表的appcode和env一致
    select A.appcode, A.deployment_base, M.appcode, M.deployment_basefrom tb_k8s_appcode_hpa A left join( select deployment_base, appcode from tb_k8s_hpa_rec where dep_status != 1 group by appcode, deployment_base ) M on A.appcode = M.appcode and A.deployment_base = M.deployment_base;

    b. 扩缩报表记录条数和已开通HPA且未临时关闭的记录数一致

    select count(*) from (select appcode,deployment_base from tb_k8s_hpa_xxx where status = 0 and dep_status = 0 group by appcode, deployment_base)M;select count(*) from tb_hpa_report_form where record_time = '2022-06-10';

七、总结

数据统计相关对数据准确性要求很高,如何收集数据?如何清洗数据?如何组装数据?这些都需要考虑,把原始数据弄准确简洁非常重要,其实还是应该在软件开发之前就考虑好需要收集哪些数据,将来要用哪些数据想好才能做出更优秀的软件。
有了上面的数据我们就可以回答文章开头的问题,HPA不一定能降低资源使用,需要配置合理,如果配置不合理,最小副本数大于实际使用的副本数根本不会发生扩缩容。


您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存