Istio 运维实战系列(3):让人头大的『无头服务』-下
失败的 Eureka 心跳通知
备注:Eureka Server 采用心跳机制来判定服务的健康状态。服务提供者在启动后,周期性(默认30秒)向Eureka Server发送心跳,以证明当前服务是可用状态。Eureka Server在一定的时间(默认90秒)未收到客户端的心跳,则认为服务宕机,注销该实例。
2020-09-24 13:32:46.533 ERROR 1 --- [tbeatExecutor-0] com.netflix.discovery.DiscoveryClient : DiscoveryClient_EUREKA-TEST-CLIENT/eureka-client-544b94f967-gcx2f:eureka-test-client - was unable to send heartbeat!
com.netflix.discovery.shared.transport.TransportException: Cannot execute request on any known server
at com.netflix.discovery.shared.transport.decorator.RetryableEurekaHttpClient.execute(RetryableEurekaHttpClient.java:112) ~[eureka-client-1.9.13.jar!/:1.9.13]
at com.netflix.discovery.shared.transport.decorator.EurekaHttpClientDecorator.sendHeartBeat(EurekaHttpClientDecorator.java:89) ~[eureka-client-1.9.13.jar!/:1.9.13]
at com.netflix.discovery.shared.transport.decorator.EurekaHttpClientDecorator$3.execute(EurekaHttpClientDecorator.java:92) ~[eureka-client-1.9.13.jar!/:1.9.13]
at com.netflix.discovery.shared.transport.decorator.SessionedEurekaHttpClient.execute(SessionedEurekaHttpClient.java:77) ~[eureka-client-1.9.13.jar!/:1.9.13]
at com.netflix.discovery.shared.transport.decorator.EurekaHttpClientDecorator.sendHeartBeat(EurekaHttpClientDecorator.java:89) ~[eureka-client-1.9.13.jar!/:1.9.13]
at com.netflix.discovery.DiscoveryClient.renew(DiscoveryClient.java:864) ~[eureka-client-1.9.13.jar!/:1.9.13]
at com.netflix.discovery.DiscoveryClient$HeartbeatThread.run(DiscoveryClient.java:1423) ~[eureka-client-1.9.13.jar!/:1.9.13]
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[na:na]
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[na:na]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[na:na]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[na:na]
at java.base/java.lang.Thread.run(Thread.java:832) ~[na:na]
过期的 IP 地址
k logs -f eureka-client-66f748f84f-vvvmz -c eureka-client -n eureka
[2020-09-24T13:31:37.980Z] "PUT /eureka/apps/EUREKA-TEST-CLIENT/eureka-client-544b94f967-gcx2f:eureka-test-client?status=UP&lastDirtyTimestamp=1600954114925 HTTP/1.1" 503 UF,URX "-" "-" 0 91 3037 - "-" "Java-EurekaClient/v1.9.13" "1cd54507-3f93-4ff3-a93e-35ead11da70f" "eureka-server:8761" "172.16.0.198:8761" outbound|8761||eureka-server.eureka.svc.cluster.local - 172.16.0.198:8761 172.16.0.169:53890 - default
HUABINGZHAO-MB0:eureka-istio-test huabingzhao$ k get svc -n eureka -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
eureka-server ClusterIP None <none> 8761/TCP 17m app=eureka-server
HUABINGZHAO-MB0:eureka-istio-test huabingzhao$ k get pod -n eureka -o wide | grep eureka-server
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
eureka-server-0 1/1 Running 0 6h55m 172.16.0.59 10.0.0.15 <none> <none>
eureka-server-1 1/1 Running 0 6m1s 172.16.0.200 10.0.0.7 <none> <none>
eureka-server-2 1/1 Running 0 6h56m 172.16.1.3 10.0.0.14 <none> <none>
{
"version_info": "2020-09-23T03:57:03Z/27",
"cluster": {
"@type": "type.googleapis.com/envoy.api.v2.Cluster",
"name": "outbound|8761||eureka-server.eureka.svc.cluster.local",
"type": "ORIGINAL_DST", # 该选项表明 Enovy 在转发请求时会直接采用 downstream 原始请求中的地址。
"connect_timeout": "1s",
"lb_policy": "CLUSTER_PROVIDED",
...
}
In these cases requests routed to an original destination cluster are forwarded to upstream hosts as addressed by the redirection metadata, without any explicit host configuration or upstream host discovery.
Client 查询 DNS 得到 eureka-server 的三个IP地址。
Client 选择 Server-1 的 IP 172.16.0.198 发起连接请求,请求被 iptables rules 拦截并重定向到了客户端 Pod 中 Envoy 的 VirtualInbound 端口 15001。
在收到 Client 的连接请求后,根据 Cluster 的配置,Envoy 采用请求中的原始目的地址 172.16.0.198 连接 Server-1,此时该 IP 对应的 Pod 是存在的,因此 Envoy 到 Server-1 的链接创建成功,Client 和 Envoy 之间的链接也会建立成功。Client 在创建链接时采用了 HTTP Keep Alive 选项,因此 Client 会一直保持该链接,并通过该链接以 30 秒间隔持续发送 HTTP PUT 服务心跳通知。
由于某些原因,该 Server-1 Pod 被 K8s 重建为 Server-1ꞌ,IP 发生了变化。
当 Server-1 的 IP 变化后,Envoy 并不会立即主动断开和 Client 端的链接。此时从 Client 的角度来看,到 172.16.0.198 的 TCP 链接依然是正常的,因此 Client 会继续使用该链接发送 HTTP 请求。同时由于 Cluster 类型为 “ORIGINAL_DST” ,Envoy 会继续尝试连接 Client 请求中的原始目的地址 172.16.0.198,如图中蓝色箭头所示。但是由于该 IP 上的 Pod 已经被销毁,Envoy 会连接失败,并在失败后向 Client 端返回一个这样的错误信息:“upstream connect error or disconnect/reset before headers. reset reason: connection failure HTTP/1.1 503” 。如果 Client 在收到该错误后不立即断开并重建链接,那么直到该链接超时之前,Client 都不会重新查询 DNS 获取到 Pod 重建后的正确地址。
为 Headless Service 启用 EDS
{
"version_info": "2020-09-23T03:02:01Z/2",
"cluster": {
"@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
"name": "outbound|8080||http-server.default.svc.cluster.local",
"type": "EDS", # 普通服务采用 EDS 服务发现,根据 LB 算法从 EDS 下发的 endpoint 中选择一个进行连接
"eds_cluster_config": {
"eds_config": {
"ads": {},
"resource_api_version": "V3"
},
"service_name": "outbound|8080||http-server.default.svc.cluster.local"
},
...
}
func convertResolution(proxy *model.Proxy, service *model.Service) cluster.Cluster_DiscoveryType {
switch service.Resolution {
case model.ClientSideLB:
return cluster.Cluster_EDS
case model.DNSLB:
return cluster.Cluster_STRICT_DNS
case model.Passthrough: // Headless Service 的取值为 model.Passthrough
if proxy.Type == model.SidecarProxy {
// 对于 Sidecar Proxy,如果 PILOT_ENABLE_EDS_FOR_HEADLESS_SERVICES 的值设为 True,则启用 EDS,否则采用 ORIGINAL_DST
if service.Attributes.ServiceRegistry == string(serviceregistry.Kubernetes) && features.EnableEDSForHeadless {
return cluster.Cluster_EDS
}
return cluster.Cluster_ORIGINAL_DST
}
return cluster.Cluster_EDS
default:
return cluster.Cluster_EDS
}
}
[2020-09-24T13:35:28.790Z] "PUT /eureka/apps/EUREKA-TEST-CLIENT/eureka-client-544b94f967-gcx2f:eureka-test-client?status=UP&lastDirtyTimestamp=1600954114925 HTTP/1.1" 200 - "-" "-" 0 0 4 4 "-" "Java-EurekaClient/v1.9.13" "d98fd3ab-778d-42d4-a361-d27c2491eff0" "eureka-server:8761" "172.16.1.3:8761" outbound|8761||eureka-server.eureka.svc.cluster.local 172.16.0.169:39934 172.16.0.198:8761 172.16.0.169:53890 - default
[2020-09-24T13:35:58.797Z] "PUT /eureka/apps/EUREKA-TEST-CLIENT/eureka-client-544b94f967-gcx2f:eureka-test-client?status=UP&lastDirtyTimestamp=1600954114925 HTTP/1.1" 200 - "-" "-" 0 0 1 1 "-" "Java-EurekaClient/v1.9.13" "7799a9a0-06a6-44bc-99f1-a928d8576b7c" "eureka-server:8761" "172.16.0.59:8761" outbound|8761||eureka-server.eureka.svc.cluster.local 172.16.0.169:45582 172.16.0.198:8761 172.16.0.169:53890 - default
[2020-09-24T13:36:28.801Z] "PUT /eureka/apps/EUREKA-TEST-CLIENT/eureka-client-544b94f967-gcx2f:eureka-test-client?status=UP&lastDirtyTimestamp=1600954114925 HTTP/1.1" 200 - "-" "-" 0 0 2 1 "-" "Java-EurekaClient/v1.9.13" "aefb383f-a86d-4c96-845c-99d6927c722e" "eureka-server:8761" "172.16.0.200:8761" outbound|8761||eureka-server.eureka.svc.cluster.local 172.16.0.169:60794 172.16.0.198:8761 172.16.0.169:53890 - default
神秘消失的服务
2020-09-24 14:07:35.511 WARN 6 --- [eureka-server-3] c.netflix.eureka.cluster.PeerEurekaNode : EUREKA-SERVER-2/eureka-server-2.eureka-server.eureka.svc.cluster.local:eureka-server-2:8761:Heartbeat@eureka-server-0.eureka-server: missing entry.
2020-09-24 14:07:35.511 WARN 6 --- [eureka-server-3] c.netflix.eureka.cluster.PeerEurekaNode : EUREKA-SERVER-2/eureka-server-2.eureka-server.eureka.svc.cluster.local:eureka-server-2:8761:Heartbeat@eureka-server-0.eureka-server: cannot find instance
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: eureka-server
spec:
selector:
matchLabels:
app: eureka-server
serviceName: "eureka-server"
replicas: 3
template:
metadata:
labels:
app: eureka-server
annotations:
sidecar.istio.io/inject: "false" # 不为 eureka-server pod 注入 Envoy Siedecar
spec:
containers:
- name: eureka-server
image: zhaohuabing/eureka-test-service:latest
ports:
- containerPort: 8761
name: http
反思
参考文档
All about ISTIO-PROXY 5xx Issues:https://medium.com/expedia-group-tech/all-about-istio-proxy-5xx-issues-e0221b29e692
Service Discovery: Eureka Server:https://cloud.spring.io/spring-cloud-netflix/multi/multi_spring-cloud-eureka-server.html
Istio 运维实战系列(2):让人头大的『无头服务』-上:https://mp.weixin.qq.com/s/67snR00h4oJCo0XVnTE4nQ
Eureka 心跳通知问题测试源码:https://github.com/zhaohuabing/eureka-istio-test