Background
Currently, the company’s services are deployed using Spring Cloud framework and k8s, but when new services need to be upgraded, although the current rolling update method is used, there will be a vacuum time ranging from 30 seconds to 1 minute when the services are registered to Eureka, which will cause the online services to be inaccessible for a short period of time, so In the service upgrade, it is necessary to make the service upgrade smoothly to achieve the effect of user insensitivity.
Cause Analysis
In Spring Cloud services, users generally access the gateway (Gateway or Zuul), and then access the internal services through the gateway for a transit, but accessing the internal services through the gateway requires a process, and the general process is like this: after the service is started, it will first register (report) its registration information (service name -> ip:port) to the Then other services will visit the registry regularly (the default interval for polling fetch is 30s) to get the latest service registration list in Eureka.
Then, if the services are updated by k8s in a rolling update fashion, the situation may be as follows
At time T, serverA_1 (the old service) is down and serverA_2 (the new service) has been started and registered in eureka, but the registration information of serverA_1 (the old service) still exists in the registration list cached in the gateway, so when the user accesses serverA, an exception is thrown because the containers where serverA_1 is located have been stopped.
Solution
1. Eureka parameter optimization
Client side
Server side
The above two optimizations are mainly to shorten the time when the service goes online and offline, and to refresh the cache of the service registration list on eureka client side and server side as fast as possible.
2. Gateways enable retry mechanism
Since we are using the zuul gateway, enable the retry mechanism to prevent requests from being forwarded to nodes that have been taken offline during rolling updates. zuul requests that fail will automatically retry once to retry other available nodes.
|
|
About OkToRetryOnAllOperations
property, the default value is false, only when the request is GET will be retried, if set to true, so set after all types of methods (GET, POST, PUT, DELETE, etc.) will be retried, the server side needs to ensure the idempotence of the interface, such as the occurrence of read timeout, if the interface is not idempotent, it may cause dirty data, this is a point that needs attention!
3. Services that need to be down are actively removed from the registry
Use the k8s container callback PreStop hook to proactively remove services that need to be down from the registry before the container is stopped and terminated. There are two types of callback handlers available for containers.
-
Exec - Executes specific commands in the container’s cgroups and namespaces, and the resources consumed by the commands count towards the container’s resource consumption.
Also specify the grace period for k8s graceful termination:
terminationGracePeriodSeconds: 90
, and add a sleep time in the command configuration, mainly as a buffer time for the service to stop, to solve the problem that some requests may be stopped before processing is completed. Here we use the Eurek Client’s own forced offline interface. It should be noted that this method requires the service to introduce thespring-boot-starter-actuator
component, which requires the service to whitelist the/actuator/service-registry
and the base image to install thecurl
command to work. -
HTTP - Performs HTTP requests to a specific endpoint on the container.
With the http approach, we need to actively remove the current service from the registry at the code level inside each service.
Note that if the service needs to have a black and white list, remember to add
/eureka/stop/client
to the whitelist, and if some services have context-path set, note that it needs to be prefixed, otherwise it will be blocked and will be of no use.
4. Delay the first probe time of the ready probe
Add redainessProbe and livenessProbe to the deployment configuration file of the k8s of the service, but what is the difference between these two?
-
LivenessProbe: The main purpose of LivenessProbe is to check if the application in the container is running properly by entering the container in the specified way, if the check fails, the container is considered unhealthy, then
Kubelet
will determine if the Pod should be restarted based on therestartPolicy
set in thePod
. IflivenessProbe
is not configured in the container configuration,Kubelet
will assume that the survival probe detection is always successful.The container started in the Pod above is a SpringBoot application that references the
Actuator
component, which provides the/actuator/health
health check address, and the survival probe can make a request to the service usingHTTPGet
. The/actuator/health
path on port8081
is requested to make a survival determination. -
ReadinessProbe: Used to determine whether the application in the container has finished starting, when the probe is successful before the Pod provides network access to the outside, set the container
Ready
state totrue
, if the probe fails, set the containerReady
state tofalse
. For Pods managed by Service, the association ofService
withPod
andEndPoint
will also be set based on whether the Pod is in theReady
state. If the Pod reverts to theReady
state, it is automatically removed from theEndPoint
list associated with theService
. It will be added back to theEndpoint
list. This mechanism prevents traffic from being forwarded to an unavailable Pod .periodSeconds
parameter indicates how often the probe detects, here is set to 10s, parameterinitialDelaySeconds
represents the delay time of the first probe, here 30 means wait for 30 seconds after the pod is started, and then carry out the survivability detection, the same as the survivability pointer, use theHTTPGet
method to send a request to the If the request is successful, it means the service is ready, and the new service will be reached if configured this way. After 30 seconds k8s will bring down the old service, and after 30 seconds, after optimizing the Eureka configuration, basically all the services have already gotten the registration information of the new service from Eureka.
In practice, the value of initialDelaySeconds
of LivenessProbe
should be greater than the value of initialDelaySeconds
of ReadinessProbe
, otherwise the pod node will not start, because the pod is not ready at this time, and if the survival pointer goes to probe, it will definitely fail, and then k8s will think that the pod is no longer alive, and will destroy the pod and rebuild it.
5. graceful shutdown to ensure that the ongoing business operations are not affected
First of all, let’s clarify how the old Pod is taken offline. If it is a linux system, the command kill -15
will be executed by default to notify the web application to stop and finally the Pod is deleted. Then what is meant by graceful shutdown? What does it do? Simply put, after sending a stop command to the application process, it ensures that the business operations being performed are not affected. The steps after the application receives the stop command should be to stop receiving access requests and wait until the requests that have been received are processed and can be successfully returned, then the application is actually stopped. SpringBoot 2.3
now supports graceful stopping, when enabled with server.shutdown=graceful
, the web server will not receive new requests when the web container is shut down and will wait for a buffer period for active requests to complete. However, our company uses SpringBoot version 2.1.5.RELEASE
, and we need to write some extra code to achieve graceful shutdown, depending on the web container, there are tomcat
and undertow
solutions.
tomcat
|
|
|
|
undertow
|
|
|
|
ok, after the above optimization, basically it will be possible to do the scrolling update without user perception.
Reference https://blog.leeyom.top/#/posts/27