EOSIO API Healthcheck for HAProxy

Introduction

If you are running any public infrastructure, you will quickly come to a point where you are required to have use a load balancer to handle an increased load and distribute the traffic to multiple backend machines. Using HAProxy is a powerful and reliable solution that is commonly used. By using multiple backend machines behind the same endpoint can increase capacity, reliability as well as user experience. This is because it will spread the traffic across the backend machines, which also allows you to take one down for maintenance without any major disturbance for the end-users.

We use HAProxy both for the internal and external infrastructure we have. This is true for p2p, ship and API nodes, where we also split the traffic based on what type of traffic it is to different nodes that are optimized for one task per node. If our API traffic increase, we can easily spin up another node in the background, and have different rate limit for different type of requests.

However, one potential problem with running both EOSIO nodes and history solutions, is that HAProxy might need to know if the EOSIO node is synced or lagging, this is a must for clients to get reliable data. For this, we have built a healthcheck, which will mark nodes as UP or DOWN, depending on their current state.

When you set up HAProxy, you will likely need a more detailed and advanced config, but for this article, and for you to get started, we will provide a simple configuration that balances HTTP requests to showcase the healthcheck.

Basic config and guide

To install our healthcheck, you can use our apt repo, or check out the source code on github.
Both options has a short guide on how to do that.

frontend myfrontend
mode http
bind 127.0.0.1:80
default_backend myservers

backend myservers
server server1 10.8.8.2:8000 check
server server2 10.8.8.3:8000 check
 server server3 10.8.8.4:8000 check

By providing check on the server lines, HAproxy performs an active health check to figure out if the server is to be seen as healthy or not (so no requests are routed to unhealthy servers). This means that HAProxy will try to perform a TCP connection at specific intervals, if that connection fails the server is marked as unhealthy.

However, this could be a problem when working with EOSIO nodes and history solutions. This is because even if a node responds, it might be lagging, and not be able to respond with the latest information. This means that even if the node is responsive, it should not be seen as healthy for HAProxy, but instead to be seen as unhealthy.

One great thing with HAProxy, is that it provides the ability to create custom health checks in those cases. This will be described in the next section.

HAProxy Custom Agent Check

HAProxy provides the user with the functionality of communicate with an external agent via TCP. In this case, the agent is the EOSIO node(s). The Custom Agent Check will open a TCP connection and check the response of such connection. This will ofcourse require that software to accept TCP connections, which EOSIO nodes do.

The program in question can do whatever checks is necessary (check CPU load, disk space. etc) and report back to HAProxy what action should be taken. You can find the HAProxy documentation here.

The response is just one line with basic ASCII text:

ResponseDescription
up\nThe server is healty
down\nThe server is unhealthy
maint\nThe server is put into maintenance mode
ready\nThe server is taken out of maintenance mode
50%\nThe server’s weight is halved.
maxconn:10\nThe servers maximum connections is set to 10

HAProxy can also be configured to send a TCP message itself when it connects to the agent program with the agent-send parameter. This feature makes it possible to configure the agent program per server basis.

The following parameters are supported in a request and are ordered from first to last below:

#NameRequiredDescription
1apiYesType of API to check against, v1 = standard, v2 = Hyperion, contract = eosio-contract-api
2urlYes (port default 80)http url to the api. http(s)://<ip-or-domain>(:<port>)
3num_blocksNo (default 10)Number of blocks the api can drift before reported down
4hostNo (default from url)Value to send in the HTTP Host Header to the API

EOSIO API Agent Check Program

This is where the fun begins. By using the ability to create custom agent checks in HAProxy. We have created an agent check program that can talk with an EOSIO API and based on the response, signal back the state of the API.

Our EOSIO Health check is written in Golang and it does more than just check if the API can respond to HTTP requests. it also checks if the API is lagging by performing a API call to get the state.

The source code can be found on GitHub, and you can install it through our APT repo.

Example

Given this example configuration:

server 10.8.8.4 ... check agent-check agent-addr 127.0.0.1 agent-port 1337 agent-send "v1|http://10.8.8.4:8081|15\\n"

Haproxy, agent check program and the eosio node communication is displayed in the diagram below:

The most interesting bit in the above diagram is the algorithm that the agent program is using to figure out the API should be marked as down or up. The third parameter in agent-send "v1|http://10.8.8.4:8081|15\n" is the threshold num_blocks used to know if the node is lagging or not. To figure this out, we calculate the time difference between the current time (on the healthcheck server) and nodeos head block time. However the argument is i given as number of blocks but a block should be produced every 0.5 seconds, so it is easy to convert to seconds: S = num_blocks / 2

In this example, if the lag is greater than 30 seconds, the node is to be seen as lagging and we report to HAProxy that this node is to be considered as down. If the time difference between current time and head block is greater than -30 seconds, we also report it as down. This could occur if something strange is happening and the node is in the future.

If the time difference is between 30 and -30 seconds the node is considered in sync. and we report “OK” to HAproxy.

The code below is here as an example pseudocode and only the important parts are included. The real life code can be found here

func check_api(url string, block_time float64) { 
// get info from the api.
info := eosapi.GetInfo(url)  
// Calculate time difference from head block.
now := time.Now().In(time.UTC)
diff := now.Sub(info.HeadBlockTime).Seconds()

if diff > block_time {
// Taking offline because head block is behind.
haproxy.send("DOWN")
} else if diff < -block_time {
// Taking offline because head block is in the future
haproxy.send("DOWN")
} else {
// Node is in sync.
haproxy.send("UP")
}
}

Hyperion

We host Hyperion history (v2), and for this, the algorithm is slightly different. This is because the Hyperion API report's it's information in different JSON format and also different information, So here we have Elasticsearch last indexed block and nodeos head block. So we don't have to convert num_blocks argument to second. Otherwise the algorithms are the same.

Note that we have a version tag that is v2 in the agent-check. The website will likely break the lines, specially if you read this on a small screen, but each server setting is a single line. 

frontend myfrontend_v2
mode http
bind 127.0.0.1:80
default_backend myservers_v2

backend myservers_v2
server server1 10.8.8.2:9000 check agent-check agent-addr 127.0.0.1 agent-port 1337 agent-send "v2|http://10.8.8.2:9000|15\n"
server server2 10.8.8.5:9000 check agent-check agent-addr 127.0.0.1 agent-port 1337 agent-send "v2|http://10.8.8.5:9000|15\n"
server server3 10.8.8.4:9000 check agent-check agent-addr 127.0.0.1 agent-port 1337 agent-send "v2|http://10.8.8.4:9000|15\n"

P2P

For P2P it's slightly different, since forwarding p2p traffic is forwarded to the p2p port of the eosio node, and our requests to check health needs to go through another port, to be able to do API calls. For P2P we use the "mark server down end connection" option, which allows us to break the traffic for the node. This ensures that clients always connects to synced nodes. With HAProxy you can also set a custom max connection per server and/or backend, this is potentially a good option for you to limit how many connections each backend server can receive. 

frontend myfrontend_p2p
mode http
bind 127.0.0.1:9876
default_backend myservers_p2p

backend myservers_p2p
server server1 10.8.8.7:9876 check on-marked-down shutdown-sessions agent-check agent-addr 127.0.0.1 agent-port 1337 agent-send "v1|http://10.8.8.7:9000|15\n"
server server2 10.8.8.8:9876 check on-marked-down shutdown-sessions agent-check agent-addr 127.0.0.1 agent-port 1337 agent-send "v1|http://10.8.8.8:9000|15
\n"
server server3 10.8.8.9:9876 check on-marked-down shutdown-sessions agent-check agent-addr 127.0.0.1 agent-port 1337 agent-send "v1|http://10.8.8.9:9000|15\n"

Summary

The EOSIO API Healthcheck for HAProxy is a great tool for anyone that want to be able to check the health of EOSIO based blockchains. It will allow you to check how far behind the node is compared to the head block, and if the lag is more than a configurable amount of time, it will be marked as down. If it later comes back in sync, it will automagically be marked as up and available.

Additional Resources

Leave a Comment: