In this post I want to talk about how reliable messaging can be achieved. With reliable messaging I mean transporting the message safely from one service to another as well as executing the assigned logic in a safe way. This message handling should not corrupt any state and be consisted even in case of system and network failures.
First we have to choose a network protocol that ensures our messages are correctly transported over the network. Nowadays this is a real no-brainer, everyone is using TCP/IP for that, with a good reason.

As can be seen in the picture above every packet sent to the server is answered with a special ACK response. This indicates that the packet has been successfully received with all data intact. Additionally TCP implements features like automatic retry in case there is packet loss on the network, it ensures the correct packet order and so forth. No wonder it became a first choice when working on a widely distributed network like the Internet.
However, when used in a messaging system there is a little drawback. The ACK response does not carry any other information besides the successful transport indication. Unfortunately a great many times we need to return additional data back to client. The bare minimum is whether the unit of work as been executed successfully or not. This is not possible with a single TCP ACK representing only a successful message transport.
To overcome this problem often a protocol on top of TCP is used. One of the most famous request protocols in this regard is HTTP. The basic messaging pattern is shown below. The client sends a packet to server which has at least a HTTP header. If received successfully the packet is answered with the already known TCP ACK.

Next the server processes the request. This may result in reads or writes on a file system or a database. Once the server has finished processing the message a separate response is returned which contains a HTTP status code like 200 OK or 400 Bad Request. Together with this code additional data may be transported like a HTML, JSON or XML document. This pattern and processing is enough for a great many messaging applications out there.
Lets think about possible errors here and analyze whether this is reliable in case the network fails or a system crash. If the server crashes after successfully receiving the message, but before any work is done, no response will be sent back. The client will probably have some sort of timeout mechanism and can either wait or try again on the same or a different server. This is secure, we know the server has crashed and no harm is done, no state has changed. Now consider another scenario. The server receives the message, finishes a transaction on a data store and at the next moment the problem occurs right before a success can be sent back to the client. In most cases this is not a problem either. For example if it was just a query and no state has changed on the server side, it will not corrupt anything. The client retries the operation and if it succeeds the second time around everything is fine.
The problem starts if the work done on the server involves write operations or sending messages to other services. Depending on the message and the operation certain side effects may occur. Consider something like drawing money from an ATM. If the operation succeeds on the server and the client is not informed about the success it may rightfully assume a failure and try again resulting in 2 transactions. As a result the money will be booked twice from the account while the ATM hands it out only once. This is bad, we need a solution for this. Continue reading…






