Server configuration

I did some research on how to configure our web servers.

Goals

I identified three properties we should try to maximize:

Availability -- Maximize uptime and ability to accept requests.
Responsiveness -- Minimize waiting time for clients.
Correctness -- Minimize error responses and timeouts.

Properties such as scalability, resource efficiency, uptime, etc are secondary to these three.

The scope of this ticket is the server-level configuration, so things like hardware fail-over, backups, network connectivity, etc are not considered here.

Production server hardware

mutalyzer.nl: 2 cores, 4G memory
test.mutalyzer.nl: 2 cores, 2G memory

Server architecture

See the contents of this repository, in summary:

nginx as reverse proxy for Gunicorn and for serving static files.
Supervisor for process control.
Gunicorn as WSGI HTTP server.
PostgreSQL for the database.
Redis for stat counters and other in-memory stores.

Three applications are served by nginx, each reverse-proxied to their own Gunicorn server:

Website
HTTP/RPC+JSON webservice
SOAP webservice

In addition to these there is the batch processor.

Current production server configuration

nginx

Configuration
proxy_read_timeout is 10 minutes for all services.

Gunicorn

Configuration
timeout is 10 minutes for all services.
workers is 2 (website), 1 (JSON webservice), 1 (SOAP webservice).
worker_class is sync (default).

Considerations

My wish that requests should never take more than 0.5 seconds is actually not realistic, since many requests depend on external services (NCBI) and can take several seconds. This is idle time though, a long-running extractor job would be 100% CPU time.
Gunicorn docs recommend not setting timeout noticeably higher than 30 seconds (the default).
nginx proxy_read_timeout default is 60 seconds.
NCBI communication is the main reason our timeouts are set to 10 minutes. However, I think this is way too high, especially for the Gunicorn timeout. If we don't have a response in say 30 seconds, do we really want to wait longer? Unfortunately I can't find back what exact problem lead me to set it to 10 minutes at the time (server config wasn't in Git yet).
Gunicorn docs recommend using the default sync workers for CPU bound applications and eventlet or gevent workers for applications making calls to external webservices.

Discussion

Ideally Mutalyzer would be mainly CPU bound and all requests could be handled in the order of microseconds. This would fit the Gunicorn sync workers quite well.

Another advantage of the sync workers is that the timeout is per request, and it will just kill the entire worker process (and start a new one). Just the protection we need to kill long running jobs. I don't think this will work similarly with async workers.

Unfortunately, some of our requests call external webservices and can easily take multiple seconds, blocking the sync worker (and we only have one or two). When all workers are busy, new requests stack up in nginx, and this is the reason we need a high proxy_read_timeout to prevent nginx from dropping these requests before they get handled. But even if they are not dropped, all clients have to wait if there are only a few of these long blocking jobs in front of them.

So our main problem is that communication with the NCBI is really bad when using the sync workers (and by the same synchronous design also for the batch processor by the way).

I see two solutions to this problem:

Strive to minimize request processing time. Don't do anything that takes more than ~ 0.1 seconds. This means NCBI communication cannot happen in the HTTP request processing and should be done asynchronously. Same for extraction jobs.
Switch to async workers.

Option 1 requires a lot of work. I could imagine the name checker doing its job normally when the reference sequence is in the cache, but spawning an asynchronous download job if it isn't and return immediately to some polling page. For the webservices this would require additional client logic.

I don't know how well option 2 works if not all jobs are long blocking (and our majority probably isn't). Although the Gunicorn docs mention that most applications will work without change on the async workers, I'm still slightly worried about the monkey patching done by Gunicorn in this mode.

Suggestions

In any case, we can safely set Gunicorn timeout to 30 seconds (the default).

Option 1:

Increase the number of Gunicorn workers.
Accept somewhat longer sequences for the description extractor (because we have more workers).
Decrease the nginx proxy_read_timeout a bit.
Long term: try to work towards solution 1 mentioned above.

Option 2:

Switch to one of the Gunicorn async workers (eventlet or gevent).
Increase the number of Gunicorn workers.
Accept somewhat longer sequences for the description extractor (they still block the worker because they are in C++ land, but we have more workers).
Reset nginx proxy_read_timeout to 60 seconds default (because long blocking requests don't block the worker).
Medium term: implement a task queue for the extractor and accept longer sequences.

I suggest we have a trial period for option 2, but we should think about how to monitor server performance so that we can make an informed decision.

Server monitoring

Today I already changed the nginx access log format to include total request processing time. Based on that we can at least get an idea of a processing time distribution.

Ideally I would like to setup something like an ELK stack (Elasticsearch, Logstash, Kibana) so we can have live monitoring of all kinds of metrics like request processing time, error rates, etc etc.

nginx rate limiting

Apparently nginx has rate limiting capabilities builtin. I would really like to explore this, since it would apply really well to our webservices. And I think it's much better to implement this on the webserver level than on the application level (and therefore in the server config rather than in the Mutalyzer code).

@j.f.j.laros Do you have any thoughts on this?

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information