Server configuration
I did some research on how to configure our web servers.
Goals
I identified three properties we should try to maximize:
- Availability -- Maximize uptime and ability to accept requests.
- Responsiveness -- Minimize waiting time for clients.
- Correctness -- Minimize error responses and timeouts.
Properties such as scalability, resource efficiency, uptime, etc are secondary to these three.
The scope of this ticket is the server-level configuration, so things like hardware fail-over, backups, network connectivity, etc are not considered here.
Production server hardware
-
mutalyzer.nl
: 2 cores, 4G memory -
test.mutalyzer.nl
: 2 cores, 2G memory
Server architecture
See the contents of this repository, in summary:
- nginx as reverse proxy for Gunicorn and for serving static files.
- Supervisor for process control.
- Gunicorn as WSGI HTTP server.
- PostgreSQL for the database.
- Redis for stat counters and other in-memory stores.
Three applications are served by nginx, each reverse-proxied to their own Gunicorn server:
- Website
- HTTP/RPC+JSON webservice
- SOAP webservice
In addition to these there is the batch processor.
Current production server configuration
nginx
- Configuration
-
proxy_read_timeout
is 10 minutes for all services.
Gunicorn
- Configuration
-
timeout
is 10 minutes for all services. -
workers
is 2 (website), 1 (JSON webservice), 1 (SOAP webservice). -
worker_class
issync
(default).
Considerations
- My wish that requests should never take more than 0.5 seconds is actually not realistic, since many requests depend on external services (NCBI) and can take several seconds. This is idle time though, a long-running extractor job would be 100% CPU time.
-
Gunicorn docs recommend not setting
timeout
noticeably higher than 30 seconds (the default). - nginx
proxy_read_timeout
default is 60 seconds. - NCBI communication is the main reason our timeouts are set to 10 minutes. However, I think this is way too high, especially for the Gunicorn
timeout
. If we don't have a response in say 30 seconds, do we really want to wait longer? Unfortunately I can't find back what exact problem lead me to set it to 10 minutes at the time (server config wasn't in Git yet). -
Gunicorn docs recommend using the default
sync
workers for CPU bound applications andeventlet
orgevent
workers for applications making calls to external webservices.
Discussion
Ideally Mutalyzer would be mainly CPU bound and all requests could be handled in the order of microseconds. This would fit the Gunicorn sync
workers quite well.
Another advantage of the sync
workers is that the timeout is per request, and it will just kill the entire worker process (and start a new one). Just the protection we need to kill long running jobs. I don't think this will work similarly with async workers.
Unfortunately, some of our requests call external webservices and can easily take multiple seconds, blocking the sync
worker (and we only have one or two). When all workers are busy, new requests stack up in nginx, and this is the reason we need a high proxy_read_timeout
to prevent nginx from dropping these requests before they get handled. But even if they are not dropped, all clients have to wait if there are only a few of these long blocking jobs in front of them.
So our main problem is that communication with the NCBI is really bad when using the sync
workers (and by the same synchronous design also for the batch processor by the way).
I see two solutions to this problem:
- Strive to minimize request processing time. Don't do anything that takes more than ~ 0.1 seconds. This means NCBI communication cannot happen in the HTTP request processing and should be done asynchronously. Same for extraction jobs.
- Switch to async workers.
Option 1 requires a lot of work. I could imagine the name checker doing its job normally when the reference sequence is in the cache, but spawning an asynchronous download job if it isn't and return immediately to some polling page. For the webservices this would require additional client logic.
I don't know how well option 2 works if not all jobs are long blocking (and our majority probably isn't). Although the Gunicorn docs mention that most applications will work without change on the async workers, I'm still slightly worried about the monkey patching done by Gunicorn in this mode.
Suggestions
In any case, we can safely set Gunicorn timeout
to 30 seconds (the default).
Option 1:
- Increase the number of Gunicorn workers.
- Accept somewhat longer sequences for the description extractor (because we have more workers).
- Decrease the nginx
proxy_read_timeout
a bit. - Long term: try to work towards solution 1 mentioned above.
Option 2:
- Switch to one of the Gunicorn async workers (
eventlet
orgevent
). - Increase the number of Gunicorn workers.
- Accept somewhat longer sequences for the description extractor (they still block the worker because they are in C++ land, but we have more workers).
- Reset nginx
proxy_read_timeout
to 60 seconds default (because long blocking requests don't block the worker). - Medium term: implement a task queue for the extractor and accept longer sequences.
I suggest we have a trial period for option 2, but we should think about how to monitor server performance so that we can make an informed decision.
Server monitoring
Today I already changed the nginx access log format to include total request processing time. Based on that we can at least get an idea of a processing time distribution.
Ideally I would like to setup something like an ELK stack (Elasticsearch, Logstash, Kibana) so we can have live monitoring of all kinds of metrics like request processing time, error rates, etc etc.
nginx rate limiting
Apparently nginx has rate limiting capabilities builtin. I would really like to explore this, since it would apply really well to our webservices. And I think it's much better to implement this on the webserver level than on the application level (and therefore in the server config rather than in the Mutalyzer code).
@j.f.j.laros Do you have any thoughts on this?