Webserver’s Concurency Models in Python

CocCoc Techblog
Coccoc Engineering Blog
10 min readSep 30, 2021

--

In the very beginning, I was wondering how a webserver works, how mighty Apache — extremely dominating in the market at that time — was struggling with high traffic environments, how Nginx’s event-driven architecture beautifully solved C10K problem, and long after, how NodeJS got the community hype. At the heart of all the mentioned brands, there is concurrency models, which play the very crucial role in the modern webserver technologies — powering most of the Internet traffic worldwide nowadays.

In this blogpost, we’re going through the development of the tech tree, from a very primitive to more advanced models that’re prevalent today. We even take a look on the infamous Python’s GIL in action, and finally find out which models are more appropriate on which situations.

Why Python? Because it’s much easier to implement examples, even the very low-level interfaces such as socket, multiplexing IO, threading… And we can use most of the cognitive energy for understanding the mechanisms under the hood rather than the language itself. Let’s make some fun with Python!

Before going further, embrace yourself with these knowledges, just a bit:

  • Processing / Threading
  • Socket
  • Multiplexing IO

The Naive server

Let’s implement a simple client-server architect, using the primitive socket() system call

Simple server using socket

In short, the service opens port 40404 locally, which accept() connections, and return result of 35th element of the Fibonacci sequence.

Why 35th? With it the test machine ultilizes fullly 1 CPU core (out of 4) and takes approx 4s to complete the recursive function — which is suitable for demonstration.

For the sake of simplicity, I didnt paste the client code here, but it’s as simple as calling socket’s connect() and recv(). Notably, 10 client processes will be forked simultaneously to connect to server for the purpose of checking the model’s concurency. All the codes will be included in the end.

Here comes the result

# time python3 client.py
2021-09-29 22:20:06.716157: [22090] Connecting 127.0.0.1:40404
...8 more times
2021-09-29 22:20:06.732502: [22099] Connecting 127.0.0.1:40404
b'9227465'
...8 more times
b'9227465'
real 0m39.847s---
# python3 naive.py
2021-09-30 17:00:37.553617: [12857] Accepting ('127.0.0.1', 21464)
2021-09-30 17:00:41.563199: [12857] Processed ('127.0.0.1', 21464)
2021-09-30 17:00:41.563701: [12857] Accepting ('127.0.0.1', 21466)
2021-09-30 17:00:45.579824: [12857] Processed ('127.0.0.1', 21466)
2021-09-30 17:00:45.580463: [12857] Accepting ('127.0.0.1', 21468)
...

~40s for 10 iterations of Fibonacci function. Remember this number, because we’ll get back to it very soon. 10 requests come at once, but 1 can be served at at time, others need to wait.

It can be clearly seen that the server use blocking, serialized model, which is the simplest one of all. One process can only serve one request at a time. Is it the slowest server?

The MultiProcessing model

The naive implementation raises a question, how to scale the webserver for it to serve more concurent requests? As one process can only serve one request at a time, the simple approach is to increase number of processes so the service can utilize more available resources in the system (in our case it’s CPU for calculating Fibonacci). Python provides multiprocessing libary for this purpose.

4 separated processes

The code is pretty much the same with the previous implementation, just call function process() 4 times in 4 different processes using multiprocessing lib. Now we have 4 processes which can ultilize fullly 4 CPU cores (can be observed by top).

# time python3 client.py
2021-09-29 22:42:17.309141: [22696] Connecting 127.0.0.1:40404
...
b'9227465'
real 0m12.063s---
# python3 proc.py
2021-09-30 17:01:14.770677: [12938] Accepting ('127.0.0.1', 21488)
2021-09-30 17:01:14.773363: [12939] Accepting ('127.0.0.1', 21490)
2021-09-30 17:01:14.775600: [12940] Accepting ('127.0.0.1', 21492)
2021-09-30 17:01:14.775894: [12941] Accepting ('127.0.0.1', 21494)
2021-09-30 17:01:18.991079: [12939] Processed ('127.0.0.1', 21490)
2021-09-30 17:01:18.991468: [12939] Accepting ('127.0.0.1', 21496)
2021-09-30 17:01:19.114944: [12940] Processed ('127.0.0.1', 21492)
2021-09-30 17:01:19.115302: [12940] Accepting ('127.0.0.1', 21498)
2021-09-30 17:01:19.162694: [12941] Processed ('127.0.0.1', 21494)
2021-09-30 17:01:19.162910: [12941] Accepting ('127.0.0.1', 21500)
2021-09-30 17:01:19.215741: [12938] Processed ('127.0.0.1', 21488)
...

Expectedly nearly 4 times faster, 4 requests was served concurrently, the 2 last requests is serving in the last iteration, that’s why total time is 4x3 = 12s.

The multiprocessing model is simple, appeared in the very beginning days of webserver era yet still prevalent in the market today with the very well known examples of php-fpm, wsgi modules…, and even modern Apache with prefork MPM. There’re reasons to do so.

The Threading model

As all know about the disavantages of spawning too many processes, threading comes into play. Thread is lightweight, consumes less memory, easier communication with other threads by shared memory…, but it comes with a cost of complexity and synchronization hell. Nearly everyone runs into problem of race condition at least one in his life which is hard to debug.

Let’s try it with our demonstration, the threading library has the similar interface to multiprocessing one, so the switching will be very simple.

4 threads inside the main process

The result, guess what?

# time python3 client.py
2021-09-30 16:33:46.530893: [12243] Connecting 127.0.0.1:40404
...
b'9227465'
real 0m43.169s---
# python3 thread.py
2021-09-30 17:02:42.883735: [13052] Accepting ('127.0.0.1', 21512)
2021-09-30 17:02:42.890233: [13052] Accepting ('127.0.0.1', 21514)
2021-09-30 17:02:42.890349: [13052] Accepting ('127.0.0.1', 21518)
2021-09-30 17:02:42.890456: [13052] Accepting ('127.0.0.1', 21516)
2021-09-30 17:02:59.937182: [13052] Processed ('127.0.0.1', 21514)
2021-09-30 17:03:00.020148: [13052] Accepting ('127.0.0.1', 21520)
2021-09-30 17:03:00.333162: [13052] Processed ('127.0.0.1', 21512)
2021-09-30 17:03:00.343931: [13052] Accepting ('127.0.0.1', 21522)
2021-09-30 17:03:00.660173: [13052] Processed ('127.0.0.1', 21516)
2021-09-30 17:03:00.681185: [13052] Accepting ('127.0.0.1', 21524)
2021-09-30 17:03:00.810254: [13052] Processed ('127.0.0.1', 21518)
2021-09-30 17:03:00.815903: [13052] Accepting ('127.0.0.1', 21526)
...

4 requests were served at a time, however 4x slower in response time (~16s each). Total time ~4x multiprocessing model.

That is such a big disappointment. For our common knowledge this result doesn’t make sense. Threading performance should be at least equal to multiprocessing if not better, but it performed in this case even worse than single process, single thread model in the first demonstration. If you’re thinking of a magical unknown stuff inside Python interpreter, then hooray, you’re correct. It’s the so-called GIL that people usually talking about.

In short, GIL is a global process lock for keeping only one thread running in a process (about the reason and details we will not cover in this post). In our case 4 threads was running in concurrency, but only one can take the lock and ultilize CPU at a time, and each thread has a timeslice to run before being preemtively switched to another thread (sounds like OS’s CPU scheduler?).

So what threading is useful for? Let’s take one more test

threading with IO-bound and CPU-bound

Usually there’re 2 types of applications: IO-bound and CPU-bound. For Fibonacci pure CPU is used, so the function is CPU-bound. As we previously know one Python can not ultilize more than 1 CPU core because of GIL so that threading will not be beneficial for CPU-bound applications. How about IO-bound counterpart?

Let’s check the result with sleep() function (simulated IO-bound such as waiting for network / disk io)

# time python3 client.py
2021-09-30 16:58:39.342335: [12779] Connecting 127.0.0.1:40404
...
b'None'
real 0m12.072s---
# python3 thread.py
2021-09-30 17:05:43.228304: [13177] Accepting ('127.0.0.1', 21544)
2021-09-30 17:05:43.229373: [13177] Accepting ('127.0.0.1', 21546)
2021-09-30 17:05:43.230724: [13177] Accepting ('127.0.0.1', 21548)
2021-09-30 17:05:43.231274: [13177] Accepting ('127.0.0.1', 21550)
2021-09-30 17:05:47.232686: [13177] Processed ('127.0.0.1', 21544)
2021-09-30 17:05:47.232909: [13177] Accepting ('127.0.0.1', 21552)
2021-09-30 17:05:47.233211: [13177] Processed ('127.0.0.1', 21546)
2021-09-30 17:05:47.233294: [13177] Accepting ('127.0.0.1', 21554)
2021-09-30 17:05:47.234927: [13177] Processed ('127.0.0.1', 21548)
2021-09-30 17:05:47.235022: [13177] Accepting ('127.0.0.1', 21556)
2021-09-30 17:05:47.235451: [13177] Processed ('127.0.0.1', 21550)
...

Now thing looks normal. Threads waiting for IO wont be affected by GIL. So now you know why Python’s threading is suggested to be used on IO-bound applications only. Another reason for not using threading is thread safety concern on synchronization, which we wont cover it here. For Python webservers, multiprocessing is still somewhat preferable.

Non-blocking?

On above processing / threading models, the socket is blocking. Which means when the thread / process is serving a request, it blocks other requests from processing, and it let them wait until the socket is free again. In the non-blocking mode, the socket does not block requests, instead return immediately error EAGAIN: Resource temporarily unavailable so the client wont wait. The client can retry another time, and do another useful stuff rather than waiting doing nothing.

Time to build a non-blocking client. It’s quite troublesome in the beginning when we need to get familiar with something new. But trust me thing will be easier as time goes by.

Simple non-blocking, polling algorithm

In the single thread, single process model, client is blocking for 4s waiting for a result from the server. With non-blocking mode, we’ll try using 4s to do something more meaningful while waiting for the result. In this demonstration only 1 request is sent to the server.

# time python3 nonblocking-client.py
2021-09-30 21:16:46.464402: Connecting 127.0.0.1:40404
2021-09-30 21:16:46.464687: Still waiting, doing Fibonacci 31
1346269
2021-09-30 21:16:48.037938: Still waiting, doing Fibonacci 31
1346269
2021-09-30 21:16:49.613336: Still waiting, doing Fibonacci 31
1346269
2021-09-30 21:16:50.359646: Data returned
Returned from Server: b'9227465'
real 0m3.945s---
# python3 naive.py
2021-09-30 21:16:46.464813: [20196] Accepting ('127.0.0.1', 22836)
2021-09-30 21:16:50.359540: [20196] Processed ('127.0.0.1', 22836)

As can be seen, while waiting for Fibonacci(35), the client take the time to calculate 3x Fibonacci(31) (~1.5s/each)

Non-blocking mode is very useful for IO-bound tasks, ask the process can use the time waiting to do other tasks, without the need of threading or subprocessing. Operating systems support polling mechanisms called Multiplexing IO with the syscalls select(), poll(), epoll(), kqueue()… which helps non-blocking programming becomes more popular.

The asynchronous, non-blocking Event-driven architecture

The non-blocking demonstration above is a bridge to the last model, which is the heart of the most effective, prevalent, prominent webservers these days such as Nginx, NodeJS, Lighttpd… The technology is called event-driven architecture, and the core of the architecture is an Event Loop. Simply put, event loop is a loop inside a process, which triggers callbacks based on events.

Think of the state machine like the rules for chess. Each HTTP transaction is a chess game. On one side of the chessboard is the web server — a grandmaster who can make decisions very quickly. On the other side is the remote client — the web browser that is accessing the site or application over a relatively slow network.
— Inside NGINX: How We Designed for Performance & Scale

Application does not block on events, that’s why it’s non-blocking, possible to accept a lots of requests simultaneously. Requests will be put into a run-loop, and then processed asynchronously until completion. Event loop can be very fast, but as single threaded nature, one single blocking call will block the whole process, which causes all the accepted requests blocked as well. We wont go into too much details here, because it may take several days to talk clearly about the event driven architecture. Time to make some codes!

Python supports eventloop with high-level interface called asyncio library. The syntax itself is quite difficult for ones not familiar with event-driven beforehand, but let’s just focus on ideas, not the language itself.

simple event-driven webserver

Similar to threading model, we will test the server with CPU-bound and IO-bound functions.

IO-bound first, where event-driven shines the most. As usual 10 request comming to the server instantly.

# time python3 client.py
2021-09-30 22:05:08.030330: [21375] Connecting 127.0.0.1:40404
...
b'None'
real 0m4.068s---
# python3 eventloop.py
2021-09-30 22:05:08.032579: [21373] Accepting ('127.0.0.1', 23074)
2021-09-30 22:05:08.032932: [21373] Accepting ('127.0.0.1', 23076)
2021-09-30 22:05:08.034144: [21373] Accepting ('127.0.0.1', 23078)
2021-09-30 22:05:08.036164: [21373] Accepting ('127.0.0.1', 23080)
...
2021-09-30 22:05:12.045846: [21373] Processed ('127.0.0.1', 23092)

All requests were accepted at once, and they finished at the same time. Total time is 4s, which mean the concurrency performed pretty nicely.

Next, CPU-bound, expectedly, the server use only 1 process, so the CPU utilization could not go higher than 1 core, eventually the response will be slow. How slow actually?

# time python3 client.py
2021-09-30 22:09:12.402979: [21493] Connecting 127.0.0.1:40404
...
2021-09-30 22:09:12.414549: [21502] Connecting 127.0.0.1:40404
b'9227465'
...
b'9227465'
real 3m57.475s---
# python3 eventloop.py
2021-09-30 22:09:12.404250: [21491] Accepting ('127.0.0.1', 23110)
2021-09-30 22:09:35.934708: [21491] Processed ('127.0.0.1', 23110)
Executing <Task finished coro=<handle() done, defined at eventloop.py:24> result=None created at eventloop.py:42> took 23.531 seconds
2021-09-30 22:09:35.939693: [21491] Accepting ('127.0.0.1', 23112)
2021-09-30 22:09:59.947985: [21491] Processed ('127.0.0.1', 23112)
Executing <Task finished coro=<handle() done, defined at eventloop.py:24> result=None created at eventloop.py:42> took 24.009 seconds
...
Executing <Task finished coro=<handle() done, defined at eventloop.py:24> result=None created at eventloop.py:42> took 23.801 seconds

24s for one request, and it was even running sequencially. Simply the fib() was blocking the loop, causing the whole process stalled. The event loop model is unbelievably slow when handling CPU-bound tasks, most interesting that it’s even much slower than single process, single thread model. This result proves that event loop is not always a silver bullet for everything, and we need to use it with care. Remember back in the days when Nginx having a lot of misteries related to blocking issues?

We’ve gone through so much code and models, we’ve seen how GIL hindered threading performance, we’ve seen the power of well-known event-driven architecture as well as the disadvantages. Concurrency is difficult, but fun to play with. About the applicability of each mechanisms, I hope that you can find out by yourself after the long examples, and write better codes. Have fun concurrency!

Further readings:

--

--

CocCoc Techblog
Coccoc Engineering Blog

From engineers who’re devoting their time to the Vietnamese browser and search engine.