Sockets, servers and scalability
Scaling a server to handle many requests
Everyone can talk, listening is where the actual difficulty lies.
Caution: If the word socket does not ring any bells in your mind, this article will be difficult for you to follow.
Bare bones server
At the time of writing this article, I am employed as a fullstack software engineer and a lot of my work intersects with writing backend microservices. Microservices that scale for traffic. I primarily use Go to write these backend APIs. Often times, I think about how backend servers are implemented internally and I decided to research a bit on the topic and that is the discussion of this very article. So if you are someone who has written any form of backend APIs, web servers or network programming, this article is going to be perfect for you.
Let us start our discussion using first princple thinking, which I like to apply to any new problem which is thrown at me. So, what are web servers actually? Having done my networks course in C, I would have answered this question with one word: socket.
Socket
Sockets are abstractions of connections. For simplicity you can think of a socket as a tuple of size 5 containing the following information:
(src IP, src port, dst IP, dst port, protocol)
where src IP = source IP address, src port = source port, dst IP = destination IP address, dst port = destination port and protocol is either TCP or UDP.
So a socket in essence is a connection using port and protocol. This leads us to an important conclusion.
Therefore, you could open two sockets, a TCP and a UDP on the same port, StackOverflow answer here
How to open a socket
Let’s look at all the steps involved while opening a socket for serving an incoming request.
int server_fd;
struct sockaddr_in addr;
server_fd = socket(AF_INET, SOCK_STREAM, 0); // SOCK_STREAM = TCP
if (server_fd < 0) { perror("socket"); exit(1); }
addr.sin_family = AF_INET;
addr.sin_addr.s_addr = INADDR_ANY;
addr.sin_port = htons(9000);
if (bind(server_fd, (struct sockaddr*)&addr, sizeof(addr)) < 0) {
    perror("bind"); exit(1);
}
// 5 connection requests will be queued before further requests are refused.
if (listen(server_fd, 5) < 0) { perror("listen"); exit(1); }
printf("Listening on port 9000...\n");
This will open a listening TCP socket on port 9000. But we cannot connect to it yet, for that we need to add the logic when a client connects to this socket.
char buf[1024];
while (1) {
    // accept a connection
    int client_fd = accept(server_fd, NULL, NULL);
    if (client_fd < 0) { perror("accept"); continue; }
    printf("Client connected.\n");
    while (1) {
        // read whatever client writes to the socket
        // this is a blocking call
        ssize_t n = read(client_fd, buf, sizeof(buf));
        if (n <= 0) break; // client closed or error
        // echo back whatever client writes
        write(client_fd, buf, n);
    }
    printf("Client disconnected.\n");
    close(client_fd);
}
close(server_fd);
I want the readers to look into the read() system call, it is a blocking call. Which means the program will not proceed ahead until it has read
whatever content client wants to send. And this is only natural: You have to wait and read what the client wants to say.
This leads us to an important conclusion: this code can only handle one client request at a time, which can be verified using netcat. If a request arrives when this server is serving an already existing connection, it has to wait.
This is not very good, let us improve this design.
Forked server
To enable the process to serve more than one client, we could replicate the process altogether using fork() system call, right?
Yes we can do that, let us try this approach. Update the main loop as follows:
char buf[1024];
while (1) {
    int client_fd = accept(server_fd, NULL, NULL);
    if (client_fd < 0) { perror("accept"); continue; }
    pid_t pid = fork();
    if (pid < 0) {
        perror("fork");
        close(client_fd);
        continue;
    }
    if (pid == 0) {
        // Child process
        close(server_fd); // child doesn't need the listening socket
        printf("[Child %d] Client connected.\n", getpid());
        while (1) {
            ssize_t n = read(client_fd, buf, sizeof(buf));
            if (n <= 0) break; // client closed or error
            write(client_fd, buf, n); // echo back
        }
        printf("[Child %d] Client disconnected.\n", getpid());
        close(client_fd);
        exit(0);
    } else {
        // Parent process
        close(client_fd); // parent doesn’t handle this connection
    }
}
Now the process makes a copy of itself to serve an incoming request, this lets the server cater to many concurrent requests. But how many?
Performance
Compile the forked server and run it
gcc -o forked_server forked_server.c
./forked_server
Inside another terminal connect to it using netcat (make 3 different connections)
nc localhost 9000
Do this 2 more times in different terminals.
Measure the memory usage using ps
ps -o pid,rss,cmd -C forked_server
This will give output somthing similar to this
  PID   RSS CMD
49006  1568 ./forked_server
49244  1276 ./forked_server
49513  1276 ./forked_server
So this means, the 3 connections have peak memory usage as 1568KB (1.5MB), 1276KB (1.2MB) and 1276KB (1.2MB) respectively.
This approach works fine for small loads, but hits limits fast when scaling up. Replicating a process comes with memory overhead, also the OS has limited process ids.
| Limit | Explanation | Symptom | 
|---|---|---|
| Memory per process | Each process has its own address space (~1–2 MB minimum) | Memory spikes with 1000+ clients | 
| Context switch | Kernel must constantly switch between thousands of processes | High CPU, slow response | 
| PID | Linux has finite number of process IDs | fork: Resource temporarily unavailable | 
| File descriptor | Each process inherits its own FD table | “Too many open files” errors | 
The conclusion that we can draw from here is: this approach scales linearly until OS overhead takes over, then collapses.
If processes are expensive, what about threads? Aren’t those light weight processes?
Threads
Surely, we can make the program better by multithreaded each request? Update the main loop like so.
while (1) {
    int *client_fd = malloc(sizeof(int));
    if (!client_fd) continue;
    *client_fd = accept(server_fd, NULL, NULL);
    if (*client_fd < 0) { perror("accept"); free(client_fd); continue; }
    pthread_t tid;
    if (pthread_create(&tid, NULL, handle_client, client_fd) != 0) {
        perror("pthread_create");
        close(*client_fd);
        free(client_fd);
        continue;
    }
    pthread_detach(tid); // don't need to join later
}
close(server_fd);
And define the handler function for the thread like so:
void *handle_client(void *arg) {
    int client_fd = *(int*)arg;
    free(arg);
    char buf[1024];
    printf("[Thread %ld] Client connected.\n", pthread_self());
    while (1) {
        ssize_t n = read(client_fd, buf, sizeof(buf));
        if (n <= 0) break;
        write(client_fd, buf, n);
    }
    printf("[Thread %ld] Client disconnected.\n", pthread_self());
    close(client_fd);
    return NULL;
}
Performance
Threads are better than the previous approach, but they also fail to serve beyond 1k concurrent requests. Each thread consumes 1MB stack + scheduler state.
| Limitation | Explanation | Symptom | 
|---|---|---|
| Memory per thread | 10k threads × 1MB ≈ 10GB just for stacks | Out of memory | 
| Context switch | Thousands of runnable threads cause kernel thrashing | CPU usage spikes to 100% even when idle | 
| File descriptor | Each socket = 1 FD; OS default often 1024 | accept: Too many open files | 
| Synchronization | If you share data between threads | Lock contention, latency | 
What is the next step, how can we make this any better, how can we scale this simple echo server to 10k requests?
Epoll
The problem is read() system call, which is blocking in nature. We have to wait for the client to write to the file descriptor and then only we can move ahead.
So the problem boils down to the synchronouse nature of listening: you wait and listen what the talking party has to say. But what about when the talking party is not saying anything at the moment? Surely you can utilize that time to listen to other talking party who want to say something right?
This mean, there is a need of event driven architecture. Something that bring asynchonity to the model. How about an event queue, if some client writes to a socket, we push an event and tell our program to read. This way we will only read when the client writes.
This can be done by the epoll() system call.
Imagine a queue of I/O readiness events. You tell the kernel: Please watch these sockets and wake me when they have something to do.
Then the kernel keeps track of all those sockets for you.
There are three main system calls:
| Call | Purpose | Analogy | 
|---|---|---|
| epoll_create1() | Create a new epoll instance (kernel event manager) | Create an empty watch list | 
| epoll_ctl() | Add / remove / modify which FDs you want to watch | Add items to your interest list | 
| epoll_wait() | Block until one or more watched FDs become ready | Wait for events to appear in the ready queue | 
- Create an epoll instance
int epfd = epoll_create1(0);
This gives you a handle to an internal epoll object in the kernel.
- Register sockets to watch
struct epoll_event ev;
ev.events = EPOLLIN;    // interested in read
ev.data.fd = sock;
epoll_ctl(epfd, EPOLL_CTL_ADD, sock, &ev);
This tranlates to a request to kernel
When data arrives on socket sock, add an event for it to my internal ready list
- Wait for events
struct epoll_event events[MAX_EVENTS];
int n = epoll_wait(epfd, events, MAX_EVENTS, -1);
This call sleeps until any socket you registered becomes ready.
When something happens, epoll_wait() returns with up to MAX_EVENTS ready sockets.
You then loop through:
for (int i = 0; i < n; i++) {
    int fd = events[i].data.fd;
    // read or write as needed
}
epoll is the better alternative to poll and select system calls.
However epoll’s complexity is roughly O(1) per ready FD, while select()/poll() are O(N) per call.
Why It Scales?
- One kernel data structure holds all your FDs.
- Wake-ups happen only when necessary.
- No linear scanning through all sockets like in select()orpoll().
That’s why epoll is the engine behind:
- nginx
- Redis
- Node.js
- Go runtime’s netpoller
Following table summarises the whole discussion
| Model | Memory per conn | Kernel objects | Max connections | Typical use | 
|---|---|---|---|---|
| fork() | ~1–2 MB | process + fd | 500–1k | legacy UNIX daemons | 
| pthread | ~1 MB | thread + fd | 1k–5k | simple chat/web servers | 
| epoll() | ~few KB | fd only | 100k+ | modern high-performance servers (nginx, Go, Rust, etc.) |