22 Sep: sendfile()

Sendfile is a system call that allows in kernel copying of data between two file descriptors. Usually used for copying between a file and a network socket, but since kernel 2.6.33 you can also copy between two files[1]. You can call this "zero-copy".

If you don't use sendfile() something like this needs to happen instead:

  1. Allocate user space data buffer
  2. read() data from source file to buffer (behind the scenes copies from disk to OS cache, then OS cache to user space)
  3. write() contents of buffer to target socket (or file)

A context switch happens for both the read() and write() system calls, and the data is copied twice - once the kernel to user space, and then again back from the buffer in user space to kernel space. Those context switches between user and kernel space, where state is saved and execution passed across, are relatively expensive.

With sendfile() this becomes:

  1. sendfile() read data from source and write to target

A single context switch occurs between the calling application and the sendfile() system call. No buffer is needed in userspace.

A minimal program to try out sendfile from source file to target file:

#include <unistd.h>
#include <stdio.h>
#include <fcntl.h>
#include <sys/types.h>
int main(int argc, char **argv) {
  int in_fd=0;
  int out_fd=0;
  struct stat stat_buf;
  off_t offset = 0;

  in_fd=open(argv[1], O_RDONLY);
  fstat(in_fd, &stat_buf);

  out_fd=open(argv[2], O_WRONLY|O_CREAT, stat_buf.st_mode);

  sendfile(out_fd, in_fd, &offset, stat_buf.st_size);

Moving to the practical matter of where you can use this, there's a lot of documentation out there that talks about problems with sendfile() and network filesystems. Apache has a directive to control whether sendfile() is used - EnableSendfile. The default is On in Apache 2.2, but Off in Apache 2.4. The documentation[2] calls out:
With a network-mounted DocumentRoot (e.g., NFS, SMB, CIFS, FUSE), the kernel may be unable to serve the network file through its own cache.
Nginx has an equivilent "sendfile" option, and this also defaults to off[3]. Its unclear to me what the about statement in the Apache docs means. Testing with the C program above, I can see that sendfile works fine when the source is on NFS, and it uses the page cache (first time run is slow, subsequently fast, drop_caches to make it slow again, i.e. re-read from source).
© 2017