Our lessons on Linux writeback: do “dirty” jobs the right way

8 min readApr 2, 2021

TL;DR

Writeback is innocent. It ensures data consistency and free up memory for other tasks. Understand it then it will pay you back.
Unfortunately, sometimes writeback can be the root cause that triggers other things that may affect your very latency-sensitive application: TLB shootdown, high context-switches, stable pages write.
Writeback itself rarely has a direct impact on your application. It does not block your write. Dynamic throttling mechanism does.
Writeback can be an indicator of performance issues, for example, memory saturation or heavy disk IO due to over-syncing data. Tracing writeback reasons can help to reveal root causes in those cases.

1. What is writeback ?

Writeback is the process of writing dirty pages in memory back to permanent storage.

I won’t explain much about dirty pages, write buffer, deferred/asynchronous write in this blog. Read this in case you dont’ know what it is.

I’ll use write buffer and pagecache interchangably in this blog. Though precisely, dirty pages are all in pagecache.

This section is just to distinguish this Linux dirty pages writeback with hard disk’s cache writeback.

2. Why should we care about writeback ?

Because write buffers and dirty pages are actually cool. It can speed up our write significantly in many cases when it’s allowed to. Then, we need a writeback mechanism to eventually save our volatile data in dirty pages to a persistent storage media. A good understanding about writeback can help us to:

Balance between data safety and performance
Avoid extreme cases that can cause very big latency outliers application, as well as revealing the root cause of such outliers easier.

2.1 A balance setting for data safety vs performance

Wise men said, understanding our applications’ behaviors helps us to maximize the performance from limited resources. That’s true for dirty pages and writeback.

Look at these settings

vm.dirty_background_bytes = 204800000
vm.dirty_bytes = 819200000
vm.dirty_writeback_centisecs = 3000

It looks weird ( much much higher than usual ) but actually works great on our our special purpose server, where:

The main part is nginx proxy, which generates 200+ G uncompressed log per day.
The most important data there are those logs. And they are shipped on the fly to kafka for later processing.
The logs are stored locally for debugging purpose only.
Any other application, if there’s any critical data, sync it by itself.
Have plenty of free mem ( around 20/32G ), but i strongly believe half of that amount will be more than enough
Can’t catch up with such IO workload on 2 x 7200 RPM HDD

In the worst case, when that small box dies unexpectedly somehow, we’ll lose ~ 3 seconds of log locally, thanks to write_back_centisecs setting. Everything’s fine. Cheap boxes are used wisely.

But of course, there’s no “one size fits all”.

2.2 Hard lessons: writeback can cause a crazy latency outlier

Even though write buffer are well-known with its performance benefit, it’s close friend writeback can be the source of an latency outliers as well.

For example, like in the graph below, the p99 latency in 1 of our critical app went skyrocket ( to 10 seconds ! ) sometimes when there was an other process writing data to disks.

A bit more info for our case:

Our java app’s main task is to process around 80G data which is locked in memory ( mmap / mlock ). Disks IO are trivial , only logging. GC log is in tmpfs
There were still plenty of headroom for free CPU and memory. The disks itself looked comfortable still ( up to 40% util, there wasn’t the same spike pattern in queue length or iops ).

The reason, after several interesting tracing sessions conducted, was the mis-configuration in our dirty settings by accident. The vm.dirty_background_bytes is quite high, it’s 200M ( inherited from the server described above ). To get rid of the latency spikes, there are 2 choices:

Modify our process which does the IO in background to use direct IO.
Low down our dirty_*_bytes settings to several bytes, to mitigate the impact of whatever writeback does.

We chose the second to avoid extra work, and because, obviously, we didn’t set those settings on this server on purpose. More like reverting.

Taking some more time to research on this, this writebach-vs-java-long-pause problem was also reported long ago in great details here in this article. We’re not alone in this case, that’s good.

Let’s see what could be the reason.

2.2.1 Writeback as a root cause

Searching around, i found another great article latency implication of virtual memory, writeback is mentioned several times as a factor that can cause us some trouble with latency outliers, which may or may not the root cause of our issue but is still worth mentioning here:

It incurs TLB shootdown, which requires a CPU core to pause whatever it is to inform ( interrupt ) other cores about the new info about memory pages it has. This process normally is super quick, but it can cause our application latecy ( especially memory-intensive one ) an unstable latency. ( Because it has to wait for its turn to be “on-CPU” ) . Couldn’t be that big as our case though. Much much smaller.
Another situation for context-switches. See how wribeback its woken up as a background process:

static void wb_wakeup(struct bdi_writeback *wb)
{
    spin_lock_bh(&wb->work_lock);
    if (test_bit(WB_registered, &wb->state))
        mod_delayed_work(bdi_wq, &wb->dwork, 0);
    spin_unlock_bh(&wb->work_lock);
}

wb_wakeup calls mode_delayed_work with delay param 0, means it’s guaranteed to schedule immediately on its current CPU. Thus too many writeback can cause high number of context-switches. This can contribule some extra latency to our case. Too much data to push can cause lower priority app to be scheduled to wait longer.

Fortunately, we can verify the CPU scheduling issue by bpftrace easily, by using runqlen.bt and runqlat.bt . TLB shootdown can also by tracked by tracing tlb_finish_mmu kprobe.

While writing a dirty memory page to disk, if there’s an application writing to that page, it will be blocked waiting, till the writeback process finished. This is an extreme case. And the latency is also very small. ( Except for the answer an extreme case, when you’re writing that page to a broken USB stick :D )

2.2.2 Writeback as an effect

Of other events that put much more load to your server or block your application directly.

a. Page reclaim causes a zoo of chaos

Theoritically, page reclaim can cause a zoo of chaos creatures in your system.

It can goes like this when you share an application with a busy database server :D

A heavy SELECT query failed all the cache and grab too many data on disks, causing huge disk IO.
Linux adds as much data was once read as it can to pagecache. And it needs more memory for pagecache
When there’s not enough free memory for pagecache, page reclaim kicks in and trying to flush dirty data to disks. How ? by calling writeback . If you are lucky and your SSD handle all this sh!t quick enough , everything’s fine.
If you’re unlucky, your application may write to a page which is being flushed back to disk ( stable pages issue — explained above 2.2.1 , third bullet ) , thus is blocked waiting.
If you’re very unlucky, your application needs some memory at the same time when Linux grabs all free mem for pagecache, thus you are , again , blocked waiting for direct reclaim. This can be traced by my idol’s vmscan tool.

In the 2 last points, you may be surprised when your application is affected by disk IO activities, while it works with memory completely.

b. Your write is throttled

By writeback, you may think.

No, it’s not writeback

For virtually every write ( except for direct IO one ) , the writer has to go through several checks, and 1 of them is balance_dirty_pages function.

The kernel stack looks like this

        balance_dirty_pages+1
        balance_dirty_pages_ratelimited+716
        generic_perform_write+362
        __generic_file_write_iter+254
        ext4_file_write_iter+198
        new_sync_write+251
        vfs_write+165
        ksys_write+87
        do_syscall_64+83
        entry_SYSCALL_64_after_hwframe+68

Here is what it does:

/* * balance_dirty_pages() must be called by processes which are generating dirty * data. It looks at the number of dirty pages in the machine and will force * the caller to wait once crossing the (background_thresh + dirty_thresh) / 2. * If we’re over `background_thresh’ then the writeback threads are woken to * perform some writeout. */

And here is the code, as my view: ( :D )

static void balance_dirty_pages(struct bdi_writeback *wb,
                                unsigned long pages_dirtied)
{
        ...
        // Declaring variables
        for (;;) {
                ...
                // Some magics to ensure the condition in description
                ...
                trace_balance_dirty_pages(wb,
                                          sdtc->thresh,
                                          sdtc->bg_thresh,
                                          sdtc->dirty,
                                          sdtc->wb_thresh,
                                          sdtc->wb_dirty,
                                          dirty_ratelimit,
                                          task_ratelimit,
                                          pages_dirtied,
                                          period,
                                          pause,
                                          start_time);
                __set_current_state(TASK_KILLABLE);
                wb->dirty_sleep = now;
                io_schedule_timeout(pause);
                ...}
        .....
}

So, this function is what put your process to be blocked waiting for IO, by calling io_schedule_timeout(pause), not writeback. Writeback should be a background process, and it shouldn’t block application directly. Otherwise it’s a foreground process.

As you can see above we have a tracepoint called balance_dirty_pages to track when it happens and which process is blocked. Using ftrace by running:

/sys/kernel/debug/tracing/events]# echo 0 > writeback/balance_dirty_pages/enable

And check the paused value , you’ll be able to see how long your application is block.

Otherwise you can trace io_schedule_timeout directly , looks at it stack to make sure it’s called by balance_dirty_pages and extract the pause value from there.

The writeback role in this case is very simple: it reduces the dirty pages, till the number of dirtied pages ≤ ( background_thresh + dirty_thresh ) /2. It was called by balance_dirty_pages , thus an effect of the events chain only.

The quicker writeback can do its job, the less number of times a writing process is throttled.

Set a small value for dirty_*_bytes can help to speed up the writeback process , but it comes with the cost of being throttled more often ( than setting a big value ).

Once again, there’s no “one size fits all”. You have to choose whether to be throttled more often in short time or being rarely blocked but in the much higher duration.

3. Conclusion

Thank you for coming this far. I’ve just summarized everything I know about how writeback can affect our application performance.

It has pros and cons but IMHO, it plays a very important role while troubleshooting your application latency outliers, which is always be fun. ( Every engineer wants to reveal a secrect of an edge case )

Hopefully it will help you someday. If it will, leave me a comment here please, I would be very happy.

Thank you.

Small note: As of today ( 9th November 2023 ), I’m looking for a new job, ideally Senior+ SRE / DevOps Engineer role but can contribute to anything related to infrastructure engineering. If you’re hiring or know somebody is, please kindly ping me via my twitter or linkedin for further info. Of course, feel free to just connect and say hi. I would love to know more about you. Thank you a ton !