Go to file
Josef Bacik 977849e8ac btrfs: adjust subpage bit start based on sectorsize
[ Upstream commit e08e49d986 ]

When running machines with 64k page size and a 16k nodesize we started
seeing tree log corruption in production.  This turned out to be because
we were not writing out dirty blocks sometimes, so this in fact affects
all metadata writes.

When writing out a subpage EB we scan the subpage bitmap for a dirty
range.  If the range isn't dirty we do

	bit_start++;

to move onto the next bit.  The problem is the bitmap is based on the
number of sectors that an EB has.  So in this case, we have a 64k
pagesize, 16k nodesize, but a 4k sectorsize.  This means our bitmap is 4
bits for every node.  With a 64k page size we end up with 4 nodes per
page.

To make this easier this is how everything looks

[0         16k       32k       48k     ] logical address
[0         4         8         12      ] radix tree offset
[               64k page               ] folio
[ 16k eb ][ 16k eb ][ 16k eb ][ 16k eb ] extent buffers
[ | | | |  | | | |   | | | |   | | | | ] bitmap

Now we use all of our addressing based on fs_info->sectorsize_bits, so
as you can see the above our 16k eb->start turns into radix entry 4.

When we find a dirty range for our eb, we correctly do bit_start +=
sectors_per_node, because if we start at bit 0, the next bit for the
next eb is 4, to correspond to eb->start 16k.

However if our range is clean, we will do bit_start++, which will now
put us offset from our radix tree entries.

In our case, assume that the first time we check the bitmap the block is
not dirty, we increment bit_start so now it == 1, and then we loop
around and check again.  This time it is dirty, and we go to find that
start using the following equation

	start = folio_start + bit_start * fs_info->sectorsize;

so in the case above, eb->start 0 is now dirty, and we calculate start
as

	0 + 1 * fs_info->sectorsize = 4096
	4096 >> 12 = 1

Now we're looking up the radix tree for 1, and we won't find an eb.
What's worse is now we're using bit_start == 1, so we do bit_start +=
sectors_per_node, which is now 5.  If that eb is dirty we will run into
the same thing, we will look at an offset that is not populated in the
radix tree, and now we're skipping the writeout of dirty extent buffers.

The best fix for this is to not use sectorsize_bits to address nodes,
but that's a larger change.  Since this is a fs corruption problem fix
it simply by always using sectors_per_node to increment the start bit.

Fixes: c4aec299fa ("btrfs: introduce submit_eb_subpage() to submit a subpage metadata page")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[ Adjust context ]
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-09 18:56:30 +02:00
arch x86/mm/64: define ARCH_PAGE_TABLE_SYNC_MASK and arch_sync_kernel_mappings() 2025-09-09 18:56:26 +02:00
block block: reject invalid operation in submit_bio_noacct 2025-08-28 16:28:40 +02:00
certs sign-file,extract-cert: use pkcs11 provider for OPENSSL MAJOR >= 3 2025-04-25 10:45:58 +02:00
crypto crypto: jitter - fix intermediary handling 2025-08-28 16:28:26 +02:00
Documentation netlink: add variable-length / auto integers 2025-09-09 18:56:22 +02:00
drivers PCI/MSI: Add an option to write MSIX ENTRY_DATA before any reads 2025-09-09 18:56:29 +02:00
fs btrfs: adjust subpage bit start based on sectorsize 2025-09-09 18:56:30 +02:00
include PCI/MSI: Add an option to write MSIX ENTRY_DATA before any reads 2025-09-09 18:56:29 +02:00
init sched/isolation: Make CONFIG_CPU_ISOLATION depend on CONFIG_SMP 2025-05-02 07:50:57 +02:00
io_uring io_uring/net: commit partial buffers on retry 2025-08-28 16:28:11 +02:00
ipc ipc: fix to protect IPCS lookups using RCU 2025-06-27 11:08:49 +01:00
kernel sched: Fix sched_numa_find_nth_cpu() if mask offline 2025-09-09 18:56:26 +02:00
lib netlink: add variable-length / auto integers 2025-09-09 18:56:22 +02:00
LICENSES
mm mm/slub: avoid accessing metadata when pointer is invalid in object_err() 2025-09-09 18:56:29 +02:00
net batman-adv: fix OOB read/write in network-coding decode 2025-09-09 18:56:27 +02:00
rust rust: module: place cleanup_module() in .exit.text section 2025-07-06 11:00:06 +02:00
samples samples: mei: Fix building on musl libc 2025-08-15 12:08:43 +02:00
scripts kconfig: lxdialog: fix 'space' to (de)select options 2025-08-28 16:28:29 +02:00
security apparmor: use the condition in AA_BUG_FMT even with debug disabled 2025-08-28 16:28:28 +02:00
sound ALSA: usb-audio: Add mute TLV for playback volumes on some devices 2025-09-09 18:56:25 +02:00
tools selftest: net: Fix weird setsockopt() in bind_bhash.c. 2025-09-09 18:56:25 +02:00
usr
virt
.clang-format
.cocciconfig
.get_maintainer.ignore
.gitattributes
.gitignore
.mailmap
.rustfmt.toml
COPYING
CREDITS
Kbuild
Kconfig
MAINTAINERS sign-file,extract-cert: move common SSL helper functions to a header 2025-04-25 10:45:57 +02:00
Makefile Linux 6.6.104 2025-09-04 15:30:29 +02:00
README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.