From a8f6c2e54ddc1f1bedf3dfc92407e2b6678dd109 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 18:59:45 -0700
Subject: xfs: document the motivation for online fsck design

Start the first chapter of the online fsck design documentation.
This covers the motivations for creating this in the first place.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 Documentation/filesystems/index.rst                |   1 +
 .../filesystems/xfs-online-fsck-design.rst         | 212 +++++++++++++++++++++
 2 files changed, 213 insertions(+)
 create mode 100644 Documentation/filesystems/xfs-online-fsck-design.rst

diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index bee63d42e5ec..fbb2b5ada95b 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -123,4 +123,5 @@ Documentation for filesystem implementations.
    vfat
    xfs-delayed-logging-design
    xfs-self-describing-metadata
+   xfs-online-fsck-design
    zonefs
diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
new file mode 100644
index 000000000000..07c7b4cde18f
--- /dev/null
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -0,0 +1,212 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _xfs_online_fsck_design:
+
+..
+        Mapping of heading styles within this document:
+        Heading 1 uses "====" above and below
+        Heading 2 uses "===="
+        Heading 3 uses "----"
+        Heading 4 uses "````"
+        Heading 5 uses "^^^^"
+        Heading 6 uses "~~~~"
+        Heading 7 uses "...."
+
+        Sections are manually numbered because apparently that's what everyone
+        does in the kernel.
+
+======================
+XFS Online Fsck Design
+======================
+
+This document captures the design of the online filesystem check feature for
+XFS.
+The purpose of this document is threefold:
+
+- To help kernel distributors understand exactly what the XFS online fsck
+  feature is, and issues about which they should be aware.
+
+- To help people reading the code to familiarize themselves with the relevant
+  concepts and design points before they start digging into the code.
+
+- To help developers maintaining the system by capturing the reasons
+  supporting higher level decision making.
+
+As the online fsck code is merged, the links in this document to topic branches
+will be replaced with links to code.
+
+This document is licensed under the terms of the GNU Public License, v2.
+The primary author is Darrick J. Wong.
+
+This design document is split into seven parts.
+Part 1 defines what fsck tools are and the motivations for writing a new one.
+Parts 2 and 3 present a high level overview of how online fsck process works
+and how it is tested to ensure correct functionality.
+Part 4 discusses the user interface and the intended usage modes of the new
+program.
+Parts 5 and 6 show off the high level components and how they fit together, and
+then present case studies of how each repair function actually works.
+Part 7 sums up what has been discussed so far and speculates about what else
+might be built atop online fsck.
+
+.. contents:: Table of Contents
+   :local:
+
+1. What is a Filesystem Check?
+==============================
+
+A Unix filesystem has four main responsibilities:
+
+- Provide a hierarchy of names through which application programs can associate
+  arbitrary blobs of data for any length of time,
+
+- Virtualize physical storage media across those names, and
+
+- Retrieve the named data blobs at any time.
+
+- Examine resource usage.
+
+Metadata directly supporting these functions (e.g. files, directories, space
+mappings) are sometimes called primary metadata.
+Secondary metadata (e.g. reverse mapping and directory parent pointers) support
+operations internal to the filesystem, such as internal consistency checking
+and reorganization.
+Summary metadata, as the name implies, condense information contained in
+primary metadata for performance reasons.
+
+The filesystem check (fsck) tool examines all the metadata in a filesystem
+to look for errors.
+In addition to looking for obvious metadata corruptions, fsck also
+cross-references different types of metadata records with each other to look
+for inconsistencies.
+People do not like losing data, so most fsck tools also contains some ability
+to correct any problems found.
+As a word of caution -- the primary goal of most Linux fsck tools is to restore
+the filesystem metadata to a consistent state, not to maximize the data
+recovered.
+That precedent will not be challenged here.
+
+Filesystems of the 20th century generally lacked any redundancy in the ondisk
+format, which means that fsck can only respond to errors by erasing files until
+errors are no longer detected.
+More recent filesystem designs contain enough redundancy in their metadata that
+it is now possible to regenerate data structures when non-catastrophic errors
+occur; this capability aids both strategies.
+
++--------------------------------------------------------------------------+
+| **Note**:                                                                |
++--------------------------------------------------------------------------+
+| System administrators avoid data loss by increasing the number of        |
+| separate storage systems through the creation of backups; and they avoid |
+| downtime by increasing the redundancy of each storage system through the |
+| creation of RAID arrays.                                                 |
+| fsck tools address only the first problem.                               |
++--------------------------------------------------------------------------+
+
+TLDR; Show Me the Code!
+-----------------------
+
+Code is posted to the kernel.org git trees as follows:
+`kernel changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`_,
+`userspace changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_, and
+`QA test changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_.
+Each kernel patchset adding an online repair function will use the same branch
+name across the kernel, xfsprogs, and fstests git repos.
+
+Existing Tools
+--------------
+
+The online fsck tool described here will be the third tool in the history of
+XFS (on Linux) to check and repair filesystems.
+Two programs precede it:
+
+The first program, ``xfs_check``, was created as part of the XFS debugger
+(``xfs_db``) and can only be used with unmounted filesystems.
+It walks all metadata in the filesystem looking for inconsistencies in the
+metadata, though it lacks any ability to repair what it finds.
+Due to its high memory requirements and inability to repair things, this
+program is now deprecated and will not be discussed further.
+
+The second program, ``xfs_repair``, was created to be faster and more robust
+than the first program.
+Like its predecessor, it can only be used with unmounted filesystems.
+It uses extent-based in-memory data structures to reduce memory consumption,
+and tries to schedule readahead IO appropriately to reduce I/O waiting time
+while it scans the metadata of the entire filesystem.
+The most important feature of this tool is its ability to respond to
+inconsistencies in file metadata and directory tree by erasing things as needed
+to eliminate problems.
+Space usage metadata are rebuilt from the observed file metadata.
+
+Problem Statement
+-----------------
+
+The current XFS tools leave several problems unsolved:
+
+1. **User programs** suddenly **lose access** to the filesystem when unexpected
+   shutdowns occur as a result of silent corruptions in the metadata.
+   These occur **unpredictably** and often without warning.
+
+2. **Users** experience a **total loss of service** during the recovery period
+   after an **unexpected shutdown** occurs.
+
+3. **Users** experience a **total loss of service** if the filesystem is taken
+   offline to **look for problems** proactively.
+
+4. **Data owners** cannot **check the integrity** of their stored data without
+   reading all of it.
+   This may expose them to substantial billing costs when a linear media scan
+   performed by the storage system administrator might suffice.
+
+5. **System administrators** cannot **schedule** a maintenance window to deal
+   with corruptions if they **lack the means** to assess filesystem health
+   while the filesystem is online.
+
+6. **Fleet monitoring tools** cannot **automate periodic checks** of filesystem
+   health when doing so requires **manual intervention** and downtime.
+
+7. **Users** can be tricked into **doing things they do not desire** when
+   malicious actors **exploit quirks of Unicode** to place misleading names
+   in directories.
+
+Given this definition of the problems to be solved and the actors who would
+benefit, the proposed solution is a third fsck tool that acts on a running
+filesystem.
+
+This new third program has three components: an in-kernel facility to check
+metadata, an in-kernel facility to repair metadata, and a userspace driver
+program to drive fsck activity on a live filesystem.
+``xfs_scrub`` is the name of the driver program.
+The rest of this document presents the goals and use cases of the new fsck
+tool, describes its major design points in connection to those goals, and
+discusses the similarities and differences with existing tools.
+
++--------------------------------------------------------------------------+
+| **Note**:                                                                |
++--------------------------------------------------------------------------+
+| Throughout this document, the existing offline fsck tool can also be     |
+| referred to by its current name "``xfs_repair``".                        |
+| The userspace driver program for the new online fsck tool can be         |
+| referred to as "``xfs_scrub``".                                          |
+| The kernel portion of online fsck that validates metadata is called      |
+| "online scrub", and portion of the kernel that fixes metadata is called  |
+| "online repair".                                                         |
++--------------------------------------------------------------------------+
+
+The naming hierarchy is broken up into objects known as directories and files
+and the physical space is split into pieces known as allocation groups.
+Sharding enables better performance on highly parallel systems and helps to
+contain the damage when corruptions occur.
+The division of the filesystem into principal objects (allocation groups and
+inodes) means that there are ample opportunities to perform targeted checks and
+repairs on a subset of the filesystem.
+
+While this is going on, other parts continue processing IO requests.
+Even if a piece of filesystem metadata can only be regenerated by scanning the
+entire system, the scan can still be done in the background while other file
+operations continue.
+
+In summary, online fsck takes advantage of resource sharding and redundant
+metadata to enable targeted checking and repair operations while the system
+is running.
+This capability will be coupled to automatic system management so that
+autonomous self-healing of XFS maximizes service availability.
-- 
cgit v1.2.3


From 88757e04c985271b593fe298855a048ca0e9907f Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 18:59:45 -0700
Subject: xfs: document the general theory underlying online fsck design

Start the second chapter of the online fsck design documentation.
This covers the general theory underlying how online fsck works.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 .../filesystems/xfs-online-fsck-design.rst         | 404 +++++++++++++++++++++
 1 file changed, 404 insertions(+)

diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index 07c7b4cde18f..0846935325b2 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -210,3 +210,407 @@ metadata to enable targeted checking and repair operations while the system
 is running.
 This capability will be coupled to automatic system management so that
 autonomous self-healing of XFS maximizes service availability.
+
+2. Theory of Operation
+======================
+
+Because it is necessary for online fsck to lock and scan live metadata objects,
+online fsck consists of three separate code components.
+The first is the userspace driver program ``xfs_scrub``, which is responsible
+for identifying individual metadata items, scheduling work items for them,
+reacting to the outcomes appropriately, and reporting results to the system
+administrator.
+The second and third are in the kernel, which implements functions to check
+and repair each type of online fsck work item.
+
++------------------------------------------------------------------+
+| **Note**:                                                        |
++------------------------------------------------------------------+
+| For brevity, this document shortens the phrase "online fsck work |
+| item" to "scrub item".                                           |
++------------------------------------------------------------------+
+
+Scrub item types are delineated in a manner consistent with the Unix design
+philosophy, which is to say that each item should handle one aspect of a
+metadata structure, and handle it well.
+
+Scope
+-----
+
+In principle, online fsck should be able to check and to repair everything that
+the offline fsck program can handle.
+However, online fsck cannot be running 100% of the time, which means that
+latent errors may creep in after a scrub completes.
+If these errors cause the next mount to fail, offline fsck is the only
+solution.
+This limitation means that maintenance of the offline fsck tool will continue.
+A second limitation of online fsck is that it must follow the same resource
+sharing and lock acquisition rules as the regular filesystem.
+This means that scrub cannot take *any* shortcuts to save time, because doing
+so could lead to concurrency problems.
+In other words, online fsck is not a complete replacement for offline fsck, and
+a complete run of online fsck may take longer than online fsck.
+However, both of these limitations are acceptable tradeoffs to satisfy the
+different motivations of online fsck, which are to **minimize system downtime**
+and to **increase predictability of operation**.
+
+.. _scrubphases:
+
+Phases of Work
+--------------
+
+The userspace driver program ``xfs_scrub`` splits the work of checking and
+repairing an entire filesystem into seven phases.
+Each phase concentrates on checking specific types of scrub items and depends
+on the success of all previous phases.
+The seven phases are as follows:
+
+1. Collect geometry information about the mounted filesystem and computer,
+   discover the online fsck capabilities of the kernel, and open the
+   underlying storage devices.
+
+2. Check allocation group metadata, all realtime volume metadata, and all quota
+   files.
+   Each metadata structure is scheduled as a separate scrub item.
+   If corruption is found in the inode header or inode btree and ``xfs_scrub``
+   is permitted to perform repairs, then those scrub items are repaired to
+   prepare for phase 3.
+   Repairs are implemented by using the information in the scrub item to
+   resubmit the kernel scrub call with the repair flag enabled; this is
+   discussed in the next section.
+   Optimizations and all other repairs are deferred to phase 4.
+
+3. Check all metadata of every file in the filesystem.
+   Each metadata structure is also scheduled as a separate scrub item.
+   If repairs are needed and ``xfs_scrub`` is permitted to perform repairs,
+   and there were no problems detected during phase 2, then those scrub items
+   are repaired immediately.
+   Optimizations, deferred repairs, and unsuccessful repairs are deferred to
+   phase 4.
+
+4. All remaining repairs and scheduled optimizations are performed during this
+   phase, if the caller permits them.
+   Before starting repairs, the summary counters are checked and any necessary
+   repairs are performed so that subsequent repairs will not fail the resource
+   reservation step due to wildly incorrect summary counters.
+   Unsuccesful repairs are requeued as long as forward progress on repairs is
+   made somewhere in the filesystem.
+   Free space in the filesystem is trimmed at the end of phase 4 if the
+   filesystem is clean.
+
+5. By the start of this phase, all primary and secondary filesystem metadata
+   must be correct.
+   Summary counters such as the free space counts and quota resource counts
+   are checked and corrected.
+   Directory entry names and extended attribute names are checked for
+   suspicious entries such as control characters or confusing Unicode sequences
+   appearing in names.
+
+6. If the caller asks for a media scan, read all allocated and written data
+   file extents in the filesystem.
+   The ability to use hardware-assisted data file integrity checking is new
+   to online fsck; neither of the previous tools have this capability.
+   If media errors occur, they will be mapped to the owning files and reported.
+
+7. Re-check the summary counters and presents the caller with a summary of
+   space usage and file counts.
+
+Steps for Each Scrub Item
+-------------------------
+
+The kernel scrub code uses a three-step strategy for checking and repairing
+the one aspect of a metadata object represented by a scrub item:
+
+1. The scrub item of interest is checked for corruptions; opportunities for
+   optimization; and for values that are directly controlled by the system
+   administrator but look suspicious.
+   If the item is not corrupt or does not need optimization, resource are
+   released and the positive scan results are returned to userspace.
+   If the item is corrupt or could be optimized but the caller does not permit
+   this, resources are released and the negative scan results are returned to
+   userspace.
+   Otherwise, the kernel moves on to the second step.
+
+2. The repair function is called to rebuild the data structure.
+   Repair functions generally choose rebuild a structure from other metadata
+   rather than try to salvage the existing structure.
+   If the repair fails, the scan results from the first step are returned to
+   userspace.
+   Otherwise, the kernel moves on to the third step.
+
+3. In the third step, the kernel runs the same checks over the new metadata
+   item to assess the efficacy of the repairs.
+   The results of the reassessment are returned to userspace.
+
+Classification of Metadata
+--------------------------
+
+Each type of metadata object (and therefore each type of scrub item) is
+classified as follows:
+
+Primary Metadata
+````````````````
+
+Metadata structures in this category should be most familiar to filesystem
+users either because they are directly created by the user or they index
+objects created by the user
+Most filesystem objects fall into this class:
+
+- Free space and reference count information
+
+- Inode records and indexes
+
+- Storage mapping information for file data
+
+- Directories
+
+- Extended attributes
+
+- Symbolic links
+
+- Quota limits
+
+Scrub obeys the same rules as regular filesystem accesses for resource and lock
+acquisition.
+
+Primary metadata objects are the simplest for scrub to process.
+The principal filesystem object (either an allocation group or an inode) that
+owns the item being scrubbed is locked to guard against concurrent updates.
+The check function examines every record associated with the type for obvious
+errors and cross-references healthy records against other metadata to look for
+inconsistencies.
+Repairs for this class of scrub item are simple, since the repair function
+starts by holding all the resources acquired in the previous step.
+The repair function scans available metadata as needed to record all the
+observations needed to complete the structure.
+Next, it stages the observations in a new ondisk structure and commits it
+atomically to complete the repair.
+Finally, the storage from the old data structure are carefully reaped.
+
+Because ``xfs_scrub`` locks a primary object for the duration of the repair,
+this is effectively an offline repair operation performed on a subset of the
+filesystem.
+This minimizes the complexity of the repair code because it is not necessary to
+handle concurrent updates from other threads, nor is it necessary to access
+any other part of the filesystem.
+As a result, indexed structures can be rebuilt very quickly, and programs
+trying to access the damaged structure will be blocked until repairs complete.
+The only infrastructure needed by the repair code are the staging area for
+observations and a means to write new structures to disk.
+Despite these limitations, the advantage that online repair holds is clear:
+targeted work on individual shards of the filesystem avoids total loss of
+service.
+
+This mechanism is described in section 2.1 ("Off-Line Algorithm") of
+V. Srinivasan and M. J. Carey, `"Performance of On-Line Index Construction
+Algorithms" <https://minds.wisconsin.edu/bitstream/handle/1793/59524/TR1047.pdf>`_,
+*Extending Database Technology*, pp. 293-309, 1992.
+
+Most primary metadata repair functions stage their intermediate results in an
+in-memory array prior to formatting the new ondisk structure, which is very
+similar to the list-based algorithm discussed in section 2.3 ("List-Based
+Algorithms") of Srinivasan.
+However, any data structure builder that maintains a resource lock for the
+duration of the repair is *always* an offline algorithm.
+
+Secondary Metadata
+``````````````````
+
+Metadata structures in this category reflect records found in primary metadata,
+but are only needed for online fsck or for reorganization of the filesystem.
+
+Secondary metadata include:
+
+- Reverse mapping information
+
+- Directory parent pointers
+
+This class of metadata is difficult for scrub to process because scrub attaches
+to the secondary object but needs to check primary metadata, which runs counter
+to the usual order of resource acquisition.
+Frequently, this means that full filesystems scans are necessary to rebuild the
+metadata.
+Check functions can be limited in scope to reduce runtime.
+Repairs, however, require a full scan of primary metadata, which can take a
+long time to complete.
+Under these conditions, ``xfs_scrub`` cannot lock resources for the entire
+duration of the repair.
+
+Instead, repair functions set up an in-memory staging structure to store
+observations.
+Depending on the requirements of the specific repair function, the staging
+index will either have the same format as the ondisk structure or a design
+specific to that repair function.
+The next step is to release all locks and start the filesystem scan.
+When the repair scanner needs to record an observation, the staging data are
+locked long enough to apply the update.
+While the filesystem scan is in progress, the repair function hooks the
+filesystem so that it can apply pending filesystem updates to the staging
+information.
+Once the scan is done, the owning object is re-locked, the live data is used to
+write a new ondisk structure, and the repairs are committed atomically.
+The hooks are disabled and the staging staging area is freed.
+Finally, the storage from the old data structure are carefully reaped.
+
+Introducing concurrency helps online repair avoid various locking problems, but
+comes at a high cost to code complexity.
+Live filesystem code has to be hooked so that the repair function can observe
+updates in progress.
+The staging area has to become a fully functional parallel structure so that
+updates can be merged from the hooks.
+Finally, the hook, the filesystem scan, and the inode locking model must be
+sufficiently well integrated that a hook event can decide if a given update
+should be applied to the staging structure.
+
+In theory, the scrub implementation could apply these same techniques for
+primary metadata, but doing so would make it massively more complex and less
+performant.
+Programs attempting to access the damaged structures are not blocked from
+operation, which may cause application failure or an unplanned filesystem
+shutdown.
+
+Inspiration for the secondary metadata repair strategy was drawn from section
+2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File")
+and 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan, `"Algorithms for
+Creating Indexes for Very Large Tables Without Quiescing Updates"
+<https://dl.acm.org/doi/10.1145/130283.130337>`_, 1992.
+
+The sidecar index mentioned above bears some resemblance to the side file
+method mentioned in Srinivasan and Mohan.
+Their method consists of an index builder that extracts relevant record data to
+build the new structure as quickly as possible; and an auxiliary structure that
+captures all updates that would be committed to the index by other threads were
+the new index already online.
+After the index building scan finishes, the updates recorded in the side file
+are applied to the new index.
+To avoid conflicts between the index builder and other writer threads, the
+builder maintains a publicly visible cursor that tracks the progress of the
+scan through the record space.
+To avoid duplication of work between the side file and the index builder, side
+file updates are elided when the record ID for the update is greater than the
+cursor position within the record ID space.
+
+To minimize changes to the rest of the codebase, XFS online repair keeps the
+replacement index hidden until it's completely ready to go.
+In other words, there is no attempt to expose the keyspace of the new index
+while repair is running.
+The complexity of such an approach would be very high and perhaps more
+appropriate to building *new* indices.
+
+**Future Work Question**: Can the full scan and live update code used to
+facilitate a repair also be used to implement a comprehensive check?
+
+*Answer*: In theory, yes.  Check would be much stronger if each scrub function
+employed these live scans to build a shadow copy of the metadata and then
+compared the shadow records to the ondisk records.
+However, doing that is a fair amount more work than what the checking functions
+do now.
+The live scans and hooks were developed much later.
+That in turn increases the runtime of those scrub functions.
+
+Summary Information
+```````````````````
+
+Metadata structures in this last category summarize the contents of primary
+metadata records.
+These are often used to speed up resource usage queries, and are many times
+smaller than the primary metadata which they represent.
+
+Examples of summary information include:
+
+- Summary counts of free space and inodes
+
+- File link counts from directories
+
+- Quota resource usage counts
+
+Check and repair require full filesystem scans, but resource and lock
+acquisition follow the same paths as regular filesystem accesses.
+
+The superblock summary counters have special requirements due to the underlying
+implementation of the incore counters, and will be treated separately.
+Check and repair of the other types of summary counters (quota resource counts
+and file link counts) employ the same filesystem scanning and hooking
+techniques as outlined above, but because the underlying data are sets of
+integer counters, the staging data need not be a fully functional mirror of the
+ondisk structure.
+
+Inspiration for quota and file link count repair strategies were drawn from
+sections 2.12 ("Online Index Operations") through 2.14 ("Incremental View
+Maintenace") of G.  Graefe, `"Concurrent Queries and Updates in Summary Views
+and Their Indexes"
+<http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`_, 2011.
+
+Since quotas are non-negative integer counts of resource usage, online
+quotacheck can use the incremental view deltas described in section 2.14 to
+track pending changes to the block and inode usage counts in each transaction,
+and commit those changes to a dquot side file when the transaction commits.
+Delta tracking is necessary for dquots because the index builder scans inodes,
+whereas the data structure being rebuilt is an index of dquots.
+Link count checking combines the view deltas and commit step into one because
+it sets attributes of the objects being scanned instead of writing them to a
+separate data structure.
+Each online fsck function will be discussed as case studies later in this
+document.
+
+Risk Management
+---------------
+
+During the development of online fsck, several risk factors were identified
+that may make the feature unsuitable for certain distributors and users.
+Steps can be taken to mitigate or eliminate those risks, though at a cost to
+functionality.
+
+- **Decreased performance**: Adding metadata indices to the filesystem
+  increases the time cost of persisting changes to disk, and the reverse space
+  mapping and directory parent pointers are no exception.
+  System administrators who require the maximum performance can disable the
+  reverse mapping features at format time, though this choice dramatically
+  reduces the ability of online fsck to find inconsistencies and repair them.
+
+- **Incorrect repairs**: As with all software, there might be defects in the
+  software that result in incorrect repairs being written to the filesystem.
+  Systematic fuzz testing (detailed in the next section) is employed by the
+  authors to find bugs early, but it might not catch everything.
+  The kernel build system provides Kconfig options (``CONFIG_XFS_ONLINE_SCRUB``
+  and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to choose not to
+  accept this risk.
+  The xfsprogs build system has a configure option (``--enable-scrub=no``) that
+  disables building of the ``xfs_scrub`` binary, though this is not a risk
+  mitigation if the kernel functionality remains enabled.
+
+- **Inability to repair**: Sometimes, a filesystem is too badly damaged to be
+  repairable.
+  If the keyspaces of several metadata indices overlap in some manner but a
+  coherent narrative cannot be formed from records collected, then the repair
+  fails.
+  To reduce the chance that a repair will fail with a dirty transaction and
+  render the filesystem unusable, the online repair functions have been
+  designed to stage and validate all new records before committing the new
+  structure.
+
+- **Misbehavior**: Online fsck requires many privileges -- raw IO to block
+  devices, opening files by handle, ignoring Unix discretionary access control,
+  and the ability to perform administrative changes.
+  Running this automatically in the background scares people, so the systemd
+  background service is configured to run with only the privileges required.
+  Obviously, this cannot address certain problems like the kernel crashing or
+  deadlocking, but it should be sufficient to prevent the scrub process from
+  escaping and reconfiguring the system.
+  The cron job does not have this protection.
+
+- **Fuzz Kiddiez**: There are many people now who seem to think that running
+  automated fuzz testing of ondisk artifacts to find mischevious behavior and
+  spraying exploit code onto the public mailing list for instant zero-day
+  disclosure is somehow of some social benefit.
+  In the view of this author, the benefit is realized only when the fuzz
+  operators help to **fix** the flaws, but this opinion apparently is not
+  widely shared among security "researchers".
+  The XFS maintainers' continuing ability to manage these events presents an
+  ongoing risk to the stability of the development process.
+  Automated testing should front-load some of the risk while the feature is
+  considered EXPERIMENTAL.
+
+Many of these risks are inherent to software programming.
+Despite this, it is hoped that this new functionality will prove useful in
+reducing unexpected downtime.
-- 
cgit v1.2.3


From 9a30b5b52180bc37c7b3cf0efb57ba1297178856 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 18:59:46 -0700
Subject: xfs: document the testing plan for online fsck

Start the third chapter of the online fsck design documentation.  This
covers the testing plan to make sure that both online and offline fsck
can detect arbitrary problems and correct them without making things
worse.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 .../filesystems/xfs-online-fsck-design.rst         | 186 +++++++++++++++++++++
 1 file changed, 186 insertions(+)

diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index 0846935325b2..ed9b83c4dbf7 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -614,3 +614,189 @@ functionality.
 Many of these risks are inherent to software programming.
 Despite this, it is hoped that this new functionality will prove useful in
 reducing unexpected downtime.
+
+3. Testing Plan
+===============
+
+As stated before, fsck tools have three main goals:
+
+1. Detect inconsistencies in the metadata;
+
+2. Eliminate those inconsistencies; and
+
+3. Minimize further loss of data.
+
+Demonstrations of correct operation are necessary to build users' confidence
+that the software behaves within expectations.
+Unfortunately, it was not really feasible to perform regular exhaustive testing
+of every aspect of a fsck tool until the introduction of low-cost virtual
+machines with high-IOPS storage.
+With ample hardware availability in mind, the testing strategy for the online
+fsck project involves differential analysis against the existing fsck tools and
+systematic testing of every attribute of every type of metadata object.
+Testing can be split into four major categories, as discussed below.
+
+Integrated Testing with fstests
+-------------------------------
+
+The primary goal of any free software QA effort is to make testing as
+inexpensive and widespread as possible to maximize the scaling advantages of
+community.
+In other words, testing should maximize the breadth of filesystem configuration
+scenarios and hardware setups.
+This improves code quality by enabling the authors of online fsck to find and
+fix bugs early, and helps developers of new features to find integration
+issues earlier in their development effort.
+
+The Linux filesystem community shares a common QA testing suite,
+`fstests <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for
+functional and regression testing.
+Even before development work began on online fsck, fstests (when run on XFS)
+would run both the ``xfs_check`` and ``xfs_repair -n`` commands on the test and
+scratch filesystems between each test.
+This provides a level of assurance that the kernel and the fsck tools stay in
+alignment about what constitutes consistent metadata.
+During development of the online checking code, fstests was modified to run
+``xfs_scrub -n`` between each test to ensure that the new checking code
+produces the same results as the two existing fsck tools.
+
+To start development of online repair, fstests was modified to run
+``xfs_repair`` to rebuild the filesystem's metadata indices between tests.
+This ensures that offline repair does not crash, leave a corrupt filesystem
+after it exists, or trigger complaints from the online check.
+This also established a baseline for what can and cannot be repaired offline.
+To complete the first phase of development of online repair, fstests was
+modified to be able to run ``xfs_scrub`` in a "force rebuild" mode.
+This enables a comparison of the effectiveness of online repair as compared to
+the existing offline repair tools.
+
+General Fuzz Testing of Metadata Blocks
+---------------------------------------
+
+XFS benefits greatly from having a very robust debugging tool, ``xfs_db``.
+
+Before development of online fsck even began, a set of fstests were created
+to test the rather common fault that entire metadata blocks get corrupted.
+This required the creation of fstests library code that can create a filesystem
+containing every possible type of metadata object.
+Next, individual test cases were created to create a test filesystem, identify
+a single block of a specific type of metadata object, trash it with the
+existing ``blocktrash`` command in ``xfs_db``, and test the reaction of a
+particular metadata validation strategy.
+
+This earlier test suite enabled XFS developers to test the ability of the
+in-kernel validation functions and the ability of the offline fsck tool to
+detect and eliminate the inconsistent metadata.
+This part of the test suite was extended to cover online fsck in exactly the
+same manner.
+
+In other words, for a given fstests filesystem configuration:
+
+* For each metadata object existing on the filesystem:
+
+  * Write garbage to it
+
+  * Test the reactions of:
+
+    1. The kernel verifiers to stop obviously bad metadata
+    2. Offline repair (``xfs_repair``) to detect and fix
+    3. Online repair (``xfs_scrub``) to detect and fix
+
+Targeted Fuzz Testing of Metadata Records
+-----------------------------------------
+
+The testing plan for online fsck includes extending the existing fs testing
+infrastructure to provide a much more powerful facility: targeted fuzz testing
+of every metadata field of every metadata object in the filesystem.
+``xfs_db`` can modify every field of every metadata structure in every
+block in the filesystem to simulate the effects of memory corruption and
+software bugs.
+Given that fstests already contains the ability to create a filesystem
+containing every metadata format known to the filesystem, ``xfs_db`` can be
+used to perform exhaustive fuzz testing!
+
+For a given fstests filesystem configuration:
+
+* For each metadata object existing on the filesystem...
+
+  * For each record inside that metadata object...
+
+    * For each field inside that record...
+
+      * For each conceivable type of transformation that can be applied to a bit field...
+
+        1. Clear all bits
+        2. Set all bits
+        3. Toggle the most significant bit
+        4. Toggle the middle bit
+        5. Toggle the least significant bit
+        6. Add a small quantity
+        7. Subtract a small quantity
+        8. Randomize the contents
+
+        * ...test the reactions of:
+
+          1. The kernel verifiers to stop obviously bad metadata
+          2. Offline checking (``xfs_repair -n``)
+          3. Offline repair (``xfs_repair``)
+          4. Online checking (``xfs_scrub -n``)
+          5. Online repair (``xfs_scrub``)
+          6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed)
+
+This is quite the combinatoric explosion!
+
+Fortunately, having this much test coverage makes it easy for XFS developers to
+check the responses of XFS' fsck tools.
+Since the introduction of the fuzz testing framework, these tests have been
+used to discover incorrect repair code and missing functionality for entire
+classes of metadata objects in ``xfs_repair``.
+The enhanced testing was used to finalize the deprecation of ``xfs_check`` by
+confirming that ``xfs_repair`` could detect at least as many corruptions as
+the older tool.
+
+These tests have been very valuable for ``xfs_scrub`` in the same ways -- they
+allow the online fsck developers to compare online fsck against offline fsck,
+and they enable XFS developers to find deficiencies in the code base.
+
+Proposed patchsets include
+`general fuzzer improvements
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements>`_,
+`fuzzing baselines
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_,
+and `improvements in fuzz testing comprehensiveness
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`_.
+
+Stress Testing
+--------------
+
+A unique requirement to online fsck is the ability to operate on a filesystem
+concurrently with regular workloads.
+Although it is of course impossible to run ``xfs_scrub`` with *zero* observable
+impact on the running system, the online repair code should never introduce
+inconsistencies into the filesystem metadata, and regular workloads should
+never notice resource starvation.
+To verify that these conditions are being met, fstests has been enhanced in
+the following ways:
+
+* For each scrub item type, create a test to exercise checking that item type
+  while running ``fsstress``.
+* For each scrub item type, create a test to exercise repairing that item type
+  while running ``fsstress``.
+* Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole
+  filesystem doesn't cause problems.
+* Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that
+  force-repairing the whole filesystem doesn't cause problems.
+* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
+  freezing and thawing the filesystem.
+* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
+  remounting the filesystem read-only and read-write.
+* The same, but running ``fsx`` instead of ``fsstress``.  (Not done yet?)
+
+Success is defined by the ability to run all of these tests without observing
+any unexpected filesystem shutdowns due to corrupted metadata, kernel hang
+check warnings, or any other sort of mischief.
+
+Proposed patchsets include `general stress testing
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes>`_
+and the `evolution of existing per-function stress testing
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress>`_.
-- 
cgit v1.2.3


From 4f7f64697025cd738685a3497e7d5252f3adf825 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 18:59:47 -0700
Subject: xfs: document the user interface for online fsck

Start the fourth chapter of the online fsck design documentation, which
discusses the user interface and the background scrubbing service.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 .../filesystems/xfs-online-fsck-design.rst         | 113 +++++++++++++++++++++
 1 file changed, 113 insertions(+)

diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index ed9b83c4dbf7..1411c09b9677 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -800,3 +800,116 @@ Proposed patchsets include `general stress testing
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes>`_
 and the `evolution of existing per-function stress testing
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress>`_.
+
+4. User Interface
+=================
+
+The primary user of online fsck is the system administrator, just like offline
+repair.
+Online fsck presents two modes of operation to administrators:
+A foreground CLI process for online fsck on demand, and a background service
+that performs autonomous checking and repair.
+
+Checking on Demand
+------------------
+
+For administrators who want the absolute freshest information about the
+metadata in a filesystem, ``xfs_scrub`` can be run as a foreground process on
+a command line.
+The program checks every piece of metadata in the filesystem while the
+administrator waits for the results to be reported, just like the existing
+``xfs_repair`` tool.
+Both tools share a ``-n`` option to perform a read-only scan, and a ``-v``
+option to increase the verbosity of the information reported.
+
+A new feature of ``xfs_scrub`` is the ``-x`` option, which employs the error
+correction capabilities of the hardware to check data file contents.
+The media scan is not enabled by default because it may dramatically increase
+program runtime and consume a lot of bandwidth on older storage hardware.
+
+The output of a foreground invocation is captured in the system log.
+
+The ``xfs_scrub_all`` program walks the list of mounted filesystems and
+initiates ``xfs_scrub`` for each of them in parallel.
+It serializes scans for any filesystems that resolve to the same top level
+kernel block device to prevent resource overconsumption.
+
+Background Service
+------------------
+
+To reduce the workload of system administrators, the ``xfs_scrub`` package
+provides a suite of `systemd <https://systemd.io/>`_ timers and services that
+run online fsck automatically on weekends by default.
+The background service configures scrub to run with as little privilege as
+possible, the lowest CPU and IO priority, and in a CPU-constrained single
+threaded mode.
+This can be tuned by the systemd administrator at any time to suit the latency
+and throughput requirements of customer workloads.
+
+The output of the background service is also captured in the system log.
+If desired, reports of failures (either due to inconsistencies or mere runtime
+errors) can be emailed automatically by setting the ``EMAIL_ADDR`` environment
+variable in the following service files:
+
+* ``xfs_scrub_fail@.service``
+* ``xfs_scrub_media_fail@.service``
+* ``xfs_scrub_all_fail.service``
+
+The decision to enable the background scan is left to the system administrator.
+This can be done by enabling either of the following services:
+
+* ``xfs_scrub_all.timer`` on systemd systems
+* ``xfs_scrub_all.cron`` on non-systemd systems
+
+This automatic weekly scan is configured out of the box to perform an
+additional media scan of all file data once per month.
+This is less foolproof than, say, storing file data block checksums, but much
+more performant if application software provides its own integrity checking,
+redundancy can be provided elsewhere above the filesystem, or the storage
+device's integrity guarantees are deemed sufficient.
+
+The systemd unit file definitions have been subjected to a security audit
+(as of systemd 249) to ensure that the xfs_scrub processes have as little
+access to the rest of the system as possible.
+This was performed via ``systemd-analyze security``, after which privileges
+were restricted to the minimum required, sandboxing was set up to the maximal
+extent possible with sandboxing and system call filtering; and access to the
+filesystem tree was restricted to the minimum needed to start the program and
+access the filesystem being scanned.
+The service definition files restrict CPU usage to 80% of one CPU core, and
+apply as nice of a priority to IO and CPU scheduling as possible.
+This measure was taken to minimize delays in the rest of the filesystem.
+No such hardening has been performed for the cron job.
+
+Proposed patchset:
+`Enabling the xfs_scrub background service
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_.
+
+Health Reporting
+----------------
+
+XFS caches a summary of each filesystem's health status in memory.
+The information is updated whenever ``xfs_scrub`` is run, or whenever
+inconsistencies are detected in the filesystem metadata during regular
+operations.
+System administrators should use the ``health`` command of ``xfs_spaceman`` to
+download this information into a human-readable format.
+If problems have been observed, the administrator can schedule a reduced
+service window to run the online repair tool to correct the problem.
+Failing that, the administrator can decide to schedule a maintenance window to
+run the traditional offline repair tool to correct the problem.
+
+**Future Work Question**: Should the health reporting integrate with the new
+inotify fs error notification system?
+Would it be helpful for sysadmins to have a daemon to listen for corruption
+notifications and initiate a repair?
+
+*Answer*: These questions remain unanswered, but should be a part of the
+conversation with early adopters and potential downstream users of XFS.
+
+Proposed patchsets include
+`wiring up health reports to correction returns
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-reports>`_
+and
+`preservation of sickness info during memory reclaim
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting>`_.
-- 
cgit v1.2.3


From e5edad526262d4ce389f7259396184975a9bc14e Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 18:59:47 -0700
Subject: xfs: document the filesystem metadata checking strategy

Begin the fifth chapter of the online fsck design documentation, where
we discuss the details of the data structures and algorithms used by the
kernel to examine filesystem metadata and cross-reference it around the
filesystem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 .../filesystems/xfs-online-fsck-design.rst         | 587 +++++++++++++++++++++
 .../filesystems/xfs-self-describing-metadata.rst   |   1 +
 2 files changed, 588 insertions(+)

diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index 1411c09b9677..4a19c70434aa 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -913,3 +913,590 @@ Proposed patchsets include
 and
 `preservation of sickness info during memory reclaim
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting>`_.
+
+5. Kernel Algorithms and Data Structures
+========================================
+
+This section discusses the key algorithms and data structures of the kernel
+code that provide the ability to check and repair metadata while the system
+is running.
+The first chapters in this section reveal the pieces that provide the
+foundation for checking metadata.
+The remainder of this section presents the mechanisms through which XFS
+regenerates itself.
+
+Self Describing Metadata
+------------------------
+
+Starting with XFS version 5 in 2012, XFS updated the format of nearly every
+ondisk block header to record a magic number, a checksum, a universally
+"unique" identifier (UUID), an owner code, the ondisk address of the block,
+and a log sequence number.
+When loading a block buffer from disk, the magic number, UUID, owner, and
+ondisk address confirm that the retrieved block matches the specific owner of
+the current filesystem, and that the information contained in the block is
+supposed to be found at the ondisk address.
+The first three components enable checking tools to disregard alleged metadata
+that doesn't belong to the filesystem, and the fourth component enables the
+filesystem to detect lost writes.
+
+Whenever a file system operation modifies a block, the change is submitted
+to the log as part of a transaction.
+The log then processes these transactions marking them done once they are
+safely persisted to storage.
+The logging code maintains the checksum and the log sequence number of the last
+transactional update.
+Checksums are useful for detecting torn writes and other discrepancies that can
+be introduced between the computer and its storage devices.
+Sequence number tracking enables log recovery to avoid applying out of date
+log updates to the filesystem.
+
+These two features improve overall runtime resiliency by providing a means for
+the filesystem to detect obvious corruption when reading metadata blocks from
+disk, but these buffer verifiers cannot provide any consistency checking
+between metadata structures.
+
+For more information, please see the documentation for
+Documentation/filesystems/xfs-self-describing-metadata.rst
+
+Reverse Mapping
+---------------
+
+The original design of XFS (circa 1993) is an improvement upon 1980s Unix
+filesystem design.
+In those days, storage density was expensive, CPU time was scarce, and
+excessive seek time could kill performance.
+For performance reasons, filesystem authors were reluctant to add redundancy to
+the filesystem, even at the cost of data integrity.
+Filesystems designers in the early 21st century choose different strategies to
+increase internal redundancy -- either storing nearly identical copies of
+metadata, or more space-efficient encoding techniques.
+
+For XFS, a different redundancy strategy was chosen to modernize the design:
+a secondary space usage index that maps allocated disk extents back to their
+owners.
+By adding a new index, the filesystem retains most of its ability to scale
+well to heavily threaded workloads involving large datasets, since the primary
+file metadata (the directory tree, the file block map, and the allocation
+groups) remain unchanged.
+Like any system that improves redundancy, the reverse-mapping feature increases
+overhead costs for space mapping activities.
+However, it has two critical advantages: first, the reverse index is key to
+enabling online fsck and other requested functionality such as free space
+defragmentation, better media failure reporting, and filesystem shrinking.
+Second, the different ondisk storage format of the reverse mapping btree
+defeats device-level deduplication because the filesystem requires real
+redundancy.
+
++--------------------------------------------------------------------------+
+| **Sidebar**:                                                             |
++--------------------------------------------------------------------------+
+| A criticism of adding the secondary index is that it does nothing to     |
+| improve the robustness of user data storage itself.                      |
+| This is a valid point, but adding a new index for file data block        |
+| checksums increases write amplification by turning data overwrites into  |
+| copy-writes, which age the filesystem prematurely.                       |
+| In keeping with thirty years of precedent, users who want file data      |
+| integrity can supply as powerful a solution as they require.             |
+| As for metadata, the complexity of adding a new secondary index of space |
+| usage is much less than adding volume management and storage device      |
+| mirroring to XFS itself.                                                 |
+| Perfection of RAID and volume management are best left to existing       |
+| layers in the kernel.                                                    |
++--------------------------------------------------------------------------+
+
+The information captured in a reverse space mapping record is as follows:
+
+.. code-block:: c
+
+	struct xfs_rmap_irec {
+	    xfs_agblock_t    rm_startblock;   /* extent start block */
+	    xfs_extlen_t     rm_blockcount;   /* extent length */
+	    uint64_t         rm_owner;        /* extent owner */
+	    uint64_t         rm_offset;       /* offset within the owner */
+	    unsigned int     rm_flags;        /* state flags */
+	};
+
+The first two fields capture the location and size of the physical space,
+in units of filesystem blocks.
+The owner field tells scrub which metadata structure or file inode have been
+assigned this space.
+For space allocated to files, the offset field tells scrub where the space was
+mapped within the file fork.
+Finally, the flags field provides extra information about the space usage --
+is this an attribute fork extent?  A file mapping btree extent?  Or an
+unwritten data extent?
+
+Online filesystem checking judges the consistency of each primary metadata
+record by comparing its information against all other space indices.
+The reverse mapping index plays a key role in the consistency checking process
+because it contains a centralized alternate copy of all space allocation
+information.
+Program runtime and ease of resource acquisition are the only real limits to
+what online checking can consult.
+For example, a file data extent mapping can be checked against:
+
+* The absence of an entry in the free space information.
+* The absence of an entry in the inode index.
+* The absence of an entry in the reference count data if the file is not
+  marked as having shared extents.
+* The correspondence of an entry in the reverse mapping information.
+
+There are several observations to make about reverse mapping indices:
+
+1. Reverse mappings can provide a positive affirmation of correctness if any of
+   the above primary metadata are in doubt.
+   The checking code for most primary metadata follows a path similar to the
+   one outlined above.
+
+2. Proving the consistency of secondary metadata with the primary metadata is
+   difficult because that requires a full scan of all primary space metadata,
+   which is very time intensive.
+   For example, checking a reverse mapping record for a file extent mapping
+   btree block requires locking the file and searching the entire btree to
+   confirm the block.
+   Instead, scrub relies on rigorous cross-referencing during the primary space
+   mapping structure checks.
+
+3. Consistency scans must use non-blocking lock acquisition primitives if the
+   required locking order is not the same order used by regular filesystem
+   operations.
+   For example, if the filesystem normally takes a file ILOCK before taking
+   the AGF buffer lock but scrub wants to take a file ILOCK while holding
+   an AGF buffer lock, scrub cannot block on that second acquisition.
+   This means that forward progress during this part of a scan of the reverse
+   mapping data cannot be guaranteed if system load is heavy.
+
+In summary, reverse mappings play a key role in reconstruction of primary
+metadata.
+The details of how these records are staged, written to disk, and committed
+into the filesystem are covered in subsequent sections.
+
+Checking and Cross-Referencing
+------------------------------
+
+The first step of checking a metadata structure is to examine every record
+contained within the structure and its relationship with the rest of the
+system.
+XFS contains multiple layers of checking to try to prevent inconsistent
+metadata from wreaking havoc on the system.
+Each of these layers contributes information that helps the kernel to make
+three decisions about the health of a metadata structure:
+
+- Is a part of this structure obviously corrupt (``XFS_SCRUB_OFLAG_CORRUPT``) ?
+- Is this structure inconsistent with the rest of the system
+  (``XFS_SCRUB_OFLAG_XCORRUPT``) ?
+- Is there so much damage around the filesystem that cross-referencing is not
+  possible (``XFS_SCRUB_OFLAG_XFAIL``) ?
+- Can the structure be optimized to improve performance or reduce the size of
+  metadata (``XFS_SCRUB_OFLAG_PREEN``) ?
+- Does the structure contain data that is not inconsistent but deserves review
+  by the system administrator (``XFS_SCRUB_OFLAG_WARNING``) ?
+
+The following sections describe how the metadata scrubbing process works.
+
+Metadata Buffer Verification
+````````````````````````````
+
+The lowest layer of metadata protection in XFS are the metadata verifiers built
+into the buffer cache.
+These functions perform inexpensive internal consistency checking of the block
+itself, and answer these questions:
+
+- Does the block belong to this filesystem?
+
+- Does the block belong to the structure that asked for the read?
+  This assumes that metadata blocks only have one owner, which is always true
+  in XFS.
+
+- Is the type of data stored in the block within a reasonable range of what
+  scrub is expecting?
+
+- Does the physical location of the block match the location it was read from?
+
+- Does the block checksum match the data?
+
+The scope of the protections here are very limited -- verifiers can only
+establish that the filesystem code is reasonably free of gross corruption bugs
+and that the storage system is reasonably competent at retrieval.
+Corruption problems observed at runtime cause the generation of health reports,
+failed system calls, and in the extreme case, filesystem shutdowns if the
+corrupt metadata force the cancellation of a dirty transaction.
+
+Every online fsck scrubbing function is expected to read every ondisk metadata
+block of a structure in the course of checking the structure.
+Corruption problems observed during a check are immediately reported to
+userspace as corruption; during a cross-reference, they are reported as a
+failure to cross-reference once the full examination is complete.
+Reads satisfied by a buffer already in cache (and hence already verified)
+bypass these checks.
+
+Internal Consistency Checks
+```````````````````````````
+
+After the buffer cache, the next level of metadata protection is the internal
+record verification code built into the filesystem.
+These checks are split between the buffer verifiers, the in-filesystem users of
+the buffer cache, and the scrub code itself, depending on the amount of higher
+level context required.
+The scope of checking is still internal to the block.
+These higher level checking functions answer these questions:
+
+- Does the type of data stored in the block match what scrub is expecting?
+
+- Does the block belong to the owning structure that asked for the read?
+
+- If the block contains records, do the records fit within the block?
+
+- If the block tracks internal free space information, is it consistent with
+  the record areas?
+
+- Are the records contained inside the block free of obvious corruptions?
+
+Record checks in this category are more rigorous and more time-intensive.
+For example, block pointers and inumbers are checked to ensure that they point
+within the dynamically allocated parts of an allocation group and within
+the filesystem.
+Names are checked for invalid characters, and flags are checked for invalid
+combinations.
+Other record attributes are checked for sensible values.
+Btree records spanning an interval of the btree keyspace are checked for
+correct order and lack of mergeability (except for file fork mappings).
+For performance reasons, regular code may skip some of these checks unless
+debugging is enabled or a write is about to occur.
+Scrub functions, of course, must check all possible problems.
+
+Validation of Userspace-Controlled Record Attributes
+````````````````````````````````````````````````````
+
+Various pieces of filesystem metadata are directly controlled by userspace.
+Because of this nature, validation work cannot be more precise than checking
+that a value is within the possible range.
+These fields include:
+
+- Superblock fields controlled by mount options
+- Filesystem labels
+- File timestamps
+- File permissions
+- File size
+- File flags
+- Names present in directory entries, extended attribute keys, and filesystem
+  labels
+- Extended attribute key namespaces
+- Extended attribute values
+- File data block contents
+- Quota limits
+- Quota timer expiration (if resource usage exceeds the soft limit)
+
+Cross-Referencing Space Metadata
+````````````````````````````````
+
+After internal block checks, the next higher level of checking is
+cross-referencing records between metadata structures.
+For regular runtime code, the cost of these checks is considered to be
+prohibitively expensive, but as scrub is dedicated to rooting out
+inconsistencies, it must pursue all avenues of inquiry.
+The exact set of cross-referencing is highly dependent on the context of the
+data structure being checked.
+
+The XFS btree code has keyspace scanning functions that online fsck uses to
+cross reference one structure with another.
+Specifically, scrub can scan the key space of an index to determine if that
+keyspace is fully, sparsely, or not at all mapped to records.
+For the reverse mapping btree, it is possible to mask parts of the key for the
+purposes of performing a keyspace scan so that scrub can decide if the rmap
+btree contains records mapping a certain extent of physical space without the
+sparsenses of the rest of the rmap keyspace getting in the way.
+
+Btree blocks undergo the following checks before cross-referencing:
+
+- Does the type of data stored in the block match what scrub is expecting?
+
+- Does the block belong to the owning structure that asked for the read?
+
+- Do the records fit within the block?
+
+- Are the records contained inside the block free of obvious corruptions?
+
+- Are the name hashes in the correct order?
+
+- Do node pointers within the btree point to valid block addresses for the type
+  of btree?
+
+- Do child pointers point towards the leaves?
+
+- Do sibling pointers point across the same level?
+
+- For each node block record, does the record key accurate reflect the contents
+  of the child block?
+
+Space allocation records are cross-referenced as follows:
+
+1. Any space mentioned by any metadata structure are cross-referenced as
+   follows:
+
+   - Does the reverse mapping index list only the appropriate owner as the
+     owner of each block?
+
+   - Are none of the blocks claimed as free space?
+
+   - If these aren't file data blocks, are none of the blocks claimed as space
+     shared by different owners?
+
+2. Btree blocks are cross-referenced as follows:
+
+   - Everything in class 1 above.
+
+   - If there's a parent node block, do the keys listed for this block match the
+     keyspace of this block?
+
+   - Do the sibling pointers point to valid blocks?  Of the same level?
+
+   - Do the child pointers point to valid blocks?  Of the next level down?
+
+3. Free space btree records are cross-referenced as follows:
+
+   - Everything in class 1 and 2 above.
+
+   - Does the reverse mapping index list no owners of this space?
+
+   - Is this space not claimed by the inode index for inodes?
+
+   - Is it not mentioned by the reference count index?
+
+   - Is there a matching record in the other free space btree?
+
+4. Inode btree records are cross-referenced as follows:
+
+   - Everything in class 1 and 2 above.
+
+   - Is there a matching record in free inode btree?
+
+   - Do cleared bits in the holemask correspond with inode clusters?
+
+   - Do set bits in the freemask correspond with inode records with zero link
+     count?
+
+5. Inode records are cross-referenced as follows:
+
+   - Everything in class 1.
+
+   - Do all the fields that summarize information about the file forks actually
+     match those forks?
+
+   - Does each inode with zero link count correspond to a record in the free
+     inode btree?
+
+6. File fork space mapping records are cross-referenced as follows:
+
+   - Everything in class 1 and 2 above.
+
+   - Is this space not mentioned by the inode btrees?
+
+   - If this is a CoW fork mapping, does it correspond to a CoW entry in the
+     reference count btree?
+
+7. Reference count records are cross-referenced as follows:
+
+   - Everything in class 1 and 2 above.
+
+   - Within the space subkeyspace of the rmap btree (that is to say, all
+     records mapped to a particular space extent and ignoring the owner info),
+     are there the same number of reverse mapping records for each block as the
+     reference count record claims?
+
+Proposed patchsets are the series to find gaps in
+`refcount btree
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-gaps>`_,
+`inode btree
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gaps>`_, and
+`rmap btree
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-gaps>`_ records;
+to find
+`mergeable records
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-records>`_;
+and to
+`improve cross referencing with rmap
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-checking>`_
+before starting a repair.
+
+Checking Extended Attributes
+````````````````````````````
+
+Extended attributes implement a key-value store that enable fragments of data
+to be attached to any file.
+Both the kernel and userspace can access the keys and values, subject to
+namespace and privilege restrictions.
+Most typically these fragments are metadata about the file -- origins, security
+contexts, user-supplied labels, indexing information, etc.
+
+Names can be as long as 255 bytes and can exist in several different
+namespaces.
+Values can be as large as 64KB.
+A file's extended attributes are stored in blocks mapped by the attr fork.
+The mappings point to leaf blocks, remote value blocks, or dabtree blocks.
+Block 0 in the attribute fork is always the top of the structure, but otherwise
+each of the three types of blocks can be found at any offset in the attr fork.
+Leaf blocks contain attribute key records that point to the name and the value.
+Names are always stored elsewhere in the same leaf block.
+Values that are less than 3/4 the size of a filesystem block are also stored
+elsewhere in the same leaf block.
+Remote value blocks contain values that are too large to fit inside a leaf.
+If the leaf information exceeds a single filesystem block, a dabtree (also
+rooted at block 0) is created to map hashes of the attribute names to leaf
+blocks in the attr fork.
+
+Checking an extended attribute structure is not so straightfoward due to the
+lack of separation between attr blocks and index blocks.
+Scrub must read each block mapped by the attr fork and ignore the non-leaf
+blocks:
+
+1. Walk the dabtree in the attr fork (if present) to ensure that there are no
+   irregularities in the blocks or dabtree mappings that do not point to
+   attr leaf blocks.
+
+2. Walk the blocks of the attr fork looking for leaf blocks.
+   For each entry inside a leaf:
+
+   a. Validate that the name does not contain invalid characters.
+
+   b. Read the attr value.
+      This performs a named lookup of the attr name to ensure the correctness
+      of the dabtree.
+      If the value is stored in a remote block, this also validates the
+      integrity of the remote value block.
+
+Checking and Cross-Referencing Directories
+``````````````````````````````````````````
+
+The filesystem directory tree is a directed acylic graph structure, with files
+constituting the nodes, and directory entries (dirents) constituting the edges.
+Directories are a special type of file containing a set of mappings from a
+255-byte sequence (name) to an inumber.
+These are called directory entries, or dirents for short.
+Each directory file must have exactly one directory pointing to the file.
+A root directory points to itself.
+Directory entries point to files of any type.
+Each non-directory file may have multiple directories point to it.
+
+In XFS, directories are implemented as a file containing up to three 32GB
+partitions.
+The first partition contains directory entry data blocks.
+Each data block contains variable-sized records associating a user-provided
+name with an inumber and, optionally, a file type.
+If the directory entry data grows beyond one block, the second partition (which
+exists as post-EOF extents) is populated with a block containing free space
+information and an index that maps hashes of the dirent names to directory data
+blocks in the first partition.
+This makes directory name lookups very fast.
+If this second partition grows beyond one block, the third partition is
+populated with a linear array of free space information for faster
+expansions.
+If the free space has been separated and the second partition grows again
+beyond one block, then a dabtree is used to map hashes of dirent names to
+directory data blocks.
+
+Checking a directory is pretty straightfoward:
+
+1. Walk the dabtree in the second partition (if present) to ensure that there
+   are no irregularities in the blocks or dabtree mappings that do not point to
+   dirent blocks.
+
+2. Walk the blocks of the first partition looking for directory entries.
+   Each dirent is checked as follows:
+
+   a. Does the name contain no invalid characters?
+
+   b. Does the inumber correspond to an actual, allocated inode?
+
+   c. Does the child inode have a nonzero link count?
+
+   d. If a file type is included in the dirent, does it match the type of the
+      inode?
+
+   e. If the child is a subdirectory, does the child's dotdot pointer point
+      back to the parent?
+
+   f. If the directory has a second partition, perform a named lookup of the
+      dirent name to ensure the correctness of the dabtree.
+
+3. Walk the free space list in the third partition (if present) to ensure that
+   the free spaces it describes are really unused.
+
+Checking operations involving :ref:`parents <dirparent>` and
+:ref:`file link counts <nlinks>` are discussed in more detail in later
+sections.
+
+Checking Directory/Attribute Btrees
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+As stated in previous sections, the directory/attribute btree (dabtree) index
+maps user-provided names to improve lookup times by avoiding linear scans.
+Internally, it maps a 32-bit hash of the name to a block offset within the
+appropriate file fork.
+
+The internal structure of a dabtree closely resembles the btrees that record
+fixed-size metadata records -- each dabtree block contains a magic number, a
+checksum, sibling pointers, a UUID, a tree level, and a log sequence number.
+The format of leaf and node records are the same -- each entry points to the
+next level down in the hierarchy, with dabtree node records pointing to dabtree
+leaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere
+in the fork.
+
+Checking and cross-referencing the dabtree is very similar to what is done for
+space btrees:
+
+- Does the type of data stored in the block match what scrub is expecting?
+
+- Does the block belong to the owning structure that asked for the read?
+
+- Do the records fit within the block?
+
+- Are the records contained inside the block free of obvious corruptions?
+
+- Are the name hashes in the correct order?
+
+- Do node pointers within the dabtree point to valid fork offsets for dabtree
+  blocks?
+
+- Do leaf pointers within the dabtree point to valid fork offsets for directory
+  or attr leaf blocks?
+
+- Do child pointers point towards the leaves?
+
+- Do sibling pointers point across the same level?
+
+- For each dabtree node record, does the record key accurate reflect the
+  contents of the child dabtree block?
+
+- For each dabtree leaf record, does the record key accurate reflect the
+  contents of the directory or attr block?
+
+Cross-Referencing Summary Counters
+``````````````````````````````````
+
+XFS maintains three classes of summary counters: available resources, quota
+resource usage, and file link counts.
+
+In theory, the amount of available resources (data blocks, inodes, realtime
+extents) can be found by walking the entire filesystem.
+This would make for very slow reporting, so a transactional filesystem can
+maintain summaries of this information in the superblock.
+Cross-referencing these values against the filesystem metadata should be a
+simple matter of walking the free space and inode metadata in each AG and the
+realtime bitmap, but there are complications that will be discussed in
+:ref:`more detail <fscounters>` later.
+
+:ref:`Quota usage <quotacheck>` and :ref:`file link count <nlinks>`
+checking are sufficiently complicated to warrant separate sections.
+
+Post-Repair Reverification
+``````````````````````````
+
+After performing a repair, the checking code is run a second time to validate
+the new structure, and the results of the health assessment are recorded
+internally and returned to the calling process.
+This step is critical for enabling system administrator to monitor the status
+of the filesystem and the progress of any repairs.
+For developers, it is a useful means to judge the efficacy of error detection
+and correction in the online and offline checking tools.
diff --git a/Documentation/filesystems/xfs-self-describing-metadata.rst b/Documentation/filesystems/xfs-self-describing-metadata.rst
index b79dbf36dc94..a10c4ae6955e 100644
--- a/Documentation/filesystems/xfs-self-describing-metadata.rst
+++ b/Documentation/filesystems/xfs-self-describing-metadata.rst
@@ -1,4 +1,5 @@
 .. SPDX-License-Identifier: GPL-2.0
+.. _xfs_self_describing_metadata:
 
 ============================
 XFS Self Describing Metadata
-- 
cgit v1.2.3


From bae43864c08884bacde35c5aba36d0335f659868 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 18:59:48 -0700
Subject: xfs: document how online fsck deals with eventual consistency

Writes to an XFS filesystem employ an eventual consistency update model
to break up complex multistep metadata updates into small chained
transactions.  This is generally good for performance and scalability
because XFS doesn't need to prepare for enormous transactions, but it
also means that online fsck must be careful not to attempt a fsck action
unless it can be shown that there are no other threads processing a
transaction chain.  This part of the design documentation covers the
thinking behind the consistency model and how scrub deals with it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 .../filesystems/xfs-online-fsck-design.rst         | 332 +++++++++++++++++++++
 1 file changed, 332 insertions(+)

diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index 4a19c70434aa..9232b2e442c8 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -1500,3 +1500,335 @@ This step is critical for enabling system administrator to monitor the status
 of the filesystem and the progress of any repairs.
 For developers, it is a useful means to judge the efficacy of error detection
 and correction in the online and offline checking tools.
+
+Eventual Consistency vs. Online Fsck
+------------------------------------
+
+Complex operations can make modifications to multiple per-AG data structures
+with a chain of transactions.
+These chains, once committed to the log, are restarted during log recovery if
+the system crashes while processing the chain.
+Because the AG header buffers are unlocked between transactions within a chain,
+online checking must coordinate with chained operations that are in progress to
+avoid incorrectly detecting inconsistencies due to pending chains.
+Furthermore, online repair must not run when operations are pending because
+the metadata are temporarily inconsistent with each other, and rebuilding is
+not possible.
+
+Only online fsck has this requirement of total consistency of AG metadata, and
+should be relatively rare as compared to filesystem change operations.
+Online fsck coordinates with transaction chains as follows:
+
+* For each AG, maintain a count of intent items targetting that AG.
+  The count should be bumped whenever a new item is added to the chain.
+  The count should be dropped when the filesystem has locked the AG header
+  buffers and finished the work.
+
+* When online fsck wants to examine an AG, it should lock the AG header
+  buffers to quiesce all transaction chains that want to modify that AG.
+  If the count is zero, proceed with the checking operation.
+  If it is nonzero, cycle the buffer locks to allow the chain to make forward
+  progress.
+
+This may lead to online fsck taking a long time to complete, but regular
+filesystem updates take precedence over background checking activity.
+Details about the discovery of this situation are presented in the
+:ref:`next section <chain_coordination>`, and details about the solution
+are presented :ref:`after that<intent_drains>`.
+
+.. _chain_coordination:
+
+Discovery of the Problem
+````````````````````````
+
+Midway through the development of online scrubbing, the fsstress tests
+uncovered a misinteraction between online fsck and compound transaction chains
+created by other writer threads that resulted in false reports of metadata
+inconsistency.
+The root cause of these reports is the eventual consistency model introduced by
+the expansion of deferred work items and compound transaction chains when
+reverse mapping and reflink were introduced.
+
+Originally, transaction chains were added to XFS to avoid deadlocks when
+unmapping space from files.
+Deadlock avoidance rules require that AGs only be locked in increasing order,
+which makes it impossible (say) to use a single transaction to free a space
+extent in AG 7 and then try to free a now superfluous block mapping btree block
+in AG 3.
+To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log
+items to commit to freeing some space in one transaction while deferring the
+actual metadata updates to a fresh transaction.
+The transaction sequence looks like this:
+
+1. The first transaction contains a physical update to the file's block mapping
+   structures to remove the mapping from the btree blocks.
+   It then attaches to the in-memory transaction an action item to schedule
+   deferred freeing of space.
+   Concretely, each transaction maintains a list of ``struct
+   xfs_defer_pending`` objects, each of which maintains a list of ``struct
+   xfs_extent_free_item`` objects.
+   Returning to the example above, the action item tracks the freeing of both
+   the unmapped space from AG 7 and the block mapping btree (BMBT) block from
+   AG 3.
+   Deferred frees recorded in this manner are committed in the log by creating
+   an EFI log item from the ``struct xfs_extent_free_item`` object and
+   attaching the log item to the transaction.
+   When the log is persisted to disk, the EFI item is written into the ondisk
+   transaction record.
+   EFIs can list up to 16 extents to free, all sorted in AG order.
+
+2. The second transaction contains a physical update to the free space btrees
+   of AG 3 to release the former BMBT block and a second physical update to the
+   free space btrees of AG 7 to release the unmapped file space.
+   Observe that the the physical updates are resequenced in the correct order
+   when possible.
+   Attached to the transaction is a an extent free done (EFD) log item.
+   The EFD contains a pointer to the EFI logged in transaction #1 so that log
+   recovery can tell if the EFI needs to be replayed.
+
+If the system goes down after transaction #1 is written back to the filesystem
+but before #2 is committed, a scan of the filesystem metadata would show
+inconsistent filesystem metadata because there would not appear to be any owner
+of the unmapped space.
+Happily, log recovery corrects this inconsistency for us -- when recovery finds
+an intent log item but does not find a corresponding intent done item, it will
+reconstruct the incore state of the intent item and finish it.
+In the example above, the log must replay both frees described in the recovered
+EFI to complete the recovery phase.
+
+There are subtleties to XFS' transaction chaining strategy to consider:
+
+* Log items must be added to a transaction in the correct order to prevent
+  conflicts with principal objects that are not held by the transaction.
+  In other words, all per-AG metadata updates for an unmapped block must be
+  completed before the last update to free the extent, and extents should not
+  be reallocated until that last update commits to the log.
+
+* AG header buffers are released between each transaction in a chain.
+  This means that other threads can observe an AG in an intermediate state,
+  but as long as the first subtlety is handled, this should not affect the
+  correctness of filesystem operations.
+
+* Unmounting the filesystem flushes all pending work to disk, which means that
+  offline fsck never sees the temporary inconsistencies caused by deferred
+  work item processing.
+
+In this manner, XFS employs a form of eventual consistency to avoid deadlocks
+and increase parallelism.
+
+During the design phase of the reverse mapping and reflink features, it was
+decided that it was impractical to cram all the reverse mapping updates for a
+single filesystem change into a single transaction because a single file
+mapping operation can explode into many small updates:
+
+* The block mapping update itself
+* A reverse mapping update for the block mapping update
+* Fixing the freelist
+* A reverse mapping update for the freelist fix
+
+* A shape change to the block mapping btree
+* A reverse mapping update for the btree update
+* Fixing the freelist (again)
+* A reverse mapping update for the freelist fix
+
+* An update to the reference counting information
+* A reverse mapping update for the refcount update
+* Fixing the freelist (a third time)
+* A reverse mapping update for the freelist fix
+
+* Freeing any space that was unmapped and not owned by any other file
+* Fixing the freelist (a fourth time)
+* A reverse mapping update for the freelist fix
+
+* Freeing the space used by the block mapping btree
+* Fixing the freelist (a fifth time)
+* A reverse mapping update for the freelist fix
+
+Free list fixups are not usually needed more than once per AG per transaction
+chain, but it is theoretically possible if space is very tight.
+For copy-on-write updates this is even worse, because this must be done once to
+remove the space from a staging area and again to map it into the file!
+
+To deal with this explosion in a calm manner, XFS expands its use of deferred
+work items to cover most reverse mapping updates and all refcount updates.
+This reduces the worst case size of transaction reservations by breaking the
+work into a long chain of small updates, which increases the degree of eventual
+consistency in the system.
+Again, this generally isn't a problem because XFS orders its deferred work
+items carefully to avoid resource reuse conflicts between unsuspecting threads.
+
+However, online fsck changes the rules -- remember that although physical
+updates to per-AG structures are coordinated by locking the buffers for AG
+headers, buffer locks are dropped between transactions.
+Once scrub acquires resources and takes locks for a data structure, it must do
+all the validation work without releasing the lock.
+If the main lock for a space btree is an AG header buffer lock, scrub may have
+interrupted another thread that is midway through finishing a chain.
+For example, if a thread performing a copy-on-write has completed a reverse
+mapping update but not the corresponding refcount update, the two AG btrees
+will appear inconsistent to scrub and an observation of corruption will be
+recorded.  This observation will not be correct.
+If a repair is attempted in this state, the results will be catastrophic!
+
+Several other solutions to this problem were evaluated upon discovery of this
+flaw and rejected:
+
+1. Add a higher level lock to allocation groups and require writer threads to
+   acquire the higher level lock in AG order before making any changes.
+   This would be very difficult to implement in practice because it is
+   difficult to determine which locks need to be obtained, and in what order,
+   without simulating the entire operation.
+   Performing a dry run of a file operation to discover necessary locks would
+   make the filesystem very slow.
+
+2. Make the deferred work coordinator code aware of consecutive intent items
+   targeting the same AG and have it hold the AG header buffers locked across
+   the transaction roll between updates.
+   This would introduce a lot of complexity into the coordinator since it is
+   only loosely coupled with the actual deferred work items.
+   It would also fail to solve the problem because deferred work items can
+   generate new deferred subtasks, but all subtasks must be complete before
+   work can start on a new sibling task.
+
+3. Teach online fsck to walk all transactions waiting for whichever lock(s)
+   protect the data structure being scrubbed to look for pending operations.
+   The checking and repair operations must factor these pending operations into
+   the evaluations being performed.
+   This solution is a nonstarter because it is *extremely* invasive to the main
+   filesystem.
+
+.. _intent_drains:
+
+Intent Drains
+`````````````
+
+Online fsck uses an atomic intent item counter and lock cycling to coordinate
+with transaction chains.
+There are two key properties to the drain mechanism.
+First, the counter is incremented when a deferred work item is *queued* to a
+transaction, and it is decremented after the associated intent done log item is
+*committed* to another transaction.
+The second property is that deferred work can be added to a transaction without
+holding an AG header lock, but per-AG work items cannot be marked done without
+locking that AG header buffer to log the physical updates and the intent done
+log item.
+The first property enables scrub to yield to running transaction chains, which
+is an explicit deprioritization of online fsck to benefit file operations.
+The second property of the drain is key to the correct coordination of scrub,
+since scrub will always be able to decide if a conflict is possible.
+
+For regular filesystem code, the drain works as follows:
+
+1. Call the appropriate subsystem function to add a deferred work item to a
+   transaction.
+
+2. The function calls ``xfs_defer_drain_bump`` to increase the counter.
+
+3. When the deferred item manager wants to finish the deferred work item, it
+   calls ``->finish_item`` to complete it.
+
+4. The ``->finish_item`` implementation logs some changes and calls
+   ``xfs_defer_drain_drop`` to decrease the sloppy counter and wake up any threads
+   waiting on the drain.
+
+5. The subtransaction commits, which unlocks the resource associated with the
+   intent item.
+
+For scrub, the drain works as follows:
+
+1. Lock the resource(s) associated with the metadata being scrubbed.
+   For example, a scan of the refcount btree would lock the AGI and AGF header
+   buffers.
+
+2. If the counter is zero (``xfs_defer_drain_busy`` returns false), there are no
+   chains in progress and the operation may proceed.
+
+3. Otherwise, release the resources grabbed in step 1.
+
+4. Wait for the intent counter to reach zero (``xfs_defer_drain_intents``), then go
+   back to step 1 unless a signal has been caught.
+
+To avoid polling in step 4, the drain provides a waitqueue for scrub threads to
+be woken up whenever the intent count drops to zero.
+
+The proposed patchset is the
+`scrub intent drain series
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>`_.
+
+.. _jump_labels:
+
+Static Keys (aka Jump Label Patching)
+`````````````````````````````````````
+
+Online fsck for XFS separates the regular filesystem from the checking and
+repair code as much as possible.
+However, there are a few parts of online fsck (such as the intent drains, and
+later, live update hooks) where it is useful for the online fsck code to know
+what's going on in the rest of the filesystem.
+Since it is not expected that online fsck will be constantly running in the
+background, it is very important to minimize the runtime overhead imposed by
+these hooks when online fsck is compiled into the kernel but not actively
+running on behalf of userspace.
+Taking locks in the hot path of a writer thread to access a data structure only
+to find that no further action is necessary is expensive -- on the author's
+computer, this have an overhead of 40-50ns per access.
+Fortunately, the kernel supports dynamic code patching, which enables XFS to
+replace a static branch to hook code with ``nop`` sleds when online fsck isn't
+running.
+This sled has an overhead of however long it takes the instruction decoder to
+skip past the sled, which seems to be on the order of less than 1ns and
+does not access memory outside of instruction fetching.
+
+When online fsck enables the static key, the sled is replaced with an
+unconditional branch to call the hook code.
+The switchover is quite expensive (~22000ns) but is paid entirely by the
+program that invoked online fsck, and can be amortized if multiple threads
+enter online fsck at the same time, or if multiple filesystems are being
+checked at the same time.
+Changing the branch direction requires taking the CPU hotplug lock, and since
+CPU initialization requires memory allocation, online fsck must be careful not
+to change a static key while holding any locks or resources that could be
+accessed in the memory reclaim paths.
+To minimize contention on the CPU hotplug lock, care should be taken not to
+enable or disable static keys unnecessarily.
+
+Because static keys are intended to minimize hook overhead for regular
+filesystem operations when xfs_scrub is not running, the intended usage
+patterns are as follows:
+
+- The hooked part of XFS should declare a static-scoped static key that
+  defaults to false.
+  The ``DEFINE_STATIC_KEY_FALSE`` macro takes care of this.
+  The static key itself should be declared as a ``static`` variable.
+
+- When deciding to invoke code that's only used by scrub, the regular
+  filesystem should call the ``static_branch_unlikely`` predicate to avoid the
+  scrub-only hook code if the static key is not enabled.
+
+- The regular filesystem should export helper functions that call
+  ``static_branch_inc`` to enable and ``static_branch_dec`` to disable the
+  static key.
+  Wrapper functions make it easy to compile out the relevant code if the kernel
+  distributor turns off online fsck at build time.
+
+- Scrub functions wanting to turn on scrub-only XFS functionality should call
+  the ``xchk_fsgates_enable`` from the setup function to enable a specific
+  hook.
+  This must be done before obtaining any resources that are used by memory
+  reclaim.
+  Callers had better be sure they really need the functionality gated by the
+  static key; the ``TRY_HARDER`` flag is useful here.
+
+Online scrub has resource acquisition helpers (e.g. ``xchk_perag_lock``) to
+handle locking AGI and AGF buffers for all scrubber functions.
+If it detects a conflict between scrub and the running transactions, it will
+try to wait for intents to complete.
+If the caller of the helper has not enabled the static key, the helper will
+return -EDEADLOCK, which should result in the scrub being restarted with the
+``TRY_HARDER`` flag set.
+The scrub setup function should detect that flag, enable the static key, and
+try the scrub again.
+Scrub teardown disables all static keys obtained by ``xchk_fsgates_enable``.
+
+For more information, please see the kernel documentation of
+Documentation/staging/static-keys.rst.
-- 
cgit v1.2.3


From 5f658dad233b2c92afddc011565d7e337d527a37 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 18:59:48 -0700
Subject: xfs: document pageable kernel memory

Add a discussion of pageable kernel memory, since online fsck needs
quite a bit more memory than most other parts of the filesystem to stage
records and other information.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 .../filesystems/xfs-online-fsck-design.rst         | 500 +++++++++++++++++++++
 1 file changed, 500 insertions(+)

diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index 9232b2e442c8..39c394530958 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -413,6 +413,8 @@ Algorithms") of Srinivasan.
 However, any data structure builder that maintains a resource lock for the
 duration of the repair is *always* an offline algorithm.
 
+.. _secondary_metadata:
+
 Secondary Metadata
 ``````````````````
 
@@ -1832,3 +1834,501 @@ Scrub teardown disables all static keys obtained by ``xchk_fsgates_enable``.
 
 For more information, please see the kernel documentation of
 Documentation/staging/static-keys.rst.
+
+.. _xfile:
+
+Pageable Kernel Memory
+----------------------
+
+Some online checking functions work by scanning the filesystem to build a
+shadow copy of an ondisk metadata structure in memory and comparing the two
+copies.
+For online repair to rebuild a metadata structure, it must compute the record
+set that will be stored in the new structure before it can persist that new
+structure to disk.
+Ideally, repairs complete with a single atomic commit that introduces
+a new data structure.
+To meet these goals, the kernel needs to collect a large amount of information
+in a place that doesn't require the correct operation of the filesystem.
+
+Kernel memory isn't suitable because:
+
+* Allocating a contiguous region of memory to create a C array is very
+  difficult, especially on 32-bit systems.
+
+* Linked lists of records introduce double pointer overhead which is very high
+  and eliminate the possibility of indexed lookups.
+
+* Kernel memory is pinned, which can drive the system into OOM conditions.
+
+* The system might not have sufficient memory to stage all the information.
+
+At any given time, online fsck does not need to keep the entire record set in
+memory, which means that individual records can be paged out if necessary.
+Continued development of online fsck demonstrated that the ability to perform
+indexed data storage would also be very useful.
+Fortunately, the Linux kernel already has a facility for byte-addressable and
+pageable storage: tmpfs.
+In-kernel graphics drivers (most notably i915) take advantage of tmpfs files
+to store intermediate data that doesn't need to be in memory at all times, so
+that usage precedent is already established.
+Hence, the ``xfile`` was born!
+
++--------------------------------------------------------------------------+
+| **Historical Sidebar**:                                                  |
++--------------------------------------------------------------------------+
+| The first edition of online repair inserted records into a new btree as  |
+| it found them, which failed because filesystem could shut down with a    |
+| built data structure, which would be live after recovery finished.       |
+|                                                                          |
+| The second edition solved the half-rebuilt structure problem by storing  |
+| everything in memory, but frequently ran the system out of memory.       |
+|                                                                          |
+| The third edition solved the OOM problem by using linked lists, but the  |
+| memory overhead of the list pointers was extreme.                        |
++--------------------------------------------------------------------------+
+
+xfile Access Models
+```````````````````
+
+A survey of the intended uses of xfiles suggested these use cases:
+
+1. Arrays of fixed-sized records (space management btrees, directory and
+   extended attribute entries)
+
+2. Sparse arrays of fixed-sized records (quotas and link counts)
+
+3. Large binary objects (BLOBs) of variable sizes (directory and extended
+   attribute names and values)
+
+4. Staging btrees in memory (reverse mapping btrees)
+
+5. Arbitrary contents (realtime space management)
+
+To support the first four use cases, high level data structures wrap the xfile
+to share functionality between online fsck functions.
+The rest of this section discusses the interfaces that the xfile presents to
+four of those five higher level data structures.
+The fifth use case is discussed in the :ref:`realtime summary <rtsummary>` case
+study.
+
+The most general storage interface supported by the xfile enables the reading
+and writing of arbitrary quantities of data at arbitrary offsets in the xfile.
+This capability is provided by ``xfile_pread`` and ``xfile_pwrite`` functions,
+which behave similarly to their userspace counterparts.
+XFS is very record-based, which suggests that the ability to load and store
+complete records is important.
+To support these cases, a pair of ``xfile_obj_load`` and ``xfile_obj_store``
+functions are provided to read and persist objects into an xfile.
+They are internally the same as pread and pwrite, except that they treat any
+error as an out of memory error.
+For online repair, squashing error conditions in this manner is an acceptable
+behavior because the only reaction is to abort the operation back to userspace.
+All five xfile usecases can be serviced by these four functions.
+
+However, no discussion of file access idioms is complete without answering the
+question, "But what about mmap?"
+It is convenient to access storage directly with pointers, just like userspace
+code does with regular memory.
+Online fsck must not drive the system into OOM conditions, which means that
+xfiles must be responsive to memory reclamation.
+tmpfs can only push a pagecache folio to the swap cache if the folio is neither
+pinned nor locked, which means the xfile must not pin too many folios.
+
+Short term direct access to xfile contents is done by locking the pagecache
+folio and mapping it into kernel address space.
+Programmatic access (e.g. pread and pwrite) uses this mechanism.
+Folio locks are not supposed to be held for long periods of time, so long
+term direct access to xfile contents is done by bumping the folio refcount,
+mapping it into kernel address space, and dropping the folio lock.
+These long term users *must* be responsive to memory reclaim by hooking into
+the shrinker infrastructure to know when to release folios.
+
+The ``xfile_get_page`` and ``xfile_put_page`` functions are provided to
+retrieve the (locked) folio that backs part of an xfile and to release it.
+The only code to use these folio lease functions are the xfarray
+:ref:`sorting<xfarray_sort>` algorithms and the :ref:`in-memory
+btrees<xfbtree>`.
+
+xfile Access Coordination
+`````````````````````````
+
+For security reasons, xfiles must be owned privately by the kernel.
+They are marked ``S_PRIVATE`` to prevent interference from the security system,
+must never be mapped into process file descriptor tables, and their pages must
+never be mapped into userspace processes.
+
+To avoid locking recursion issues with the VFS, all accesses to the shmfs file
+are performed by manipulating the page cache directly.
+xfile writers call the ``->write_begin`` and ``->write_end`` functions of the
+xfile's address space to grab writable pages, copy the caller's buffer into the
+page, and release the pages.
+xfile readers call ``shmem_read_mapping_page_gfp`` to grab pages directly
+before copying the contents into the caller's buffer.
+In other words, xfiles ignore the VFS read and write code paths to avoid
+having to create a dummy ``struct kiocb`` and to avoid taking inode and
+freeze locks.
+tmpfs cannot be frozen, and xfiles must not be exposed to userspace.
+
+If an xfile is shared between threads to stage repairs, the caller must provide
+its own locks to coordinate access.
+For example, if a scrub function stores scan results in an xfile and needs
+other threads to provide updates to the scanned data, the scrub function must
+provide a lock for all threads to share.
+
+.. _xfarray:
+
+Arrays of Fixed-Sized Records
+`````````````````````````````
+
+In XFS, each type of indexed space metadata (free space, inodes, reference
+counts, file fork space, and reverse mappings) consists of a set of fixed-size
+records indexed with a classic B+ tree.
+Directories have a set of fixed-size dirent records that point to the names,
+and extended attributes have a set of fixed-size attribute keys that point to
+names and values.
+Quota counters and file link counters index records with numbers.
+During a repair, scrub needs to stage new records during the gathering step and
+retrieve them during the btree building step.
+
+Although this requirement can be satisfied by calling the read and write
+methods of the xfile directly, it is simpler for callers for there to be a
+higher level abstraction to take care of computing array offsets, to provide
+iterator functions, and to deal with sparse records and sorting.
+The ``xfarray`` abstraction presents a linear array for fixed-size records atop
+the byte-accessible xfile.
+
+.. _xfarray_access_patterns:
+
+Array Access Patterns
+^^^^^^^^^^^^^^^^^^^^^
+
+Array access patterns in online fsck tend to fall into three categories.
+Iteration of records is assumed to be necessary for all cases and will be
+covered in the next section.
+
+The first type of caller handles records that are indexed by position.
+Gaps may exist between records, and a record may be updated multiple times
+during the collection step.
+In other words, these callers want a sparse linearly addressed table file.
+The typical use case are quota records or file link count records.
+Access to array elements is performed programmatically via ``xfarray_load`` and
+``xfarray_store`` functions, which wrap the similarly-named xfile functions to
+provide loading and storing of array elements at arbitrary array indices.
+Gaps are defined to be null records, and null records are defined to be a
+sequence of all zero bytes.
+Null records are detected by calling ``xfarray_element_is_null``.
+They are created either by calling ``xfarray_unset`` to null out an existing
+record or by never storing anything to an array index.
+
+The second type of caller handles records that are not indexed by position
+and do not require multiple updates to a record.
+The typical use case here is rebuilding space btrees and key/value btrees.
+These callers can add records to the array without caring about array indices
+via the ``xfarray_append`` function, which stores a record at the end of the
+array.
+For callers that require records to be presentable in a specific order (e.g.
+rebuilding btree data), the ``xfarray_sort`` function can arrange the sorted
+records; this function will be covered later.
+
+The third type of caller is a bag, which is useful for counting records.
+The typical use case here is constructing space extent reference counts from
+reverse mapping information.
+Records can be put in the bag in any order, they can be removed from the bag
+at any time, and uniqueness of records is left to callers.
+The ``xfarray_store_anywhere`` function is used to insert a record in any
+null record slot in the bag; and the ``xfarray_unset`` function removes a
+record from the bag.
+
+The proposed patchset is the
+`big in-memory array
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>`_.
+
+Iterating Array Elements
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Most users of the xfarray require the ability to iterate the records stored in
+the array.
+Callers can probe every possible array index with the following:
+
+.. code-block:: c
+
+	xfarray_idx_t i;
+	foreach_xfarray_idx(array, i) {
+	    xfarray_load(array, i, &rec);
+
+	    /* do something with rec */
+	}
+
+All users of this idiom must be prepared to handle null records or must already
+know that there aren't any.
+
+For xfarray users that want to iterate a sparse array, the ``xfarray_iter``
+function ignores indices in the xfarray that have never been written to by
+calling ``xfile_seek_data`` (which internally uses ``SEEK_DATA``) to skip areas
+of the array that are not populated with memory pages.
+Once it finds a page, it will skip the zeroed areas of the page.
+
+.. code-block:: c
+
+	xfarray_idx_t i = XFARRAY_CURSOR_INIT;
+	while ((ret = xfarray_iter(array, &i, &rec)) == 1) {
+	    /* do something with rec */
+	}
+
+.. _xfarray_sort:
+
+Sorting Array Elements
+^^^^^^^^^^^^^^^^^^^^^^
+
+During the fourth demonstration of online repair, a community reviewer remarked
+that for performance reasons, online repair ought to load batches of records
+into btree record blocks instead of inserting records into a new btree one at a
+time.
+The btree insertion code in XFS is responsible for maintaining correct ordering
+of the records, so naturally the xfarray must also support sorting the record
+set prior to bulk loading.
+
+Case Study: Sorting xfarrays
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The sorting algorithm used in the xfarray is actually a combination of adaptive
+quicksort and a heapsort subalgorithm in the spirit of
+`Sedgewick <https://algs4.cs.princeton.edu/23quicksort/>`_ and
+`pdqsort <https://github.com/orlp/pdqsort>`_, with customizations for the Linux
+kernel.
+To sort records in a reasonably short amount of time, ``xfarray`` takes
+advantage of the binary subpartitioning offered by quicksort, but it also uses
+heapsort to hedge aginst performance collapse if the chosen quicksort pivots
+are poor.
+Both algorithms are (in general) O(n * lg(n)), but there is a wide performance
+gulf between the two implementations.
+
+The Linux kernel already contains a reasonably fast implementation of heapsort.
+It only operates on regular C arrays, which limits the scope of its usefulness.
+There are two key places where the xfarray uses it:
+
+* Sorting any record subset backed by a single xfile page.
+
+* Loading a small number of xfarray records from potentially disparate parts
+  of the xfarray into a memory buffer, and sorting the buffer.
+
+In other words, ``xfarray`` uses heapsort to constrain the nested recursion of
+quicksort, thereby mitigating quicksort's worst runtime behavior.
+
+Choosing a quicksort pivot is a tricky business.
+A good pivot splits the set to sort in half, leading to the divide and conquer
+behavior that is crucial to  O(n * lg(n)) performance.
+A poor pivot barely splits the subset at all, leading to O(n\ :sup:`2`)
+runtime.
+The xfarray sort routine tries to avoid picking a bad pivot by sampling nine
+records into a memory buffer and using the kernel heapsort to identify the
+median of the nine.
+
+Most modern quicksort implementations employ Tukey's "ninther" to select a
+pivot from a classic C array.
+Typical ninther implementations pick three unique triads of records, sort each
+of the triads, and then sort the middle value of each triad to determine the
+ninther value.
+As stated previously, however, xfile accesses are not entirely cheap.
+It turned out to be much more performant to read the nine elements into a
+memory buffer, run the kernel's in-memory heapsort on the buffer, and choose
+the 4th element of that buffer as the pivot.
+Tukey's ninthers are described in J. W. Tukey, `The ninther, a technique for
+low-effort robust (resistant) location in large samples`, in *Contributions to
+Survey Sampling and Applied Statistics*, edited by H. David, (Academic Press,
+1978), pp. 251–257.
+
+The partitioning of quicksort is fairly textbook -- rearrange the record
+subset around the pivot, then set up the current and next stack frames to
+sort with the larger and the smaller halves of the pivot, respectively.
+This keeps the stack space requirements to log2(record count).
+
+As a final performance optimization, the hi and lo scanning phase of quicksort
+keeps examined xfile pages mapped in the kernel for as long as possible to
+reduce map/unmap cycles.
+Surprisingly, this reduces overall sort runtime by nearly half again after
+accounting for the application of heapsort directly onto xfile pages.
+
+Blob Storage
+````````````
+
+Extended attributes and directories add an additional requirement for staging
+records: arbitrary byte sequences of finite length.
+Each directory entry record needs to store entry name,
+and each extended attribute needs to store both the attribute name and value.
+The names, keys, and values can consume a large amount of memory, so the
+``xfblob`` abstraction was created to simplify management of these blobs
+atop an xfile.
+
+Blob arrays provide ``xfblob_load`` and ``xfblob_store`` functions to retrieve
+and persist objects.
+The store function returns a magic cookie for every object that it persists.
+Later, callers provide this cookie to the ``xblob_load`` to recall the object.
+The ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate``
+function frees them all because compaction is not needed.
+
+The details of repairing directories and extended attributes will be discussed
+in a subsequent section about atomic extent swapping.
+However, it should be noted that these repair functions only use blob storage
+to cache a small number of entries before adding them to a temporary ondisk
+file, which is why compaction is not required.
+
+The proposed patchset is at the start of the
+`extended attribute repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ series.
+
+.. _xfbtree:
+
+In-Memory B+Trees
+`````````````````
+
+The chapter about :ref:`secondary metadata<secondary_metadata>` mentioned that
+checking and repairing of secondary metadata commonly requires coordination
+between a live metadata scan of the filesystem and writer threads that are
+updating that metadata.
+Keeping the scan data up to date requires requires the ability to propagate
+metadata updates from the filesystem into the data being collected by the scan.
+This *can* be done by appending concurrent updates into a separate log file and
+applying them before writing the new metadata to disk, but this leads to
+unbounded memory consumption if the rest of the system is very busy.
+Another option is to skip the side-log and commit live updates from the
+filesystem directly into the scan data, which trades more overhead for a lower
+maximum memory requirement.
+In both cases, the data structure holding the scan results must support indexed
+access to perform well.
+
+Given that indexed lookups of scan data is required for both strategies, online
+fsck employs the second strategy of committing live updates directly into
+scan data.
+Because xfarrays are not indexed and do not enforce record ordering, they
+are not suitable for this task.
+Conveniently, however, XFS has a library to create and maintain ordered reverse
+mapping records: the existing rmap btree code!
+If only there was a means to create one in memory.
+
+Recall that the :ref:`xfile <xfile>` abstraction represents memory pages as a
+regular file, which means that the kernel can create byte or block addressable
+virtual address spaces at will.
+The XFS buffer cache specializes in abstracting IO to block-oriented  address
+spaces, which means that adaptation of the buffer cache to interface with
+xfiles enables reuse of the entire btree library.
+Btrees built atop an xfile are collectively known as ``xfbtrees``.
+The next few sections describe how they actually work.
+
+The proposed patchset is the
+`in-memory btree
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>`_
+series.
+
+Using xfiles as a Buffer Cache Target
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Two modifications are necessary to support xfiles as a buffer cache target.
+The first is to make it possible for the ``struct xfs_buftarg`` structure to
+host the ``struct xfs_buf`` rhashtable, because normally those are held by a
+per-AG structure.
+The second change is to modify the buffer ``ioapply`` function to "read" cached
+pages from the xfile and "write" cached pages back to the xfile.
+Multiple access to individual buffers is controlled by the ``xfs_buf`` lock,
+since the xfile does not provide any locking on its own.
+With this adaptation in place, users of the xfile-backed buffer cache use
+exactly the same APIs as users of the disk-backed buffer cache.
+The separation between xfile and buffer cache implies higher memory usage since
+they do not share pages, but this property could some day enable transactional
+updates to an in-memory btree.
+Today, however, it simply eliminates the need for new code.
+
+Space Management with an xfbtree
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Space management for an xfile is very simple -- each btree block is one memory
+page in size.
+These blocks use the same header format as an on-disk btree, but the in-memory
+block verifiers ignore the checksums, assuming that xfile memory is no more
+corruption-prone than regular DRAM.
+Reusing existing code here is more important than absolute memory efficiency.
+
+The very first block of an xfile backing an xfbtree contains a header block.
+The header describes the owner, height, and the block number of the root
+xfbtree block.
+
+To allocate a btree block, use ``xfile_seek_data`` to find a gap in the file.
+If there are no gaps, create one by extending the length of the xfile.
+Preallocate space for the block with ``xfile_prealloc``, and hand back the
+location.
+To free an xfbtree block, use ``xfile_discard`` (which internally uses
+``FALLOC_FL_PUNCH_HOLE``) to remove the memory page from the xfile.
+
+Populating an xfbtree
+^^^^^^^^^^^^^^^^^^^^^
+
+An online fsck function that wants to create an xfbtree should proceed as
+follows:
+
+1. Call ``xfile_create`` to create an xfile.
+
+2. Call ``xfs_alloc_memory_buftarg`` to create a buffer cache target structure
+   pointing to the xfile.
+
+3. Pass the buffer cache target, buffer ops, and other information to
+   ``xfbtree_create`` to write an initial tree header and root block to the
+   xfile.
+   Each btree type should define a wrapper that passes necessary arguments to
+   the creation function.
+   For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take care of
+   all the necessary details for callers.
+   A ``struct xfbtree`` object will be returned.
+
+4. Pass the xfbtree object to the btree cursor creation function for the
+   btree type.
+   Following the example above, ``xfs_rmapbt_mem_cursor`` takes care of this
+   for callers.
+
+5. Pass the btree cursor to the regular btree functions to make queries against
+   and to update the in-memory btree.
+   For example, a btree cursor for an rmap xfbtree can be passed to the
+   ``xfs_rmap_*`` functions just like any other btree cursor.
+   See the :ref:`next section<xfbtree_commit>` for information on dealing with
+   xfbtree updates that are logged to a transaction.
+
+6. When finished, delete the btree cursor, destroy the xfbtree object, free the
+   buffer target, and the destroy the xfile to release all resources.
+
+.. _xfbtree_commit:
+
+Committing Logged xfbtree Buffers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Although it is a clever hack to reuse the rmap btree code to handle the staging
+structure, the ephemeral nature of the in-memory btree block storage presents
+some challenges of its own.
+The XFS transaction manager must not commit buffer log items for buffers backed
+by an xfile because the log format does not understand updates for devices
+other than the data device.
+An ephemeral xfbtree probably will not exist by the time the AIL checkpoints
+log transactions back into the filesystem, and certainly won't exist during
+log recovery.
+For these reasons, any code updating an xfbtree in transaction context must
+remove the buffer log items from the transaction and write the updates into the
+backing xfile before committing or cancelling the transaction.
+
+The ``xfbtree_trans_commit`` and ``xfbtree_trans_cancel`` functions implement
+this functionality as follows:
+
+1. Find each buffer log item whose buffer targets the xfile.
+
+2. Record the dirty/ordered status of the log item.
+
+3. Detach the log item from the buffer.
+
+4. Queue the buffer to a special delwri list.
+
+5. Clear the transaction dirty flag if the only dirty log items were the ones
+   that were detached in step 3.
+
+6. Submit the delwri list to commit the changes to the xfile, if the updates
+   are being committed.
+
+After removing xfile logged buffers from the transaction in this manner, the
+transaction can be committed or cancelled.
-- 
cgit v1.2.3


From 7fb8ccffd36946a055cc89b8f0acf49cb59403b1 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 18:59:49 -0700
Subject: xfs: document btree bulk loading

Add a discussion of the btree bulk loading code, which makes it easy to
take an in-memory recordset and write it out to disk in an efficient
manner.  This also enables atomic switchover from the old to the new
structure with minimal potential for leaking the old blocks.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 .../filesystems/xfs-online-fsck-design.rst         | 665 +++++++++++++++++++++
 1 file changed, 665 insertions(+)

diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index 39c394530958..e57d01924515 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -2332,3 +2332,668 @@ this functionality as follows:
 
 After removing xfile logged buffers from the transaction in this manner, the
 transaction can be committed or cancelled.
+
+Bulk Loading of Ondisk B+Trees
+------------------------------
+
+As mentioned previously, early iterations of online repair built new btree
+structures by creating a new btree and adding observations individually.
+Loading a btree one record at a time had a slight advantage of not requiring
+the incore records to be sorted prior to commit, but was very slow and leaked
+blocks if the system went down during a repair.
+Loading records one at a time also meant that repair could not control the
+loading factor of the blocks in the new btree.
+
+Fortunately, the venerable ``xfs_repair`` tool had a more efficient means for
+rebuilding a btree index from a collection of records -- bulk btree loading.
+This was implemented rather inefficiently code-wise, since ``xfs_repair``
+had separate copy-pasted implementations for each btree type.
+
+To prepare for online fsck, each of the four bulk loaders were studied, notes
+were taken, and the four were refactored into a single generic btree bulk
+loading mechanism.
+Those notes in turn have been refreshed and are presented below.
+
+Geometry Computation
+````````````````````
+
+The zeroth step of bulk loading is to assemble the entire record set that will
+be stored in the new btree, and sort the records.
+Next, call ``xfs_btree_bload_compute_geometry`` to compute the shape of the
+btree from the record set, the type of btree, and any load factor preferences.
+This information is required for resource reservation.
+
+First, the geometry computation computes the minimum and maximum records that
+will fit in a leaf block from the size of a btree block and the size of the
+block header.
+Roughly speaking, the maximum number of records is::
+
+        maxrecs = (block_size - header_size) / record_size
+
+The XFS design specifies that btree blocks should be merged when possible,
+which means the minimum number of records is half of maxrecs::
+
+        minrecs = maxrecs / 2
+
+The next variable to determine is the desired loading factor.
+This must be at least minrecs and no more than maxrecs.
+Choosing minrecs is undesirable because it wastes half the block.
+Choosing maxrecs is also undesirable because adding a single record to each
+newly rebuilt leaf block will cause a tree split, which causes a noticeable
+drop in performance immediately afterwards.
+The default loading factor was chosen to be 75% of maxrecs, which provides a
+reasonably compact structure without any immediate split penalties::
+
+        default_load_factor = (maxrecs + minrecs) / 2
+
+If space is tight, the loading factor will be set to maxrecs to try to avoid
+running out of space::
+
+        leaf_load_factor = enough space ? default_load_factor : maxrecs
+
+Load factor is computed for btree node blocks using the combined size of the
+btree key and pointer as the record size::
+
+        maxrecs = (block_size - header_size) / (key_size + ptr_size)
+        minrecs = maxrecs / 2
+        node_load_factor = enough space ? default_load_factor : maxrecs
+
+Once that's done, the number of leaf blocks required to store the record set
+can be computed as::
+
+        leaf_blocks = ceil(record_count / leaf_load_factor)
+
+The number of node blocks needed to point to the next level down in the tree
+is computed as::
+
+        n_blocks = (n == 0 ? leaf_blocks : node_blocks[n])
+        node_blocks[n + 1] = ceil(n_blocks / node_load_factor)
+
+The entire computation is performed recursively until the current level only
+needs one block.
+The resulting geometry is as follows:
+
+- For AG-rooted btrees, this level is the root level, so the height of the new
+  tree is ``level + 1`` and the space needed is the summation of the number of
+  blocks on each level.
+
+- For inode-rooted btrees where the records in the top level do not fit in the
+  inode fork area, the height is ``level + 2``, the space needed is the
+  summation of the number of blocks on each level, and the inode fork points to
+  the root block.
+
+- For inode-rooted btrees where the records in the top level can be stored in
+  the inode fork area, then the root block can be stored in the inode, the
+  height is ``level + 1``, and the space needed is one less than the summation
+  of the number of blocks on each level.
+  This only becomes relevant when non-bmap btrees gain the ability to root in
+  an inode, which is a future patchset and only included here for completeness.
+
+.. _newbt:
+
+Reserving New B+Tree Blocks
+```````````````````````````
+
+Once repair knows the number of blocks needed for the new btree, it allocates
+those blocks using the free space information.
+Each reserved extent is tracked separately by the btree builder state data.
+To improve crash resilience, the reservation code also logs an Extent Freeing
+Intent (EFI) item in the same transaction as each space allocation and attaches
+its in-memory ``struct xfs_extent_free_item`` object to the space reservation.
+If the system goes down, log recovery will use the unfinished EFIs to free the
+unused space, the free space, leaving the filesystem unchanged.
+
+Each time the btree builder claims a block for the btree from a reserved
+extent, it updates the in-memory reservation to reflect the claimed space.
+Block reservation tries to allocate as much contiguous space as possible to
+reduce the number of EFIs in play.
+
+While repair is writing these new btree blocks, the EFIs created for the space
+reservations pin the tail of the ondisk log.
+It's possible that other parts of the system will remain busy and push the head
+of the log towards the pinned tail.
+To avoid livelocking the filesystem, the EFIs must not pin the tail of the log
+for too long.
+To alleviate this problem, the dynamic relogging capability of the deferred ops
+mechanism is reused here to commit a transaction at the log head containing an
+EFD for the old EFI and new EFI at the head.
+This enables the log to release the old EFI to keep the log moving forwards.
+
+EFIs have a role to play during the commit and reaping phases; please see the
+next section and the section about :ref:`reaping<reaping>` for more details.
+
+Proposed patchsets are the
+`bitmap rework
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework>`_
+and the
+`preparation for bulk loading btrees
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_.
+
+
+Writing the New Tree
+````````````````````
+
+This part is pretty simple -- the btree builder (``xfs_btree_bulkload``) claims
+a block from the reserved list, writes the new btree block header, fills the
+rest of the block with records, and adds the new leaf block to a list of
+written blocks::
+
+  ┌────┐
+  │leaf│
+  │RRR │
+  └────┘
+
+Sibling pointers are set every time a new block is added to the level::
+
+  ┌────┐ ┌────┐ ┌────┐ ┌────┐
+  │leaf│→│leaf│→│leaf│→│leaf│
+  │RRR │←│RRR │←│RRR │←│RRR │
+  └────┘ └────┘ └────┘ └────┘
+
+When it finishes writing the record leaf blocks, it moves on to the node
+blocks
+To fill a node block, it walks each block in the next level down in the tree
+to compute the relevant keys and write them into the parent node::
+
+      ┌────┐       ┌────┐
+      │node│──────→│node│
+      │PP  │←──────│PP  │
+      └────┘       └────┘
+      ↙   ↘         ↙   ↘
+  ┌────┐ ┌────┐ ┌────┐ ┌────┐
+  │leaf│→│leaf│→│leaf│→│leaf│
+  │RRR │←│RRR │←│RRR │←│RRR │
+  └────┘ └────┘ └────┘ └────┘
+
+When it reaches the root level, it is ready to commit the new btree!::
+
+          ┌─────────┐
+          │  root   │
+          │   PP    │
+          └─────────┘
+          ↙         ↘
+      ┌────┐       ┌────┐
+      │node│──────→│node│
+      │PP  │←──────│PP  │
+      └────┘       └────┘
+      ↙   ↘         ↙   ↘
+  ┌────┐ ┌────┐ ┌────┐ ┌────┐
+  │leaf│→│leaf│→│leaf│→│leaf│
+  │RRR │←│RRR │←│RRR │←│RRR │
+  └────┘ └────┘ └────┘ └────┘
+
+The first step to commit the new btree is to persist the btree blocks to disk
+synchronously.
+This is a little complicated because a new btree block could have been freed
+in the recent past, so the builder must use ``xfs_buf_delwri_queue_here`` to
+remove the (stale) buffer from the AIL list before it can write the new blocks
+to disk.
+Blocks are queued for IO using a delwri list and written in one large batch
+with ``xfs_buf_delwri_submit``.
+
+Once the new blocks have been persisted to disk, control returns to the
+individual repair function that called the bulk loader.
+The repair function must log the location of the new root in a transaction,
+clean up the space reservations that were made for the new btree, and reap the
+old metadata blocks:
+
+1. Commit the location of the new btree root.
+
+2. For each incore reservation:
+
+   a. Log Extent Freeing Done (EFD) items for all the space that was consumed
+      by the btree builder.  The new EFDs must point to the EFIs attached to
+      the reservation to prevent log recovery from freeing the new blocks.
+
+   b. For unclaimed portions of incore reservations, create a regular deferred
+      extent free work item to be free the unused space later in the
+      transaction chain.
+
+   c. The EFDs and EFIs logged in steps 2a and 2b must not overrun the
+      reservation of the committing transaction.
+      If the btree loading code suspects this might be about to happen, it must
+      call ``xrep_defer_finish`` to clear out the deferred work and obtain a
+      fresh transaction.
+
+3. Clear out the deferred work a second time to finish the commit and clean
+   the repair transaction.
+
+The transaction rolling in steps 2c and 3 represent a weakness in the repair
+algorithm, because a log flush and a crash before the end of the reap step can
+result in space leaking.
+Online repair functions minimize the chances of this occuring by using very
+large transactions, which each can accomodate many thousands of block freeing
+instructions.
+Repair moves on to reaping the old blocks, which will be presented in a
+subsequent :ref:`section<reaping>` after a few case studies of bulk loading.
+
+Case Study: Rebuilding the Inode Index
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The high level process to rebuild the inode index btree is:
+
+1. Walk the reverse mapping records to generate ``struct xfs_inobt_rec``
+   records from the inode chunk information and a bitmap of the old inode btree
+   blocks.
+
+2. Append the records to an xfarray in inode order.
+
+3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
+   of blocks needed for the inode btree.
+   If the free space inode btree is enabled, call it again to estimate the
+   geometry of the finobt.
+
+4. Allocate the number of blocks computed in the previous step.
+
+5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
+   generate the internal node blocks.
+   If the free space inode btree is enabled, call it again to load the finobt.
+
+6. Commit the location of the new btree root block(s) to the AGI.
+
+7. Reap the old btree blocks using the bitmap created in step 1.
+
+Details are as follows.
+
+The inode btree maps inumbers to the ondisk location of the associated
+inode records, which means that the inode btrees can be rebuilt from the
+reverse mapping information.
+Reverse mapping records with an owner of ``XFS_RMAP_OWN_INOBT`` marks the
+location of the old inode btree blocks.
+Each reverse mapping record with an owner of ``XFS_RMAP_OWN_INODES`` marks the
+location of at least one inode cluster buffer.
+A cluster is the smallest number of ondisk inodes that can be allocated or
+freed in a single transaction; it is never smaller than 1 fs block or 4 inodes.
+
+For the space represented by each inode cluster, ensure that there are no
+records in the free space btrees nor any records in the reference count btree.
+If there are, the space metadata inconsistencies are reason enough to abort the
+operation.
+Otherwise, read each cluster buffer to check that its contents appear to be
+ondisk inodes and to decide if the file is allocated
+(``xfs_dinode.i_mode != 0``) or free (``xfs_dinode.i_mode == 0``).
+Accumulate the results of successive inode cluster buffer reads until there is
+enough information to fill a single inode chunk record, which is 64 consecutive
+numbers in the inumber keyspace.
+If the chunk is sparse, the chunk record may include holes.
+
+Once the repair function accumulates one chunk's worth of data, it calls
+``xfarray_append`` to add the inode btree record to the xfarray.
+This xfarray is walked twice during the btree creation step -- once to populate
+the inode btree with all inode chunk records, and a second time to populate the
+free inode btree with records for chunks that have free non-sparse inodes.
+The number of records for the inode btree is the number of xfarray records,
+but the record count for the free inode btree has to be computed as inode chunk
+records are stored in the xfarray.
+
+The proposed patchset is the
+`AG btree repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
+series.
+
+Case Study: Rebuilding the Space Reference Counts
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Reverse mapping records are used to rebuild the reference count information.
+Reference counts are required for correct operation of copy on write for shared
+file data.
+Imagine the reverse mapping entries as rectangles representing extents of
+physical blocks, and that the rectangles can be laid down to allow them to
+overlap each other.
+From the diagram below, it is apparent that a reference count record must start
+or end wherever the height of the stack changes.
+In other words, the record emission stimulus is level-triggered::
+
+                        █    ███
+              ██      █████ ████   ███        ██████
+        ██   ████     ███████████ ████     █████████
+        ████████████████████████████████ ███████████
+        ^ ^  ^^ ^^    ^ ^^ ^^^  ^^^^  ^ ^^ ^  ^     ^
+        2 1  23 21    3 43 234  2123  1 01 2  3     0
+
+The ondisk reference count btree does not store the refcount == 0 cases because
+the free space btree already records which blocks are free.
+Extents being used to stage copy-on-write operations should be the only records
+with refcount == 1.
+Single-owner file blocks aren't recorded in either the free space or the
+reference count btrees.
+
+The high level process to rebuild the reference count btree is:
+
+1. Walk the reverse mapping records to generate ``struct xfs_refcount_irec``
+   records for any space having more than one reverse mapping and add them to
+   the xfarray.
+   Any records owned by ``XFS_RMAP_OWN_COW`` are also added to the xfarray
+   because these are extents allocated to stage a copy on write operation and
+   are tracked in the refcount btree.
+
+   Use any records owned by ``XFS_RMAP_OWN_REFC`` to create a bitmap of old
+   refcount btree blocks.
+
+2. Sort the records in physical extent order, putting the CoW staging extents
+   at the end of the xfarray.
+   This matches the sorting order of records in the refcount btree.
+
+3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
+   of blocks needed for the new tree.
+
+4. Allocate the number of blocks computed in the previous step.
+
+5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
+   generate the internal node blocks.
+
+6. Commit the location of new btree root block to the AGF.
+
+7. Reap the old btree blocks using the bitmap created in step 1.
+
+Details are as follows; the same algorithm is used by ``xfs_repair`` to
+generate refcount information from reverse mapping records.
+
+- Until the reverse mapping btree runs out of records:
+
+  - Retrieve the next record from the btree and put it in a bag.
+
+  - Collect all records with the same starting block from the btree and put
+    them in the bag.
+
+  - While the bag isn't empty:
+
+    - Among the mappings in the bag, compute the lowest block number where the
+      reference count changes.
+      This position will be either the starting block number of the next
+      unprocessed reverse mapping or the next block after the shortest mapping
+      in the bag.
+
+    - Remove all mappings from the bag that end at this position.
+
+    - Collect all reverse mappings that start at this position from the btree
+      and put them in the bag.
+
+    - If the size of the bag changed and is greater than one, create a new
+      refcount record associating the block number range that we just walked to
+      the size of the bag.
+
+The bag-like structure in this case is a type 2 xfarray as discussed in the
+:ref:`xfarray access patterns<xfarray_access_patterns>` section.
+Reverse mappings are added to the bag using ``xfarray_store_anywhere`` and
+removed via ``xfarray_unset``.
+Bag members are examined through ``xfarray_iter`` loops.
+
+The proposed patchset is the
+`AG btree repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
+series.
+
+Case Study: Rebuilding File Fork Mapping Indices
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The high level process to rebuild a data/attr fork mapping btree is:
+
+1. Walk the reverse mapping records to generate ``struct xfs_bmbt_rec``
+   records from the reverse mapping records for that inode and fork.
+   Append these records to an xfarray.
+   Compute the bitmap of the old bmap btree blocks from the ``BMBT_BLOCK``
+   records.
+
+2. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
+   of blocks needed for the new tree.
+
+3. Sort the records in file offset order.
+
+4. If the extent records would fit in the inode fork immediate area, commit the
+   records to that immediate area and skip to step 8.
+
+5. Allocate the number of blocks computed in the previous step.
+
+6. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
+   generate the internal node blocks.
+
+7. Commit the new btree root block to the inode fork immediate area.
+
+8. Reap the old btree blocks using the bitmap created in step 1.
+
+There are some complications here:
+First, it's possible to move the fork offset to adjust the sizes of the
+immediate areas if the data and attr forks are not both in BMBT format.
+Second, if there are sufficiently few fork mappings, it may be possible to use
+EXTENTS format instead of BMBT, which may require a conversion.
+Third, the incore extent map must be reloaded carefully to avoid disturbing
+any delayed allocation extents.
+
+The proposed patchset is the
+`file mapping repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings>`_
+series.
+
+.. _reaping:
+
+Reaping Old Metadata Blocks
+---------------------------
+
+Whenever online fsck builds a new data structure to replace one that is
+suspect, there is a question of how to find and dispose of the blocks that
+belonged to the old structure.
+The laziest method of course is not to deal with them at all, but this slowly
+leads to service degradations as space leaks out of the filesystem.
+Hopefully, someone will schedule a rebuild of the free space information to
+plug all those leaks.
+Offline repair rebuilds all space metadata after recording the usage of
+the files and directories that it decides not to clear, hence it can build new
+structures in the discovered free space and avoid the question of reaping.
+
+As part of a repair, online fsck relies heavily on the reverse mapping records
+to find space that is owned by the corresponding rmap owner yet truly free.
+Cross referencing rmap records with other rmap records is necessary because
+there may be other data structures that also think they own some of those
+blocks (e.g. crosslinked trees).
+Permitting the block allocator to hand them out again will not push the system
+towards consistency.
+
+For space metadata, the process of finding extents to dispose of generally
+follows this format:
+
+1. Create a bitmap of space used by data structures that must be preserved.
+   The space reservations used to create the new metadata can be used here if
+   the same rmap owner code is used to denote all of the objects being rebuilt.
+
+2. Survey the reverse mapping data to create a bitmap of space owned by the
+   same ``XFS_RMAP_OWN_*`` number for the metadata that is being preserved.
+
+3. Use the bitmap disunion operator to subtract (1) from (2).
+   The remaining set bits represent candidate extents that could be freed.
+   The process moves on to step 4 below.
+
+Repairs for file-based metadata such as extended attributes, directories,
+symbolic links, quota files and realtime bitmaps are performed by building a
+new structure attached to a temporary file and swapping the forks.
+Afterward, the mappings in the old file fork are the candidate blocks for
+disposal.
+
+The process for disposing of old extents is as follows:
+
+4. For each candidate extent, count the number of reverse mapping records for
+   the first block in that extent that do not have the same rmap owner for the
+   data structure being repaired.
+
+   - If zero, the block has a single owner and can be freed.
+
+   - If not, the block is part of a crosslinked structure and must not be
+     freed.
+
+5. Starting with the next block in the extent, figure out how many more blocks
+   have the same zero/nonzero other owner status as that first block.
+
+6. If the region is crosslinked, delete the reverse mapping entry for the
+   structure being repaired and move on to the next region.
+
+7. If the region is to be freed, mark any corresponding buffers in the buffer
+   cache as stale to prevent log writeback.
+
+8. Free the region and move on.
+
+However, there is one complication to this procedure.
+Transactions are of finite size, so the reaping process must be careful to roll
+the transactions to avoid overruns.
+Overruns come from two sources:
+
+a. EFIs logged on behalf of space that is no longer occupied
+
+b. Log items for buffer invalidations
+
+This is also a window in which a crash during the reaping process can leak
+blocks.
+As stated earlier, online repair functions use very large transactions to
+minimize the chances of this occurring.
+
+The proposed patchset is the
+`preparation for bulk loading btrees
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_
+series.
+
+Case Study: Reaping After a Regular Btree Repair
+````````````````````````````````````````````````
+
+Old reference count and inode btrees are the easiest to reap because they have
+rmap records with special owner codes: ``XFS_RMAP_OWN_REFC`` for the refcount
+btree, and ``XFS_RMAP_OWN_INOBT`` for the inode and free inode btrees.
+Creating a list of extents to reap the old btree blocks is quite simple,
+conceptually:
+
+1. Lock the relevant AGI/AGF header buffers to prevent allocation and frees.
+
+2. For each reverse mapping record with an rmap owner corresponding to the
+   metadata structure being rebuilt, set the corresponding range in a bitmap.
+
+3. Walk the current data structures that have the same rmap owner.
+   For each block visited, clear that range in the above bitmap.
+
+4. Each set bit in the bitmap represents a block that could be a block from the
+   old data structures and hence is a candidate for reaping.
+   In other words, ``(rmap_records_owned_by & ~blocks_reachable_by_walk)``
+   are the blocks that might be freeable.
+
+If it is possible to maintain the AGF lock throughout the repair (which is the
+common case), then step 2 can be performed at the same time as the reverse
+mapping record walk that creates the records for the new btree.
+
+Case Study: Rebuilding the Free Space Indices
+`````````````````````````````````````````````
+
+The high level process to rebuild the free space indices is:
+
+1. Walk the reverse mapping records to generate ``struct xfs_alloc_rec_incore``
+   records from the gaps in the reverse mapping btree.
+
+2. Append the records to an xfarray.
+
+3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
+   of blocks needed for each new tree.
+
+4. Allocate the number of blocks computed in the previous step from the free
+   space information collected.
+
+5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
+   generate the internal node blocks for the free space by length index.
+   Call it again for the free space by block number index.
+
+6. Commit the locations of the new btree root blocks to the AGF.
+
+7. Reap the old btree blocks by looking for space that is not recorded by the
+   reverse mapping btree, the new free space btrees, or the AGFL.
+
+Repairing the free space btrees has three key complications over a regular
+btree repair:
+
+First, free space is not explicitly tracked in the reverse mapping records.
+Hence, the new free space records must be inferred from gaps in the physical
+space component of the keyspace of the reverse mapping btree.
+
+Second, free space repairs cannot use the common btree reservation code because
+new blocks are reserved out of the free space btrees.
+This is impossible when repairing the free space btrees themselves.
+However, repair holds the AGF buffer lock for the duration of the free space
+index reconstruction, so it can use the collected free space information to
+supply the blocks for the new free space btrees.
+It is not necessary to back each reserved extent with an EFI because the new
+free space btrees are constructed in what the ondisk filesystem thinks is
+unowned space.
+However, if reserving blocks for the new btrees from the collected free space
+information changes the number of free space records, repair must re-estimate
+the new free space btree geometry with the new record count until the
+reservation is sufficient.
+As part of committing the new btrees, repair must ensure that reverse mappings
+are created for the reserved blocks and that unused reserved blocks are
+inserted into the free space btrees.
+Deferrred rmap and freeing operations are used to ensure that this transition
+is atomic, similar to the other btree repair functions.
+
+Third, finding the blocks to reap after the repair is not overly
+straightforward.
+Blocks for the free space btrees and the reverse mapping btrees are supplied by
+the AGFL.
+Blocks put onto the AGFL have reverse mapping records with the owner
+``XFS_RMAP_OWN_AG``.
+This ownership is retained when blocks move from the AGFL into the free space
+btrees or the reverse mapping btrees.
+When repair walks reverse mapping records to synthesize free space records, it
+creates a bitmap (``ag_owner_bitmap``) of all the space claimed by
+``XFS_RMAP_OWN_AG`` records.
+The repair context maintains a second bitmap corresponding to the rmap btree
+blocks and the AGFL blocks (``rmap_agfl_bitmap``).
+When the walk is complete, the bitmap disunion operation ``(ag_owner_bitmap &
+~rmap_agfl_bitmap)`` computes the extents that are used by the old free space
+btrees.
+These blocks can then be reaped using the methods outlined above.
+
+The proposed patchset is the
+`AG btree repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
+series.
+
+.. _rmap_reap:
+
+Case Study: Reaping After Repairing Reverse Mapping Btrees
+``````````````````````````````````````````````````````````
+
+Old reverse mapping btrees are less difficult to reap after a repair.
+As mentioned in the previous section, blocks on the AGFL, the two free space
+btree blocks, and the reverse mapping btree blocks all have reverse mapping
+records with ``XFS_RMAP_OWN_AG`` as the owner.
+The full process of gathering reverse mapping records and building a new btree
+are described in the case study of
+:ref:`live rebuilds of rmap data <rmap_repair>`, but a crucial point from that
+discussion is that the new rmap btree will not contain any records for the old
+rmap btree, nor will the old btree blocks be tracked in the free space btrees.
+The list of candidate reaping blocks is computed by setting the bits
+corresponding to the gaps in the new rmap btree records, and then clearing the
+bits corresponding to extents in the free space btrees and the current AGFL
+blocks.
+The result ``(new_rmapbt_gaps & ~(agfl | bnobt_records))`` are reaped using the
+methods outlined above.
+
+The rest of the process of rebuildng the reverse mapping btree is discussed
+in a separate :ref:`case study<rmap_repair>`.
+
+The proposed patchset is the
+`AG btree repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
+series.
+
+Case Study: Rebuilding the AGFL
+```````````````````````````````
+
+The allocation group free block list (AGFL) is repaired as follows:
+
+1. Create a bitmap for all the space that the reverse mapping data claims is
+   owned by ``XFS_RMAP_OWN_AG``.
+
+2. Subtract the space used by the two free space btrees and the rmap btree.
+
+3. Subtract any space that the reverse mapping data claims is owned by any
+   other owner, to avoid re-adding crosslinked blocks to the AGFL.
+
+4. Once the AGFL is full, reap any blocks leftover.
+
+5. The next operation to fix the freelist will right-size the list.
+
+See `fs/xfs/scrub/agheader_repair.c <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/xfs/scrub/agheader_repair.c>`_ for more details.
-- 
cgit v1.2.3


From d697887193ef25df3d6e6a77d4108e7924eba2ca Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 18:59:49 -0700
Subject: xfs: document online file metadata repair code

Add to the fifth chapter of the online fsck design documentation, where
we discuss the details of the data structures and algorithms used by the
kernel to repair file metadata.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 .../filesystems/xfs-online-fsck-design.rst         | 154 +++++++++++++++++++++
 1 file changed, 154 insertions(+)

diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index e57d01924515..a047fc772a62 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -2997,3 +2997,157 @@ The allocation group free block list (AGFL) is repaired as follows:
 5. The next operation to fix the freelist will right-size the list.
 
 See `fs/xfs/scrub/agheader_repair.c <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/xfs/scrub/agheader_repair.c>`_ for more details.
+
+Inode Record Repairs
+--------------------
+
+Inode records must be handled carefully, because they have both ondisk records
+("dinodes") and an in-memory ("cached") representation.
+There is a very high potential for cache coherency issues if online fsck is not
+careful to access the ondisk metadata *only* when the ondisk metadata is so
+badly damaged that the filesystem cannot load the in-memory representation.
+When online fsck wants to open a damaged file for scrubbing, it must use
+specialized resource acquisition functions that return either the in-memory
+representation *or* a lock on whichever object is necessary to prevent any
+update to the ondisk location.
+
+The only repairs that should be made to the ondisk inode buffers are whatever
+is necessary to get the in-core structure loaded.
+This means fixing whatever is caught by the inode cluster buffer and inode fork
+verifiers, and retrying the ``iget`` operation.
+If the second ``iget`` fails, the repair has failed.
+
+Once the in-memory representation is loaded, repair can lock the inode and can
+subject it to comprehensive checks, repairs, and optimizations.
+Most inode attributes are easy to check and constrain, or are user-controlled
+arbitrary bit patterns; these are both easy to fix.
+Dealing with the data and attr fork extent counts and the file block counts is
+more complicated, because computing the correct value requires traversing the
+forks, or if that fails, leaving the fields invalid and waiting for the fork
+fsck functions to run.
+
+The proposed patchset is the
+`inode
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_
+repair series.
+
+Quota Record Repairs
+--------------------
+
+Similar to inodes, quota records ("dquots") also have both ondisk records and
+an in-memory representation, and hence are subject to the same cache coherency
+issues.
+Somewhat confusingly, both are known as dquots in the XFS codebase.
+
+The only repairs that should be made to the ondisk quota record buffers are
+whatever is necessary to get the in-core structure loaded.
+Once the in-memory representation is loaded, the only attributes needing
+checking are obviously bad limits and timer values.
+
+Quota usage counters are checked, repaired, and discussed separately in the
+section about :ref:`live quotacheck <quotacheck>`.
+
+The proposed patchset is the
+`quota
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_
+repair series.
+
+.. _fscounters:
+
+Freezing to Fix Summary Counters
+--------------------------------
+
+Filesystem summary counters track availability of filesystem resources such
+as free blocks, free inodes, and allocated inodes.
+This information could be compiled by walking the free space and inode indexes,
+but this is a slow process, so XFS maintains a copy in the ondisk superblock
+that should reflect the ondisk metadata, at least when the filesystem has been
+unmounted cleanly.
+For performance reasons, XFS also maintains incore copies of those counters,
+which are key to enabling resource reservations for active transactions.
+Writer threads reserve the worst-case quantities of resources from the
+incore counter and give back whatever they don't use at commit time.
+It is therefore only necessary to serialize on the superblock when the
+superblock is being committed to disk.
+
+The lazy superblock counter feature introduced in XFS v5 took this even further
+by training log recovery to recompute the summary counters from the AG headers,
+which eliminated the need for most transactions even to touch the superblock.
+The only time XFS commits the summary counters is at filesystem unmount.
+To reduce contention even further, the incore counter is implemented as a
+percpu counter, which means that each CPU is allocated a batch of blocks from a
+global incore counter and can satisfy small allocations from the local batch.
+
+The high-performance nature of the summary counters makes it difficult for
+online fsck to check them, since there is no way to quiesce a percpu counter
+while the system is running.
+Although online fsck can read the filesystem metadata to compute the correct
+values of the summary counters, there's no way to hold the value of a percpu
+counter stable, so it's quite possible that the counter will be out of date by
+the time the walk is complete.
+Earlier versions of online scrub would return to userspace with an incomplete
+scan flag, but this is not a satisfying outcome for a system administrator.
+For repairs, the in-memory counters must be stabilized while walking the
+filesystem metadata to get an accurate reading and install it in the percpu
+counter.
+
+To satisfy this requirement, online fsck must prevent other programs in the
+system from initiating new writes to the filesystem, it must disable background
+garbage collection threads, and it must wait for existing writer programs to
+exit the kernel.
+Once that has been established, scrub can walk the AG free space indexes, the
+inode btrees, and the realtime bitmap to compute the correct value of all
+four summary counters.
+This is very similar to a filesystem freeze, though not all of the pieces are
+necessary:
+
+- The final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to
+  prevent other threads from thawing the filesystem, or other scrub threads
+  from initiating another fscounters freeze.
+
+- It does not quiesce the log.
+
+With this code in place, it is now possible to pause the filesystem for just
+long enough to check and correct the summary counters.
+
++--------------------------------------------------------------------------+
+| **Historical Sidebar**:                                                  |
++--------------------------------------------------------------------------+
+| The initial implementation used the actual VFS filesystem freeze         |
+| mechanism to quiesce filesystem activity.                                |
+| With the filesystem frozen, it is possible to resolve the counter values |
+| with exact precision, but there are many problems with calling the VFS   |
+| methods directly:                                                        |
+|                                                                          |
+| - Other programs can unfreeze the filesystem without our knowledge.      |
+|   This leads to incorrect scan results and incorrect repairs.            |
+|                                                                          |
+| - Adding an extra lock to prevent others from thawing the filesystem     |
+|   required the addition of a ``->freeze_super`` function to wrap         |
+|   ``freeze_fs()``.                                                       |
+|   This in turn caused other subtle problems because it turns out that    |
+|   the VFS ``freeze_super`` and ``thaw_super`` functions can drop the     |
+|   last reference to the VFS superblock, and any subsequent access        |
+|   becomes a UAF bug!                                                     |
+|   This can happen if the filesystem is unmounted while the underlying    |
+|   block device has frozen the filesystem.                                |
+|   This problem could be solved by grabbing extra references to the       |
+|   superblock, but it felt suboptimal given the other inadequacies of     |
+|   this approach.                                                         |
+|                                                                          |
+| - The log need not be quiesced to check the summary counters, but a VFS  |
+|   freeze initiates one anyway.                                           |
+|   This adds unnecessary runtime to live fscounter fsck operations.       |
+|                                                                          |
+| - Quiescing the log means that XFS flushes the (possibly incorrect)      |
+|   counters to disk as part of cleaning the log.                          |
+|                                                                          |
+| - A bug in the VFS meant that freeze could complete even when            |
+|   sync_filesystem fails to flush the filesystem and returns an error.    |
+|   This bug was fixed in Linux 5.17.                                      |
++--------------------------------------------------------------------------+
+
+The proposed patchset is the
+`summary counter cleanup
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_
+series.
-- 
cgit v1.2.3


From a0d856eede5a3ccfc7f0a4e02ee2e91ffab7f784 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 18:59:50 -0700
Subject: xfs: document full filesystem scans for online fsck

Certain parts of the online fsck code need to scan every file in the
entire filesystem.  It is not acceptable to block the entire filesystem
while this happens, which means that we need to be clever in allowing
scans to coordinate with ongoing filesystem updates.  We also need to
hook the filesystem so that regular updates propagate to the staging
records.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 .../filesystems/xfs-online-fsck-design.rst         | 681 +++++++++++++++++++++
 1 file changed, 681 insertions(+)

diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index a047fc772a62..a768dfbbc4a5 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -3151,3 +3151,684 @@ The proposed patchset is the
 `summary counter cleanup
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_
 series.
+
+Full Filesystem Scans
+---------------------
+
+Certain types of metadata can only be checked by walking every file in the
+entire filesystem to record observations and comparing the observations against
+what's recorded on disk.
+Like every other type of online repair, repairs are made by writing those
+observations to disk in a replacement structure and committing it atomically.
+However, it is not practical to shut down the entire filesystem to examine
+hundreds of billions of files because the downtime would be excessive.
+Therefore, online fsck must build the infrastructure to manage a live scan of
+all the files in the filesystem.
+There are two questions that need to be solved to perform a live walk:
+
+- How does scrub manage the scan while it is collecting data?
+
+- How does the scan keep abreast of changes being made to the system by other
+  threads?
+
+.. _iscan:
+
+Coordinated Inode Scans
+```````````````````````
+
+In the original Unix filesystems of the 1970s, each directory entry contained
+an index number (*inumber*) which was used as an index into on ondisk array
+(*itable*) of fixed-size records (*inodes*) describing a file's attributes and
+its data block mapping.
+This system is described by J. Lions, `"inode (5659)"
+<http://www.lemis.com/grog/Documentation/Lions/>`_ in *Lions' Commentary on
+UNIX, 6th Edition*, (Dept. of Computer Science, the University of New South
+Wales, November 1977), pp. 18-2; and later by D. Ritchie and K. Thompson,
+`"Implementation of the File System"
+<https://archive.org/details/bstj57-6-1905/page/n8/mode/1up>`_, from *The UNIX
+Time-Sharing System*, (The Bell System Technical Journal, July 1978), pp.
+1913-4.
+
+XFS retains most of this design, except now inumbers are search keys over all
+the space in the data section filesystem.
+They form a continuous keyspace that can be expressed as a 64-bit integer,
+though the inodes themselves are sparsely distributed within the keyspace.
+Scans proceed in a linear fashion across the inumber keyspace, starting from
+``0x0`` and ending at ``0xFFFFFFFFFFFFFFFF``.
+Naturally, a scan through a keyspace requires a scan cursor object to track the
+scan progress.
+Because this keyspace is sparse, this cursor contains two parts.
+The first part of this scan cursor object tracks the inode that will be
+examined next; call this the examination cursor.
+Somewhat less obviously, the scan cursor object must also track which parts of
+the keyspace have already been visited, which is critical for deciding if a
+concurrent filesystem update needs to be incorporated into the scan data.
+Call this the visited inode cursor.
+
+Advancing the scan cursor is a multi-step process encapsulated in
+``xchk_iscan_iter``:
+
+1. Lock the AGI buffer of the AG containing the inode pointed to by the visited
+   inode cursor.
+   This guarantee that inodes in this AG cannot be allocated or freed while
+   advancing the cursor.
+
+2. Use the per-AG inode btree to look up the next inumber after the one that
+   was just visited, since it may not be keyspace adjacent.
+
+3. If there are no more inodes left in this AG:
+
+   a. Move the examination cursor to the point of the inumber keyspace that
+      corresponds to the start of the next AG.
+
+   b. Adjust the visited inode cursor to indicate that it has "visited" the
+      last possible inode in the current AG's inode keyspace.
+      XFS inumbers are segmented, so the cursor needs to be marked as having
+      visited the entire keyspace up to just before the start of the next AG's
+      inode keyspace.
+
+   c. Unlock the AGI and return to step 1 if there are unexamined AGs in the
+      filesystem.
+
+   d. If there are no more AGs to examine, set both cursors to the end of the
+      inumber keyspace.
+      The scan is now complete.
+
+4. Otherwise, there is at least one more inode to scan in this AG:
+
+   a. Move the examination cursor ahead to the next inode marked as allocated
+      by the inode btree.
+
+   b. Adjust the visited inode cursor to point to the inode just prior to where
+      the examination cursor is now.
+      Because the scanner holds the AGI buffer lock, no inodes could have been
+      created in the part of the inode keyspace that the visited inode cursor
+      just advanced.
+
+5. Get the incore inode for the inumber of the examination cursor.
+   By maintaining the AGI buffer lock until this point, the scanner knows that
+   it was safe to advance the examination cursor across the entire keyspace,
+   and that it has stabilized this next inode so that it cannot disappear from
+   the filesystem until the scan releases the incore inode.
+
+6. Drop the AGI lock and return the incore inode to the caller.
+
+Online fsck functions scan all files in the filesystem as follows:
+
+1. Start a scan by calling ``xchk_iscan_start``.
+
+2. Advance the scan cursor (``xchk_iscan_iter``) to get the next inode.
+   If one is provided:
+
+   a. Lock the inode to prevent updates during the scan.
+
+   b. Scan the inode.
+
+   c. While still holding the inode lock, adjust the visited inode cursor
+      (``xchk_iscan_mark_visited``) to point to this inode.
+
+   d. Unlock and release the inode.
+
+8. Call ``xchk_iscan_teardown`` to complete the scan.
+
+There are subtleties with the inode cache that complicate grabbing the incore
+inode for the caller.
+Obviously, it is an absolute requirement that the inode metadata be consistent
+enough to load it into the inode cache.
+Second, if the incore inode is stuck in some intermediate state, the scan
+coordinator must release the AGI and push the main filesystem to get the inode
+back into a loadable state.
+
+The proposed patches are the
+`inode scanner
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
+series.
+The first user of the new functionality is the
+`online quotacheck
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
+series.
+
+Inode Management
+````````````````
+
+In regular filesystem code, references to allocated XFS incore inodes are
+always obtained (``xfs_iget``) outside of transaction context because the
+creation of the incore context for an existing file does not require metadata
+updates.
+However, it is important to note that references to incore inodes obtained as
+part of file creation must be performed in transaction context because the
+filesystem must ensure the atomicity of the ondisk inode btree index updates
+and the initialization of the actual ondisk inode.
+
+References to incore inodes are always released (``xfs_irele``) outside of
+transaction context because there are a handful of activities that might
+require ondisk updates:
+
+- The VFS may decide to kick off writeback as part of a ``DONTCACHE`` inode
+  release.
+
+- Speculative preallocations need to be unreserved.
+
+- An unlinked file may have lost its last reference, in which case the entire
+  file must be inactivated, which involves releasing all of its resources in
+  the ondisk metadata and freeing the inode.
+
+These activities are collectively called inode inactivation.
+Inactivation has two parts -- the VFS part, which initiates writeback on all
+dirty file pages, and the XFS part, which cleans up XFS-specific information
+and frees the inode if it was unlinked.
+If the inode is unlinked (or unconnected after a file handle operation), the
+kernel drops the inode into the inactivation machinery immediately.
+
+During normal operation, resource acquisition for an update follows this order
+to avoid deadlocks:
+
+1. Inode reference (``iget``).
+
+2. Filesystem freeze protection, if repairing (``mnt_want_write_file``).
+
+3. Inode ``IOLOCK`` (VFS ``i_rwsem``) lock to control file IO.
+
+4. Inode ``MMAPLOCK`` (page cache ``invalidate_lock``) lock for operations that
+   can update page cache mappings.
+
+5. Log feature enablement.
+
+6. Transaction log space grant.
+
+7. Space on the data and realtime devices for the transaction.
+
+8. Incore dquot references, if a file is being repaired.
+   Note that they are not locked, merely acquired.
+
+9. Inode ``ILOCK`` for file metadata updates.
+
+10. AG header buffer locks / Realtime metadata inode ILOCK.
+
+11. Realtime metadata buffer locks, if applicable.
+
+12. Extent mapping btree blocks, if applicable.
+
+Resources are often released in the reverse order, though this is not required.
+However, online fsck differs from regular XFS operations because it may examine
+an object that normally is acquired in a later stage of the locking order, and
+then decide to cross-reference the object with an object that is acquired
+earlier in the order.
+The next few sections detail the specific ways in which online fsck takes care
+to avoid deadlocks.
+
+iget and irele During a Scrub
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+An inode scan performed on behalf of a scrub operation runs in transaction
+context, and possibly with resources already locked and bound to it.
+This isn't much of a problem for ``iget`` since it can operate in the context
+of an existing transaction, as long as all of the bound resources are acquired
+before the inode reference in the regular filesystem.
+
+When the VFS ``iput`` function is given a linked inode with no other
+references, it normally puts the inode on an LRU list in the hope that it can
+save time if another process re-opens the file before the system runs out
+of memory and frees it.
+Filesystem callers can short-circuit the LRU process by setting a ``DONTCACHE``
+flag on the inode to cause the kernel to try to drop the inode into the
+inactivation machinery immediately.
+
+In the past, inactivation was always done from the process that dropped the
+inode, which was a problem for scrub because scrub may already hold a
+transaction, and XFS does not support nesting transactions.
+On the other hand, if there is no scrub transaction, it is desirable to drop
+otherwise unused inodes immediately to avoid polluting caches.
+To capture these nuances, the online fsck code has a separate ``xchk_irele``
+function to set or clear the ``DONTCACHE`` flag to get the required release
+behavior.
+
+Proposed patchsets include fixing
+`scrub iget usage
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixes>`_ and
+`dir iget usage
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>`_.
+
+Locking Inodes
+^^^^^^^^^^^^^^
+
+In regular filesystem code, the VFS and XFS will acquire multiple IOLOCK locks
+in a well-known order: parent → child when updating the directory tree, and
+in numerical order of the addresses of their ``struct inode`` object otherwise.
+For regular files, the MMAPLOCK can be acquired after the IOLOCK to stop page
+faults.
+If two MMAPLOCKs must be acquired, they are acquired in numerical order of
+the addresses of their ``struct address_space`` objects.
+Due to the structure of existing filesystem code, IOLOCKs and MMAPLOCKs must be
+acquired before transactions are allocated.
+If two ILOCKs must be acquired, they are acquired in inumber order.
+
+Inode lock acquisition must be done carefully during a coordinated inode scan.
+Online fsck cannot abide these conventions, because for a directory tree
+scanner, the scrub process holds the IOLOCK of the file being scanned and it
+needs to take the IOLOCK of the file at the other end of the directory link.
+If the directory tree is corrupt because it contains a cycle, ``xfs_scrub``
+cannot use the regular inode locking functions and avoid becoming trapped in an
+ABBA deadlock.
+
+Solving both of these problems is straightforward -- any time online fsck
+needs to take a second lock of the same class, it uses trylock to avoid an ABBA
+deadlock.
+If the trylock fails, scrub drops all inode locks and use trylock loops to
+(re)acquire all necessary resources.
+Trylock loops enable scrub to check for pending fatal signals, which is how
+scrub avoids deadlocking the filesystem or becoming an unresponsive process.
+However, trylock loops means that online fsck must be prepared to measure the
+resource being scrubbed before and after the lock cycle to detect changes and
+react accordingly.
+
+.. _dirparent:
+
+Case Study: Finding a Directory Parent
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Consider the directory parent pointer repair code as an example.
+Online fsck must verify that the dotdot dirent of a directory points up to a
+parent directory, and that the parent directory contains exactly one dirent
+pointing down to the child directory.
+Fully validating this relationship (and repairing it if possible) requires a
+walk of every directory on the filesystem while holding the child locked, and
+while updates to the directory tree are being made.
+The coordinated inode scan provides a way to walk the filesystem without the
+possibility of missing an inode.
+The child directory is kept locked to prevent updates to the dotdot dirent, but
+if the scanner fails to lock a parent, it can drop and relock both the child
+and the prospective parent.
+If the dotdot entry changes while the directory is unlocked, then a move or
+rename operation must have changed the child's parentage, and the scan can
+exit early.
+
+The proposed patchset is the
+`directory repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
+series.
+
+.. _fshooks:
+
+Filesystem Hooks
+`````````````````
+
+The second piece of support that online fsck functions need during a full
+filesystem scan is the ability to stay informed about updates being made by
+other threads in the filesystem, since comparisons against the past are useless
+in a dynamic environment.
+Two pieces of Linux kernel infrastructure enable online fsck to monitor regular
+filesystem operations: filesystem hooks and :ref:`static keys<jump_labels>`.
+
+Filesystem hooks convey information about an ongoing filesystem operation to
+a downstream consumer.
+In this case, the downstream consumer is always an online fsck function.
+Because multiple fsck functions can run in parallel, online fsck uses the Linux
+notifier call chain facility to dispatch updates to any number of interested
+fsck processes.
+Call chains are a dynamic list, which means that they can be configured at
+run time.
+Because these hooks are private to the XFS module, the information passed along
+contains exactly what the checking function needs to update its observations.
+
+The current implementation of XFS hooks uses SRCU notifier chains to reduce the
+impact to highly threaded workloads.
+Regular blocking notifier chains use a rwsem and seem to have a much lower
+overhead for single-threaded applications.
+However, it may turn out that the combination of blocking chains and static
+keys are a more performant combination; more study is needed here.
+
+The following pieces are necessary to hook a certain point in the filesystem:
+
+- A ``struct xfs_hooks`` object must be embedded in a convenient place such as
+  a well-known incore filesystem object.
+
+- Each hook must define an action code and a structure containing more context
+  about the action.
+
+- Hook providers should provide appropriate wrapper functions and structs
+  around the ``xfs_hooks`` and ``xfs_hook`` objects to take advantage of type
+  checking to ensure correct usage.
+
+- A callsite in the regular filesystem code must be chosen to call
+  ``xfs_hooks_call`` with the action code and data structure.
+  This place should be adjacent to (and not earlier than) the place where
+  the filesystem update is committed to the transaction.
+  In general, when the filesystem calls a hook chain, it should be able to
+  handle sleeping and should not be vulnerable to memory reclaim or locking
+  recursion.
+  However, the exact requirements are very dependent on the context of the hook
+  caller and the callee.
+
+- The online fsck function should define a structure to hold scan data, a lock
+  to coordinate access to the scan data, and a ``struct xfs_hook`` object.
+  The scanner function and the regular filesystem code must acquire resources
+  in the same order; see the next section for details.
+
+- The online fsck code must contain a C function to catch the hook action code
+  and data structure.
+  If the object being updated has already been visited by the scan, then the
+  hook information must be applied to the scan data.
+
+- Prior to unlocking inodes to start the scan, online fsck must call
+  ``xfs_hooks_setup`` to initialize the ``struct xfs_hook``, and
+  ``xfs_hooks_add`` to enable the hook.
+
+- Online fsck must call ``xfs_hooks_del`` to disable the hook once the scan is
+  complete.
+
+The number of hooks should be kept to a minimum to reduce complexity.
+Static keys are used to reduce the overhead of filesystem hooks to nearly
+zero when online fsck is not running.
+
+.. _liveupdate:
+
+Live Updates During a Scan
+``````````````````````````
+
+The code paths of the online fsck scanning code and the :ref:`hooked<fshooks>`
+filesystem code look like this::
+
+            other program
+                  ↓
+            inode lock ←────────────────────┐
+                  ↓                         │
+            AG header lock                  │
+                  ↓                         │
+            filesystem function             │
+                  ↓                         │
+            notifier call chain             │    same
+                  ↓                         ├─── inode
+            scrub hook function             │    lock
+                  ↓                         │
+            scan data mutex ←──┐    same    │
+                  ↓            ├─── scan    │
+            update scan data   │    lock    │
+                  ↑            │            │
+            scan data mutex ←──┘            │
+                  ↑                         │
+            inode lock ←────────────────────┘
+                  ↑
+            scrub function
+                  ↑
+            inode scanner
+                  ↑
+            xfs_scrub
+
+These rules must be followed to ensure correct interactions between the
+checking code and the code making an update to the filesystem:
+
+- Prior to invoking the notifier call chain, the filesystem function being
+  hooked must acquire the same lock that the scrub scanning function acquires
+  to scan the inode.
+
+- The scanning function and the scrub hook function must coordinate access to
+  the scan data by acquiring a lock on the scan data.
+
+- Scrub hook function must not add the live update information to the scan
+  observations unless the inode being updated has already been scanned.
+  The scan coordinator has a helper predicate (``xchk_iscan_want_live_update``)
+  for this.
+
+- Scrub hook functions must not change the caller's state, including the
+  transaction that it is running.
+  They must not acquire any resources that might conflict with the filesystem
+  function being hooked.
+
+- The hook function can abort the inode scan to avoid breaking the other rules.
+
+The inode scan APIs are pretty simple:
+
+- ``xchk_iscan_start`` starts a scan
+
+- ``xchk_iscan_iter`` grabs a reference to the next inode in the scan or
+  returns zero if there is nothing left to scan
+
+- ``xchk_iscan_want_live_update`` to decide if an inode has already been
+  visited in the scan.
+  This is critical for hook functions to decide if they need to update the
+  in-memory scan information.
+
+- ``xchk_iscan_mark_visited`` to mark an inode as having been visited in the
+  scan
+
+- ``xchk_iscan_teardown`` to finish the scan
+
+This functionality is also a part of the
+`inode scanner
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
+series.
+
+.. _quotacheck:
+
+Case Study: Quota Counter Checking
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+It is useful to compare the mount time quotacheck code to the online repair
+quotacheck code.
+Mount time quotacheck does not have to contend with concurrent operations, so
+it does the following:
+
+1. Make sure the ondisk dquots are in good enough shape that all the incore
+   dquots will actually load, and zero the resource usage counters in the
+   ondisk buffer.
+
+2. Walk every inode in the filesystem.
+   Add each file's resource usage to the incore dquot.
+
+3. Walk each incore dquot.
+   If the incore dquot is not being flushed, add the ondisk buffer backing the
+   incore dquot to a delayed write (delwri) list.
+
+4. Write the buffer list to disk.
+
+Like most online fsck functions, online quotacheck can't write to regular
+filesystem objects until the newly collected metadata reflect all filesystem
+state.
+Therefore, online quotacheck records file resource usage to a shadow dquot
+index implemented with a sparse ``xfarray``, and only writes to the real dquots
+once the scan is complete.
+Handling transactional updates is tricky because quota resource usage updates
+are handled in phases to minimize contention on dquots:
+
+1. The inodes involved are joined and locked to a transaction.
+
+2. For each dquot attached to the file:
+
+   a. The dquot is locked.
+
+   b. A quota reservation is added to the dquot's resource usage.
+      The reservation is recorded in the transaction.
+
+   c. The dquot is unlocked.
+
+3. Changes in actual quota usage are tracked in the transaction.
+
+4. At transaction commit time, each dquot is examined again:
+
+   a. The dquot is locked again.
+
+   b. Quota usage changes are logged and unused reservation is given back to
+      the dquot.
+
+   c. The dquot is unlocked.
+
+For online quotacheck, hooks are placed in steps 2 and 4.
+The step 2 hook creates a shadow version of the transaction dquot context
+(``dqtrx``) that operates in a similar manner to the regular code.
+The step 4 hook commits the shadow ``dqtrx`` changes to the shadow dquots.
+Notice that both hooks are called with the inode locked, which is how the
+live update coordinates with the inode scanner.
+
+The quotacheck scan looks like this:
+
+1. Set up a coordinated inode scan.
+
+2. For each inode returned by the inode scan iterator:
+
+   a. Grab and lock the inode.
+
+   b. Determine that inode's resource usage (data blocks, inode counts,
+      realtime blocks) and add that to the shadow dquots for the user, group,
+      and project ids associated with the inode.
+
+   c. Unlock and release the inode.
+
+3. For each dquot in the system:
+
+   a. Grab and lock the dquot.
+
+   b. Check the dquot against the shadow dquots created by the scan and updated
+      by the live hooks.
+
+Live updates are key to being able to walk every quota record without
+needing to hold any locks for a long duration.
+If repairs are desired, the real and shadow dquots are locked and their
+resource counts are set to the values in the shadow dquot.
+
+The proposed patchset is the
+`online quotacheck
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
+series.
+
+.. _nlinks:
+
+Case Study: File Link Count Checking
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+File link count checking also uses live update hooks.
+The coordinated inode scanner is used to visit all directories on the
+filesystem, and per-file link count records are stored in a sparse ``xfarray``
+indexed by inumber.
+During the scanning phase, each entry in a directory generates observation
+data as follows:
+
+1. If the entry is a dotdot (``'..'``) entry of the root directory, the
+   directory's parent link count is bumped because the root directory's dotdot
+   entry is self referential.
+
+2. If the entry is a dotdot entry of a subdirectory, the parent's backref
+   count is bumped.
+
+3. If the entry is neither a dot nor a dotdot entry, the target file's parent
+   count is bumped.
+
+4. If the target is a subdirectory, the parent's child link count is bumped.
+
+A crucial point to understand about how the link count inode scanner interacts
+with the live update hooks is that the scan cursor tracks which *parent*
+directories have been scanned.
+In other words, the live updates ignore any update about ``A → B`` when A has
+not been scanned, even if B has been scanned.
+Furthermore, a subdirectory A with a dotdot entry pointing back to B is
+accounted as a backref counter in the shadow data for A, since child dotdot
+entries affect the parent's link count.
+Live update hooks are carefully placed in all parts of the filesystem that
+create, change, or remove directory entries, since those operations involve
+bumplink and droplink.
+
+For any file, the correct link count is the number of parents plus the number
+of child subdirectories.
+Non-directories never have children of any kind.
+The backref information is used to detect inconsistencies in the number of
+links pointing to child subdirectories and the number of dotdot entries
+pointing back.
+
+After the scan completes, the link count of each file can be checked by locking
+both the inode and the shadow data, and comparing the link counts.
+A second coordinated inode scan cursor is used for comparisons.
+Live updates are key to being able to walk every inode without needing to hold
+any locks between inodes.
+If repairs are desired, the inode's link count is set to the value in the
+shadow information.
+If no parents are found, the file must be :ref:`reparented <orphanage>` to the
+orphanage to prevent the file from being lost forever.
+
+The proposed patchset is the
+`file link count repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks>`_
+series.
+
+.. _rmap_repair:
+
+Case Study: Rebuilding Reverse Mapping Records
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Most repair functions follow the same pattern: lock filesystem resources,
+walk the surviving ondisk metadata looking for replacement metadata records,
+and use an :ref:`in-memory array <xfarray>` to store the gathered observations.
+The primary advantage of this approach is the simplicity and modularity of the
+repair code -- code and data are entirely contained within the scrub module,
+do not require hooks in the main filesystem, and are usually the most efficient
+in memory use.
+A secondary advantage of this repair approach is atomicity -- once the kernel
+decides a structure is corrupt, no other threads can access the metadata until
+the kernel finishes repairing and revalidating the metadata.
+
+For repairs going on within a shard of the filesystem, these advantages
+outweigh the delays inherent in locking the shard while repairing parts of the
+shard.
+Unfortunately, repairs to the reverse mapping btree cannot use the "standard"
+btree repair strategy because it must scan every space mapping of every fork of
+every file in the filesystem, and the filesystem cannot stop.
+Therefore, rmap repair foregoes atomicity between scrub and repair.
+It combines a :ref:`coordinated inode scanner <iscan>`, :ref:`live update hooks
+<liveupdate>`, and an :ref:`in-memory rmap btree <xfbtree>` to complete the
+scan for reverse mapping records.
+
+1. Set up an xfbtree to stage rmap records.
+
+2. While holding the locks on the AGI and AGF buffers acquired during the
+   scrub, generate reverse mappings for all AG metadata: inodes, btrees, CoW
+   staging extents, and the internal log.
+
+3. Set up an inode scanner.
+
+4. Hook into rmap updates for the AG being repaired so that the live scan data
+   can receive updates to the rmap btree from the rest of the filesystem during
+   the file scan.
+
+5. For each space mapping found in either fork of each file scanned,
+   decide if the mapping matches the AG of interest.
+   If so:
+
+   a. Create a btree cursor for the in-memory btree.
+
+   b. Use the rmap code to add the record to the in-memory btree.
+
+   c. Use the :ref:`special commit function <xfbtree_commit>` to write the
+      xfbtree changes to the xfile.
+
+6. For each live update received via the hook, decide if the owner has already
+   been scanned.
+   If so, apply the live update into the scan data:
+
+   a. Create a btree cursor for the in-memory btree.
+
+   b. Replay the operation into the in-memory btree.
+
+   c. Use the :ref:`special commit function <xfbtree_commit>` to write the
+      xfbtree changes to the xfile.
+      This is performed with an empty transaction to avoid changing the
+      caller's state.
+
+7. When the inode scan finishes, create a new scrub transaction and relock the
+   two AG headers.
+
+8. Compute the new btree geometry using the number of rmap records in the
+   shadow btree, like all other btree rebuilding functions.
+
+9. Allocate the number of blocks computed in the previous step.
+
+10. Perform the usual btree bulk loading and commit to install the new rmap
+    btree.
+
+11. Reap the old rmap btree blocks as discussed in the case study about how
+    to :ref:`reap after rmap btree repair <rmap_reap>`.
+
+12. Free the xfbtree now that it not needed.
+
+The proposed patchset is the
+`rmap repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>`_
+series.
-- 
cgit v1.2.3


From 2f754f7fb9fdd4e051cb273010c4244f95095835 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 18:59:50 -0700
Subject: xfs: document metadata file repair

File-based metadata (such as xattrs and directories) can be extremely
large.  To reduce the memory requirements and maximize code reuse, it is
very convenient to create a temporary file, use the regular dir/attr
code to store salvaged information, and then atomically swap the extents
between the file being repaired and the temporary file.  Record the high
level concepts behind how temporary files and atomic content swapping
should work, and then present some case studies of what the actual
repair functions do.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 .../filesystems/xfs-online-fsck-design.rst         | 536 +++++++++++++++++++++
 1 file changed, 536 insertions(+)

diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index a768dfbbc4a5..facfde07e569 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -3389,6 +3389,8 @@ Proposed patchsets include fixing
 `dir iget usage
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>`_.
 
+.. _ilocking:
+
 Locking Inodes
 ^^^^^^^^^^^^^^
 
@@ -3832,3 +3834,537 @@ The proposed patchset is the
 `rmap repair
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>`_
 series.
+
+Staging Repairs with Temporary Files on Disk
+--------------------------------------------
+
+XFS stores a substantial amount of metadata in file forks: directories,
+extended attributes, symbolic link targets, free space bitmaps and summary
+information for the realtime volume, and quota records.
+File forks map 64-bit logical file fork space extents to physical storage space
+extents, similar to how a memory management unit maps 64-bit virtual addresses
+to physical memory addresses.
+Therefore, file-based tree structures (such as directories and extended
+attributes) use blocks mapped in the file fork offset address space that point
+to other blocks mapped within that same address space, and file-based linear
+structures (such as bitmaps and quota records) compute array element offsets in
+the file fork offset address space.
+
+Because file forks can consume as much space as the entire filesystem, repairs
+cannot be staged in memory, even when a paging scheme is available.
+Therefore, online repair of file-based metadata createas a temporary file in
+the XFS filesystem, writes a new structure at the correct offsets into the
+temporary file, and atomically swaps the fork mappings (and hence the fork
+contents) to commit the repair.
+Once the repair is complete, the old fork can be reaped as necessary; if the
+system goes down during the reap, the iunlink code will delete the blocks
+during log recovery.
+
+**Note**: All space usage and inode indices in the filesystem *must* be
+consistent to use a temporary file safely!
+This dependency is the reason why online repair can only use pageable kernel
+memory to stage ondisk space usage information.
+
+Swapping metadata extents with a temporary file requires the owner field of the
+block headers to match the file being repaired and not the temporary file.  The
+directory, extended attribute, and symbolic link functions were all modified to
+allow callers to specify owner numbers explicitly.
+
+There is a downside to the reaping process -- if the system crashes during the
+reap phase and the fork extents are crosslinked, the iunlink processing will
+fail because freeing space will find the extra reverse mappings and abort.
+
+Temporary files created for repair are similar to ``O_TMPFILE`` files created
+by userspace.
+They are not linked into a directory and the entire file will be reaped when
+the last reference to the file is lost.
+The key differences are that these files must have no access permission outside
+the kernel at all, they must be specially marked to prevent them from being
+opened by handle, and they must never be linked into the directory tree.
+
++--------------------------------------------------------------------------+
+| **Historical Sidebar**:                                                  |
++--------------------------------------------------------------------------+
+| In the initial iteration of file metadata repair, the damaged metadata   |
+| blocks would be scanned for salvageable data; the extents in the file    |
+| fork would be reaped; and then a new structure would be built in its     |
+| place.                                                                   |
+| This strategy did not survive the introduction of the atomic repair      |
+| requirement expressed earlier in this document.                          |
+|                                                                          |
+| The second iteration explored building a second structure at a high      |
+| offset in the fork from the salvage data, reaping the old extents, and   |
+| using a ``COLLAPSE_RANGE`` operation to slide the new extents into       |
+| place.                                                                   |
+|                                                                          |
+| This had many drawbacks:                                                 |
+|                                                                          |
+| - Array structures are linearly addressed, and the regular filesystem    |
+|   codebase does not have the concept of a linear offset that could be    |
+|   applied to the record offset computation to build an alternate copy.   |
+|                                                                          |
+| - Extended attributes are allowed to use the entire attr fork offset     |
+|   address space.                                                         |
+|                                                                          |
+| - Even if repair could build an alternate copy of a data structure in a  |
+|   different part of the fork address space, the atomic repair commit     |
+|   requirement means that online repair would have to be able to perform  |
+|   a log assisted ``COLLAPSE_RANGE`` operation to ensure that the old     |
+|   structure was completely replaced.                                     |
+|                                                                          |
+| - A crash after construction of the secondary tree but before the range  |
+|   collapse would leave unreachable blocks in the file fork.              |
+|   This would likely confuse things further.                              |
+|                                                                          |
+| - Reaping blocks after a repair is not a simple operation, and           |
+|   initiating a reap operation from a restarted range collapse operation  |
+|   during log recovery is daunting.                                       |
+|                                                                          |
+| - Directory entry blocks and quota records record the file fork offset   |
+|   in the header area of each block.                                      |
+|   An atomic range collapse operation would have to rewrite this part of  |
+|   each block header.                                                     |
+|   Rewriting a single field in block headers is not a huge problem, but   |
+|   it's something to be aware of.                                         |
+|                                                                          |
+| - Each block in a directory or extended attributes btree index contains  |
+|   sibling and child block pointers.                                      |
+|   Were the atomic commit to use a range collapse operation, each block   |
+|   would have to be rewritten very carefully to preserve the graph        |
+|   structure.                                                             |
+|   Doing this as part of a range collapse means rewriting a large number  |
+|   of blocks repeatedly, which is not conducive to quick repairs.         |
+|                                                                          |
+| This lead to the introduction of temporary file staging.                 |
++--------------------------------------------------------------------------+
+
+Using a Temporary File
+``````````````````````
+
+Online repair code should use the ``xrep_tempfile_create`` function to create a
+temporary file inside the filesystem.
+This allocates an inode, marks the in-core inode private, and attaches it to
+the scrub context.
+These files are hidden from userspace, may not be added to the directory tree,
+and must be kept private.
+
+Temporary files only use two inode locks: the IOLOCK and the ILOCK.
+The MMAPLOCK is not needed here, because there must not be page faults from
+userspace for data fork blocks.
+The usage patterns of these two locks are the same as for any other XFS file --
+access to file data are controlled via the IOLOCK, and access to file metadata
+are controlled via the ILOCK.
+Locking helpers are provided so that the temporary file and its lock state can
+be cleaned up by the scrub context.
+To comply with the nested locking strategy laid out in the :ref:`inode
+locking<ilocking>` section, it is recommended that scrub functions use the
+xrep_tempfile_ilock*_nowait lock helpers.
+
+Data can be written to a temporary file by two means:
+
+1. ``xrep_tempfile_copyin`` can be used to set the contents of a regular
+   temporary file from an xfile.
+
+2. The regular directory, symbolic link, and extended attribute functions can
+   be used to write to the temporary file.
+
+Once a good copy of a data file has been constructed in a temporary file, it
+must be conveyed to the file being repaired, which is the topic of the next
+section.
+
+The proposed patches are in the
+`repair temporary files
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_
+series.
+
+Atomic Extent Swapping
+----------------------
+
+Once repair builds a temporary file with a new data structure written into
+it, it must commit the new changes into the existing file.
+It is not possible to swap the inumbers of two files, so instead the new
+metadata must replace the old.
+This suggests the need for the ability to swap extents, but the existing extent
+swapping code used by the file defragmenting tool ``xfs_fsr`` is not sufficient
+for online repair because:
+
+a. When the reverse-mapping btree is enabled, the swap code must keep the
+   reverse mapping information up to date with every exchange of mappings.
+   Therefore, it can only exchange one mapping per transaction, and each
+   transaction is independent.
+
+b. Reverse-mapping is critical for the operation of online fsck, so the old
+   defragmentation code (which swapped entire extent forks in a single
+   operation) is not useful here.
+
+c. Defragmentation is assumed to occur between two files with identical
+   contents.
+   For this use case, an incomplete exchange will not result in a user-visible
+   change in file contents, even if the operation is interrupted.
+
+d. Online repair needs to swap the contents of two files that are by definition
+   *not* identical.
+   For directory and xattr repairs, the user-visible contents might be the
+   same, but the contents of individual blocks may be very different.
+
+e. Old blocks in the file may be cross-linked with another structure and must
+   not reappear if the system goes down mid-repair.
+
+These problems are overcome by creating a new deferred operation and a new type
+of log intent item to track the progress of an operation to exchange two file
+ranges.
+The new deferred operation type chains together the same transactions used by
+the reverse-mapping extent swap code.
+The new log item records the progress of the exchange to ensure that once an
+exchange begins, it will always run to completion, even there are
+interruptions.
+The new ``XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP`` log-incompatible feature flag
+in the superblock protects these new log item records from being replayed on
+old kernels.
+
+The proposed patchset is the
+`atomic extent swap
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_
+series.
+
++--------------------------------------------------------------------------+
+| **Sidebar: Using Log-Incompatible Feature Flags**                        |
++--------------------------------------------------------------------------+
+| Starting with XFS v5, the superblock contains a                          |
+| ``sb_features_log_incompat`` field to indicate that the log contains     |
+| records that might not readable by all kernels that could mount this     |
+| filesystem.                                                              |
+| In short, log incompat features protect the log contents against kernels |
+| that will not understand the contents.                                   |
+| Unlike the other superblock feature bits, log incompat bits are          |
+| ephemeral because an empty (clean) log does not need protection.         |
+| The log cleans itself after its contents have been committed into the    |
+| filesystem, either as part of an unmount or because the system is        |
+| otherwise idle.                                                          |
+| Because upper level code can be working on a transaction at the same     |
+| time that the log cleans itself, it is necessary for upper level code to |
+| communicate to the log when it is going to use a log incompatible        |
+| feature.                                                                 |
+|                                                                          |
+| The log coordinates access to incompatible features through the use of   |
+| one ``struct rw_semaphore`` for each feature.                            |
+| The log cleaning code tries to take this rwsem in exclusive mode to      |
+| clear the bit; if the lock attempt fails, the feature bit remains set.   |
+| Filesystem code signals its intention to use a log incompat feature in a |
+| transaction by calling ``xlog_use_incompat_feat``, which takes the rwsem |
+| in shared mode.                                                          |
+| The code supporting a log incompat feature should create wrapper         |
+| functions to obtain the log feature and call                             |
+| ``xfs_add_incompat_log_feature`` to set the feature bits in the primary  |
+| superblock.                                                              |
+| The superblock update is performed transactionally, so the wrapper to    |
+| obtain log assistance must be called just prior to the creation of the   |
+| transaction that uses the functionality.                                 |
+| For a file operation, this step must happen after taking the IOLOCK      |
+| and the MMAPLOCK, but before allocating the transaction.                 |
+| When the transaction is complete, the ``xlog_drop_incompat_feat``        |
+| function is called to release the feature.                               |
+| The feature bit will not be cleared from the superblock until the log    |
+| becomes clean.                                                           |
+|                                                                          |
+| Log-assisted extended attribute updates and atomic extent swaps both use |
+| log incompat features and provide convenience wrappers around the        |
+| functionality.                                                           |
++--------------------------------------------------------------------------+
+
+Mechanics of an Atomic Extent Swap
+``````````````````````````````````
+
+Swapping entire file forks is a complex task.
+The goal is to exchange all file fork mappings between two file fork offset
+ranges.
+There are likely to be many extent mappings in each fork, and the edges of
+the mappings aren't necessarily aligned.
+Furthermore, there may be other updates that need to happen after the swap,
+such as exchanging file sizes, inode flags, or conversion of fork data to local
+format.
+This is roughly the format of the new deferred extent swap work item:
+
+.. code-block:: c
+
+	struct xfs_swapext_intent {
+	    /* Inodes participating in the operation. */
+	    struct xfs_inode    *sxi_ip1;
+	    struct xfs_inode    *sxi_ip2;
+
+	    /* File offset range information. */
+	    xfs_fileoff_t       sxi_startoff1;
+	    xfs_fileoff_t       sxi_startoff2;
+	    xfs_filblks_t       sxi_blockcount;
+
+	    /* Set these file sizes after the operation, unless negative. */
+	    xfs_fsize_t         sxi_isize1;
+	    xfs_fsize_t         sxi_isize2;
+
+	    /* XFS_SWAP_EXT_* log operation flags */
+	    uint64_t            sxi_flags;
+	};
+
+The new log intent item contains enough information to track two logical fork
+offset ranges: ``(inode1, startoff1, blockcount)`` and ``(inode2, startoff2,
+blockcount)``.
+Each step of a swap operation exchanges the largest file range mapping possible
+from one file to the other.
+After each step in the swap operation, the two startoff fields are incremented
+and the blockcount field is decremented to reflect the progress made.
+The flags field captures behavioral parameters such as swapping the attr fork
+instead of the data fork and other work to be done after the extent swap.
+The two isize fields are used to swap the file size at the end of the operation
+if the file data fork is the target of the swap operation.
+
+When the extent swap is initiated, the sequence of operations is as follows:
+
+1. Create a deferred work item for the extent swap.
+   At the start, it should contain the entirety of the file ranges to be
+   swapped.
+
+2. Call ``xfs_defer_finish`` to process the exchange.
+   This is encapsulated in ``xrep_tempswap_contents`` for scrub operations.
+   This will log an extent swap intent item to the transaction for the deferred
+   extent swap work item.
+
+3. Until ``sxi_blockcount`` of the deferred extent swap work item is zero,
+
+   a. Read the block maps of both file ranges starting at ``sxi_startoff1`` and
+      ``sxi_startoff2``, respectively, and compute the longest extent that can
+      be swapped in a single step.
+      This is the minimum of the two ``br_blockcount`` s in the mappings.
+      Keep advancing through the file forks until at least one of the mappings
+      contains written blocks.
+      Mutual holes, unwritten extents, and extent mappings to the same physical
+      space are not exchanged.
+
+      For the next few steps, this document will refer to the mapping that came
+      from file 1 as "map1", and the mapping that came from file 2 as "map2".
+
+   b. Create a deferred block mapping update to unmap map1 from file 1.
+
+   c. Create a deferred block mapping update to unmap map2 from file 2.
+
+   d. Create a deferred block mapping update to map map1 into file 2.
+
+   e. Create a deferred block mapping update to map map2 into file 1.
+
+   f. Log the block, quota, and extent count updates for both files.
+
+   g. Extend the ondisk size of either file if necessary.
+
+   h. Log an extent swap done log item for the extent swap intent log item
+      that was read at the start of step 3.
+
+   i. Compute the amount of file range that has just been covered.
+      This quantity is ``(map1.br_startoff + map1.br_blockcount -
+      sxi_startoff1)``, because step 3a could have skipped holes.
+
+   j. Increase the starting offsets of ``sxi_startoff1`` and ``sxi_startoff2``
+      by the number of blocks computed in the previous step, and decrease
+      ``sxi_blockcount`` by the same quantity.
+      This advances the cursor.
+
+   k. Log a new extent swap intent log item reflecting the advanced state of
+      the work item.
+
+   l. Return the proper error code (EAGAIN) to the deferred operation manager
+      to inform it that there is more work to be done.
+      The operation manager completes the deferred work in steps 3b-3e before
+      moving back to the start of step 3.
+
+4. Perform any post-processing.
+   This will be discussed in more detail in subsequent sections.
+
+If the filesystem goes down in the middle of an operation, log recovery will
+find the most recent unfinished extent swap log intent item and restart from
+there.
+This is how extent swapping guarantees that an outside observer will either see
+the old broken structure or the new one, and never a mismash of both.
+
+Preparation for Extent Swapping
+```````````````````````````````
+
+There are a few things that need to be taken care of before initiating an
+atomic extent swap operation.
+First, regular files require the page cache to be flushed to disk before the
+operation begins, and directio writes to be quiesced.
+Like any filesystem operation, extent swapping must determine the maximum
+amount of disk space and quota that can be consumed on behalf of both files in
+the operation, and reserve that quantity of resources to avoid an unrecoverable
+out of space failure once it starts dirtying metadata.
+The preparation step scans the ranges of both files to estimate:
+
+- Data device blocks needed to handle the repeated updates to the fork
+  mappings.
+- Change in data and realtime block counts for both files.
+- Increase in quota usage for both files, if the two files do not share the
+  same set of quota ids.
+- The number of extent mappings that will be added to each file.
+- Whether or not there are partially written realtime extents.
+  User programs must never be able to access a realtime file extent that maps
+  to different extents on the realtime volume, which could happen if the
+  operation fails to run to completion.
+
+The need for precise estimation increases the run time of the swap operation,
+but it is very important to maintain correct accounting.
+The filesystem must not run completely out of free space, nor can the extent
+swap ever add more extent mappings to a fork than it can support.
+Regular users are required to abide the quota limits, though metadata repairs
+may exceed quota to resolve inconsistent metadata elsewhere.
+
+Special Features for Swapping Metadata File Extents
+```````````````````````````````````````````````````
+
+Extended attributes, symbolic links, and directories can set the fork format to
+"local" and treat the fork as a literal area for data storage.
+Metadata repairs must take extra steps to support these cases:
+
+- If both forks are in local format and the fork areas are large enough, the
+  swap is performed by copying the incore fork contents, logging both forks,
+  and committing.
+  The atomic extent swap mechanism is not necessary, since this can be done
+  with a single transaction.
+
+- If both forks map blocks, then the regular atomic extent swap is used.
+
+- Otherwise, only one fork is in local format.
+  The contents of the local format fork are converted to a block to perform the
+  swap.
+  The conversion to block format must be done in the same transaction that
+  logs the initial extent swap intent log item.
+  The regular atomic extent swap is used to exchange the mappings.
+  Special flags are set on the swap operation so that the transaction can be
+  rolled one more time to convert the second file's fork back to local format
+  so that the second file will be ready to go as soon as the ILOCK is dropped.
+
+Extended attributes and directories stamp the owning inode into every block,
+but the buffer verifiers do not actually check the inode number!
+Although there is no verification, it is still important to maintain
+referential integrity, so prior to performing the extent swap, online repair
+builds every block in the new data structure with the owner field of the file
+being repaired.
+
+After a successful swap operation, the repair operation must reap the old fork
+blocks by processing each fork mapping through the standard :ref:`file extent
+reaping <reaping>` mechanism that is done post-repair.
+If the filesystem should go down during the reap part of the repair, the
+iunlink processing at the end of recovery will free both the temporary file and
+whatever blocks were not reaped.
+However, this iunlink processing omits the cross-link detection of online
+repair, and is not completely foolproof.
+
+Swapping Temporary File Extents
+```````````````````````````````
+
+To repair a metadata file, online repair proceeds as follows:
+
+1. Create a temporary repair file.
+
+2. Use the staging data to write out new contents into the temporary repair
+   file.
+   The same fork must be written to as is being repaired.
+
+3. Commit the scrub transaction, since the swap estimation step must be
+   completed before transaction reservations are made.
+
+4. Call ``xrep_tempswap_trans_alloc`` to allocate a new scrub transaction with
+   the appropriate resource reservations, locks, and fill out a ``struct
+   xfs_swapext_req`` with the details of the swap operation.
+
+5. Call ``xrep_tempswap_contents`` to swap the contents.
+
+6. Commit the transaction to complete the repair.
+
+.. _rtsummary:
+
+Case Study: Repairing the Realtime Summary File
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In the "realtime" section of an XFS filesystem, free space is tracked via a
+bitmap, similar to Unix FFS.
+Each bit in the bitmap represents one realtime extent, which is a multiple of
+the filesystem block size between 4KiB and 1GiB in size.
+The realtime summary file indexes the number of free extents of a given size to
+the offset of the block within the realtime free space bitmap where those free
+extents begin.
+In other words, the summary file helps the allocator find free extents by
+length, similar to what the free space by count (cntbt) btree does for the data
+section.
+
+The summary file itself is a flat file (with no block headers or checksums!)
+partitioned into ``log2(total rt extents)`` sections containing enough 32-bit
+counters to match the number of blocks in the rt bitmap.
+Each counter records the number of free extents that start in that bitmap block
+and can satisfy a power-of-two allocation request.
+
+To check the summary file against the bitmap:
+
+1. Take the ILOCK of both the realtime bitmap and summary files.
+
+2. For each free space extent recorded in the bitmap:
+
+   a. Compute the position in the summary file that contains a counter that
+      represents this free extent.
+
+   b. Read the counter from the xfile.
+
+   c. Increment it, and write it back to the xfile.
+
+3. Compare the contents of the xfile against the ondisk file.
+
+To repair the summary file, write the xfile contents into the temporary file
+and use atomic extent swap to commit the new contents.
+The temporary file is then reaped.
+
+The proposed patchset is the
+`realtime summary repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary>`_
+series.
+
+Case Study: Salvaging Extended Attributes
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In XFS, extended attributes are implemented as a namespaced name-value store.
+Values are limited in size to 64KiB, but there is no limit in the number of
+names.
+The attribute fork is unpartitioned, which means that the root of the attribute
+structure is always in logical block zero, but attribute leaf blocks, dabtree
+index blocks, and remote value blocks are intermixed.
+Attribute leaf blocks contain variable-sized records that associate
+user-provided names with the user-provided values.
+Values larger than a block are allocated separate extents and written there.
+If the leaf information expands beyond a single block, a directory/attribute
+btree (``dabtree``) is created to map hashes of attribute names to entries
+for fast lookup.
+
+Salvaging extended attributes is done as follows:
+
+1. Walk the attr fork mappings of the file being repaired to find the attribute
+   leaf blocks.
+   When one is found,
+
+   a. Walk the attr leaf block to find candidate keys.
+      When one is found,
+
+      1. Check the name for problems, and ignore the name if there are.
+
+      2. Retrieve the value.
+         If that succeeds, add the name and value to the staging xfarray and
+         xfblob.
+
+2. If the memory usage of the xfarray and xfblob exceed a certain amount of
+   memory or there are no more attr fork blocks to examine, unlock the file and
+   add the staged extended attributes to the temporary file.
+
+3. Use atomic extent swapping to exchange the new and old extended attribute
+   structures.
+   The old attribute blocks are now attached to the temporary file.
+
+4. Reap the temporary file.
+
+The proposed patchset is the
+`extended attribute repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_
+series.
-- 
cgit v1.2.3


From a26aa25247bc91fefa08e4127a16046135f30a5f Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 18:59:51 -0700
Subject: xfs: document directory tree repairs

Directory tree repairs are the least complete part of online fsck, due
to the lack of directory parent pointers.  However, even without that
feature, we can still make some corrections to the directory tree -- we
can salvage as many directory entries as we can from a damaged
directory, and we can reattach orphaned inodes to the lost+found, just
as xfs_repair does now.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 .../filesystems/xfs-online-fsck-design.rst         | 419 +++++++++++++++++++++
 1 file changed, 419 insertions(+)

diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index facfde07e569..6be097aa623b 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -2150,6 +2150,8 @@ reduce map/unmap cycles.
 Surprisingly, this reduces overall sort runtime by nearly half again after
 accounting for the application of heapsort directly onto xfile pages.
 
+.. _xfblob:
+
 Blob Storage
 ````````````
 
@@ -4368,3 +4370,420 @@ The proposed patchset is the
 `extended attribute repair
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_
 series.
+
+Fixing Directories
+------------------
+
+Fixing directories is difficult with currently available filesystem features,
+since directory entries are not redundant.
+The offline repair tool scans all inodes to find files with nonzero link count,
+and then it scans all directories to establish parentage of those linked files.
+Damaged files and directories are zapped, and files with no parent are
+moved to the ``/lost+found`` directory.
+It does not try to salvage anything.
+
+The best that online repair can do at this time is to read directory data
+blocks and salvage any dirents that look plausible, correct link counts, and
+move orphans back into the directory tree.
+The salvage process is discussed in the case study at the end of this section.
+The :ref:`file link count fsck <nlinks>` code takes care of fixing link counts
+and moving orphans to the ``/lost+found`` directory.
+
+Case Study: Salvaging Directories
+`````````````````````````````````
+
+Unlike extended attributes, directory blocks are all the same size, so
+salvaging directories is straightforward:
+
+1. Find the parent of the directory.
+   If the dotdot entry is not unreadable, try to confirm that the alleged
+   parent has a child entry pointing back to the directory being repaired.
+   Otherwise, walk the filesystem to find it.
+
+2. Walk the first partition of data fork of the directory to find the directory
+   entry data blocks.
+   When one is found,
+
+   a. Walk the directory data block to find candidate entries.
+      When an entry is found:
+
+      i. Check the name for problems, and ignore the name if there are.
+
+      ii. Retrieve the inumber and grab the inode.
+          If that succeeds, add the name, inode number, and file type to the
+          staging xfarray and xblob.
+
+3. If the memory usage of the xfarray and xfblob exceed a certain amount of
+   memory or there are no more directory data blocks to examine, unlock the
+   directory and add the staged dirents into the temporary directory.
+   Truncate the staging files.
+
+4. Use atomic extent swapping to exchange the new and old directory structures.
+   The old directory blocks are now attached to the temporary file.
+
+5. Reap the temporary file.
+
+**Future Work Question**: Should repair revalidate the dentry cache when
+rebuilding a directory?
+
+*Answer*: Yes, it should.
+
+In theory it is necessary to scan all dentry cache entries for a directory to
+ensure that one of the following apply:
+
+1. The cached dentry reflects an ondisk dirent in the new directory.
+
+2. The cached dentry no longer has a corresponding ondisk dirent in the new
+   directory and the dentry can be purged from the cache.
+
+3. The cached dentry no longer has an ondisk dirent but the dentry cannot be
+   purged.
+   This is the problem case.
+
+Unfortunately, the current dentry cache design doesn't provide a means to walk
+every child dentry of a specific directory, which makes this a hard problem.
+There is no known solution.
+
+The proposed patchset is the
+`directory repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
+series.
+
+Parent Pointers
+```````````````
+
+A parent pointer is a piece of file metadata that enables a user to locate the
+file's parent directory without having to traverse the directory tree from the
+root.
+Without them, reconstruction of directory trees is hindered in much the same
+way that the historic lack of reverse space mapping information once hindered
+reconstruction of filesystem space metadata.
+The parent pointer feature, however, makes total directory reconstruction
+possible.
+
+XFS parent pointers include the dirent name and location of the entry within
+the parent directory.
+In other words, child files use extended attributes to store pointers to
+parents in the form ``(parent_inum, parent_gen, dirent_pos) → (dirent_name)``.
+The directory checking process can be strengthened to ensure that the target of
+each dirent also contains a parent pointer pointing back to the dirent.
+Likewise, each parent pointer can be checked by ensuring that the target of
+each parent pointer is a directory and that it contains a dirent matching
+the parent pointer.
+Both online and offline repair can use this strategy.
+
+**Note**: The ondisk format of parent pointers is not yet finalized.
+
++--------------------------------------------------------------------------+
+| **Historical Sidebar**:                                                  |
++--------------------------------------------------------------------------+
+| Directory parent pointers were first proposed as an XFS feature more     |
+| than a decade ago by SGI.                                                |
+| Each link from a parent directory to a child file is mirrored with an    |
+| extended attribute in the child that could be used to identify the       |
+| parent directory.                                                        |
+| Unfortunately, this early implementation had major shortcomings and was  |
+| never merged into Linux XFS:                                             |
+|                                                                          |
+| 1. The XFS codebase of the late 2000s did not have the infrastructure to |
+|    enforce strong referential integrity in the directory tree.           |
+|    It did not guarantee that a change in a forward link would always be  |
+|    followed up with the corresponding change to the reverse links.       |
+|                                                                          |
+| 2. Referential integrity was not integrated into offline repair.         |
+|    Checking and repairs were performed on mounted filesystems without    |
+|    taking any kernel or inode locks to coordinate access.                |
+|    It is not clear how this actually worked properly.                    |
+|                                                                          |
+| 3. The extended attribute did not record the name of the directory entry |
+|    in the parent, so the SGI parent pointer implementation cannot be     |
+|    used to reconnect the directory tree.                                 |
+|                                                                          |
+| 4. Extended attribute forks only support 65,536 extents, which means     |
+|    that parent pointer attribute creation is likely to fail at some      |
+|    point before the maximum file link count is achieved.                 |
+|                                                                          |
+| The original parent pointer design was too unstable for something like   |
+| a file system repair to depend on.                                       |
+| Allison Henderson, Chandan Babu, and Catherine Hoang are working on a    |
+| second implementation that solves all shortcomings of the first.         |
+| During 2022, Allison introduced log intent items to track physical       |
+| manipulations of the extended attribute structures.                      |
+| This solves the referential integrity problem by making it possible to   |
+| commit a dirent update and a parent pointer update in the same           |
+| transaction.                                                             |
+| Chandan increased the maximum extent counts of both data and attribute   |
+| forks, thereby ensuring that the extended attribute structure can grow   |
+| to handle the maximum hardlink count of any file.                        |
++--------------------------------------------------------------------------+
+
+Case Study: Repairing Directories with Parent Pointers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Directory rebuilding uses a :ref:`coordinated inode scan <iscan>` and
+a :ref:`directory entry live update hook <liveupdate>` as follows:
+
+1. Set up a temporary directory for generating the new directory structure,
+   an xfblob for storing entry names, and an xfarray for stashing directory
+   updates.
+
+2. Set up an inode scanner and hook into the directory entry code to receive
+   updates on directory operations.
+
+3. For each parent pointer found in each file scanned, decide if the parent
+   pointer references the directory of interest.
+   If so:
+
+   a. Stash an addname entry for this dirent in the xfarray for later.
+
+   b. When finished scanning that file, flush the stashed updates to the
+      temporary directory.
+
+4. For each live directory update received via the hook, decide if the child
+   has already been scanned.
+   If so:
+
+   a. Stash an addname or removename entry for this dirent update in the
+      xfarray for later.
+      We cannot write directly to the temporary directory because hook
+      functions are not allowed to modify filesystem metadata.
+      Instead, we stash updates in the xfarray and rely on the scanner thread
+      to apply the stashed updates to the temporary directory.
+
+5. When the scan is complete, atomically swap the contents of the temporary
+   directory and the directory being repaired.
+   The temporary directory now contains the damaged directory structure.
+
+6. Reap the temporary directory.
+
+7. Update the dirent position field of parent pointers as necessary.
+   This may require the queuing of a substantial number of xattr log intent
+   items.
+
+The proposed patchset is the
+`parent pointers directory repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-dir-repair>`_
+series.
+
+**Unresolved Question**: How will repair ensure that the ``dirent_pos`` fields
+match in the reconstructed directory?
+
+*Answer*: There are a few ways to solve this problem:
+
+1. The field could be designated advisory, since the other three values are
+   sufficient to find the entry in the parent.
+   However, this makes indexed key lookup impossible while repairs are ongoing.
+
+2. We could allow creating directory entries at specified offsets, which solves
+   the referential integrity problem but runs the risk that dirent creation
+   will fail due to conflicts with the free space in the directory.
+
+   These conflicts could be resolved by appending the directory entry and
+   amending the xattr code to support updating an xattr key and reindexing the
+   dabtree, though this would have to be performed with the parent directory
+   still locked.
+
+3. Same as above, but remove the old parent pointer entry and add a new one
+   atomically.
+
+4. Change the ondisk xattr format to ``(parent_inum, name) → (parent_gen)``,
+   which would provide the attr name uniqueness that we require, without
+   forcing repair code to update the dirent position.
+   Unfortunately, this requires changes to the xattr code to support attr
+   names as long as 263 bytes.
+
+5. Change the ondisk xattr format to ``(parent_inum, hash(name)) →
+   (name, parent_gen)``.
+   If the hash is sufficiently resistant to collisions (e.g. sha256) then
+   this should provide the attr name uniqueness that we require.
+   Names shorter than 247 bytes could be stored directly.
+
+Discussion is ongoing under the `parent pointers patch deluge
+<https://www.spinics.net/lists/linux-xfs/msg69397.html>`_.
+
+Case Study: Repairing Parent Pointers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Online reconstruction of a file's parent pointer information works similarly to
+directory reconstruction:
+
+1. Set up a temporary file for generating a new extended attribute structure,
+   an `xfblob<xfblob>` for storing parent pointer names, and an xfarray for
+   stashing parent pointer updates.
+
+2. Set up an inode scanner and hook into the directory entry code to receive
+   updates on directory operations.
+
+3. For each directory entry found in each directory scanned, decide if the
+   dirent references the file of interest.
+   If so:
+
+   a. Stash an addpptr entry for this parent pointer in the xfblob and xfarray
+      for later.
+
+   b. When finished scanning the directory, flush the stashed updates to the
+      temporary directory.
+
+4. For each live directory update received via the hook, decide if the parent
+   has already been scanned.
+   If so:
+
+   a. Stash an addpptr or removepptr entry for this dirent update in the
+      xfarray for later.
+      We cannot write parent pointers directly to the temporary file because
+      hook functions are not allowed to modify filesystem metadata.
+      Instead, we stash updates in the xfarray and rely on the scanner thread
+      to apply the stashed parent pointer updates to the temporary file.
+
+5. Copy all non-parent pointer extended attributes to the temporary file.
+
+6. When the scan is complete, atomically swap the attribute fork of the
+   temporary file and the file being repaired.
+   The temporary file now contains the damaged extended attribute structure.
+
+7. Reap the temporary file.
+
+The proposed patchset is the
+`parent pointers repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-parent-repair>`_
+series.
+
+Digression: Offline Checking of Parent Pointers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Examining parent pointers in offline repair works differently because corrupt
+files are erased long before directory tree connectivity checks are performed.
+Parent pointer checks are therefore a second pass to be added to the existing
+connectivity checks:
+
+1. After the set of surviving files has been established (i.e. phase 6),
+   walk the surviving directories of each AG in the filesystem.
+   This is already performed as part of the connectivity checks.
+
+2. For each directory entry found, record the name in an xfblob, and store
+   ``(child_ag_inum, parent_inum, parent_gen, dirent_pos)`` tuples in a
+   per-AG in-memory slab.
+
+3. For each AG in the filesystem,
+
+   a. Sort the per-AG tuples in order of child_ag_inum, parent_inum, and
+      dirent_pos.
+
+   b. For each inode in the AG,
+
+      1. Scan the inode for parent pointers.
+         Record the names in a per-file xfblob, and store ``(parent_inum,
+         parent_gen, dirent_pos)`` tuples in a per-file slab.
+
+      2. Sort the per-file tuples in order of parent_inum, and dirent_pos.
+
+      3. Position one slab cursor at the start of the inode's records in the
+         per-AG tuple slab.
+         This should be trivial since the per-AG tuples are in child inumber
+         order.
+
+      4. Position a second slab cursor at the start of the per-file tuple slab.
+
+      5. Iterate the two cursors in lockstep, comparing the parent_ino and
+         dirent_pos fields of the records under each cursor.
+
+         a. Tuples in the per-AG list but not the per-file list are missing and
+            need to be written to the inode.
+
+         b. Tuples in the per-file list but not the per-AG list are dangling
+            and need to be removed from the inode.
+
+         c. For tuples in both lists, update the parent_gen and name components
+            of the parent pointer if necessary.
+
+4. Move on to examining link counts, as we do today.
+
+The proposed patchset is the
+`offline parent pointers repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-repair>`_
+series.
+
+Rebuilding directories from parent pointers in offline repair is very
+challenging because it currently uses a single-pass scan of the filesystem
+during phase 3 to decide which files are corrupt enough to be zapped.
+This scan would have to be converted into a multi-pass scan:
+
+1. The first pass of the scan zaps corrupt inodes, forks, and attributes
+   much as it does now.
+   Corrupt directories are noted but not zapped.
+
+2. The next pass records parent pointers pointing to the directories noted
+   as being corrupt in the first pass.
+   This second pass may have to happen after the phase 4 scan for duplicate
+   blocks, if phase 4 is also capable of zapping directories.
+
+3. The third pass resets corrupt directories to an empty shortform directory.
+   Free space metadata has not been ensured yet, so repair cannot yet use the
+   directory building code in libxfs.
+
+4. At the start of phase 6, space metadata have been rebuilt.
+   Use the parent pointer information recorded during step 2 to reconstruct
+   the dirents and add them to the now-empty directories.
+
+This code has not yet been constructed.
+
+.. _orphanage:
+
+The Orphanage
+-------------
+
+Filesystems present files as a directed, and hopefully acyclic, graph.
+In other words, a tree.
+The root of the filesystem is a directory, and each entry in a directory points
+downwards either to more subdirectories or to non-directory files.
+Unfortunately, a disruption in the directory graph pointers result in a
+disconnected graph, which makes files impossible to access via regular path
+resolution.
+
+Without parent pointers, the directory parent pointer online scrub code can
+detect a dotdot entry pointing to a parent directory that doesn't have a link
+back to the child directory and the file link count checker can detect a file
+that isn't pointed to by any directory in the filesystem.
+If such a file has a positive link count, the file is an orphan.
+
+With parent pointers, directories can be rebuilt by scanning parent pointers
+and parent pointers can be rebuilt by scanning directories.
+This should reduce the incidence of files ending up in ``/lost+found``.
+
+When orphans are found, they should be reconnected to the directory tree.
+Offline fsck solves the problem by creating a directory ``/lost+found`` to
+serve as an orphanage, and linking orphan files into the orphanage by using the
+inumber as the name.
+Reparenting a file to the orphanage does not reset any of its permissions or
+ACLs.
+
+This process is more involved in the kernel than it is in userspace.
+The directory and file link count repair setup functions must use the regular
+VFS mechanisms to create the orphanage directory with all the necessary
+security attributes and dentry cache entries, just like a regular directory
+tree modification.
+
+Orphaned files are adopted by the orphanage as follows:
+
+1. Call ``xrep_orphanage_try_create`` at the start of the scrub setup function
+   to try to ensure that the lost and found directory actually exists.
+   This also attaches the orphanage directory to the scrub context.
+
+2. If the decision is made to reconnect a file, take the IOLOCK of both the
+   orphanage and the file being reattached.
+   The ``xrep_orphanage_iolock_two`` function follows the inode locking
+   strategy discussed earlier.
+
+3. Call ``xrep_orphanage_compute_blkres`` and ``xrep_orphanage_compute_name``
+   to compute the new name in the orphanage and the block reservation required.
+
+4. Use ``xrep_orphanage_adoption_prep`` to reserve resources to the repair
+   transaction.
+
+5. Call ``xrep_orphanage_adopt`` to reparent the orphaned file into the lost
+   and found, and update the kernel dentry cache.
+
+The proposed patches are in the
+`orphanage adoption
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_
+series.
-- 
cgit v1.2.3


From af051dfb81111610eed89a1098874932d59e7d8f Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 18:59:51 -0700
Subject: xfs: document the userspace fsck driver program

Add the sixth chapter of the online fsck design documentation, where
we discuss the details of the data structures and algorithms used by the
driver program xfs_scrub.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 .../filesystems/xfs-online-fsck-design.rst         | 316 +++++++++++++++++++++
 1 file changed, 316 insertions(+)

diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index 6be097aa623b..6dcdf70626f0 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -315,6 +315,9 @@ The seven phases are as follows:
 7. Re-check the summary counters and presents the caller with a summary of
    space usage and file counts.
 
+This allocation of responsibilities will be :ref:`revisited <scrubcheck>`
+later in this document.
+
 Steps for Each Scrub Item
 -------------------------
 
@@ -4787,3 +4790,316 @@ The proposed patches are in the
 `orphanage adoption
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_
 series.
+
+6. Userspace Algorithms and Data Structures
+===========================================
+
+This section discusses the key algorithms and data structures of the userspace
+program, ``xfs_scrub``, that provide the ability to drive metadata checks and
+repairs in the kernel, verify file data, and look for other potential problems.
+
+.. _scrubcheck:
+
+Checking Metadata
+-----------------
+
+Recall the :ref:`phases of fsck work<scrubphases>` outlined earlier.
+That structure follows naturally from the data dependencies designed into the
+filesystem from its beginnings in 1993.
+In XFS, there are several groups of metadata dependencies:
+
+a. Filesystem summary counts depend on consistency within the inode indices,
+   the allocation group space btrees, and the realtime volume space
+   information.
+
+b. Quota resource counts depend on consistency within the quota file data
+   forks, inode indices, inode records, and the forks of every file on the
+   system.
+
+c. The naming hierarchy depends on consistency within the directory and
+   extended attribute structures.
+   This includes file link counts.
+
+d. Directories, extended attributes, and file data depend on consistency within
+   the file forks that map directory and extended attribute data to physical
+   storage media.
+
+e. The file forks depends on consistency within inode records and the space
+   metadata indices of the allocation groups and the realtime volume.
+   This includes quota and realtime metadata files.
+
+f. Inode records depends on consistency within the inode metadata indices.
+
+g. Realtime space metadata depend on the inode records and data forks of the
+   realtime metadata inodes.
+
+h. The allocation group metadata indices (free space, inodes, reference count,
+   and reverse mapping btrees) depend on consistency within the AG headers and
+   between all the AG metadata btrees.
+
+i. ``xfs_scrub`` depends on the filesystem being mounted and kernel support
+   for online fsck functionality.
+
+Therefore, a metadata dependency graph is a convenient way to schedule checking
+operations in the ``xfs_scrub`` program:
+
+- Phase 1 checks that the provided path maps to an XFS filesystem and detect
+  the kernel's scrubbing abilities, which validates group (i).
+
+- Phase 2 scrubs groups (g) and (h) in parallel using a threaded workqueue.
+
+- Phase 3 scans inodes in parallel.
+  For each inode, groups (f), (e), and (d) are checked, in that order.
+
+- Phase 4 repairs everything in groups (i) through (d) so that phases 5 and 6
+  may run reliably.
+
+- Phase 5 starts by checking groups (b) and (c) in parallel before moving on
+  to checking names.
+
+- Phase 6 depends on groups (i) through (b) to find file data blocks to verify,
+  to read them, and to report which blocks of which files are affected.
+
+- Phase 7 checks group (a), having validated everything else.
+
+Notice that the data dependencies between groups are enforced by the structure
+of the program flow.
+
+Parallel Inode Scans
+--------------------
+
+An XFS filesystem can easily contain hundreds of millions of inodes.
+Given that XFS targets installations with large high-performance storage,
+it is desirable to scrub inodes in parallel to minimize runtime, particularly
+if the program has been invoked manually from a command line.
+This requires careful scheduling to keep the threads as evenly loaded as
+possible.
+
+Early iterations of the ``xfs_scrub`` inode scanner naïvely created a single
+workqueue and scheduled a single workqueue item per AG.
+Each workqueue item walked the inode btree (with ``XFS_IOC_INUMBERS``) to find
+inode chunks and then called bulkstat (``XFS_IOC_BULKSTAT``) to gather enough
+information to construct file handles.
+The file handle was then passed to a function to generate scrub items for each
+metadata object of each inode.
+This simple algorithm leads to thread balancing problems in phase 3 if the
+filesystem contains one AG with a few large sparse files and the rest of the
+AGs contain many smaller files.
+The inode scan dispatch function was not sufficiently granular; it should have
+been dispatching at the level of individual inodes, or, to constrain memory
+consumption, inode btree records.
+
+Thanks to Dave Chinner, bounded workqueues in userspace enable ``xfs_scrub`` to
+avoid this problem with ease by adding a second workqueue.
+Just like before, the first workqueue is seeded with one workqueue item per AG,
+and it uses INUMBERS to find inode btree chunks.
+The second workqueue, however, is configured with an upper bound on the number
+of items that can be waiting to be run.
+Each inode btree chunk found by the first workqueue's workers are queued to the
+second workqueue, and it is this second workqueue that queries BULKSTAT,
+creates a file handle, and passes it to a function to generate scrub items for
+each metadata object of each inode.
+If the second workqueue is too full, the workqueue add function blocks the
+first workqueue's workers until the backlog eases.
+This doesn't completely solve the balancing problem, but reduces it enough to
+move on to more pressing issues.
+
+The proposed patchsets are the scrub
+`performance tweaks
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-tweaks>`_
+and the
+`inode scan rebalance
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalance>`_
+series.
+
+.. _scrubrepair:
+
+Scheduling Repairs
+------------------
+
+During phase 2, corruptions and inconsistencies reported in any AGI header or
+inode btree are repaired immediately, because phase 3 relies on proper
+functioning of the inode indices to find inodes to scan.
+Failed repairs are rescheduled to phase 4.
+Problems reported in any other space metadata are deferred to phase 4.
+Optimization opportunities are always deferred to phase 4, no matter their
+origin.
+
+During phase 3, corruptions and inconsistencies reported in any part of a
+file's metadata are repaired immediately if all space metadata were validated
+during phase 2.
+Repairs that fail or cannot be repaired immediately are scheduled for phase 4.
+
+In the original design of ``xfs_scrub``, it was thought that repairs would be
+so infrequent that the ``struct xfs_scrub_metadata`` objects used to
+communicate with the kernel could also be used as the primary object to
+schedule repairs.
+With recent increases in the number of optimizations possible for a given
+filesystem object, it became much more memory-efficient to track all eligible
+repairs for a given filesystem object with a single repair item.
+Each repair item represents a single lockable object -- AGs, metadata files,
+individual inodes, or a class of summary information.
+
+Phase 4 is responsible for scheduling a lot of repair work in as quick a
+manner as is practical.
+The :ref:`data dependencies <scrubcheck>` outlined earlier still apply, which
+means that ``xfs_scrub`` must try to complete the repair work scheduled by
+phase 2 before trying repair work scheduled by phase 3.
+The repair process is as follows:
+
+1. Start a round of repair with a workqueue and enough workers to keep the CPUs
+   as busy as the user desires.
+
+   a. For each repair item queued by phase 2,
+
+      i.   Ask the kernel to repair everything listed in the repair item for a
+           given filesystem object.
+
+      ii.  Make a note if the kernel made any progress in reducing the number
+           of repairs needed for this object.
+
+      iii. If the object no longer requires repairs, revalidate all metadata
+           associated with this object.
+           If the revalidation succeeds, drop the repair item.
+           If not, requeue the item for more repairs.
+
+   b. If any repairs were made, jump back to 1a to retry all the phase 2 items.
+
+   c. For each repair item queued by phase 3,
+
+      i.   Ask the kernel to repair everything listed in the repair item for a
+           given filesystem object.
+
+      ii.  Make a note if the kernel made any progress in reducing the number
+           of repairs needed for this object.
+
+      iii. If the object no longer requires repairs, revalidate all metadata
+           associated with this object.
+           If the revalidation succeeds, drop the repair item.
+           If not, requeue the item for more repairs.
+
+   d. If any repairs were made, jump back to 1c to retry all the phase 3 items.
+
+2. If step 1 made any repair progress of any kind, jump back to step 1 to start
+   another round of repair.
+
+3. If there are items left to repair, run them all serially one more time.
+   Complain if the repairs were not successful, since this is the last chance
+   to repair anything.
+
+Corruptions and inconsistencies encountered during phases 5 and 7 are repaired
+immediately.
+Corrupt file data blocks reported by phase 6 cannot be recovered by the
+filesystem.
+
+The proposed patchsets are the
+`repair warning improvements
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warnings>`_,
+refactoring of the
+`repair data dependency
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-deps>`_
+and
+`object tracking
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracking>`_,
+and the
+`repair scheduling
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-scheduling>`_
+improvement series.
+
+Checking Names for Confusable Unicode Sequences
+-----------------------------------------------
+
+If ``xfs_scrub`` succeeds in validating the filesystem metadata by the end of
+phase 4, it moves on to phase 5, which checks for suspicious looking names in
+the filesystem.
+These names consist of the filesystem label, names in directory entries, and
+the names of extended attributes.
+Like most Unix filesystems, XFS imposes the sparest of constraints on the
+contents of a name:
+
+- Slashes and null bytes are not allowed in directory entries.
+
+- Null bytes are not allowed in userspace-visible extended attributes.
+
+- Null bytes are not allowed in the filesystem label.
+
+Directory entries and attribute keys store the length of the name explicitly
+ondisk, which means that nulls are not name terminators.
+For this section, the term "naming domain" refers to any place where names are
+presented together -- all the names in a directory, or all the attributes of a
+file.
+
+Although the Unix naming constraints are very permissive, the reality of most
+modern-day Linux systems is that programs work with Unicode character code
+points to support international languages.
+These programs typically encode those code points in UTF-8 when interfacing
+with the C library because the kernel expects null-terminated names.
+In the common case, therefore, names found in an XFS filesystem are actually
+UTF-8 encoded Unicode data.
+
+To maximize its expressiveness, the Unicode standard defines separate control
+points for various characters that render similarly or identically in writing
+systems around the world.
+For example, the character "Cyrillic Small Letter A" U+0430 "а" often renders
+identically to "Latin Small Letter A" U+0061 "a".
+
+The standard also permits characters to be constructed in multiple ways --
+either by using a defined code point, or by combining one code point with
+various combining marks.
+For example, the character "Angstrom Sign U+212B "Å" can also be expressed
+as "Latin Capital Letter A" U+0041 "A" followed by "Combining Ring Above"
+U+030A "◌̊".
+Both sequences render identically.
+
+Like the standards that preceded it, Unicode also defines various control
+characters to alter the presentation of text.
+For example, the character "Right-to-Left Override" U+202E can trick some
+programs into rendering "moo\\xe2\\x80\\xaegnp.txt" as "mootxt.png".
+A second category of rendering problems involves whitespace characters.
+If the character "Zero Width Space" U+200B is encountered in a file name, the
+name will render identically to a name that does not have the zero width
+space.
+
+If two names within a naming domain have different byte sequences but render
+identically, a user may be confused by it.
+The kernel, in its indifference to upper level encoding schemes, permits this.
+Most filesystem drivers persist the byte sequence names that are given to them
+by the VFS.
+
+Techniques for detecting confusable names are explained in great detail in
+sections 4 and 5 of the
+`Unicode Security Mechanisms <https://unicode.org/reports/tr39/>`_
+document.
+When ``xfs_scrub`` detects UTF-8 encoding in use on a system, it uses the
+Unicode normalization form NFD in conjunction with the confusable name
+detection component of
+`libicu <https://github.com/unicode-org/icu>`_
+to identify names with a directory or within a file's extended attributes that
+could be confused for each other.
+Names are also checked for control characters, non-rendering characters, and
+mixing of bidirectional characters.
+All of these potential issues are reported to the system administrator during
+phase 5.
+
+Media Verification of File Data Extents
+---------------------------------------
+
+The system administrator can elect to initiate a media scan of all file data
+blocks.
+This scan after validation of all filesystem metadata (except for the summary
+counters) as phase 6.
+The scan starts by calling ``FS_IOC_GETFSMAP`` to scan the filesystem space map
+to find areas that are allocated to file data fork extents.
+Gaps betweeen data fork extents that are smaller than 64k are treated as if
+they were data fork extents to reduce the command setup overhead.
+When the space map scan accumulates a region larger than 32MB, a media
+verification request is sent to the disk as a directio read of the raw block
+device.
+
+If the verification read fails, ``xfs_scrub`` retries with single-block reads
+to narrow down the failure to the specific region of the media and recorded.
+When it has finished issuing verification requests, it again uses the space
+mapping ioctl to map the recorded media errors back to metadata structures
+and report what has been lost.
+For media errors in blocks owned by files, parent pointers can be used to
+construct file paths from inode numbers for user-friendly reporting.
-- 
cgit v1.2.3


From 03786f0afb2ed5705a0478e14fea50a7f1a44f7e Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 18:59:52 -0700
Subject: xfs: document future directions of online fsck

Add the seventh and final chapter of the online fsck documentation,
where we talk about future functionality that can tie in with the
functionality provided by the online fsck patchset.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 .../filesystems/xfs-online-fsck-design.rst         | 210 +++++++++++++++++++++
 1 file changed, 210 insertions(+)

diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index 6dcdf70626f0..791ab264b77e 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -5103,3 +5103,213 @@ mapping ioctl to map the recorded media errors back to metadata structures
 and report what has been lost.
 For media errors in blocks owned by files, parent pointers can be used to
 construct file paths from inode numbers for user-friendly reporting.
+
+7. Conclusion and Future Work
+=============================
+
+It is hoped that the reader of this document has followed the designs laid out
+in this document and now has some familiarity with how XFS performs online
+rebuilding of its metadata indices, and how filesystem users can interact with
+that functionality.
+Although the scope of this work is daunting, it is hoped that this guide will
+make it easier for code readers to understand what has been built, for whom it
+has been built, and why.
+Please feel free to contact the XFS mailing list with questions.
+
+FIEXCHANGE_RANGE
+----------------
+
+As discussed earlier, a second frontend to the atomic extent swap mechanism is
+a new ioctl call that userspace programs can use to commit updates to files
+atomically.
+This frontend has been out for review for several years now, though the
+necessary refinements to online repair and lack of customer demand mean that
+the proposal has not been pushed very hard.
+
+Extent Swapping with Regular User Files
+```````````````````````````````````````
+
+As mentioned earlier, XFS has long had the ability to swap extents between
+files, which is used almost exclusively by ``xfs_fsr`` to defragment files.
+The earliest form of this was the fork swap mechanism, where the entire
+contents of data forks could be exchanged between two files by exchanging the
+raw bytes in each inode fork's immediate area.
+When XFS v5 came along with self-describing metadata, this old mechanism grew
+some log support to continue rewriting the owner fields of BMBT blocks during
+log recovery.
+When the reverse mapping btree was later added to XFS, the only way to maintain
+the consistency of the fork mappings with the reverse mapping index was to
+develop an iterative mechanism that used deferred bmap and rmap operations to
+swap mappings one at a time.
+This mechanism is identical to steps 2-3 from the procedure above except for
+the new tracking items, because the atomic extent swap mechanism is an
+iteration of an existing mechanism and not something totally novel.
+For the narrow case of file defragmentation, the file contents must be
+identical, so the recovery guarantees are not much of a gain.
+
+Atomic extent swapping is much more flexible than the existing swapext
+implementations because it can guarantee that the caller never sees a mix of
+old and new contents even after a crash, and it can operate on two arbitrary
+file fork ranges.
+The extra flexibility enables several new use cases:
+
+- **Atomic commit of file writes**: A userspace process opens a file that it
+  wants to update.
+  Next, it opens a temporary file and calls the file clone operation to reflink
+  the first file's contents into the temporary file.
+  Writes to the original file should instead be written to the temporary file.
+  Finally, the process calls the atomic extent swap system call
+  (``FIEXCHANGE_RANGE``) to exchange the file contents, thereby committing all
+  of the updates to the original file, or none of them.
+
+.. _swapext_if_unchanged:
+
+- **Transactional file updates**: The same mechanism as above, but the caller
+  only wants the commit to occur if the original file's contents have not
+  changed.
+  To make this happen, the calling process snapshots the file modification and
+  change timestamps of the original file before reflinking its data to the
+  temporary file.
+  When the program is ready to commit the changes, it passes the timestamps
+  into the kernel as arguments to the atomic extent swap system call.
+  The kernel only commits the changes if the provided timestamps match the
+  original file.
+
+- **Emulation of atomic block device writes**: Export a block device with a
+  logical sector size matching the filesystem block size to force all writes
+  to be aligned to the filesystem block size.
+  Stage all writes to a temporary file, and when that is complete, call the
+  atomic extent swap system call with a flag to indicate that holes in the
+  temporary file should be ignored.
+  This emulates an atomic device write in software, and can support arbitrary
+  scattered writes.
+
+Vectorized Scrub
+----------------
+
+As it turns out, the :ref:`refactoring <scrubrepair>` of repair items mentioned
+earlier was a catalyst for enabling a vectorized scrub system call.
+Since 2018, the cost of making a kernel call has increased considerably on some
+systems to mitigate the effects of speculative execution attacks.
+This incentivizes program authors to make as few system calls as possible to
+reduce the number of times an execution path crosses a security boundary.
+
+With vectorized scrub, userspace pushes to the kernel the identity of a
+filesystem object, a list of scrub types to run against that object, and a
+simple representation of the data dependencies between the selected scrub
+types.
+The kernel executes as much of the caller's plan as it can until it hits a
+dependency that cannot be satisfied due to a corruption, and tells userspace
+how much was accomplished.
+It is hoped that ``io_uring`` will pick up enough of this functionality that
+online fsck can use that instead of adding a separate vectored scrub system
+call to XFS.
+
+The relevant patchsets are the
+`kernel vectorized scrub
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=vectorized-scrub>`_
+and
+`userspace vectorized scrub
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=vectorized-scrub>`_
+series.
+
+Quality of Service Targets for Scrub
+------------------------------------
+
+One serious shortcoming of the online fsck code is that the amount of time that
+it can spend in the kernel holding resource locks is basically unbounded.
+Userspace is allowed to send a fatal signal to the process which will cause
+``xfs_scrub`` to exit when it reaches a good stopping point, but there's no way
+for userspace to provide a time budget to the kernel.
+Given that the scrub codebase has helpers to detect fatal signals, it shouldn't
+be too much work to allow userspace to specify a timeout for a scrub/repair
+operation and abort the operation if it exceeds budget.
+However, most repair functions have the property that once they begin to touch
+ondisk metadata, the operation cannot be cancelled cleanly, after which a QoS
+timeout is no longer useful.
+
+Defragmenting Free Space
+------------------------
+
+Over the years, many XFS users have requested the creation of a program to
+clear a portion of the physical storage underlying a filesystem so that it
+becomes a contiguous chunk of free space.
+Call this free space defragmenter ``clearspace`` for short.
+
+The first piece the ``clearspace`` program needs is the ability to read the
+reverse mapping index from userspace.
+This already exists in the form of the ``FS_IOC_GETFSMAP`` ioctl.
+The second piece it needs is a new fallocate mode
+(``FALLOC_FL_MAP_FREE_SPACE``) that allocates the free space in a region and
+maps it to a file.
+Call this file the "space collector" file.
+The third piece is the ability to force an online repair.
+
+To clear all the metadata out of a portion of physical storage, clearspace
+uses the new fallocate map-freespace call to map any free space in that region
+to the space collector file.
+Next, clearspace finds all metadata blocks in that region by way of
+``GETFSMAP`` and issues forced repair requests on the data structure.
+This often results in the metadata being rebuilt somewhere that is not being
+cleared.
+After each relocation, clearspace calls the "map free space" function again to
+collect any newly freed space in the region being cleared.
+
+To clear all the file data out of a portion of the physical storage, clearspace
+uses the FSMAP information to find relevant file data blocks.
+Having identified a good target, it uses the ``FICLONERANGE`` call on that part
+of the file to try to share the physical space with a dummy file.
+Cloning the extent means that the original owners cannot overwrite the
+contents; any changes will be written somewhere else via copy-on-write.
+Clearspace makes its own copy of the frozen extent in an area that is not being
+cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic extent swap
+<swapext_if_unchanged>` feature) to change the target file's data extent
+mapping away from the area being cleared.
+When all other mappings have been moved, clearspace reflinks the space into the
+space collector file so that it becomes unavailable.
+
+There are further optimizations that could apply to the above algorithm.
+To clear a piece of physical storage that has a high sharing factor, it is
+strongly desirable to retain this sharing factor.
+In fact, these extents should be moved first to maximize sharing factor after
+the operation completes.
+To make this work smoothly, clearspace needs a new ioctl
+(``FS_IOC_GETREFCOUNTS``) to report reference count information to userspace.
+With the refcount information exposed, clearspace can quickly find the longest,
+most shared data extents in the filesystem, and target them first.
+
+**Future Work Question**: How might the filesystem move inode chunks?
+
+*Answer*: To move inode chunks, Dave Chinner constructed a prototype program
+that creates a new file with the old contents and then locklessly runs around
+the filesystem updating directory entries.
+The operation cannot complete if the filesystem goes down.
+That problem isn't totally insurmountable: create an inode remapping table
+hidden behind a jump label, and a log item that tracks the kernel walking the
+filesystem to update directory entries.
+The trouble is, the kernel can't do anything about open files, since it cannot
+revoke them.
+
+**Future Work Question**: Can static keys be used to minimize the cost of
+supporting ``revoke()`` on XFS files?
+
+*Answer*: Yes.
+Until the first revocation, the bailout code need not be in the call path at
+all.
+
+The relevant patchsets are the
+`kernel freespace defrag
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespace>`_
+and
+`userspace freespace defrag
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespace>`_
+series.
+
+Shrinking Filesystems
+---------------------
+
+Removing the end of the filesystem ought to be a simple matter of evacuating
+the data and metadata at the end of the filesystem, and handing the freed space
+to the shrink code.
+That requires an evacuation of the space at end of the filesystem, which is a
+use of free space defragmentation!
-- 
cgit v1.2.3


From 774a99b47b588bf0bd9f65d3b241d5bba0b2fcb0 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 18:59:53 -0700
Subject: xfs: give xfs_bmap_intent its own perag reference

Give the xfs_bmap_intent an active reference to the perag structure
data.  This reference will be used to enable scrub intent draining
functionality in subsequent patches.  Later, shrink will use these
passive references to know if an AG is quiesced or not.

The reason why we take a passive ref for a file mapping operation is
simple: we're committing to some sort of action involving space in an
AG, so we want to indicate our interest in that AG.  The space is
already allocated, so we need to be able to operate on AGs that are
offline or being shrunk.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.c |  1 +
 fs/xfs/libxfs/xfs_bmap.h |  4 ++++
 fs/xfs/xfs_bmap_item.c   | 29 ++++++++++++++++++++++++++++-
 3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 34de6e6898c4..0b4fb3979bdc 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -6075,6 +6075,7 @@ __xfs_bmap_add(
 	bi->bi_whichfork = whichfork;
 	bi->bi_bmap = *bmap;
 
+	xfs_bmap_update_get_group(tp->t_mountp, bi);
 	xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_BMAP, &bi->bi_list);
 	return 0;
 }
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index dd08361ca5a6..e5a492027aea 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -238,9 +238,13 @@ struct xfs_bmap_intent {
 	enum xfs_bmap_intent_type		bi_type;
 	int					bi_whichfork;
 	struct xfs_inode			*bi_owner;
+	struct xfs_perag			*bi_pag;
 	struct xfs_bmbt_irec			bi_bmap;
 };
 
+void xfs_bmap_update_get_group(struct xfs_mount *mp,
+		struct xfs_bmap_intent *bi);
+
 int	xfs_bmap_finish_one(struct xfs_trans *tp, struct xfs_bmap_intent *bi);
 void	xfs_bmap_map_extent(struct xfs_trans *tp, struct xfs_inode *ip,
 		struct xfs_bmbt_irec *imap);
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 6e2f0013380a..8f0f33d07d2c 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -24,6 +24,7 @@
 #include "xfs_error.h"
 #include "xfs_log_priv.h"
 #include "xfs_log_recover.h"
+#include "xfs_ag.h"
 
 struct kmem_cache	*xfs_bui_cache;
 struct kmem_cache	*xfs_bud_cache;
@@ -363,6 +364,26 @@ xfs_bmap_update_create_done(
 	return &xfs_trans_get_bud(tp, BUI_ITEM(intent))->bud_item;
 }
 
+/* Take a passive ref to the AG containing the space we're mapping. */
+void
+xfs_bmap_update_get_group(
+	struct xfs_mount	*mp,
+	struct xfs_bmap_intent	*bi)
+{
+	xfs_agnumber_t		agno;
+
+	agno = XFS_FSB_TO_AGNO(mp, bi->bi_bmap.br_startblock);
+	bi->bi_pag = xfs_perag_get(mp, agno);
+}
+
+/* Release a passive AG ref after finishing mapping work. */
+static inline void
+xfs_bmap_update_put_group(
+	struct xfs_bmap_intent	*bi)
+{
+	xfs_perag_put(bi->bi_pag);
+}
+
 /* Process a deferred rmap update. */
 STATIC int
 xfs_bmap_update_finish_item(
@@ -381,6 +402,8 @@ xfs_bmap_update_finish_item(
 		ASSERT(bi->bi_type == XFS_BMAP_UNMAP);
 		return -EAGAIN;
 	}
+
+	xfs_bmap_update_put_group(bi);
 	kmem_cache_free(xfs_bmap_intent_cache, bi);
 	return error;
 }
@@ -393,7 +416,7 @@ xfs_bmap_update_abort_intent(
 	xfs_bui_release(BUI_ITEM(intent));
 }
 
-/* Cancel a deferred rmap update. */
+/* Cancel a deferred bmap update. */
 STATIC void
 xfs_bmap_update_cancel_item(
 	struct list_head		*item)
@@ -401,6 +424,8 @@ xfs_bmap_update_cancel_item(
 	struct xfs_bmap_intent		*bi;
 
 	bi = container_of(item, struct xfs_bmap_intent, bi_list);
+
+	xfs_bmap_update_put_group(bi);
 	kmem_cache_free(xfs_bmap_intent_cache, bi);
 }
 
@@ -509,10 +534,12 @@ xfs_bui_item_recover(
 	fake.bi_bmap.br_state = (map->me_flags & XFS_BMAP_EXTENT_UNWRITTEN) ?
 			XFS_EXT_UNWRITTEN : XFS_EXT_NORM;
 
+	xfs_bmap_update_get_group(mp, &fake);
 	error = xfs_trans_log_finish_bmap_update(tp, budp, &fake);
 	if (error == -EFSCORRUPTED)
 		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, map,
 				sizeof(*map));
+	xfs_bmap_update_put_group(&fake);
 	if (error)
 		goto err_cancel;
 
-- 
cgit v1.2.3


From b2ccab3199aa7cea9154d80ea2585312c5f6eba0 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 18:59:53 -0700
Subject: xfs: pass per-ag references to xfs_free_extent

Pass a reference to the per-AG structure to xfs_free_extent.  Most
callers already have one, so we can eliminate unnecessary lookups.  The
one exception to this is the EFI code, which the next patch will fix.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_ag.c             |  6 ++----
 fs/xfs/libxfs/xfs_alloc.c          | 15 +++++----------
 fs/xfs/libxfs/xfs_alloc.h          |  8 +++++---
 fs/xfs/libxfs/xfs_ialloc_btree.c   |  7 +++++--
 fs/xfs/libxfs/xfs_refcount_btree.c |  5 +++--
 fs/xfs/scrub/repair.c              |  3 ++-
 fs/xfs/xfs_extfree_item.c          |  8 ++++++--
 7 files changed, 28 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
index 86696a1c6891..ae45f546ed86 100644
--- a/fs/xfs/libxfs/xfs_ag.c
+++ b/fs/xfs/libxfs/xfs_ag.c
@@ -1043,10 +1043,8 @@ xfs_ag_extend_space(
 	if (error)
 		return error;
 
-	error = xfs_free_extent(tp, XFS_AGB_TO_FSB(pag->pag_mount, pag->pag_agno,
-					be32_to_cpu(agf->agf_length) - len),
-				len, &XFS_RMAP_OINFO_SKIP_UPDATE,
-				XFS_AG_RESV_NONE);
+	error = xfs_free_extent(tp, pag, be32_to_cpu(agf->agf_length) - len,
+			len, &XFS_RMAP_OINFO_SKIP_UPDATE, XFS_AG_RESV_NONE);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 203f16c48c19..ea9ac2ad9d36 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -3596,7 +3596,8 @@ xfs_free_extent_fix_freelist(
 int
 __xfs_free_extent(
 	struct xfs_trans		*tp,
-	xfs_fsblock_t			bno,
+	struct xfs_perag		*pag,
+	xfs_agblock_t			agbno,
 	xfs_extlen_t			len,
 	const struct xfs_owner_info	*oinfo,
 	enum xfs_ag_resv_type		type,
@@ -3604,12 +3605,9 @@ __xfs_free_extent(
 {
 	struct xfs_mount		*mp = tp->t_mountp;
 	struct xfs_buf			*agbp;
-	xfs_agnumber_t			agno = XFS_FSB_TO_AGNO(mp, bno);
-	xfs_agblock_t			agbno = XFS_FSB_TO_AGBNO(mp, bno);
 	struct xfs_agf			*agf;
 	int				error;
 	unsigned int			busy_flags = 0;
-	struct xfs_perag		*pag;
 
 	ASSERT(len != 0);
 	ASSERT(type != XFS_AG_RESV_AGFL);
@@ -3618,10 +3616,9 @@ __xfs_free_extent(
 			XFS_ERRTAG_FREE_EXTENT))
 		return -EIO;
 
-	pag = xfs_perag_get(mp, agno);
 	error = xfs_free_extent_fix_freelist(tp, pag, &agbp);
 	if (error)
-		goto err;
+		return error;
 	agf = agbp->b_addr;
 
 	if (XFS_IS_CORRUPT(mp, agbno >= mp->m_sb.sb_agblocks)) {
@@ -3635,20 +3632,18 @@ __xfs_free_extent(
 		goto err_release;
 	}
 
-	error = xfs_free_ag_extent(tp, agbp, agno, agbno, len, oinfo, type);
+	error = xfs_free_ag_extent(tp, agbp, pag->pag_agno, agbno, len, oinfo,
+			type);
 	if (error)
 		goto err_release;
 
 	if (skip_discard)
 		busy_flags |= XFS_EXTENT_BUSY_SKIP_DISCARD;
 	xfs_extent_busy_insert(tp, pag, agbno, len, busy_flags);
-	xfs_perag_put(pag);
 	return 0;
 
 err_release:
 	xfs_trans_brelse(tp, agbp);
-err:
-	xfs_perag_put(pag);
 	return error;
 }
 
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 2b246d74c189..e12d86e3aeec 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -141,7 +141,8 @@ int xfs_alloc_vextent_first_ag(struct xfs_alloc_arg *args,
 int				/* error */
 __xfs_free_extent(
 	struct xfs_trans	*tp,	/* transaction pointer */
-	xfs_fsblock_t		bno,	/* starting block number of extent */
+	struct xfs_perag	*pag,
+	xfs_agblock_t		agbno,
 	xfs_extlen_t		len,	/* length of extent */
 	const struct xfs_owner_info	*oinfo,	/* extent owner */
 	enum xfs_ag_resv_type	type,	/* block reservation type */
@@ -150,12 +151,13 @@ __xfs_free_extent(
 static inline int
 xfs_free_extent(
 	struct xfs_trans	*tp,
-	xfs_fsblock_t		bno,
+	struct xfs_perag	*pag,
+	xfs_agblock_t		agbno,
 	xfs_extlen_t		len,
 	const struct xfs_owner_info	*oinfo,
 	enum xfs_ag_resv_type	type)
 {
-	return __xfs_free_extent(tp, bno, len, oinfo, type, false);
+	return __xfs_free_extent(tp, pag, agbno, len, oinfo, type, false);
 }
 
 int				/* error */
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index 9b28211d5a4c..1d2af50ac95b 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -156,9 +156,12 @@ __xfs_inobt_free_block(
 	struct xfs_buf		*bp,
 	enum xfs_ag_resv_type	resv)
 {
+	xfs_fsblock_t		fsbno;
+
 	xfs_inobt_mod_blockcount(cur, -1);
-	return xfs_free_extent(cur->bc_tp,
-			XFS_DADDR_TO_FSB(cur->bc_mp, xfs_buf_daddr(bp)), 1,
+	fsbno = XFS_DADDR_TO_FSB(cur->bc_mp, xfs_buf_daddr(bp));
+	return xfs_free_extent(cur->bc_tp, cur->bc_ag.pag,
+			XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1,
 			&XFS_RMAP_OINFO_INOBT, resv);
 }
 
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index f3b860970b26..749e837de98d 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -112,8 +112,9 @@ xfs_refcountbt_free_block(
 			XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1);
 	be32_add_cpu(&agf->agf_refcount_blocks, -1);
 	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_REFCOUNT_BLOCKS);
-	error = xfs_free_extent(cur->bc_tp, fsbno, 1, &XFS_RMAP_OINFO_REFC,
-			XFS_AG_RESV_METADATA);
+	error = xfs_free_extent(cur->bc_tp, cur->bc_ag.pag,
+			XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1,
+			&XFS_RMAP_OINFO_REFC, XFS_AG_RESV_METADATA);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 1b71174ec0d6..e12058a5f22e 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -598,7 +598,8 @@ xrep_reap_block(
 	else if (resv == XFS_AG_RESV_AGFL)
 		error = xrep_put_freelist(sc, agbno);
 	else
-		error = xfs_free_extent(sc->tp, fsbno, 1, oinfo, resv);
+		error = xfs_free_extent(sc->tp, sc->sa.pag, agbno, 1, oinfo,
+				resv);
 	if (agf_bp != sc->sa.agf_bp)
 		xfs_trans_brelse(sc->tp, agf_bp);
 	if (error)
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index 011b50469301..c1aae07467c9 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -350,6 +350,7 @@ xfs_trans_free_extent(
 	struct xfs_owner_info		oinfo = { };
 	struct xfs_mount		*mp = tp->t_mountp;
 	struct xfs_extent		*extp;
+	struct xfs_perag		*pag;
 	uint				next_extent;
 	xfs_agnumber_t			agno = XFS_FSB_TO_AGNO(mp,
 							xefi->xefi_startblock);
@@ -366,9 +367,12 @@ xfs_trans_free_extent(
 	trace_xfs_bmap_free_deferred(tp->t_mountp, agno, 0, agbno,
 			xefi->xefi_blockcount);
 
-	error = __xfs_free_extent(tp, xefi->xefi_startblock,
-			xefi->xefi_blockcount, &oinfo, XFS_AG_RESV_NONE,
+	pag = xfs_perag_get(mp, agno);
+	error = __xfs_free_extent(tp, pag, agbno, xefi->xefi_blockcount,
+			&oinfo, XFS_AG_RESV_NONE,
 			xefi->xefi_flags & XFS_EFI_SKIP_DISCARD);
+	xfs_perag_put(pag);
+
 	/*
 	 * Mark the transaction dirty, even on error. This ensures the
 	 * transaction is aborted, which:
-- 
cgit v1.2.3


From f6b384631e1e3482c24e35b53adbd3da50e47e8f Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 18:59:54 -0700
Subject: xfs: give xfs_extfree_intent its own perag reference

Give the xfs_extfree_intent an passive reference to the perag structure
data.  This reference will be used to enable scrub intent draining
functionality in subsequent patches.  The space being freed must already
be allocated, so we need to able to run even if the AG is being offlined
or shrunk.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_alloc.c |  7 ++++--
 fs/xfs/libxfs/xfs_alloc.h |  4 ++++
 fs/xfs/xfs_extfree_item.c | 58 +++++++++++++++++++++++++++++++----------------
 3 files changed, 47 insertions(+), 22 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index ea9ac2ad9d36..d72483013b7d 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2405,6 +2405,7 @@ xfs_defer_agfl_block(
 
 	trace_xfs_agfl_free_defer(mp, agno, 0, agbno, 1);
 
+	xfs_extent_free_get_group(mp, xefi);
 	xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_AGFL_FREE, &xefi->xefi_list);
 }
 
@@ -2421,8 +2422,8 @@ __xfs_free_extent_later(
 	bool				skip_discard)
 {
 	struct xfs_extent_free_item	*xefi;
-#ifdef DEBUG
 	struct xfs_mount		*mp = tp->t_mountp;
+#ifdef DEBUG
 	xfs_agnumber_t			agno;
 	xfs_agblock_t			agbno;
 
@@ -2456,9 +2457,11 @@ __xfs_free_extent_later(
 	} else {
 		xefi->xefi_owner = XFS_RMAP_OWN_NULL;
 	}
-	trace_xfs_bmap_free_defer(tp->t_mountp,
+	trace_xfs_bmap_free_defer(mp,
 			XFS_FSB_TO_AGNO(tp->t_mountp, bno), 0,
 			XFS_FSB_TO_AGBNO(tp->t_mountp, bno), len);
+
+	xfs_extent_free_get_group(mp, xefi);
 	xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_FREE, &xefi->xefi_list);
 }
 
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index e12d86e3aeec..5569cb2ede0d 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -237,9 +237,13 @@ struct xfs_extent_free_item {
 	uint64_t		xefi_owner;
 	xfs_fsblock_t		xefi_startblock;/* starting fs block number */
 	xfs_extlen_t		xefi_blockcount;/* number of blocks in extent */
+	struct xfs_perag	*xefi_pag;
 	unsigned int		xefi_flags;
 };
 
+void xfs_extent_free_get_group(struct xfs_mount *mp,
+		struct xfs_extent_free_item *xefi);
+
 #define XFS_EFI_SKIP_DISCARD	(1U << 0) /* don't issue discard */
 #define XFS_EFI_ATTR_FORK	(1U << 1) /* freeing attr fork block */
 #define XFS_EFI_BMBT_BLOCK	(1U << 2) /* freeing bmap btree block */
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index c1aae07467c9..38b66fcfddc8 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -350,10 +350,7 @@ xfs_trans_free_extent(
 	struct xfs_owner_info		oinfo = { };
 	struct xfs_mount		*mp = tp->t_mountp;
 	struct xfs_extent		*extp;
-	struct xfs_perag		*pag;
 	uint				next_extent;
-	xfs_agnumber_t			agno = XFS_FSB_TO_AGNO(mp,
-							xefi->xefi_startblock);
 	xfs_agblock_t			agbno = XFS_FSB_TO_AGBNO(mp,
 							xefi->xefi_startblock);
 	int				error;
@@ -364,14 +361,12 @@ xfs_trans_free_extent(
 	if (xefi->xefi_flags & XFS_EFI_BMBT_BLOCK)
 		oinfo.oi_flags |= XFS_OWNER_INFO_BMBT_BLOCK;
 
-	trace_xfs_bmap_free_deferred(tp->t_mountp, agno, 0, agbno,
-			xefi->xefi_blockcount);
+	trace_xfs_bmap_free_deferred(tp->t_mountp, xefi->xefi_pag->pag_agno, 0,
+			agbno, xefi->xefi_blockcount);
 
-	pag = xfs_perag_get(mp, agno);
-	error = __xfs_free_extent(tp, pag, agbno, xefi->xefi_blockcount,
-			&oinfo, XFS_AG_RESV_NONE,
+	error = __xfs_free_extent(tp, xefi->xefi_pag, agbno,
+			xefi->xefi_blockcount, &oinfo, XFS_AG_RESV_NONE,
 			xefi->xefi_flags & XFS_EFI_SKIP_DISCARD);
-	xfs_perag_put(pag);
 
 	/*
 	 * Mark the transaction dirty, even on error. This ensures the
@@ -400,14 +395,13 @@ xfs_extent_free_diff_items(
 	const struct list_head		*a,
 	const struct list_head		*b)
 {
-	struct xfs_mount		*mp = priv;
 	struct xfs_extent_free_item	*ra;
 	struct xfs_extent_free_item	*rb;
 
 	ra = container_of(a, struct xfs_extent_free_item, xefi_list);
 	rb = container_of(b, struct xfs_extent_free_item, xefi_list);
-	return  XFS_FSB_TO_AGNO(mp, ra->xefi_startblock) -
-		XFS_FSB_TO_AGNO(mp, rb->xefi_startblock);
+
+	return ra->xefi_pag->pag_agno - rb->xefi_pag->pag_agno;
 }
 
 /* Log a free extent to the intent item. */
@@ -466,6 +460,26 @@ xfs_extent_free_create_done(
 	return &xfs_trans_get_efd(tp, EFI_ITEM(intent), count)->efd_item;
 }
 
+/* Take a passive ref to the AG containing the space we're freeing. */
+void
+xfs_extent_free_get_group(
+	struct xfs_mount		*mp,
+	struct xfs_extent_free_item	*xefi)
+{
+	xfs_agnumber_t			agno;
+
+	agno = XFS_FSB_TO_AGNO(mp, xefi->xefi_startblock);
+	xefi->xefi_pag = xfs_perag_get(mp, agno);
+}
+
+/* Release a passive AG ref after some freeing work. */
+static inline void
+xfs_extent_free_put_group(
+	struct xfs_extent_free_item	*xefi)
+{
+	xfs_perag_put(xefi->xefi_pag);
+}
+
 /* Process a free extent. */
 STATIC int
 xfs_extent_free_finish_item(
@@ -480,6 +494,8 @@ xfs_extent_free_finish_item(
 	xefi = container_of(item, struct xfs_extent_free_item, xefi_list);
 
 	error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi);
+
+	xfs_extent_free_put_group(xefi);
 	kmem_cache_free(xfs_extfree_item_cache, xefi);
 	return error;
 }
@@ -500,6 +516,8 @@ xfs_extent_free_cancel_item(
 	struct xfs_extent_free_item	*xefi;
 
 	xefi = container_of(item, struct xfs_extent_free_item, xefi_list);
+
+	xfs_extent_free_put_group(xefi);
 	kmem_cache_free(xfs_extfree_item_cache, xefi);
 }
 
@@ -530,24 +548,21 @@ xfs_agfl_free_finish_item(
 	struct xfs_extent		*extp;
 	struct xfs_buf			*agbp;
 	int				error;
-	xfs_agnumber_t			agno;
 	xfs_agblock_t			agbno;
 	uint				next_extent;
-	struct xfs_perag		*pag;
 
 	xefi = container_of(item, struct xfs_extent_free_item, xefi_list);
 	ASSERT(xefi->xefi_blockcount == 1);
-	agno = XFS_FSB_TO_AGNO(mp, xefi->xefi_startblock);
 	agbno = XFS_FSB_TO_AGBNO(mp, xefi->xefi_startblock);
 	oinfo.oi_owner = xefi->xefi_owner;
 
-	trace_xfs_agfl_free_deferred(mp, agno, 0, agbno, xefi->xefi_blockcount);
+	trace_xfs_agfl_free_deferred(mp, xefi->xefi_pag->pag_agno, 0, agbno,
+			xefi->xefi_blockcount);
 
-	pag = xfs_perag_get(mp, agno);
-	error = xfs_alloc_read_agf(pag, tp, 0, &agbp);
+	error = xfs_alloc_read_agf(xefi->xefi_pag, tp, 0, &agbp);
 	if (!error)
-		error = xfs_free_agfl_block(tp, agno, agbno, agbp, &oinfo);
-	xfs_perag_put(pag);
+		error = xfs_free_agfl_block(tp, xefi->xefi_pag->pag_agno,
+				agbno, agbp, &oinfo);
 
 	/*
 	 * Mark the transaction dirty, even on error. This ensures the
@@ -566,6 +581,7 @@ xfs_agfl_free_finish_item(
 	extp->ext_len = xefi->xefi_blockcount;
 	efdp->efd_next_extent++;
 
+	xfs_extent_free_put_group(xefi);
 	kmem_cache_free(xfs_extfree_item_cache, xefi);
 	return error;
 }
@@ -636,7 +652,9 @@ xfs_efi_item_recover(
 		fake.xefi_startblock = extp->ext_start;
 		fake.xefi_blockcount = extp->ext_len;
 
+		xfs_extent_free_get_group(mp, &fake);
 		error = xfs_trans_free_extent(tp, efdp, &fake);
+		xfs_extent_free_put_group(&fake);
 		if (error == -EFSCORRUPTED)
 			XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
 					extp, sizeof(*extp));
-- 
cgit v1.2.3


From c13418e8eb375872ad297aeec5fa26277febc155 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 18:59:54 -0700
Subject: xfs: give xfs_rmap_intent its own perag reference

Give the xfs_rmap_intent a passive reference to the perag structure
data.  This reference will be used to enable scrub intent draining
functionality in subsequent patches.  The space we're (reverse) mapping
is already allocated, so we need to be able to operate even if the AG is
being shrunk or offlined.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_rmap.c | 29 +++++++++++------------------
 fs/xfs/libxfs/xfs_rmap.h |  4 ++++
 fs/xfs/xfs_rmap_item.c   | 32 +++++++++++++++++++++++++++++---
 3 files changed, 44 insertions(+), 21 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index df720041cd3d..c2624d11f041 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -2394,7 +2394,6 @@ xfs_rmap_finish_one(
 	struct xfs_btree_cur		**pcur)
 {
 	struct xfs_mount		*mp = tp->t_mountp;
-	struct xfs_perag		*pag;
 	struct xfs_btree_cur		*rcur;
 	struct xfs_buf			*agbp = NULL;
 	int				error = 0;
@@ -2402,26 +2401,22 @@ xfs_rmap_finish_one(
 	xfs_agblock_t			bno;
 	bool				unwritten;
 
-	pag = xfs_perag_get(mp, XFS_FSB_TO_AGNO(mp, ri->ri_bmap.br_startblock));
 	bno = XFS_FSB_TO_AGBNO(mp, ri->ri_bmap.br_startblock);
 
-	trace_xfs_rmap_deferred(mp, pag->pag_agno, ri->ri_type, bno,
+	trace_xfs_rmap_deferred(mp, ri->ri_pag->pag_agno, ri->ri_type, bno,
 			ri->ri_owner, ri->ri_whichfork,
 			ri->ri_bmap.br_startoff, ri->ri_bmap.br_blockcount,
 			ri->ri_bmap.br_state);
 
-	if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_RMAP_FINISH_ONE)) {
-		error = -EIO;
-		goto out_drop;
-	}
-
+	if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_RMAP_FINISH_ONE))
+		return -EIO;
 
 	/*
 	 * If we haven't gotten a cursor or the cursor AG doesn't match
 	 * the startblock, get one now.
 	 */
 	rcur = *pcur;
-	if (rcur != NULL && rcur->bc_ag.pag != pag) {
+	if (rcur != NULL && rcur->bc_ag.pag != ri->ri_pag) {
 		xfs_rmap_finish_one_cleanup(tp, rcur, 0);
 		rcur = NULL;
 		*pcur = NULL;
@@ -2432,15 +2427,13 @@ xfs_rmap_finish_one(
 		 * rmapbt, because a shape change could cause us to
 		 * allocate blocks.
 		 */
-		error = xfs_free_extent_fix_freelist(tp, pag, &agbp);
+		error = xfs_free_extent_fix_freelist(tp, ri->ri_pag, &agbp);
 		if (error)
-			goto out_drop;
-		if (XFS_IS_CORRUPT(tp->t_mountp, !agbp)) {
-			error = -EFSCORRUPTED;
-			goto out_drop;
-		}
+			return error;
+		if (XFS_IS_CORRUPT(tp->t_mountp, !agbp))
+			return -EFSCORRUPTED;
 
-		rcur = xfs_rmapbt_init_cursor(mp, tp, agbp, pag);
+		rcur = xfs_rmapbt_init_cursor(mp, tp, agbp, ri->ri_pag);
 	}
 	*pcur = rcur;
 
@@ -2480,8 +2473,7 @@ xfs_rmap_finish_one(
 		ASSERT(0);
 		error = -EFSCORRUPTED;
 	}
-out_drop:
-	xfs_perag_put(pag);
+
 	return error;
 }
 
@@ -2526,6 +2518,7 @@ __xfs_rmap_add(
 	ri->ri_whichfork = whichfork;
 	ri->ri_bmap = *bmap;
 
+	xfs_rmap_update_get_group(tp->t_mountp, ri);
 	xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_RMAP, &ri->ri_list);
 }
 
diff --git a/fs/xfs/libxfs/xfs_rmap.h b/fs/xfs/libxfs/xfs_rmap.h
index 2dac88cea28d..1472ae570a8a 100644
--- a/fs/xfs/libxfs/xfs_rmap.h
+++ b/fs/xfs/libxfs/xfs_rmap.h
@@ -162,8 +162,12 @@ struct xfs_rmap_intent {
 	int					ri_whichfork;
 	uint64_t				ri_owner;
 	struct xfs_bmbt_irec			ri_bmap;
+	struct xfs_perag			*ri_pag;
 };
 
+void xfs_rmap_update_get_group(struct xfs_mount *mp,
+		struct xfs_rmap_intent *ri);
+
 /* functions for updating the rmapbt based on bmbt map/unmap operations */
 void xfs_rmap_map_extent(struct xfs_trans *tp, struct xfs_inode *ip,
 		int whichfork, struct xfs_bmbt_irec *imap);
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index a1619d67015f..739ddbd04a17 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -20,6 +20,7 @@
 #include "xfs_error.h"
 #include "xfs_log_priv.h"
 #include "xfs_log_recover.h"
+#include "xfs_ag.h"
 
 struct kmem_cache	*xfs_rui_cache;
 struct kmem_cache	*xfs_rud_cache;
@@ -320,14 +321,13 @@ xfs_rmap_update_diff_items(
 	const struct list_head		*a,
 	const struct list_head		*b)
 {
-	struct xfs_mount		*mp = priv;
 	struct xfs_rmap_intent		*ra;
 	struct xfs_rmap_intent		*rb;
 
 	ra = container_of(a, struct xfs_rmap_intent, ri_list);
 	rb = container_of(b, struct xfs_rmap_intent, ri_list);
-	return  XFS_FSB_TO_AGNO(mp, ra->ri_bmap.br_startblock) -
-		XFS_FSB_TO_AGNO(mp, rb->ri_bmap.br_startblock);
+
+	return ra->ri_pag->pag_agno - rb->ri_pag->pag_agno;
 }
 
 /* Log rmap updates in the intent item. */
@@ -390,6 +390,26 @@ xfs_rmap_update_create_done(
 	return &xfs_trans_get_rud(tp, RUI_ITEM(intent))->rud_item;
 }
 
+/* Take a passive ref to the AG containing the space we're rmapping. */
+void
+xfs_rmap_update_get_group(
+	struct xfs_mount	*mp,
+	struct xfs_rmap_intent	*ri)
+{
+	xfs_agnumber_t		agno;
+
+	agno = XFS_FSB_TO_AGNO(mp, ri->ri_bmap.br_startblock);
+	ri->ri_pag = xfs_perag_get(mp, agno);
+}
+
+/* Release a passive AG ref after finishing rmapping work. */
+static inline void
+xfs_rmap_update_put_group(
+	struct xfs_rmap_intent	*ri)
+{
+	xfs_perag_put(ri->ri_pag);
+}
+
 /* Process a deferred rmap update. */
 STATIC int
 xfs_rmap_update_finish_item(
@@ -405,6 +425,8 @@ xfs_rmap_update_finish_item(
 
 	error = xfs_trans_log_finish_rmap_update(tp, RUD_ITEM(done), ri,
 			state);
+
+	xfs_rmap_update_put_group(ri);
 	kmem_cache_free(xfs_rmap_intent_cache, ri);
 	return error;
 }
@@ -425,6 +447,8 @@ xfs_rmap_update_cancel_item(
 	struct xfs_rmap_intent		*ri;
 
 	ri = container_of(item, struct xfs_rmap_intent, ri_list);
+
+	xfs_rmap_update_put_group(ri);
 	kmem_cache_free(xfs_rmap_intent_cache, ri);
 }
 
@@ -559,11 +583,13 @@ xfs_rui_item_recover(
 		fake.ri_bmap.br_state = (map->me_flags & XFS_RMAP_EXTENT_UNWRITTEN) ?
 				XFS_EXT_UNWRITTEN : XFS_EXT_NORM;
 
+		xfs_rmap_update_get_group(mp, &fake);
 		error = xfs_trans_log_finish_rmap_update(tp, rudp, &fake,
 				&rcur);
 		if (error == -EFSCORRUPTED)
 			XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
 					map, sizeof(*map));
+		xfs_rmap_update_put_group(&fake);
 		if (error)
 			goto abort_error;
 
-- 
cgit v1.2.3


From 00e7b3bac1dc8961bd5aa9d39e79131c6bd81181 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 18:59:55 -0700
Subject: xfs: give xfs_refcount_intent its own perag reference

Give the xfs_refcount_intent a passive reference to the perag structure
data.  This reference will be used to enable scrub intent draining
functionality in subsequent patches.  Any space being modified by a
refcount intent is already allocated, so we need to be able to operate
even if the AG is being shrunk or offlined.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_refcount.c | 33 ++++++++++++++-------------------
 fs/xfs/libxfs/xfs_refcount.h |  4 ++++
 fs/xfs/xfs_refcount_item.c   | 36 ++++++++++++++++++++++++++++++++----
 3 files changed, 50 insertions(+), 23 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index bcf46aa0d08b..6dc968618e66 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -1332,26 +1332,22 @@ xfs_refcount_finish_one(
 	xfs_agblock_t			bno;
 	unsigned long			nr_ops = 0;
 	int				shape_changes = 0;
-	struct xfs_perag		*pag;
 
-	pag = xfs_perag_get(mp, XFS_FSB_TO_AGNO(mp, ri->ri_startblock));
 	bno = XFS_FSB_TO_AGBNO(mp, ri->ri_startblock);
 
 	trace_xfs_refcount_deferred(mp, XFS_FSB_TO_AGNO(mp, ri->ri_startblock),
 			ri->ri_type, XFS_FSB_TO_AGBNO(mp, ri->ri_startblock),
 			ri->ri_blockcount);
 
-	if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_REFCOUNT_FINISH_ONE)) {
-		error = -EIO;
-		goto out_drop;
-	}
+	if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_REFCOUNT_FINISH_ONE))
+		return -EIO;
 
 	/*
 	 * If we haven't gotten a cursor or the cursor AG doesn't match
 	 * the startblock, get one now.
 	 */
 	rcur = *pcur;
-	if (rcur != NULL && rcur->bc_ag.pag != pag) {
+	if (rcur != NULL && rcur->bc_ag.pag != ri->ri_pag) {
 		nr_ops = rcur->bc_ag.refc.nr_ops;
 		shape_changes = rcur->bc_ag.refc.shape_changes;
 		xfs_refcount_finish_one_cleanup(tp, rcur, 0);
@@ -1359,12 +1355,12 @@ xfs_refcount_finish_one(
 		*pcur = NULL;
 	}
 	if (rcur == NULL) {
-		error = xfs_alloc_read_agf(pag, tp, XFS_ALLOC_FLAG_FREEING,
-				&agbp);
+		error = xfs_alloc_read_agf(ri->ri_pag, tp,
+				XFS_ALLOC_FLAG_FREEING, &agbp);
 		if (error)
-			goto out_drop;
+			return error;
 
-		rcur = xfs_refcountbt_init_cursor(mp, tp, agbp, pag);
+		rcur = xfs_refcountbt_init_cursor(mp, tp, agbp, ri->ri_pag);
 		rcur->bc_ag.refc.nr_ops = nr_ops;
 		rcur->bc_ag.refc.shape_changes = shape_changes;
 	}
@@ -1375,7 +1371,7 @@ xfs_refcount_finish_one(
 		error = xfs_refcount_adjust(rcur, &bno, &ri->ri_blockcount,
 				XFS_REFCOUNT_ADJUST_INCREASE);
 		if (error)
-			goto out_drop;
+			return error;
 		if (ri->ri_blockcount > 0)
 			error = xfs_refcount_continue_op(rcur, ri, bno);
 		break;
@@ -1383,31 +1379,29 @@ xfs_refcount_finish_one(
 		error = xfs_refcount_adjust(rcur, &bno, &ri->ri_blockcount,
 				XFS_REFCOUNT_ADJUST_DECREASE);
 		if (error)
-			goto out_drop;
+			return error;
 		if (ri->ri_blockcount > 0)
 			error = xfs_refcount_continue_op(rcur, ri, bno);
 		break;
 	case XFS_REFCOUNT_ALLOC_COW:
 		error = __xfs_refcount_cow_alloc(rcur, bno, ri->ri_blockcount);
 		if (error)
-			goto out_drop;
+			return error;
 		ri->ri_blockcount = 0;
 		break;
 	case XFS_REFCOUNT_FREE_COW:
 		error = __xfs_refcount_cow_free(rcur, bno, ri->ri_blockcount);
 		if (error)
-			goto out_drop;
+			return error;
 		ri->ri_blockcount = 0;
 		break;
 	default:
 		ASSERT(0);
-		error = -EFSCORRUPTED;
+		return -EFSCORRUPTED;
 	}
 	if (!error && ri->ri_blockcount > 0)
-		trace_xfs_refcount_finish_one_leftover(mp, pag->pag_agno,
+		trace_xfs_refcount_finish_one_leftover(mp, ri->ri_pag->pag_agno,
 				ri->ri_type, bno, ri->ri_blockcount);
-out_drop:
-	xfs_perag_put(pag);
 	return error;
 }
 
@@ -1435,6 +1429,7 @@ __xfs_refcount_add(
 	ri->ri_startblock = startblock;
 	ri->ri_blockcount = blockcount;
 
+	xfs_refcount_update_get_group(tp->t_mountp, ri);
 	xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_REFCOUNT, &ri->ri_list);
 }
 
diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
index c633477ce3ce..c89f0fcd1ee3 100644
--- a/fs/xfs/libxfs/xfs_refcount.h
+++ b/fs/xfs/libxfs/xfs_refcount.h
@@ -50,6 +50,7 @@ enum xfs_refcount_intent_type {
 
 struct xfs_refcount_intent {
 	struct list_head			ri_list;
+	struct xfs_perag			*ri_pag;
 	enum xfs_refcount_intent_type		ri_type;
 	xfs_extlen_t				ri_blockcount;
 	xfs_fsblock_t				ri_startblock;
@@ -67,6 +68,9 @@ xfs_refcount_check_domain(
 	return true;
 }
 
+void xfs_refcount_update_get_group(struct xfs_mount *mp,
+		struct xfs_refcount_intent *ri);
+
 void xfs_refcount_increase_extent(struct xfs_trans *tp,
 		struct xfs_bmbt_irec *irec);
 void xfs_refcount_decrease_extent(struct xfs_trans *tp,
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index 48d771a76add..7edee9590ed6 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -20,6 +20,7 @@
 #include "xfs_error.h"
 #include "xfs_log_priv.h"
 #include "xfs_log_recover.h"
+#include "xfs_ag.h"
 
 struct kmem_cache	*xfs_cui_cache;
 struct kmem_cache	*xfs_cud_cache;
@@ -279,14 +280,13 @@ xfs_refcount_update_diff_items(
 	const struct list_head		*a,
 	const struct list_head		*b)
 {
-	struct xfs_mount		*mp = priv;
 	struct xfs_refcount_intent	*ra;
 	struct xfs_refcount_intent	*rb;
 
 	ra = container_of(a, struct xfs_refcount_intent, ri_list);
 	rb = container_of(b, struct xfs_refcount_intent, ri_list);
-	return  XFS_FSB_TO_AGNO(mp, ra->ri_startblock) -
-		XFS_FSB_TO_AGNO(mp, rb->ri_startblock);
+
+	return ra->ri_pag->pag_agno - rb->ri_pag->pag_agno;
 }
 
 /* Set the phys extent flags for this reverse mapping. */
@@ -365,6 +365,26 @@ xfs_refcount_update_create_done(
 	return &xfs_trans_get_cud(tp, CUI_ITEM(intent))->cud_item;
 }
 
+/* Take a passive ref to the AG containing the space we're refcounting. */
+void
+xfs_refcount_update_get_group(
+	struct xfs_mount		*mp,
+	struct xfs_refcount_intent	*ri)
+{
+	xfs_agnumber_t			agno;
+
+	agno = XFS_FSB_TO_AGNO(mp, ri->ri_startblock);
+	ri->ri_pag = xfs_perag_get(mp, agno);
+}
+
+/* Release a passive AG ref after finishing refcounting work. */
+static inline void
+xfs_refcount_update_put_group(
+	struct xfs_refcount_intent	*ri)
+{
+	xfs_perag_put(ri->ri_pag);
+}
+
 /* Process a deferred refcount update. */
 STATIC int
 xfs_refcount_update_finish_item(
@@ -386,6 +406,8 @@ xfs_refcount_update_finish_item(
 		       ri->ri_type == XFS_REFCOUNT_DECREASE);
 		return -EAGAIN;
 	}
+
+	xfs_refcount_update_put_group(ri);
 	kmem_cache_free(xfs_refcount_intent_cache, ri);
 	return error;
 }
@@ -406,6 +428,8 @@ xfs_refcount_update_cancel_item(
 	struct xfs_refcount_intent	*ri;
 
 	ri = container_of(item, struct xfs_refcount_intent, ri_list);
+
+	xfs_refcount_update_put_group(ri);
 	kmem_cache_free(xfs_refcount_intent_cache, ri);
 }
 
@@ -520,9 +544,13 @@ xfs_cui_item_recover(
 
 		fake.ri_startblock = pmap->pe_startblock;
 		fake.ri_blockcount = pmap->pe_len;
-		if (!requeue_only)
+
+		if (!requeue_only) {
+			xfs_refcount_update_get_group(mp, &fake);
 			error = xfs_trans_log_finish_refcount_update(tp, cudp,
 					&fake, &rcur);
+			xfs_refcount_update_put_group(&fake);
+		}
 		if (error == -EFSCORRUPTED)
 			XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
 					&cuip->cui_format,
-- 
cgit v1.2.3


From 9b2e5a234c89f097ec36f922763dfa1465dc06f8 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 18:59:55 -0700
Subject: xfs: create traced helper to get extra perag references

There are a few places in the XFS codebase where a caller has either an
active or a passive reference to a perag structure and wants to give
a passive reference to some other piece of code.  Btree cursor creation
and inode walks are good examples of this.  Replace the open-coded logic
with a helper to do this.

The new function adds a few safeguards -- it checks that there's at
least one reference to the perag structure passed in, and it records the
refcount bump in the ftrace information.  This makes it much easier to
debug perag refcounting problems.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_ag.c             | 13 +++++++++++++
 fs/xfs/libxfs/xfs_ag.h             |  1 +
 fs/xfs/libxfs/xfs_alloc_btree.c    |  4 +---
 fs/xfs/libxfs/xfs_ialloc_btree.c   |  4 +---
 fs/xfs/libxfs/xfs_refcount_btree.c |  5 +----
 fs/xfs/libxfs/xfs_rmap_btree.c     |  5 +----
 fs/xfs/xfs_iunlink_item.c          |  4 +---
 fs/xfs/xfs_iwalk.c                 |  5 ++---
 fs/xfs/xfs_trace.h                 |  1 +
 9 files changed, 22 insertions(+), 20 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
index ae45f546ed86..2d8910046ed9 100644
--- a/fs/xfs/libxfs/xfs_ag.c
+++ b/fs/xfs/libxfs/xfs_ag.c
@@ -81,6 +81,19 @@ xfs_perag_get_tag(
 	return pag;
 }
 
+/* Get a passive reference to the given perag. */
+struct xfs_perag *
+xfs_perag_hold(
+	struct xfs_perag	*pag)
+{
+	ASSERT(atomic_read(&pag->pag_ref) > 0 ||
+	       atomic_read(&pag->pag_active_ref) > 0);
+
+	trace_xfs_perag_hold(pag, _RET_IP_);
+	atomic_inc(&pag->pag_ref);
+	return pag;
+}
+
 void
 xfs_perag_put(
 	struct xfs_perag	*pag)
diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
index 5e18536dfdce..8092eaba977d 100644
--- a/fs/xfs/libxfs/xfs_ag.h
+++ b/fs/xfs/libxfs/xfs_ag.h
@@ -134,6 +134,7 @@ void xfs_free_perag(struct xfs_mount *mp);
 struct xfs_perag *xfs_perag_get(struct xfs_mount *mp, xfs_agnumber_t agno);
 struct xfs_perag *xfs_perag_get_tag(struct xfs_mount *mp, xfs_agnumber_t agno,
 		unsigned int tag);
+struct xfs_perag *xfs_perag_hold(struct xfs_perag *pag);
 void xfs_perag_put(struct xfs_perag *pag);
 
 /* Active AG references */
diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
index 0f29c7b1b39f..8e8416c14cec 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.c
+++ b/fs/xfs/libxfs/xfs_alloc_btree.c
@@ -492,9 +492,7 @@ xfs_allocbt_init_common(
 		cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_abtb_2);
 	}
 
-	/* take a reference for the cursor */
-	atomic_inc(&pag->pag_ref);
-	cur->bc_ag.pag = pag;
+	cur->bc_ag.pag = xfs_perag_hold(pag);
 
 	if (xfs_has_crc(mp))
 		cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index 1d2af50ac95b..ad6c521f05eb 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -450,9 +450,7 @@ xfs_inobt_init_common(
 	if (xfs_has_crc(mp))
 		cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
 
-	/* take a reference for the cursor */
-	atomic_inc(&pag->pag_ref);
-	cur->bc_ag.pag = pag;
+	cur->bc_ag.pag = xfs_perag_hold(pag);
 	return cur;
 }
 
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index 749e837de98d..03d2b01487a1 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -340,10 +340,7 @@ xfs_refcountbt_init_common(
 
 	cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
 
-	/* take a reference for the cursor */
-	atomic_inc(&pag->pag_ref);
-	cur->bc_ag.pag = pag;
-
+	cur->bc_ag.pag = xfs_perag_hold(pag);
 	cur->bc_ag.refc.nr_ops = 0;
 	cur->bc_ag.refc.shape_changes = 0;
 	cur->bc_ops = &xfs_refcountbt_ops;
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index d3285684bb5e..56d074b42660 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -460,10 +460,7 @@ xfs_rmapbt_init_common(
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_rmap_2);
 	cur->bc_ops = &xfs_rmapbt_ops;
 
-	/* take a reference for the cursor */
-	atomic_inc(&pag->pag_ref);
-	cur->bc_ag.pag = pag;
-
+	cur->bc_ag.pag = xfs_perag_hold(pag);
 	return cur;
 }
 
diff --git a/fs/xfs/xfs_iunlink_item.c b/fs/xfs/xfs_iunlink_item.c
index 43005ce8bd48..2ddccb172fa0 100644
--- a/fs/xfs/xfs_iunlink_item.c
+++ b/fs/xfs/xfs_iunlink_item.c
@@ -168,9 +168,7 @@ xfs_iunlink_log_inode(
 	iup->ip = ip;
 	iup->next_agino = next_agino;
 	iup->old_agino = ip->i_next_unlinked;
-
-	atomic_inc(&pag->pag_ref);
-	iup->pag = pag;
+	iup->pag = xfs_perag_hold(pag);
 
 	xfs_trans_add_item(tp, &iup->item);
 	tp->t_flags |= XFS_TRANS_DIRTY;
diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
index 21be93bf006d..b3275e8d47b6 100644
--- a/fs/xfs/xfs_iwalk.c
+++ b/fs/xfs/xfs_iwalk.c
@@ -667,11 +667,10 @@ xfs_iwalk_threaded(
 		iwag->mp = mp;
 
 		/*
-		 * perag is being handed off to async work, so take another
+		 * perag is being handed off to async work, so take a passive
 		 * reference for the async work to release.
 		 */
-		atomic_inc(&pag->pag_ref);
-		iwag->pag = pag;
+		iwag->pag = xfs_perag_hold(pag);
 		iwag->iwalk_fn = iwalk_fn;
 		iwag->data = data;
 		iwag->startino = startino;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 9c0006c55fec..db09bb771765 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -190,6 +190,7 @@ DEFINE_EVENT(xfs_perag_class, name,	\
 	TP_ARGS(pag, caller_ip))
 DEFINE_PERAG_REF_EVENT(xfs_perag_get);
 DEFINE_PERAG_REF_EVENT(xfs_perag_get_tag);
+DEFINE_PERAG_REF_EVENT(xfs_perag_hold);
 DEFINE_PERAG_REF_EVENT(xfs_perag_put);
 DEFINE_PERAG_REF_EVENT(xfs_perag_grab);
 DEFINE_PERAG_REF_EVENT(xfs_perag_grab_tag);
-- 
cgit v1.2.3


From 739a2fe0428f24c11fe652252c2f19ef7a697209 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 18:59:56 -0700
Subject: xfs: fix author and spdx headers on scrub/ files

Fix the spdx tags to match current practice, and update the author
contact information.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/agheader.c        | 4 ++--
 fs/xfs/scrub/agheader_repair.c | 4 ++--
 fs/xfs/scrub/alloc.c           | 4 ++--
 fs/xfs/scrub/attr.c            | 4 ++--
 fs/xfs/scrub/attr.h            | 2 +-
 fs/xfs/scrub/bitmap.c          | 4 ++--
 fs/xfs/scrub/bitmap.h          | 4 ++--
 fs/xfs/scrub/bmap.c            | 4 ++--
 fs/xfs/scrub/btree.c           | 4 ++--
 fs/xfs/scrub/btree.h           | 4 ++--
 fs/xfs/scrub/common.c          | 4 ++--
 fs/xfs/scrub/common.h          | 4 ++--
 fs/xfs/scrub/dabtree.c         | 4 ++--
 fs/xfs/scrub/dabtree.h         | 4 ++--
 fs/xfs/scrub/dir.c             | 4 ++--
 fs/xfs/scrub/fscounters.c      | 2 +-
 fs/xfs/scrub/health.c          | 4 ++--
 fs/xfs/scrub/health.h          | 4 ++--
 fs/xfs/scrub/ialloc.c          | 4 ++--
 fs/xfs/scrub/inode.c           | 4 ++--
 fs/xfs/scrub/parent.c          | 4 ++--
 fs/xfs/scrub/quota.c           | 4 ++--
 fs/xfs/scrub/refcount.c        | 4 ++--
 fs/xfs/scrub/repair.c          | 4 ++--
 fs/xfs/scrub/repair.h          | 4 ++--
 fs/xfs/scrub/rmap.c            | 4 ++--
 fs/xfs/scrub/rtbitmap.c        | 4 ++--
 fs/xfs/scrub/scrub.c           | 4 ++--
 fs/xfs/scrub/scrub.h           | 4 ++--
 fs/xfs/scrub/symlink.c         | 4 ++--
 fs/xfs/scrub/trace.c           | 4 ++--
 fs/xfs/scrub/trace.h           | 4 ++--
 fs/xfs/scrub/xfs_scrub.h       | 4 ++--
 33 files changed, 64 insertions(+), 64 deletions(-)

diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
index 4dd52b15f09c..ad8c592e11cf 100644
--- a/fs/xfs/scrub/agheader.c
+++ b/fs/xfs/scrub/agheader.c
@@ -1,7 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2017 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
 #include "xfs_fs.h"
diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
index c37e6d72760b..703e27cec327 100644
--- a/fs/xfs/scrub/agheader_repair.c
+++ b/fs/xfs/scrub/agheader_repair.c
@@ -1,7 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2018 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
 #include "xfs_fs.h"
diff --git a/fs/xfs/scrub/alloc.c b/fs/xfs/scrub/alloc.c
index 3b38f4e2a537..16c4d57992b9 100644
--- a/fs/xfs/scrub/alloc.c
+++ b/fs/xfs/scrub/alloc.c
@@ -1,7 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2017 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
 #include "xfs_fs.h"
diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index 31529b9bf389..e0ed2123b5e2 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -1,7 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2017 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
 #include "xfs_fs.h"
diff --git a/fs/xfs/scrub/attr.h b/fs/xfs/scrub/attr.h
index 3590e10e3e62..f9680cb02a30 100644
--- a/fs/xfs/scrub/attr.h
+++ b/fs/xfs/scrub/attr.h
@@ -1,7 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0-or-later */
 /*
  * Copyright (C) 2019 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #ifndef __XFS_SCRUB_ATTR_H__
 #define __XFS_SCRUB_ATTR_H__
diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c
index a255f09e9f0a..55b2af10aae9 100644
--- a/fs/xfs/scrub/bitmap.c
+++ b/fs/xfs/scrub/bitmap.c
@@ -1,7 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2018 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
 #include "xfs_fs.h"
diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h
index 900646b72de1..c5db165d397e 100644
--- a/fs/xfs/scrub/bitmap.h
+++ b/fs/xfs/scrub/bitmap.h
@@ -1,7 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2018 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #ifndef __XFS_SCRUB_BITMAP_H__
 #define __XFS_SCRUB_BITMAP_H__
diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index dbbc7037074c..d7a941936ceb 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -1,7 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2017 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
 #include "xfs_fs.h"
diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
index 0fd36d5b4646..1a32b1fb75a2 100644
--- a/fs/xfs/scrub/btree.c
+++ b/fs/xfs/scrub/btree.c
@@ -1,7 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2017 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
 #include "xfs_fs.h"
diff --git a/fs/xfs/scrub/btree.h b/fs/xfs/scrub/btree.h
index da61a53a0b61..1e3093d340c1 100644
--- a/fs/xfs/scrub/btree.h
+++ b/fs/xfs/scrub/btree.h
@@ -1,7 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2017 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #ifndef __XFS_SCRUB_BTREE_H__
 #define __XFS_SCRUB_BTREE_H__
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 848a8e32e56f..6eec71f92310 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -1,7 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2017 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
 #include "xfs_fs.h"
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index b73648d81d23..5fc0faeef18b 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -1,7 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2017 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #ifndef __XFS_SCRUB_COMMON_H__
 #define __XFS_SCRUB_COMMON_H__
diff --git a/fs/xfs/scrub/dabtree.c b/fs/xfs/scrub/dabtree.c
index d17cee177085..245971a7bb1d 100644
--- a/fs/xfs/scrub/dabtree.c
+++ b/fs/xfs/scrub/dabtree.c
@@ -1,7 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2017 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
 #include "xfs_fs.h"
diff --git a/fs/xfs/scrub/dabtree.h b/fs/xfs/scrub/dabtree.h
index 1f3515c6d5a8..a7ac9bf16db9 100644
--- a/fs/xfs/scrub/dabtree.h
+++ b/fs/xfs/scrub/dabtree.h
@@ -1,7 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2017 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #ifndef __XFS_SCRUB_DABTREE_H__
 #define __XFS_SCRUB_DABTREE_H__
diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
index d1b0f23c2c59..24c6a967c67d 100644
--- a/fs/xfs/scrub/dir.c
+++ b/fs/xfs/scrub/dir.c
@@ -1,7 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2017 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
 #include "xfs_fs.h"
diff --git a/fs/xfs/scrub/fscounters.c b/fs/xfs/scrub/fscounters.c
index f0c7f41897b9..df57e0314b46 100644
--- a/fs/xfs/scrub/fscounters.c
+++ b/fs/xfs/scrub/fscounters.c
@@ -1,7 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0+
 /*
  * Copyright (C) 2019 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
 #include "xfs_fs.h"
diff --git a/fs/xfs/scrub/health.c b/fs/xfs/scrub/health.c
index aa65ec88a0c0..d416b8701c9a 100644
--- a/fs/xfs/scrub/health.c
+++ b/fs/xfs/scrub/health.c
@@ -1,7 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2019 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
 #include "xfs_fs.h"
diff --git a/fs/xfs/scrub/health.h b/fs/xfs/scrub/health.h
index d0b938d3d028..2ef83db88a72 100644
--- a/fs/xfs/scrub/health.h
+++ b/fs/xfs/scrub/health.h
@@ -1,7 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2019 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #ifndef __XFS_SCRUB_HEALTH_H__
 #define __XFS_SCRUB_HEALTH_H__
diff --git a/fs/xfs/scrub/ialloc.c b/fs/xfs/scrub/ialloc.c
index e312be7cd375..9d5a51b4af4b 100644
--- a/fs/xfs/scrub/ialloc.c
+++ b/fs/xfs/scrub/ialloc.c
@@ -1,7 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2017 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
 #include "xfs_fs.h"
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index 7a2f38e5202c..0d8d697ca265 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -1,7 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2017 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
 #include "xfs_fs.h"
diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c
index d8dff3fd8053..c641b7d55a1d 100644
--- a/fs/xfs/scrub/parent.c
+++ b/fs/xfs/scrub/parent.c
@@ -1,7 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2017 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
 #include "xfs_fs.h"
diff --git a/fs/xfs/scrub/quota.c b/fs/xfs/scrub/quota.c
index 9eeac8565394..a79e4c2cbd61 100644
--- a/fs/xfs/scrub/quota.c
+++ b/fs/xfs/scrub/quota.c
@@ -1,7 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2017 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
 #include "xfs_fs.h"
diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
index d9c1b3cea4a5..2db52a56c38e 100644
--- a/fs/xfs/scrub/refcount.c
+++ b/fs/xfs/scrub/refcount.c
@@ -1,7 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2017 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
 #include "xfs_fs.h"
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index e12058a5f22e..0b740f533959 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -1,7 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2018 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
 #include "xfs_fs.h"
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 840f74ec431c..b86cdfe506d8 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -1,7 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2018 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #ifndef __XFS_SCRUB_REPAIR_H__
 #define __XFS_SCRUB_REPAIR_H__
diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index 229826b2e1c0..2b16c9192447 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -1,7 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2017 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
 #include "xfs_fs.h"
diff --git a/fs/xfs/scrub/rtbitmap.c b/fs/xfs/scrub/rtbitmap.c
index 0a3bde64c675..924a45778a0f 100644
--- a/fs/xfs/scrub/rtbitmap.c
+++ b/fs/xfs/scrub/rtbitmap.c
@@ -1,7 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2017 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
 #include "xfs_fs.h"
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 07a7a75f987f..67dcc5efcbb1 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -1,7 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2017 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
 #include "xfs_fs.h"
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index b4d391b4c938..d72f2ccda091 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -1,7 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2017 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #ifndef __XFS_SCRUB_SCRUB_H__
 #define __XFS_SCRUB_SCRUB_H__
diff --git a/fs/xfs/scrub/symlink.c b/fs/xfs/scrub/symlink.c
index c1c99ffe7408..624f5e864c6f 100644
--- a/fs/xfs/scrub/symlink.c
+++ b/fs/xfs/scrub/symlink.c
@@ -1,7 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2017 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
 #include "xfs_fs.h"
diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
index b5f94676c37c..315f872e1c91 100644
--- a/fs/xfs/scrub/trace.c
+++ b/fs/xfs/scrub/trace.c
@@ -1,7 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2017 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
 #include "xfs_fs.h"
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 93ece6df02e3..9679ef7c3f01 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -1,7 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2017 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  *
  * NOTE: none of these tracepoints shall be considered a stable kernel ABI
  * as they can change at any time.  See xfs_trace.h for documentation of
diff --git a/fs/xfs/scrub/xfs_scrub.h b/fs/xfs/scrub/xfs_scrub.h
index 2ceae614ade8..76c209c74fff 100644
--- a/fs/xfs/scrub/xfs_scrub.h
+++ b/fs/xfs/scrub/xfs_scrub.h
@@ -1,7 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2017 Oracle.  All Rights Reserved.
- * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #ifndef __XFS_SCRUB_H__
 #define __XFS_SCRUB_H__
-- 
cgit v1.2.3


From ecc73f8a58c7844b04186726f8699ba97cec2ef9 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 18:59:57 -0700
Subject: xfs: update copyright years for scrub/ files

Update the copyright years in the scrub/ source code files.  This isn't
required, but it's helpful to remind myself just how long it's taken to
develop this feature.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/agheader.c        | 2 +-
 fs/xfs/scrub/agheader_repair.c | 2 +-
 fs/xfs/scrub/alloc.c           | 2 +-
 fs/xfs/scrub/attr.c            | 2 +-
 fs/xfs/scrub/attr.h            | 2 +-
 fs/xfs/scrub/bitmap.c          | 2 +-
 fs/xfs/scrub/bitmap.h          | 2 +-
 fs/xfs/scrub/bmap.c            | 2 +-
 fs/xfs/scrub/btree.c           | 2 +-
 fs/xfs/scrub/btree.h           | 2 +-
 fs/xfs/scrub/common.c          | 2 +-
 fs/xfs/scrub/common.h          | 2 +-
 fs/xfs/scrub/dabtree.c         | 2 +-
 fs/xfs/scrub/dabtree.h         | 2 +-
 fs/xfs/scrub/dir.c             | 2 +-
 fs/xfs/scrub/fscounters.c      | 2 +-
 fs/xfs/scrub/health.c          | 2 +-
 fs/xfs/scrub/health.h          | 2 +-
 fs/xfs/scrub/ialloc.c          | 2 +-
 fs/xfs/scrub/inode.c           | 2 +-
 fs/xfs/scrub/parent.c          | 2 +-
 fs/xfs/scrub/quota.c           | 2 +-
 fs/xfs/scrub/refcount.c        | 2 +-
 fs/xfs/scrub/repair.c          | 2 +-
 fs/xfs/scrub/repair.h          | 2 +-
 fs/xfs/scrub/rmap.c            | 2 +-
 fs/xfs/scrub/rtbitmap.c        | 2 +-
 fs/xfs/scrub/scrub.c           | 2 +-
 fs/xfs/scrub/scrub.h           | 2 +-
 fs/xfs/scrub/symlink.c         | 2 +-
 fs/xfs/scrub/trace.c           | 2 +-
 fs/xfs/scrub/trace.h           | 2 +-
 fs/xfs/scrub/xfs_scrub.h       | 2 +-
 33 files changed, 33 insertions(+), 33 deletions(-)

diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
index ad8c592e11cf..c91819da1f5f 100644
--- a/fs/xfs/scrub/agheader.c
+++ b/fs/xfs/scrub/agheader.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ * Copyright (C) 2017-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
index 703e27cec327..edfb1dfb80a9 100644
--- a/fs/xfs/scrub/agheader_repair.c
+++ b/fs/xfs/scrub/agheader_repair.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Copyright (C) 2018 Oracle.  All Rights Reserved.
+ * Copyright (C) 2018-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
diff --git a/fs/xfs/scrub/alloc.c b/fs/xfs/scrub/alloc.c
index 16c4d57992b9..39e79b9536bc 100644
--- a/fs/xfs/scrub/alloc.c
+++ b/fs/xfs/scrub/alloc.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ * Copyright (C) 2017-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index e0ed2123b5e2..5573be3a3dfe 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ * Copyright (C) 2017-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
diff --git a/fs/xfs/scrub/attr.h b/fs/xfs/scrub/attr.h
index f9680cb02a30..bc6321552251 100644
--- a/fs/xfs/scrub/attr.h
+++ b/fs/xfs/scrub/attr.h
@@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0-or-later */
 /*
- * Copyright (C) 2019 Oracle.  All Rights Reserved.
+ * Copyright (C) 2019-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #ifndef __XFS_SCRUB_ATTR_H__
diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c
index 55b2af10aae9..ce8b17d76c0b 100644
--- a/fs/xfs/scrub/bitmap.c
+++ b/fs/xfs/scrub/bitmap.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Copyright (C) 2018 Oracle.  All Rights Reserved.
+ * Copyright (C) 2018-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h
index c5db165d397e..85ec0e2792c5 100644
--- a/fs/xfs/scrub/bitmap.h
+++ b/fs/xfs/scrub/bitmap.h
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Copyright (C) 2018 Oracle.  All Rights Reserved.
+ * Copyright (C) 2018-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #ifndef __XFS_SCRUB_BITMAP_H__
diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index d7a941936ceb..f6d8cb938a02 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ * Copyright (C) 2017-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
index 1a32b1fb75a2..e54c1cfe64bf 100644
--- a/fs/xfs/scrub/btree.c
+++ b/fs/xfs/scrub/btree.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ * Copyright (C) 2017-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
diff --git a/fs/xfs/scrub/btree.h b/fs/xfs/scrub/btree.h
index 1e3093d340c1..70461885c6c7 100644
--- a/fs/xfs/scrub/btree.h
+++ b/fs/xfs/scrub/btree.h
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ * Copyright (C) 2017-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #ifndef __XFS_SCRUB_BTREE_H__
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 6eec71f92310..597e6aca8628 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ * Copyright (C) 2017-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 5fc0faeef18b..273a4331da05 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ * Copyright (C) 2017-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #ifndef __XFS_SCRUB_COMMON_H__
diff --git a/fs/xfs/scrub/dabtree.c b/fs/xfs/scrub/dabtree.c
index 245971a7bb1d..c392c0765e5c 100644
--- a/fs/xfs/scrub/dabtree.c
+++ b/fs/xfs/scrub/dabtree.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ * Copyright (C) 2017-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
diff --git a/fs/xfs/scrub/dabtree.h b/fs/xfs/scrub/dabtree.h
index a7ac9bf16db9..4f8c2138a1ec 100644
--- a/fs/xfs/scrub/dabtree.h
+++ b/fs/xfs/scrub/dabtree.h
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ * Copyright (C) 2017-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #ifndef __XFS_SCRUB_DABTREE_H__
diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
index 24c6a967c67d..b6081a3e1b91 100644
--- a/fs/xfs/scrub/dir.c
+++ b/fs/xfs/scrub/dir.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ * Copyright (C) 2017-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
diff --git a/fs/xfs/scrub/fscounters.c b/fs/xfs/scrub/fscounters.c
index df57e0314b46..a38006c71bff 100644
--- a/fs/xfs/scrub/fscounters.c
+++ b/fs/xfs/scrub/fscounters.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0+
 /*
- * Copyright (C) 2019 Oracle.  All Rights Reserved.
+ * Copyright (C) 2019-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
diff --git a/fs/xfs/scrub/health.c b/fs/xfs/scrub/health.c
index d416b8701c9a..66e99b0f6049 100644
--- a/fs/xfs/scrub/health.c
+++ b/fs/xfs/scrub/health.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Copyright (C) 2019 Oracle.  All Rights Reserved.
+ * Copyright (C) 2019-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
diff --git a/fs/xfs/scrub/health.h b/fs/xfs/scrub/health.h
index 2ef83db88a72..66a273f8585b 100644
--- a/fs/xfs/scrub/health.h
+++ b/fs/xfs/scrub/health.h
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Copyright (C) 2019 Oracle.  All Rights Reserved.
+ * Copyright (C) 2019-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #ifndef __XFS_SCRUB_HEALTH_H__
diff --git a/fs/xfs/scrub/ialloc.c b/fs/xfs/scrub/ialloc.c
index 9d5a51b4af4b..b14270bd1c62 100644
--- a/fs/xfs/scrub/ialloc.c
+++ b/fs/xfs/scrub/ialloc.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ * Copyright (C) 2017-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index 0d8d697ca265..dc66a1465f1b 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ * Copyright (C) 2017-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c
index c641b7d55a1d..d1db18250ee3 100644
--- a/fs/xfs/scrub/parent.c
+++ b/fs/xfs/scrub/parent.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ * Copyright (C) 2017-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
diff --git a/fs/xfs/scrub/quota.c b/fs/xfs/scrub/quota.c
index a79e4c2cbd61..b019c70c065a 100644
--- a/fs/xfs/scrub/quota.c
+++ b/fs/xfs/scrub/quota.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ * Copyright (C) 2017-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
index 2db52a56c38e..a5005b1d010d 100644
--- a/fs/xfs/scrub/refcount.c
+++ b/fs/xfs/scrub/refcount.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ * Copyright (C) 2017-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 0b740f533959..b800341aae69 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Copyright (C) 2018 Oracle.  All Rights Reserved.
+ * Copyright (C) 2018-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index b86cdfe506d8..4fbb52228c48 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Copyright (C) 2018 Oracle.  All Rights Reserved.
+ * Copyright (C) 2018-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #ifndef __XFS_SCRUB_REPAIR_H__
diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index 2b16c9192447..4dc79e1a675d 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ * Copyright (C) 2017-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
diff --git a/fs/xfs/scrub/rtbitmap.c b/fs/xfs/scrub/rtbitmap.c
index 924a45778a0f..e7dace7b4be8 100644
--- a/fs/xfs/scrub/rtbitmap.c
+++ b/fs/xfs/scrub/rtbitmap.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ * Copyright (C) 2017-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 67dcc5efcbb1..e8e2bee001e5 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ * Copyright (C) 2017-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index d72f2ccda091..72a5a8a64a87 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ * Copyright (C) 2017-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #ifndef __XFS_SCRUB_SCRUB_H__
diff --git a/fs/xfs/scrub/symlink.c b/fs/xfs/scrub/symlink.c
index 624f5e864c6f..38708fb9a5d7 100644
--- a/fs/xfs/scrub/symlink.c
+++ b/fs/xfs/scrub/symlink.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ * Copyright (C) 2017-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
index 315f872e1c91..0a975439d2b6 100644
--- a/fs/xfs/scrub/trace.c
+++ b/fs/xfs/scrub/trace.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ * Copyright (C) 2017-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #include "xfs.h"
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 9679ef7c3f01..81f7c3051a1a 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ * Copyright (C) 2017-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  *
  * NOTE: none of these tracepoints shall be considered a stable kernel ABI
diff --git a/fs/xfs/scrub/xfs_scrub.h b/fs/xfs/scrub/xfs_scrub.h
index 76c209c74fff..a39befa743ce 100644
--- a/fs/xfs/scrub/xfs_scrub.h
+++ b/fs/xfs/scrub/xfs_scrub.h
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ * Copyright (C) 2017-2023 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
  */
 #ifndef __XFS_SCRUB_H__
-- 
cgit v1.2.3


From 901489030441f30970f507b071a2f7ac3b194499 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 18:59:58 -0700
Subject: xfs: add a tracepoint to report incorrect extent refcounts

Add a new tracepoint so that I can see exactly what and where we failed
the refcount check.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/refcount.c |  5 ++++-
 fs/xfs/scrub/trace.h    | 35 +++++++++++++++++++++++++++++++++++
 2 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
index a5005b1d010d..aaad13b1871f 100644
--- a/fs/xfs/scrub/refcount.c
+++ b/fs/xfs/scrub/refcount.c
@@ -13,6 +13,7 @@
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/btree.h"
+#include "scrub/trace.h"
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
 #include "xfs_ag.h"
@@ -300,8 +301,10 @@ xchk_refcountbt_xref_rmap(
 		goto out_free;
 
 	xchk_refcountbt_process_rmap_fragments(&refchk);
-	if (irec->rc_refcount != refchk.seen)
+	if (irec->rc_refcount != refchk.seen) {
+		trace_xchk_refcount_incorrect(sc->sa.pag, irec, refchk.seen);
 		xchk_btree_xref_set_corrupt(sc, sc->sa.rmap_cur, 0);
+	}
 
 out_free:
 	list_for_each_entry_safe(frag, n, &refchk.fragments, list) {
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 81f7c3051a1a..ad25ae88fce1 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -30,6 +30,9 @@ TRACE_DEFINE_ENUM(XFS_BTNUM_FINOi);
 TRACE_DEFINE_ENUM(XFS_BTNUM_RMAPi);
 TRACE_DEFINE_ENUM(XFS_BTNUM_REFCi);
 
+TRACE_DEFINE_ENUM(XFS_REFC_DOMAIN_SHARED);
+TRACE_DEFINE_ENUM(XFS_REFC_DOMAIN_COW);
+
 TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_PROBE);
 TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_SB);
 TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_AGF);
@@ -657,6 +660,38 @@ TRACE_EVENT(xchk_fscounters_within_range,
 		  __entry->old_value)
 )
 
+TRACE_EVENT(xchk_refcount_incorrect,
+	TP_PROTO(struct xfs_perag *pag, const struct xfs_refcount_irec *irec,
+		 xfs_nlink_t seen),
+	TP_ARGS(pag, irec, seen),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(enum xfs_refc_domain, domain)
+		__field(xfs_agblock_t, startblock)
+		__field(xfs_extlen_t, blockcount)
+		__field(xfs_nlink_t, refcount)
+		__field(xfs_nlink_t, seen)
+	),
+	TP_fast_assign(
+		__entry->dev = pag->pag_mount->m_super->s_dev;
+		__entry->agno = pag->pag_agno;
+		__entry->domain = irec->rc_domain;
+		__entry->startblock = irec->rc_startblock;
+		__entry->blockcount = irec->rc_blockcount;
+		__entry->refcount = irec->rc_refcount;
+		__entry->seen = seen;
+	),
+	TP_printk("dev %d:%d agno 0x%x dom %s agbno 0x%x fsbcount 0x%x refcount %u seen %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __print_symbolic(__entry->domain, XFS_REFC_DOMAIN_STRINGS),
+		  __entry->startblock,
+		  __entry->blockcount,
+		  __entry->refcount,
+		  __entry->seen)
+)
+
 /* repair tracepoints */
 #if IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR)
 
-- 
cgit v1.2.3


From d5c88131dbf01a30a222ad82d58e0c21a15f0d8e Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 18:59:58 -0700
Subject: xfs: allow queued AG intents to drain before scrubbing

When a writer thread executes a chain of log intent items, the AG header
buffer locks will cycle during a transaction roll to get from one intent
item to the next in a chain.  Although scrub takes all AG header buffer
locks, this isn't sufficient to guard against scrub checking an AG while
that writer thread is in the middle of finishing a chain because there's
no higher level locking primitive guarding allocation groups.

When there's a collision, cross-referencing between data structures
(e.g. rmapbt and refcountbt) yields false corruption events; if repair
is running, this results in incorrect repairs, which is catastrophic.

Fix this by adding to the perag structure the count of active intents
and make scrub wait until it has both AG header buffer locks and the
intent counter reaches zero.

One quirk of the drain code is that deferred bmap updates also bump and
drop the intent counter.  A fundamental decision made during the design
phase of the reverse mapping feature is that updates to the rmapbt
records are always made by the same code that updates the primary
metadata.  In other words, callers of bmapi functions expect that the
bmapi functions will queue deferred rmap updates.

Some parts of the reflink code queue deferred refcount (CUI) and bmap
(BUI) updates in the same head transaction, but the deferred work
manager completely finishes the CUI before the BUI work is started.  As
a result, the CUI drops the intent count long before the deferred rmap
(RUI) update even has a chance to bump the intent count.  The only way
to keep the intent count elevated between the CUI and RUI is for the BUI
to bump the counter until the RUI has been created.

A second quirk of the intent drain code is that deferred work items must
increment the intent counter as soon as the work item is added to the
transaction.  When a BUI completes and queues an RUI, the RUI must
increment the counter before the BUI decrements it.  The only way to
accomplish this is to require that the counter be bumped as soon as the
deferred work item is created in memory.

In the next patches we'll improve on this facility, but this patch
provides the basic functionality.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/Kconfig             |   4 ++
 fs/xfs/Makefile            |   2 +
 fs/xfs/libxfs/xfs_ag.c     |   4 ++
 fs/xfs/libxfs/xfs_ag.h     |   8 +++
 fs/xfs/libxfs/xfs_defer.c  |   6 +-
 fs/xfs/scrub/common.c      | 111 +++++++++++++++++++++++++++++++----
 fs/xfs/scrub/health.c      |   2 +
 fs/xfs/scrub/refcount.c    |   2 +
 fs/xfs/xfs_bmap_item.c     |  12 +++-
 fs/xfs/xfs_drain.c         | 140 +++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_drain.h         |  84 +++++++++++++++++++++++++++
 fs/xfs/xfs_extfree_item.c  |   4 +-
 fs/xfs/xfs_linux.h         |   1 +
 fs/xfs/xfs_refcount_item.c |   4 +-
 fs/xfs/xfs_rmap_item.c     |   4 +-
 fs/xfs/xfs_trace.h         |  71 +++++++++++++++++++++++
 16 files changed, 438 insertions(+), 21 deletions(-)
 create mode 100644 fs/xfs/xfs_drain.c
 create mode 100644 fs/xfs/xfs_drain.h

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index 9fac5ea8d0e4..ab24e683b440 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -93,10 +93,14 @@ config XFS_RT
 
 	  If unsure, say N.
 
+config XFS_DRAIN_INTENTS
+	bool
+
 config XFS_ONLINE_SCRUB
 	bool "XFS online metadata check support"
 	default n
 	depends on XFS_FS
+	select XFS_DRAIN_INTENTS
 	help
 	  If you say Y here you will be able to check metadata on a
 	  mounted XFS filesystem.  This feature is intended to reduce
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 92d88dc3c9f7..3bdbc838c4d1 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -136,6 +136,8 @@ ifeq ($(CONFIG_MEMORY_FAILURE),y)
 xfs-$(CONFIG_FS_DAX)		+= xfs_notify_failure.o
 endif
 
+xfs-$(CONFIG_XFS_DRAIN_INTENTS)	+= xfs_drain.o
+
 # online scrub/repair
 ifeq ($(CONFIG_XFS_ONLINE_SCRUB),y)
 
diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
index 2d8910046ed9..1b078bbbf225 100644
--- a/fs/xfs/libxfs/xfs_ag.c
+++ b/fs/xfs/libxfs/xfs_ag.c
@@ -260,6 +260,7 @@ xfs_free_perag(
 		spin_unlock(&mp->m_perag_lock);
 		ASSERT(pag);
 		XFS_IS_CORRUPT(pag->pag_mount, atomic_read(&pag->pag_ref) != 0);
+		xfs_defer_drain_free(&pag->pag_intents_drain);
 
 		cancel_delayed_work_sync(&pag->pag_blockgc_work);
 		xfs_buf_hash_destroy(pag);
@@ -385,6 +386,7 @@ xfs_initialize_perag(
 		spin_lock_init(&pag->pag_state_lock);
 		INIT_DELAYED_WORK(&pag->pag_blockgc_work, xfs_blockgc_worker);
 		INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC);
+		xfs_defer_drain_init(&pag->pag_intents_drain);
 		init_waitqueue_head(&pag->pagb_wait);
 		init_waitqueue_head(&pag->pag_active_wq);
 		pag->pagb_count = 0;
@@ -421,6 +423,7 @@ xfs_initialize_perag(
 	return 0;
 
 out_remove_pag:
+	xfs_defer_drain_free(&pag->pag_intents_drain);
 	radix_tree_delete(&mp->m_perag_tree, index);
 out_free_pag:
 	kmem_free(pag);
@@ -431,6 +434,7 @@ out_unwind_new_pags:
 		if (!pag)
 			break;
 		xfs_buf_hash_destroy(pag);
+		xfs_defer_drain_free(&pag->pag_intents_drain);
 		kmem_free(pag);
 	}
 	return error;
diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
index 8092eaba977d..2e0aef87d633 100644
--- a/fs/xfs/libxfs/xfs_ag.h
+++ b/fs/xfs/libxfs/xfs_ag.h
@@ -101,6 +101,14 @@ struct xfs_perag {
 	/* background prealloc block trimming */
 	struct delayed_work	pag_blockgc_work;
 
+	/*
+	 * We use xfs_drain to track the number of deferred log intent items
+	 * that have been queued (but not yet processed) so that waiters (e.g.
+	 * scrub) will not lock resources when other threads are in the middle
+	 * of processing a chain of intent items only to find momentary
+	 * inconsistencies.
+	 */
+	struct xfs_defer_drain	pag_intents_drain;
 #endif /* __KERNEL__ */
 };
 
diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index 5a321b783398..bcfb6a4203cd 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -397,6 +397,7 @@ xfs_defer_cancel_list(
 		list_for_each_safe(pwi, n, &dfp->dfp_work) {
 			list_del(pwi);
 			dfp->dfp_count--;
+			trace_xfs_defer_cancel_item(mp, dfp, pwi);
 			ops->cancel_item(pwi);
 		}
 		ASSERT(dfp->dfp_count == 0);
@@ -476,6 +477,7 @@ xfs_defer_finish_one(
 	list_for_each_safe(li, n, &dfp->dfp_work) {
 		list_del(li);
 		dfp->dfp_count--;
+		trace_xfs_defer_finish_item(tp->t_mountp, dfp, li);
 		error = ops->finish_item(tp, dfp->dfp_done, li, &state);
 		if (error == -EAGAIN) {
 			int		ret;
@@ -623,7 +625,7 @@ xfs_defer_add(
 	struct list_head		*li)
 {
 	struct xfs_defer_pending	*dfp = NULL;
-	const struct xfs_defer_op_type	*ops;
+	const struct xfs_defer_op_type	*ops = defer_op_types[type];
 
 	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
 	BUILD_BUG_ON(ARRAY_SIZE(defer_op_types) != XFS_DEFER_OPS_TYPE_MAX);
@@ -636,7 +638,6 @@ xfs_defer_add(
 	if (!list_empty(&tp->t_dfops)) {
 		dfp = list_last_entry(&tp->t_dfops,
 				struct xfs_defer_pending, dfp_list);
-		ops = defer_op_types[dfp->dfp_type];
 		if (dfp->dfp_type != type ||
 		    (ops->max_items && dfp->dfp_count >= ops->max_items))
 			dfp = NULL;
@@ -653,6 +654,7 @@ xfs_defer_add(
 	}
 
 	list_add_tail(li, &dfp->dfp_work);
+	trace_xfs_defer_add_item(tp->t_mountp, dfp, li);
 	dfp->dfp_count++;
 }
 
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 597e6aca8628..2a496d1699a3 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -396,26 +396,19 @@ want_ag_read_header_failure(
 }
 
 /*
- * Grab the perag structure and all the headers for an AG.
+ * Grab the AG header buffers for the attached perag structure.
  *
  * The headers should be released by xchk_ag_free, but as a fail safe we attach
  * all the buffers we grab to the scrub transaction so they'll all be freed
- * when we cancel it.  Returns ENOENT if we can't grab the perag structure.
+ * when we cancel it.
  */
-int
-xchk_ag_read_headers(
+static inline int
+xchk_perag_read_headers(
 	struct xfs_scrub	*sc,
-	xfs_agnumber_t		agno,
 	struct xchk_ag		*sa)
 {
-	struct xfs_mount	*mp = sc->mp;
 	int			error;
 
-	ASSERT(!sa->pag);
-	sa->pag = xfs_perag_get(mp, agno);
-	if (!sa->pag)
-		return -ENOENT;
-
 	error = xfs_ialloc_read_agi(sa->pag, sc->tp, &sa->agi_bp);
 	if (error && want_ag_read_header_failure(sc, XFS_SCRUB_TYPE_AGI))
 		return error;
@@ -427,6 +420,102 @@ xchk_ag_read_headers(
 	return 0;
 }
 
+/*
+ * Grab the AG headers for the attached perag structure and wait for pending
+ * intents to drain.
+ */
+static int
+xchk_perag_drain_and_lock(
+	struct xfs_scrub	*sc)
+{
+	struct xchk_ag		*sa = &sc->sa;
+	int			error = 0;
+
+	ASSERT(sa->pag != NULL);
+	ASSERT(sa->agi_bp == NULL);
+	ASSERT(sa->agf_bp == NULL);
+
+	do {
+		if (xchk_should_terminate(sc, &error))
+			return error;
+
+		error = xchk_perag_read_headers(sc, sa);
+		if (error)
+			return error;
+
+		/*
+		 * If we've grabbed an inode for scrubbing then we assume that
+		 * holding its ILOCK will suffice to coordinate with any intent
+		 * chains involving this inode.
+		 */
+		if (sc->ip)
+			return 0;
+
+		/*
+		 * Decide if this AG is quiet enough for all metadata to be
+		 * consistent with each other.  XFS allows the AG header buffer
+		 * locks to cycle across transaction rolls while processing
+		 * chains of deferred ops, which means that there could be
+		 * other threads in the middle of processing a chain of
+		 * deferred ops.  For regular operations we are careful about
+		 * ordering operations to prevent collisions between threads
+		 * (which is why we don't need a per-AG lock), but scrub and
+		 * repair have to serialize against chained operations.
+		 *
+		 * We just locked all the AG headers buffers; now take a look
+		 * to see if there are any intents in progress.  If there are,
+		 * drop the AG headers and wait for the intents to drain.
+		 * Since we hold all the AG header locks for the duration of
+		 * the scrub, this is the only time we have to sample the
+		 * intents counter; any threads increasing it after this point
+		 * can't possibly be in the middle of a chain of AG metadata
+		 * updates.
+		 *
+		 * Obviously, this should be slanted against scrub and in favor
+		 * of runtime threads.
+		 */
+		if (!xfs_perag_intent_busy(sa->pag))
+			return 0;
+
+		if (sa->agf_bp) {
+			xfs_trans_brelse(sc->tp, sa->agf_bp);
+			sa->agf_bp = NULL;
+		}
+
+		if (sa->agi_bp) {
+			xfs_trans_brelse(sc->tp, sa->agi_bp);
+			sa->agi_bp = NULL;
+		}
+
+		error = xfs_perag_intent_drain(sa->pag);
+		if (error == -ERESTARTSYS)
+			error = -EINTR;
+	} while (!error);
+
+	return error;
+}
+
+/*
+ * Grab the per-AG structure, grab all AG header buffers, and wait until there
+ * aren't any pending intents.  Returns -ENOENT if we can't grab the perag
+ * structure.
+ */
+int
+xchk_ag_read_headers(
+	struct xfs_scrub	*sc,
+	xfs_agnumber_t		agno,
+	struct xchk_ag		*sa)
+{
+	struct xfs_mount	*mp = sc->mp;
+
+	ASSERT(!sa->pag);
+	sa->pag = xfs_perag_get(mp, agno);
+	if (!sa->pag)
+		return -ENOENT;
+
+	return xchk_perag_drain_and_lock(sc);
+}
+
 /* Release all the AG btree cursors. */
 void
 xchk_ag_btcur_free(
diff --git a/fs/xfs/scrub/health.c b/fs/xfs/scrub/health.c
index 66e99b0f6049..d2b2a1cb6533 100644
--- a/fs/xfs/scrub/health.c
+++ b/fs/xfs/scrub/health.c
@@ -7,6 +7,8 @@
 #include "xfs_fs.h"
 #include "xfs_shared.h"
 #include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
 #include "xfs_btree.h"
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
index aaad13b1871f..756066f3dea2 100644
--- a/fs/xfs/scrub/refcount.c
+++ b/fs/xfs/scrub/refcount.c
@@ -7,6 +7,8 @@
 #include "xfs_fs.h"
 #include "xfs_shared.h"
 #include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
 #include "xfs_btree.h"
 #include "xfs_rmap.h"
 #include "xfs_refcount.h"
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 8f0f33d07d2c..7551c3ec4ea5 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -373,7 +373,15 @@ xfs_bmap_update_get_group(
 	xfs_agnumber_t		agno;
 
 	agno = XFS_FSB_TO_AGNO(mp, bi->bi_bmap.br_startblock);
-	bi->bi_pag = xfs_perag_get(mp, agno);
+
+	/*
+	 * Bump the intent count on behalf of the deferred rmap and refcount
+	 * intent items that that we can queue when we finish this bmap work.
+	 * This new intent item will bump the intent count before the bmap
+	 * intent drops the intent count, ensuring that the intent count
+	 * remains nonzero across the transaction roll.
+	 */
+	bi->bi_pag = xfs_perag_intent_get(mp, agno);
 }
 
 /* Release a passive AG ref after finishing mapping work. */
@@ -381,7 +389,7 @@ static inline void
 xfs_bmap_update_put_group(
 	struct xfs_bmap_intent	*bi)
 {
-	xfs_perag_put(bi->bi_pag);
+	xfs_perag_intent_put(bi->bi_pag);
 }
 
 /* Process a deferred rmap update. */
diff --git a/fs/xfs/xfs_drain.c b/fs/xfs/xfs_drain.c
new file mode 100644
index 000000000000..b431abdf0af1
--- /dev/null
+++ b/fs/xfs/xfs_drain.c
@@ -0,0 +1,140 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022-2023 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_ag.h"
+#include "xfs_trace.h"
+
+void
+xfs_defer_drain_init(
+	struct xfs_defer_drain	*dr)
+{
+	atomic_set(&dr->dr_count, 0);
+	init_waitqueue_head(&dr->dr_waiters);
+}
+
+void
+xfs_defer_drain_free(struct xfs_defer_drain	*dr)
+{
+	ASSERT(atomic_read(&dr->dr_count) == 0);
+}
+
+/* Increase the pending intent count. */
+static inline void xfs_defer_drain_grab(struct xfs_defer_drain *dr)
+{
+	atomic_inc(&dr->dr_count);
+}
+
+static inline bool has_waiters(struct wait_queue_head *wq_head)
+{
+	/*
+	 * This memory barrier is paired with the one in set_current_state on
+	 * the waiting side.
+	 */
+	smp_mb__after_atomic();
+	return waitqueue_active(wq_head);
+}
+
+/* Decrease the pending intent count, and wake any waiters, if appropriate. */
+static inline void xfs_defer_drain_rele(struct xfs_defer_drain *dr)
+{
+	if (atomic_dec_and_test(&dr->dr_count) &&
+	    has_waiters(&dr->dr_waiters))
+		wake_up(&dr->dr_waiters);
+}
+
+/* Are there intents pending? */
+static inline bool xfs_defer_drain_busy(struct xfs_defer_drain *dr)
+{
+	return atomic_read(&dr->dr_count) > 0;
+}
+
+/*
+ * Wait for the pending intent count for a drain to hit zero.
+ *
+ * Callers must not hold any locks that would prevent intents from being
+ * finished.
+ */
+static inline int xfs_defer_drain_wait(struct xfs_defer_drain *dr)
+{
+	return wait_event_killable(dr->dr_waiters, !xfs_defer_drain_busy(dr));
+}
+
+/*
+ * Get a passive reference to an AG and declare an intent to update its
+ * metadata.
+ */
+struct xfs_perag *
+xfs_perag_intent_get(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno)
+{
+	struct xfs_perag	*pag;
+
+	pag = xfs_perag_get(mp, agno);
+	if (!pag)
+		return NULL;
+
+	xfs_perag_intent_hold(pag);
+	return pag;
+}
+
+/*
+ * Release our intent to update this AG's metadata, and then release our
+ * passive ref to the AG.
+ */
+void
+xfs_perag_intent_put(
+	struct xfs_perag	*pag)
+{
+	xfs_perag_intent_rele(pag);
+	xfs_perag_put(pag);
+}
+
+/*
+ * Declare an intent to update AG metadata.  Other threads that need exclusive
+ * access can decide to back off if they see declared intentions.
+ */
+void
+xfs_perag_intent_hold(
+	struct xfs_perag	*pag)
+{
+	trace_xfs_perag_intent_hold(pag, __return_address);
+	xfs_defer_drain_grab(&pag->pag_intents_drain);
+}
+
+/* Release our intent to update this AG's metadata. */
+void
+xfs_perag_intent_rele(
+	struct xfs_perag	*pag)
+{
+	trace_xfs_perag_intent_rele(pag, __return_address);
+	xfs_defer_drain_rele(&pag->pag_intents_drain);
+}
+
+/*
+ * Wait for the intent update count for this AG to hit zero.
+ * Callers must not hold any AG header buffers.
+ */
+int
+xfs_perag_intent_drain(
+	struct xfs_perag	*pag)
+{
+	trace_xfs_perag_wait_intents(pag, __return_address);
+	return xfs_defer_drain_wait(&pag->pag_intents_drain);
+}
+
+/* Has anyone declared an intent to update this AG? */
+bool
+xfs_perag_intent_busy(
+	struct xfs_perag	*pag)
+{
+	return xfs_defer_drain_busy(&pag->pag_intents_drain);
+}
diff --git a/fs/xfs/xfs_drain.h b/fs/xfs/xfs_drain.h
new file mode 100644
index 000000000000..9b16df3cc7dc
--- /dev/null
+++ b/fs/xfs/xfs_drain.h
@@ -0,0 +1,84 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022-2023 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef XFS_DRAIN_H_
+#define XFS_DRAIN_H_
+
+struct xfs_perag;
+
+#ifdef CONFIG_XFS_DRAIN_INTENTS
+/*
+ * Passive drain mechanism.  This data structure tracks a count of some items
+ * and contains a waitqueue for callers who would like to wake up when the
+ * count hits zero.
+ */
+struct xfs_defer_drain {
+	/* Number of items pending in some part of the filesystem. */
+	atomic_t		dr_count;
+
+	/* Queue to wait for dri_count to go to zero */
+	struct wait_queue_head	dr_waiters;
+};
+
+void xfs_defer_drain_init(struct xfs_defer_drain *dr);
+void xfs_defer_drain_free(struct xfs_defer_drain *dr);
+
+/*
+ * Deferred Work Intent Drains
+ * ===========================
+ *
+ * When a writer thread executes a chain of log intent items, the AG header
+ * buffer locks will cycle during a transaction roll to get from one intent
+ * item to the next in a chain.  Although scrub takes all AG header buffer
+ * locks, this isn't sufficient to guard against scrub checking an AG while
+ * that writer thread is in the middle of finishing a chain because there's no
+ * higher level locking primitive guarding allocation groups.
+ *
+ * When there's a collision, cross-referencing between data structures (e.g.
+ * rmapbt and refcountbt) yields false corruption events; if repair is running,
+ * this results in incorrect repairs, which is catastrophic.
+ *
+ * The solution is to the perag structure the count of active intents and make
+ * scrub wait until it has both AG header buffer locks and the intent counter
+ * reaches zero.  It is therefore critical that deferred work threads hold the
+ * AGI or AGF buffers when decrementing the intent counter.
+ *
+ * Given a list of deferred work items, the deferred work manager will complete
+ * a work item and all the sub-items that the parent item creates before moving
+ * on to the next work item in the list.  This is also true for all levels of
+ * sub-items.  Writer threads are permitted to queue multiple work items
+ * targetting the same AG, so a deferred work item (such as a BUI) that creates
+ * sub-items (such as RUIs) must bump the intent counter and maintain it until
+ * the sub-items can themselves bump the intent counter.
+ *
+ * Therefore, the intent count tracks entire lifetimes of deferred work items.
+ * All functions that create work items must increment the intent counter as
+ * soon as the item is added to the transaction and cannot drop the counter
+ * until the item is finished or cancelled.
+ */
+struct xfs_perag *xfs_perag_intent_get(struct xfs_mount *mp,
+		xfs_agnumber_t agno);
+void xfs_perag_intent_put(struct xfs_perag *pag);
+
+void xfs_perag_intent_hold(struct xfs_perag *pag);
+void xfs_perag_intent_rele(struct xfs_perag *pag);
+
+int xfs_perag_intent_drain(struct xfs_perag *pag);
+bool xfs_perag_intent_busy(struct xfs_perag *pag);
+#else
+struct xfs_defer_drain { /* empty */ };
+
+#define xfs_defer_drain_free(dr)		((void)0)
+#define xfs_defer_drain_init(dr)		((void)0)
+
+#define xfs_perag_intent_get(mp, agno)		xfs_perag_get((mp), (agno))
+#define xfs_perag_intent_put(pag)		xfs_perag_put(pag)
+
+static inline void xfs_perag_intent_hold(struct xfs_perag *pag) { }
+static inline void xfs_perag_intent_rele(struct xfs_perag *pag) { }
+
+#endif /* CONFIG_XFS_DRAIN_INTENTS */
+
+#endif /* XFS_DRAIN_H_ */
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index 38b66fcfddc8..f9e36b810663 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -469,7 +469,7 @@ xfs_extent_free_get_group(
 	xfs_agnumber_t			agno;
 
 	agno = XFS_FSB_TO_AGNO(mp, xefi->xefi_startblock);
-	xefi->xefi_pag = xfs_perag_get(mp, agno);
+	xefi->xefi_pag = xfs_perag_intent_get(mp, agno);
 }
 
 /* Release a passive AG ref after some freeing work. */
@@ -477,7 +477,7 @@ static inline void
 xfs_extent_free_put_group(
 	struct xfs_extent_free_item	*xefi)
 {
-	xfs_perag_put(xefi->xefi_pag);
+	xfs_perag_intent_put(xefi->xefi_pag);
 }
 
 /* Process a free extent. */
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index e88f18f85e4b..74dcb05069e8 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -80,6 +80,7 @@ typedef __u32			xfs_nlink_t;
 #include "xfs_cksum.h"
 #include "xfs_buf.h"
 #include "xfs_message.h"
+#include "xfs_drain.h"
 
 #ifdef __BIG_ENDIAN
 #define XFS_NATIVE_HOST 1
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index 7edee9590ed6..edd8587658d5 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -374,7 +374,7 @@ xfs_refcount_update_get_group(
 	xfs_agnumber_t			agno;
 
 	agno = XFS_FSB_TO_AGNO(mp, ri->ri_startblock);
-	ri->ri_pag = xfs_perag_get(mp, agno);
+	ri->ri_pag = xfs_perag_intent_get(mp, agno);
 }
 
 /* Release a passive AG ref after finishing refcounting work. */
@@ -382,7 +382,7 @@ static inline void
 xfs_refcount_update_put_group(
 	struct xfs_refcount_intent	*ri)
 {
-	xfs_perag_put(ri->ri_pag);
+	xfs_perag_intent_put(ri->ri_pag);
 }
 
 /* Process a deferred refcount update. */
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index 739ddbd04a17..520c7ebdfed8 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -399,7 +399,7 @@ xfs_rmap_update_get_group(
 	xfs_agnumber_t		agno;
 
 	agno = XFS_FSB_TO_AGNO(mp, ri->ri_bmap.br_startblock);
-	ri->ri_pag = xfs_perag_get(mp, agno);
+	ri->ri_pag = xfs_perag_intent_get(mp, agno);
 }
 
 /* Release a passive AG ref after finishing rmapping work. */
@@ -407,7 +407,7 @@ static inline void
 xfs_rmap_update_put_group(
 	struct xfs_rmap_intent	*ri)
 {
-	xfs_perag_put(ri->ri_pag);
+	xfs_perag_intent_put(ri->ri_pag);
 }
 
 /* Process a deferred rmap update. */
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index db09bb771765..cd4ca5b1fcb0 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2687,6 +2687,44 @@ DEFINE_BMAP_FREE_DEFERRED_EVENT(xfs_bmap_free_deferred);
 DEFINE_BMAP_FREE_DEFERRED_EVENT(xfs_agfl_free_defer);
 DEFINE_BMAP_FREE_DEFERRED_EVENT(xfs_agfl_free_deferred);
 
+DECLARE_EVENT_CLASS(xfs_defer_pending_item_class,
+	TP_PROTO(struct xfs_mount *mp, struct xfs_defer_pending *dfp,
+		 void *item),
+	TP_ARGS(mp, dfp, item),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(int, type)
+		__field(void *, intent)
+		__field(void *, item)
+		__field(char, committed)
+		__field(int, nr)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->type = dfp->dfp_type;
+		__entry->intent = dfp->dfp_intent;
+		__entry->item = item;
+		__entry->committed = dfp->dfp_done != NULL;
+		__entry->nr = dfp->dfp_count;
+	),
+	TP_printk("dev %d:%d optype %d intent %p item %p committed %d nr %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->type,
+		  __entry->intent,
+		  __entry->item,
+		  __entry->committed,
+		  __entry->nr)
+)
+#define DEFINE_DEFER_PENDING_ITEM_EVENT(name) \
+DEFINE_EVENT(xfs_defer_pending_item_class, name, \
+	TP_PROTO(struct xfs_mount *mp, struct xfs_defer_pending *dfp, \
+		 void *item), \
+	TP_ARGS(mp, dfp, item))
+
+DEFINE_DEFER_PENDING_ITEM_EVENT(xfs_defer_add_item);
+DEFINE_DEFER_PENDING_ITEM_EVENT(xfs_defer_cancel_item);
+DEFINE_DEFER_PENDING_ITEM_EVENT(xfs_defer_finish_item);
+
 /* rmap tracepoints */
 DECLARE_EVENT_CLASS(xfs_rmap_class,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
@@ -4326,6 +4364,39 @@ TRACE_EVENT(xfs_force_shutdown,
 		__entry->line_num)
 );
 
+#ifdef CONFIG_XFS_DRAIN_INTENTS
+DECLARE_EVENT_CLASS(xfs_perag_intents_class,
+	TP_PROTO(struct xfs_perag *pag, void *caller_ip),
+	TP_ARGS(pag, caller_ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(long, nr_intents)
+		__field(void *, caller_ip)
+	),
+	TP_fast_assign(
+		__entry->dev = pag->pag_mount->m_super->s_dev;
+		__entry->agno = pag->pag_agno;
+		__entry->nr_intents = atomic_read(&pag->pag_intents_drain.dr_count);
+		__entry->caller_ip = caller_ip;
+	),
+	TP_printk("dev %d:%d agno 0x%x intents %ld caller %pS",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->nr_intents,
+		  __entry->caller_ip)
+);
+
+#define DEFINE_PERAG_INTENTS_EVENT(name)	\
+DEFINE_EVENT(xfs_perag_intents_class, name,					\
+	TP_PROTO(struct xfs_perag *pag, void *caller_ip), \
+	TP_ARGS(pag, caller_ip))
+DEFINE_PERAG_INTENTS_EVENT(xfs_perag_intent_hold);
+DEFINE_PERAG_INTENTS_EVENT(xfs_perag_intent_rele);
+DEFINE_PERAG_INTENTS_EVENT(xfs_perag_wait_intents);
+
+#endif /* CONFIG_XFS_DRAIN_INTENTS */
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH
-- 
cgit v1.2.3


From 3f64c718d06eae168208faaadb522007e0048e7b Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 18:59:59 -0700
Subject: xfs: clean up scrub context if scrub setup returns -EDEADLOCK

It has been a longstanding convention that online scrub and repair
functions can return -EDEADLOCK to signal that they weren't able to
obtain some necessary resource.  When this happens, the scrub framework
is supposed to release all resources attached to the scrub context, set
the TRY_HARDER flag in the scrub context flags, and try again.  In this
context, individual scrub functions are supposed to take all the
resources they (incorrectly) speculated were not necessary.

We're about to make it so that the functions that lock and wait for a
filesystem AG can also return EDEADLOCK to signal that we need to try
again with the drain waiters enabled.  Therefore, refactor
xfs_scrub_metadata to support this behavior for ->setup() functions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/scrub.c | 28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index e8e2bee001e5..9364fe7d07b4 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -491,23 +491,16 @@ retry_op:
 
 	/* Set up for the operation. */
 	error = sc->ops->setup(sc);
+	if (error == -EDEADLOCK && !(sc->flags & XCHK_TRY_HARDER))
+		goto try_harder;
 	if (error)
 		goto out_teardown;
 
 	/* Scrub for errors. */
 	error = sc->ops->scrub(sc);
-	if (!(sc->flags & XCHK_TRY_HARDER) && error == -EDEADLOCK) {
-		/*
-		 * Scrubbers return -EDEADLOCK to mean 'try harder'.
-		 * Tear down everything we hold, then set up again with
-		 * preparation for worst-case scenarios.
-		 */
-		error = xchk_teardown(sc, 0);
-		if (error)
-			goto out_sc;
-		sc->flags |= XCHK_TRY_HARDER;
-		goto retry_op;
-	} else if (error || (sm->sm_flags & XFS_SCRUB_OFLAG_INCOMPLETE))
+	if (error == -EDEADLOCK && !(sc->flags & XCHK_TRY_HARDER))
+		goto try_harder;
+	if (error || (sm->sm_flags & XFS_SCRUB_OFLAG_INCOMPLETE))
 		goto out_teardown;
 
 	xchk_update_health(sc);
@@ -565,4 +558,15 @@ out:
 		error = 0;
 	}
 	return error;
+try_harder:
+	/*
+	 * Scrubbers return -EDEADLOCK to mean 'try harder'.  Tear down
+	 * everything we hold, then set up again with preparation for
+	 * worst-case scenarios.
+	 */
+	error = xchk_teardown(sc, 0);
+	if (error)
+		goto out_sc;
+	sc->flags |= XCHK_TRY_HARDER;
+	goto retry_op;
 }
-- 
cgit v1.2.3


From 466c525d6d35e69115852c004f405f0711b8f91a Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 18:59:59 -0700
Subject: xfs: minimize overhead of drain wakeups by using jump labels

To reduce the runtime overhead even further when online fsck isn't
running, use a static branch key to decide if we call wake_up on the
drain.  For compilers that support jump labels, the call to wake_up is
replaced by a nop sled when nobody is waiting for intents to drain.

From my initial microbenchmarking, every transition of the static key
between the on and off states takes about 22000ns to complete; this is
paid entirely by the xfs_scrub process.  When the static key is off
(which it should be when fsck isn't running), the nop sled adds an
overhead of approximately 0.36ns to runtime code.  The post-atomic
lockless waiter check adds about 0.03ns, which is basically free.

For the few compilers that don't support jump labels, runtime code pays
the cost of calling wake_up on an empty waitqueue, which was observed to
be about 30ns.  However, most architectures that have sufficient memory
and CPU capacity to run XFS also support jump labels, so this is not
much of a worry.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/Kconfig            |  1 +
 fs/xfs/scrub/agheader.c   |  9 +++++++++
 fs/xfs/scrub/alloc.c      |  3 +++
 fs/xfs/scrub/bmap.c       |  3 +++
 fs/xfs/scrub/common.c     | 24 ++++++++++++++++++++++++
 fs/xfs/scrub/common.h     | 15 +++++++++++++++
 fs/xfs/scrub/fscounters.c |  7 +++++++
 fs/xfs/scrub/ialloc.c     |  2 ++
 fs/xfs/scrub/inode.c      |  3 +++
 fs/xfs/scrub/quota.c      |  3 +++
 fs/xfs/scrub/refcount.c   |  2 ++
 fs/xfs/scrub/rmap.c       |  3 +++
 fs/xfs/scrub/scrub.c      | 25 +++++++++++++++++++++----
 fs/xfs/scrub/scrub.h      | 11 ++++++++++-
 fs/xfs/scrub/trace.h      | 33 +++++++++++++++++++++++++++++++++
 fs/xfs/xfs_drain.c        | 26 ++++++++++++++++++++++++++
 fs/xfs/xfs_drain.h        |  3 +++
 17 files changed, 168 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index ab24e683b440..05bc865142b8 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -95,6 +95,7 @@ config XFS_RT
 
 config XFS_DRAIN_INTENTS
 	bool
+	select JUMP_LABEL if HAVE_ARCH_JUMP_LABEL
 
 config XFS_ONLINE_SCRUB
 	bool "XFS online metadata check support"
diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
index c91819da1f5f..87cb13a6e84a 100644
--- a/fs/xfs/scrub/agheader.c
+++ b/fs/xfs/scrub/agheader.c
@@ -18,6 +18,15 @@
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 
+int
+xchk_setup_agheader(
+	struct xfs_scrub	*sc)
+{
+	if (xchk_need_intent_drain(sc))
+		xchk_fsgates_enable(sc, XCHK_FSGATES_DRAIN);
+	return xchk_setup_fs(sc);
+}
+
 /* Superblock */
 
 /* Cross-reference with the other btrees. */
diff --git a/fs/xfs/scrub/alloc.c b/fs/xfs/scrub/alloc.c
index 39e79b9536bc..de313df2b15b 100644
--- a/fs/xfs/scrub/alloc.c
+++ b/fs/xfs/scrub/alloc.c
@@ -24,6 +24,9 @@ int
 xchk_setup_ag_allocbt(
 	struct xfs_scrub	*sc)
 {
+	if (xchk_need_intent_drain(sc))
+		xchk_fsgates_enable(sc, XCHK_FSGATES_DRAIN);
+
 	return xchk_setup_ag_btree(sc, false);
 }
 
diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index f6d8cb938a02..a5078d63808f 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -31,6 +31,9 @@ xchk_setup_inode_bmap(
 {
 	int			error;
 
+	if (xchk_need_intent_drain(sc))
+		xchk_fsgates_enable(sc, XCHK_FSGATES_DRAIN);
+
 	error = xchk_get_inode(sc);
 	if (error)
 		goto out;
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 2a496d1699a3..87649facbbde 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -487,6 +487,8 @@ xchk_perag_drain_and_lock(
 			sa->agi_bp = NULL;
 		}
 
+		if (!(sc->flags & XCHK_FSGATES_DRAIN))
+			return -EDEADLOCK;
 		error = xfs_perag_intent_drain(sa->pag);
 		if (error == -ERESTARTSYS)
 			error = -EINTR;
@@ -1005,3 +1007,25 @@ xchk_start_reaping(
 	}
 	sc->flags &= ~XCHK_REAPING_DISABLED;
 }
+
+/*
+ * Enable filesystem hooks (i.e. runtime code patching) before starting a scrub
+ * operation.  Callers must not hold any locks that intersect with the CPU
+ * hotplug lock (e.g. writeback locks) because code patching must halt the CPUs
+ * to change kernel code.
+ */
+void
+xchk_fsgates_enable(
+	struct xfs_scrub	*sc,
+	unsigned int		scrub_fsgates)
+{
+	ASSERT(!(scrub_fsgates & ~XCHK_FSGATES_ALL));
+	ASSERT(!(sc->flags & scrub_fsgates));
+
+	trace_xchk_fsgates_enable(sc, scrub_fsgates);
+
+	if (scrub_fsgates & XCHK_FSGATES_DRAIN)
+		xfs_drain_wait_enable();
+
+	sc->flags |= scrub_fsgates;
+}
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 273a4331da05..4714e8a43094 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -72,6 +72,7 @@ bool xchk_should_check_xref(struct xfs_scrub *sc, int *error,
 			   struct xfs_btree_cur **curpp);
 
 /* Setup functions */
+int xchk_setup_agheader(struct xfs_scrub *sc);
 int xchk_setup_fs(struct xfs_scrub *sc);
 int xchk_setup_ag_allocbt(struct xfs_scrub *sc);
 int xchk_setup_ag_iallocbt(struct xfs_scrub *sc);
@@ -151,4 +152,18 @@ int xchk_ilock_inverted(struct xfs_inode *ip, uint lock_mode);
 void xchk_stop_reaping(struct xfs_scrub *sc);
 void xchk_start_reaping(struct xfs_scrub *sc);
 
+/*
+ * Setting up a hook to wait for intents to drain is costly -- we have to take
+ * the CPU hotplug lock and force an i-cache flush on all CPUs once to set it
+ * up, and again to tear it down.  These costs add up quickly, so we only want
+ * to enable the drain waiter if the drain actually detected a conflict with
+ * running intent chains.
+ */
+static inline bool xchk_need_intent_drain(struct xfs_scrub *sc)
+{
+	return sc->flags & XCHK_TRY_HARDER;
+}
+
+void xchk_fsgates_enable(struct xfs_scrub *sc, unsigned int scrub_fshooks);
+
 #endif	/* __XFS_SCRUB_COMMON_H__ */
diff --git a/fs/xfs/scrub/fscounters.c b/fs/xfs/scrub/fscounters.c
index a38006c71bff..faa315be7978 100644
--- a/fs/xfs/scrub/fscounters.c
+++ b/fs/xfs/scrub/fscounters.c
@@ -130,6 +130,13 @@ xchk_setup_fscounters(
 	struct xchk_fscounters	*fsc;
 	int			error;
 
+	/*
+	 * If the AGF doesn't track btreeblks, we have to lock the AGF to count
+	 * btree block usage by walking the actual btrees.
+	 */
+	if (!xfs_has_lazysbcount(sc->mp))
+		xchk_fsgates_enable(sc, XCHK_FSGATES_DRAIN);
+
 	sc->buf = kzalloc(sizeof(struct xchk_fscounters), XCHK_GFP_FLAGS);
 	if (!sc->buf)
 		return -ENOMEM;
diff --git a/fs/xfs/scrub/ialloc.c b/fs/xfs/scrub/ialloc.c
index b14270bd1c62..9563769a8881 100644
--- a/fs/xfs/scrub/ialloc.c
+++ b/fs/xfs/scrub/ialloc.c
@@ -32,6 +32,8 @@ int
 xchk_setup_ag_iallocbt(
 	struct xfs_scrub	*sc)
 {
+	if (xchk_need_intent_drain(sc))
+		xchk_fsgates_enable(sc, XCHK_FSGATES_DRAIN);
 	return xchk_setup_ag_btree(sc, sc->flags & XCHK_TRY_HARDER);
 }
 
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index dc66a1465f1b..bbf9432c02c2 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -32,6 +32,9 @@ xchk_setup_inode(
 {
 	int			error;
 
+	if (xchk_need_intent_drain(sc))
+		xchk_fsgates_enable(sc, XCHK_FSGATES_DRAIN);
+
 	/*
 	 * Try to get the inode.  If the verifiers fail, we try again
 	 * in raw mode.
diff --git a/fs/xfs/scrub/quota.c b/fs/xfs/scrub/quota.c
index b019c70c065a..e6caa358cbda 100644
--- a/fs/xfs/scrub/quota.c
+++ b/fs/xfs/scrub/quota.c
@@ -53,6 +53,9 @@ xchk_setup_quota(
 	if (!xfs_this_quota_on(sc->mp, dqtype))
 		return -ENOENT;
 
+	if (xchk_need_intent_drain(sc))
+		xchk_fsgates_enable(sc, XCHK_FSGATES_DRAIN);
+
 	error = xchk_setup_fs(sc);
 	if (error)
 		return error;
diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
index 756066f3dea2..6f649cc01310 100644
--- a/fs/xfs/scrub/refcount.c
+++ b/fs/xfs/scrub/refcount.c
@@ -27,6 +27,8 @@ int
 xchk_setup_ag_refcountbt(
 	struct xfs_scrub	*sc)
 {
+	if (xchk_need_intent_drain(sc))
+		xchk_fsgates_enable(sc, XCHK_FSGATES_DRAIN);
 	return xchk_setup_ag_btree(sc, false);
 }
 
diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index 4dc79e1a675d..c6e47ef4c79b 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -24,6 +24,9 @@ int
 xchk_setup_ag_rmapbt(
 	struct xfs_scrub	*sc)
 {
+	if (xchk_need_intent_drain(sc))
+		xchk_fsgates_enable(sc, XCHK_FSGATES_DRAIN);
+
 	return xchk_setup_ag_btree(sc, false);
 }
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 9364fe7d07b4..bd5d4357cd64 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -145,6 +145,21 @@ xchk_probe(
 
 /* Scrub setup and teardown */
 
+static inline void
+xchk_fsgates_disable(
+	struct xfs_scrub	*sc)
+{
+	if (!(sc->flags & XCHK_FSGATES_ALL))
+		return;
+
+	trace_xchk_fsgates_disable(sc, sc->flags & XCHK_FSGATES_ALL);
+
+	if (sc->flags & XCHK_FSGATES_DRAIN)
+		xfs_drain_wait_disable();
+
+	sc->flags &= ~XCHK_FSGATES_ALL;
+}
+
 /* Free all the resources and finish the transactions. */
 STATIC int
 xchk_teardown(
@@ -177,6 +192,8 @@ xchk_teardown(
 		kvfree(sc->buf);
 		sc->buf = NULL;
 	}
+
+	xchk_fsgates_disable(sc);
 	return error;
 }
 
@@ -191,25 +208,25 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 	},
 	[XFS_SCRUB_TYPE_SB] = {		/* superblock */
 		.type	= ST_PERAG,
-		.setup	= xchk_setup_fs,
+		.setup	= xchk_setup_agheader,
 		.scrub	= xchk_superblock,
 		.repair	= xrep_superblock,
 	},
 	[XFS_SCRUB_TYPE_AGF] = {	/* agf */
 		.type	= ST_PERAG,
-		.setup	= xchk_setup_fs,
+		.setup	= xchk_setup_agheader,
 		.scrub	= xchk_agf,
 		.repair	= xrep_agf,
 	},
 	[XFS_SCRUB_TYPE_AGFL]= {	/* agfl */
 		.type	= ST_PERAG,
-		.setup	= xchk_setup_fs,
+		.setup	= xchk_setup_agheader,
 		.scrub	= xchk_agfl,
 		.repair	= xrep_agfl,
 	},
 	[XFS_SCRUB_TYPE_AGI] = {	/* agi */
 		.type	= ST_PERAG,
-		.setup	= xchk_setup_fs,
+		.setup	= xchk_setup_agheader,
 		.scrub	= xchk_agi,
 		.repair	= xrep_agi,
 	},
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 72a5a8a64a87..4fdb6017f820 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -96,9 +96,18 @@ struct xfs_scrub {
 
 /* XCHK state flags grow up from zero, XREP state flags grown down from 2^31 */
 #define XCHK_TRY_HARDER		(1 << 0)  /* can't get resources, try again */
-#define XCHK_REAPING_DISABLED	(1 << 2)  /* background block reaping paused */
+#define XCHK_REAPING_DISABLED	(1 << 1)  /* background block reaping paused */
+#define XCHK_FSGATES_DRAIN	(1 << 2)  /* defer ops draining enabled */
 #define XREP_ALREADY_FIXED	(1 << 31) /* checking our repair work */
 
+/*
+ * The XCHK_FSGATES* flags reflect functionality in the main filesystem that
+ * are only enabled for this particular online fsck.  When not in use, the
+ * features are gated off via dynamic code patching, which is why the state
+ * must be enabled during scrub setup and can only be torn down afterwards.
+ */
+#define XCHK_FSGATES_ALL	(XCHK_FSGATES_DRAIN)
+
 /* Metadata scrubbers */
 int xchk_tester(struct xfs_scrub *sc);
 int xchk_superblock(struct xfs_scrub *sc);
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index ad25ae88fce1..304c55192c90 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -96,6 +96,12 @@ TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_FSCOUNTERS);
 	{ XFS_SCRUB_OFLAG_WARNING,		"warning" }, \
 	{ XFS_SCRUB_OFLAG_NO_REPAIR_NEEDED,	"norepair" }
 
+#define XFS_SCRUB_STATE_STRINGS \
+	{ XCHK_TRY_HARDER,			"try_harder" }, \
+	{ XCHK_REAPING_DISABLED,		"reaping_disabled" }, \
+	{ XCHK_FSGATES_DRAIN,			"fsgates_drain" }, \
+	{ XREP_ALREADY_FIXED,			"already_fixed" }
+
 DECLARE_EVENT_CLASS(xchk_class,
 	TP_PROTO(struct xfs_inode *ip, struct xfs_scrub_metadata *sm,
 		 int error),
@@ -142,6 +148,33 @@ DEFINE_SCRUB_EVENT(xchk_deadlock_retry);
 DEFINE_SCRUB_EVENT(xrep_attempt);
 DEFINE_SCRUB_EVENT(xrep_done);
 
+DECLARE_EVENT_CLASS(xchk_fsgate_class,
+	TP_PROTO(struct xfs_scrub *sc, unsigned int fsgate_flags),
+	TP_ARGS(sc, fsgate_flags),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned int, type)
+		__field(unsigned int, fsgate_flags)
+	),
+	TP_fast_assign(
+		__entry->dev = sc->mp->m_super->s_dev;
+		__entry->type = sc->sm->sm_type;
+		__entry->fsgate_flags = fsgate_flags;
+	),
+	TP_printk("dev %d:%d type %s fsgates '%s'",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __print_symbolic(__entry->type, XFS_SCRUB_TYPE_STRINGS),
+		  __print_flags(__entry->fsgate_flags, "|", XFS_SCRUB_STATE_STRINGS))
+)
+
+#define DEFINE_SCRUB_FSHOOK_EVENT(name) \
+DEFINE_EVENT(xchk_fsgate_class, name, \
+	TP_PROTO(struct xfs_scrub *sc, unsigned int fsgates_flags), \
+	TP_ARGS(sc, fsgates_flags))
+
+DEFINE_SCRUB_FSHOOK_EVENT(xchk_fsgates_enable);
+DEFINE_SCRUB_FSHOOK_EVENT(xchk_fsgates_disable);
+
 TRACE_EVENT(xchk_op_error,
 	TP_PROTO(struct xfs_scrub *sc, xfs_agnumber_t agno,
 		 xfs_agblock_t bno, int error, void *ret_ip),
diff --git a/fs/xfs/xfs_drain.c b/fs/xfs/xfs_drain.c
index b431abdf0af1..005a66be44a2 100644
--- a/fs/xfs/xfs_drain.c
+++ b/fs/xfs/xfs_drain.c
@@ -12,6 +12,31 @@
 #include "xfs_ag.h"
 #include "xfs_trace.h"
 
+/*
+ * Use a static key here to reduce the overhead of xfs_drain_rele.  If the
+ * compiler supports jump labels, the static branch will be replaced by a nop
+ * sled when there are no xfs_drain_wait callers.  Online fsck is currently
+ * the only caller, so this is a reasonable tradeoff.
+ *
+ * Note: Patching the kernel code requires taking the cpu hotplug lock.  Other
+ * parts of the kernel allocate memory with that lock held, which means that
+ * XFS callers cannot hold any locks that might be used by memory reclaim or
+ * writeback when calling the static_branch_{inc,dec} functions.
+ */
+static DEFINE_STATIC_KEY_FALSE(xfs_drain_waiter_gate);
+
+void
+xfs_drain_wait_disable(void)
+{
+	static_branch_dec(&xfs_drain_waiter_gate);
+}
+
+void
+xfs_drain_wait_enable(void)
+{
+	static_branch_inc(&xfs_drain_waiter_gate);
+}
+
 void
 xfs_defer_drain_init(
 	struct xfs_defer_drain	*dr)
@@ -46,6 +71,7 @@ static inline bool has_waiters(struct wait_queue_head *wq_head)
 static inline void xfs_defer_drain_rele(struct xfs_defer_drain *dr)
 {
 	if (atomic_dec_and_test(&dr->dr_count) &&
+	    static_branch_unlikely(&xfs_drain_waiter_gate) &&
 	    has_waiters(&dr->dr_waiters))
 		wake_up(&dr->dr_waiters);
 }
diff --git a/fs/xfs/xfs_drain.h b/fs/xfs/xfs_drain.h
index 9b16df3cc7dc..50a5772a8296 100644
--- a/fs/xfs/xfs_drain.h
+++ b/fs/xfs/xfs_drain.h
@@ -25,6 +25,9 @@ struct xfs_defer_drain {
 void xfs_defer_drain_init(struct xfs_defer_drain *dr);
 void xfs_defer_drain_free(struct xfs_defer_drain *dr);
 
+void xfs_drain_wait_disable(void);
+void xfs_drain_wait_enable(void);
+
 /*
  * Deferred Work Intent Drains
  * ===========================
-- 
cgit v1.2.3


From 88accf17226733088923635b580779a3c86b6f23 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:00 -0700
Subject: xfs: scrub should use ECHRNG to signal that the drain is needed

In the previous patch, we added jump labels to the intent drain code so
that regular filesystem operations need not pay the price of checking
for someone (scrub) waiting on intents to drain from some part of the
filesystem when that someone isn't running.

However, I observed that xfs/285 now spends a lot more time pushing the
AIL from the inode btree scrubber than it used to.  This is because the
inobt scrubber will try push the AIL to try to get logged inode cores
written to the filesystem when it sees a weird discrepancy between the
ondisk inode and the inobt records.  This AIL push is triggered when the
setup function sees TRY_HARDER is set; and the requisite EDEADLOCK
return is initiated when the discrepancy is seen.

The solution to this performance slow down is to use a different result
code (ECHRNG) for scrub code to signal that it needs to wait for
deferred intent work items to drain out of some part of the filesystem.
When this happens, set a new scrub state flag (XCHK_NEED_DRAIN) so that
setup functions will activate the jump label.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/btree.c   |  1 +
 fs/xfs/scrub/common.c  |  4 +++-
 fs/xfs/scrub/common.h  |  2 +-
 fs/xfs/scrub/dabtree.c |  1 +
 fs/xfs/scrub/repair.c  |  3 +++
 fs/xfs/scrub/scrub.c   | 10 ++++++++++
 fs/xfs/scrub/scrub.h   |  1 +
 fs/xfs/scrub/trace.h   |  1 +
 8 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
index e54c1cfe64bf..626282dbe2e3 100644
--- a/fs/xfs/scrub/btree.c
+++ b/fs/xfs/scrub/btree.c
@@ -36,6 +36,7 @@ __xchk_btree_process_error(
 
 	switch (*error) {
 	case -EDEADLOCK:
+	case -ECHRNG:
 		/* Used to restart an op with deadlock avoidance. */
 		trace_xchk_deadlock_retry(sc->ip, sc->sm, *error);
 		break;
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 87649facbbde..dcfe66044d4a 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -75,6 +75,7 @@ __xchk_process_error(
 	case 0:
 		return true;
 	case -EDEADLOCK:
+	case -ECHRNG:
 		/* Used to restart an op with deadlock avoidance. */
 		trace_xchk_deadlock_retry(
 				sc->ip ? sc->ip : XFS_I(file_inode(sc->file)),
@@ -130,6 +131,7 @@ __xchk_fblock_process_error(
 	case 0:
 		return true;
 	case -EDEADLOCK:
+	case -ECHRNG:
 		/* Used to restart an op with deadlock avoidance. */
 		trace_xchk_deadlock_retry(sc->ip, sc->sm, *error);
 		break;
@@ -488,7 +490,7 @@ xchk_perag_drain_and_lock(
 		}
 
 		if (!(sc->flags & XCHK_FSGATES_DRAIN))
-			return -EDEADLOCK;
+			return -ECHRNG;
 		error = xfs_perag_intent_drain(sa->pag);
 		if (error == -ERESTARTSYS)
 			error = -EINTR;
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 4714e8a43094..83b1a392930a 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -161,7 +161,7 @@ void xchk_start_reaping(struct xfs_scrub *sc);
  */
 static inline bool xchk_need_intent_drain(struct xfs_scrub *sc)
 {
-	return sc->flags & XCHK_TRY_HARDER;
+	return sc->flags & XCHK_NEED_DRAIN;
 }
 
 void xchk_fsgates_enable(struct xfs_scrub *sc, unsigned int scrub_fshooks);
diff --git a/fs/xfs/scrub/dabtree.c b/fs/xfs/scrub/dabtree.c
index c392c0765e5c..82b150d3b8b7 100644
--- a/fs/xfs/scrub/dabtree.c
+++ b/fs/xfs/scrub/dabtree.c
@@ -39,6 +39,7 @@ xchk_da_process_error(
 
 	switch (*error) {
 	case -EDEADLOCK:
+	case -ECHRNG:
 		/* Used to restart an op with deadlock avoidance. */
 		trace_xchk_deadlock_retry(sc->ip, sc->sm, *error);
 		break;
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index b800341aae69..ab0758308f57 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -60,6 +60,9 @@ xrep_attempt(
 		sc->sm->sm_flags &= ~XFS_SCRUB_FLAGS_OUT;
 		sc->flags |= XREP_ALREADY_FIXED;
 		return -EAGAIN;
+	case -ECHRNG:
+		sc->flags |= XCHK_NEED_DRAIN;
+		return -EAGAIN;
 	case -EDEADLOCK:
 		/* Tell the caller to try again having grabbed all the locks. */
 		if (!(sc->flags & XCHK_TRY_HARDER)) {
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index bd5d4357cd64..787a9096ddef 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -510,6 +510,8 @@ retry_op:
 	error = sc->ops->setup(sc);
 	if (error == -EDEADLOCK && !(sc->flags & XCHK_TRY_HARDER))
 		goto try_harder;
+	if (error == -ECHRNG && !(sc->flags & XCHK_NEED_DRAIN))
+		goto need_drain;
 	if (error)
 		goto out_teardown;
 
@@ -517,6 +519,8 @@ retry_op:
 	error = sc->ops->scrub(sc);
 	if (error == -EDEADLOCK && !(sc->flags & XCHK_TRY_HARDER))
 		goto try_harder;
+	if (error == -ECHRNG && !(sc->flags & XCHK_NEED_DRAIN))
+		goto need_drain;
 	if (error || (sm->sm_flags & XFS_SCRUB_OFLAG_INCOMPLETE))
 		goto out_teardown;
 
@@ -575,6 +579,12 @@ out:
 		error = 0;
 	}
 	return error;
+need_drain:
+	error = xchk_teardown(sc, 0);
+	if (error)
+		goto out_sc;
+	sc->flags |= XCHK_NEED_DRAIN;
+	goto retry_op;
 try_harder:
 	/*
 	 * Scrubbers return -EDEADLOCK to mean 'try harder'.  Tear down
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 4fdb6017f820..d85c3b883b4c 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -98,6 +98,7 @@ struct xfs_scrub {
 #define XCHK_TRY_HARDER		(1 << 0)  /* can't get resources, try again */
 #define XCHK_REAPING_DISABLED	(1 << 1)  /* background block reaping paused */
 #define XCHK_FSGATES_DRAIN	(1 << 2)  /* defer ops draining enabled */
+#define XCHK_NEED_DRAIN		(1 << 3)  /* scrub needs to drain defer ops */
 #define XREP_ALREADY_FIXED	(1 << 31) /* checking our repair work */
 
 /*
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 304c55192c90..68efd6fda61c 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -100,6 +100,7 @@ TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_FSCOUNTERS);
 	{ XCHK_TRY_HARDER,			"try_harder" }, \
 	{ XCHK_REAPING_DISABLED,		"reaping_disabled" }, \
 	{ XCHK_FSGATES_DRAIN,			"fsgates_drain" }, \
+	{ XCHK_NEED_DRAIN,			"need_drain" }, \
 	{ XREP_ALREADY_FIXED,			"already_fixed" }
 
 DECLARE_EVENT_CLASS(xchk_class,
-- 
cgit v1.2.3


From 35e3b9a11740b53387e7af151768c13700f80696 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:01 -0700
Subject: xfs: standardize ondisk to incore conversion for free space btrees

Create a xfs_alloc_btrec_to_irec function to convert an ondisk record to
an incore record, and a xfs_alloc_check_irec function to detect
corruption.  Replace all the open-coded logic with calls to the new
helpers and bubble up corruption reports.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_alloc.c | 56 ++++++++++++++++++++++++++++++++++++-----------
 fs/xfs/libxfs/xfs_alloc.h |  6 +++++
 fs/xfs/scrub/alloc.c      | 24 ++++++++++----------
 3 files changed, 61 insertions(+), 25 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index d72483013b7d..89c935cbcc4d 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -233,6 +233,34 @@ xfs_alloc_update(
 	return xfs_btree_update(cur, &rec);
 }
 
+/* Convert the ondisk btree record to its incore representation. */
+void
+xfs_alloc_btrec_to_irec(
+	const union xfs_btree_rec	*rec,
+	struct xfs_alloc_rec_incore	*irec)
+{
+	irec->ar_startblock = be32_to_cpu(rec->alloc.ar_startblock);
+	irec->ar_blockcount = be32_to_cpu(rec->alloc.ar_blockcount);
+}
+
+/* Simple checks for free space records. */
+xfs_failaddr_t
+xfs_alloc_check_irec(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_alloc_rec_incore *irec)
+{
+	struct xfs_perag		*pag = cur->bc_ag.pag;
+
+	if (irec->ar_blockcount == 0)
+		return __this_address;
+
+	/* check for valid extent range, including overflow */
+	if (!xfs_verify_agbext(pag, irec->ar_startblock, irec->ar_blockcount))
+		return __this_address;
+
+	return NULL;
+}
+
 /*
  * Get the data from the pointed-to record.
  */
@@ -243,34 +271,34 @@ xfs_alloc_get_rec(
 	xfs_extlen_t		*len,	/* output: length of extent */
 	int			*stat)	/* output: success/failure */
 {
+	struct xfs_alloc_rec_incore irec;
 	struct xfs_mount	*mp = cur->bc_mp;
 	struct xfs_perag	*pag = cur->bc_ag.pag;
 	union xfs_btree_rec	*rec;
+	xfs_failaddr_t		fa;
 	int			error;
 
 	error = xfs_btree_get_rec(cur, &rec, stat);
 	if (error || !(*stat))
 		return error;
 
-	*bno = be32_to_cpu(rec->alloc.ar_startblock);
-	*len = be32_to_cpu(rec->alloc.ar_blockcount);
-
-	if (*len == 0)
-		goto out_bad_rec;
-
-	/* check for valid extent range, including overflow */
-	if (!xfs_verify_agbext(pag, *bno, *len))
+	xfs_alloc_btrec_to_irec(rec, &irec);
+	fa = xfs_alloc_check_irec(cur, &irec);
+	if (fa)
 		goto out_bad_rec;
 
+	*bno = irec.ar_startblock;
+	*len = irec.ar_blockcount;
 	return 0;
 
 out_bad_rec:
 	xfs_warn(mp,
-		"%s Freespace BTree record corruption in AG %d detected!",
+		"%s Freespace BTree record corruption in AG %d detected at %pS!",
 		cur->bc_btnum == XFS_BTNUM_BNO ? "Block" : "Size",
-		pag->pag_agno);
+		pag->pag_agno, fa);
 	xfs_warn(mp,
-		"start block 0x%x block count 0x%x", *bno, *len);
+		"start block 0x%x block count 0x%x", irec.ar_startblock,
+		irec.ar_blockcount);
 	return -EFSCORRUPTED;
 }
 
@@ -3665,8 +3693,10 @@ xfs_alloc_query_range_helper(
 	struct xfs_alloc_query_range_info	*query = priv;
 	struct xfs_alloc_rec_incore		irec;
 
-	irec.ar_startblock = be32_to_cpu(rec->alloc.ar_startblock);
-	irec.ar_blockcount = be32_to_cpu(rec->alloc.ar_blockcount);
+	xfs_alloc_btrec_to_irec(rec, &irec);
+	if (xfs_alloc_check_irec(cur, &irec) != NULL)
+		return -EFSCORRUPTED;
+
 	return query->fn(cur, &irec, query->priv);
 }
 
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 5569cb2ede0d..56bd05900b35 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -181,6 +181,12 @@ xfs_alloc_get_rec(
 	xfs_extlen_t		*len,	/* output: length of extent */
 	int			*stat);	/* output: success/failure */
 
+union xfs_btree_rec;
+void xfs_alloc_btrec_to_irec(const union xfs_btree_rec *rec,
+		struct xfs_alloc_rec_incore *irec);
+xfs_failaddr_t xfs_alloc_check_irec(struct xfs_btree_cur *cur,
+		const struct xfs_alloc_rec_incore *irec);
+
 int xfs_read_agf(struct xfs_perag *pag, struct xfs_trans *tp, int flags,
 		struct xfs_buf **agfbpp);
 int xfs_alloc_read_agf(struct xfs_perag *pag, struct xfs_trans *tp, int flags,
diff --git a/fs/xfs/scrub/alloc.c b/fs/xfs/scrub/alloc.c
index de313df2b15b..53de04c6027c 100644
--- a/fs/xfs/scrub/alloc.c
+++ b/fs/xfs/scrub/alloc.c
@@ -78,9 +78,11 @@ xchk_allocbt_xref_other(
 STATIC void
 xchk_allocbt_xref(
 	struct xfs_scrub	*sc,
-	xfs_agblock_t		agbno,
-	xfs_extlen_t		len)
+	const struct xfs_alloc_rec_incore *irec)
 {
+	xfs_agblock_t		agbno = irec->ar_startblock;
+	xfs_extlen_t		len = irec->ar_blockcount;
+
 	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
 		return;
 
@@ -93,20 +95,18 @@ xchk_allocbt_xref(
 /* Scrub a bnobt/cntbt record. */
 STATIC int
 xchk_allocbt_rec(
-	struct xchk_btree	*bs,
-	const union xfs_btree_rec *rec)
+	struct xchk_btree		*bs,
+	const union xfs_btree_rec	*rec)
 {
-	struct xfs_perag	*pag = bs->cur->bc_ag.pag;
-	xfs_agblock_t		bno;
-	xfs_extlen_t		len;
+	struct xfs_alloc_rec_incore	irec;
 
-	bno = be32_to_cpu(rec->alloc.ar_startblock);
-	len = be32_to_cpu(rec->alloc.ar_blockcount);
-
-	if (!xfs_verify_agbext(pag, bno, len))
+	xfs_alloc_btrec_to_irec(rec, &irec);
+	if (xfs_alloc_check_irec(bs->cur, &irec) != NULL) {
 		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
+		return 0;
+	}
 
-	xchk_allocbt_xref(bs->sc, bno, len);
+	xchk_allocbt_xref(bs->sc, &irec);
 
 	return 0;
 }
-- 
cgit v1.2.3


From 366a0b8d49c3a7edcb5331f254af195716ba4bdf Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:01 -0700
Subject: xfs: standardize ondisk to incore conversion for inode btrees

Create a xfs_inobt_check_irec function to detect corruption in btree
records.  Fix all xfs_inobt_btrec_to_irec callsites to call the new
helper and bubble up corruption reports.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c       | 53 ++++++++++++++++++++++++++--------------
 fs/xfs/libxfs/xfs_ialloc.h       |  2 ++
 fs/xfs/libxfs/xfs_ialloc_btree.c |  2 +-
 fs/xfs/libxfs/xfs_ialloc_btree.h |  2 +-
 fs/xfs/scrub/ialloc.c            | 24 +++---------------
 5 files changed, 43 insertions(+), 40 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 7ee292aecbeb..32af8326ad76 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -95,6 +95,33 @@ xfs_inobt_btrec_to_irec(
 	irec->ir_free = be64_to_cpu(rec->inobt.ir_free);
 }
 
+/* Simple checks for inode records. */
+xfs_failaddr_t
+xfs_inobt_check_irec(
+	struct xfs_btree_cur			*cur,
+	const struct xfs_inobt_rec_incore	*irec)
+{
+	uint64_t			realfree;
+
+	if (!xfs_verify_agino(cur->bc_ag.pag, irec->ir_startino))
+		return __this_address;
+	if (irec->ir_count < XFS_INODES_PER_HOLEMASK_BIT ||
+	    irec->ir_count > XFS_INODES_PER_CHUNK)
+		return __this_address;
+	if (irec->ir_freecount > XFS_INODES_PER_CHUNK)
+		return __this_address;
+
+	/* if there are no holes, return the first available offset */
+	if (!xfs_inobt_issparse(irec->ir_holemask))
+		realfree = irec->ir_free;
+	else
+		realfree = irec->ir_free & xfs_inobt_irec_to_allocmask(irec);
+	if (hweight64(realfree) != irec->ir_freecount)
+		return __this_address;
+
+	return NULL;
+}
+
 /*
  * Get the data from the pointed-to record.
  */
@@ -106,38 +133,25 @@ xfs_inobt_get_rec(
 {
 	struct xfs_mount		*mp = cur->bc_mp;
 	union xfs_btree_rec		*rec;
+	xfs_failaddr_t			fa;
 	int				error;
-	uint64_t			realfree;
 
 	error = xfs_btree_get_rec(cur, &rec, stat);
 	if (error || *stat == 0)
 		return error;
 
 	xfs_inobt_btrec_to_irec(mp, rec, irec);
-
-	if (!xfs_verify_agino(cur->bc_ag.pag, irec->ir_startino))
-		goto out_bad_rec;
-	if (irec->ir_count < XFS_INODES_PER_HOLEMASK_BIT ||
-	    irec->ir_count > XFS_INODES_PER_CHUNK)
-		goto out_bad_rec;
-	if (irec->ir_freecount > XFS_INODES_PER_CHUNK)
-		goto out_bad_rec;
-
-	/* if there are no holes, return the first available offset */
-	if (!xfs_inobt_issparse(irec->ir_holemask))
-		realfree = irec->ir_free;
-	else
-		realfree = irec->ir_free & xfs_inobt_irec_to_allocmask(irec);
-	if (hweight64(realfree) != irec->ir_freecount)
+	fa = xfs_inobt_check_irec(cur, irec);
+	if (fa)
 		goto out_bad_rec;
 
 	return 0;
 
 out_bad_rec:
 	xfs_warn(mp,
-		"%s Inode BTree record corruption in AG %d detected!",
+		"%s Inode BTree record corruption in AG %d detected at %pS!",
 		cur->bc_btnum == XFS_BTNUM_INO ? "Used" : "Free",
-		cur->bc_ag.pag->pag_agno);
+		cur->bc_ag.pag->pag_agno, fa);
 	xfs_warn(mp,
 "start inode 0x%x, count 0x%x, free 0x%x freemask 0x%llx, holemask 0x%x",
 		irec->ir_startino, irec->ir_count, irec->ir_freecount,
@@ -2690,6 +2704,9 @@ xfs_ialloc_count_inodes_rec(
 	struct xfs_ialloc_count_inodes	*ci = priv;
 
 	xfs_inobt_btrec_to_irec(cur->bc_mp, rec, &irec);
+	if (xfs_inobt_check_irec(cur, &irec) != NULL)
+		return -EFSCORRUPTED;
+
 	ci->count += irec.ir_count;
 	ci->freecount += irec.ir_freecount;
 
diff --git a/fs/xfs/libxfs/xfs_ialloc.h b/fs/xfs/libxfs/xfs_ialloc.h
index ab8c30b4ec22..90b0e5079338 100644
--- a/fs/xfs/libxfs/xfs_ialloc.h
+++ b/fs/xfs/libxfs/xfs_ialloc.h
@@ -93,6 +93,8 @@ union xfs_btree_rec;
 void xfs_inobt_btrec_to_irec(struct xfs_mount *mp,
 		const union xfs_btree_rec *rec,
 		struct xfs_inobt_rec_incore *irec);
+xfs_failaddr_t xfs_inobt_check_irec(struct xfs_btree_cur *cur,
+		const struct xfs_inobt_rec_incore *irec);
 int xfs_ialloc_has_inodes_at_extent(struct xfs_btree_cur *cur,
 		xfs_agblock_t bno, xfs_extlen_t len, bool *exists);
 int xfs_ialloc_has_inode_record(struct xfs_btree_cur *cur, xfs_agino_t low,
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index ad6c521f05eb..f900c056b82c 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -608,7 +608,7 @@ xfs_iallocbt_maxlevels_ondisk(void)
  */
 uint64_t
 xfs_inobt_irec_to_allocmask(
-	struct xfs_inobt_rec_incore	*rec)
+	const struct xfs_inobt_rec_incore	*rec)
 {
 	uint64_t			bitmap = 0;
 	uint64_t			inodespbit;
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.h b/fs/xfs/libxfs/xfs_ialloc_btree.h
index e859a6e05230..3262c3fe5ebe 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.h
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.h
@@ -53,7 +53,7 @@ struct xfs_btree_cur *xfs_inobt_stage_cursor(struct xfs_perag *pag,
 extern int xfs_inobt_maxrecs(struct xfs_mount *, int, int);
 
 /* ir_holemask to inode allocation bitmap conversion */
-uint64_t xfs_inobt_irec_to_allocmask(struct xfs_inobt_rec_incore *);
+uint64_t xfs_inobt_irec_to_allocmask(const struct xfs_inobt_rec_incore *irec);
 
 #if defined(DEBUG) || defined(XFS_WARN)
 int xfs_inobt_rec_check_count(struct xfs_mount *,
diff --git a/fs/xfs/scrub/ialloc.c b/fs/xfs/scrub/ialloc.c
index 9563769a8881..11afb4c5a161 100644
--- a/fs/xfs/scrub/ialloc.c
+++ b/fs/xfs/scrub/ialloc.c
@@ -119,15 +119,6 @@ xchk_iallocbt_chunk(
 	return true;
 }
 
-/* Count the number of free inodes. */
-static unsigned int
-xchk_iallocbt_freecount(
-	xfs_inofree_t			freemask)
-{
-	BUILD_BUG_ON(sizeof(freemask) != sizeof(__u64));
-	return hweight64(freemask);
-}
-
 /*
  * Check that an inode's allocation status matches ir_free in the inobt
  * record.  First we try querying the in-core inode state, and if the inode
@@ -431,24 +422,17 @@ xchk_iallocbt_rec(
 	int				holecount;
 	int				i;
 	int				error = 0;
-	unsigned int			real_freecount;
 	uint16_t			holemask;
 
 	xfs_inobt_btrec_to_irec(mp, rec, &irec);
-
-	if (irec.ir_count > XFS_INODES_PER_CHUNK ||
-	    irec.ir_freecount > XFS_INODES_PER_CHUNK)
-		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
-
-	real_freecount = irec.ir_freecount +
-			(XFS_INODES_PER_CHUNK - irec.ir_count);
-	if (real_freecount != xchk_iallocbt_freecount(irec.ir_free))
+	if (xfs_inobt_check_irec(bs->cur, &irec) != NULL) {
 		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
+		return 0;
+	}
 
 	agino = irec.ir_startino;
 	/* Record has to be properly aligned within the AG. */
-	if (!xfs_verify_agino(pag, agino) ||
-	    !xfs_verify_agino(pag, agino + XFS_INODES_PER_CHUNK - 1)) {
+	if (!xfs_verify_agino(pag, agino + XFS_INODES_PER_CHUNK - 1)) {
 		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
 		goto out;
 	}
-- 
cgit v1.2.3


From 2b30cc0bf0589d1ea0506c019b9b81de77535c87 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:02 -0700
Subject: xfs: standardize ondisk to incore conversion for refcount btrees

Create a xfs_refcount_check_irec function to detect corruption in btree
records.  Fix all xfs_refcount_btrec_to_irec callsites to call the new
helper and bubble up corruption reports.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_refcount.c | 45 ++++++++++++++++++++++++++++++--------------
 fs/xfs/libxfs/xfs_refcount.h |  2 ++
 fs/xfs/scrub/refcount.c      | 14 +++-----------
 3 files changed, 36 insertions(+), 25 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index 6dc968618e66..b77dea10c8bd 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -120,6 +120,30 @@ xfs_refcount_btrec_to_irec(
 	irec->rc_refcount = be32_to_cpu(rec->refc.rc_refcount);
 }
 
+/* Simple checks for refcount records. */
+xfs_failaddr_t
+xfs_refcount_check_irec(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_refcount_irec	*irec)
+{
+	struct xfs_perag		*pag = cur->bc_ag.pag;
+
+	if (irec->rc_blockcount == 0 || irec->rc_blockcount > MAXREFCEXTLEN)
+		return __this_address;
+
+	if (!xfs_refcount_check_domain(irec))
+		return __this_address;
+
+	/* check for valid extent range, including overflow */
+	if (!xfs_verify_agbext(pag, irec->rc_startblock, irec->rc_blockcount))
+		return __this_address;
+
+	if (irec->rc_refcount == 0 || irec->rc_refcount > MAXREFCOUNT)
+		return __this_address;
+
+	return NULL;
+}
+
 /*
  * Get the data from the pointed-to record.
  */
@@ -132,6 +156,7 @@ xfs_refcount_get_rec(
 	struct xfs_mount		*mp = cur->bc_mp;
 	struct xfs_perag		*pag = cur->bc_ag.pag;
 	union xfs_btree_rec		*rec;
+	xfs_failaddr_t			fa;
 	int				error;
 
 	error = xfs_btree_get_rec(cur, &rec, stat);
@@ -139,17 +164,8 @@ xfs_refcount_get_rec(
 		return error;
 
 	xfs_refcount_btrec_to_irec(rec, irec);
-	if (irec->rc_blockcount == 0 || irec->rc_blockcount > MAXREFCEXTLEN)
-		goto out_bad_rec;
-
-	if (!xfs_refcount_check_domain(irec))
-		goto out_bad_rec;
-
-	/* check for valid extent range, including overflow */
-	if (!xfs_verify_agbext(pag, irec->rc_startblock, irec->rc_blockcount))
-		goto out_bad_rec;
-
-	if (irec->rc_refcount == 0 || irec->rc_refcount > MAXREFCOUNT)
+	fa = xfs_refcount_check_irec(cur, irec);
+	if (fa)
 		goto out_bad_rec;
 
 	trace_xfs_refcount_get(cur->bc_mp, pag->pag_agno, irec);
@@ -157,8 +173,8 @@ xfs_refcount_get_rec(
 
 out_bad_rec:
 	xfs_warn(mp,
-		"Refcount BTree record corruption in AG %d detected!",
-		pag->pag_agno);
+		"Refcount BTree record corruption in AG %d detected at %pS!",
+		pag->pag_agno, fa);
 	xfs_warn(mp,
 		"Start block 0x%x, block count 0x%x, references 0x%x",
 		irec->rc_startblock, irec->rc_blockcount, irec->rc_refcount);
@@ -1871,7 +1887,8 @@ xfs_refcount_recover_extent(
 	INIT_LIST_HEAD(&rr->rr_list);
 	xfs_refcount_btrec_to_irec(rec, &rr->rr_rrec);
 
-	if (XFS_IS_CORRUPT(cur->bc_mp,
+	if (xfs_refcount_check_irec(cur, &rr->rr_rrec) != NULL ||
+	    XFS_IS_CORRUPT(cur->bc_mp,
 			   rr->rr_rrec.rc_domain != XFS_REFC_DOMAIN_COW)) {
 		kfree(rr);
 		return -EFSCORRUPTED;
diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
index c89f0fcd1ee3..fc0b58d4c379 100644
--- a/fs/xfs/libxfs/xfs_refcount.h
+++ b/fs/xfs/libxfs/xfs_refcount.h
@@ -117,6 +117,8 @@ extern int xfs_refcount_has_record(struct xfs_btree_cur *cur,
 union xfs_btree_rec;
 extern void xfs_refcount_btrec_to_irec(const union xfs_btree_rec *rec,
 		struct xfs_refcount_irec *irec);
+xfs_failaddr_t xfs_refcount_check_irec(struct xfs_btree_cur *cur,
+		const struct xfs_refcount_irec *irec);
 extern int xfs_refcount_insert(struct xfs_btree_cur *cur,
 		struct xfs_refcount_irec *irec, int *stat);
 
diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
index 6f649cc01310..4d77049dfce2 100644
--- a/fs/xfs/scrub/refcount.c
+++ b/fs/xfs/scrub/refcount.c
@@ -340,24 +340,16 @@ xchk_refcountbt_rec(
 {
 	struct xfs_refcount_irec irec;
 	xfs_agblock_t		*cow_blocks = bs->private;
-	struct xfs_perag	*pag = bs->cur->bc_ag.pag;
 
 	xfs_refcount_btrec_to_irec(rec, &irec);
-
-	/* Check the domain and refcount are not incompatible. */
-	if (!xfs_refcount_check_domain(&irec))
+	if (xfs_refcount_check_irec(bs->cur, &irec) != NULL) {
 		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
+		return 0;
+	}
 
 	if (irec.rc_domain == XFS_REFC_DOMAIN_COW)
 		(*cow_blocks) += irec.rc_blockcount;
 
-	/* Check the extent. */
-	if (!xfs_verify_agbext(pag, irec.rc_startblock, irec.rc_blockcount))
-		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
-
-	if (irec.rc_refcount == 0)
-		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
-
 	xchk_refcountbt_xref(bs->sc, &irec);
 
 	return 0;
-- 
cgit v1.2.3


From 39ab26d59f039c6190bbaa8118a8f0ffed84492a Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:02 -0700
Subject: xfs: return a failure address from xfs_rmap_irec_offset_unpack

Currently, xfs_rmap_irec_offset_unpack returns only 0 or -EFSCORRUPTED.
Change this function to return the code address of a failed conversion
in preparation for the next patch, which standardizes localized record
checking and reporting code.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_rmap.c |  9 ++++-----
 fs/xfs/libxfs/xfs_rmap.h |  9 +++++----
 fs/xfs/scrub/rmap.c      | 11 +++++------
 3 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index c2624d11f041..830b38337cd5 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -193,7 +193,7 @@ done:
 }
 
 /* Convert an internal btree record to an rmap record. */
-int
+xfs_failaddr_t
 xfs_rmap_btrec_to_irec(
 	const union xfs_btree_rec	*rec,
 	struct xfs_rmap_irec		*irec)
@@ -2320,11 +2320,10 @@ xfs_rmap_query_range_helper(
 {
 	struct xfs_rmap_query_range_info	*query = priv;
 	struct xfs_rmap_irec			irec;
-	int					error;
 
-	error = xfs_rmap_btrec_to_irec(rec, &irec);
-	if (error)
-		return error;
+	if (xfs_rmap_btrec_to_irec(rec, &irec) != NULL)
+		return -EFSCORRUPTED;
+
 	return query->fn(cur, &irec, query->priv);
 }
 
diff --git a/fs/xfs/libxfs/xfs_rmap.h b/fs/xfs/libxfs/xfs_rmap.h
index 1472ae570a8a..6a08c403e8b7 100644
--- a/fs/xfs/libxfs/xfs_rmap.h
+++ b/fs/xfs/libxfs/xfs_rmap.h
@@ -62,13 +62,14 @@ xfs_rmap_irec_offset_pack(
 	return x;
 }
 
-static inline int
+static inline xfs_failaddr_t
 xfs_rmap_irec_offset_unpack(
 	__u64			offset,
 	struct xfs_rmap_irec	*irec)
 {
 	if (offset & ~(XFS_RMAP_OFF_MASK | XFS_RMAP_OFF_FLAGS))
-		return -EFSCORRUPTED;
+		return __this_address;
+
 	irec->rm_offset = XFS_RMAP_OFF(offset);
 	irec->rm_flags = 0;
 	if (offset & XFS_RMAP_OFF_ATTR_FORK)
@@ -77,7 +78,7 @@ xfs_rmap_irec_offset_unpack(
 		irec->rm_flags |= XFS_RMAP_BMBT_BLOCK;
 	if (offset & XFS_RMAP_OFF_UNWRITTEN)
 		irec->rm_flags |= XFS_RMAP_UNWRITTEN;
-	return 0;
+	return NULL;
 }
 
 static inline void
@@ -192,7 +193,7 @@ int xfs_rmap_lookup_le_range(struct xfs_btree_cur *cur, xfs_agblock_t bno,
 int xfs_rmap_compare(const struct xfs_rmap_irec *a,
 		const struct xfs_rmap_irec *b);
 union xfs_btree_rec;
-int xfs_rmap_btrec_to_irec(const union xfs_btree_rec *rec,
+xfs_failaddr_t xfs_rmap_btrec_to_irec(const union xfs_btree_rec *rec,
 		struct xfs_rmap_irec *irec);
 int xfs_rmap_has_record(struct xfs_btree_cur *cur, xfs_agblock_t bno,
 		xfs_extlen_t len, bool *exists);
diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index c6e47ef4c79b..9f661c487beb 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -100,11 +100,11 @@ xchk_rmapbt_rec(
 	bool			is_unwritten;
 	bool			is_bmbt;
 	bool			is_attr;
-	int			error;
 
-	error = xfs_rmap_btrec_to_irec(rec, &irec);
-	if (!xchk_btree_process_error(bs->sc, bs->cur, 0, &error))
-		goto out;
+	if (xfs_rmap_btrec_to_irec(rec, &irec) != NULL) {
+		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
+		return 0;
+	}
 
 	/* Check extent. */
 	if (irec.rm_startblock + irec.rm_blockcount <= irec.rm_startblock)
@@ -159,8 +159,7 @@ xchk_rmapbt_rec(
 	}
 
 	xchk_rmapbt_xref(bs->sc, &irec);
-out:
-	return error;
+	return 0;
 }
 
 /* Scrub the rmap btree for some AG. */
-- 
cgit v1.2.3


From c4e34172da26cb57f56c471728853d3a428ec832 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:03 -0700
Subject: xfs: standardize ondisk to incore conversion for rmap btrees

Create a xfs_rmap_check_irec function to detect corruption in btree
records.  Fix all xfs_rmap_btrec_to_irec callsites to call the new
helper and bubble up corruption reports.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_rmap.c | 62 ++++++++++++++++++++++++++++++------------------
 fs/xfs/libxfs/xfs_rmap.h |  3 +++
 fs/xfs/scrub/rmap.c      | 39 ++----------------------------
 3 files changed, 44 insertions(+), 60 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index 830b38337cd5..5c7b081cef87 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -205,6 +205,36 @@ xfs_rmap_btrec_to_irec(
 			irec);
 }
 
+/* Simple checks for rmap records. */
+xfs_failaddr_t
+xfs_rmap_check_irec(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_rmap_irec	*irec)
+{
+	struct xfs_mount		*mp = cur->bc_mp;
+
+	if (irec->rm_blockcount == 0)
+		return __this_address;
+	if (irec->rm_startblock <= XFS_AGFL_BLOCK(mp)) {
+		if (irec->rm_owner != XFS_RMAP_OWN_FS)
+			return __this_address;
+		if (irec->rm_blockcount != XFS_AGFL_BLOCK(mp) + 1)
+			return __this_address;
+	} else {
+		/* check for valid extent range, including overflow */
+		if (!xfs_verify_agbext(cur->bc_ag.pag, irec->rm_startblock,
+						       irec->rm_blockcount))
+			return __this_address;
+	}
+
+	if (!(xfs_verify_ino(mp, irec->rm_owner) ||
+	      (irec->rm_owner <= XFS_RMAP_OWN_FS &&
+	       irec->rm_owner >= XFS_RMAP_OWN_MIN)))
+		return __this_address;
+
+	return NULL;
+}
+
 /*
  * Get the data from the pointed-to record.
  */
@@ -217,39 +247,24 @@ xfs_rmap_get_rec(
 	struct xfs_mount	*mp = cur->bc_mp;
 	struct xfs_perag	*pag = cur->bc_ag.pag;
 	union xfs_btree_rec	*rec;
+	xfs_failaddr_t		fa;
 	int			error;
 
 	error = xfs_btree_get_rec(cur, &rec, stat);
 	if (error || !*stat)
 		return error;
 
-	if (xfs_rmap_btrec_to_irec(rec, irec))
-		goto out_bad_rec;
-
-	if (irec->rm_blockcount == 0)
-		goto out_bad_rec;
-	if (irec->rm_startblock <= XFS_AGFL_BLOCK(mp)) {
-		if (irec->rm_owner != XFS_RMAP_OWN_FS)
-			goto out_bad_rec;
-		if (irec->rm_blockcount != XFS_AGFL_BLOCK(mp) + 1)
-			goto out_bad_rec;
-	} else {
-		/* check for valid extent range, including overflow */
-		if (!xfs_verify_agbext(pag, irec->rm_startblock,
-					    irec->rm_blockcount))
-			goto out_bad_rec;
-	}
-
-	if (!(xfs_verify_ino(mp, irec->rm_owner) ||
-	      (irec->rm_owner <= XFS_RMAP_OWN_FS &&
-	       irec->rm_owner >= XFS_RMAP_OWN_MIN)))
+	fa = xfs_rmap_btrec_to_irec(rec, irec);
+	if (!fa)
+		fa = xfs_rmap_check_irec(cur, irec);
+	if (fa)
 		goto out_bad_rec;
 
 	return 0;
 out_bad_rec:
 	xfs_warn(mp,
-		"Reverse Mapping BTree record corruption in AG %d detected!",
-		pag->pag_agno);
+		"Reverse Mapping BTree record corruption in AG %d detected at %pS!",
+		pag->pag_agno, fa);
 	xfs_warn(mp,
 		"Owner 0x%llx, flags 0x%x, start block 0x%x block count 0x%x",
 		irec->rm_owner, irec->rm_flags, irec->rm_startblock,
@@ -2321,7 +2336,8 @@ xfs_rmap_query_range_helper(
 	struct xfs_rmap_query_range_info	*query = priv;
 	struct xfs_rmap_irec			irec;
 
-	if (xfs_rmap_btrec_to_irec(rec, &irec) != NULL)
+	if (xfs_rmap_btrec_to_irec(rec, &irec) != NULL ||
+	    xfs_rmap_check_irec(cur, &irec) != NULL)
 		return -EFSCORRUPTED;
 
 	return query->fn(cur, &irec, query->priv);
diff --git a/fs/xfs/libxfs/xfs_rmap.h b/fs/xfs/libxfs/xfs_rmap.h
index 6a08c403e8b7..7fb298bcc15f 100644
--- a/fs/xfs/libxfs/xfs_rmap.h
+++ b/fs/xfs/libxfs/xfs_rmap.h
@@ -195,6 +195,9 @@ int xfs_rmap_compare(const struct xfs_rmap_irec *a,
 union xfs_btree_rec;
 xfs_failaddr_t xfs_rmap_btrec_to_irec(const union xfs_btree_rec *rec,
 		struct xfs_rmap_irec *irec);
+xfs_failaddr_t xfs_rmap_check_irec(struct xfs_btree_cur *cur,
+		const struct xfs_rmap_irec *irec);
+
 int xfs_rmap_has_record(struct xfs_btree_cur *cur, xfs_agblock_t bno,
 		xfs_extlen_t len, bool *exists);
 int xfs_rmap_record_exists(struct xfs_btree_cur *cur, xfs_agblock_t bno,
diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index 9f661c487beb..353cf9d90027 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -93,43 +93,18 @@ xchk_rmapbt_rec(
 	struct xchk_btree	*bs,
 	const union xfs_btree_rec *rec)
 {
-	struct xfs_mount	*mp = bs->cur->bc_mp;
 	struct xfs_rmap_irec	irec;
-	struct xfs_perag	*pag = bs->cur->bc_ag.pag;
 	bool			non_inode;
 	bool			is_unwritten;
 	bool			is_bmbt;
 	bool			is_attr;
 
-	if (xfs_rmap_btrec_to_irec(rec, &irec) != NULL) {
+	if (xfs_rmap_btrec_to_irec(rec, &irec) != NULL ||
+	    xfs_rmap_check_irec(bs->cur, &irec) != NULL) {
 		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
 		return 0;
 	}
 
-	/* Check extent. */
-	if (irec.rm_startblock + irec.rm_blockcount <= irec.rm_startblock)
-		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
-
-	if (irec.rm_owner == XFS_RMAP_OWN_FS) {
-		/*
-		 * xfs_verify_agbno returns false for static fs metadata.
-		 * Since that only exists at the start of the AG, validate
-		 * that by hand.
-		 */
-		if (irec.rm_startblock != 0 ||
-		    irec.rm_blockcount != XFS_AGFL_BLOCK(mp) + 1)
-			xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
-	} else {
-		/*
-		 * Otherwise we must point somewhere past the static metadata
-		 * but before the end of the FS.  Run the regular check.
-		 */
-		if (!xfs_verify_agbno(pag, irec.rm_startblock) ||
-		    !xfs_verify_agbno(pag, irec.rm_startblock +
-				irec.rm_blockcount - 1))
-			xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
-	}
-
 	/* Check flags. */
 	non_inode = XFS_RMAP_NON_INODE_OWNER(irec.rm_owner);
 	is_bmbt = irec.rm_flags & XFS_RMAP_BMBT_BLOCK;
@@ -148,16 +123,6 @@ xchk_rmapbt_rec(
 	if (non_inode && (is_bmbt || is_unwritten || is_attr))
 		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
 
-	if (!non_inode) {
-		if (!xfs_verify_ino(mp, irec.rm_owner))
-			xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
-	} else {
-		/* Non-inode owner within the magic values? */
-		if (irec.rm_owner <= XFS_RMAP_OWN_MIN ||
-		    irec.rm_owner > XFS_RMAP_OWN_FS)
-			xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
-	}
-
 	xchk_rmapbt_xref(bs->sc, &irec);
 	return 0;
 }
-- 
cgit v1.2.3


From 69010fe3ac1fe9932a64268c32b67964fe5c06a8 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:04 -0700
Subject: xfs: standardize ondisk to incore conversion for bmap btrees

Fix all xfs_bmbt_disk_get_all callsites to call xfs_bmap_validate_extent
and bubble up corruption reports.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/bmap.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index a5078d63808f..6188eba672e5 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -471,6 +471,12 @@ xchk_bmapbt_rec(
 		return 0;
 
 	xfs_bmbt_disk_get_all(&rec->bmbt, &irec);
+	if (xfs_bmap_validate_extent(ip, info->whichfork, &irec) != NULL) {
+		xchk_fblock_set_corrupt(bs->sc, info->whichfork,
+				irec.br_startoff);
+		return 0;
+	}
+
 	if (!xfs_iext_lookup_extent(ip, ifp, irec.br_startoff, &icur,
 				&iext_irec) ||
 	    irec.br_startoff != iext_irec.br_startoff ||
-- 
cgit v1.2.3


From ee12eaaa435a7be17152ac50943ee77249de624a Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:04 -0700
Subject: xfs: complain about bad records in query_range helpers

For every btree type except for the bmbt, refactor the code that
complains about bad records into a helper and make the ->query_range
helpers call it so that corruptions found via that avenue are logged.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_alloc.c    | 38 +++++++++++++++++++++++---------------
 fs/xfs/libxfs/xfs_ialloc.c   | 38 ++++++++++++++++++++++++--------------
 fs/xfs/libxfs/xfs_refcount.c | 32 +++++++++++++++++++-------------
 fs/xfs/libxfs/xfs_rmap.c     | 40 +++++++++++++++++++++++++---------------
 4 files changed, 91 insertions(+), 57 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 89c935cbcc4d..23f0acfc2a64 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -261,6 +261,24 @@ xfs_alloc_check_irec(
 	return NULL;
 }
 
+static inline int
+xfs_alloc_complain_bad_rec(
+	struct xfs_btree_cur		*cur,
+	xfs_failaddr_t			fa,
+	const struct xfs_alloc_rec_incore *irec)
+{
+	struct xfs_mount		*mp = cur->bc_mp;
+
+	xfs_warn(mp,
+		"%s Freespace BTree record corruption in AG %d detected at %pS!",
+		cur->bc_btnum == XFS_BTNUM_BNO ? "Block" : "Size",
+		cur->bc_ag.pag->pag_agno, fa);
+	xfs_warn(mp,
+		"start block 0x%x block count 0x%x", irec->ar_startblock,
+		irec->ar_blockcount);
+	return -EFSCORRUPTED;
+}
+
 /*
  * Get the data from the pointed-to record.
  */
@@ -272,8 +290,6 @@ xfs_alloc_get_rec(
 	int			*stat)	/* output: success/failure */
 {
 	struct xfs_alloc_rec_incore irec;
-	struct xfs_mount	*mp = cur->bc_mp;
-	struct xfs_perag	*pag = cur->bc_ag.pag;
 	union xfs_btree_rec	*rec;
 	xfs_failaddr_t		fa;
 	int			error;
@@ -285,21 +301,11 @@ xfs_alloc_get_rec(
 	xfs_alloc_btrec_to_irec(rec, &irec);
 	fa = xfs_alloc_check_irec(cur, &irec);
 	if (fa)
-		goto out_bad_rec;
+		return xfs_alloc_complain_bad_rec(cur, fa, &irec);
 
 	*bno = irec.ar_startblock;
 	*len = irec.ar_blockcount;
 	return 0;
-
-out_bad_rec:
-	xfs_warn(mp,
-		"%s Freespace BTree record corruption in AG %d detected at %pS!",
-		cur->bc_btnum == XFS_BTNUM_BNO ? "Block" : "Size",
-		pag->pag_agno, fa);
-	xfs_warn(mp,
-		"start block 0x%x block count 0x%x", irec.ar_startblock,
-		irec.ar_blockcount);
-	return -EFSCORRUPTED;
 }
 
 /*
@@ -3692,10 +3698,12 @@ xfs_alloc_query_range_helper(
 {
 	struct xfs_alloc_query_range_info	*query = priv;
 	struct xfs_alloc_rec_incore		irec;
+	xfs_failaddr_t				fa;
 
 	xfs_alloc_btrec_to_irec(rec, &irec);
-	if (xfs_alloc_check_irec(cur, &irec) != NULL)
-		return -EFSCORRUPTED;
+	fa = xfs_alloc_check_irec(cur, &irec);
+	if (fa)
+		return xfs_alloc_complain_bad_rec(cur, fa, &irec);
 
 	return query->fn(cur, &irec, query->priv);
 }
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 32af8326ad76..b7dc8b81a133 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -122,6 +122,25 @@ xfs_inobt_check_irec(
 	return NULL;
 }
 
+static inline int
+xfs_inobt_complain_bad_rec(
+	struct xfs_btree_cur		*cur,
+	xfs_failaddr_t			fa,
+	const struct xfs_inobt_rec_incore *irec)
+{
+	struct xfs_mount		*mp = cur->bc_mp;
+
+	xfs_warn(mp,
+		"%s Inode BTree record corruption in AG %d detected at %pS!",
+		cur->bc_btnum == XFS_BTNUM_INO ? "Used" : "Free",
+		cur->bc_ag.pag->pag_agno, fa);
+	xfs_warn(mp,
+"start inode 0x%x, count 0x%x, free 0x%x freemask 0x%llx, holemask 0x%x",
+		irec->ir_startino, irec->ir_count, irec->ir_freecount,
+		irec->ir_free, irec->ir_holemask);
+	return -EFSCORRUPTED;
+}
+
 /*
  * Get the data from the pointed-to record.
  */
@@ -143,20 +162,9 @@ xfs_inobt_get_rec(
 	xfs_inobt_btrec_to_irec(mp, rec, irec);
 	fa = xfs_inobt_check_irec(cur, irec);
 	if (fa)
-		goto out_bad_rec;
+		return xfs_inobt_complain_bad_rec(cur, fa, irec);
 
 	return 0;
-
-out_bad_rec:
-	xfs_warn(mp,
-		"%s Inode BTree record corruption in AG %d detected at %pS!",
-		cur->bc_btnum == XFS_BTNUM_INO ? "Used" : "Free",
-		cur->bc_ag.pag->pag_agno, fa);
-	xfs_warn(mp,
-"start inode 0x%x, count 0x%x, free 0x%x freemask 0x%llx, holemask 0x%x",
-		irec->ir_startino, irec->ir_count, irec->ir_freecount,
-		irec->ir_free, irec->ir_holemask);
-	return -EFSCORRUPTED;
 }
 
 /*
@@ -2702,10 +2710,12 @@ xfs_ialloc_count_inodes_rec(
 {
 	struct xfs_inobt_rec_incore	irec;
 	struct xfs_ialloc_count_inodes	*ci = priv;
+	xfs_failaddr_t			fa;
 
 	xfs_inobt_btrec_to_irec(cur->bc_mp, rec, &irec);
-	if (xfs_inobt_check_irec(cur, &irec) != NULL)
-		return -EFSCORRUPTED;
+	fa = xfs_inobt_check_irec(cur, &irec);
+	if (fa)
+		return xfs_inobt_complain_bad_rec(cur, fa, &irec);
 
 	ci->count += irec.ir_count;
 	ci->freecount += irec.ir_freecount;
diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index b77dea10c8bd..335f84bef81c 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -144,6 +144,23 @@ xfs_refcount_check_irec(
 	return NULL;
 }
 
+static inline int
+xfs_refcount_complain_bad_rec(
+	struct xfs_btree_cur		*cur,
+	xfs_failaddr_t			fa,
+	const struct xfs_refcount_irec	*irec)
+{
+	struct xfs_mount		*mp = cur->bc_mp;
+
+	xfs_warn(mp,
+ "Refcount BTree record corruption in AG %d detected at %pS!",
+				cur->bc_ag.pag->pag_agno, fa);
+	xfs_warn(mp,
+		"Start block 0x%x, block count 0x%x, references 0x%x",
+		irec->rc_startblock, irec->rc_blockcount, irec->rc_refcount);
+	return -EFSCORRUPTED;
+}
+
 /*
  * Get the data from the pointed-to record.
  */
@@ -153,8 +170,6 @@ xfs_refcount_get_rec(
 	struct xfs_refcount_irec	*irec,
 	int				*stat)
 {
-	struct xfs_mount		*mp = cur->bc_mp;
-	struct xfs_perag		*pag = cur->bc_ag.pag;
 	union xfs_btree_rec		*rec;
 	xfs_failaddr_t			fa;
 	int				error;
@@ -166,19 +181,10 @@ xfs_refcount_get_rec(
 	xfs_refcount_btrec_to_irec(rec, irec);
 	fa = xfs_refcount_check_irec(cur, irec);
 	if (fa)
-		goto out_bad_rec;
+		return xfs_refcount_complain_bad_rec(cur, fa, irec);
 
-	trace_xfs_refcount_get(cur->bc_mp, pag->pag_agno, irec);
+	trace_xfs_refcount_get(cur->bc_mp, cur->bc_ag.pag->pag_agno, irec);
 	return 0;
-
-out_bad_rec:
-	xfs_warn(mp,
-		"Refcount BTree record corruption in AG %d detected at %pS!",
-		pag->pag_agno, fa);
-	xfs_warn(mp,
-		"Start block 0x%x, block count 0x%x, references 0x%x",
-		irec->rc_startblock, irec->rc_blockcount, irec->rc_refcount);
-	return -EFSCORRUPTED;
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index 5c7b081cef87..641114a023f2 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -235,6 +235,24 @@ xfs_rmap_check_irec(
 	return NULL;
 }
 
+static inline int
+xfs_rmap_complain_bad_rec(
+	struct xfs_btree_cur		*cur,
+	xfs_failaddr_t			fa,
+	const struct xfs_rmap_irec	*irec)
+{
+	struct xfs_mount		*mp = cur->bc_mp;
+
+	xfs_warn(mp,
+		"Reverse Mapping BTree record corruption in AG %d detected at %pS!",
+		cur->bc_ag.pag->pag_agno, fa);
+	xfs_warn(mp,
+		"Owner 0x%llx, flags 0x%x, start block 0x%x block count 0x%x",
+		irec->rm_owner, irec->rm_flags, irec->rm_startblock,
+		irec->rm_blockcount);
+	return -EFSCORRUPTED;
+}
+
 /*
  * Get the data from the pointed-to record.
  */
@@ -244,8 +262,6 @@ xfs_rmap_get_rec(
 	struct xfs_rmap_irec	*irec,
 	int			*stat)
 {
-	struct xfs_mount	*mp = cur->bc_mp;
-	struct xfs_perag	*pag = cur->bc_ag.pag;
 	union xfs_btree_rec	*rec;
 	xfs_failaddr_t		fa;
 	int			error;
@@ -258,18 +274,9 @@ xfs_rmap_get_rec(
 	if (!fa)
 		fa = xfs_rmap_check_irec(cur, irec);
 	if (fa)
-		goto out_bad_rec;
+		return xfs_rmap_complain_bad_rec(cur, fa, irec);
 
 	return 0;
-out_bad_rec:
-	xfs_warn(mp,
-		"Reverse Mapping BTree record corruption in AG %d detected at %pS!",
-		pag->pag_agno, fa);
-	xfs_warn(mp,
-		"Owner 0x%llx, flags 0x%x, start block 0x%x block count 0x%x",
-		irec->rm_owner, irec->rm_flags, irec->rm_startblock,
-		irec->rm_blockcount);
-	return -EFSCORRUPTED;
 }
 
 struct xfs_find_left_neighbor_info {
@@ -2335,10 +2342,13 @@ xfs_rmap_query_range_helper(
 {
 	struct xfs_rmap_query_range_info	*query = priv;
 	struct xfs_rmap_irec			irec;
+	xfs_failaddr_t				fa;
 
-	if (xfs_rmap_btrec_to_irec(rec, &irec) != NULL ||
-	    xfs_rmap_check_irec(cur, &irec) != NULL)
-		return -EFSCORRUPTED;
+	fa = xfs_rmap_btrec_to_irec(rec, &irec);
+	if (!fa)
+		fa = xfs_rmap_check_irec(cur, &irec);
+	if (fa)
+		return xfs_rmap_complain_bad_rec(cur, fa, &irec);
 
 	return query->fn(cur, &irec, query->priv);
 }
-- 
cgit v1.2.3


From 7d7d6d2fd0444904f12e70d9c930556c4eb44337 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:05 -0700
Subject: xfs: hoist rmap record flag checks from scrub

Move the rmap record flag checks from xchk_rmapbt_rec into
xfs_rmap_check_irec so that they are applied everywhere.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_rmap.c | 22 ++++++++++++++++++++++
 fs/xfs/scrub/rmap.c      | 22 ----------------------
 2 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index 641114a023f2..e66ecd794a84 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -212,6 +212,10 @@ xfs_rmap_check_irec(
 	const struct xfs_rmap_irec	*irec)
 {
 	struct xfs_mount		*mp = cur->bc_mp;
+	bool				is_inode;
+	bool				is_unwritten;
+	bool				is_bmbt;
+	bool				is_attr;
 
 	if (irec->rm_blockcount == 0)
 		return __this_address;
@@ -232,6 +236,24 @@ xfs_rmap_check_irec(
 	       irec->rm_owner >= XFS_RMAP_OWN_MIN)))
 		return __this_address;
 
+	/* Check flags. */
+	is_inode = !XFS_RMAP_NON_INODE_OWNER(irec->rm_owner);
+	is_bmbt = irec->rm_flags & XFS_RMAP_BMBT_BLOCK;
+	is_attr = irec->rm_flags & XFS_RMAP_ATTR_FORK;
+	is_unwritten = irec->rm_flags & XFS_RMAP_UNWRITTEN;
+
+	if (is_bmbt && irec->rm_offset != 0)
+		return __this_address;
+
+	if (!is_inode && irec->rm_offset != 0)
+		return __this_address;
+
+	if (is_unwritten && (is_bmbt || !is_inode || is_attr))
+		return __this_address;
+
+	if (!is_inode && (is_bmbt || is_unwritten || is_attr))
+		return __this_address;
+
 	return NULL;
 }
 
diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index 353cf9d90027..ef6e4b8546a6 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -94,10 +94,6 @@ xchk_rmapbt_rec(
 	const union xfs_btree_rec *rec)
 {
 	struct xfs_rmap_irec	irec;
-	bool			non_inode;
-	bool			is_unwritten;
-	bool			is_bmbt;
-	bool			is_attr;
 
 	if (xfs_rmap_btrec_to_irec(rec, &irec) != NULL ||
 	    xfs_rmap_check_irec(bs->cur, &irec) != NULL) {
@@ -105,24 +101,6 @@ xchk_rmapbt_rec(
 		return 0;
 	}
 
-	/* Check flags. */
-	non_inode = XFS_RMAP_NON_INODE_OWNER(irec.rm_owner);
-	is_bmbt = irec.rm_flags & XFS_RMAP_BMBT_BLOCK;
-	is_attr = irec.rm_flags & XFS_RMAP_ATTR_FORK;
-	is_unwritten = irec.rm_flags & XFS_RMAP_UNWRITTEN;
-
-	if (is_bmbt && irec.rm_offset != 0)
-		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
-
-	if (non_inode && irec.rm_offset != 0)
-		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
-
-	if (is_unwritten && (is_bmbt || non_inode || is_attr))
-		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
-
-	if (non_inode && (is_bmbt || is_unwritten || is_attr))
-		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
-
 	xchk_rmapbt_xref(bs->sc, &irec);
 	return 0;
 }
-- 
cgit v1.2.3


From 6a3bd8fcf9afb47c703cb268f30f60aa2e7af86a Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:05 -0700
Subject: xfs: complain about bad file mapping records in the ondisk bmbt

Similar to what we've just done for the other btrees, create a function
to log corrupt bmbt records and call it whenever we encounter a bad
record in the ondisk btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.c       | 31 ++++++++++++++++++++++++++++++-
 fs/xfs/libxfs/xfs_bmap.h       |  2 ++
 fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
 3 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 0b4fb3979bdc..5e66807737c5 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -1083,6 +1083,34 @@ struct xfs_iread_state {
 	xfs_extnum_t		loaded;
 };
 
+int
+xfs_bmap_complain_bad_rec(
+	struct xfs_inode		*ip,
+	int				whichfork,
+	xfs_failaddr_t			fa,
+	const struct xfs_bmbt_irec	*irec)
+{
+	struct xfs_mount		*mp = ip->i_mount;
+	const char			*forkname;
+
+	switch (whichfork) {
+	case XFS_DATA_FORK:	forkname = "data"; break;
+	case XFS_ATTR_FORK:	forkname = "attr"; break;
+	case XFS_COW_FORK:	forkname = "CoW"; break;
+	default:		forkname = "???"; break;
+	}
+
+	xfs_warn(mp,
+ "Bmap BTree record corruption in inode 0x%llx %s fork detected at %pS!",
+				ip->i_ino, forkname, fa);
+	xfs_warn(mp,
+		"Offset 0x%llx, start block 0x%llx, block count 0x%llx state 0x%x",
+		irec->br_startoff, irec->br_startblock, irec->br_blockcount,
+		irec->br_state);
+
+	return -EFSCORRUPTED;
+}
+
 /* Stuff every bmbt record from this block into the incore extent map. */
 static int
 xfs_iread_bmbt_block(
@@ -1125,7 +1153,8 @@ xfs_iread_bmbt_block(
 			xfs_inode_verifier_error(ip, -EFSCORRUPTED,
 					"xfs_iread_extents(2)", frp,
 					sizeof(*frp), fa);
-			return -EFSCORRUPTED;
+			return xfs_bmap_complain_bad_rec(ip, whichfork, fa,
+					&new);
 		}
 		xfs_iext_insert(ip, &ir->icur, &new,
 				xfs_bmap_fork_to_state(whichfork));
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index e5a492027aea..9ff030d12981 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -265,6 +265,8 @@ static inline uint32_t xfs_bmap_fork_to_state(int whichfork)
 
 xfs_failaddr_t xfs_bmap_validate_extent(struct xfs_inode *ip, int whichfork,
 		struct xfs_bmbt_irec *irec);
+int xfs_bmap_complain_bad_rec(struct xfs_inode *ip, int whichfork,
+		xfs_failaddr_t fa, const struct xfs_bmbt_irec *irec);
 
 int	xfs_bmapi_remap(struct xfs_trans *tp, struct xfs_inode *ip,
 		xfs_fileoff_t bno, xfs_filblks_t len, xfs_fsblock_t startblock,
diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
index 6b21760184d9..ff37eecec4b0 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.c
+++ b/fs/xfs/libxfs/xfs_inode_fork.c
@@ -140,7 +140,8 @@ xfs_iformat_extents(
 				xfs_inode_verifier_error(ip, -EFSCORRUPTED,
 						"xfs_iformat_extents(2)",
 						dp, sizeof(*dp), fa);
-				return -EFSCORRUPTED;
+				return xfs_bmap_complain_bad_rec(ip, whichfork,
+						fa, &new);
 			}
 
 			xfs_iext_insert(ip, &icur, &new, state);
-- 
cgit v1.2.3


From e774b2ea0bb130d00e3cb1c29cd91028d0c0c83d Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:06 -0700
Subject: xfs: hoist rmap record flag checks from scrub

Move the rmap record flag checks from xchk_rmapbt_rec into
xfs_rmap_check_irec so that they are applied everywhere.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_rmap.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index e66ecd794a84..da008d317f83 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -254,6 +254,11 @@ xfs_rmap_check_irec(
 	if (!is_inode && (is_bmbt || is_unwritten || is_attr))
 		return __this_address;
 
+	/* Check for a valid fork offset, if applicable. */
+	if (is_inode && !is_bmbt &&
+	    !xfs_verify_fileext(mp, irec->rm_offset, irec->rm_blockcount))
+		return __this_address;
+
 	return NULL;
 }
 
-- 
cgit v1.2.3


From de1a9ce225e93b22d189f8ffbce20074bc803121 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:06 -0700
Subject: xfs: hoist inode record alignment checks from scrub

Move the inobt record alignment checks from xchk_iallocbt_rec into
xfs_inobt_check_irec so that they are applied everywhere.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 4 ++++
 fs/xfs/scrub/ialloc.c      | 6 ------
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index b7dc8b81a133..0d2980accd3c 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -103,8 +103,12 @@ xfs_inobt_check_irec(
 {
 	uint64_t			realfree;
 
+	/* Record has to be properly aligned within the AG. */
 	if (!xfs_verify_agino(cur->bc_ag.pag, irec->ir_startino))
 		return __this_address;
+	if (!xfs_verify_agino(cur->bc_ag.pag,
+				irec->ir_startino + XFS_INODES_PER_CHUNK - 1))
+		return __this_address;
 	if (irec->ir_count < XFS_INODES_PER_HOLEMASK_BIT ||
 	    irec->ir_count > XFS_INODES_PER_CHUNK)
 		return __this_address;
diff --git a/fs/xfs/scrub/ialloc.c b/fs/xfs/scrub/ialloc.c
index 11afb4c5a161..ca5a7e0f5451 100644
--- a/fs/xfs/scrub/ialloc.c
+++ b/fs/xfs/scrub/ialloc.c
@@ -413,7 +413,6 @@ xchk_iallocbt_rec(
 	const union xfs_btree_rec	*rec)
 {
 	struct xfs_mount		*mp = bs->cur->bc_mp;
-	struct xfs_perag		*pag = bs->cur->bc_ag.pag;
 	struct xchk_iallocbt		*iabt = bs->private;
 	struct xfs_inobt_rec_incore	irec;
 	uint64_t			holes;
@@ -431,11 +430,6 @@ xchk_iallocbt_rec(
 	}
 
 	agino = irec.ir_startino;
-	/* Record has to be properly aligned within the AG. */
-	if (!xfs_verify_agino(pag, agino + XFS_INODES_PER_CHUNK - 1)) {
-		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
-		goto out;
-	}
 
 	xchk_iallocbt_rec_alignment(bs, &irec);
 	if (bs->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
-- 
cgit v1.2.3


From 08c987deca56687c0930f308f5148ef1af38dc14 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:07 -0700
Subject: xfs: fix rm_offset flag handling in rmap keys

Keys for extent interval records in the reverse mapping btree are
supposed to be computed as follows:

(physical block, owner, fork, is_btree, offset)

This provides users the ability to look up a reverse mapping from a file
block mapping record -- start with the physical block; then if there are
multiple records for the same block, move on to the owner; then the
inode fork type; and so on to the file offset.

Unfortunately, the code that creates rmap lookup keys from rmap records
forgot to mask off the record attribute flags, leading to ondisk keys
that look like this:

(physical block, owner, fork, is_btree, unwritten state, offset)

Fortunately, this has all worked ok for the past six years because the
key comparison functions incorrectly ignore the fork/bmbt/unwritten
information that's encoded in the on-disk offset.  This means that
lookup comparisons are only done with:

(physical block, owner, offset)

Queries can (theoretically) return incorrect results because of this
omission.  On consistent filesystems this isn't an issue because xattr
and bmbt blocks cannot be shared and hence the comparisons succeed
purely on the contents of the rm_startblock field.  For the one case
where we support sharing (written data fork blocks) all flag bits are
zero, so the omission in the comparison has no ill effects.

Unfortunately, this bug prevents scrub from detecting incorrect fork and
bmbt flag bits in the rmap btree, so we really do need to fix the
compare code.  Old filesystems with the unwritten bit erroneously set in
the rmap key struct will work fine on new kernels since we still ignore
the unwritten bit.  New filesystems on older kernels will work fine
since the old kernels never paid attention to the unwritten bit.

A previous version of this patch forgot to keep the (un)written state
flag masked during the comparison and caused a major regression in
5.9.x since unwritten extent conversion can update an rmap record
without requiring key updates.

Note that blocks cannot go directly from data fork to attr fork without
being deallocated and reallocated, nor can they be added to or removed
from a bmbt without a free/alloc cycle, so this should not cause any
regressions.

Found by fuzzing keys[1].attrfork = ones on xfs/371.

Fixes: 4b8ed67794fe ("xfs: add rmap btree operations")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_rmap_btree.c | 40 ++++++++++++++++++++++++++++++----------
 1 file changed, 30 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index 56d074b42660..84e2b692f034 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -156,6 +156,16 @@ xfs_rmapbt_get_maxrecs(
 	return cur->bc_mp->m_rmap_mxr[level != 0];
 }
 
+/*
+ * Convert the ondisk record's offset field into the ondisk key's offset field.
+ * Fork and bmbt are significant parts of the rmap record key, but written
+ * status is merely a record attribute.
+ */
+static inline __be64 ondisk_rec_offset_to_key(const union xfs_btree_rec *rec)
+{
+	return rec->rmap.rm_offset & ~cpu_to_be64(XFS_RMAP_OFF_UNWRITTEN);
+}
+
 STATIC void
 xfs_rmapbt_init_key_from_rec(
 	union xfs_btree_key		*key,
@@ -163,7 +173,7 @@ xfs_rmapbt_init_key_from_rec(
 {
 	key->rmap.rm_startblock = rec->rmap.rm_startblock;
 	key->rmap.rm_owner = rec->rmap.rm_owner;
-	key->rmap.rm_offset = rec->rmap.rm_offset;
+	key->rmap.rm_offset = ondisk_rec_offset_to_key(rec);
 }
 
 /*
@@ -186,7 +196,7 @@ xfs_rmapbt_init_high_key_from_rec(
 	key->rmap.rm_startblock = rec->rmap.rm_startblock;
 	be32_add_cpu(&key->rmap.rm_startblock, adj);
 	key->rmap.rm_owner = rec->rmap.rm_owner;
-	key->rmap.rm_offset = rec->rmap.rm_offset;
+	key->rmap.rm_offset = ondisk_rec_offset_to_key(rec);
 	if (XFS_RMAP_NON_INODE_OWNER(be64_to_cpu(rec->rmap.rm_owner)) ||
 	    XFS_RMAP_IS_BMBT_BLOCK(be64_to_cpu(rec->rmap.rm_offset)))
 		return;
@@ -219,6 +229,16 @@ xfs_rmapbt_init_ptr_from_cur(
 	ptr->s = agf->agf_roots[cur->bc_btnum];
 }
 
+/*
+ * Mask the appropriate parts of the ondisk key field for a key comparison.
+ * Fork and bmbt are significant parts of the rmap record key, but written
+ * status is merely a record attribute.
+ */
+static inline uint64_t offset_keymask(uint64_t offset)
+{
+	return offset & ~XFS_RMAP_OFF_UNWRITTEN;
+}
+
 STATIC int64_t
 xfs_rmapbt_key_diff(
 	struct xfs_btree_cur		*cur,
@@ -240,8 +260,8 @@ xfs_rmapbt_key_diff(
 	else if (y > x)
 		return -1;
 
-	x = XFS_RMAP_OFF(be64_to_cpu(kp->rm_offset));
-	y = rec->rm_offset;
+	x = offset_keymask(be64_to_cpu(kp->rm_offset));
+	y = offset_keymask(xfs_rmap_irec_offset_pack(rec));
 	if (x > y)
 		return 1;
 	else if (y > x)
@@ -272,8 +292,8 @@ xfs_rmapbt_diff_two_keys(
 	else if (y > x)
 		return -1;
 
-	x = XFS_RMAP_OFF(be64_to_cpu(kp1->rm_offset));
-	y = XFS_RMAP_OFF(be64_to_cpu(kp2->rm_offset));
+	x = offset_keymask(be64_to_cpu(kp1->rm_offset));
+	y = offset_keymask(be64_to_cpu(kp2->rm_offset));
 	if (x > y)
 		return 1;
 	else if (y > x)
@@ -387,8 +407,8 @@ xfs_rmapbt_keys_inorder(
 		return 1;
 	else if (a > b)
 		return 0;
-	a = XFS_RMAP_OFF(be64_to_cpu(k1->rmap.rm_offset));
-	b = XFS_RMAP_OFF(be64_to_cpu(k2->rmap.rm_offset));
+	a = offset_keymask(be64_to_cpu(k1->rmap.rm_offset));
+	b = offset_keymask(be64_to_cpu(k2->rmap.rm_offset));
 	if (a <= b)
 		return 1;
 	return 0;
@@ -417,8 +437,8 @@ xfs_rmapbt_recs_inorder(
 		return 1;
 	else if (a > b)
 		return 0;
-	a = XFS_RMAP_OFF(be64_to_cpu(r1->rmap.rm_offset));
-	b = XFS_RMAP_OFF(be64_to_cpu(r2->rmap.rm_offset));
+	a = offset_keymask(be64_to_cpu(r1->rmap.rm_offset));
+	b = offset_keymask(be64_to_cpu(r2->rmap.rm_offset));
 	if (a <= b)
 		return 1;
 	return 0;
-- 
cgit v1.2.3


From 38384569a2a8a721623d80c5ae3bcf80614ab792 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:07 -0700
Subject: xfs: detect unwritten bit set in rmapbt node block keys

In the last patch, we changed the rmapbt code to remove the UNWRITTEN
bit when creating an rmapbt key from an rmapbt record, and we changed
the rmapbt key comparison code to start considering the ATTR and BMBT
flags during lookup.  This brought the behavior of the rmapbt
implementation in line with its specification.

However, there may exist filesystems that have the unwritten bit still
set in the rmapbt keys.  We should detect these situations and flag the
rmapbt as one that would benefit from optimization.  Eventually, online
repair will be able to do something in response to this.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/btree.c | 10 ++++++++++
 fs/xfs/scrub/btree.h |  2 ++
 fs/xfs/scrub/rmap.c  | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 65 insertions(+)

diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
index 626282dbe2e3..de4b29ae0839 100644
--- a/fs/xfs/scrub/btree.c
+++ b/fs/xfs/scrub/btree.c
@@ -119,6 +119,16 @@ xchk_btree_xref_set_corrupt(
 			__return_address);
 }
 
+void
+xchk_btree_set_preen(
+	struct xfs_scrub	*sc,
+	struct xfs_btree_cur	*cur,
+	int			level)
+{
+	__xchk_btree_set_corrupt(sc, cur, level, XFS_SCRUB_OFLAG_PREEN,
+			__return_address);
+}
+
 /*
  * Make sure this record is in order and doesn't stray outside of the parent
  * keys.
diff --git a/fs/xfs/scrub/btree.h b/fs/xfs/scrub/btree.h
index 70461885c6c7..639e44e7544f 100644
--- a/fs/xfs/scrub/btree.h
+++ b/fs/xfs/scrub/btree.h
@@ -19,6 +19,8 @@ bool xchk_btree_xref_process_error(struct xfs_scrub *sc,
 /* Check for btree corruption. */
 void xchk_btree_set_corrupt(struct xfs_scrub *sc,
 		struct xfs_btree_cur *cur, int level);
+void xchk_btree_set_preen(struct xfs_scrub *sc, struct xfs_btree_cur *cur,
+		int level);
 
 /* Check for btree xref discrepancies. */
 void xchk_btree_xref_set_corrupt(struct xfs_scrub *sc,
diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index ef6e4b8546a6..8e78e1bc9eef 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -87,6 +87,58 @@ xchk_rmapbt_xref(
 		xchk_rmapbt_xref_refc(sc, irec);
 }
 
+/*
+ * Check for bogus UNWRITTEN flags in the rmapbt node block keys.
+ *
+ * In reverse mapping records, the file mapping extent state
+ * (XFS_RMAP_OFF_UNWRITTEN) is a record attribute, not a key field.  It is not
+ * involved in lookups in any way.  In older kernels, the functions that
+ * convert rmapbt records to keys forgot to filter out the extent state bit,
+ * even though the key comparison functions have filtered the flag correctly.
+ * If we spot an rmap key with the unwritten bit set in rm_offset, we should
+ * mark the btree as needing optimization to rebuild the btree without those
+ * flags.
+ */
+STATIC void
+xchk_rmapbt_check_unwritten_in_keyflags(
+	struct xchk_btree	*bs)
+{
+	struct xfs_scrub	*sc = bs->sc;
+	struct xfs_btree_cur	*cur = bs->cur;
+	struct xfs_btree_block	*keyblock;
+	union xfs_btree_key	*lkey, *hkey;
+	__be64			badflag = cpu_to_be64(XFS_RMAP_OFF_UNWRITTEN);
+	unsigned int		level;
+
+	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_PREEN)
+		return;
+
+	for (level = 1; level < cur->bc_nlevels; level++) {
+		struct xfs_buf	*bp;
+		unsigned int	ptr;
+
+		/* Only check the first time we've seen this node block. */
+		if (cur->bc_levels[level].ptr > 1)
+			continue;
+
+		keyblock = xfs_btree_get_block(cur, level, &bp);
+		for (ptr = 1; ptr <= be16_to_cpu(keyblock->bb_numrecs); ptr++) {
+			lkey = xfs_btree_key_addr(cur, ptr, keyblock);
+
+			if (lkey->rmap.rm_offset & badflag) {
+				xchk_btree_set_preen(sc, cur, level);
+				break;
+			}
+
+			hkey = xfs_btree_high_key_addr(cur, ptr, keyblock);
+			if (hkey->rmap.rm_offset & badflag) {
+				xchk_btree_set_preen(sc, cur, level);
+				break;
+			}
+		}
+	}
+}
+
 /* Scrub an rmapbt record. */
 STATIC int
 xchk_rmapbt_rec(
@@ -101,6 +153,7 @@ xchk_rmapbt_rec(
 		return 0;
 	}
 
+	xchk_rmapbt_check_unwritten_in_keyflags(bs);
 	xchk_rmapbt_xref(bs->sc, &irec);
 	return 0;
 }
-- 
cgit v1.2.3


From c99f99fa3eafc824ea6859590f5d2e4c6a7f4359 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:08 -0700
Subject: xfs: check btree keys reflect the child block

When scrub is checking a non-root btree block, it should make sure that
the keys in the parent btree block accurately capture the keyspace that
the child block stores.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/btree.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 48 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
index de4b29ae0839..d9ab280eb6c4 100644
--- a/fs/xfs/scrub/btree.c
+++ b/fs/xfs/scrub/btree.c
@@ -529,6 +529,48 @@ xchk_btree_check_minrecs(
 		xchk_btree_set_corrupt(bs->sc, cur, level);
 }
 
+/*
+ * If this btree block has a parent, make sure that the parent's keys capture
+ * the keyspace contained in this block.
+ */
+STATIC void
+xchk_btree_block_check_keys(
+	struct xchk_btree	*bs,
+	int			level,
+	struct xfs_btree_block	*block)
+{
+	union xfs_btree_key	block_key;
+	union xfs_btree_key	*block_high_key;
+	union xfs_btree_key	*parent_low_key, *parent_high_key;
+	struct xfs_btree_cur	*cur = bs->cur;
+	struct xfs_btree_block	*parent_block;
+	struct xfs_buf		*bp;
+
+	if (level == cur->bc_nlevels - 1)
+		return;
+
+	xfs_btree_get_keys(cur, block, &block_key);
+
+	/* Make sure the low key of this block matches the parent. */
+	parent_block = xfs_btree_get_block(cur, level + 1, &bp);
+	parent_low_key = xfs_btree_key_addr(cur, cur->bc_levels[level + 1].ptr,
+			parent_block);
+	if (cur->bc_ops->diff_two_keys(cur, &block_key, parent_low_key)) {
+		xchk_btree_set_corrupt(bs->sc, bs->cur, level);
+		return;
+	}
+
+	if (!(cur->bc_flags & XFS_BTREE_OVERLAPPING))
+		return;
+
+	/* Make sure the high key of this block matches the parent. */
+	parent_high_key = xfs_btree_high_key_addr(cur,
+			cur->bc_levels[level + 1].ptr, parent_block);
+	block_high_key = xfs_btree_high_key_from_key(cur, &block_key);
+	if (cur->bc_ops->diff_two_keys(cur, block_high_key, parent_high_key))
+		xchk_btree_set_corrupt(bs->sc, bs->cur, level);
+}
+
 /*
  * Grab and scrub a btree block given a btree pointer.  Returns block
  * and buffer pointers (if applicable) if they're ok to use.
@@ -580,7 +622,12 @@ xchk_btree_get_block(
 	 * Check the block's siblings; this function absorbs error codes
 	 * for us.
 	 */
-	return xchk_btree_block_check_siblings(bs, *pblock);
+	error = xchk_btree_block_check_siblings(bs, *pblock);
+	if (error)
+		return error;
+
+	xchk_btree_block_check_keys(bs, level, *pblock);
+	return 0;
 }
 
 /*
-- 
cgit v1.2.3


From ee5fe8ff6d19b35e7547af789cba877dbf04517b Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:09 -0700
Subject: xfs: refactor converting btree irec to btree key

We keep doing these conversions to support btree queries, so refactor
this into a helper.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_btree.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index c4649cc624e1..5ea1c27e2be6 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -4937,6 +4937,19 @@ out:
 	return error;
 }
 
+static inline void
+xfs_btree_key_from_irec(
+	struct xfs_btree_cur		*cur,
+	union xfs_btree_key		*key,
+	const union xfs_btree_irec	*irec)
+{
+	union xfs_btree_rec		rec;
+
+	cur->bc_rec = *irec;
+	cur->bc_ops->init_rec_from_cur(cur, &rec);
+	cur->bc_ops->init_key_from_rec(key, &rec);
+}
+
 /*
  * Query a btree for all records overlapping a given interval of keys.  The
  * supplied function will be called with each record found; return one of the
@@ -4951,18 +4964,12 @@ xfs_btree_query_range(
 	xfs_btree_query_range_fn	fn,
 	void				*priv)
 {
-	union xfs_btree_rec		rec;
 	union xfs_btree_key		low_key;
 	union xfs_btree_key		high_key;
 
 	/* Find the keys of both ends of the interval. */
-	cur->bc_rec = *high_rec;
-	cur->bc_ops->init_rec_from_cur(cur, &rec);
-	cur->bc_ops->init_key_from_rec(&high_key, &rec);
-
-	cur->bc_rec = *low_rec;
-	cur->bc_ops->init_rec_from_cur(cur, &rec);
-	cur->bc_ops->init_key_from_rec(&low_key, &rec);
+	xfs_btree_key_from_irec(cur, &high_key, high_rec);
+	xfs_btree_key_from_irec(cur, &low_key, low_rec);
 
 	/* Enforce low key < high key. */
 	if (cur->bc_ops->diff_two_keys(cur, &low_key, &high_key) > 0)
-- 
cgit v1.2.3


From 2bea8df0a52b05bc0dddd54e950ae37c83533b03 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:09 -0700
Subject: xfs: always scrub record/key order of interior records

In commit d47fef9342d0, we removed the firstrec and firstkey fields of
struct xchk_btree because Christoph thought they were unnecessary
because we could use the record index in the btree cursor.  This is
incorrect because bc_ptrs (now bc_levels[].ptr) tracks the cursor
position within a specific btree block, not within the entire level.

The end result is that scrub no longer detects situations where the
rightmost record of a block is identical to the leftmost record of that
block's right sibling.  Fix this regression by reintroducing record
validity booleans so that order checking skips *only* the leftmost
record/key in each level.

Fixes: d47fef9342d0 ("xfs: don't track firstrec/firstkey separately in xchk_btree")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/btree.c | 14 ++++++++------
 fs/xfs/scrub/btree.h |  8 +++++++-
 2 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
index d9ab280eb6c4..4ec3b1cab018 100644
--- a/fs/xfs/scrub/btree.c
+++ b/fs/xfs/scrub/btree.c
@@ -151,11 +151,12 @@ xchk_btree_rec(
 
 	trace_xchk_btree_rec(bs->sc, cur, 0);
 
-	/* If this isn't the first record, are they in order? */
-	if (cur->bc_levels[0].ptr > 1 &&
+	/* Are all records across all record blocks in order? */
+	if (bs->lastrec_valid &&
 	    !cur->bc_ops->recs_inorder(cur, &bs->lastrec, rec))
 		xchk_btree_set_corrupt(bs->sc, cur, 0);
 	memcpy(&bs->lastrec, rec, cur->bc_ops->rec_len);
+	bs->lastrec_valid = true;
 
 	if (cur->bc_nlevels == 1)
 		return;
@@ -198,11 +199,12 @@ xchk_btree_key(
 
 	trace_xchk_btree_key(bs->sc, cur, level);
 
-	/* If this isn't the first key, are they in order? */
-	if (cur->bc_levels[level].ptr > 1 &&
-	    !cur->bc_ops->keys_inorder(cur, &bs->lastkey[level - 1], key))
+	/* Are all low keys across all node blocks in order? */
+	if (bs->lastkey[level - 1].valid &&
+	    !cur->bc_ops->keys_inorder(cur, &bs->lastkey[level - 1].key, key))
 		xchk_btree_set_corrupt(bs->sc, cur, level);
-	memcpy(&bs->lastkey[level - 1], key, cur->bc_ops->key_len);
+	memcpy(&bs->lastkey[level - 1].key, key, cur->bc_ops->key_len);
+	bs->lastkey[level - 1].valid = true;
 
 	if (level + 1 >= cur->bc_nlevels)
 		return;
diff --git a/fs/xfs/scrub/btree.h b/fs/xfs/scrub/btree.h
index 639e44e7544f..9d7b9ee8bef4 100644
--- a/fs/xfs/scrub/btree.h
+++ b/fs/xfs/scrub/btree.h
@@ -31,6 +31,11 @@ typedef int (*xchk_btree_rec_fn)(
 	struct xchk_btree		*bs,
 	const union xfs_btree_rec	*rec);
 
+struct xchk_btree_key {
+	union xfs_btree_key		key;
+	bool				valid;
+};
+
 struct xchk_btree {
 	/* caller-provided scrub state */
 	struct xfs_scrub		*sc;
@@ -40,11 +45,12 @@ struct xchk_btree {
 	void				*private;
 
 	/* internal scrub state */
+	bool				lastrec_valid;
 	union xfs_btree_rec		lastrec;
 	struct list_head		to_check;
 
 	/* this element must come last! */
-	union xfs_btree_key		lastkey[];
+	struct xchk_btree_key		lastkey[];
 };
 
 /*
-- 
cgit v1.2.3


From bd7e795108ccd8d0f3dc34e16957cbba7e89f342 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:10 -0700
Subject: xfs: refactor ->diff_two_keys callsites

Create wrapper functions around ->diff_two_keys so that we don't have to
remember what the return values mean, and adjust some of the code
comments to reflect the longtime code behavior.  We're going to
introduce more uses of ->diff_two_keys in the next patch, so reduce the
cognitive load for readers by doing this refactoring now.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_btree.c | 57 ++++++++++++++++++++---------------------------
 fs/xfs/libxfs/xfs_btree.h | 55 +++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/btree.c      | 24 ++++++++++----------
 3 files changed, 91 insertions(+), 45 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 5ea1c27e2be6..92c610850fac 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2067,8 +2067,7 @@ xfs_btree_get_leaf_keys(
 		for (n = 2; n <= xfs_btree_get_numrecs(block); n++) {
 			rec = xfs_btree_rec_addr(cur, n, block);
 			cur->bc_ops->init_high_key_from_rec(&hkey, rec);
-			if (cur->bc_ops->diff_two_keys(cur, &hkey, &max_hkey)
-					> 0)
+			if (xfs_btree_keycmp_gt(cur, &hkey, &max_hkey))
 				max_hkey = hkey;
 		}
 
@@ -2096,7 +2095,7 @@ xfs_btree_get_node_keys(
 		max_hkey = xfs_btree_high_key_addr(cur, 1, block);
 		for (n = 2; n <= xfs_btree_get_numrecs(block); n++) {
 			hkey = xfs_btree_high_key_addr(cur, n, block);
-			if (cur->bc_ops->diff_two_keys(cur, hkey, max_hkey) > 0)
+			if (xfs_btree_keycmp_gt(cur, hkey, max_hkey))
 				max_hkey = hkey;
 		}
 
@@ -2183,8 +2182,8 @@ __xfs_btree_updkeys(
 		nlkey = xfs_btree_key_addr(cur, ptr, block);
 		nhkey = xfs_btree_high_key_addr(cur, ptr, block);
 		if (!force_all &&
-		    !(cur->bc_ops->diff_two_keys(cur, nlkey, lkey) != 0 ||
-		      cur->bc_ops->diff_two_keys(cur, nhkey, hkey) != 0))
+		    xfs_btree_keycmp_eq(cur, nlkey, lkey) &&
+		    xfs_btree_keycmp_eq(cur, nhkey, hkey))
 			break;
 		xfs_btree_copy_keys(cur, nlkey, lkey, 1);
 		xfs_btree_log_keys(cur, bp, ptr, ptr);
@@ -4716,7 +4715,6 @@ xfs_btree_simple_query_range(
 {
 	union xfs_btree_rec		*recp;
 	union xfs_btree_key		rec_key;
-	int64_t				diff;
 	int				stat;
 	bool				firstrec = true;
 	int				error;
@@ -4746,20 +4744,17 @@ xfs_btree_simple_query_range(
 		if (error || !stat)
 			break;
 
-		/* Skip if high_key(rec) < low_key. */
+		/* Skip if low_key > high_key(rec). */
 		if (firstrec) {
 			cur->bc_ops->init_high_key_from_rec(&rec_key, recp);
 			firstrec = false;
-			diff = cur->bc_ops->diff_two_keys(cur, low_key,
-					&rec_key);
-			if (diff > 0)
+			if (xfs_btree_keycmp_gt(cur, low_key, &rec_key))
 				goto advloop;
 		}
 
-		/* Stop if high_key < low_key(rec). */
+		/* Stop if low_key(rec) > high_key. */
 		cur->bc_ops->init_key_from_rec(&rec_key, recp);
-		diff = cur->bc_ops->diff_two_keys(cur, &rec_key, high_key);
-		if (diff > 0)
+		if (xfs_btree_keycmp_gt(cur, &rec_key, high_key))
 			break;
 
 		/* Callback */
@@ -4813,8 +4808,6 @@ xfs_btree_overlapped_query_range(
 	union xfs_btree_key		*hkp;
 	union xfs_btree_rec		*recp;
 	struct xfs_btree_block		*block;
-	int64_t				ldiff;
-	int64_t				hdiff;
 	int				level;
 	struct xfs_buf			*bp;
 	int				i;
@@ -4854,25 +4847,23 @@ pop_up:
 					block);
 
 			cur->bc_ops->init_high_key_from_rec(&rec_hkey, recp);
-			ldiff = cur->bc_ops->diff_two_keys(cur, &rec_hkey,
-					low_key);
-
 			cur->bc_ops->init_key_from_rec(&rec_key, recp);
-			hdiff = cur->bc_ops->diff_two_keys(cur, high_key,
-					&rec_key);
 
 			/*
+			 * If (query's high key < record's low key), then there
+			 * are no more interesting records in this block.  Pop
+			 * up to the leaf level to find more record blocks.
+			 *
 			 * If (record's high key >= query's low key) and
 			 *    (query's high key >= record's low key), then
 			 * this record overlaps the query range; callback.
 			 */
-			if (ldiff >= 0 && hdiff >= 0) {
+			if (xfs_btree_keycmp_lt(cur, high_key, &rec_key))
+				goto pop_up;
+			if (xfs_btree_keycmp_ge(cur, &rec_hkey, low_key)) {
 				error = fn(cur, recp, priv);
 				if (error)
 					break;
-			} else if (hdiff < 0) {
-				/* Record is larger than high key; pop. */
-				goto pop_up;
 			}
 			cur->bc_levels[level].ptr++;
 			continue;
@@ -4884,15 +4875,18 @@ pop_up:
 				block);
 		pp = xfs_btree_ptr_addr(cur, cur->bc_levels[level].ptr, block);
 
-		ldiff = cur->bc_ops->diff_two_keys(cur, hkp, low_key);
-		hdiff = cur->bc_ops->diff_two_keys(cur, high_key, lkp);
-
 		/*
+		 * If (query's high key < pointer's low key), then there are no
+		 * more interesting keys in this block.  Pop up one leaf level
+		 * to continue looking for records.
+		 *
 		 * If (pointer's high key >= query's low key) and
 		 *    (query's high key >= pointer's low key), then
 		 * this record overlaps the query range; follow pointer.
 		 */
-		if (ldiff >= 0 && hdiff >= 0) {
+		if (xfs_btree_keycmp_lt(cur, high_key, lkp))
+			goto pop_up;
+		if (xfs_btree_keycmp_ge(cur, hkp, low_key)) {
 			level--;
 			error = xfs_btree_lookup_get_block(cur, level, pp,
 					&block);
@@ -4907,9 +4901,6 @@ pop_up:
 #endif
 			cur->bc_levels[level].ptr = 1;
 			continue;
-		} else if (hdiff < 0) {
-			/* The low key is larger than the upper range; pop. */
-			goto pop_up;
 		}
 		cur->bc_levels[level].ptr++;
 	}
@@ -4971,8 +4962,8 @@ xfs_btree_query_range(
 	xfs_btree_key_from_irec(cur, &high_key, high_rec);
 	xfs_btree_key_from_irec(cur, &low_key, low_rec);
 
-	/* Enforce low key < high key. */
-	if (cur->bc_ops->diff_two_keys(cur, &low_key, &high_key) > 0)
+	/* Enforce low key <= high key. */
+	if (!xfs_btree_keycmp_le(cur, &low_key, &high_key))
 		return -EINVAL;
 
 	if (!(cur->bc_flags & XFS_BTREE_OVERLAPPING))
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 29c4b4ccb909..f5aa4b893ee7 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -546,6 +546,61 @@ int xfs_btree_has_record(struct xfs_btree_cur *cur,
 bool xfs_btree_has_more_records(struct xfs_btree_cur *cur);
 struct xfs_ifork *xfs_btree_ifork_ptr(struct xfs_btree_cur *cur);
 
+/* Key comparison helpers */
+static inline bool
+xfs_btree_keycmp_lt(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*key1,
+	const union xfs_btree_key	*key2)
+{
+	return cur->bc_ops->diff_two_keys(cur, key1, key2) < 0;
+}
+
+static inline bool
+xfs_btree_keycmp_gt(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*key1,
+	const union xfs_btree_key	*key2)
+{
+	return cur->bc_ops->diff_two_keys(cur, key1, key2) > 0;
+}
+
+static inline bool
+xfs_btree_keycmp_eq(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*key1,
+	const union xfs_btree_key	*key2)
+{
+	return cur->bc_ops->diff_two_keys(cur, key1, key2) == 0;
+}
+
+static inline bool
+xfs_btree_keycmp_le(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*key1,
+	const union xfs_btree_key	*key2)
+{
+	return !xfs_btree_keycmp_gt(cur, key1, key2);
+}
+
+static inline bool
+xfs_btree_keycmp_ge(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*key1,
+	const union xfs_btree_key	*key2)
+{
+	return !xfs_btree_keycmp_lt(cur, key1, key2);
+}
+
+static inline bool
+xfs_btree_keycmp_ne(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*key1,
+	const union xfs_btree_key	*key2)
+{
+	return !xfs_btree_keycmp_eq(cur, key1, key2);
+}
+
 /* Does this cursor point to the last block in the given level? */
 static inline bool
 xfs_btree_islastblock(
diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
index 4ec3b1cab018..1165dc05a179 100644
--- a/fs/xfs/scrub/btree.c
+++ b/fs/xfs/scrub/btree.c
@@ -161,20 +161,20 @@ xchk_btree_rec(
 	if (cur->bc_nlevels == 1)
 		return;
 
-	/* Is this at least as large as the parent low key? */
+	/* Is low_key(rec) at least as large as the parent low key? */
 	cur->bc_ops->init_key_from_rec(&key, rec);
 	keyblock = xfs_btree_get_block(cur, 1, &bp);
 	keyp = xfs_btree_key_addr(cur, cur->bc_levels[1].ptr, keyblock);
-	if (cur->bc_ops->diff_two_keys(cur, &key, keyp) < 0)
+	if (xfs_btree_keycmp_lt(cur, &key, keyp))
 		xchk_btree_set_corrupt(bs->sc, cur, 1);
 
 	if (!(cur->bc_flags & XFS_BTREE_OVERLAPPING))
 		return;
 
-	/* Is this no larger than the parent high key? */
+	/* Is high_key(rec) no larger than the parent high key? */
 	cur->bc_ops->init_high_key_from_rec(&hkey, rec);
 	keyp = xfs_btree_high_key_addr(cur, cur->bc_levels[1].ptr, keyblock);
-	if (cur->bc_ops->diff_two_keys(cur, keyp, &hkey) < 0)
+	if (xfs_btree_keycmp_lt(cur, keyp, &hkey))
 		xchk_btree_set_corrupt(bs->sc, cur, 1);
 }
 
@@ -209,20 +209,20 @@ xchk_btree_key(
 	if (level + 1 >= cur->bc_nlevels)
 		return;
 
-	/* Is this at least as large as the parent low key? */
+	/* Is this block's low key at least as large as the parent low key? */
 	keyblock = xfs_btree_get_block(cur, level + 1, &bp);
 	keyp = xfs_btree_key_addr(cur, cur->bc_levels[level + 1].ptr, keyblock);
-	if (cur->bc_ops->diff_two_keys(cur, key, keyp) < 0)
+	if (xfs_btree_keycmp_lt(cur, key, keyp))
 		xchk_btree_set_corrupt(bs->sc, cur, level);
 
 	if (!(cur->bc_flags & XFS_BTREE_OVERLAPPING))
 		return;
 
-	/* Is this no larger than the parent high key? */
+	/* Is this block's high key no larger than the parent high key? */
 	key = xfs_btree_high_key_addr(cur, cur->bc_levels[level].ptr, block);
 	keyp = xfs_btree_high_key_addr(cur, cur->bc_levels[level + 1].ptr,
 			keyblock);
-	if (cur->bc_ops->diff_two_keys(cur, keyp, key) < 0)
+	if (xfs_btree_keycmp_lt(cur, keyp, key))
 		xchk_btree_set_corrupt(bs->sc, cur, level);
 }
 
@@ -557,7 +557,7 @@ xchk_btree_block_check_keys(
 	parent_block = xfs_btree_get_block(cur, level + 1, &bp);
 	parent_low_key = xfs_btree_key_addr(cur, cur->bc_levels[level + 1].ptr,
 			parent_block);
-	if (cur->bc_ops->diff_two_keys(cur, &block_key, parent_low_key)) {
+	if (xfs_btree_keycmp_ne(cur, &block_key, parent_low_key)) {
 		xchk_btree_set_corrupt(bs->sc, bs->cur, level);
 		return;
 	}
@@ -569,7 +569,7 @@ xchk_btree_block_check_keys(
 	parent_high_key = xfs_btree_high_key_addr(cur,
 			cur->bc_levels[level + 1].ptr, parent_block);
 	block_high_key = xfs_btree_high_key_from_key(cur, &block_key);
-	if (cur->bc_ops->diff_two_keys(cur, block_high_key, parent_high_key))
+	if (xfs_btree_keycmp_ne(cur, block_high_key, parent_high_key))
 		xchk_btree_set_corrupt(bs->sc, bs->cur, level);
 }
 
@@ -661,7 +661,7 @@ xchk_btree_block_keys(
 	parent_keys = xfs_btree_key_addr(cur, cur->bc_levels[level + 1].ptr,
 			parent_block);
 
-	if (cur->bc_ops->diff_two_keys(cur, &block_keys, parent_keys) != 0)
+	if (xfs_btree_keycmp_ne(cur, &block_keys, parent_keys))
 		xchk_btree_set_corrupt(bs->sc, cur, 1);
 
 	if (!(cur->bc_flags & XFS_BTREE_OVERLAPPING))
@@ -672,7 +672,7 @@ xchk_btree_block_keys(
 	high_pk = xfs_btree_high_key_addr(cur, cur->bc_levels[level + 1].ptr,
 			parent_block);
 
-	if (cur->bc_ops->diff_two_keys(cur, high_bk, high_pk) != 0)
+	if (xfs_btree_keycmp_ne(cur, high_bk, high_pk))
 		xchk_btree_set_corrupt(bs->sc, cur, 1);
 }
 
-- 
cgit v1.2.3


From 6abc7aef85b1f42cb39a3149f4ab64ca255e41e6 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:10 -0700
Subject: xfs: replace xfs_btree_has_record with a general keyspace scanner

The current implementation of xfs_btree_has_record returns true if it
finds /any/ record within the given range.  Unfortunately, that's not
sufficient for scrub.  We want to be able to tell if a range of keyspace
for a btree is devoid of records, is totally mapped to records, or is
somewhere in between.  By forcing this to be a boolean, we conflated
sparseness and fullness, which caused scrub to return incorrect results.
Fix the API so that we can tell the caller which of those three is the
current state.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_alloc.c          |  11 ++--
 fs/xfs/libxfs/xfs_alloc.h          |   4 +-
 fs/xfs/libxfs/xfs_alloc_btree.c    |  12 +++++
 fs/xfs/libxfs/xfs_bmap_btree.c     |  11 ++++
 fs/xfs/libxfs/xfs_btree.c          | 108 ++++++++++++++++++++++++++++++++-----
 fs/xfs/libxfs/xfs_btree.h          |  44 ++++++++++++++-
 fs/xfs/libxfs/xfs_ialloc_btree.c   |  12 +++++
 fs/xfs/libxfs/xfs_refcount.c       |  11 ++--
 fs/xfs/libxfs/xfs_refcount.h       |   4 +-
 fs/xfs/libxfs/xfs_refcount_btree.c |  11 ++++
 fs/xfs/libxfs/xfs_rmap.c           |  12 +++--
 fs/xfs/libxfs/xfs_rmap.h           |   4 +-
 fs/xfs/libxfs/xfs_rmap_btree.c     |  16 ++++++
 fs/xfs/libxfs/xfs_types.h          |  12 +++++
 fs/xfs/scrub/alloc.c               |   6 +--
 fs/xfs/scrub/refcount.c            |   8 +--
 fs/xfs/scrub/rmap.c                |   6 +--
 17 files changed, 249 insertions(+), 43 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 23f0acfc2a64..34c8501d86d0 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -3745,13 +3745,16 @@ xfs_alloc_query_all(
 	return xfs_btree_query_all(cur, xfs_alloc_query_range_helper, &query);
 }
 
-/* Is there a record covering a given extent? */
+/*
+ * Scan part of the keyspace of the free space and tell us if the area has no
+ * records, is fully mapped by records, or is partially filled.
+ */
 int
-xfs_alloc_has_record(
+xfs_alloc_has_records(
 	struct xfs_btree_cur	*cur,
 	xfs_agblock_t		bno,
 	xfs_extlen_t		len,
-	bool			*exists)
+	enum xbtree_recpacking	*outcome)
 {
 	union xfs_btree_irec	low;
 	union xfs_btree_irec	high;
@@ -3761,7 +3764,7 @@ xfs_alloc_has_record(
 	memset(&high, 0xFF, sizeof(high));
 	high.a.ar_startblock = bno + len - 1;
 
-	return xfs_btree_has_record(cur, &low, &high, exists);
+	return xfs_btree_has_records(cur, &low, &high, outcome);
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 56bd05900b35..5dbb25546d0b 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -213,8 +213,8 @@ int xfs_alloc_query_range(struct xfs_btree_cur *cur,
 int xfs_alloc_query_all(struct xfs_btree_cur *cur, xfs_alloc_query_range_fn fn,
 		void *priv);
 
-int xfs_alloc_has_record(struct xfs_btree_cur *cur, xfs_agblock_t bno,
-		xfs_extlen_t len, bool *exist);
+int xfs_alloc_has_records(struct xfs_btree_cur *cur, xfs_agblock_t bno,
+		xfs_extlen_t len, enum xbtree_recpacking *outcome);
 
 typedef int (*xfs_agfl_walk_fn)(struct xfs_mount *mp, xfs_agblock_t bno,
 		void *priv);
diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
index 8e8416c14cec..be80c57aeddc 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.c
+++ b/fs/xfs/libxfs/xfs_alloc_btree.c
@@ -423,6 +423,16 @@ xfs_cntbt_recs_inorder(
 		 be32_to_cpu(r2->alloc.ar_startblock));
 }
 
+STATIC enum xbtree_key_contig
+xfs_allocbt_keys_contiguous(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*key1,
+	const union xfs_btree_key	*key2)
+{
+	return xbtree_key_contig(be32_to_cpu(key1->alloc.ar_startblock),
+				 be32_to_cpu(key2->alloc.ar_startblock));
+}
+
 static const struct xfs_btree_ops xfs_bnobt_ops = {
 	.rec_len		= sizeof(xfs_alloc_rec_t),
 	.key_len		= sizeof(xfs_alloc_key_t),
@@ -443,6 +453,7 @@ static const struct xfs_btree_ops xfs_bnobt_ops = {
 	.diff_two_keys		= xfs_bnobt_diff_two_keys,
 	.keys_inorder		= xfs_bnobt_keys_inorder,
 	.recs_inorder		= xfs_bnobt_recs_inorder,
+	.keys_contiguous	= xfs_allocbt_keys_contiguous,
 };
 
 static const struct xfs_btree_ops xfs_cntbt_ops = {
@@ -465,6 +476,7 @@ static const struct xfs_btree_ops xfs_cntbt_ops = {
 	.diff_two_keys		= xfs_cntbt_diff_two_keys,
 	.keys_inorder		= xfs_cntbt_keys_inorder,
 	.recs_inorder		= xfs_cntbt_recs_inorder,
+	.keys_contiguous	= NULL, /* not needed right now */
 };
 
 /* Allocate most of a new allocation btree cursor. */
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index b8ad95050c9b..3edf314a55e2 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -500,6 +500,16 @@ xfs_bmbt_recs_inorder(
 		xfs_bmbt_disk_get_startoff(&r2->bmbt);
 }
 
+STATIC enum xbtree_key_contig
+xfs_bmbt_keys_contiguous(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*key1,
+	const union xfs_btree_key	*key2)
+{
+	return xbtree_key_contig(be64_to_cpu(key1->bmbt.br_startoff),
+				 be64_to_cpu(key2->bmbt.br_startoff));
+}
+
 static const struct xfs_btree_ops xfs_bmbt_ops = {
 	.rec_len		= sizeof(xfs_bmbt_rec_t),
 	.key_len		= sizeof(xfs_bmbt_key_t),
@@ -520,6 +530,7 @@ static const struct xfs_btree_ops xfs_bmbt_ops = {
 	.buf_ops		= &xfs_bmbt_buf_ops,
 	.keys_inorder		= xfs_bmbt_keys_inorder,
 	.recs_inorder		= xfs_bmbt_recs_inorder,
+	.keys_contiguous	= xfs_bmbt_keys_contiguous,
 };
 
 /*
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 92c610850fac..afbd3bcdf567 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -5025,34 +5025,116 @@ xfs_btree_diff_two_ptrs(
 	return (int64_t)be32_to_cpu(a->s) - be32_to_cpu(b->s);
 }
 
-/* If there's an extent, we're done. */
+struct xfs_btree_has_records {
+	/* Keys for the start and end of the range we want to know about. */
+	union xfs_btree_key		start_key;
+	union xfs_btree_key		end_key;
+
+	/* Highest record key we've seen so far. */
+	union xfs_btree_key		high_key;
+
+	enum xbtree_recpacking		outcome;
+};
+
 STATIC int
-xfs_btree_has_record_helper(
+xfs_btree_has_records_helper(
 	struct xfs_btree_cur		*cur,
 	const union xfs_btree_rec	*rec,
 	void				*priv)
 {
-	return -ECANCELED;
+	union xfs_btree_key		rec_key;
+	union xfs_btree_key		rec_high_key;
+	struct xfs_btree_has_records	*info = priv;
+	enum xbtree_key_contig		key_contig;
+
+	cur->bc_ops->init_key_from_rec(&rec_key, rec);
+
+	if (info->outcome == XBTREE_RECPACKING_EMPTY) {
+		info->outcome = XBTREE_RECPACKING_SPARSE;
+
+		/*
+		 * If the first record we find does not overlap the start key,
+		 * then there is a hole at the start of the search range.
+		 * Classify this as sparse and stop immediately.
+		 */
+		if (xfs_btree_keycmp_lt(cur, &info->start_key, &rec_key))
+			return -ECANCELED;
+	} else {
+		/*
+		 * If a subsequent record does not overlap with the any record
+		 * we've seen so far, there is a hole in the middle of the
+		 * search range.  Classify this as sparse and stop.
+		 * If the keys overlap and this btree does not allow overlap,
+		 * signal corruption.
+		 */
+		key_contig = cur->bc_ops->keys_contiguous(cur, &info->high_key,
+					&rec_key);
+		if (key_contig == XBTREE_KEY_OVERLAP &&
+				!(cur->bc_flags & XFS_BTREE_OVERLAPPING))
+			return -EFSCORRUPTED;
+		if (key_contig == XBTREE_KEY_GAP)
+			return -ECANCELED;
+	}
+
+	/*
+	 * If high_key(rec) is larger than any other high key we've seen,
+	 * remember it for later.
+	 */
+	cur->bc_ops->init_high_key_from_rec(&rec_high_key, rec);
+	if (xfs_btree_keycmp_gt(cur, &rec_high_key, &info->high_key))
+		info->high_key = rec_high_key; /* struct copy */
+
+	return 0;
 }
 
-/* Is there a record covering a given range of keys? */
+/*
+ * Scan part of the keyspace of a btree and tell us if that keyspace does not
+ * map to any records; is fully mapped to records; or is partially mapped to
+ * records.  This is the btree record equivalent to determining if a file is
+ * sparse.
+ */
 int
-xfs_btree_has_record(
+xfs_btree_has_records(
 	struct xfs_btree_cur		*cur,
 	const union xfs_btree_irec	*low,
 	const union xfs_btree_irec	*high,
-	bool				*exists)
+	enum xbtree_recpacking		*outcome)
 {
+	struct xfs_btree_has_records	info = {
+		.outcome		= XBTREE_RECPACKING_EMPTY,
+	};
 	int				error;
 
-	error = xfs_btree_query_range(cur, low, high,
-			&xfs_btree_has_record_helper, NULL);
-	if (error == -ECANCELED) {
-		*exists = true;
-		return 0;
+	/* Not all btrees support this operation. */
+	if (!cur->bc_ops->keys_contiguous) {
+		ASSERT(0);
+		return -EOPNOTSUPP;
 	}
-	*exists = false;
-	return error;
+
+	xfs_btree_key_from_irec(cur, &info.start_key, low);
+	xfs_btree_key_from_irec(cur, &info.end_key, high);
+
+	error = xfs_btree_query_range(cur, low, high,
+			xfs_btree_has_records_helper, &info);
+	if (error == -ECANCELED)
+		goto out;
+	if (error)
+		return error;
+
+	if (info.outcome == XBTREE_RECPACKING_EMPTY)
+		goto out;
+
+	/*
+	 * If the largest high_key(rec) we saw during the walk is greater than
+	 * the end of the search range, classify this as full.  Otherwise,
+	 * there is a hole at the end of the search range.
+	 */
+	if (xfs_btree_keycmp_ge(cur, &info.high_key, &info.end_key))
+		info.outcome = XBTREE_RECPACKING_FULL;
+
+out:
+	*outcome = info.outcome;
+	return 0;
 }
 
 /* Are there more records in this btree? */
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index f5aa4b893ee7..66431f351bb2 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -90,6 +90,27 @@ uint32_t xfs_btree_magic(int crc, xfs_btnum_t btnum);
 #define XFS_BTREE_STATS_ADD(cur, stat, val)	\
 	XFS_STATS_ADD_OFF((cur)->bc_mp, (cur)->bc_statoff + __XBTS_ ## stat, val)
 
+enum xbtree_key_contig {
+	XBTREE_KEY_GAP = 0,
+	XBTREE_KEY_CONTIGUOUS,
+	XBTREE_KEY_OVERLAP,
+};
+
+/*
+ * Decide if these two numeric btree key fields are contiguous, overlapping,
+ * or if there's a gap between them.  @x should be the field from the high
+ * key and @y should be the field from the low key.
+ */
+static inline enum xbtree_key_contig xbtree_key_contig(uint64_t x, uint64_t y)
+{
+	x++;
+	if (x < y)
+		return XBTREE_KEY_GAP;
+	if (x == y)
+		return XBTREE_KEY_CONTIGUOUS;
+	return XBTREE_KEY_OVERLAP;
+}
+
 struct xfs_btree_ops {
 	/* size of the key and record structures */
 	size_t	key_len;
@@ -157,6 +178,19 @@ struct xfs_btree_ops {
 	int	(*recs_inorder)(struct xfs_btree_cur *cur,
 				const union xfs_btree_rec *r1,
 				const union xfs_btree_rec *r2);
+
+	/*
+	 * Are these two btree keys immediately adjacent?
+	 *
+	 * Given two btree keys @key1 and @key2, decide if it is impossible for
+	 * there to be a third btree key K satisfying the relationship
+	 * @key1 < K < @key2.  To determine if two btree records are
+	 * immediately adjacent, @key1 should be the high key of the first
+	 * record and @key2 should be the low key of the second record.
+	 */
+	enum xbtree_key_contig (*keys_contiguous)(struct xfs_btree_cur *cur,
+			       const union xfs_btree_key *key1,
+			       const union xfs_btree_key *key2);
 };
 
 /*
@@ -540,9 +574,15 @@ void xfs_btree_get_keys(struct xfs_btree_cur *cur,
 		struct xfs_btree_block *block, union xfs_btree_key *key);
 union xfs_btree_key *xfs_btree_high_key_from_key(struct xfs_btree_cur *cur,
 		union xfs_btree_key *key);
-int xfs_btree_has_record(struct xfs_btree_cur *cur,
+typedef bool (*xfs_btree_key_gap_fn)(struct xfs_btree_cur *cur,
+		const union xfs_btree_key *key1,
+		const union xfs_btree_key *key2);
+
+int xfs_btree_has_records(struct xfs_btree_cur *cur,
 		const union xfs_btree_irec *low,
-		const union xfs_btree_irec *high, bool *exists);
+		const union xfs_btree_irec *high,
+		enum xbtree_recpacking *outcome);
+
 bool xfs_btree_has_more_records(struct xfs_btree_cur *cur);
 struct xfs_ifork *xfs_btree_ifork_ptr(struct xfs_btree_cur *cur);
 
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index f900c056b82c..dd1fad8c3304 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -383,6 +383,16 @@ xfs_inobt_recs_inorder(
 		be32_to_cpu(r2->inobt.ir_startino);
 }
 
+STATIC enum xbtree_key_contig
+xfs_inobt_keys_contiguous(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*key1,
+	const union xfs_btree_key	*key2)
+{
+	return xbtree_key_contig(be32_to_cpu(key1->inobt.ir_startino),
+				 be32_to_cpu(key2->inobt.ir_startino));
+}
+
 static const struct xfs_btree_ops xfs_inobt_ops = {
 	.rec_len		= sizeof(xfs_inobt_rec_t),
 	.key_len		= sizeof(xfs_inobt_key_t),
@@ -402,6 +412,7 @@ static const struct xfs_btree_ops xfs_inobt_ops = {
 	.diff_two_keys		= xfs_inobt_diff_two_keys,
 	.keys_inorder		= xfs_inobt_keys_inorder,
 	.recs_inorder		= xfs_inobt_recs_inorder,
+	.keys_contiguous	= xfs_inobt_keys_contiguous,
 };
 
 static const struct xfs_btree_ops xfs_finobt_ops = {
@@ -423,6 +434,7 @@ static const struct xfs_btree_ops xfs_finobt_ops = {
 	.diff_two_keys		= xfs_inobt_diff_two_keys,
 	.keys_inorder		= xfs_inobt_keys_inorder,
 	.recs_inorder		= xfs_inobt_recs_inorder,
+	.keys_contiguous	= xfs_inobt_keys_contiguous,
 };
 
 /*
diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index 335f84bef81c..94377b59ba44 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -1998,14 +1998,17 @@ out_free:
 	return error;
 }
 
-/* Is there a record covering a given extent? */
+/*
+ * Scan part of the keyspace of the refcount records and tell us if the area
+ * has no records, is fully mapped by records, or is partially filled.
+ */
 int
-xfs_refcount_has_record(
+xfs_refcount_has_records(
 	struct xfs_btree_cur	*cur,
 	enum xfs_refc_domain	domain,
 	xfs_agblock_t		bno,
 	xfs_extlen_t		len,
-	bool			*exists)
+	enum xbtree_recpacking	*outcome)
 {
 	union xfs_btree_irec	low;
 	union xfs_btree_irec	high;
@@ -2016,7 +2019,7 @@ xfs_refcount_has_record(
 	high.rc.rc_startblock = bno + len - 1;
 	low.rc.rc_domain = high.rc.rc_domain = domain;
 
-	return xfs_btree_has_record(cur, &low, &high, exists);
+	return xfs_btree_has_records(cur, &low, &high, outcome);
 }
 
 int __init
diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
index fc0b58d4c379..783cd89ca195 100644
--- a/fs/xfs/libxfs/xfs_refcount.h
+++ b/fs/xfs/libxfs/xfs_refcount.h
@@ -111,9 +111,9 @@ extern int xfs_refcount_recover_cow_leftovers(struct xfs_mount *mp,
  */
 #define XFS_REFCOUNT_ITEM_OVERHEAD	32
 
-extern int xfs_refcount_has_record(struct xfs_btree_cur *cur,
+extern int xfs_refcount_has_records(struct xfs_btree_cur *cur,
 		enum xfs_refc_domain domain, xfs_agblock_t bno,
-		xfs_extlen_t len, bool *exists);
+		xfs_extlen_t len, enum xbtree_recpacking *outcome);
 union xfs_btree_rec;
 extern void xfs_refcount_btrec_to_irec(const union xfs_btree_rec *rec,
 		struct xfs_refcount_irec *irec);
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index 03d2b01487a1..1628eecb53fd 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -300,6 +300,16 @@ xfs_refcountbt_recs_inorder(
 		be32_to_cpu(r2->refc.rc_startblock);
 }
 
+STATIC enum xbtree_key_contig
+xfs_refcountbt_keys_contiguous(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*key1,
+	const union xfs_btree_key	*key2)
+{
+	return xbtree_key_contig(be32_to_cpu(key1->refc.rc_startblock),
+				 be32_to_cpu(key2->refc.rc_startblock));
+}
+
 static const struct xfs_btree_ops xfs_refcountbt_ops = {
 	.rec_len		= sizeof(struct xfs_refcount_rec),
 	.key_len		= sizeof(struct xfs_refcount_key),
@@ -319,6 +329,7 @@ static const struct xfs_btree_ops xfs_refcountbt_ops = {
 	.diff_two_keys		= xfs_refcountbt_diff_two_keys,
 	.keys_inorder		= xfs_refcountbt_keys_inorder,
 	.recs_inorder		= xfs_refcountbt_recs_inorder,
+	.keys_contiguous	= xfs_refcountbt_keys_contiguous,
 };
 
 /*
diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index da008d317f83..e616b964f11c 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -2709,13 +2709,17 @@ xfs_rmap_compare(
 		return 0;
 }
 
-/* Is there a record covering a given extent? */
+/*
+ * Scan the physical storage part of the keyspace of the reverse mapping index
+ * and tell us if the area has no records, is fully mapped by records, or is
+ * partially filled.
+ */
 int
-xfs_rmap_has_record(
+xfs_rmap_has_records(
 	struct xfs_btree_cur	*cur,
 	xfs_agblock_t		bno,
 	xfs_extlen_t		len,
-	bool			*exists)
+	enum xbtree_recpacking	*outcome)
 {
 	union xfs_btree_irec	low;
 	union xfs_btree_irec	high;
@@ -2725,7 +2729,7 @@ xfs_rmap_has_record(
 	memset(&high, 0xFF, sizeof(high));
 	high.r.rm_startblock = bno + len - 1;
 
-	return xfs_btree_has_record(cur, &low, &high, exists);
+	return xfs_btree_has_records(cur, &low, &high, outcome);
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_rmap.h b/fs/xfs/libxfs/xfs_rmap.h
index 7fb298bcc15f..4cbe50cf522e 100644
--- a/fs/xfs/libxfs/xfs_rmap.h
+++ b/fs/xfs/libxfs/xfs_rmap.h
@@ -198,8 +198,8 @@ xfs_failaddr_t xfs_rmap_btrec_to_irec(const union xfs_btree_rec *rec,
 xfs_failaddr_t xfs_rmap_check_irec(struct xfs_btree_cur *cur,
 		const struct xfs_rmap_irec *irec);
 
-int xfs_rmap_has_record(struct xfs_btree_cur *cur, xfs_agblock_t bno,
-		xfs_extlen_t len, bool *exists);
+int xfs_rmap_has_records(struct xfs_btree_cur *cur, xfs_agblock_t bno,
+		xfs_extlen_t len, enum xbtree_recpacking *outcome);
 int xfs_rmap_record_exists(struct xfs_btree_cur *cur, xfs_agblock_t bno,
 		xfs_extlen_t len, const struct xfs_owner_info *oinfo,
 		bool *has_rmap);
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index 84e2b692f034..66beb87caf1a 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -444,6 +444,21 @@ xfs_rmapbt_recs_inorder(
 	return 0;
 }
 
+STATIC enum xbtree_key_contig
+xfs_rmapbt_keys_contiguous(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*key1,
+	const union xfs_btree_key	*key2)
+{
+	/*
+	 * We only support checking contiguity of the physical space component.
+	 * If any callers ever need more specificity than that, they'll have to
+	 * implement it here.
+	 */
+	return xbtree_key_contig(be32_to_cpu(key1->rmap.rm_startblock),
+				 be32_to_cpu(key2->rmap.rm_startblock));
+}
+
 static const struct xfs_btree_ops xfs_rmapbt_ops = {
 	.rec_len		= sizeof(struct xfs_rmap_rec),
 	.key_len		= 2 * sizeof(struct xfs_rmap_key),
@@ -463,6 +478,7 @@ static const struct xfs_btree_ops xfs_rmapbt_ops = {
 	.diff_two_keys		= xfs_rmapbt_diff_two_keys,
 	.keys_inorder		= xfs_rmapbt_keys_inorder,
 	.recs_inorder		= xfs_rmapbt_recs_inorder,
+	.keys_contiguous	= xfs_rmapbt_keys_contiguous,
 };
 
 static struct xfs_btree_cur *
diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
index 5ebdda7e1078..851220021484 100644
--- a/fs/xfs/libxfs/xfs_types.h
+++ b/fs/xfs/libxfs/xfs_types.h
@@ -204,6 +204,18 @@ enum xfs_ag_resv_type {
 	XFS_AG_RESV_RMAPBT,
 };
 
+/* Results of scanning a btree keyspace to check occupancy. */
+enum xbtree_recpacking {
+	/* None of the keyspace maps to records. */
+	XBTREE_RECPACKING_EMPTY = 0,
+
+	/* Some, but not all, of the keyspace maps to records. */
+	XBTREE_RECPACKING_SPARSE,
+
+	/* The entire keyspace maps to records. */
+	XBTREE_RECPACKING_FULL,
+};
+
 /*
  * Type verifier functions
  */
diff --git a/fs/xfs/scrub/alloc.c b/fs/xfs/scrub/alloc.c
index 53de04c6027c..5920fe051543 100644
--- a/fs/xfs/scrub/alloc.c
+++ b/fs/xfs/scrub/alloc.c
@@ -144,15 +144,15 @@ xchk_xref_is_used_space(
 	xfs_agblock_t		agbno,
 	xfs_extlen_t		len)
 {
-	bool			is_freesp;
+	enum xbtree_recpacking	outcome;
 	int			error;
 
 	if (!sc->sa.bno_cur || xchk_skip_xref(sc->sm))
 		return;
 
-	error = xfs_alloc_has_record(sc->sa.bno_cur, agbno, len, &is_freesp);
+	error = xfs_alloc_has_records(sc->sa.bno_cur, agbno, len, &outcome);
 	if (!xchk_should_check_xref(sc, &error, &sc->sa.bno_cur))
 		return;
-	if (is_freesp)
+	if (outcome != XBTREE_RECPACKING_EMPTY)
 		xchk_btree_xref_set_corrupt(sc, sc->sa.bno_cur, 0);
 }
diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
index 4d77049dfce2..ed47c570c658 100644
--- a/fs/xfs/scrub/refcount.c
+++ b/fs/xfs/scrub/refcount.c
@@ -457,16 +457,16 @@ xchk_xref_is_not_shared(
 	xfs_agblock_t		agbno,
 	xfs_extlen_t		len)
 {
-	bool			shared;
+	enum xbtree_recpacking	outcome;
 	int			error;
 
 	if (!sc->sa.refc_cur || xchk_skip_xref(sc->sm))
 		return;
 
-	error = xfs_refcount_has_record(sc->sa.refc_cur, XFS_REFC_DOMAIN_SHARED,
-			agbno, len, &shared);
+	error = xfs_refcount_has_records(sc->sa.refc_cur,
+			XFS_REFC_DOMAIN_SHARED, agbno, len, &outcome);
 	if (!xchk_should_check_xref(sc, &error, &sc->sa.refc_cur))
 		return;
-	if (shared)
+	if (outcome != XBTREE_RECPACKING_EMPTY)
 		xchk_btree_xref_set_corrupt(sc, sc->sa.refc_cur, 0);
 }
diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index 8e78e1bc9eef..2f9e4f77db6b 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -219,15 +219,15 @@ xchk_xref_has_no_owner(
 	xfs_agblock_t		bno,
 	xfs_extlen_t		len)
 {
-	bool			has_rmap;
+	enum xbtree_recpacking	outcome;
 	int			error;
 
 	if (!sc->sa.rmap_cur || xchk_skip_xref(sc->sm))
 		return;
 
-	error = xfs_rmap_has_record(sc->sa.rmap_cur, bno, len, &has_rmap);
+	error = xfs_rmap_has_records(sc->sa.rmap_cur, bno, len, &outcome);
 	if (!xchk_should_check_xref(sc, &error, &sc->sa.rmap_cur))
 		return;
-	if (has_rmap)
+	if (outcome != XBTREE_RECPACKING_EMPTY)
 		xchk_btree_xref_set_corrupt(sc, sc->sa.rmap_cur, 0);
 }
-- 
cgit v1.2.3


From 4a200a0978288f919aba3f015f374f6ed279e658 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:11 -0700
Subject: xfs: implement masked btree key comparisons for _has_records scans

For keyspace fullness scans, we want to be able to mask off the parts of
the key that we don't care about.  For most btree types we /do/ want the
full keyspace, but for checking that a given space usage also has a full
complement of rmapbt records (even if different/multiple owners) we need
this masking so that we only track sparseness of rm_startblock, not the
whole keyspace (which is extremely sparse).

Augment the ->diff_two_keys and ->keys_contiguous helpers to take a
third union xfs_btree_key argument, and wire up xfs_rmap_has_records to
pass this through.  This third "mask" argument should contain a nonzero
value in each structure field that should be used in the key comparisons
done during the scan.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_alloc.c          |  2 +-
 fs/xfs/libxfs/xfs_alloc_btree.c    | 18 +++++++++++---
 fs/xfs/libxfs/xfs_bmap_btree.c     | 10 ++++++--
 fs/xfs/libxfs/xfs_btree.c          | 24 +++++++++++++++---
 fs/xfs/libxfs/xfs_btree.h          | 50 +++++++++++++++++++++++++++++++++-----
 fs/xfs/libxfs/xfs_ialloc_btree.c   | 12 ++++++---
 fs/xfs/libxfs/xfs_refcount.c       |  2 +-
 fs/xfs/libxfs/xfs_refcount_btree.c | 12 ++++++---
 fs/xfs/libxfs/xfs_rmap.c           |  5 +++-
 fs/xfs/libxfs/xfs_rmap_btree.c     | 47 +++++++++++++++++++++++------------
 10 files changed, 142 insertions(+), 40 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 34c8501d86d0..fdfa08cbf4db 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -3764,7 +3764,7 @@ xfs_alloc_has_records(
 	memset(&high, 0xFF, sizeof(high));
 	high.a.ar_startblock = bno + len - 1;
 
-	return xfs_btree_has_records(cur, &low, &high, outcome);
+	return xfs_btree_has_records(cur, &low, &high, NULL, outcome);
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
index be80c57aeddc..c65228efed4a 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.c
+++ b/fs/xfs/libxfs/xfs_alloc_btree.c
@@ -260,20 +260,27 @@ STATIC int64_t
 xfs_bnobt_diff_two_keys(
 	struct xfs_btree_cur		*cur,
 	const union xfs_btree_key	*k1,
-	const union xfs_btree_key	*k2)
+	const union xfs_btree_key	*k2,
+	const union xfs_btree_key	*mask)
 {
+	ASSERT(!mask || mask->alloc.ar_startblock);
+
 	return (int64_t)be32_to_cpu(k1->alloc.ar_startblock) -
-			  be32_to_cpu(k2->alloc.ar_startblock);
+			be32_to_cpu(k2->alloc.ar_startblock);
 }
 
 STATIC int64_t
 xfs_cntbt_diff_two_keys(
 	struct xfs_btree_cur		*cur,
 	const union xfs_btree_key	*k1,
-	const union xfs_btree_key	*k2)
+	const union xfs_btree_key	*k2,
+	const union xfs_btree_key	*mask)
 {
 	int64_t				diff;
 
+	ASSERT(!mask || (mask->alloc.ar_blockcount &&
+			 mask->alloc.ar_startblock));
+
 	diff =  be32_to_cpu(k1->alloc.ar_blockcount) -
 		be32_to_cpu(k2->alloc.ar_blockcount);
 	if (diff)
@@ -427,8 +434,11 @@ STATIC enum xbtree_key_contig
 xfs_allocbt_keys_contiguous(
 	struct xfs_btree_cur		*cur,
 	const union xfs_btree_key	*key1,
-	const union xfs_btree_key	*key2)
+	const union xfs_btree_key	*key2,
+	const union xfs_btree_key	*mask)
 {
+	ASSERT(!mask || mask->alloc.ar_startblock);
+
 	return xbtree_key_contig(be32_to_cpu(key1->alloc.ar_startblock),
 				 be32_to_cpu(key2->alloc.ar_startblock));
 }
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index 3edf314a55e2..1b40e5f8b1ec 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -382,11 +382,14 @@ STATIC int64_t
 xfs_bmbt_diff_two_keys(
 	struct xfs_btree_cur		*cur,
 	const union xfs_btree_key	*k1,
-	const union xfs_btree_key	*k2)
+	const union xfs_btree_key	*k2,
+	const union xfs_btree_key	*mask)
 {
 	uint64_t			a = be64_to_cpu(k1->bmbt.br_startoff);
 	uint64_t			b = be64_to_cpu(k2->bmbt.br_startoff);
 
+	ASSERT(!mask || mask->bmbt.br_startoff);
+
 	/*
 	 * Note: This routine previously casted a and b to int64 and subtracted
 	 * them to generate a result.  This lead to problems if b was the
@@ -504,8 +507,11 @@ STATIC enum xbtree_key_contig
 xfs_bmbt_keys_contiguous(
 	struct xfs_btree_cur		*cur,
 	const union xfs_btree_key	*key1,
-	const union xfs_btree_key	*key2)
+	const union xfs_btree_key	*key2,
+	const union xfs_btree_key	*mask)
 {
+	ASSERT(!mask || mask->bmbt.br_startoff);
+
 	return xbtree_key_contig(be64_to_cpu(key1->bmbt.br_startoff),
 				 be64_to_cpu(key2->bmbt.br_startoff));
 }
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index afbd3bcdf567..6a6503ab0cd7 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -5030,6 +5030,9 @@ struct xfs_btree_has_records {
 	union xfs_btree_key		start_key;
 	union xfs_btree_key		end_key;
 
+	/* Mask for key comparisons, if desired. */
+	const union xfs_btree_key	*key_mask;
+
 	/* Highest record key we've seen so far. */
 	union xfs_btree_key		high_key;
 
@@ -5057,7 +5060,8 @@ xfs_btree_has_records_helper(
 		 * then there is a hole at the start of the search range.
 		 * Classify this as sparse and stop immediately.
 		 */
-		if (xfs_btree_keycmp_lt(cur, &info->start_key, &rec_key))
+		if (xfs_btree_masked_keycmp_lt(cur, &info->start_key, &rec_key,
+					info->key_mask))
 			return -ECANCELED;
 	} else {
 		/*
@@ -5068,7 +5072,7 @@ xfs_btree_has_records_helper(
 		 * signal corruption.
 		 */
 		key_contig = cur->bc_ops->keys_contiguous(cur, &info->high_key,
-					&rec_key);
+					&rec_key, info->key_mask);
 		if (key_contig == XBTREE_KEY_OVERLAP &&
 				!(cur->bc_flags & XFS_BTREE_OVERLAPPING))
 			return -EFSCORRUPTED;
@@ -5081,7 +5085,8 @@ xfs_btree_has_records_helper(
 	 * remember it for later.
 	 */
 	cur->bc_ops->init_high_key_from_rec(&rec_high_key, rec);
-	if (xfs_btree_keycmp_gt(cur, &rec_high_key, &info->high_key))
+	if (xfs_btree_masked_keycmp_gt(cur, &rec_high_key, &info->high_key,
+				info->key_mask))
 		info->high_key = rec_high_key; /* struct copy */
 
 	return 0;
@@ -5092,16 +5097,26 @@ xfs_btree_has_records_helper(
  * map to any records; is fully mapped to records; or is partially mapped to
  * records.  This is the btree record equivalent to determining if a file is
  * sparse.
+ *
+ * For most btree types, the record scan should use all available btree key
+ * fields to compare the keys encountered.  These callers should pass NULL for
+ * @mask.  However, some callers (e.g.  scanning physical space in the rmapbt)
+ * want to ignore some part of the btree record keyspace when performing the
+ * comparison.  These callers should pass in a union xfs_btree_key object with
+ * the fields that *should* be a part of the comparison set to any nonzero
+ * value, and the rest zeroed.
  */
 int
 xfs_btree_has_records(
 	struct xfs_btree_cur		*cur,
 	const union xfs_btree_irec	*low,
 	const union xfs_btree_irec	*high,
+	const union xfs_btree_key	*mask,
 	enum xbtree_recpacking		*outcome)
 {
 	struct xfs_btree_has_records	info = {
 		.outcome		= XBTREE_RECPACKING_EMPTY,
+		.key_mask		= mask,
 	};
 	int				error;
 
@@ -5129,7 +5144,8 @@ xfs_btree_has_records(
 	 * the end of the search range, classify this as full.  Otherwise,
 	 * there is a hole at the end of the search range.
 	 */
-	if (xfs_btree_keycmp_ge(cur, &info.high_key, &info.end_key))
+	if (xfs_btree_masked_keycmp_ge(cur, &info.high_key, &info.end_key,
+				mask))
 		info.outcome = XBTREE_RECPACKING_FULL;
 
 out:
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 66431f351bb2..a2aa36b23e25 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -161,11 +161,14 @@ struct xfs_btree_ops {
 
 	/*
 	 * Difference between key2 and key1 -- positive if key1 > key2,
-	 * negative if key1 < key2, and zero if equal.
+	 * negative if key1 < key2, and zero if equal.  If the @mask parameter
+	 * is non NULL, each key field to be used in the comparison must
+	 * contain a nonzero value.
 	 */
 	int64_t (*diff_two_keys)(struct xfs_btree_cur *cur,
 				 const union xfs_btree_key *key1,
-				 const union xfs_btree_key *key2);
+				 const union xfs_btree_key *key2,
+				 const union xfs_btree_key *mask);
 
 	const struct xfs_buf_ops	*buf_ops;
 
@@ -187,10 +190,13 @@ struct xfs_btree_ops {
 	 * @key1 < K < @key2.  To determine if two btree records are
 	 * immediately adjacent, @key1 should be the high key of the first
 	 * record and @key2 should be the low key of the second record.
+	 * If the @mask parameter is non NULL, each key field to be used in the
+	 * comparison must contain a nonzero value.
 	 */
 	enum xbtree_key_contig (*keys_contiguous)(struct xfs_btree_cur *cur,
 			       const union xfs_btree_key *key1,
-			       const union xfs_btree_key *key2);
+			       const union xfs_btree_key *key2,
+			       const union xfs_btree_key *mask);
 };
 
 /*
@@ -581,6 +587,7 @@ typedef bool (*xfs_btree_key_gap_fn)(struct xfs_btree_cur *cur,
 int xfs_btree_has_records(struct xfs_btree_cur *cur,
 		const union xfs_btree_irec *low,
 		const union xfs_btree_irec *high,
+		const union xfs_btree_key *mask,
 		enum xbtree_recpacking *outcome);
 
 bool xfs_btree_has_more_records(struct xfs_btree_cur *cur);
@@ -593,7 +600,7 @@ xfs_btree_keycmp_lt(
 	const union xfs_btree_key	*key1,
 	const union xfs_btree_key	*key2)
 {
-	return cur->bc_ops->diff_two_keys(cur, key1, key2) < 0;
+	return cur->bc_ops->diff_two_keys(cur, key1, key2, NULL) < 0;
 }
 
 static inline bool
@@ -602,7 +609,7 @@ xfs_btree_keycmp_gt(
 	const union xfs_btree_key	*key1,
 	const union xfs_btree_key	*key2)
 {
-	return cur->bc_ops->diff_two_keys(cur, key1, key2) > 0;
+	return cur->bc_ops->diff_two_keys(cur, key1, key2, NULL) > 0;
 }
 
 static inline bool
@@ -611,7 +618,7 @@ xfs_btree_keycmp_eq(
 	const union xfs_btree_key	*key1,
 	const union xfs_btree_key	*key2)
 {
-	return cur->bc_ops->diff_two_keys(cur, key1, key2) == 0;
+	return cur->bc_ops->diff_two_keys(cur, key1, key2, NULL) == 0;
 }
 
 static inline bool
@@ -641,6 +648,37 @@ xfs_btree_keycmp_ne(
 	return !xfs_btree_keycmp_eq(cur, key1, key2);
 }
 
+/* Masked key comparison helpers */
+static inline bool
+xfs_btree_masked_keycmp_lt(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*key1,
+	const union xfs_btree_key	*key2,
+	const union xfs_btree_key	*mask)
+{
+	return cur->bc_ops->diff_two_keys(cur, key1, key2, mask) < 0;
+}
+
+static inline bool
+xfs_btree_masked_keycmp_gt(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*key1,
+	const union xfs_btree_key	*key2,
+	const union xfs_btree_key	*mask)
+{
+	return cur->bc_ops->diff_two_keys(cur, key1, key2, mask) > 0;
+}
+
+static inline bool
+xfs_btree_masked_keycmp_ge(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*key1,
+	const union xfs_btree_key	*key2,
+	const union xfs_btree_key	*mask)
+{
+	return !xfs_btree_masked_keycmp_lt(cur, key1, key2, mask);
+}
+
 /* Does this cursor point to the last block in the given level? */
 static inline bool
 xfs_btree_islastblock(
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index dd1fad8c3304..5a945ae21b5d 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -269,10 +269,13 @@ STATIC int64_t
 xfs_inobt_diff_two_keys(
 	struct xfs_btree_cur		*cur,
 	const union xfs_btree_key	*k1,
-	const union xfs_btree_key	*k2)
+	const union xfs_btree_key	*k2,
+	const union xfs_btree_key	*mask)
 {
+	ASSERT(!mask || mask->inobt.ir_startino);
+
 	return (int64_t)be32_to_cpu(k1->inobt.ir_startino) -
-			  be32_to_cpu(k2->inobt.ir_startino);
+			be32_to_cpu(k2->inobt.ir_startino);
 }
 
 static xfs_failaddr_t
@@ -387,8 +390,11 @@ STATIC enum xbtree_key_contig
 xfs_inobt_keys_contiguous(
 	struct xfs_btree_cur		*cur,
 	const union xfs_btree_key	*key1,
-	const union xfs_btree_key	*key2)
+	const union xfs_btree_key	*key2,
+	const union xfs_btree_key	*mask)
 {
+	ASSERT(!mask || mask->inobt.ir_startino);
+
 	return xbtree_key_contig(be32_to_cpu(key1->inobt.ir_startino),
 				 be32_to_cpu(key2->inobt.ir_startino));
 }
diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index 94377b59ba44..c1c65774dcc2 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -2019,7 +2019,7 @@ xfs_refcount_has_records(
 	high.rc.rc_startblock = bno + len - 1;
 	low.rc.rc_domain = high.rc.rc_domain = domain;
 
-	return xfs_btree_has_records(cur, &low, &high, outcome);
+	return xfs_btree_has_records(cur, &low, &high, NULL, outcome);
 }
 
 int __init
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index 1628eecb53fd..d4afc5f4e6a5 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -202,10 +202,13 @@ STATIC int64_t
 xfs_refcountbt_diff_two_keys(
 	struct xfs_btree_cur		*cur,
 	const union xfs_btree_key	*k1,
-	const union xfs_btree_key	*k2)
+	const union xfs_btree_key	*k2,
+	const union xfs_btree_key	*mask)
 {
+	ASSERT(!mask || mask->refc.rc_startblock);
+
 	return (int64_t)be32_to_cpu(k1->refc.rc_startblock) -
-			  be32_to_cpu(k2->refc.rc_startblock);
+			be32_to_cpu(k2->refc.rc_startblock);
 }
 
 STATIC xfs_failaddr_t
@@ -304,8 +307,11 @@ STATIC enum xbtree_key_contig
 xfs_refcountbt_keys_contiguous(
 	struct xfs_btree_cur		*cur,
 	const union xfs_btree_key	*key1,
-	const union xfs_btree_key	*key2)
+	const union xfs_btree_key	*key2,
+	const union xfs_btree_key	*mask)
 {
+	ASSERT(!mask || mask->refc.rc_startblock);
+
 	return xbtree_key_contig(be32_to_cpu(key1->refc.rc_startblock),
 				 be32_to_cpu(key2->refc.rc_startblock));
 }
diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index e616b964f11c..308b81f321eb 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -2721,6 +2721,9 @@ xfs_rmap_has_records(
 	xfs_extlen_t		len,
 	enum xbtree_recpacking	*outcome)
 {
+	union xfs_btree_key	mask = {
+		.rmap.rm_startblock = cpu_to_be32(-1U),
+	};
 	union xfs_btree_irec	low;
 	union xfs_btree_irec	high;
 
@@ -2729,7 +2732,7 @@ xfs_rmap_has_records(
 	memset(&high, 0xFF, sizeof(high));
 	high.r.rm_startblock = bno + len - 1;
 
-	return xfs_btree_has_records(cur, &low, &high, outcome);
+	return xfs_btree_has_records(cur, &low, &high, &mask, outcome);
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index 66beb87caf1a..6c81b20e97d2 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -273,31 +273,43 @@ STATIC int64_t
 xfs_rmapbt_diff_two_keys(
 	struct xfs_btree_cur		*cur,
 	const union xfs_btree_key	*k1,
-	const union xfs_btree_key	*k2)
+	const union xfs_btree_key	*k2,
+	const union xfs_btree_key	*mask)
 {
 	const struct xfs_rmap_key	*kp1 = &k1->rmap;
 	const struct xfs_rmap_key	*kp2 = &k2->rmap;
 	int64_t				d;
 	__u64				x, y;
 
+	/* Doesn't make sense to mask off the physical space part */
+	ASSERT(!mask || mask->rmap.rm_startblock);
+
 	d = (int64_t)be32_to_cpu(kp1->rm_startblock) -
-		       be32_to_cpu(kp2->rm_startblock);
+		     be32_to_cpu(kp2->rm_startblock);
 	if (d)
 		return d;
 
-	x = be64_to_cpu(kp1->rm_owner);
-	y = be64_to_cpu(kp2->rm_owner);
-	if (x > y)
-		return 1;
-	else if (y > x)
-		return -1;
+	if (!mask || mask->rmap.rm_owner) {
+		x = be64_to_cpu(kp1->rm_owner);
+		y = be64_to_cpu(kp2->rm_owner);
+		if (x > y)
+			return 1;
+		else if (y > x)
+			return -1;
+	}
+
+	if (!mask || mask->rmap.rm_offset) {
+		/* Doesn't make sense to allow offset but not owner */
+		ASSERT(!mask || mask->rmap.rm_owner);
+
+		x = offset_keymask(be64_to_cpu(kp1->rm_offset));
+		y = offset_keymask(be64_to_cpu(kp2->rm_offset));
+		if (x > y)
+			return 1;
+		else if (y > x)
+			return -1;
+	}
 
-	x = offset_keymask(be64_to_cpu(kp1->rm_offset));
-	y = offset_keymask(be64_to_cpu(kp2->rm_offset));
-	if (x > y)
-		return 1;
-	else if (y > x)
-		return -1;
 	return 0;
 }
 
@@ -448,13 +460,18 @@ STATIC enum xbtree_key_contig
 xfs_rmapbt_keys_contiguous(
 	struct xfs_btree_cur		*cur,
 	const union xfs_btree_key	*key1,
-	const union xfs_btree_key	*key2)
+	const union xfs_btree_key	*key2,
+	const union xfs_btree_key	*mask)
 {
+	ASSERT(!mask || mask->rmap.rm_startblock);
+
 	/*
 	 * We only support checking contiguity of the physical space component.
 	 * If any callers ever need more specificity than that, they'll have to
 	 * implement it here.
 	 */
+	ASSERT(!mask || (!mask->rmap.rm_owner && !mask->rmap.rm_offset));
+
 	return xbtree_key_contig(be32_to_cpu(key1->rmap.rm_startblock),
 				 be32_to_cpu(key2->rmap.rm_startblock));
 }
-- 
cgit v1.2.3


From 7ad9ea6398feae3ae4ce79fe08457f93b79a9a43 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:12 -0700
Subject: xfs: check the reference counts of gaps in the refcount btree

Gaps in the reference count btree are also significant -- for these
regions, there must not be any overlapping reverse mappings.  We don't
currently check this, so make the refcount scrubber more complete.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/refcount.c | 95 ++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 90 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
index ed47c570c658..771a591a4aeb 100644
--- a/fs/xfs/scrub/refcount.c
+++ b/fs/xfs/scrub/refcount.c
@@ -332,6 +332,64 @@ xchk_refcountbt_xref(
 	xchk_refcountbt_xref_rmap(sc, irec);
 }
 
+struct xchk_refcbt_records {
+	/* The next AG block where we aren't expecting shared extents. */
+	xfs_agblock_t		next_unshared_agbno;
+
+	/* Number of CoW blocks we expect. */
+	xfs_agblock_t		cow_blocks;
+
+	/* Was the last record a shared or CoW staging extent? */
+	enum xfs_refc_domain	prev_domain;
+};
+
+STATIC int
+xchk_refcountbt_rmap_check_gap(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_rmap_irec	*rec,
+	void				*priv)
+{
+	xfs_agblock_t			*next_bno = priv;
+
+	if (*next_bno != NULLAGBLOCK && rec->rm_startblock < *next_bno)
+		return -ECANCELED;
+
+	*next_bno = rec->rm_startblock + rec->rm_blockcount;
+	return 0;
+}
+
+/*
+ * Make sure that a gap in the reference count records does not correspond to
+ * overlapping records (i.e. shared extents) in the reverse mappings.
+ */
+static inline void
+xchk_refcountbt_xref_gaps(
+	struct xfs_scrub	*sc,
+	struct xchk_refcbt_records *rrc,
+	xfs_agblock_t		bno)
+{
+	struct xfs_rmap_irec	low;
+	struct xfs_rmap_irec	high;
+	xfs_agblock_t		next_bno = NULLAGBLOCK;
+	int			error;
+
+	if (bno <= rrc->next_unshared_agbno || !sc->sa.rmap_cur ||
+            xchk_skip_xref(sc->sm))
+		return;
+
+	memset(&low, 0, sizeof(low));
+	low.rm_startblock = rrc->next_unshared_agbno;
+	memset(&high, 0xFF, sizeof(high));
+	high.rm_startblock = bno - 1;
+
+	error = xfs_rmap_query_range(sc->sa.rmap_cur, &low, &high,
+			xchk_refcountbt_rmap_check_gap, &next_bno);
+	if (error == -ECANCELED)
+		xchk_btree_xref_set_corrupt(sc, sc->sa.rmap_cur, 0);
+	else
+		xchk_should_check_xref(sc, &error, &sc->sa.rmap_cur);
+}
+
 /* Scrub a refcountbt record. */
 STATIC int
 xchk_refcountbt_rec(
@@ -339,7 +397,7 @@ xchk_refcountbt_rec(
 	const union xfs_btree_rec *rec)
 {
 	struct xfs_refcount_irec irec;
-	xfs_agblock_t		*cow_blocks = bs->private;
+	struct xchk_refcbt_records *rrc = bs->private;
 
 	xfs_refcount_btrec_to_irec(rec, &irec);
 	if (xfs_refcount_check_irec(bs->cur, &irec) != NULL) {
@@ -348,10 +406,27 @@ xchk_refcountbt_rec(
 	}
 
 	if (irec.rc_domain == XFS_REFC_DOMAIN_COW)
-		(*cow_blocks) += irec.rc_blockcount;
+		rrc->cow_blocks += irec.rc_blockcount;
+
+	/* Shared records always come before CoW records. */
+	if (irec.rc_domain == XFS_REFC_DOMAIN_SHARED &&
+	    rrc->prev_domain == XFS_REFC_DOMAIN_COW)
+		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
+	rrc->prev_domain = irec.rc_domain;
 
 	xchk_refcountbt_xref(bs->sc, &irec);
 
+	/*
+	 * If this is a record for a shared extent, check that all blocks
+	 * between the previous record and this one have at most one reverse
+	 * mapping.
+	 */
+	if (irec.rc_domain == XFS_REFC_DOMAIN_SHARED) {
+		xchk_refcountbt_xref_gaps(bs->sc, rrc, irec.rc_startblock);
+		rrc->next_unshared_agbno = irec.rc_startblock +
+					   irec.rc_blockcount;
+	}
+
 	return 0;
 }
 
@@ -393,15 +468,25 @@ int
 xchk_refcountbt(
 	struct xfs_scrub	*sc)
 {
-	xfs_agblock_t		cow_blocks = 0;
+	struct xchk_refcbt_records rrc = {
+		.cow_blocks		= 0,
+		.next_unshared_agbno	= 0,
+		.prev_domain		= XFS_REFC_DOMAIN_SHARED,
+	};
 	int			error;
 
 	error = xchk_btree(sc, sc->sa.refc_cur, xchk_refcountbt_rec,
-			&XFS_RMAP_OINFO_REFC, &cow_blocks);
+			&XFS_RMAP_OINFO_REFC, &rrc);
 	if (error)
 		return error;
 
-	xchk_refcount_xref_rmap(sc, cow_blocks);
+	/*
+	 * Check that all blocks between the last refcount > 1 record and the
+	 * end of the AG have at most one reverse mapping.
+	 */
+	xchk_refcountbt_xref_gaps(sc, &rrc, sc->mp->m_sb.sb_agblocks);
+
+	xchk_refcount_xref_rmap(sc, rrc.cow_blocks);
 
 	return 0;
 }
-- 
cgit v1.2.3


From 7ac14fa2bd22e99a06ae16382b394f697cfe2b8a Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:12 -0700
Subject: xfs: ensure that all metadata and data blocks are not cow staging
 extents

Make sure that all filesystem metadata blocks and file data blocks are
not also marked as CoW staging extents.  The extra checking added here
was inspired by an actual VM host filesystem corruption incident due to
bugs in the CoW handling of 4.x kernels.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/agheader.c |  5 +++++
 fs/xfs/scrub/alloc.c    |  1 +
 fs/xfs/scrub/bmap.c     | 11 ++++++++---
 fs/xfs/scrub/ialloc.c   |  2 +-
 fs/xfs/scrub/inode.c    |  1 +
 fs/xfs/scrub/refcount.c | 21 +++++++++++++++++++++
 fs/xfs/scrub/scrub.h    |  2 ++
 7 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
index 87cb13a6e84a..1a84153afa91 100644
--- a/fs/xfs/scrub/agheader.c
+++ b/fs/xfs/scrub/agheader.c
@@ -53,6 +53,7 @@ xchk_superblock_xref(
 	xchk_xref_is_not_inode_chunk(sc, agbno, 1);
 	xchk_xref_is_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS);
 	xchk_xref_is_not_shared(sc, agbno, 1);
+	xchk_xref_is_not_cow_staging(sc, agbno, 1);
 
 	/* scrub teardown will take care of sc->sa for us */
 }
@@ -517,6 +518,7 @@ xchk_agf_xref(
 	xchk_xref_is_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS);
 	xchk_agf_xref_btreeblks(sc);
 	xchk_xref_is_not_shared(sc, agbno, 1);
+	xchk_xref_is_not_cow_staging(sc, agbno, 1);
 	xchk_agf_xref_refcblks(sc);
 
 	/* scrub teardown will take care of sc->sa for us */
@@ -644,6 +646,7 @@ xchk_agfl_block_xref(
 	xchk_xref_is_not_inode_chunk(sc, agbno, 1);
 	xchk_xref_is_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_AG);
 	xchk_xref_is_not_shared(sc, agbno, 1);
+	xchk_xref_is_not_cow_staging(sc, agbno, 1);
 }
 
 /* Scrub an AGFL block. */
@@ -700,6 +703,7 @@ xchk_agfl_xref(
 	xchk_xref_is_not_inode_chunk(sc, agbno, 1);
 	xchk_xref_is_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS);
 	xchk_xref_is_not_shared(sc, agbno, 1);
+	xchk_xref_is_not_cow_staging(sc, agbno, 1);
 
 	/*
 	 * Scrub teardown will take care of sc->sa for us.  Leave sc->sa
@@ -855,6 +859,7 @@ xchk_agi_xref(
 	xchk_agi_xref_icounts(sc);
 	xchk_xref_is_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS);
 	xchk_xref_is_not_shared(sc, agbno, 1);
+	xchk_xref_is_not_cow_staging(sc, agbno, 1);
 	xchk_agi_xref_fiblocks(sc);
 
 	/* scrub teardown will take care of sc->sa for us */
diff --git a/fs/xfs/scrub/alloc.c b/fs/xfs/scrub/alloc.c
index 5920fe051543..12dd55ac2a4f 100644
--- a/fs/xfs/scrub/alloc.c
+++ b/fs/xfs/scrub/alloc.c
@@ -90,6 +90,7 @@ xchk_allocbt_xref(
 	xchk_xref_is_not_inode_chunk(sc, agbno, len);
 	xchk_xref_has_no_owner(sc, agbno, len);
 	xchk_xref_is_not_shared(sc, agbno, len);
+	xchk_xref_is_not_cow_staging(sc, agbno, len);
 }
 
 /* Scrub a bnobt/cntbt record. */
diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index 6188eba672e5..be2c4da2808b 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -328,12 +328,17 @@ xchk_bmap_iextent_xref(
 	xchk_bmap_xref_rmap(info, irec, agbno);
 	switch (info->whichfork) {
 	case XFS_DATA_FORK:
-		if (xfs_is_reflink_inode(info->sc->ip))
-			break;
-		fallthrough;
+		if (!xfs_is_reflink_inode(info->sc->ip))
+			xchk_xref_is_not_shared(info->sc, agbno,
+					irec->br_blockcount);
+		xchk_xref_is_not_cow_staging(info->sc, agbno,
+				irec->br_blockcount);
+		break;
 	case XFS_ATTR_FORK:
 		xchk_xref_is_not_shared(info->sc, agbno,
 				irec->br_blockcount);
+		xchk_xref_is_not_cow_staging(info->sc, agbno,
+				irec->br_blockcount);
 		break;
 	case XFS_COW_FORK:
 		xchk_xref_is_cow_staging(info->sc, agbno,
diff --git a/fs/xfs/scrub/ialloc.c b/fs/xfs/scrub/ialloc.c
index ca5a7e0f5451..6d08613db32f 100644
--- a/fs/xfs/scrub/ialloc.c
+++ b/fs/xfs/scrub/ialloc.c
@@ -115,7 +115,7 @@ xchk_iallocbt_chunk(
 		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
 
 	xchk_iallocbt_chunk_xref(bs->sc, irec, agino, bno, len);
-
+	xchk_xref_is_not_cow_staging(bs->sc, bno, len);
 	return true;
 }
 
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index bbf9432c02c2..50ebd72f6d95 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -558,6 +558,7 @@ xchk_inode_xref(
 	xchk_inode_xref_finobt(sc, ino);
 	xchk_xref_is_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_INODES);
 	xchk_xref_is_not_shared(sc, agbno, 1);
+	xchk_xref_is_not_cow_staging(sc, agbno, 1);
 	xchk_inode_xref_bmap(sc, dip);
 
 out_free:
diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
index 771a591a4aeb..db9e46a4f8d4 100644
--- a/fs/xfs/scrub/refcount.c
+++ b/fs/xfs/scrub/refcount.c
@@ -555,3 +555,24 @@ xchk_xref_is_not_shared(
 	if (outcome != XBTREE_RECPACKING_EMPTY)
 		xchk_btree_xref_set_corrupt(sc, sc->sa.refc_cur, 0);
 }
+
+/* xref check that the extent is not being used for CoW staging. */
+void
+xchk_xref_is_not_cow_staging(
+	struct xfs_scrub	*sc,
+	xfs_agblock_t		agbno,
+	xfs_extlen_t		len)
+{
+	enum xbtree_recpacking	outcome;
+	int			error;
+
+	if (!sc->sa.refc_cur || xchk_skip_xref(sc->sm))
+		return;
+
+	error = xfs_refcount_has_records(sc->sa.refc_cur, XFS_REFC_DOMAIN_COW,
+			agbno, len, &outcome);
+	if (!xchk_should_check_xref(sc, &error, &sc->sa.refc_cur))
+		return;
+	if (outcome != XBTREE_RECPACKING_EMPTY)
+		xchk_btree_xref_set_corrupt(sc, sc->sa.refc_cur, 0);
+}
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index d85c3b883b4c..b6f452eb9645 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -172,6 +172,8 @@ void xchk_xref_is_cow_staging(struct xfs_scrub *sc, xfs_agblock_t bno,
 		xfs_extlen_t len);
 void xchk_xref_is_not_shared(struct xfs_scrub *sc, xfs_agblock_t bno,
 		xfs_extlen_t len);
+void xchk_xref_is_not_cow_staging(struct xfs_scrub *sc, xfs_agblock_t bno,
+		xfs_extlen_t len);
 #ifdef CONFIG_XFS_RT
 void xchk_xref_is_used_rt_space(struct xfs_scrub *sc, xfs_rtblock_t rtbno,
 		xfs_extlen_t len);
-- 
cgit v1.2.3


From cc1207662d1a08e253520654e956f5e699826caa Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:13 -0700
Subject: xfs: remove pointless shadow variable from xfs_difree_inobt

In xfs_difree_inobt, the pag passed in was previously used to look up
the AGI buffer.  There's no need to extract it again, so remove the
shadow variable and shut up -Wshadow.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 0d2980accd3c..be952bd37a61 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -1978,8 +1978,6 @@ xfs_difree_inobt(
 	 */
 	if (!xfs_has_ikeep(mp) && rec.ir_free == XFS_INOBT_ALL_FREE &&
 	    mp->m_sb.sb_inopblock <= XFS_INODES_PER_CHUNK) {
-		struct xfs_perag	*pag = agbp->b_pag;
-
 		xic->deleted = true;
 		xic->first_ino = XFS_AGINO_TO_INO(mp, pag->pag_agno,
 				rec.ir_startino);
-- 
cgit v1.2.3


From c01868b60e8c19888572f90fd3426c0652c0e2a9 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:13 -0700
Subject: xfs: clean up broken eearly-exit code in the inode btree scrubber

Corrupt inode chunks should cause us to exit early after setting the
CORRUPT flag on the scrub state.  While we're at it, collapse trivial
helpers.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/ialloc.c | 42 +++++++++++++++++-------------------------
 1 file changed, 17 insertions(+), 25 deletions(-)

diff --git a/fs/xfs/scrub/ialloc.c b/fs/xfs/scrub/ialloc.c
index 6d08613db32f..3f0994166173 100644
--- a/fs/xfs/scrub/ialloc.c
+++ b/fs/xfs/scrub/ialloc.c
@@ -79,25 +79,7 @@ xchk_iallocbt_chunk_xref_other(
 		xchk_btree_xref_set_corrupt(sc, *pcur, 0);
 }
 
-/* Cross-reference with the other btrees. */
-STATIC void
-xchk_iallocbt_chunk_xref(
-	struct xfs_scrub		*sc,
-	struct xfs_inobt_rec_incore	*irec,
-	xfs_agino_t			agino,
-	xfs_agblock_t			agbno,
-	xfs_extlen_t			len)
-{
-	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
-		return;
-
-	xchk_xref_is_used_space(sc, agbno, len);
-	xchk_iallocbt_chunk_xref_other(sc, irec, agino);
-	xchk_xref_is_owned_by(sc, agbno, len, &XFS_RMAP_OINFO_INODES);
-	xchk_xref_is_not_shared(sc, agbno, len);
-}
-
-/* Is this chunk worth checking? */
+/* Is this chunk worth checking and cross-referencing? */
 STATIC bool
 xchk_iallocbt_chunk(
 	struct xchk_btree		*bs,
@@ -105,17 +87,24 @@ xchk_iallocbt_chunk(
 	xfs_agino_t			agino,
 	xfs_extlen_t			len)
 {
+	struct xfs_scrub		*sc = bs->sc;
 	struct xfs_mount		*mp = bs->cur->bc_mp;
 	struct xfs_perag		*pag = bs->cur->bc_ag.pag;
-	xfs_agblock_t			bno;
+	xfs_agblock_t			agbno;
 
-	bno = XFS_AGINO_TO_AGBNO(mp, agino);
+	agbno = XFS_AGINO_TO_AGBNO(mp, agino);
 
-	if (!xfs_verify_agbext(pag, bno, len))
+	if (!xfs_verify_agbext(pag, agbno, len))
 		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
 
-	xchk_iallocbt_chunk_xref(bs->sc, irec, agino, bno, len);
-	xchk_xref_is_not_cow_staging(bs->sc, bno, len);
+	if (bs->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
+		return false;
+
+	xchk_xref_is_used_space(sc, agbno, len);
+	xchk_iallocbt_chunk_xref_other(sc, irec, agino);
+	xchk_xref_is_owned_by(sc, agbno, len, &XFS_RMAP_OINFO_INODES);
+	xchk_xref_is_not_shared(sc, agbno, len);
+	xchk_xref_is_not_cow_staging(sc, agbno, len);
 	return true;
 }
 
@@ -463,7 +452,7 @@ xchk_iallocbt_rec(
 		if (holemask & 1)
 			holecount += XFS_INODES_PER_HOLEMASK_BIT;
 		else if (!xchk_iallocbt_chunk(bs, &irec, agino, len))
-			break;
+			goto out;
 		holemask >>= 1;
 		agino += XFS_INODES_PER_HOLEMASK_BIT;
 	}
@@ -473,6 +462,9 @@ xchk_iallocbt_rec(
 		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
 
 check_clusters:
+	if (bs->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
+		goto out;
+
 	error = xchk_iallocbt_check_clusters(bs, &irec);
 	if (error)
 		goto out;
-- 
cgit v1.2.3


From bc0f3b55467e1b5833bebae011a07e72a35afc2e Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:14 -0700
Subject: xfs: directly cross-reference the inode btrees with each other

Improve the cross-referencing of the two inode btrees by directly
checking the free and hole state of each inode with the other btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/ialloc.c | 225 ++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 198 insertions(+), 27 deletions(-)

diff --git a/fs/xfs/scrub/ialloc.c b/fs/xfs/scrub/ialloc.c
index 3f0994166173..3a37c3dc0fae 100644
--- a/fs/xfs/scrub/ialloc.c
+++ b/fs/xfs/scrub/ialloc.c
@@ -51,32 +51,201 @@ struct xchk_iallocbt {
 };
 
 /*
- * If we're checking the finobt, cross-reference with the inobt.
- * Otherwise we're checking the inobt; if there is an finobt, make sure
- * we have a record or not depending on freecount.
+ * Does the finobt have a record for this inode with the same hole/free state?
+ * This is a bit complicated because of the following:
+ *
+ * - The finobt need not have a record if all inodes in the inobt record are
+ *   allocated.
+ * - The finobt need not have a record if all inodes in the inobt record are
+ *   free.
+ * - The finobt need not have a record if the inobt record says this is a hole.
+ *   This likely doesn't happen in practice.
  */
-static inline void
-xchk_iallocbt_chunk_xref_other(
+STATIC int
+xchk_inobt_xref_finobt(
+	struct xfs_scrub	*sc,
+	struct xfs_inobt_rec_incore *irec,
+	xfs_agino_t		agino,
+	bool			free,
+	bool			hole)
+{
+	struct xfs_inobt_rec_incore frec;
+	struct xfs_btree_cur	*cur = sc->sa.fino_cur;
+	bool			ffree, fhole;
+	unsigned int		frec_idx, fhole_idx;
+	int			has_record;
+	int			error;
+
+	ASSERT(cur->bc_btnum == XFS_BTNUM_FINO);
+
+	error = xfs_inobt_lookup(cur, agino, XFS_LOOKUP_LE, &has_record);
+	if (error)
+		return error;
+	if (!has_record)
+		goto no_record;
+
+	error = xfs_inobt_get_rec(cur, &frec, &has_record);
+	if (!has_record)
+		return -EFSCORRUPTED;
+
+	if (frec.ir_startino + XFS_INODES_PER_CHUNK <= agino)
+		goto no_record;
+
+	/* There's a finobt record; free and hole status must match. */
+	frec_idx = agino - frec.ir_startino;
+	ffree = frec.ir_free & (1ULL << frec_idx);
+	fhole_idx = frec_idx / XFS_INODES_PER_HOLEMASK_BIT;
+	fhole = frec.ir_holemask & (1U << fhole_idx);
+
+	if (ffree != free)
+		xchk_btree_xref_set_corrupt(sc, cur, 0);
+	if (fhole != hole)
+		xchk_btree_xref_set_corrupt(sc, cur, 0);
+	return 0;
+
+no_record:
+	/* inobt record is fully allocated */
+	if (irec->ir_free == 0)
+		return 0;
+
+	/* inobt record is totally unallocated */
+	if (irec->ir_free == XFS_INOBT_ALL_FREE)
+		return 0;
+
+	/* inobt record says this is a hole */
+	if (hole)
+		return 0;
+
+	/* finobt doesn't care about allocated inodes */
+	if (!free)
+		return 0;
+
+	xchk_btree_xref_set_corrupt(sc, cur, 0);
+	return 0;
+}
+
+/*
+ * Make sure that each inode of this part of an inobt record has the same
+ * sparse and free status as the finobt.
+ */
+STATIC void
+xchk_inobt_chunk_xref_finobt(
 	struct xfs_scrub		*sc,
 	struct xfs_inobt_rec_incore	*irec,
-	xfs_agino_t			agino)
+	xfs_agino_t			agino,
+	unsigned int			nr_inodes)
 {
-	struct xfs_btree_cur		**pcur;
-	bool				has_irec;
+	xfs_agino_t			i;
+	unsigned int			rec_idx;
 	int				error;
 
-	if (sc->sm->sm_type == XFS_SCRUB_TYPE_FINOBT)
-		pcur = &sc->sa.ino_cur;
-	else
-		pcur = &sc->sa.fino_cur;
-	if (!(*pcur))
+	ASSERT(sc->sm->sm_type == XFS_SCRUB_TYPE_INOBT);
+
+	if (!sc->sa.fino_cur || xchk_skip_xref(sc->sm))
 		return;
-	error = xfs_ialloc_has_inode_record(*pcur, agino, agino, &has_irec);
-	if (!xchk_should_check_xref(sc, &error, pcur))
+
+	for (i = agino, rec_idx = agino - irec->ir_startino;
+	     i < agino + nr_inodes;
+	     i++, rec_idx++) {
+		bool			free, hole;
+		unsigned int		hole_idx;
+
+		free = irec->ir_free & (1ULL << rec_idx);
+		hole_idx = rec_idx / XFS_INODES_PER_HOLEMASK_BIT;
+		hole = irec->ir_holemask & (1U << hole_idx);
+
+		error = xchk_inobt_xref_finobt(sc, irec, i, free, hole);
+		if (!xchk_should_check_xref(sc, &error, &sc->sa.fino_cur))
+			return;
+	}
+}
+
+/*
+ * Does the inobt have a record for this inode with the same hole/free state?
+ * The inobt must always have a record if there's a finobt record.
+ */
+STATIC int
+xchk_finobt_xref_inobt(
+	struct xfs_scrub	*sc,
+	struct xfs_inobt_rec_incore *frec,
+	xfs_agino_t		agino,
+	bool			ffree,
+	bool			fhole)
+{
+	struct xfs_inobt_rec_incore irec;
+	struct xfs_btree_cur	*cur = sc->sa.ino_cur;
+	bool			free, hole;
+	unsigned int		rec_idx, hole_idx;
+	int			has_record;
+	int			error;
+
+	ASSERT(cur->bc_btnum == XFS_BTNUM_INO);
+
+	error = xfs_inobt_lookup(cur, agino, XFS_LOOKUP_LE, &has_record);
+	if (error)
+		return error;
+	if (!has_record)
+		goto no_record;
+
+	error = xfs_inobt_get_rec(cur, &irec, &has_record);
+	if (!has_record)
+		return -EFSCORRUPTED;
+
+	if (irec.ir_startino + XFS_INODES_PER_CHUNK <= agino)
+		goto no_record;
+
+	/* There's an inobt record; free and hole status must match. */
+	rec_idx = agino - irec.ir_startino;
+	free = irec.ir_free & (1ULL << rec_idx);
+	hole_idx = rec_idx / XFS_INODES_PER_HOLEMASK_BIT;
+	hole = irec.ir_holemask & (1U << hole_idx);
+
+	if (ffree != free)
+		xchk_btree_xref_set_corrupt(sc, cur, 0);
+	if (fhole != hole)
+		xchk_btree_xref_set_corrupt(sc, cur, 0);
+	return 0;
+
+no_record:
+	/* finobt should never have a record for which the inobt does not */
+	xchk_btree_xref_set_corrupt(sc, cur, 0);
+	return 0;
+}
+
+/*
+ * Make sure that each inode of this part of an finobt record has the same
+ * sparse and free status as the inobt.
+ */
+STATIC void
+xchk_finobt_chunk_xref_inobt(
+	struct xfs_scrub		*sc,
+	struct xfs_inobt_rec_incore	*frec,
+	xfs_agino_t			agino,
+	unsigned int			nr_inodes)
+{
+	xfs_agino_t			i;
+	unsigned int			rec_idx;
+	int				error;
+
+	ASSERT(sc->sm->sm_type == XFS_SCRUB_TYPE_FINOBT);
+
+	if (!sc->sa.ino_cur || xchk_skip_xref(sc->sm))
 		return;
-	if (((irec->ir_freecount > 0 && !has_irec) ||
-	     (irec->ir_freecount == 0 && has_irec)))
-		xchk_btree_xref_set_corrupt(sc, *pcur, 0);
+
+	for (i = agino, rec_idx = agino - frec->ir_startino;
+	     i < agino + nr_inodes;
+	     i++, rec_idx++) {
+		bool			ffree, fhole;
+		unsigned int		hole_idx;
+
+		ffree = frec->ir_free & (1ULL << rec_idx);
+		hole_idx = rec_idx / XFS_INODES_PER_HOLEMASK_BIT;
+		fhole = frec->ir_holemask & (1U << hole_idx);
+
+		error = xchk_finobt_xref_inobt(sc, frec, i, ffree, fhole);
+		if (!xchk_should_check_xref(sc, &error, &sc->sa.ino_cur))
+			return;
+	}
 }
 
 /* Is this chunk worth checking and cross-referencing? */
@@ -85,14 +254,16 @@ xchk_iallocbt_chunk(
 	struct xchk_btree		*bs,
 	struct xfs_inobt_rec_incore	*irec,
 	xfs_agino_t			agino,
-	xfs_extlen_t			len)
+	unsigned int			nr_inodes)
 {
 	struct xfs_scrub		*sc = bs->sc;
 	struct xfs_mount		*mp = bs->cur->bc_mp;
 	struct xfs_perag		*pag = bs->cur->bc_ag.pag;
 	xfs_agblock_t			agbno;
+	xfs_extlen_t			len;
 
 	agbno = XFS_AGINO_TO_AGBNO(mp, agino);
+	len = XFS_B_TO_FSB(mp, nr_inodes * mp->m_sb.sb_inodesize);
 
 	if (!xfs_verify_agbext(pag, agbno, len))
 		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
@@ -101,7 +272,10 @@ xchk_iallocbt_chunk(
 		return false;
 
 	xchk_xref_is_used_space(sc, agbno, len);
-	xchk_iallocbt_chunk_xref_other(sc, irec, agino);
+	if (sc->sm->sm_type == XFS_SCRUB_TYPE_INOBT)
+		xchk_inobt_chunk_xref_finobt(sc, irec, agino, nr_inodes);
+	else
+		xchk_finobt_chunk_xref_inobt(sc, irec, agino, nr_inodes);
 	xchk_xref_is_owned_by(sc, agbno, len, &XFS_RMAP_OINFO_INODES);
 	xchk_xref_is_not_shared(sc, agbno, len);
 	xchk_xref_is_not_cow_staging(sc, agbno, len);
@@ -406,7 +580,6 @@ xchk_iallocbt_rec(
 	struct xfs_inobt_rec_incore	irec;
 	uint64_t			holes;
 	xfs_agino_t			agino;
-	xfs_extlen_t			len;
 	int				holecount;
 	int				i;
 	int				error = 0;
@@ -428,12 +601,11 @@ xchk_iallocbt_rec(
 
 	/* Handle non-sparse inodes */
 	if (!xfs_inobt_issparse(irec.ir_holemask)) {
-		len = XFS_B_TO_FSB(mp,
-				XFS_INODES_PER_CHUNK * mp->m_sb.sb_inodesize);
 		if (irec.ir_count != XFS_INODES_PER_CHUNK)
 			xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
 
-		if (!xchk_iallocbt_chunk(bs, &irec, agino, len))
+		if (!xchk_iallocbt_chunk(bs, &irec, agino,
+					XFS_INODES_PER_CHUNK))
 			goto out;
 		goto check_clusters;
 	}
@@ -441,8 +613,6 @@ xchk_iallocbt_rec(
 	/* Check each chunk of a sparse inode cluster. */
 	holemask = irec.ir_holemask;
 	holecount = 0;
-	len = XFS_B_TO_FSB(mp,
-			XFS_INODES_PER_HOLEMASK_BIT * mp->m_sb.sb_inodesize);
 	holes = ~xfs_inobt_irec_to_allocmask(&irec);
 	if ((holes & irec.ir_free) != holes ||
 	    irec.ir_freecount > irec.ir_count)
@@ -451,7 +621,8 @@ xchk_iallocbt_rec(
 	for (i = 0; i < XFS_INOBT_HOLEMASK_BITS; i++) {
 		if (holemask & 1)
 			holecount += XFS_INODES_PER_HOLEMASK_BIT;
-		else if (!xchk_iallocbt_chunk(bs, &irec, agino, len))
+		else if (!xchk_iallocbt_chunk(bs, &irec, agino,
+					XFS_INODES_PER_HOLEMASK_BIT))
 			goto out;
 		holemask >>= 1;
 		agino += XFS_INODES_PER_HOLEMASK_BIT;
-- 
cgit v1.2.3


From 69115f775f6e8e972a40aa6aa1523bcb0b252b1c Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:15 -0700
Subject: xfs: teach scrub to check for sole ownership of metadata objects

Strengthen online scrub's checking even further by enabling us to check
that a range of blocks are owned solely by a given owner.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_rmap.c | 192 ++++++++++++++++++++++++++++++++---------------
 fs/xfs/libxfs/xfs_rmap.h |  18 ++++-
 fs/xfs/scrub/agheader.c  |  10 +--
 fs/xfs/scrub/btree.c     |   2 +-
 fs/xfs/scrub/ialloc.c    |   4 +-
 fs/xfs/scrub/inode.c     |   2 +-
 fs/xfs/scrub/rmap.c      |  45 +++++------
 fs/xfs/scrub/scrub.h     |   2 +-
 8 files changed, 182 insertions(+), 93 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index 308b81f321eb..f4dc23b3b837 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -2735,65 +2735,141 @@ xfs_rmap_has_records(
 	return xfs_btree_has_records(cur, &low, &high, &mask, outcome);
 }
 
-/*
- * Is there a record for this owner completely covering a given physical
- * extent?  If so, *has_rmap will be set to true.  If there is no record
- * or the record only covers part of the range, we set *has_rmap to false.
- * This function doesn't perform range lookups or offset checks, so it is
- * not suitable for checking data fork blocks.
- */
-int
-xfs_rmap_record_exists(
-	struct xfs_btree_cur		*cur,
+struct xfs_rmap_ownercount {
+	/* Owner that we're looking for. */
+	struct xfs_rmap_irec	good;
+
+	/* rmap search keys */
+	struct xfs_rmap_irec	low;
+	struct xfs_rmap_irec	high;
+
+	struct xfs_rmap_matches	*results;
+
+	/* Stop early if we find a nonmatch? */
+	bool			stop_on_nonmatch;
+};
+
+/* Does this rmap represent space that can have multiple owners? */
+static inline bool
+xfs_rmap_shareable(
+	struct xfs_mount		*mp,
+	const struct xfs_rmap_irec	*rmap)
+{
+	if (!xfs_has_reflink(mp))
+		return false;
+	if (XFS_RMAP_NON_INODE_OWNER(rmap->rm_owner))
+		return false;
+	if (rmap->rm_flags & (XFS_RMAP_ATTR_FORK |
+			      XFS_RMAP_BMBT_BLOCK))
+		return false;
+	return true;
+}
+
+static inline void
+xfs_rmap_ownercount_init(
+	struct xfs_rmap_ownercount	*roc,
 	xfs_agblock_t			bno,
 	xfs_extlen_t			len,
 	const struct xfs_owner_info	*oinfo,
-	bool				*has_rmap)
+	struct xfs_rmap_matches		*results)
 {
-	uint64_t			owner;
-	uint64_t			offset;
-	unsigned int			flags;
-	int				has_record;
-	struct xfs_rmap_irec		irec;
-	int				error;
+	memset(roc, 0, sizeof(*roc));
+	roc->results = results;
+
+	roc->low.rm_startblock = bno;
+	memset(&roc->high, 0xFF, sizeof(roc->high));
+	roc->high.rm_startblock = bno + len - 1;
+
+	memset(results, 0, sizeof(*results));
+	roc->good.rm_startblock = bno;
+	roc->good.rm_blockcount = len;
+	roc->good.rm_owner = oinfo->oi_owner;
+	roc->good.rm_offset = oinfo->oi_offset;
+	if (oinfo->oi_flags & XFS_OWNER_INFO_ATTR_FORK)
+		roc->good.rm_flags |= XFS_RMAP_ATTR_FORK;
+	if (oinfo->oi_flags & XFS_OWNER_INFO_BMBT_BLOCK)
+		roc->good.rm_flags |= XFS_RMAP_BMBT_BLOCK;
+}
 
-	xfs_owner_info_unpack(oinfo, &owner, &offset, &flags);
-	ASSERT(XFS_RMAP_NON_INODE_OWNER(owner) ||
-	       (flags & XFS_RMAP_BMBT_BLOCK));
+/* Figure out if this is a match for the owner. */
+STATIC int
+xfs_rmap_count_owners_helper(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_rmap_irec	*rec,
+	void				*priv)
+{
+	struct xfs_rmap_ownercount	*roc = priv;
+	struct xfs_rmap_irec		check = *rec;
+	unsigned int			keyflags;
+	bool				filedata;
+	int64_t				delta;
+
+	filedata = !XFS_RMAP_NON_INODE_OWNER(check.rm_owner) &&
+		   !(check.rm_flags & XFS_RMAP_BMBT_BLOCK);
+
+	/* Trim the part of check that comes before the comparison range. */
+	delta = (int64_t)roc->good.rm_startblock - check.rm_startblock;
+	if (delta > 0) {
+		check.rm_startblock += delta;
+		check.rm_blockcount -= delta;
+		if (filedata)
+			check.rm_offset += delta;
+	}
 
-	error = xfs_rmap_lookup_le(cur, bno, owner, offset, flags, &irec,
-			&has_record);
-	if (error)
-		return error;
-	if (!has_record) {
-		*has_rmap = false;
-		return 0;
+	/* Trim the part of check that comes after the comparison range. */
+	delta = (check.rm_startblock + check.rm_blockcount) -
+		(roc->good.rm_startblock + roc->good.rm_blockcount);
+	if (delta > 0)
+		check.rm_blockcount -= delta;
+
+	/* Don't care about unwritten status for establishing ownership. */
+	keyflags = check.rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK);
+
+	if (check.rm_startblock	== roc->good.rm_startblock &&
+	    check.rm_blockcount	== roc->good.rm_blockcount &&
+	    check.rm_owner	== roc->good.rm_owner &&
+	    check.rm_offset	== roc->good.rm_offset &&
+	    keyflags		== roc->good.rm_flags) {
+		roc->results->matches++;
+	} else {
+		roc->results->non_owner_matches++;
+		if (xfs_rmap_shareable(cur->bc_mp, &roc->good) ^
+		    xfs_rmap_shareable(cur->bc_mp, &check))
+			roc->results->bad_non_owner_matches++;
 	}
 
-	*has_rmap = (irec.rm_owner == owner && irec.rm_startblock <= bno &&
-		     irec.rm_startblock + irec.rm_blockcount >= bno + len);
+	if (roc->results->non_owner_matches && roc->stop_on_nonmatch)
+		return -ECANCELED;
+
 	return 0;
 }
 
-struct xfs_rmap_key_state {
-	uint64_t			owner;
-	uint64_t			offset;
-	unsigned int			flags;
-};
-
-/* For each rmap given, figure out if it doesn't match the key we want. */
-STATIC int
-xfs_rmap_has_other_keys_helper(
+/* Count the number of owners and non-owners of this range of blocks. */
+int
+xfs_rmap_count_owners(
 	struct xfs_btree_cur		*cur,
-	const struct xfs_rmap_irec	*rec,
-	void				*priv)
+	xfs_agblock_t			bno,
+	xfs_extlen_t			len,
+	const struct xfs_owner_info	*oinfo,
+	struct xfs_rmap_matches		*results)
 {
-	struct xfs_rmap_key_state	*rks = priv;
+	struct xfs_rmap_ownercount	roc;
+	int				error;
 
-	if (rks->owner == rec->rm_owner && rks->offset == rec->rm_offset &&
-	    ((rks->flags & rec->rm_flags) & XFS_RMAP_KEY_FLAGS) == rks->flags)
-		return 0;
-	return -ECANCELED;
+	xfs_rmap_ownercount_init(&roc, bno, len, oinfo, results);
+	error = xfs_rmap_query_range(cur, &roc.low, &roc.high,
+			xfs_rmap_count_owners_helper, &roc);
+	if (error)
+		return error;
+
+	/*
+	 * There can't be any non-owner rmaps that conflict with the given
+	 * owner if we didn't find any rmaps matching the owner.
+	 */
+	if (!results->matches)
+		results->bad_non_owner_matches = 0;
+
+	return 0;
 }
 
 /*
@@ -2806,28 +2882,26 @@ xfs_rmap_has_other_keys(
 	xfs_agblock_t			bno,
 	xfs_extlen_t			len,
 	const struct xfs_owner_info	*oinfo,
-	bool				*has_rmap)
+	bool				*has_other)
 {
-	struct xfs_rmap_irec		low = {0};
-	struct xfs_rmap_irec		high;
-	struct xfs_rmap_key_state	rks;
+	struct xfs_rmap_matches		res;
+	struct xfs_rmap_ownercount	roc;
 	int				error;
 
-	xfs_owner_info_unpack(oinfo, &rks.owner, &rks.offset, &rks.flags);
-	*has_rmap = false;
-
-	low.rm_startblock = bno;
-	memset(&high, 0xFF, sizeof(high));
-	high.rm_startblock = bno + len - 1;
+	xfs_rmap_ownercount_init(&roc, bno, len, oinfo, &res);
+	roc.stop_on_nonmatch = true;
 
-	error = xfs_rmap_query_range(cur, &low, &high,
-			xfs_rmap_has_other_keys_helper, &rks);
+	error = xfs_rmap_query_range(cur, &roc.low, &roc.high,
+			xfs_rmap_count_owners_helper, &roc);
 	if (error == -ECANCELED) {
-		*has_rmap = true;
+		*has_other = true;
 		return 0;
 	}
+	if (error)
+		return error;
 
-	return error;
+	*has_other = false;
+	return 0;
 }
 
 const struct xfs_owner_info XFS_RMAP_OINFO_SKIP_UPDATE = {
diff --git a/fs/xfs/libxfs/xfs_rmap.h b/fs/xfs/libxfs/xfs_rmap.h
index 4cbe50cf522e..3c98d9d50afb 100644
--- a/fs/xfs/libxfs/xfs_rmap.h
+++ b/fs/xfs/libxfs/xfs_rmap.h
@@ -200,12 +200,24 @@ xfs_failaddr_t xfs_rmap_check_irec(struct xfs_btree_cur *cur,
 
 int xfs_rmap_has_records(struct xfs_btree_cur *cur, xfs_agblock_t bno,
 		xfs_extlen_t len, enum xbtree_recpacking *outcome);
-int xfs_rmap_record_exists(struct xfs_btree_cur *cur, xfs_agblock_t bno,
+
+struct xfs_rmap_matches {
+	/* Number of owner matches. */
+	unsigned long long	matches;
+
+	/* Number of non-owner matches. */
+	unsigned long long	non_owner_matches;
+
+	/* Number of non-owner matches that conflict with the owner matches. */
+	unsigned long long	bad_non_owner_matches;
+};
+
+int xfs_rmap_count_owners(struct xfs_btree_cur *cur, xfs_agblock_t bno,
 		xfs_extlen_t len, const struct xfs_owner_info *oinfo,
-		bool *has_rmap);
+		struct xfs_rmap_matches *rmatch);
 int xfs_rmap_has_other_keys(struct xfs_btree_cur *cur, xfs_agblock_t bno,
 		xfs_extlen_t len, const struct xfs_owner_info *oinfo,
-		bool *has_rmap);
+		bool *has_other);
 int xfs_rmap_map_raw(struct xfs_btree_cur *cur, struct xfs_rmap_irec *rmap);
 
 extern const struct xfs_owner_info XFS_RMAP_OINFO_SKIP_UPDATE;
diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
index 1a84153afa91..6c6e5eba42c8 100644
--- a/fs/xfs/scrub/agheader.c
+++ b/fs/xfs/scrub/agheader.c
@@ -51,7 +51,7 @@ xchk_superblock_xref(
 
 	xchk_xref_is_used_space(sc, agbno, 1);
 	xchk_xref_is_not_inode_chunk(sc, agbno, 1);
-	xchk_xref_is_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS);
+	xchk_xref_is_only_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS);
 	xchk_xref_is_not_shared(sc, agbno, 1);
 	xchk_xref_is_not_cow_staging(sc, agbno, 1);
 
@@ -515,7 +515,7 @@ xchk_agf_xref(
 	xchk_agf_xref_freeblks(sc);
 	xchk_agf_xref_cntbt(sc);
 	xchk_xref_is_not_inode_chunk(sc, agbno, 1);
-	xchk_xref_is_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS);
+	xchk_xref_is_only_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS);
 	xchk_agf_xref_btreeblks(sc);
 	xchk_xref_is_not_shared(sc, agbno, 1);
 	xchk_xref_is_not_cow_staging(sc, agbno, 1);
@@ -644,7 +644,7 @@ xchk_agfl_block_xref(
 
 	xchk_xref_is_used_space(sc, agbno, 1);
 	xchk_xref_is_not_inode_chunk(sc, agbno, 1);
-	xchk_xref_is_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_AG);
+	xchk_xref_is_only_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_AG);
 	xchk_xref_is_not_shared(sc, agbno, 1);
 	xchk_xref_is_not_cow_staging(sc, agbno, 1);
 }
@@ -701,7 +701,7 @@ xchk_agfl_xref(
 
 	xchk_xref_is_used_space(sc, agbno, 1);
 	xchk_xref_is_not_inode_chunk(sc, agbno, 1);
-	xchk_xref_is_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS);
+	xchk_xref_is_only_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS);
 	xchk_xref_is_not_shared(sc, agbno, 1);
 	xchk_xref_is_not_cow_staging(sc, agbno, 1);
 
@@ -857,7 +857,7 @@ xchk_agi_xref(
 	xchk_xref_is_used_space(sc, agbno, 1);
 	xchk_xref_is_not_inode_chunk(sc, agbno, 1);
 	xchk_agi_xref_icounts(sc);
-	xchk_xref_is_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS);
+	xchk_xref_is_only_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS);
 	xchk_xref_is_not_shared(sc, agbno, 1);
 	xchk_xref_is_not_cow_staging(sc, agbno, 1);
 	xchk_agi_xref_fiblocks(sc);
diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
index 1165dc05a179..1935b9ce1885 100644
--- a/fs/xfs/scrub/btree.c
+++ b/fs/xfs/scrub/btree.c
@@ -402,7 +402,7 @@ xchk_btree_check_block_owner(
 	if (!bs->sc->sa.bno_cur && btnum == XFS_BTNUM_BNO)
 		bs->cur = NULL;
 
-	xchk_xref_is_owned_by(bs->sc, agbno, 1, bs->oinfo);
+	xchk_xref_is_only_owned_by(bs->sc, agbno, 1, bs->oinfo);
 	if (!bs->sc->sa.rmap_cur && btnum == XFS_BTNUM_RMAP)
 		bs->cur = NULL;
 
diff --git a/fs/xfs/scrub/ialloc.c b/fs/xfs/scrub/ialloc.c
index fda96b536730..575f22a02ebe 100644
--- a/fs/xfs/scrub/ialloc.c
+++ b/fs/xfs/scrub/ialloc.c
@@ -276,7 +276,7 @@ xchk_iallocbt_chunk(
 		xchk_inobt_chunk_xref_finobt(sc, irec, agino, nr_inodes);
 	else
 		xchk_finobt_chunk_xref_inobt(sc, irec, agino, nr_inodes);
-	xchk_xref_is_owned_by(sc, agbno, len, &XFS_RMAP_OINFO_INODES);
+	xchk_xref_is_only_owned_by(sc, agbno, len, &XFS_RMAP_OINFO_INODES);
 	xchk_xref_is_not_shared(sc, agbno, len);
 	xchk_xref_is_not_cow_staging(sc, agbno, len);
 	return true;
@@ -428,7 +428,7 @@ xchk_iallocbt_check_cluster(
 		return 0;
 	}
 
-	xchk_xref_is_owned_by(bs->sc, agbno, M_IGEO(mp)->blocks_per_cluster,
+	xchk_xref_is_only_owned_by(bs->sc, agbno, M_IGEO(mp)->blocks_per_cluster,
 			&XFS_RMAP_OINFO_INODES);
 
 	/* Grab the inode cluster buffer. */
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index 50ebd72f6d95..2db96c8a71dc 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -556,7 +556,7 @@ xchk_inode_xref(
 
 	xchk_xref_is_used_space(sc, agbno, 1);
 	xchk_inode_xref_finobt(sc, ino);
-	xchk_xref_is_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_INODES);
+	xchk_xref_is_only_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_INODES);
 	xchk_xref_is_not_shared(sc, agbno, 1);
 	xchk_xref_is_not_cow_staging(sc, agbno, 1);
 	xchk_inode_xref_bmap(sc, dip);
diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index 2f9e4f77db6b..18b64287158e 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -167,38 +167,29 @@ xchk_rmapbt(
 			&XFS_RMAP_OINFO_AG, NULL);
 }
 
-/* xref check that the extent is owned by a given owner */
-static inline void
-xchk_xref_check_owner(
+/* xref check that the extent is owned only by a given owner */
+void
+xchk_xref_is_only_owned_by(
 	struct xfs_scrub		*sc,
 	xfs_agblock_t			bno,
 	xfs_extlen_t			len,
-	const struct xfs_owner_info	*oinfo,
-	bool				should_have_rmap)
+	const struct xfs_owner_info	*oinfo)
 {
-	bool				has_rmap;
+	struct xfs_rmap_matches		res;
 	int				error;
 
 	if (!sc->sa.rmap_cur || xchk_skip_xref(sc->sm))
 		return;
 
-	error = xfs_rmap_record_exists(sc->sa.rmap_cur, bno, len, oinfo,
-			&has_rmap);
+	error = xfs_rmap_count_owners(sc->sa.rmap_cur, bno, len, oinfo, &res);
 	if (!xchk_should_check_xref(sc, &error, &sc->sa.rmap_cur))
 		return;
-	if (has_rmap != should_have_rmap)
+	if (res.matches != 1)
+		xchk_btree_xref_set_corrupt(sc, sc->sa.rmap_cur, 0);
+	if (res.bad_non_owner_matches)
+		xchk_btree_xref_set_corrupt(sc, sc->sa.rmap_cur, 0);
+	if (res.non_owner_matches)
 		xchk_btree_xref_set_corrupt(sc, sc->sa.rmap_cur, 0);
-}
-
-/* xref check that the extent is owned by a given owner */
-void
-xchk_xref_is_owned_by(
-	struct xfs_scrub		*sc,
-	xfs_agblock_t			bno,
-	xfs_extlen_t			len,
-	const struct xfs_owner_info	*oinfo)
-{
-	xchk_xref_check_owner(sc, bno, len, oinfo, true);
 }
 
 /* xref check that the extent is not owned by a given owner */
@@ -209,7 +200,19 @@ xchk_xref_is_not_owned_by(
 	xfs_extlen_t			len,
 	const struct xfs_owner_info	*oinfo)
 {
-	xchk_xref_check_owner(sc, bno, len, oinfo, false);
+	struct xfs_rmap_matches		res;
+	int				error;
+
+	if (!sc->sa.rmap_cur || xchk_skip_xref(sc->sm))
+		return;
+
+	error = xfs_rmap_count_owners(sc->sa.rmap_cur, bno, len, oinfo, &res);
+	if (!xchk_should_check_xref(sc, &error, &sc->sa.rmap_cur))
+		return;
+	if (res.matches != 0)
+		xchk_btree_xref_set_corrupt(sc, sc->sa.rmap_cur, 0);
+	if (res.bad_non_owner_matches)
+		xchk_btree_xref_set_corrupt(sc, sc->sa.rmap_cur, 0);
 }
 
 /* xref check that the extent has no reverse mapping at all */
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index b6f452eb9645..c519927355fe 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -162,7 +162,7 @@ void xchk_xref_is_not_inode_chunk(struct xfs_scrub *sc, xfs_agblock_t agbno,
 		xfs_extlen_t len);
 void xchk_xref_is_inode_chunk(struct xfs_scrub *sc, xfs_agblock_t agbno,
 		xfs_extlen_t len);
-void xchk_xref_is_owned_by(struct xfs_scrub *sc, xfs_agblock_t agbno,
+void xchk_xref_is_only_owned_by(struct xfs_scrub *sc, xfs_agblock_t agbno,
 		xfs_extlen_t len, const struct xfs_owner_info *oinfo);
 void xchk_xref_is_not_owned_by(struct xfs_scrub *sc, xfs_agblock_t agbno,
 		xfs_extlen_t len, const struct xfs_owner_info *oinfo);
-- 
cgit v1.2.3


From efc0845f5d3e253f7f46a60b66a94c3164d76ee3 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:15 -0700
Subject: xfs: convert xfs_ialloc_has_inodes_at_extent to return keyfill scan
 results

Convert the xfs_ialloc_has_inodes_at_extent function to return keyfill
scan results because for a given range of inode numbers, we might have
no indexed inodes at all; the entire region might be allocated ondisk
inodes; or there might be a mix of the two.

Unfortunately, sparse inodes adds to the complexity, because each inode
record can have holes, which means that we cannot use the generic btree
_scan_keyfill function because we must look for holes in individual
records to decide the result.  On the plus side, online fsck can now
detect sub-chunk discrepancies in the inobt.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 82 ++++++++++++++++++++++++++++------------------
 fs/xfs/libxfs/xfs_ialloc.h |  5 ++-
 fs/xfs/scrub/ialloc.c      | 17 ++++++----
 3 files changed, 62 insertions(+), 42 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index be952bd37a61..a16d5de16933 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -2641,44 +2641,50 @@ xfs_ialloc_read_agi(
 	return 0;
 }
 
-/* Is there an inode record covering a given range of inode numbers? */
-int
-xfs_ialloc_has_inode_record(
-	struct xfs_btree_cur	*cur,
-	xfs_agino_t		low,
-	xfs_agino_t		high,
-	bool			*exists)
+/* How many inodes are backed by inode clusters ondisk? */
+STATIC int
+xfs_ialloc_count_ondisk(
+	struct xfs_btree_cur		*cur,
+	xfs_agino_t			low,
+	xfs_agino_t			high,
+	unsigned int			*allocated)
 {
 	struct xfs_inobt_rec_incore	irec;
-	xfs_agino_t		agino;
-	uint16_t		holemask;
-	int			has_record;
-	int			i;
-	int			error;
+	unsigned int			ret = 0;
+	int				has_record;
+	int				error;
 
-	*exists = false;
 	error = xfs_inobt_lookup(cur, low, XFS_LOOKUP_LE, &has_record);
-	while (error == 0 && has_record) {
+	if (error)
+		return error;
+
+	while (has_record) {
+		unsigned int		i, hole_idx;
+
 		error = xfs_inobt_get_rec(cur, &irec, &has_record);
-		if (error || irec.ir_startino > high)
+		if (error)
+			return error;
+		if (irec.ir_startino > high)
 			break;
 
-		agino = irec.ir_startino;
-		holemask = irec.ir_holemask;
-		for (i = 0; i < XFS_INOBT_HOLEMASK_BITS; holemask >>= 1,
-				i++, agino += XFS_INODES_PER_HOLEMASK_BIT) {
-			if (holemask & 1)
+		for (i = 0; i < XFS_INODES_PER_CHUNK; i++) {
+			if (irec.ir_startino + i < low)
 				continue;
-			if (agino + XFS_INODES_PER_HOLEMASK_BIT > low &&
-					agino <= high) {
-				*exists = true;
-				return 0;
-			}
+			if (irec.ir_startino + i > high)
+				break;
+
+			hole_idx = i / XFS_INODES_PER_HOLEMASK_BIT;
+			if (!(irec.ir_holemask & (1U << hole_idx)))
+				ret++;
 		}
 
 		error = xfs_btree_increment(cur, 0, &has_record);
+		if (error)
+			return error;
 	}
-	return error;
+
+	*allocated = ret;
+	return 0;
 }
 
 /* Is there an inode record covering a given extent? */
@@ -2687,15 +2693,27 @@ xfs_ialloc_has_inodes_at_extent(
 	struct xfs_btree_cur	*cur,
 	xfs_agblock_t		bno,
 	xfs_extlen_t		len,
-	bool			*exists)
+	enum xbtree_recpacking	*outcome)
 {
-	xfs_agino_t		low;
-	xfs_agino_t		high;
+	xfs_agino_t		agino;
+	xfs_agino_t		last_agino;
+	unsigned int		allocated;
+	int			error;
 
-	low = XFS_AGB_TO_AGINO(cur->bc_mp, bno);
-	high = XFS_AGB_TO_AGINO(cur->bc_mp, bno + len) - 1;
+	agino = XFS_AGB_TO_AGINO(cur->bc_mp, bno);
+	last_agino = XFS_AGB_TO_AGINO(cur->bc_mp, bno + len) - 1;
 
-	return xfs_ialloc_has_inode_record(cur, low, high, exists);
+	error = xfs_ialloc_count_ondisk(cur, agino, last_agino, &allocated);
+	if (error)
+		return error;
+
+	if (allocated == 0)
+		*outcome = XBTREE_RECPACKING_EMPTY;
+	else if (allocated == last_agino - agino + 1)
+		*outcome = XBTREE_RECPACKING_FULL;
+	else
+		*outcome = XBTREE_RECPACKING_SPARSE;
+	return 0;
 }
 
 struct xfs_ialloc_count_inodes {
diff --git a/fs/xfs/libxfs/xfs_ialloc.h b/fs/xfs/libxfs/xfs_ialloc.h
index 90b0e5079338..fe824bb04a09 100644
--- a/fs/xfs/libxfs/xfs_ialloc.h
+++ b/fs/xfs/libxfs/xfs_ialloc.h
@@ -96,9 +96,8 @@ void xfs_inobt_btrec_to_irec(struct xfs_mount *mp,
 xfs_failaddr_t xfs_inobt_check_irec(struct xfs_btree_cur *cur,
 		const struct xfs_inobt_rec_incore *irec);
 int xfs_ialloc_has_inodes_at_extent(struct xfs_btree_cur *cur,
-		xfs_agblock_t bno, xfs_extlen_t len, bool *exists);
-int xfs_ialloc_has_inode_record(struct xfs_btree_cur *cur, xfs_agino_t low,
-		xfs_agino_t high, bool *exists);
+		xfs_agblock_t bno, xfs_extlen_t len,
+		enum xbtree_recpacking *outcome);
 int xfs_ialloc_count_inodes(struct xfs_btree_cur *cur, xfs_agino_t *count,
 		xfs_agino_t *freecount);
 int xfs_inobt_insert_rec(struct xfs_btree_cur *cur, uint16_t holemask,
diff --git a/fs/xfs/scrub/ialloc.c b/fs/xfs/scrub/ialloc.c
index 3a37c3dc0fae..fda96b536730 100644
--- a/fs/xfs/scrub/ialloc.c
+++ b/fs/xfs/scrub/ialloc.c
@@ -765,18 +765,18 @@ xchk_xref_inode_check(
 	xfs_agblock_t		agbno,
 	xfs_extlen_t		len,
 	struct xfs_btree_cur	**icur,
-	bool			should_have_inodes)
+	enum xbtree_recpacking	expected)
 {
-	bool			has_inodes;
+	enum xbtree_recpacking	outcome;
 	int			error;
 
 	if (!(*icur) || xchk_skip_xref(sc->sm))
 		return;
 
-	error = xfs_ialloc_has_inodes_at_extent(*icur, agbno, len, &has_inodes);
+	error = xfs_ialloc_has_inodes_at_extent(*icur, agbno, len, &outcome);
 	if (!xchk_should_check_xref(sc, &error, icur))
 		return;
-	if (has_inodes != should_have_inodes)
+	if (outcome != expected)
 		xchk_btree_xref_set_corrupt(sc, *icur, 0);
 }
 
@@ -787,8 +787,10 @@ xchk_xref_is_not_inode_chunk(
 	xfs_agblock_t		agbno,
 	xfs_extlen_t		len)
 {
-	xchk_xref_inode_check(sc, agbno, len, &sc->sa.ino_cur, false);
-	xchk_xref_inode_check(sc, agbno, len, &sc->sa.fino_cur, false);
+	xchk_xref_inode_check(sc, agbno, len, &sc->sa.ino_cur,
+			XBTREE_RECPACKING_EMPTY);
+	xchk_xref_inode_check(sc, agbno, len, &sc->sa.fino_cur,
+			XBTREE_RECPACKING_EMPTY);
 }
 
 /* xref check that the extent is covered by inodes */
@@ -798,5 +800,6 @@ xchk_xref_is_inode_chunk(
 	xfs_agblock_t		agbno,
 	xfs_extlen_t		len)
 {
-	xchk_xref_inode_check(sc, agbno, len, &sc->sa.ino_cur, true);
+	xchk_xref_inode_check(sc, agbno, len, &sc->sa.ino_cur,
+			XBTREE_RECPACKING_FULL);
 }
-- 
cgit v1.2.3


From 9dceccc5822f2ecea12a89f24d7cad1f3e5eab7c Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:16 -0700
Subject: xfs: use the directory name hash function for dir scrubbing

The directory code has a directory-specific hash computation function
that includes a modified hash function for case-insensitive lookups.
Hence we must use that function (and not the raw da_hashname) when
checking the dabtree structure.

Found by accidentally breaking xfs/188 to create an abnormally huge
case-insensitive directory and watching scrub break.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/dir.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
index b6081a3e1b91..1b1830576dcd 100644
--- a/fs/xfs/scrub/dir.c
+++ b/fs/xfs/scrub/dir.c
@@ -201,6 +201,7 @@ xchk_dir_rec(
 	struct xchk_da_btree		*ds,
 	int				level)
 {
+	struct xfs_name			dname = { };
 	struct xfs_da_state_blk		*blk = &ds->state->path.blk[level];
 	struct xfs_mount		*mp = ds->state->mp;
 	struct xfs_inode		*dp = ds->dargs.dp;
@@ -297,7 +298,11 @@ xchk_dir_rec(
 		xchk_fblock_set_corrupt(ds->sc, XFS_DATA_FORK, rec_bno);
 		goto out_relse;
 	}
-	calc_hash = xfs_da_hashname(dent->name, dent->namelen);
+
+	/* Does the directory hash match? */
+	dname.name = dent->name;
+	dname.len = dent->namelen;
+	calc_hash = xfs_dir2_hashname(mp, &dname);
 	if (calc_hash != hash)
 		xchk_fblock_set_corrupt(ds->sc, XFS_DATA_FORK, rec_bno);
 
-- 
cgit v1.2.3


From 30f8ee5e7e0ccce396dff209c6cbce49d0d7e167 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:16 -0700
Subject: xfs: ensure that single-owner file blocks are not owned by others

For any file fork mapping that can only have a single owner, make sure
that there are no other rmap owners for that mapping.  This patch
requires the more detailed checking provided by xfs_rmap_count_owners so
that we can know how many rmap records for a given range of space had a
matching owner, how many had a non-matching owner, and how many
conflicted with the records that have a matching owner.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/bmap.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index be2c4da2808b..2412dcf0fa9a 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -308,6 +308,7 @@ xchk_bmap_iextent_xref(
 	struct xchk_bmap_info	*info,
 	struct xfs_bmbt_irec	*irec)
 {
+	struct xfs_owner_info	oinfo;
 	struct xfs_mount	*mp = info->sc->mp;
 	xfs_agnumber_t		agno;
 	xfs_agblock_t		agbno;
@@ -328,19 +329,30 @@ xchk_bmap_iextent_xref(
 	xchk_bmap_xref_rmap(info, irec, agbno);
 	switch (info->whichfork) {
 	case XFS_DATA_FORK:
-		if (!xfs_is_reflink_inode(info->sc->ip))
+		if (!xfs_is_reflink_inode(info->sc->ip)) {
+			xfs_rmap_ino_owner(&oinfo, info->sc->ip->i_ino,
+					info->whichfork, irec->br_startoff);
+			xchk_xref_is_only_owned_by(info->sc, agbno,
+					irec->br_blockcount, &oinfo);
 			xchk_xref_is_not_shared(info->sc, agbno,
 					irec->br_blockcount);
+		}
 		xchk_xref_is_not_cow_staging(info->sc, agbno,
 				irec->br_blockcount);
 		break;
 	case XFS_ATTR_FORK:
+		xfs_rmap_ino_owner(&oinfo, info->sc->ip->i_ino,
+				info->whichfork, irec->br_startoff);
+		xchk_xref_is_only_owned_by(info->sc, agbno, irec->br_blockcount,
+				&oinfo);
 		xchk_xref_is_not_shared(info->sc, agbno,
 				irec->br_blockcount);
 		xchk_xref_is_not_cow_staging(info->sc, agbno,
 				irec->br_blockcount);
 		break;
 	case XFS_COW_FORK:
+		xchk_xref_is_only_owned_by(info->sc, agbno, irec->br_blockcount,
+				&XFS_RMAP_OINFO_COW);
 		xchk_xref_is_cow_staging(info->sc, agbno,
 				irec->br_blockcount);
 		xchk_xref_is_not_shared(info->sc, agbno,
-- 
cgit v1.2.3


From 4c233b5c4f29dff11eeb64b2b1cc0831b9904a4f Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:17 -0700
Subject: xfs: streamline the directory iteration code for scrub

Currently, online scrub reuses the xfs_readdir code to walk every entry
in a directory.  This isn't awesome for performance, since we end up
cycling the directory ILOCK needlessly and coding around the particular
quirks of the VFS dir_context interface.

Create a streamlined version of readdir that keeps the ILOCK (since the
walk function isn't going to copy stuff to userspace), skips a whole lot
of directory walk cursor checks (since we start at 0 and walk to the
end) and has a sane way to return error codes.

Note: Porting the dotdot checking code is left for a subsequent patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/Makefile        |   1 +
 fs/xfs/scrub/dir.c     | 188 ++++++++-----------------
 fs/xfs/scrub/parent.c  |  73 +++-------
 fs/xfs/scrub/readdir.c | 375 +++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/readdir.h |  19 +++
 5 files changed, 473 insertions(+), 183 deletions(-)
 create mode 100644 fs/xfs/scrub/readdir.c
 create mode 100644 fs/xfs/scrub/readdir.h

diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 3bdbc838c4d1..ac9d03cd2623 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -158,6 +158,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   ialloc.o \
 				   inode.o \
 				   parent.o \
+				   readdir.o \
 				   refcount.o \
 				   rmap.o \
 				   scrub.o \
diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
index 1b1830576dcd..f1cbe7b22688 100644
--- a/fs/xfs/scrub/dir.c
+++ b/fs/xfs/scrub/dir.c
@@ -18,6 +18,7 @@
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/dabtree.h"
+#include "scrub/readdir.h"
 
 /* Set us up to scrub directories. */
 int
@@ -31,30 +32,21 @@ xchk_setup_directory(
 
 /* Scrub a directory entry. */
 
-struct xchk_dir_ctx {
-	/* VFS fill-directory iterator */
-	struct dir_context	dir_iter;
-
-	struct xfs_scrub	*sc;
-};
-
-/* Check that an inode's mode matches a given DT_ type. */
+/* Check that an inode's mode matches a given XFS_DIR3_FT_* type. */
 STATIC int
 xchk_dir_check_ftype(
-	struct xchk_dir_ctx	*sdc,
+	struct xfs_scrub	*sc,
 	xfs_fileoff_t		offset,
 	xfs_ino_t		inum,
-	int			dtype)
+	int			ftype)
 {
-	struct xfs_mount	*mp = sdc->sc->mp;
+	struct xfs_mount	*mp = sc->mp;
 	struct xfs_inode	*ip;
-	int			ino_dtype;
 	int			error = 0;
 
 	if (!xfs_has_ftype(mp)) {
-		if (dtype != DT_UNKNOWN && dtype != DT_DIR)
-			xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK,
-					offset);
+		if (ftype != XFS_DIR3_FT_UNKNOWN && ftype != XFS_DIR3_FT_DIR)
+			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
 		goto out;
 	}
 
@@ -71,21 +63,17 @@ xchk_dir_check_ftype(
 	 * -EFSCORRUPTED or -EFSBADCRC then the child is corrupt which is a
 	 *  cross referencing error.  Any other error is an operational error.
 	 */
-	error = xfs_iget(mp, sdc->sc->tp, inum, 0, 0, &ip);
+	error = xfs_iget(mp, sc->tp, inum, 0, 0, &ip);
 	if (error == -EINVAL || error == -ENOENT) {
 		error = -EFSCORRUPTED;
-		xchk_fblock_process_error(sdc->sc, XFS_DATA_FORK, 0, &error);
+		xchk_fblock_process_error(sc, XFS_DATA_FORK, 0, &error);
 		goto out;
 	}
-	if (!xchk_fblock_xref_process_error(sdc->sc, XFS_DATA_FORK, offset,
-			&error))
+	if (!xchk_fblock_xref_process_error(sc, XFS_DATA_FORK, offset, &error))
 		goto out;
 
-	/* Convert mode to the DT_* values that dir_emit uses. */
-	ino_dtype = xfs_dir3_get_dtype(mp,
-			xfs_mode_to_ftype(VFS_I(ip)->i_mode));
-	if (ino_dtype != dtype)
-		xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK, offset);
+	if (xfs_mode_to_ftype(VFS_I(ip)->i_mode) != ftype)
+		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
 	xfs_irele(ip);
 out:
 	return error;
@@ -94,105 +82,85 @@ out:
 /*
  * Scrub a single directory entry.
  *
- * We use the VFS directory iterator (i.e. readdir) to call this
- * function for every directory entry in a directory.  Once we're here,
- * we check the inode number to make sure it's sane, then we check that
- * we can look up this filename.  Finally, we check the ftype.
+ * Check the inode number to make sure it's sane, then we check that we can
+ * look up this filename.  Finally, we check the ftype.
  */
-STATIC bool
+STATIC int
 xchk_dir_actor(
-	struct dir_context	*dir_iter,
-	const char		*name,
-	int			namelen,
-	loff_t			pos,
-	u64			ino,
-	unsigned		type)
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*dp,
+	xfs_dir2_dataptr_t	dapos,
+	const struct xfs_name	*name,
+	xfs_ino_t		ino,
+	void			*priv)
 {
-	struct xfs_mount	*mp;
-	struct xfs_inode	*ip;
-	struct xchk_dir_ctx	*sdc;
-	struct xfs_name		xname;
+	struct xfs_mount	*mp = dp->i_mount;
 	xfs_ino_t		lookup_ino;
 	xfs_dablk_t		offset;
 	bool			checked_ftype = false;
 	int			error = 0;
 
-	sdc = container_of(dir_iter, struct xchk_dir_ctx, dir_iter);
-	ip = sdc->sc->ip;
-	mp = ip->i_mount;
 	offset = xfs_dir2_db_to_da(mp->m_dir_geo,
-			xfs_dir2_dataptr_to_db(mp->m_dir_geo, pos));
+			xfs_dir2_dataptr_to_db(mp->m_dir_geo, dapos));
 
-	if (xchk_should_terminate(sdc->sc, &error))
-		return !error;
+	if (xchk_should_terminate(sc, &error))
+		return error;
 
 	/* Does this inode number make sense? */
 	if (!xfs_verify_dir_ino(mp, ino)) {
-		xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK, offset);
-		goto out;
+		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
+		return -ECANCELED;
 	}
 
 	/* Does this name make sense? */
-	if (!xfs_dir2_namecheck(name, namelen)) {
-		xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK, offset);
-		goto out;
+	if (!xfs_dir2_namecheck(name->name, name->len)) {
+		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
+		return -ECANCELED;
 	}
 
-	if (!strncmp(".", name, namelen)) {
+	if (!strncmp(".", name->name, name->len)) {
 		/* If this is "." then check that the inum matches the dir. */
-		if (xfs_has_ftype(mp) && type != DT_DIR)
-			xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK,
-					offset);
+		if (xfs_has_ftype(mp) && name->type != XFS_DIR3_FT_DIR)
+			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
 		checked_ftype = true;
-		if (ino != ip->i_ino)
-			xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK,
-					offset);
-	} else if (!strncmp("..", name, namelen)) {
+		if (ino != dp->i_ino)
+			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
+	} else if (!strncmp("..", name->name, name->len)) {
 		/*
 		 * If this is ".." in the root inode, check that the inum
 		 * matches this dir.
 		 */
-		if (xfs_has_ftype(mp) && type != DT_DIR)
-			xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK,
-					offset);
+		if (xfs_has_ftype(mp) && name->type != XFS_DIR3_FT_DIR)
+			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
 		checked_ftype = true;
-		if (ip->i_ino == mp->m_sb.sb_rootino && ino != ip->i_ino)
-			xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK,
-					offset);
+		if (dp->i_ino == mp->m_sb.sb_rootino && ino != dp->i_ino)
+			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
 	}
 
 	/* Verify that we can look up this name by hash. */
-	xname.name = name;
-	xname.len = namelen;
-	xname.type = XFS_DIR3_FT_UNKNOWN;
-
-	error = xfs_dir_lookup(sdc->sc->tp, ip, &xname, &lookup_ino, NULL);
+	error = xchk_dir_lookup(sc, dp, name, &lookup_ino);
 	/* ENOENT means the hash lookup failed and the dir is corrupt */
 	if (error == -ENOENT)
 		error = -EFSCORRUPTED;
-	if (!xchk_fblock_process_error(sdc->sc, XFS_DATA_FORK, offset,
-			&error))
+	if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, offset, &error))
 		goto out;
 	if (lookup_ino != ino) {
-		xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK, offset);
-		goto out;
+		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
+		return -ECANCELED;
 	}
 
 	/* Verify the file type.  This function absorbs error codes. */
 	if (!checked_ftype) {
-		error = xchk_dir_check_ftype(sdc, offset, lookup_ino, type);
+		error = xchk_dir_check_ftype(sc, offset, lookup_ino,
+				name->type);
 		if (error)
 			goto out;
 	}
+
 out:
-	/*
-	 * A negative error code returned here is supposed to cause the
-	 * dir_emit caller (xfs_readdir) to abort the directory iteration
-	 * and return zero to xchk_directory.
-	 */
-	if (error == 0 && sdc->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
-		return false;
-	return !error;
+	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
+		return -ECANCELED;
+	return error;
 }
 
 /* Scrub a directory btree record. */
@@ -808,14 +776,7 @@ int
 xchk_directory(
 	struct xfs_scrub	*sc)
 {
-	struct xchk_dir_ctx	sdc = {
-		.dir_iter.actor = xchk_dir_actor,
-		.dir_iter.pos = 0,
-		.sc = sc,
-	};
-	size_t			bufsize;
-	loff_t			oldpos;
-	int			error = 0;
+	int			error;
 
 	if (!S_ISDIR(VFS_I(sc->ip)->i_mode))
 		return -ENOENT;
@@ -823,7 +784,7 @@ xchk_directory(
 	/* Plausible size? */
 	if (sc->ip->i_disk_size < xfs_dir2_sf_hdr_size(0)) {
 		xchk_ino_set_corrupt(sc, sc->ip->i_ino);
-		goto out;
+		return 0;
 	}
 
 	/* Check directory tree structure */
@@ -832,7 +793,7 @@ xchk_directory(
 		return error;
 
 	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
-		return error;
+		return 0;
 
 	/* Check the freespace. */
 	error = xchk_directory_blocks(sc);
@@ -840,44 +801,11 @@ xchk_directory(
 		return error;
 
 	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
-		return error;
-
-	/*
-	 * Check that every dirent we see can also be looked up by hash.
-	 * Userspace usually asks for a 32k buffer, so we will too.
-	 */
-	bufsize = (size_t)min_t(loff_t, XFS_READDIR_BUFSIZE,
-			sc->ip->i_disk_size);
-
-	/*
-	 * Look up every name in this directory by hash.
-	 *
-	 * Use the xfs_readdir function to call xchk_dir_actor on
-	 * every directory entry in this directory.  In _actor, we check
-	 * the name, inode number, and ftype (if applicable) of the
-	 * entry.  xfs_readdir uses the VFS filldir functions to provide
-	 * iteration context.
-	 *
-	 * The VFS grabs a read or write lock via i_rwsem before it reads
-	 * or writes to a directory.  If we've gotten this far we've
-	 * already obtained IOLOCK_EXCL, which (since 4.10) is the same as
-	 * getting a write lock on i_rwsem.  Therefore, it is safe for us
-	 * to drop the ILOCK here in order to reuse the _readdir and
-	 * _dir_lookup routines, which do their own ILOCK locking.
-	 */
-	oldpos = 0;
-	sc->ilock_flags &= ~XFS_ILOCK_EXCL;
-	xfs_iunlock(sc->ip, XFS_ILOCK_EXCL);
-	while (true) {
-		error = xfs_readdir(sc->tp, sc->ip, &sdc.dir_iter, bufsize);
-		if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, 0,
-				&error))
-			goto out;
-		if (oldpos == sdc.dir_iter.pos)
-			break;
-		oldpos = sdc.dir_iter.pos;
-	}
+		return 0;
 
-out:
+	/* Look up every name in this directory by hash. */
+	error = xchk_dir_walk(sc, sc->ip, xchk_dir_actor, NULL);
+	if (error == -ECANCELED)
+		error = 0;
 	return error;
 }
diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c
index d1db18250ee3..af351c4ee6ec 100644
--- a/fs/xfs/scrub/parent.c
+++ b/fs/xfs/scrub/parent.c
@@ -16,6 +16,7 @@
 #include "xfs_dir2_priv.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
+#include "scrub/readdir.h"
 
 /* Set us up to scrub parents. */
 int
@@ -30,39 +31,36 @@ xchk_setup_parent(
 /* Look for an entry in a parent pointing to this inode. */
 
 struct xchk_parent_ctx {
-	struct dir_context	dc;
 	struct xfs_scrub	*sc;
-	xfs_ino_t		ino;
 	xfs_nlink_t		nlink;
-	bool			cancelled;
 };
 
 /* Look for a single entry in a directory pointing to an inode. */
-STATIC bool
+STATIC int
 xchk_parent_actor(
-	struct dir_context	*dc,
-	const char		*name,
-	int			namelen,
-	loff_t			pos,
-	u64			ino,
-	unsigned		type)
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*dp,
+	xfs_dir2_dataptr_t	dapos,
+	const struct xfs_name	*name,
+	xfs_ino_t		ino,
+	void			*priv)
 {
-	struct xchk_parent_ctx	*spc;
+	struct xchk_parent_ctx	*spc = priv;
 	int			error = 0;
 
-	spc = container_of(dc, struct xchk_parent_ctx, dc);
-	if (spc->ino == ino)
+	/* Does this name make sense? */
+	if (!xfs_dir2_namecheck(name->name, name->len))
+		error = -EFSCORRUPTED;
+	if (!xchk_fblock_xref_process_error(sc, XFS_DATA_FORK, 0, &error))
+		return error;
+
+	if (sc->ip->i_ino == ino)
 		spc->nlink++;
 
-	/*
-	 * If we're facing a fatal signal, bail out.  Store the cancellation
-	 * status separately because the VFS readdir code squashes error codes
-	 * into short directory reads.
-	 */
 	if (xchk_should_terminate(spc->sc, &error))
-		spc->cancelled = true;
+		return error;
 
-	return !error;
+	return 0;
 }
 
 /* Count the number of dentries in the parent dir that point to this inode. */
@@ -73,50 +71,19 @@ xchk_parent_count_parent_dentries(
 	xfs_nlink_t		*nlink)
 {
 	struct xchk_parent_ctx	spc = {
-		.dc.actor	= xchk_parent_actor,
-		.ino		= sc->ip->i_ino,
 		.sc		= sc,
+		.nlink		= 0,
 	};
-	size_t			bufsize;
-	loff_t			oldpos;
 	uint			lock_mode;
 	int			error = 0;
 
-	/*
-	 * If there are any blocks, read-ahead block 0 as we're almost
-	 * certain to have the next operation be a read there.  This is
-	 * how we guarantee that the parent's extent map has been loaded,
-	 * if there is one.
-	 */
 	lock_mode = xfs_ilock_data_map_shared(parent);
-	if (parent->i_df.if_nextents > 0)
-		error = xfs_dir3_data_readahead(parent, 0, 0);
+	error = xchk_dir_walk(sc, parent, xchk_parent_actor, &spc);
 	xfs_iunlock(parent, lock_mode);
 	if (error)
 		return error;
 
-	/*
-	 * Iterate the parent dir to confirm that there is
-	 * exactly one entry pointing back to the inode being
-	 * scanned.
-	 */
-	bufsize = (size_t)min_t(loff_t, XFS_READDIR_BUFSIZE,
-			parent->i_disk_size);
-	oldpos = 0;
-	while (true) {
-		error = xfs_readdir(sc->tp, parent, &spc.dc, bufsize);
-		if (error)
-			goto out;
-		if (spc.cancelled) {
-			error = -EAGAIN;
-			goto out;
-		}
-		if (oldpos == spc.dc.pos)
-			break;
-		oldpos = spc.dc.pos;
-	}
 	*nlink = spc.nlink;
-out:
 	return error;
 }
 
diff --git a/fs/xfs/scrub/readdir.c b/fs/xfs/scrub/readdir.c
new file mode 100644
index 000000000000..e51c1544be63
--- /dev/null
+++ b/fs/xfs/scrub/readdir.c
@@ -0,0 +1,375 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022-2023 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_inode.h"
+#include "xfs_dir2.h"
+#include "xfs_dir2_priv.h"
+#include "xfs_trace.h"
+#include "xfs_bmap.h"
+#include "xfs_trans.h"
+#include "xfs_error.h"
+#include "scrub/scrub.h"
+#include "scrub/readdir.h"
+
+/* Call a function for every entry in a shortform directory. */
+STATIC int
+xchk_dir_walk_sf(
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*dp,
+	xchk_dirent_fn		dirent_fn,
+	void			*priv)
+{
+	struct xfs_name		name = {
+		.name		= ".",
+		.len		= 1,
+		.type		= XFS_DIR3_FT_DIR,
+	};
+	struct xfs_mount	*mp = dp->i_mount;
+	struct xfs_da_geometry	*geo = mp->m_dir_geo;
+	struct xfs_dir2_sf_entry *sfep;
+	struct xfs_dir2_sf_hdr	*sfp;
+	xfs_ino_t		ino;
+	xfs_dir2_dataptr_t	dapos;
+	unsigned int		i;
+	int			error;
+
+	ASSERT(dp->i_df.if_bytes == dp->i_disk_size);
+	ASSERT(dp->i_df.if_u1.if_data != NULL);
+
+	sfp = (struct xfs_dir2_sf_hdr *)dp->i_df.if_u1.if_data;
+
+	/* dot entry */
+	dapos = xfs_dir2_db_off_to_dataptr(geo, geo->datablk,
+			geo->data_entry_offset);
+
+	error = dirent_fn(sc, dp, dapos, &name, dp->i_ino, priv);
+	if (error)
+		return error;
+
+	/* dotdot entry */
+	dapos = xfs_dir2_db_off_to_dataptr(geo, geo->datablk,
+			geo->data_entry_offset +
+			xfs_dir2_data_entsize(mp, sizeof(".") - 1));
+	ino = xfs_dir2_sf_get_parent_ino(sfp);
+	name.name = "..";
+	name.len = 2;
+
+	error = dirent_fn(sc, dp, dapos, &name, ino, priv);
+	if (error)
+		return error;
+
+	/* iterate everything else */
+	sfep = xfs_dir2_sf_firstentry(sfp);
+	for (i = 0; i < sfp->count; i++) {
+		dapos = xfs_dir2_db_off_to_dataptr(geo, geo->datablk,
+				xfs_dir2_sf_get_offset(sfep));
+		ino = xfs_dir2_sf_get_ino(mp, sfp, sfep);
+		name.name = sfep->name;
+		name.len = sfep->namelen;
+		name.type = xfs_dir2_sf_get_ftype(mp, sfep);
+
+		error = dirent_fn(sc, dp, dapos, &name, ino, priv);
+		if (error)
+			return error;
+
+		sfep = xfs_dir2_sf_nextentry(mp, sfp, sfep);
+	}
+
+	return 0;
+}
+
+/* Call a function for every entry in a block directory. */
+STATIC int
+xchk_dir_walk_block(
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*dp,
+	xchk_dirent_fn		dirent_fn,
+	void			*priv)
+{
+	struct xfs_mount	*mp = dp->i_mount;
+	struct xfs_da_geometry	*geo = mp->m_dir_geo;
+	struct xfs_buf		*bp;
+	unsigned int		off, next_off, end;
+	int			error;
+
+	error = xfs_dir3_block_read(sc->tp, dp, &bp);
+	if (error)
+		return error;
+
+	/* Walk each directory entry. */
+	end = xfs_dir3_data_end_offset(geo, bp->b_addr);
+	for (off = geo->data_entry_offset; off < end; off = next_off) {
+		struct xfs_name			name = { };
+		struct xfs_dir2_data_unused	*dup = bp->b_addr + off;
+		struct xfs_dir2_data_entry	*dep = bp->b_addr + off;
+		xfs_ino_t			ino;
+		xfs_dir2_dataptr_t		dapos;
+
+		/* Skip an empty entry. */
+		if (be16_to_cpu(dup->freetag) == XFS_DIR2_DATA_FREE_TAG) {
+			next_off = off + be16_to_cpu(dup->length);
+			continue;
+		}
+
+		/* Otherwise, find the next entry and report it. */
+		next_off = off + xfs_dir2_data_entsize(mp, dep->namelen);
+		if (next_off > end)
+			break;
+
+		dapos = xfs_dir2_db_off_to_dataptr(geo, geo->datablk, off);
+		ino = be64_to_cpu(dep->inumber);
+		name.name = dep->name;
+		name.len = dep->namelen;
+		name.type = xfs_dir2_data_get_ftype(mp, dep);
+
+		error = dirent_fn(sc, dp, dapos, &name, ino, priv);
+		if (error)
+			break;
+	}
+
+	xfs_trans_brelse(sc->tp, bp);
+	return error;
+}
+
+/* Read a leaf-format directory buffer. */
+STATIC int
+xchk_read_leaf_dir_buf(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*dp,
+	struct xfs_da_geometry	*geo,
+	xfs_dir2_off_t		*curoff,
+	struct xfs_buf		**bpp)
+{
+	struct xfs_iext_cursor	icur;
+	struct xfs_bmbt_irec	map;
+	struct xfs_ifork	*ifp = xfs_ifork_ptr(dp, XFS_DATA_FORK);
+	xfs_dablk_t		last_da;
+	xfs_dablk_t		map_off;
+	xfs_dir2_off_t		new_off;
+
+	*bpp = NULL;
+
+	/*
+	 * Look for mapped directory blocks at or above the current offset.
+	 * Truncate down to the nearest directory block to start the scanning
+	 * operation.
+	 */
+	last_da = xfs_dir2_byte_to_da(geo, XFS_DIR2_LEAF_OFFSET);
+	map_off = xfs_dir2_db_to_da(geo, xfs_dir2_byte_to_db(geo, *curoff));
+
+	if (!xfs_iext_lookup_extent(dp, ifp, map_off, &icur, &map))
+		return 0;
+	if (map.br_startoff >= last_da)
+		return 0;
+	xfs_trim_extent(&map, map_off, last_da - map_off);
+
+	/* Read the directory block of that first mapping. */
+	new_off = xfs_dir2_da_to_byte(geo, map.br_startoff);
+	if (new_off > *curoff)
+		*curoff = new_off;
+
+	return xfs_dir3_data_read(tp, dp, map.br_startoff, 0, bpp);
+}
+
+/* Call a function for every entry in a leaf directory. */
+STATIC int
+xchk_dir_walk_leaf(
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*dp,
+	xchk_dirent_fn		dirent_fn,
+	void			*priv)
+{
+	struct xfs_mount	*mp = dp->i_mount;
+	struct xfs_da_geometry	*geo = mp->m_dir_geo;
+	struct xfs_buf		*bp = NULL;
+	xfs_dir2_off_t		curoff = 0;
+	unsigned int		offset = 0;
+	int			error;
+
+	/* Iterate every directory offset in this directory. */
+	while (curoff < XFS_DIR2_LEAF_OFFSET) {
+		struct xfs_name			name = { };
+		struct xfs_dir2_data_unused	*dup;
+		struct xfs_dir2_data_entry	*dep;
+		xfs_ino_t			ino;
+		unsigned int			length;
+		xfs_dir2_dataptr_t		dapos;
+
+		/*
+		 * If we have no buffer, or we're off the end of the
+		 * current buffer, need to get another one.
+		 */
+		if (!bp || offset >= geo->blksize) {
+			if (bp) {
+				xfs_trans_brelse(sc->tp, bp);
+				bp = NULL;
+			}
+
+			error = xchk_read_leaf_dir_buf(sc->tp, dp, geo, &curoff,
+					&bp);
+			if (error || !bp)
+				break;
+
+			/*
+			 * Find our position in the block.
+			 */
+			offset = geo->data_entry_offset;
+			curoff += geo->data_entry_offset;
+		}
+
+		/* Skip an empty entry. */
+		dup = bp->b_addr + offset;
+		if (be16_to_cpu(dup->freetag) == XFS_DIR2_DATA_FREE_TAG) {
+			length = be16_to_cpu(dup->length);
+			offset += length;
+			curoff += length;
+			continue;
+		}
+
+		/* Otherwise, find the next entry and report it. */
+		dep = bp->b_addr + offset;
+		length = xfs_dir2_data_entsize(mp, dep->namelen);
+
+		dapos = xfs_dir2_byte_to_dataptr(curoff) & 0x7fffffff;
+		ino = be64_to_cpu(dep->inumber);
+		name.name = dep->name;
+		name.len = dep->namelen;
+		name.type = xfs_dir2_data_get_ftype(mp, dep);
+
+		error = dirent_fn(sc, dp, dapos, &name, ino, priv);
+		if (error)
+			break;
+
+		/* Advance to the next entry. */
+		offset += length;
+		curoff += length;
+	}
+
+	if (bp)
+		xfs_trans_brelse(sc->tp, bp);
+	return error;
+}
+
+/*
+ * Call a function for every entry in a directory.
+ *
+ * Callers must hold the ILOCK.  File types are XFS_DIR3_FT_*.
+ */
+int
+xchk_dir_walk(
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*dp,
+	xchk_dirent_fn		dirent_fn,
+	void			*priv)
+{
+	struct xfs_da_args	args = {
+		.dp		= dp,
+		.geo		= dp->i_mount->m_dir_geo,
+		.trans		= sc->tp,
+	};
+	bool			isblock;
+	int			error;
+
+	if (xfs_is_shutdown(dp->i_mount))
+		return -EIO;
+
+	ASSERT(S_ISDIR(VFS_I(dp)->i_mode));
+	ASSERT(xfs_isilocked(dp, XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
+
+	if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL)
+		return xchk_dir_walk_sf(sc, dp, dirent_fn, priv);
+
+	/* dir2 functions require that the data fork is loaded */
+	error = xfs_iread_extents(sc->tp, dp, XFS_DATA_FORK);
+	if (error)
+		return error;
+
+	error = xfs_dir2_isblock(&args, &isblock);
+	if (error)
+		return error;
+
+	if (isblock)
+		return xchk_dir_walk_block(sc, dp, dirent_fn, priv);
+
+	return xchk_dir_walk_leaf(sc, dp, dirent_fn, priv);
+}
+
+/*
+ * Look up the inode number for an exact name in a directory.
+ *
+ * Callers must hold the ILOCK.  File types are XFS_DIR3_FT_*.  Names are not
+ * checked for correctness.
+ */
+int
+xchk_dir_lookup(
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*dp,
+	const struct xfs_name	*name,
+	xfs_ino_t		*ino)
+{
+	struct xfs_da_args	args = {
+		.dp		= dp,
+		.geo		= dp->i_mount->m_dir_geo,
+		.trans		= sc->tp,
+		.name		= name->name,
+		.namelen	= name->len,
+		.filetype	= name->type,
+		.hashval	= xfs_dir2_hashname(dp->i_mount, name),
+		.whichfork	= XFS_DATA_FORK,
+		.op_flags	= XFS_DA_OP_OKNOENT,
+	};
+	bool			isblock, isleaf;
+	int			error;
+
+	if (xfs_is_shutdown(dp->i_mount))
+		return -EIO;
+
+	ASSERT(S_ISDIR(VFS_I(dp)->i_mode));
+	ASSERT(xfs_isilocked(dp, XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
+
+	if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
+		error = xfs_dir2_sf_lookup(&args);
+		goto out_check_rval;
+	}
+
+	/* dir2 functions require that the data fork is loaded */
+	error = xfs_iread_extents(sc->tp, dp, XFS_DATA_FORK);
+	if (error)
+		return error;
+
+	error = xfs_dir2_isblock(&args, &isblock);
+	if (error)
+		return error;
+
+	if (isblock) {
+		error = xfs_dir2_block_lookup(&args);
+		goto out_check_rval;
+	}
+
+	error = xfs_dir2_isleaf(&args, &isleaf);
+	if (error)
+		return error;
+
+	if (isleaf) {
+		error = xfs_dir2_leaf_lookup(&args);
+		goto out_check_rval;
+	}
+
+	error = xfs_dir2_node_lookup(&args);
+
+out_check_rval:
+	if (error == -EEXIST)
+		error = 0;
+	if (!error)
+		*ino = args.inumber;
+	return error;
+}
diff --git a/fs/xfs/scrub/readdir.h b/fs/xfs/scrub/readdir.h
new file mode 100644
index 000000000000..55787f4df123
--- /dev/null
+++ b/fs/xfs/scrub/readdir.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2022-2023 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_SCRUB_READDIR_H__
+#define __XFS_SCRUB_READDIR_H__
+
+typedef int (*xchk_dirent_fn)(struct xfs_scrub *sc, struct xfs_inode *dp,
+		xfs_dir2_dataptr_t dapos, const struct xfs_name *name,
+		xfs_ino_t ino, void *priv);
+
+int xchk_dir_walk(struct xfs_scrub *sc, struct xfs_inode *dp,
+		xchk_dirent_fn dirent_fn, void *priv);
+
+int xchk_dir_lookup(struct xfs_scrub *sc, struct xfs_inode *dp,
+		const struct xfs_name *name, xfs_ino_t *ino);
+
+#endif /* __XFS_SCRUB_READDIR_H__ */
-- 
cgit v1.2.3


From d9a94480f978d5fbf1235a12a476f9f39a029ea5 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:17 -0700
Subject: xfs: xfs_iget in the directory scrubber needs to use UNTRUSTED

In commit 4b80ac64450f, we tried to strengthen the directory scrubber by
using the iget call to detect directory entries that point to
unallocated inodes.  Unfortunately, that commit neglected to pass
XFS_IGET_UNTRUSTED to xfs_iget, so we don't check the inode btree first.
If the inode number points to something that isn't even an inode
cluster, iget will throw corruption errors and return -EFSCORRUPTED,
which means that we fail to mark the directory corrupt.

Fixes: 4b80ac64450f ("xfs: scrub should mark a directory corrupt if any entries cannot be iget'd")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/dir.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
index f1cbe7b22688..41f10e1c580c 100644
--- a/fs/xfs/scrub/dir.c
+++ b/fs/xfs/scrub/dir.c
@@ -57,13 +57,15 @@ xchk_dir_check_ftype(
 	 * eofblocks cleanup (which allocates what would be a nested
 	 * transaction), we can't use DONTCACHE here because DONTCACHE
 	 * inodes can trigger immediate inactive cleanup of the inode.
+	 * Use UNTRUSTED here to check the allocation status of the inode in
+	 * the inode btrees.
 	 *
 	 * If _iget returns -EINVAL or -ENOENT then the child inode number is
 	 * garbage and the directory is corrupt.  If the _iget returns
 	 * -EFSCORRUPTED or -EFSBADCRC then the child is corrupt which is a
 	 *  cross referencing error.  Any other error is an operational error.
 	 */
-	error = xfs_iget(mp, sc->tp, inum, 0, 0, &ip);
+	error = xfs_iget(mp, sc->tp, inum, XFS_IGET_UNTRUSTED, 0, &ip);
 	if (error == -EINVAL || error == -ENOENT) {
 		error = -EFSCORRUPTED;
 		xchk_fblock_process_error(sc, XFS_DATA_FORK, 0, &error);
-- 
cgit v1.2.3


From 6bb9209ceebb07fd07cec25af04eed1809c654de Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:18 -0700
Subject: xfs: always check the existence of a dirent's child inode

When we're scrubbing directory entries, we always need to iget the child
inode to make sure that the inode pointer points to a valid inode.  The
original directory scrub code (commit a5c4) only set us up to do this
for ftype=1 filesystems, which is not sufficient; and then commit 4b80
made it worse by exempting the dot and dotdot entries.

Sorta-fixes: a5c46e5e8912 ("xfs: scrub directory metadata")
Sorta-fixes: 4b80ac64450f ("xfs: scrub should mark a directory corrupt if any entries cannot be iget'd")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/dir.c | 73 +++++++++++++++++++++---------------------------------
 1 file changed, 28 insertions(+), 45 deletions(-)

diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
index 41f10e1c580c..6404201d3d36 100644
--- a/fs/xfs/scrub/dir.c
+++ b/fs/xfs/scrub/dir.c
@@ -33,52 +33,23 @@ xchk_setup_directory(
 /* Scrub a directory entry. */
 
 /* Check that an inode's mode matches a given XFS_DIR3_FT_* type. */
-STATIC int
+STATIC void
 xchk_dir_check_ftype(
 	struct xfs_scrub	*sc,
 	xfs_fileoff_t		offset,
-	xfs_ino_t		inum,
+	struct xfs_inode	*ip,
 	int			ftype)
 {
 	struct xfs_mount	*mp = sc->mp;
-	struct xfs_inode	*ip;
-	int			error = 0;
 
 	if (!xfs_has_ftype(mp)) {
 		if (ftype != XFS_DIR3_FT_UNKNOWN && ftype != XFS_DIR3_FT_DIR)
 			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
-		goto out;
-	}
-
-	/*
-	 * Grab the inode pointed to by the dirent.  We release the
-	 * inode before we cancel the scrub transaction.  Since we're
-	 * don't know a priori that releasing the inode won't trigger
-	 * eofblocks cleanup (which allocates what would be a nested
-	 * transaction), we can't use DONTCACHE here because DONTCACHE
-	 * inodes can trigger immediate inactive cleanup of the inode.
-	 * Use UNTRUSTED here to check the allocation status of the inode in
-	 * the inode btrees.
-	 *
-	 * If _iget returns -EINVAL or -ENOENT then the child inode number is
-	 * garbage and the directory is corrupt.  If the _iget returns
-	 * -EFSCORRUPTED or -EFSBADCRC then the child is corrupt which is a
-	 *  cross referencing error.  Any other error is an operational error.
-	 */
-	error = xfs_iget(mp, sc->tp, inum, XFS_IGET_UNTRUSTED, 0, &ip);
-	if (error == -EINVAL || error == -ENOENT) {
-		error = -EFSCORRUPTED;
-		xchk_fblock_process_error(sc, XFS_DATA_FORK, 0, &error);
-		goto out;
+		return;
 	}
-	if (!xchk_fblock_xref_process_error(sc, XFS_DATA_FORK, offset, &error))
-		goto out;
 
 	if (xfs_mode_to_ftype(VFS_I(ip)->i_mode) != ftype)
 		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
-	xfs_irele(ip);
-out:
-	return error;
 }
 
 /*
@@ -97,9 +68,9 @@ xchk_dir_actor(
 	void			*priv)
 {
 	struct xfs_mount	*mp = dp->i_mount;
+	struct xfs_inode	*ip;
 	xfs_ino_t		lookup_ino;
 	xfs_dablk_t		offset;
-	bool			checked_ftype = false;
 	int			error = 0;
 
 	offset = xfs_dir2_db_to_da(mp->m_dir_geo,
@@ -122,9 +93,6 @@ xchk_dir_actor(
 
 	if (!strncmp(".", name->name, name->len)) {
 		/* If this is "." then check that the inum matches the dir. */
-		if (xfs_has_ftype(mp) && name->type != XFS_DIR3_FT_DIR)
-			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
-		checked_ftype = true;
 		if (ino != dp->i_ino)
 			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
 	} else if (!strncmp("..", name->name, name->len)) {
@@ -132,9 +100,6 @@ xchk_dir_actor(
 		 * If this is ".." in the root inode, check that the inum
 		 * matches this dir.
 		 */
-		if (xfs_has_ftype(mp) && name->type != XFS_DIR3_FT_DIR)
-			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
-		checked_ftype = true;
 		if (dp->i_ino == mp->m_sb.sb_rootino && ino != dp->i_ino)
 			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
 	}
@@ -151,14 +116,32 @@ xchk_dir_actor(
 		return -ECANCELED;
 	}
 
-	/* Verify the file type.  This function absorbs error codes. */
-	if (!checked_ftype) {
-		error = xchk_dir_check_ftype(sc, offset, lookup_ino,
-				name->type);
-		if (error)
-			goto out;
+	/*
+	 * Grab the inode pointed to by the dirent.  We release the
+	 * inode before we cancel the scrub transaction.  Since we're
+	 * don't know a priori that releasing the inode won't trigger
+	 * eofblocks cleanup (which allocates what would be a nested
+	 * transaction), we can't use DONTCACHE here because DONTCACHE
+	 * inodes can trigger immediate inactive cleanup of the inode.
+	 * Use UNTRUSTED here to check the allocation status of the inode in
+	 * the inode btrees.
+	 *
+	 * If _iget returns -EINVAL or -ENOENT then the child inode number is
+	 * garbage and the directory is corrupt.  If the _iget returns
+	 * -EFSCORRUPTED or -EFSBADCRC then the child is corrupt which is a
+	 *  cross referencing error.  Any other error is an operational error.
+	 */
+	error = xfs_iget(mp, sc->tp, ino, XFS_IGET_UNTRUSTED, 0, &ip);
+	if (error == -EINVAL || error == -ENOENT) {
+		error = -EFSCORRUPTED;
+		xchk_fblock_process_error(sc, XFS_DATA_FORK, 0, &error);
+		goto out;
 	}
+	if (!xchk_fblock_xref_process_error(sc, XFS_DATA_FORK, offset, &error))
+		goto out;
 
+	xchk_dir_check_ftype(sc, offset, ip, name->type);
+	xfs_irele(ip);
 out:
 	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
 		return -ECANCELED;
-- 
cgit v1.2.3


From cbab28f4c0719c956fa7b83613a3591c361126c7 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:19 -0700
Subject: xfs: remove xchk_parent_count_parent_dentries

This helper is now trivial, so get rid of it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/parent.c | 42 +++++++++++++-----------------------------
 1 file changed, 13 insertions(+), 29 deletions(-)

diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c
index af351c4ee6ec..bbf6492c8e8e 100644
--- a/fs/xfs/scrub/parent.c
+++ b/fs/xfs/scrub/parent.c
@@ -63,30 +63,6 @@ xchk_parent_actor(
 	return 0;
 }
 
-/* Count the number of dentries in the parent dir that point to this inode. */
-STATIC int
-xchk_parent_count_parent_dentries(
-	struct xfs_scrub	*sc,
-	struct xfs_inode	*parent,
-	xfs_nlink_t		*nlink)
-{
-	struct xchk_parent_ctx	spc = {
-		.sc		= sc,
-		.nlink		= 0,
-	};
-	uint			lock_mode;
-	int			error = 0;
-
-	lock_mode = xfs_ilock_data_map_shared(parent);
-	error = xchk_dir_walk(sc, parent, xchk_parent_actor, &spc);
-	xfs_iunlock(parent, lock_mode);
-	if (error)
-		return error;
-
-	*nlink = spc.nlink;
-	return error;
-}
-
 /*
  * Given the inode number of the alleged parent of the inode being
  * scrubbed, try to validate that the parent has exactly one directory
@@ -98,10 +74,14 @@ xchk_parent_validate(
 	xfs_ino_t		dnum,
 	bool			*try_again)
 {
+	struct xchk_parent_ctx	spc = {
+		.sc		= sc,
+		.nlink		= 0,
+	};
 	struct xfs_mount	*mp = sc->mp;
 	struct xfs_inode	*dp = NULL;
 	xfs_nlink_t		expected_nlink;
-	xfs_nlink_t		nlink;
+	uint			lock_mode;
 	int			error = 0;
 
 	*try_again = false;
@@ -156,11 +136,13 @@ xchk_parent_validate(
 	 * the child inodes.
 	 */
 	if (xfs_ilock_nowait(dp, XFS_IOLOCK_SHARED)) {
-		error = xchk_parent_count_parent_dentries(sc, dp, &nlink);
+		lock_mode = xfs_ilock_data_map_shared(dp);
+		error = xchk_dir_walk(sc, dp, xchk_parent_actor, &spc);
+		xfs_iunlock(dp, lock_mode);
 		if (!xchk_fblock_xref_process_error(sc, XFS_DATA_FORK, 0,
 				&error))
 			goto out_unlock;
-		if (nlink != expected_nlink)
+		if (spc.nlink != expected_nlink)
 			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
 		goto out_unlock;
 	}
@@ -178,7 +160,9 @@ xchk_parent_validate(
 		goto out_rele;
 
 	/* Go looking for our dentry. */
-	error = xchk_parent_count_parent_dentries(sc, dp, &nlink);
+	lock_mode = xfs_ilock_data_map_shared(dp);
+	error = xchk_dir_walk(sc, dp, xchk_parent_actor, &spc);
+	xfs_iunlock(dp, lock_mode);
 	if (!xchk_fblock_xref_process_error(sc, XFS_DATA_FORK, 0, &error))
 		goto out_unlock;
 
@@ -213,7 +197,7 @@ xchk_parent_validate(
 	 * '..' didn't change, so check that there was only one entry
 	 * for us in the parent.
 	 */
-	if (nlink != expected_nlink)
+	if (spc.nlink != expected_nlink)
 		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
 	return error;
 
-- 
cgit v1.2.3


From b049962c0f6eb6fb17e8294721f948285a44a672 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:19 -0700
Subject: xfs: simplify xchk_parent_validate

This function is unnecessarily long because it contains code to
revalidate a dotdot entry after cycling locks to try to confirm a
subdirectory parent pointer.  Shorten the codebase by making the
parent's lookup call do double duty as the revalidation code.

This weakeans the efficacy of this scrub function temporarily, but the
next patch will resolve this as part of fixing an unhandled race that is
the result of the VFS rename locking model not working the way Darrick
thought it did.

Rename this stupid 'dnum' variable too.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/parent.c | 100 ++++++++++++--------------------------------------
 1 file changed, 23 insertions(+), 77 deletions(-)

diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c
index bbf6492c8e8e..50dc423041ee 100644
--- a/fs/xfs/scrub/parent.c
+++ b/fs/xfs/scrub/parent.c
@@ -71,7 +71,7 @@ xchk_parent_actor(
 STATIC int
 xchk_parent_validate(
 	struct xfs_scrub	*sc,
-	xfs_ino_t		dnum,
+	xfs_ino_t		parent_ino,
 	bool			*try_again)
 {
 	struct xchk_parent_ctx	spc = {
@@ -86,11 +86,16 @@ xchk_parent_validate(
 
 	*try_again = false;
 
-	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
+	/* Is this the root dir?  Then '..' must point to itself. */
+	if (sc->ip == mp->m_rootip) {
+		if (sc->ip->i_ino != mp->m_sb.sb_rootino ||
+		    sc->ip->i_ino != parent_ino)
+			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
 		goto out;
+	}
 
 	/* '..' must not point to ourselves. */
-	if (sc->ip->i_ino == dnum) {
+	if (sc->ip->i_ino == parent_ino) {
 		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
 		goto out;
 	}
@@ -115,7 +120,7 @@ xchk_parent_validate(
 	 * -EFSCORRUPTED or -EFSBADCRC then the parent is corrupt which is a
 	 *  cross referencing error.  Any other error is an operational error.
 	 */
-	error = xfs_iget(mp, sc->tp, dnum, XFS_IGET_UNTRUSTED, 0, &dp);
+	error = xfs_iget(mp, sc->tp, parent_ino, XFS_IGET_UNTRUSTED, 0, &dp);
 	if (error == -EINVAL || error == -ENOENT) {
 		error = -EFSCORRUPTED;
 		xchk_fblock_process_error(sc, XFS_DATA_FORK, 0, &error);
@@ -135,71 +140,19 @@ xchk_parent_validate(
 	 * use nowait here to avoid an ABBA deadlock on the parent and
 	 * the child inodes.
 	 */
-	if (xfs_ilock_nowait(dp, XFS_IOLOCK_SHARED)) {
-		lock_mode = xfs_ilock_data_map_shared(dp);
-		error = xchk_dir_walk(sc, dp, xchk_parent_actor, &spc);
-		xfs_iunlock(dp, lock_mode);
-		if (!xchk_fblock_xref_process_error(sc, XFS_DATA_FORK, 0,
-				&error))
-			goto out_unlock;
-		if (spc.nlink != expected_nlink)
-			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
-		goto out_unlock;
-	}
-
-	/*
-	 * The game changes if we get here.  We failed to lock the parent,
-	 * so we're going to try to verify both pointers while only holding
-	 * one lock so as to avoid deadlocking with something that's actually
-	 * trying to traverse down the directory tree.
-	 */
-	xfs_iunlock(sc->ip, sc->ilock_flags);
-	sc->ilock_flags = 0;
-	error = xchk_ilock_inverted(dp, XFS_IOLOCK_SHARED);
-	if (error)
+	if (!xfs_ilock_nowait(dp, XFS_IOLOCK_SHARED)) {
+		*try_again = true;
 		goto out_rele;
+	}
 
-	/* Go looking for our dentry. */
 	lock_mode = xfs_ilock_data_map_shared(dp);
 	error = xchk_dir_walk(sc, dp, xchk_parent_actor, &spc);
 	xfs_iunlock(dp, lock_mode);
 	if (!xchk_fblock_xref_process_error(sc, XFS_DATA_FORK, 0, &error))
 		goto out_unlock;
 
-	/* Drop the parent lock, relock this inode. */
-	xfs_iunlock(dp, XFS_IOLOCK_SHARED);
-	error = xchk_ilock_inverted(sc->ip, XFS_IOLOCK_EXCL);
-	if (error)
-		goto out_rele;
-	sc->ilock_flags = XFS_IOLOCK_EXCL;
-
-	/*
-	 * If we're an unlinked directory, the parent /won't/ have a link
-	 * to us.  Otherwise, it should have one link.  We have to re-set
-	 * it here because we dropped the lock on sc->ip.
-	 */
-	expected_nlink = VFS_I(sc->ip)->i_nlink == 0 ? 0 : 1;
-
-	/* Look up '..' to see if the inode changed. */
-	error = xfs_dir_lookup(sc->tp, sc->ip, &xfs_name_dotdot, &dnum, NULL);
-	if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, 0, &error))
-		goto out_rele;
-
-	/* Drat, parent changed.  Try again! */
-	if (dnum != dp->i_ino) {
-		xfs_irele(dp);
-		*try_again = true;
-		return 0;
-	}
-	xfs_irele(dp);
-
-	/*
-	 * '..' didn't change, so check that there was only one entry
-	 * for us in the parent.
-	 */
 	if (spc.nlink != expected_nlink)
 		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
-	return error;
 
 out_unlock:
 	xfs_iunlock(dp, XFS_IOLOCK_SHARED);
@@ -215,7 +168,7 @@ xchk_parent(
 	struct xfs_scrub	*sc)
 {
 	struct xfs_mount	*mp = sc->mp;
-	xfs_ino_t		dnum;
+	xfs_ino_t		parent_ino;
 	bool			try_again;
 	int			tries = 0;
 	int			error = 0;
@@ -243,25 +196,18 @@ xchk_parent(
 	sc->ilock_flags &= ~(XFS_ILOCK_EXCL | XFS_MMAPLOCK_EXCL);
 	xfs_iunlock(sc->ip, XFS_ILOCK_EXCL | XFS_MMAPLOCK_EXCL);
 
-	/* Look up '..' */
-	error = xfs_dir_lookup(sc->tp, sc->ip, &xfs_name_dotdot, &dnum, NULL);
-	if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, 0, &error))
-		goto out;
-	if (!xfs_verify_dir_ino(mp, dnum)) {
-		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
-		goto out;
-	}
-
-	/* Is this the root dir?  Then '..' must point to itself. */
-	if (sc->ip == mp->m_rootip) {
-		if (sc->ip->i_ino != mp->m_sb.sb_rootino ||
-		    sc->ip->i_ino != dnum)
+	do {
+		/* Look up '..' */
+		error = xfs_dir_lookup(sc->tp, sc->ip, &xfs_name_dotdot,
+				&parent_ino, NULL);
+		if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, 0, &error))
+			goto out;
+		if (!xfs_verify_dir_ino(mp, parent_ino)) {
 			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
-		goto out;
-	}
+			goto out;
+		}
 
-	do {
-		error = xchk_parent_validate(sc, dnum, &try_again);
+		error = xchk_parent_validate(sc, parent_ino, &try_again);
 		if (error)
 			goto out;
 	} while (try_again && ++tries < 20);
-- 
cgit v1.2.3


From a03297a0ca9f21800c9b88028a3722715b2eb5ba Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:20 -0700
Subject: xfs: manage inode DONTCACHE status at irele time

Right now, there are statements scattered all over the online fsck
codebase about how we can't use XFS_IGET_DONTCACHE because of concerns
about scrub's unusual practice of releasing inodes with transactions
held.

However, iget is the wrong place to handle this -- the DONTCACHE state
doesn't matter at all until we try to *release* the inode, and here we
get things wrong in multiple ways:

First, if we /do/ have a transaction, we must NOT drop the inode,
because the inode could have dirty pages, dropping the inode will
trigger writeback, and writeback can trigger a nested transaction.

Second, if the inode already had an active reference and the DONTCACHE
flag set, the icache hit when scrub grabs another ref will not clear
DONTCACHE.  This is sort of by design, since DONTCACHE is now used to
initiate cache drops so that sysadmins can change a file's access mode
between pagecache and DAX.

Third, if we do actually have the last active reference to the inode, we
can set DONTCACHE to avoid polluting the cache.  This is the /one/ case
where we actually want that flag.

Create an xchk_irele helper to encode all that logic and switch the
online fsck code to use it.  Since this now means that nearly all
scrubbers use the same xfs_iget flags, we can wrap them too.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/common.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++----
 fs/xfs/scrub/common.h |  3 +++
 fs/xfs/scrub/dir.c    | 14 ++++----------
 fs/xfs/scrub/parent.c | 13 ++++---------
 fs/xfs/scrub/scrub.c  |  2 +-
 5 files changed, 60 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 813ded91661b..9af653a1d351 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -718,6 +718,16 @@ xchk_checkpoint_log(
 	return 0;
 }
 
+/* Verify that an inode is allocated ondisk, then return its cached inode. */
+int
+xchk_iget(
+	struct xfs_scrub	*sc,
+	xfs_ino_t		inum,
+	struct xfs_inode	**ipp)
+{
+	return xfs_iget(sc->mp, sc->tp, inum, XFS_IGET_UNTRUSTED, 0, ipp);
+}
+
 /*
  * Given an inode and the scrub control structure, grab either the
  * inode referenced in the control structure or the inode passed in.
@@ -743,8 +753,7 @@ xchk_get_inode(
 	/* Look up the inode, see if the generation number matches. */
 	if (xfs_internal_inum(mp, sc->sm->sm_ino))
 		return -ENOENT;
-	error = xfs_iget(mp, NULL, sc->sm->sm_ino,
-			XFS_IGET_UNTRUSTED | XFS_IGET_DONTCACHE, 0, &ip);
+	error = xchk_iget(sc, sc->sm->sm_ino, &ip);
 	switch (error) {
 	case -ENOENT:
 		/* Inode doesn't exist, just bail out. */
@@ -768,7 +777,7 @@ xchk_get_inode(
 		pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, sc->sm->sm_ino));
 		if (pag) {
 			error = xfs_imap(pag, sc->tp, sc->sm->sm_ino, &imap,
-					XFS_IGET_UNTRUSTED | XFS_IGET_DONTCACHE);
+					XFS_IGET_UNTRUSTED);
 			xfs_perag_put(pag);
 			if (error)
 				return -ENOENT;
@@ -783,7 +792,7 @@ xchk_get_inode(
 		return error;
 	}
 	if (VFS_I(ip)->i_generation != sc->sm->sm_gen) {
-		xfs_irele(ip);
+		xchk_irele(sc, ip);
 		return -ENOENT;
 	}
 
@@ -791,6 +800,41 @@ xchk_get_inode(
 	return 0;
 }
 
+/* Release an inode, possibly dropping it in the process. */
+void
+xchk_irele(
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*ip)
+{
+	if (current->journal_info != NULL) {
+		ASSERT(current->journal_info == sc->tp);
+
+		/*
+		 * If we are in a transaction, we /cannot/ drop the inode
+		 * ourselves, because the VFS will trigger writeback, which
+		 * can require a transaction.  Clear DONTCACHE to force the
+		 * inode to the LRU, where someone else can take care of
+		 * dropping it.
+		 *
+		 * Note that when we grabbed our reference to the inode, it
+		 * could have had an active ref and DONTCACHE set if a sysadmin
+		 * is trying to coerce a change in file access mode.  icache
+		 * hits do not clear DONTCACHE, so we must do it here.
+		 */
+		spin_lock(&VFS_I(ip)->i_lock);
+		VFS_I(ip)->i_state &= ~I_DONTCACHE;
+		spin_unlock(&VFS_I(ip)->i_lock);
+	} else if (atomic_read(&VFS_I(ip)->i_count) == 1) {
+		/*
+		 * If this is the last reference to the inode and the caller
+		 * permits it, set DONTCACHE to avoid thrashing.
+		 */
+		d_mark_dontcache(VFS_I(ip));
+	}
+
+	xfs_irele(ip);
+}
+
 /* Set us up to scrub a file's contents. */
 int
 xchk_setup_inode_contents(
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 544f86ff8d1d..7e9e8b7b6cb0 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -137,6 +137,9 @@ int xchk_get_inode(struct xfs_scrub *sc);
 int xchk_setup_inode_contents(struct xfs_scrub *sc, unsigned int resblks);
 void xchk_buffer_recheck(struct xfs_scrub *sc, struct xfs_buf *bp);
 
+int xchk_iget(struct xfs_scrub *sc, xfs_ino_t inum, struct xfs_inode **ipp);
+void xchk_irele(struct xfs_scrub *sc, struct xfs_inode *ip);
+
 /*
  * Don't bother cross-referencing if we already found corruption or cross
  * referencing discrepancies.
diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
index 6404201d3d36..0b491784b759 100644
--- a/fs/xfs/scrub/dir.c
+++ b/fs/xfs/scrub/dir.c
@@ -117,21 +117,15 @@ xchk_dir_actor(
 	}
 
 	/*
-	 * Grab the inode pointed to by the dirent.  We release the
-	 * inode before we cancel the scrub transaction.  Since we're
-	 * don't know a priori that releasing the inode won't trigger
-	 * eofblocks cleanup (which allocates what would be a nested
-	 * transaction), we can't use DONTCACHE here because DONTCACHE
-	 * inodes can trigger immediate inactive cleanup of the inode.
-	 * Use UNTRUSTED here to check the allocation status of the inode in
-	 * the inode btrees.
+	 * Grab the inode pointed to by the dirent.  We release the inode
+	 * before we cancel the scrub transaction.
 	 *
 	 * If _iget returns -EINVAL or -ENOENT then the child inode number is
 	 * garbage and the directory is corrupt.  If the _iget returns
 	 * -EFSCORRUPTED or -EFSBADCRC then the child is corrupt which is a
 	 *  cross referencing error.  Any other error is an operational error.
 	 */
-	error = xfs_iget(mp, sc->tp, ino, XFS_IGET_UNTRUSTED, 0, &ip);
+	error = xchk_iget(sc, ino, &ip);
 	if (error == -EINVAL || error == -ENOENT) {
 		error = -EFSCORRUPTED;
 		xchk_fblock_process_error(sc, XFS_DATA_FORK, 0, &error);
@@ -141,7 +135,7 @@ xchk_dir_actor(
 		goto out;
 
 	xchk_dir_check_ftype(sc, offset, ip, name->type);
-	xfs_irele(ip);
+	xchk_irele(sc, ip);
 out:
 	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
 		return -ECANCELED;
diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c
index b6c8f6dccc8f..58d5dfb7ea21 100644
--- a/fs/xfs/scrub/parent.c
+++ b/fs/xfs/scrub/parent.c
@@ -127,20 +127,15 @@ xchk_parent_validate(
 	expected_nlink = VFS_I(sc->ip)->i_nlink == 0 ? 0 : 1;
 
 	/*
-	 * Grab this parent inode.  We release the inode before we
-	 * cancel the scrub transaction.  Since we're don't know a
-	 * priori that releasing the inode won't trigger eofblocks
-	 * cleanup (which allocates what would be a nested transaction)
-	 * if the parent pointer erroneously points to a file, we
-	 * can't use DONTCACHE here because DONTCACHE inodes can trigger
-	 * immediate inactive cleanup of the inode.
+	 * Grab the parent directory inode.  This must be released before we
+	 * cancel the scrub transaction.
 	 *
 	 * If _iget returns -EINVAL or -ENOENT then the parent inode number is
 	 * garbage and the directory is corrupt.  If the _iget returns
 	 * -EFSCORRUPTED or -EFSBADCRC then the parent is corrupt which is a
 	 *  cross referencing error.  Any other error is an operational error.
 	 */
-	error = xfs_iget(mp, sc->tp, parent_ino, XFS_IGET_UNTRUSTED, 0, &dp);
+	error = xchk_iget(sc, parent_ino, &dp);
 	if (error == -EINVAL || error == -ENOENT) {
 		error = -EFSCORRUPTED;
 		xchk_fblock_process_error(sc, XFS_DATA_FORK, 0, &error);
@@ -176,7 +171,7 @@ xchk_parent_validate(
 out_unlock:
 	xfs_iunlock(dp, lock_mode);
 out_rele:
-	xfs_irele(dp);
+	xchk_irele(sc, dp);
 	return error;
 }
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 787a9096ddef..03ec455318f4 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -181,7 +181,7 @@ xchk_teardown(
 			xfs_iunlock(sc->ip, sc->ilock_flags);
 		if (sc->ip != ip_in &&
 		    !xfs_internal_inum(sc->mp, sc->ip->i_ino))
-			xfs_irele(sc->ip);
+			xchk_irele(sc, sc->ip);
 		sc->ip = NULL;
 	}
 	if (sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR)
-- 
cgit v1.2.3


From 0916056eba4fd816f8042a3960597c316ea10256 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:20 -0700
Subject: xfs: fix parent pointer scrub racing with subdirectory reparenting

Jan Kara pointed out that rename() doesn't lock a subdirectory that is
being moved from one parent to another, even though the move requires an
update to the subdirectory's dotdot entry.  This means that it's *not*
sufficient to hold a directory's IOLOCK to stabilize the dotdot entry.
We must hold the ILOCK of both the child and the alleged parent, and
there's no use in holding the parent's IOLOCK.

With that in mind, we can get rid of all the messy code that tries to
grab the parent's IOLOCK, which means we don't need to let go of the
ILOCK of the directory whose parent we are checking.  We still have to
use nonblocking mode to take the ILOCK of the alleged parent, so the
revalidation loop has to stay.

However, we can remove the retry counter, since threads aren't supposed
to hold the ILOCK for long periods of time.  Remove the inverted ilock
helper from the common code since nobody uses it.  Remove the entire
source of -EDEADLOCK-based "retry harder" scrub executions.

Link: https://lore.kernel.org/linux-xfs/20230117123735.un7wbamlbdihninm@quack3/
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/common.c |  22 ----------
 fs/xfs/scrub/common.h |   1 -
 fs/xfs/scrub/parent.c | 118 ++++++++++++++++++++++++--------------------------
 3 files changed, 57 insertions(+), 84 deletions(-)

diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index dcfe66044d4a..813ded91661b 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -962,28 +962,6 @@ xchk_metadata_inode_forks(
 	return 0;
 }
 
-/*
- * Try to lock an inode in violation of the usual locking order rules.  For
- * example, trying to get the IOLOCK while in transaction context, or just
- * plain breaking AG-order or inode-order inode locking rules.  Either way,
- * the only way to avoid an ABBA deadlock is to use trylock and back off if
- * we can't.
- */
-int
-xchk_ilock_inverted(
-	struct xfs_inode	*ip,
-	uint			lock_mode)
-{
-	int			i;
-
-	for (i = 0; i < 20; i++) {
-		if (xfs_ilock_nowait(ip, lock_mode))
-			return 0;
-		delay(1);
-	}
-	return -EDEADLOCK;
-}
-
 /* Pause background reaping of resources. */
 void
 xchk_stop_reaping(
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 83b1a392930a..544f86ff8d1d 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -148,7 +148,6 @@ static inline bool xchk_skip_xref(struct xfs_scrub_metadata *sm)
 }
 
 int xchk_metadata_inode_forks(struct xfs_scrub *sc);
-int xchk_ilock_inverted(struct xfs_inode *ip, uint lock_mode);
 void xchk_stop_reaping(struct xfs_scrub *sc);
 void xchk_start_reaping(struct xfs_scrub *sc);
 
diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c
index 50dc423041ee..b6c8f6dccc8f 100644
--- a/fs/xfs/scrub/parent.c
+++ b/fs/xfs/scrub/parent.c
@@ -64,15 +64,37 @@ xchk_parent_actor(
 }
 
 /*
- * Given the inode number of the alleged parent of the inode being
- * scrubbed, try to validate that the parent has exactly one directory
- * entry pointing back to the inode being scrubbed.
+ * Try to lock a parent directory for checking dirents.  Returns the inode
+ * flags for the locks we now hold, or zero if we failed.
+ */
+STATIC unsigned int
+xchk_parent_ilock_dir(
+	struct xfs_inode	*dp)
+{
+	if (!xfs_ilock_nowait(dp, XFS_ILOCK_SHARED))
+		return 0;
+
+	if (!xfs_need_iread_extents(&dp->i_df))
+		return XFS_ILOCK_SHARED;
+
+	xfs_iunlock(dp, XFS_ILOCK_SHARED);
+
+	if (!xfs_ilock_nowait(dp, XFS_ILOCK_EXCL))
+		return 0;
+
+	return XFS_ILOCK_EXCL;
+}
+
+/*
+ * Given the inode number of the alleged parent of the inode being scrubbed,
+ * try to validate that the parent has exactly one directory entry pointing
+ * back to the inode being scrubbed.  Returns -EAGAIN if we need to revalidate
+ * the dotdot entry.
  */
 STATIC int
 xchk_parent_validate(
 	struct xfs_scrub	*sc,
-	xfs_ino_t		parent_ino,
-	bool			*try_again)
+	xfs_ino_t		parent_ino)
 {
 	struct xchk_parent_ctx	spc = {
 		.sc		= sc,
@@ -81,23 +103,21 @@ xchk_parent_validate(
 	struct xfs_mount	*mp = sc->mp;
 	struct xfs_inode	*dp = NULL;
 	xfs_nlink_t		expected_nlink;
-	uint			lock_mode;
+	unsigned int		lock_mode;
 	int			error = 0;
 
-	*try_again = false;
-
 	/* Is this the root dir?  Then '..' must point to itself. */
 	if (sc->ip == mp->m_rootip) {
 		if (sc->ip->i_ino != mp->m_sb.sb_rootino ||
 		    sc->ip->i_ino != parent_ino)
 			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
-		goto out;
+		return 0;
 	}
 
 	/* '..' must not point to ourselves. */
 	if (sc->ip->i_ino == parent_ino) {
 		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
-		goto out;
+		return 0;
 	}
 
 	/*
@@ -124,41 +144,39 @@ xchk_parent_validate(
 	if (error == -EINVAL || error == -ENOENT) {
 		error = -EFSCORRUPTED;
 		xchk_fblock_process_error(sc, XFS_DATA_FORK, 0, &error);
-		goto out;
+		return error;
 	}
 	if (!xchk_fblock_xref_process_error(sc, XFS_DATA_FORK, 0, &error))
-		goto out;
+		return error;
 	if (dp == sc->ip || !S_ISDIR(VFS_I(dp)->i_mode)) {
 		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
 		goto out_rele;
 	}
 
-	/*
-	 * We prefer to keep the inode locked while we lock and search
-	 * its alleged parent for a forward reference.  If we can grab
-	 * the iolock, validate the pointers and we're done.  We must
-	 * use nowait here to avoid an ABBA deadlock on the parent and
-	 * the child inodes.
-	 */
-	if (!xfs_ilock_nowait(dp, XFS_IOLOCK_SHARED)) {
-		*try_again = true;
+	lock_mode = xchk_parent_ilock_dir(dp);
+	if (!lock_mode) {
+		xfs_iunlock(sc->ip, XFS_ILOCK_EXCL);
+		xfs_ilock(sc->ip, XFS_ILOCK_EXCL);
+		error = -EAGAIN;
 		goto out_rele;
 	}
 
-	lock_mode = xfs_ilock_data_map_shared(dp);
+	/* Look for a directory entry in the parent pointing to the child. */
 	error = xchk_dir_walk(sc, dp, xchk_parent_actor, &spc);
-	xfs_iunlock(dp, lock_mode);
 	if (!xchk_fblock_xref_process_error(sc, XFS_DATA_FORK, 0, &error))
 		goto out_unlock;
 
+	/*
+	 * Ensure that the parent has as many links to the child as the child
+	 * thinks it has to the parent.
+	 */
 	if (spc.nlink != expected_nlink)
 		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
 
 out_unlock:
-	xfs_iunlock(dp, XFS_IOLOCK_SHARED);
+	xfs_iunlock(dp, lock_mode);
 out_rele:
 	xfs_irele(dp);
-out:
 	return error;
 }
 
@@ -169,8 +187,6 @@ xchk_parent(
 {
 	struct xfs_mount	*mp = sc->mp;
 	xfs_ino_t		parent_ino;
-	bool			try_again;
-	int			tries = 0;
 	int			error = 0;
 
 	/*
@@ -183,49 +199,29 @@ xchk_parent(
 	/* We're not a special inode, are we? */
 	if (!xfs_verify_dir_ino(mp, sc->ip->i_ino)) {
 		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
-		goto out;
+		return 0;
 	}
 
-	/*
-	 * The VFS grabs a read or write lock via i_rwsem before it reads
-	 * or writes to a directory.  If we've gotten this far we've
-	 * already obtained IOLOCK_EXCL, which (since 4.10) is the same as
-	 * getting a write lock on i_rwsem.  Therefore, it is safe for us
-	 * to drop the ILOCK here in order to do directory lookups.
-	 */
-	sc->ilock_flags &= ~(XFS_ILOCK_EXCL | XFS_MMAPLOCK_EXCL);
-	xfs_iunlock(sc->ip, XFS_ILOCK_EXCL | XFS_MMAPLOCK_EXCL);
-
 	do {
+		if (xchk_should_terminate(sc, &error))
+			break;
+
 		/* Look up '..' */
-		error = xfs_dir_lookup(sc->tp, sc->ip, &xfs_name_dotdot,
-				&parent_ino, NULL);
+		error = xchk_dir_lookup(sc, sc->ip, &xfs_name_dotdot,
+				&parent_ino);
 		if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, 0, &error))
-			goto out;
+			return error;
 		if (!xfs_verify_dir_ino(mp, parent_ino)) {
 			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
-			goto out;
+			return 0;
 		}
 
-		error = xchk_parent_validate(sc, parent_ino, &try_again);
-		if (error)
-			goto out;
-	} while (try_again && ++tries < 20);
+		/*
+		 * Check that the dotdot entry points to a parent directory
+		 * containing a dirent pointing to this subdirectory.
+		 */
+		error = xchk_parent_validate(sc, parent_ino);
+	} while (error == -EAGAIN);
 
-	/*
-	 * We gave it our best shot but failed, so mark this scrub
-	 * incomplete.  Userspace can decide if it wants to try again.
-	 */
-	if (try_again && tries == 20)
-		xchk_set_incomplete(sc);
-out:
-	/*
-	 * If we failed to lock the parent inode even after a retry, just mark
-	 * this scrub incomplete and return.
-	 */
-	if ((sc->flags & XCHK_TRY_HARDER) && error == -EDEADLOCK) {
-		error = 0;
-		xchk_set_incomplete(sc);
-	}
 	return error;
 }
-- 
cgit v1.2.3


From 302436c27c3fc61c1dab83f4c995dec12eb43161 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:21 -0700
Subject: xfs: fix an inode lookup race in xchk_get_inode

In commit d658e, we tried to improve the robustnes of xchk_get_inode in
the face of EINVAL returns from iget by calling xfs_imap to see if the
inobt itself thinks that the inode is allocated.  Unfortunately, that
commit didn't consider the possibility that the inode gets allocated
after iget but before imap.  In this case, the imap call will succeed,
but we turn that into a corruption error and tell userspace the inode is
corrupt.

Avoid this false corruption report by grabbing the AGI header and
retrying the iget before calling imap.  If the iget succeeds, we can
proceed with the usual scrub-by-handle code.  Fix all the incorrect
comments too, since unreadable/corrupt inodes no longer result in EINVAL
returns.

Fixes: d658e72b4a09 ("xfs: distinguish between corrupt inode and invalid inum in xfs_scrub_get_inode")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/common.c | 233 +++++++++++++++++++++++++++++++++++++++++---------
 fs/xfs/scrub/common.h |   4 +
 fs/xfs/xfs_icache.c   |   3 +-
 fs/xfs/xfs_icache.h   |  11 ++-
 4 files changed, 205 insertions(+), 46 deletions(-)

diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 9af653a1d351..3fd437430cad 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -643,6 +643,14 @@ xchk_ag_init(
 
 /* Per-scrubber setup functions */
 
+void
+xchk_trans_cancel(
+	struct xfs_scrub	*sc)
+{
+	xfs_trans_cancel(sc->tp);
+	sc->tp = NULL;
+}
+
 /*
  * Grab an empty transaction so that we can re-grab locked buffers if
  * one of our btrees turns out to be cyclic.
@@ -728,6 +736,101 @@ xchk_iget(
 	return xfs_iget(sc->mp, sc->tp, inum, XFS_IGET_UNTRUSTED, 0, ipp);
 }
 
+/*
+ * Try to grab an inode in a manner that avoids races with physical inode
+ * allocation.  If we can't, return the locked AGI buffer so that the caller
+ * can single-step the loading process to see where things went wrong.
+ * Callers must have a valid scrub transaction.
+ *
+ * If the iget succeeds, return 0, a NULL AGI, and the inode.
+ *
+ * If the iget fails, return the error, the locked AGI, and a NULL inode.  This
+ * can include -EINVAL and -ENOENT for invalid inode numbers or inodes that are
+ * no longer allocated; or any other corruption or runtime error.
+ *
+ * If the AGI read fails, return the error, a NULL AGI, and NULL inode.
+ *
+ * If a fatal signal is pending, return -EINTR, a NULL AGI, and a NULL inode.
+ */
+int
+xchk_iget_agi(
+	struct xfs_scrub	*sc,
+	xfs_ino_t		inum,
+	struct xfs_buf		**agi_bpp,
+	struct xfs_inode	**ipp)
+{
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_trans	*tp = sc->tp;
+	struct xfs_perag	*pag;
+	int			error;
+
+	ASSERT(sc->tp != NULL);
+
+again:
+	*agi_bpp = NULL;
+	*ipp = NULL;
+	error = 0;
+
+	if (xchk_should_terminate(sc, &error))
+		return error;
+
+	/*
+	 * Attach the AGI buffer to the scrub transaction to avoid deadlocks
+	 * in the iget cache miss path.
+	 */
+	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, inum));
+	error = xfs_ialloc_read_agi(pag, tp, agi_bpp);
+	xfs_perag_put(pag);
+	if (error)
+		return error;
+
+	error = xfs_iget(mp, tp, inum,
+			XFS_IGET_NORETRY | XFS_IGET_UNTRUSTED, 0, ipp);
+	if (error == -EAGAIN) {
+		/*
+		 * The inode may be in core but temporarily unavailable and may
+		 * require the AGI buffer before it can be returned.  Drop the
+		 * AGI buffer and retry the lookup.
+		 *
+		 * Incore lookup will fail with EAGAIN on a cache hit if the
+		 * inode is queued to the inactivation list.  The inactivation
+		 * worker may remove the inode from the unlinked list and hence
+		 * needs the AGI.
+		 *
+		 * Hence xchk_iget_agi() needs to drop the AGI lock on EAGAIN
+		 * to allow inodegc to make progress and move the inode to
+		 * IRECLAIMABLE state where xfs_iget will be able to return it
+		 * again if it can lock the inode.
+		 */
+		xfs_trans_brelse(tp, *agi_bpp);
+		delay(1);
+		goto again;
+	}
+	if (error)
+		return error;
+
+	/* We got the inode, so we can release the AGI. */
+	ASSERT(*ipp != NULL);
+	xfs_trans_brelse(tp, *agi_bpp);
+	*agi_bpp = NULL;
+	return 0;
+}
+
+/* Install an inode that we opened by handle for scrubbing. */
+static int
+xchk_install_handle_inode(
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*ip)
+{
+	if (VFS_I(ip)->i_generation != sc->sm->sm_gen) {
+		xchk_irele(sc, ip);
+		return -ENOENT;
+	}
+
+	sc->ip = ip;
+	return 0;
+}
+
 /*
  * Given an inode and the scrub control structure, grab either the
  * inode referenced in the control structure or the inode passed in.
@@ -740,64 +843,112 @@ xchk_get_inode(
 	struct xfs_imap		imap;
 	struct xfs_mount	*mp = sc->mp;
 	struct xfs_perag	*pag;
+	struct xfs_buf		*agi_bp;
 	struct xfs_inode	*ip_in = XFS_I(file_inode(sc->file));
 	struct xfs_inode	*ip = NULL;
+	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, sc->sm->sm_ino);
 	int			error;
 
+	ASSERT(sc->tp == NULL);
+
 	/* We want to scan the inode we already had opened. */
 	if (sc->sm->sm_ino == 0 || sc->sm->sm_ino == ip_in->i_ino) {
 		sc->ip = ip_in;
 		return 0;
 	}
 
-	/* Look up the inode, see if the generation number matches. */
+	/* Reject internal metadata files and obviously bad inode numbers. */
 	if (xfs_internal_inum(mp, sc->sm->sm_ino))
 		return -ENOENT;
+	if (!xfs_verify_ino(sc->mp, sc->sm->sm_ino))
+		return -ENOENT;
+
+	/* Try a regular untrusted iget. */
 	error = xchk_iget(sc, sc->sm->sm_ino, &ip);
-	switch (error) {
-	case -ENOENT:
-		/* Inode doesn't exist, just bail out. */
-		return error;
-	case 0:
-		/* Got an inode, continue. */
-		break;
-	case -EINVAL:
-		/*
-		 * -EINVAL with IGET_UNTRUSTED could mean one of several
-		 * things: userspace gave us an inode number that doesn't
-		 * correspond to fs space, or doesn't have an inobt entry;
-		 * or it could simply mean that the inode buffer failed the
-		 * read verifiers.
-		 *
-		 * Try just the inode mapping lookup -- if it succeeds, then
-		 * the inode buffer verifier failed and something needs fixing.
-		 * Otherwise, we really couldn't find it so tell userspace
-		 * that it no longer exists.
-		 */
-		pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, sc->sm->sm_ino));
-		if (pag) {
-			error = xfs_imap(pag, sc->tp, sc->sm->sm_ino, &imap,
-					XFS_IGET_UNTRUSTED);
-			xfs_perag_put(pag);
-			if (error)
-				return -ENOENT;
-		}
-		error = -EFSCORRUPTED;
-		fallthrough;
-	default:
-		trace_xchk_op_error(sc,
-				XFS_INO_TO_AGNO(mp, sc->sm->sm_ino),
-				XFS_INO_TO_AGBNO(mp, sc->sm->sm_ino),
-				error, __return_address);
+	if (!error)
+		return xchk_install_handle_inode(sc, ip);
+	if (error == -ENOENT)
 		return error;
+	if (error != -EINVAL)
+		goto out_error;
+
+	/*
+	 * EINVAL with IGET_UNTRUSTED probably means one of several things:
+	 * userspace gave us an inode number that doesn't correspond to fs
+	 * space; the inode btree lacks a record for this inode; or there is a
+	 * record, and it says this inode is free.
+	 *
+	 * We want to look up this inode in the inobt to distinguish two
+	 * scenarios: (1) the inobt says the inode is free, in which case
+	 * there's nothing to do; and (2) the inobt says the inode is
+	 * allocated, but loading it failed due to corruption.
+	 *
+	 * Allocate a transaction and grab the AGI to prevent inobt activity
+	 * in this AG.  Retry the iget in case someone allocated a new inode
+	 * after the first iget failed.
+	 */
+	error = xchk_trans_alloc(sc, 0);
+	if (error)
+		goto out_error;
+
+	error = xchk_iget_agi(sc, sc->sm->sm_ino, &agi_bp, &ip);
+	if (error == 0) {
+		/* Actually got the inode, so install it. */
+		xchk_trans_cancel(sc);
+		return xchk_install_handle_inode(sc, ip);
 	}
-	if (VFS_I(ip)->i_generation != sc->sm->sm_gen) {
-		xchk_irele(sc, ip);
-		return -ENOENT;
+	if (error == -ENOENT)
+		goto out_gone;
+	if (error != -EINVAL)
+		goto out_cancel;
+
+	/* Ensure that we have protected against inode allocation/freeing. */
+	if (agi_bp == NULL) {
+		ASSERT(agi_bp != NULL);
+		error = -ECANCELED;
+		goto out_cancel;
 	}
 
-	sc->ip = ip;
-	return 0;
+	/*
+	 * Untrusted iget failed a second time.  Let's try an inobt lookup.
+	 * If the inobt thinks this the inode neither can exist inside the
+	 * filesystem nor is allocated, return ENOENT to signal that the check
+	 * can be skipped.
+	 *
+	 * If the lookup returns corruption, we'll mark this inode corrupt and
+	 * exit to userspace.  There's little chance of fixing anything until
+	 * the inobt is straightened out, but there's nothing we can do here.
+	 *
+	 * If the lookup encounters any other error, exit to userspace.
+	 *
+	 * If the lookup succeeds, something else must be very wrong in the fs
+	 * such that setting up the incore inode failed in some strange way.
+	 * Treat those as corruptions.
+	 */
+	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, sc->sm->sm_ino));
+	if (!pag) {
+		error = -EFSCORRUPTED;
+		goto out_cancel;
+	}
+
+	error = xfs_imap(pag, sc->tp, sc->sm->sm_ino, &imap,
+			XFS_IGET_UNTRUSTED);
+	xfs_perag_put(pag);
+	if (error == -EINVAL || error == -ENOENT)
+		goto out_gone;
+	if (!error)
+		error = -EFSCORRUPTED;
+
+out_cancel:
+	xchk_trans_cancel(sc);
+out_error:
+	trace_xchk_op_error(sc, agno, XFS_INO_TO_AGBNO(mp, sc->sm->sm_ino),
+			error, __return_address);
+	return error;
+out_gone:
+	/* The file is gone, so there's nothing to check. */
+	xchk_trans_cancel(sc);
+	return -ENOENT;
 }
 
 /* Release an inode, possibly dropping it in the process. */
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 7e9e8b7b6cb0..5c76614c2c04 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -32,6 +32,8 @@ xchk_should_terminate(
 }
 
 int xchk_trans_alloc(struct xfs_scrub *sc, uint resblks);
+void xchk_trans_cancel(struct xfs_scrub *sc);
+
 bool xchk_process_error(struct xfs_scrub *sc, xfs_agnumber_t agno,
 		xfs_agblock_t bno, int *error);
 bool xchk_fblock_process_error(struct xfs_scrub *sc, int whichfork,
@@ -138,6 +140,8 @@ int xchk_setup_inode_contents(struct xfs_scrub *sc, unsigned int resblks);
 void xchk_buffer_recheck(struct xfs_scrub *sc, struct xfs_buf *bp);
 
 int xchk_iget(struct xfs_scrub *sc, xfs_ino_t inum, struct xfs_inode **ipp);
+int xchk_iget_agi(struct xfs_scrub *sc, xfs_ino_t inum,
+		struct xfs_buf **agi_bpp, struct xfs_inode **ipp);
 void xchk_irele(struct xfs_scrub *sc, struct xfs_inode *ip);
 
 /*
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index c9a7e270a428..351849fc18ff 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -767,7 +767,8 @@ again:
 	return 0;
 
 out_error_or_again:
-	if (!(flags & XFS_IGET_INCORE) && error == -EAGAIN) {
+	if (!(flags & (XFS_IGET_INCORE | XFS_IGET_NORETRY)) &&
+	    error == -EAGAIN) {
 		delay(1);
 		goto again;
 	}
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 6cd180721659..87910191a9dd 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -34,10 +34,13 @@ struct xfs_icwalk {
 /*
  * Flags for xfs_iget()
  */
-#define XFS_IGET_CREATE		0x1
-#define XFS_IGET_UNTRUSTED	0x2
-#define XFS_IGET_DONTCACHE	0x4
-#define XFS_IGET_INCORE		0x8	/* don't read from disk or reinit */
+#define XFS_IGET_CREATE		(1U << 0)
+#define XFS_IGET_UNTRUSTED	(1U << 1)
+#define XFS_IGET_DONTCACHE	(1U << 2)
+/* don't read from disk or reinit */
+#define XFS_IGET_INCORE		(1U << 3)
+/* Return -EAGAIN immediately if the inode is unavailable. */
+#define XFS_IGET_NORETRY	(1U << 4)
 
 int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino,
 	     uint flags, uint lock_flags, xfs_inode_t **ipp);
-- 
cgit v1.2.3


From 46e0dd89659923dd02cfa45080675fc4f0926528 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:21 -0700
Subject: xfs: rename xchk_get_inode -> xchk_iget_for_scrubbing

Dave Chinner suggested renaming this function to make more obvious what
it does.  The function returns an incore inode to callers that want to
scrub a metadata structure that hangs off an inode.  If the iget fails
with EINVAL, it will single-step the loading process to distinguish
between actually free inodes or impossible inumbers (ENOENT);
discrepancies between the inobt freemask and the free status in the
inode record (EFSCORRUPTED).  Any other negative errno is returned
unchanged.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/bmap.c   |  2 +-
 fs/xfs/scrub/common.c | 12 +++++++-----
 fs/xfs/scrub/common.h |  2 +-
 fs/xfs/scrub/inode.c  |  2 +-
 4 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index 2412dcf0fa9a..9cf66a5c2376 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -34,7 +34,7 @@ xchk_setup_inode_bmap(
 	if (xchk_need_intent_drain(sc))
 		xchk_fsgates_enable(sc, XCHK_FSGATES_DRAIN);
 
-	error = xchk_get_inode(sc);
+	error = xchk_iget_for_scrubbing(sc);
 	if (error)
 		goto out;
 
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 3fd437430cad..8dd9ab31ebc6 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -832,12 +832,14 @@ xchk_install_handle_inode(
 }
 
 /*
- * Given an inode and the scrub control structure, grab either the
- * inode referenced in the control structure or the inode passed in.
- * The inode is not locked.
+ * In preparation to scrub metadata structures that hang off of an inode,
+ * grab either the inode referenced in the scrub control structure or the
+ * inode passed in.  If the inumber does not reference an allocated inode
+ * record, the function returns ENOENT to end the scrub early.  The inode
+ * is not locked.
  */
 int
-xchk_get_inode(
+xchk_iget_for_scrubbing(
 	struct xfs_scrub	*sc)
 {
 	struct xfs_imap		imap;
@@ -994,7 +996,7 @@ xchk_setup_inode_contents(
 {
 	int			error;
 
-	error = xchk_get_inode(sc);
+	error = xchk_iget_for_scrubbing(sc);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 5c76614c2c04..bfe4deb2c33d 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -135,7 +135,7 @@ int xchk_count_rmap_ownedby_ag(struct xfs_scrub *sc, struct xfs_btree_cur *cur,
 		const struct xfs_owner_info *oinfo, xfs_filblks_t *blocks);
 
 int xchk_setup_ag_btree(struct xfs_scrub *sc, bool force_log);
-int xchk_get_inode(struct xfs_scrub *sc);
+int xchk_iget_for_scrubbing(struct xfs_scrub *sc);
 int xchk_setup_inode_contents(struct xfs_scrub *sc, unsigned int resblks);
 void xchk_buffer_recheck(struct xfs_scrub *sc, struct xfs_buf *bp);
 
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index 2db96c8a71dc..424a35766732 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -39,7 +39,7 @@ xchk_setup_inode(
 	 * Try to get the inode.  If the verifiers fail, we try again
 	 * in raw mode.
 	 */
-	error = xchk_get_inode(sc);
+	error = xchk_iget_for_scrubbing(sc);
 	switch (error) {
 	case 0:
 		break;
-- 
cgit v1.2.3


From 38bb13108479f5cac955bb291ea6aa6d24268f4f Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:22 -0700
Subject: xfs: retain the AGI when we can't iget an inode to scrub the core

xchk_get_inode is not quite the right function to be calling from the
inode scrubber setup function.  The common get_inode function either
gets an inode and installs it in the scrub context, or it returns an
error code explaining what happened.  This is acceptable for most file
scrubbers because it is not in their scope to fix corruptions in the
inode core and fork areas that cause iget to fail.

Dealing with these problems is within the scope of the inode scrubber,
however.  If iget fails with EFSCORRUPTED, we need to xchk_inode to flag
that as corruption.  Since we can't get our hands on an incore inode, we
need to hold the AGI to prevent inode allocation activity so that
nothing changes in the inode metadata.

Looking ahead to the inode core repair patches, we will also need to
hold the AGI buffer into xrep_inode so that we can make modifications to
the xfs_dinode structure without any other thread swooping in to
allocate or free the inode.

Adapt the xchk_get_inode into xchk_setup_inode since this is a one-off
use case where the error codes we check for are a little different, and
the return state is much different from the common function.

xchk_setup_inode prepares to check or repair an inode record, so it must
continue the scrub operation even if the inode/inobt verifiers cause
xfs_iget to return EFSCORRUPTED.  This is done by attaching the locked
AGI buffer to the scrub transaction and returning 0 to move on to the
actual scrub.  (Later, the online inode repair code will also want the
xfs_imap structure so that it can reset the ondisk xfs_dinode
structure.)

xchk_get_inode retrieves an inode on behalf of a scrubber that operates
on an incore inode -- data/attr/cow forks, directories, xattrs,
symlinks, parent pointers, etc.  If the inode/inobt verifiers fail and
xfs_iget returns EFSCORRUPTED, we want to exit to userspace (because the
caller should be fix the inode first) and drop everything we acquired
along the way.

A behavior common to both functions is that it's possible that xfs_scrub
asked for a scrub-by-handle concurrent with the inode being freed or the
passed-in inumber is invalid.  In this case, we call xfs_imap to see if
the inobt index thinks the inode is allocated, and return ENOENT
("nothing to check here") to userspace if this is not the case.  The
imap lookup is why both functions call xchk_iget_agi.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/common.c |   2 +-
 fs/xfs/scrub/common.h |   1 +
 fs/xfs/scrub/inode.c  | 177 +++++++++++++++++++++++++++++++++++++++++++-------
 3 files changed, 156 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 8dd9ab31ebc6..b3ba87c4bc79 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -817,7 +817,7 @@ again:
 }
 
 /* Install an inode that we opened by handle for scrubbing. */
-static int
+int
 xchk_install_handle_inode(
 	struct xfs_scrub	*sc,
 	struct xfs_inode	*ip)
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index bfe4deb2c33d..18b5f2b62f13 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -143,6 +143,7 @@ int xchk_iget(struct xfs_scrub *sc, xfs_ino_t inum, struct xfs_inode **ipp);
 int xchk_iget_agi(struct xfs_scrub *sc, xfs_ino_t inum,
 		struct xfs_buf **agi_bpp, struct xfs_inode **ipp);
 void xchk_irele(struct xfs_scrub *sc, struct xfs_inode *ip);
+int xchk_install_handle_inode(struct xfs_scrub *sc, struct xfs_inode *ip);
 
 /*
  * Don't bother cross-referencing if we already found corruption or cross
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index 424a35766732..74ded772cb8f 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -11,8 +11,11 @@
 #include "xfs_mount.h"
 #include "xfs_btree.h"
 #include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_ag.h"
 #include "xfs_inode.h"
 #include "xfs_ialloc.h"
+#include "xfs_icache.h"
 #include "xfs_da_format.h"
 #include "xfs_reflink.h"
 #include "xfs_rmap.h"
@@ -20,48 +23,176 @@
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/btree.h"
+#include "scrub/trace.h"
+
+/* Prepare the attached inode for scrubbing. */
+static inline int
+xchk_prepare_iscrub(
+	struct xfs_scrub	*sc)
+{
+	int			error;
+
+	sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
+	xfs_ilock(sc->ip, sc->ilock_flags);
+
+	error = xchk_trans_alloc(sc, 0);
+	if (error)
+		return error;
+
+	sc->ilock_flags |= XFS_ILOCK_EXCL;
+	xfs_ilock(sc->ip, XFS_ILOCK_EXCL);
+	return 0;
+}
+
+/* Install this scrub-by-handle inode and prepare it for scrubbing. */
+static inline int
+xchk_install_handle_iscrub(
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*ip)
+{
+	int			error;
+
+	error = xchk_install_handle_inode(sc, ip);
+	if (error)
+		return error;
+
+	return xchk_prepare_iscrub(sc);
+}
 
 /*
- * Grab total control of the inode metadata.  It doesn't matter here if
- * the file data is still changing; exclusive access to the metadata is
- * the goal.
+ * Grab total control of the inode metadata.  In the best case, we grab the
+ * incore inode and take all locks on it.  If the incore inode cannot be
+ * constructed due to corruption problems, lock the AGI so that we can single
+ * step the loading process to fix everything that can go wrong.
  */
 int
 xchk_setup_inode(
 	struct xfs_scrub	*sc)
 {
+	struct xfs_imap		imap;
+	struct xfs_inode	*ip;
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_inode	*ip_in = XFS_I(file_inode(sc->file));
+	struct xfs_buf		*agi_bp;
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, sc->sm->sm_ino);
 	int			error;
 
 	if (xchk_need_intent_drain(sc))
 		xchk_fsgates_enable(sc, XCHK_FSGATES_DRAIN);
 
+	/* We want to scan the opened inode, so lock it and exit. */
+	if (sc->sm->sm_ino == 0 || sc->sm->sm_ino == ip_in->i_ino) {
+		sc->ip = ip_in;
+		return xchk_prepare_iscrub(sc);
+	}
+
+	/* Reject internal metadata files and obviously bad inode numbers. */
+	if (xfs_internal_inum(mp, sc->sm->sm_ino))
+		return -ENOENT;
+	if (!xfs_verify_ino(sc->mp, sc->sm->sm_ino))
+		return -ENOENT;
+
+	/* Try a regular untrusted iget. */
+	error = xchk_iget(sc, sc->sm->sm_ino, &ip);
+	if (!error)
+		return xchk_install_handle_iscrub(sc, ip);
+	if (error == -ENOENT)
+		return error;
+	if (error != -EFSCORRUPTED && error != -EFSBADCRC && error != -EINVAL)
+		goto out_error;
+
 	/*
-	 * Try to get the inode.  If the verifiers fail, we try again
-	 * in raw mode.
+	 * EINVAL with IGET_UNTRUSTED probably means one of several things:
+	 * userspace gave us an inode number that doesn't correspond to fs
+	 * space; the inode btree lacks a record for this inode; or there is
+	 * a record, and it says this inode is free.
+	 *
+	 * EFSCORRUPTED/EFSBADCRC could mean that the inode was mappable, but
+	 * some other metadata corruption (e.g. inode forks) prevented
+	 * instantiation of the incore inode.  Or it could mean the inobt is
+	 * corrupt.
+	 *
+	 * We want to look up this inode in the inobt directly to distinguish
+	 * three different scenarios: (1) the inobt says the inode is free,
+	 * in which case there's nothing to do; (2) the inobt is corrupt so we
+	 * should flag the corruption and exit to userspace to let it fix the
+	 * inobt; and (3) the inobt says the inode is allocated, but loading it
+	 * failed due to corruption.
+	 *
+	 * Allocate a transaction and grab the AGI to prevent inobt activity in
+	 * this AG.  Retry the iget in case someone allocated a new inode after
+	 * the first iget failed.
 	 */
-	error = xchk_iget_for_scrubbing(sc);
-	switch (error) {
-	case 0:
-		break;
-	case -EFSCORRUPTED:
-	case -EFSBADCRC:
-		return xchk_trans_alloc(sc, 0);
-	default:
-		return error;
+	error = xchk_trans_alloc(sc, 0);
+	if (error)
+		goto out_error;
+
+	error = xchk_iget_agi(sc, sc->sm->sm_ino, &agi_bp, &ip);
+	if (error == 0) {
+		/* Actually got the incore inode, so install it and proceed. */
+		xchk_trans_cancel(sc);
+		return xchk_install_handle_iscrub(sc, ip);
+	}
+	if (error == -ENOENT)
+		goto out_gone;
+	if (error != -EFSCORRUPTED && error != -EFSBADCRC && error != -EINVAL)
+		goto out_cancel;
+
+	/* Ensure that we have protected against inode allocation/freeing. */
+	if (agi_bp == NULL) {
+		ASSERT(agi_bp != NULL);
+		error = -ECANCELED;
+		goto out_cancel;
 	}
 
-	/* Got the inode, lock it and we're ready to go. */
-	sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
-	xfs_ilock(sc->ip, sc->ilock_flags);
-	error = xchk_trans_alloc(sc, 0);
+	/*
+	 * Untrusted iget failed a second time.  Let's try an inobt lookup.
+	 * If the inobt doesn't think this is an allocated inode then we'll
+	 * return ENOENT to signal that the check can be skipped.
+	 *
+	 * If the lookup signals corruption, we'll mark this inode corrupt and
+	 * exit to userspace.  There's little chance of fixing anything until
+	 * the inobt is straightened out, but there's nothing we can do here.
+	 *
+	 * If the lookup encounters a runtime error, exit to userspace.
+	 */
+	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, sc->sm->sm_ino));
+	if (!pag) {
+		error = -EFSCORRUPTED;
+		goto out_cancel;
+	}
+
+	error = xfs_imap(pag, sc->tp, sc->sm->sm_ino, &imap,
+			XFS_IGET_UNTRUSTED);
+	xfs_perag_put(pag);
+	if (error == -EINVAL || error == -ENOENT)
+		goto out_gone;
 	if (error)
-		goto out;
-	sc->ilock_flags |= XFS_ILOCK_EXCL;
-	xfs_ilock(sc->ip, XFS_ILOCK_EXCL);
+		goto out_cancel;
 
-out:
-	/* scrub teardown will unlock and release the inode for us */
+	/*
+	 * The lookup succeeded.  Chances are the ondisk inode is corrupt and
+	 * preventing iget from reading it.  Retain the scrub transaction and
+	 * the AGI buffer to prevent anyone from allocating or freeing inodes.
+	 * This ensures that we preserve the inconsistency between the inobt
+	 * saying the inode is allocated and the icache being unable to load
+	 * the inode until we can flag the corruption in xchk_inode.  The
+	 * scrub function has to note the corruption, since we're not really
+	 * supposed to do that from the setup function.
+	 */
+	return 0;
+
+out_cancel:
+	xchk_trans_cancel(sc);
+out_error:
+	trace_xchk_op_error(sc, agno, XFS_INO_TO_AGBNO(mp, sc->sm->sm_ino),
+			error, __return_address);
 	return error;
+out_gone:
+	/* The file is gone, so there's nothing to check. */
+	xchk_trans_cancel(sc);
+	return -ENOENT;
 }
 
 /* Inode core */
-- 
cgit v1.2.3


From 1fc7a0597d237c17b6501f8c33b76d3eaaae9079 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:22 -0700
Subject: xfs: don't take the MMAPLOCK when scrubbing file metadata

The MMAPLOCK stabilizes mappings in a file's pagecache.  Therefore, we
do not need it to check directories, symlinks, extended attributes, or
file-based metadata.  Reduce its usage to the one case that requires it,
which is when we want to scrub the data fork of a regular file.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/bmap.c   |  7 +++++--
 fs/xfs/scrub/common.c | 11 ++++++++---
 fs/xfs/scrub/inode.c  |  2 +-
 3 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index 9cf66a5c2376..e485a546a758 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -38,8 +38,8 @@ xchk_setup_inode_bmap(
 	if (error)
 		goto out;
 
-	sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
-	xfs_ilock(sc->ip, sc->ilock_flags);
+	sc->ilock_flags = XFS_IOLOCK_EXCL;
+	xfs_ilock(sc->ip, XFS_IOLOCK_EXCL);
 
 	/*
 	 * We don't want any ephemeral data fork updates sitting around
@@ -50,6 +50,9 @@ xchk_setup_inode_bmap(
 	    sc->sm->sm_type == XFS_SCRUB_TYPE_BMBTD) {
 		struct address_space	*mapping = VFS_I(sc->ip)->i_mapping;
 
+		sc->ilock_flags |= XFS_MMAPLOCK_EXCL;
+		xfs_ilock(sc->ip, XFS_MMAPLOCK_EXCL);
+
 		inode_dio_wait(VFS_I(sc->ip));
 
 		/*
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index b3ba87c4bc79..9aa79665c608 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -988,7 +988,11 @@ xchk_irele(
 	xfs_irele(ip);
 }
 
-/* Set us up to scrub a file's contents. */
+/*
+ * Set us up to scrub metadata mapped by a file's fork.  Callers must not use
+ * this to operate on user-accessible regular file data because the MMAPLOCK is
+ * not taken.
+ */
 int
 xchk_setup_inode_contents(
 	struct xfs_scrub	*sc,
@@ -1000,9 +1004,10 @@ xchk_setup_inode_contents(
 	if (error)
 		return error;
 
-	/* Got the inode, lock it and we're ready to go. */
-	sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
+	/* Lock the inode so the VFS cannot touch this file. */
+	sc->ilock_flags = XFS_IOLOCK_EXCL;
 	xfs_ilock(sc->ip, sc->ilock_flags);
+
 	error = xchk_trans_alloc(sc, resblks);
 	if (error)
 		goto out;
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index 74ded772cb8f..3e1e02e340a6 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -32,7 +32,7 @@ xchk_prepare_iscrub(
 {
 	int			error;
 
-	sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
+	sc->ilock_flags = XFS_IOLOCK_EXCL;
 	xfs_ilock(sc->ip, sc->ilock_flags);
 
 	error = xchk_trans_alloc(sc, 0);
-- 
cgit v1.2.3


From 971ee3a6706abf1074349c124922e4e4d513fa45 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:23 -0700
Subject: xfs: change bmap scrubber to store the previous mapping

Convert the inode data/attr/cow fork scrubber to remember the entire
previous mapping, not just the next expected offset.  No behavior
changes here, but this will enable some better checking in subsequent
patches.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/bmap.c | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index e485a546a758..7b4d0e2736a2 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -96,11 +96,23 @@ out:
 
 struct xchk_bmap_info {
 	struct xfs_scrub	*sc;
+
+	/* Incore extent tree cursor */
 	struct xfs_iext_cursor	icur;
-	xfs_fileoff_t		lastoff;
+
+	/* Previous fork mapping that we examined */
+	struct xfs_bmbt_irec	prev_rec;
+
+	/* Is this a realtime fork? */
 	bool			is_rt;
+
+	/* May mappings point to shared space? */
 	bool			is_shared;
+
+	/* Was the incore extent tree loaded? */
 	bool			was_loaded;
+
+	/* Which inode fork are we checking? */
 	int			whichfork;
 };
 
@@ -405,7 +417,8 @@ xchk_bmap_iextent(
 	 * Check for out-of-order extents.  This record could have come
 	 * from the incore list, for which there is no ordering check.
 	 */
-	if (irec->br_startoff < info->lastoff)
+	if (irec->br_startoff < info->prev_rec.br_startoff +
+				info->prev_rec.br_blockcount)
 		xchk_fblock_set_corrupt(info->sc, info->whichfork,
 				irec->br_startoff);
 
@@ -712,7 +725,8 @@ xchk_bmap_iextent_delalloc(
 	 * Check for out-of-order extents.  This record could have come
 	 * from the incore list, for which there is no ordering check.
 	 */
-	if (irec->br_startoff < info->lastoff)
+	if (irec->br_startoff < info->prev_rec.br_startoff +
+				info->prev_rec.br_blockcount)
 		xchk_fblock_set_corrupt(info->sc, info->whichfork,
 				irec->br_startoff);
 
@@ -806,7 +820,6 @@ xchk_bmap(
 		goto out;
 
 	/* Scrub extent records. */
-	info.lastoff = 0;
 	ifp = xfs_ifork_ptr(ip, whichfork);
 	for_each_xfs_iext(ifp, &info.icur, &irec) {
 		if (xchk_should_terminate(sc, &error) ||
@@ -823,7 +836,7 @@ xchk_bmap(
 			xchk_bmap_iextent_delalloc(ip, &info, &irec);
 		else
 			xchk_bmap_iextent(ip, &info, &irec);
-		info.lastoff = irec.br_startoff + irec.br_blockcount;
+		memcpy(&info.prev_rec, &irec, sizeof(struct xfs_bmbt_irec));
 	}
 
 	error = xchk_bmap_check_rmaps(sc, whichfork);
-- 
cgit v1.2.3


From 634d4a79e76691020ba73f50416da37a30779e9e Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:24 -0700
Subject: xfs: accumulate iextent records when checking bmap

Currently, the bmap scrubber checks file fork mappings individually.  In
the case that the file uses multiple mappings to a single contiguous
piece of space, the scrubber repeatedly locks the AG to check the
existence of a reverse mapping that overlaps this file mapping.  If the
reverse mapping starts before or ends after the mapping we're checking,
it will also crawl around in the bmbt checking correspondence for
adjacent extents.

This is not very time efficient because it does the crawling while
holding the AGF buffer, and checks the middle mappings multiple times.
Instead, create a custom iextent record iterator function that combines
multiple adjacent allocated mappings into one large incore bmbt record.
This is feasible because the incore bmbt record length is 64-bits wide.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.h |   2 +-
 fs/xfs/scrub/bmap.c      | 183 +++++++++++++++++++++++++++--------------------
 2 files changed, 107 insertions(+), 78 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 9ff030d12981..e33470e39728 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -145,7 +145,7 @@ static inline int xfs_bmapi_whichfork(uint32_t bmapi_flags)
 	{ BMAP_COWFORK,		"COW" }
 
 /* Return true if the extent is an allocated extent, written or not. */
-static inline bool xfs_bmap_is_real_extent(struct xfs_bmbt_irec *irec)
+static inline bool xfs_bmap_is_real_extent(const struct xfs_bmbt_irec *irec)
 {
 	return irec->br_startblock != HOLESTARTBLOCK &&
 		irec->br_startblock != DELAYSTARTBLOCK &&
diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index 7b4d0e2736a2..4bda1b0ee122 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -165,48 +165,6 @@ xchk_bmap_get_rmap(
 	return has_rmap;
 }
 
-static inline bool
-xchk_bmap_has_prev(
-	struct xchk_bmap_info	*info,
-	struct xfs_bmbt_irec	*irec)
-{
-	struct xfs_bmbt_irec	got;
-	struct xfs_ifork	*ifp;
-
-	ifp = xfs_ifork_ptr(info->sc->ip, info->whichfork);
-
-	if (!xfs_iext_peek_prev_extent(ifp, &info->icur, &got))
-		return false;
-	if (got.br_startoff + got.br_blockcount != irec->br_startoff)
-		return false;
-	if (got.br_startblock + got.br_blockcount != irec->br_startblock)
-		return false;
-	if (got.br_state != irec->br_state)
-		return false;
-	return true;
-}
-
-static inline bool
-xchk_bmap_has_next(
-	struct xchk_bmap_info	*info,
-	struct xfs_bmbt_irec	*irec)
-{
-	struct xfs_bmbt_irec	got;
-	struct xfs_ifork	*ifp;
-
-	ifp = xfs_ifork_ptr(info->sc->ip, info->whichfork);
-
-	if (!xfs_iext_peek_next_extent(ifp, &info->icur, &got))
-		return false;
-	if (irec->br_startoff + irec->br_blockcount != got.br_startoff)
-		return false;
-	if (irec->br_startblock + irec->br_blockcount != got.br_startblock)
-		return false;
-	if (got.br_state != irec->br_state)
-		return false;
-	return true;
-}
-
 /* Make sure that we have rmapbt records for this extent. */
 STATIC void
 xchk_bmap_xref_rmap(
@@ -277,31 +235,20 @@ xchk_bmap_xref_rmap(
 				irec->br_startoff);
 
 	/*
-	 * If the rmap starts before this bmbt record, make sure there's a bmbt
-	 * record for the previous offset that is contiguous with this mapping.
-	 * Skip this for CoW fork extents because the refcount btree (and not
-	 * the inode) is the ondisk owner for those extents.
-	 */
-	if (info->whichfork != XFS_COW_FORK && rmap.rm_startblock < agbno &&
-	    !xchk_bmap_has_prev(info, irec)) {
-		xchk_fblock_xref_set_corrupt(info->sc, info->whichfork,
-				irec->br_startoff);
-		return;
-	}
-
-	/*
-	 * If the rmap ends after this bmbt record, make sure there's a bmbt
-	 * record for the next offset that is contiguous with this mapping.
-	 * Skip this for CoW fork extents because the refcount btree (and not
-	 * the inode) is the ondisk owner for those extents.
+	 * The rmap must correspond exactly with this bmbt record.  Skip this
+	 * for CoW fork extents because the refcount btree (and not the inode)
+	 * is the ondisk owner for those extents.
 	 */
-	rmap_end = (unsigned long long)rmap.rm_startblock + rmap.rm_blockcount;
-	if (info->whichfork != XFS_COW_FORK &&
-	    rmap_end > agbno + irec->br_blockcount &&
-	    !xchk_bmap_has_next(info, irec)) {
-		xchk_fblock_xref_set_corrupt(info->sc, info->whichfork,
-				irec->br_startoff);
-		return;
+	if (info->whichfork != XFS_COW_FORK) {
+		if (rmap.rm_startblock != agbno)
+			xchk_fblock_xref_set_corrupt(info->sc, info->whichfork,
+					irec->br_startoff);
+
+		rmap_end = (unsigned long long)rmap.rm_startblock +
+					       rmap.rm_blockcount;
+		if (rmap_end != agbno + irec->br_blockcount)
+			xchk_fblock_xref_set_corrupt(info->sc, info->whichfork,
+					irec->br_startoff);
 	}
 }
 
@@ -428,15 +375,7 @@ xchk_bmap_iextent(
 
 	xchk_bmap_dirattr_extent(ip, info, irec);
 
-	/* There should never be a "hole" extent in either extent list. */
-	if (irec->br_startblock == HOLESTARTBLOCK)
-		xchk_fblock_set_corrupt(info->sc, info->whichfork,
-				irec->br_startoff);
-
 	/* Make sure the extent points to a valid place. */
-	if (irec->br_blockcount > XFS_MAX_BMBT_EXTLEN)
-		xchk_fblock_set_corrupt(info->sc, info->whichfork,
-				irec->br_startoff);
 	if (info->is_rt &&
 	    !xfs_verify_rtext(mp, irec->br_startblock, irec->br_blockcount))
 		xchk_fblock_set_corrupt(info->sc, info->whichfork,
@@ -740,6 +679,90 @@ xchk_bmap_iextent_delalloc(
 				irec->br_startoff);
 }
 
+/* Decide if this individual fork mapping is ok. */
+static bool
+xchk_bmap_iext_mapping(
+	struct xchk_bmap_info		*info,
+	const struct xfs_bmbt_irec	*irec)
+{
+	/* There should never be a "hole" extent in either extent list. */
+	if (irec->br_startblock == HOLESTARTBLOCK)
+		return false;
+	if (irec->br_blockcount > XFS_MAX_BMBT_EXTLEN)
+		return false;
+	return true;
+}
+
+/* Are these two mappings contiguous with each other? */
+static inline bool
+xchk_are_bmaps_contiguous(
+	const struct xfs_bmbt_irec	*b1,
+	const struct xfs_bmbt_irec	*b2)
+{
+	/* Don't try to combine unallocated mappings. */
+	if (!xfs_bmap_is_real_extent(b1))
+		return false;
+	if (!xfs_bmap_is_real_extent(b2))
+		return false;
+
+	/* Does b2 come right after b1 in the logical and physical range? */
+	if (b1->br_startoff + b1->br_blockcount != b2->br_startoff)
+		return false;
+	if (b1->br_startblock + b1->br_blockcount != b2->br_startblock)
+		return false;
+	if (b1->br_state != b2->br_state)
+		return false;
+	return true;
+}
+
+/*
+ * Walk the incore extent records, accumulating consecutive contiguous records
+ * into a single incore mapping.  Returns true if @irec has been set to a
+ * mapping or false if there are no more mappings.  Caller must ensure that
+ * @info.icur is zeroed before the first call.
+ */
+static int
+xchk_bmap_iext_iter(
+	struct xchk_bmap_info	*info,
+	struct xfs_bmbt_irec	*irec)
+{
+	struct xfs_bmbt_irec	got;
+	struct xfs_ifork	*ifp;
+
+	ifp = xfs_ifork_ptr(info->sc->ip, info->whichfork);
+
+	/* Advance to the next iextent record and check the mapping. */
+	xfs_iext_next(ifp, &info->icur);
+	if (!xfs_iext_get_extent(ifp, &info->icur, irec))
+		return false;
+
+	if (!xchk_bmap_iext_mapping(info, irec)) {
+		xchk_fblock_set_corrupt(info->sc, info->whichfork,
+				irec->br_startoff);
+		return false;
+	}
+
+	/*
+	 * Iterate subsequent iextent records and merge them with the one
+	 * that we just read, if possible.
+	 */
+	while (xfs_iext_peek_next_extent(ifp, &info->icur, &got)) {
+		if (!xchk_are_bmaps_contiguous(irec, &got))
+			break;
+
+		if (!xchk_bmap_iext_mapping(info, &got)) {
+			xchk_fblock_set_corrupt(info->sc, info->whichfork,
+					got.br_startoff);
+			return false;
+		}
+
+		irec->br_blockcount += got.br_blockcount;
+		xfs_iext_next(ifp, &info->icur);
+	}
+
+	return true;
+}
+
 /*
  * Scrub an inode fork's block mappings.
  *
@@ -819,9 +842,15 @@ xchk_bmap(
 	if (!xchk_fblock_process_error(sc, whichfork, 0, &error))
 		goto out;
 
-	/* Scrub extent records. */
-	ifp = xfs_ifork_ptr(ip, whichfork);
-	for_each_xfs_iext(ifp, &info.icur, &irec) {
+	/*
+	 * Scrub extent records.  We use a special iterator function here that
+	 * combines adjacent mappings if they are logically and physically
+	 * contiguous.   For large allocations that require multiple bmbt
+	 * records, this reduces the number of cross-referencing calls, which
+	 * reduces runtime.  Cross referencing with the rmap is simpler because
+	 * the rmap must match the combined mapping exactly.
+	 */
+	while (xchk_bmap_iext_iter(&info, &irec)) {
 		if (xchk_should_terminate(sc, &error) ||
 		    (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT))
 			goto out;
-- 
cgit v1.2.3


From c0d5a92f7aaf41b1ab70869358d534757b569a1f Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:24 -0700
Subject: xfs: split xchk_bmap_xref_rmap into two functions

There's more special-cased functionality than not in this function.
Split it into two so that each can be far more cohesive.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/bmap.c | 116 ++++++++++++++++++++++++++++++++++------------------
 1 file changed, 76 insertions(+), 40 deletions(-)

diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index 4bda1b0ee122..912b67d2321b 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -165,7 +165,7 @@ xchk_bmap_get_rmap(
 	return has_rmap;
 }
 
-/* Make sure that we have rmapbt records for this extent. */
+/* Make sure that we have rmapbt records for this data/attr fork extent. */
 STATIC void
 xchk_bmap_xref_rmap(
 	struct xchk_bmap_info	*info,
@@ -174,41 +174,39 @@ xchk_bmap_xref_rmap(
 {
 	struct xfs_rmap_irec	rmap;
 	unsigned long long	rmap_end;
-	uint64_t		owner;
+	uint64_t		owner = info->sc->ip->i_ino;
 
 	if (!info->sc->sa.rmap_cur || xchk_skip_xref(info->sc->sm))
 		return;
 
-	if (info->whichfork == XFS_COW_FORK)
-		owner = XFS_RMAP_OWN_COW;
-	else
-		owner = info->sc->ip->i_ino;
-
 	/* Find the rmap record for this irec. */
 	if (!xchk_bmap_get_rmap(info, irec, agbno, owner, &rmap))
 		return;
 
-	/* Check the rmap. */
+	/*
+	 * The rmap must be an exact match for this incore file mapping record,
+	 * which may have arisen from multiple ondisk records.
+	 */
+	if (rmap.rm_startblock != agbno)
+		xchk_fblock_xref_set_corrupt(info->sc, info->whichfork,
+				irec->br_startoff);
+
 	rmap_end = (unsigned long long)rmap.rm_startblock + rmap.rm_blockcount;
-	if (rmap.rm_startblock > agbno ||
-	    agbno + irec->br_blockcount > rmap_end)
+	if (rmap_end != agbno + irec->br_blockcount)
 		xchk_fblock_xref_set_corrupt(info->sc, info->whichfork,
 				irec->br_startoff);
 
-	/*
-	 * Check the logical offsets if applicable.  CoW staging extents
-	 * don't track logical offsets since the mappings only exist in
-	 * memory.
-	 */
-	if (info->whichfork != XFS_COW_FORK) {
-		rmap_end = (unsigned long long)rmap.rm_offset +
-				rmap.rm_blockcount;
-		if (rmap.rm_offset > irec->br_startoff ||
-		    irec->br_startoff + irec->br_blockcount > rmap_end)
-			xchk_fblock_xref_set_corrupt(info->sc,
-					info->whichfork, irec->br_startoff);
-	}
+	/* Check the logical offsets. */
+	if (rmap.rm_offset != irec->br_startoff)
+		xchk_fblock_xref_set_corrupt(info->sc, info->whichfork,
+				irec->br_startoff);
+
+	rmap_end = (unsigned long long)rmap.rm_offset + rmap.rm_blockcount;
+	if (rmap_end != irec->br_startoff + irec->br_blockcount)
+		xchk_fblock_xref_set_corrupt(info->sc, info->whichfork,
+				irec->br_startoff);
 
+	/* Check the owner */
 	if (rmap.rm_owner != owner)
 		xchk_fblock_xref_set_corrupt(info->sc, info->whichfork,
 				irec->br_startoff);
@@ -220,8 +218,7 @@ xchk_bmap_xref_rmap(
 	 * records because the blocks are owned (on-disk) by the refcountbt,
 	 * which doesn't track unwritten state.
 	 */
-	if (owner != XFS_RMAP_OWN_COW &&
-	    !!(irec->br_state == XFS_EXT_UNWRITTEN) !=
+	if (!!(irec->br_state == XFS_EXT_UNWRITTEN) !=
 	    !!(rmap.rm_flags & XFS_RMAP_UNWRITTEN))
 		xchk_fblock_xref_set_corrupt(info->sc, info->whichfork,
 				irec->br_startoff);
@@ -233,23 +230,60 @@ xchk_bmap_xref_rmap(
 	if (rmap.rm_flags & XFS_RMAP_BMBT_BLOCK)
 		xchk_fblock_xref_set_corrupt(info->sc, info->whichfork,
 				irec->br_startoff);
+}
+
+/* Make sure that we have rmapbt records for this COW fork extent. */
+STATIC void
+xchk_bmap_xref_rmap_cow(
+	struct xchk_bmap_info	*info,
+	struct xfs_bmbt_irec	*irec,
+	xfs_agblock_t		agbno)
+{
+	struct xfs_rmap_irec	rmap;
+	unsigned long long	rmap_end;
+	uint64_t		owner = XFS_RMAP_OWN_COW;
+
+	if (!info->sc->sa.rmap_cur || xchk_skip_xref(info->sc->sm))
+		return;
+
+	/* Find the rmap record for this irec. */
+	if (!xchk_bmap_get_rmap(info, irec, agbno, owner, &rmap))
+		return;
 
 	/*
-	 * The rmap must correspond exactly with this bmbt record.  Skip this
-	 * for CoW fork extents because the refcount btree (and not the inode)
-	 * is the ondisk owner for those extents.
+	 * CoW staging extents are owned by the refcount btree, so the rmap
+	 * can start before and end after the physical space allocated to this
+	 * mapping.  There are no offsets to check.
 	 */
-	if (info->whichfork != XFS_COW_FORK) {
-		if (rmap.rm_startblock != agbno)
-			xchk_fblock_xref_set_corrupt(info->sc, info->whichfork,
-					irec->br_startoff);
-
-		rmap_end = (unsigned long long)rmap.rm_startblock +
-					       rmap.rm_blockcount;
-		if (rmap_end != agbno + irec->br_blockcount)
-			xchk_fblock_xref_set_corrupt(info->sc, info->whichfork,
-					irec->br_startoff);
-	}
+	if (rmap.rm_startblock > agbno)
+		xchk_fblock_xref_set_corrupt(info->sc, info->whichfork,
+				irec->br_startoff);
+
+	rmap_end = (unsigned long long)rmap.rm_startblock + rmap.rm_blockcount;
+	if (rmap_end < agbno + irec->br_blockcount)
+		xchk_fblock_xref_set_corrupt(info->sc, info->whichfork,
+				irec->br_startoff);
+
+	/* Check the owner */
+	if (rmap.rm_owner != owner)
+		xchk_fblock_xref_set_corrupt(info->sc, info->whichfork,
+				irec->br_startoff);
+
+	/*
+	 * No flags allowed.  Note that the (in-memory) CoW fork distinguishes
+	 * between unwritten and written extents, but we don't track that in
+	 * the rmap records because the blocks are owned (on-disk) by the
+	 * refcountbt, which doesn't track unwritten state.
+	 */
+	if (rmap.rm_flags & XFS_RMAP_ATTR_FORK)
+		xchk_fblock_xref_set_corrupt(info->sc, info->whichfork,
+				irec->br_startoff);
+	if (rmap.rm_flags & XFS_RMAP_BMBT_BLOCK)
+		xchk_fblock_xref_set_corrupt(info->sc, info->whichfork,
+				irec->br_startoff);
+	if (rmap.rm_flags & XFS_RMAP_UNWRITTEN)
+		xchk_fblock_xref_set_corrupt(info->sc, info->whichfork,
+				irec->br_startoff);
 }
 
 /* Cross-reference a single rtdev extent record. */
@@ -288,9 +322,9 @@ xchk_bmap_iextent_xref(
 
 	xchk_xref_is_used_space(info->sc, agbno, len);
 	xchk_xref_is_not_inode_chunk(info->sc, agbno, len);
-	xchk_bmap_xref_rmap(info, irec, agbno);
 	switch (info->whichfork) {
 	case XFS_DATA_FORK:
+		xchk_bmap_xref_rmap(info, irec, agbno);
 		if (!xfs_is_reflink_inode(info->sc->ip)) {
 			xfs_rmap_ino_owner(&oinfo, info->sc->ip->i_ino,
 					info->whichfork, irec->br_startoff);
@@ -303,6 +337,7 @@ xchk_bmap_iextent_xref(
 				irec->br_blockcount);
 		break;
 	case XFS_ATTR_FORK:
+		xchk_bmap_xref_rmap(info, irec, agbno);
 		xfs_rmap_ino_owner(&oinfo, info->sc->ip->i_ino,
 				info->whichfork, irec->br_startoff);
 		xchk_xref_is_only_owned_by(info->sc, agbno, irec->br_blockcount,
@@ -313,6 +348,7 @@ xchk_bmap_iextent_xref(
 				irec->br_blockcount);
 		break;
 	case XFS_COW_FORK:
+		xchk_bmap_xref_rmap_cow(info, irec, agbno);
 		xchk_xref_is_only_owned_by(info->sc, agbno, irec->br_blockcount,
 				&XFS_RMAP_OINFO_COW);
 		xchk_xref_is_cow_staging(info->sc, agbno,
-- 
cgit v1.2.3


From 336642f79283715e4535bfaa05f5593dd91da6e8 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:25 -0700
Subject: xfs: alert the user about data/attr fork mappings that could be
 merged

If the data or attr forks have mappings that could be merged, let the
user know that the structure could be optimized.  This isn't a
filesystem corruption since the regular filesystem does not try to be
smart about merging bmbt records.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/bmap.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index 912b67d2321b..a2edcf8cc779 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -764,6 +764,7 @@ xchk_bmap_iext_iter(
 {
 	struct xfs_bmbt_irec	got;
 	struct xfs_ifork	*ifp;
+	xfs_filblks_t		prev_len;
 
 	ifp = xfs_ifork_ptr(info->sc->ip, info->whichfork);
 
@@ -782,6 +783,7 @@ xchk_bmap_iext_iter(
 	 * Iterate subsequent iextent records and merge them with the one
 	 * that we just read, if possible.
 	 */
+	prev_len = irec->br_blockcount;
 	while (xfs_iext_peek_next_extent(ifp, &info->icur, &got)) {
 		if (!xchk_are_bmaps_contiguous(irec, &got))
 			break;
@@ -792,7 +794,16 @@ xchk_bmap_iext_iter(
 			return false;
 		}
 
+		/*
+		 * Notify the user of mergeable records in the data or attr
+		 * forks.  CoW forks only exist in memory so we ignore them.
+		 */
+		if (info->whichfork != XFS_COW_FORK &&
+		    prev_len + got.br_blockcount > BMBT_BLOCKCOUNT_MASK)
+			xchk_ino_set_preen(info->sc, info->sc->ip->i_ino);
+
 		irec->br_blockcount += got.br_blockcount;
+		prev_len = got.br_blockcount;
 		xfs_iext_next(ifp, &info->icur);
 	}
 
-- 
cgit v1.2.3


From e8882f69b941b20704ea509ebfca2d8a123ad6e3 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:25 -0700
Subject: xfs: split the xchk_bmap_check_rmaps into a predicate

This function has two parts: the second part scans every reverse mapping
record for this file fork to make sure that there's a corresponding
mapping in the fork, and the first part decides if we even want to do
that.

Split the first part into a separate predicate so that we can make more
changes to it in the next patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/bmap.c | 60 +++++++++++++++++++++++++++++++++--------------------
 1 file changed, 38 insertions(+), 22 deletions(-)

diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index a2edcf8cc779..64ed5b6585d5 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -635,28 +635,28 @@ xchk_bmap_check_ag_rmaps(
 	return error;
 }
 
-/* Make sure each rmap has a corresponding bmbt entry. */
-STATIC int
-xchk_bmap_check_rmaps(
-	struct xfs_scrub	*sc,
-	int			whichfork)
+/*
+ * Decide if we want to walk every rmap btree in the fs to make sure that each
+ * rmap for this file fork has corresponding bmbt entries.
+ */
+static bool
+xchk_bmap_want_check_rmaps(
+	struct xchk_bmap_info	*info)
 {
-	struct xfs_ifork	*ifp = xfs_ifork_ptr(sc->ip, whichfork);
-	struct xfs_perag	*pag;
-	xfs_agnumber_t		agno;
+	struct xfs_scrub	*sc = info->sc;
+	struct xfs_ifork	*ifp;
 	bool			zero_size;
-	int			error;
 
-	if (!xfs_has_rmapbt(sc->mp) ||
-	    whichfork == XFS_COW_FORK ||
-	    (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT))
-		return 0;
+	if (!xfs_has_rmapbt(sc->mp))
+		return false;
+	if (info->whichfork == XFS_COW_FORK)
+		return false;
+	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
+		return false;
 
 	/* Don't support realtime rmap checks yet. */
-	if (XFS_IS_REALTIME_INODE(sc->ip) && whichfork == XFS_DATA_FORK)
-		return 0;
-
-	ASSERT(xfs_ifork_ptr(sc->ip, whichfork) != NULL);
+	if (info->is_rt)
+		return false;
 
 	/*
 	 * Only do this for complex maps that are in btree format, or for
@@ -666,14 +666,28 @@ xchk_bmap_check_rmaps(
 	 * reattached.
 	 */
 
-	if (whichfork == XFS_DATA_FORK)
+	if (info->whichfork == XFS_DATA_FORK)
 		zero_size = i_size_read(VFS_I(sc->ip)) == 0;
 	else
 		zero_size = false;
 
+	ifp = xfs_ifork_ptr(sc->ip, info->whichfork);
 	if (ifp->if_format != XFS_DINODE_FMT_BTREE &&
 	    (zero_size || ifp->if_nextents > 0))
-		return 0;
+		return false;
+
+	return true;
+}
+
+/* Make sure each rmap has a corresponding bmbt entry. */
+STATIC int
+xchk_bmap_check_rmaps(
+	struct xfs_scrub	*sc,
+	int			whichfork)
+{
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno;
+	int			error;
 
 	for_each_perag(sc->mp, agno, pag) {
 		error = xchk_bmap_check_ag_rmaps(sc, whichfork, pag);
@@ -915,9 +929,11 @@ xchk_bmap(
 		memcpy(&info.prev_rec, &irec, sizeof(struct xfs_bmbt_irec));
 	}
 
-	error = xchk_bmap_check_rmaps(sc, whichfork);
-	if (!xchk_fblock_xref_process_error(sc, whichfork, 0, &error))
-		goto out;
+	if (xchk_bmap_want_check_rmaps(&info)) {
+		error = xchk_bmap_check_rmaps(sc, whichfork);
+		if (!xchk_fblock_xref_process_error(sc, whichfork, 0, &error))
+			goto out;
+	}
 out:
 	return error;
 }
-- 
cgit v1.2.3


From d5784ae82778d94a18aba25ccbddc16f8ae13001 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:26 -0700
Subject: xfs: flag free space btree records that could be merged

Complain if we encounter free space btree records that could be merged.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/alloc.c | 29 ++++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/scrub/alloc.c b/fs/xfs/scrub/alloc.c
index 12dd55ac2a4f..279af72b1671 100644
--- a/fs/xfs/scrub/alloc.c
+++ b/fs/xfs/scrub/alloc.c
@@ -31,6 +31,12 @@ xchk_setup_ag_allocbt(
 }
 
 /* Free space btree scrubber. */
+
+struct xchk_alloc {
+	/* Previous free space extent. */
+	struct xfs_alloc_rec_incore	prev;
+};
+
 /*
  * Ensure there's a corresponding cntbt/bnobt record matching this
  * bnobt/cntbt record, respectively.
@@ -93,6 +99,24 @@ xchk_allocbt_xref(
 	xchk_xref_is_not_cow_staging(sc, agbno, len);
 }
 
+/* Flag failures for records that could be merged. */
+STATIC void
+xchk_allocbt_mergeable(
+	struct xchk_btree	*bs,
+	struct xchk_alloc	*ca,
+	const struct xfs_alloc_rec_incore *irec)
+{
+	if (bs->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
+		return;
+
+	if (ca->prev.ar_blockcount > 0 &&
+	    ca->prev.ar_startblock + ca->prev.ar_blockcount == irec->ar_startblock &&
+	    ca->prev.ar_blockcount + irec->ar_blockcount < (uint32_t)~0U)
+		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
+
+	memcpy(&ca->prev, irec, sizeof(*irec));
+}
+
 /* Scrub a bnobt/cntbt record. */
 STATIC int
 xchk_allocbt_rec(
@@ -100,6 +124,7 @@ xchk_allocbt_rec(
 	const union xfs_btree_rec	*rec)
 {
 	struct xfs_alloc_rec_incore	irec;
+	struct xchk_alloc	*ca = bs->private;
 
 	xfs_alloc_btrec_to_irec(rec, &irec);
 	if (xfs_alloc_check_irec(bs->cur, &irec) != NULL) {
@@ -107,6 +132,7 @@ xchk_allocbt_rec(
 		return 0;
 	}
 
+	xchk_allocbt_mergeable(bs, ca, &irec);
 	xchk_allocbt_xref(bs->sc, &irec);
 
 	return 0;
@@ -118,10 +144,11 @@ xchk_allocbt(
 	struct xfs_scrub	*sc,
 	xfs_btnum_t		which)
 {
+	struct xchk_alloc	ca = { };
 	struct xfs_btree_cur	*cur;
 
 	cur = which == XFS_BTNUM_BNO ? sc->sa.bno_cur : sc->sa.cnt_cur;
-	return xchk_btree(sc, cur, xchk_allocbt_rec, &XFS_RMAP_OINFO_AG, NULL);
+	return xchk_btree(sc, cur, xchk_allocbt_rec, &XFS_RMAP_OINFO_AG, &ca);
 }
 
 int
-- 
cgit v1.2.3


From 1e59fdb7d6157ff685a250e0873a015a2b16a4f2 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:26 -0700
Subject: xfs: don't call xchk_bmap_check_rmaps for btree-format file forks

The logic at the end of xchk_bmap_want_check_rmaps tries to detect a
file fork that has been zapped by what will become the online inode
repair code.  Zapped forks are in FMT_EXTENTS with zero extents, and
some sort of hint that there's supposed to be data somewhere in the
filesystem.

Unfortunately, the inverted logic here is confusing and has the effect
that we always call xchk_bmap_check_rmaps for FMT_BTREE forks.  This is
horribly inefficient and unnecessary, so invert the logic to get rid of
this performance problem.  This has caused 8h delays in generic/333 and
generic/334.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/bmap.c | 30 ++++++++++++++----------------
 1 file changed, 14 insertions(+), 16 deletions(-)

diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index 64ed5b6585d5..87ab9f95a487 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -645,7 +645,6 @@ xchk_bmap_want_check_rmaps(
 {
 	struct xfs_scrub	*sc = info->sc;
 	struct xfs_ifork	*ifp;
-	bool			zero_size;
 
 	if (!xfs_has_rmapbt(sc->mp))
 		return false;
@@ -659,24 +658,23 @@ xchk_bmap_want_check_rmaps(
 		return false;
 
 	/*
-	 * Only do this for complex maps that are in btree format, or for
-	 * situations where we would seem to have a size but zero extents.
-	 * The inode repair code can zap broken iforks, which means we have
-	 * to flag this bmap as corrupt if there are rmaps that need to be
-	 * reattached.
+	 * The inode repair code zaps broken inode forks by resetting them back
+	 * to EXTENTS format and zero extent records.  If we encounter a fork
+	 * in this state along with evidence that the fork isn't supposed to be
+	 * empty, we need to scan the reverse mappings to decide if we're going
+	 * to rebuild the fork.  Data forks with nonzero file size are scanned.
+	 * xattr forks are never empty of content, so they are always scanned.
 	 */
-
-	if (info->whichfork == XFS_DATA_FORK)
-		zero_size = i_size_read(VFS_I(sc->ip)) == 0;
-	else
-		zero_size = false;
-
 	ifp = xfs_ifork_ptr(sc->ip, info->whichfork);
-	if (ifp->if_format != XFS_DINODE_FMT_BTREE &&
-	    (zero_size || ifp->if_nextents > 0))
-		return false;
+	if (ifp->if_format == XFS_DINODE_FMT_EXTENTS && ifp->if_nextents == 0) {
+		if (info->whichfork == XFS_DATA_FORK &&
+		    i_size_read(VFS_I(sc->ip)) == 0)
+			return false;
 
-	return true;
+		return true;
+	}
+
+	return false;
 }
 
 /* Make sure each rmap has a corresponding bmbt entry. */
-- 
cgit v1.2.3


From db0502b39c21d1cab6b6778a416a5b407170be90 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:27 -0700
Subject: xfs: flag refcount btree records that could be merged

Complain if we encounter refcount btree records that could be merged.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/refcount.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
index db9e46a4f8d4..ed5eb367ce49 100644
--- a/fs/xfs/scrub/refcount.c
+++ b/fs/xfs/scrub/refcount.c
@@ -333,6 +333,9 @@ xchk_refcountbt_xref(
 }
 
 struct xchk_refcbt_records {
+	/* Previous refcount record. */
+	struct xfs_refcount_irec prev_rec;
+
 	/* The next AG block where we aren't expecting shared extents. */
 	xfs_agblock_t		next_unshared_agbno;
 
@@ -390,6 +393,46 @@ xchk_refcountbt_xref_gaps(
 		xchk_should_check_xref(sc, &error, &sc->sa.rmap_cur);
 }
 
+static inline bool
+xchk_refcount_mergeable(
+	struct xchk_refcbt_records	*rrc,
+	const struct xfs_refcount_irec	*r2)
+{
+	const struct xfs_refcount_irec	*r1 = &rrc->prev_rec;
+
+	/* Ignore if prev_rec is not yet initialized. */
+	if (r1->rc_blockcount > 0)
+		return false;
+
+	if (r1->rc_domain != r2->rc_domain)
+		return false;
+	if (r1->rc_startblock + r1->rc_blockcount != r2->rc_startblock)
+		return false;
+	if (r1->rc_refcount != r2->rc_refcount)
+		return false;
+	if ((unsigned long long)r1->rc_blockcount + r2->rc_blockcount >
+			MAXREFCEXTLEN)
+		return false;
+
+	return true;
+}
+
+/* Flag failures for records that could be merged. */
+STATIC void
+xchk_refcountbt_check_mergeable(
+	struct xchk_btree		*bs,
+	struct xchk_refcbt_records	*rrc,
+	const struct xfs_refcount_irec	*irec)
+{
+	if (bs->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
+		return;
+
+	if (xchk_refcount_mergeable(rrc, irec))
+		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
+
+	memcpy(&rrc->prev_rec, irec, sizeof(struct xfs_refcount_irec));
+}
+
 /* Scrub a refcountbt record. */
 STATIC int
 xchk_refcountbt_rec(
@@ -414,6 +457,7 @@ xchk_refcountbt_rec(
 		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
 	rrc->prev_domain = irec.rc_domain;
 
+	xchk_refcountbt_check_mergeable(bs, rrc, &irec);
 	xchk_refcountbt_xref(bs->sc, &irec);
 
 	/*
-- 
cgit v1.2.3


From 29ab991b4fe9df3cb6f943bea9e256fbdfa93589 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:27 -0700
Subject: xfs: check overlapping rmap btree records

The rmap btree scrubber doesn't contain sufficient checking for records
that cannot overlap but do anyway.  For the other btrees, this is
enforced by the inorder checks in xchk_btree_rec, but the rmap btree is
special because it allows overlapping records to handle shared data
extents.

Therefore, enhance the rmap btree record check function to compare each
record against the previous one so that we can detect overlapping rmap
records for space allocations that do not allow sharing.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/rmap.c | 74 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 72 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index 18b64287158e..f7e0384a0e69 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -32,6 +32,15 @@ xchk_setup_ag_rmapbt(
 
 /* Reverse-mapping scrubber. */
 
+struct xchk_rmap {
+	/*
+	 * The furthest-reaching of the rmapbt records that we've already
+	 * processed.  This enables us to detect overlapping records for space
+	 * allocations that cannot be shared.
+	 */
+	struct xfs_rmap_irec	overlap_rec;
+};
+
 /* Cross-reference a rmap against the refcount btree. */
 STATIC void
 xchk_rmapbt_xref_refc(
@@ -139,12 +148,63 @@ xchk_rmapbt_check_unwritten_in_keyflags(
 	}
 }
 
+static inline bool
+xchk_rmapbt_is_shareable(
+	struct xfs_scrub		*sc,
+	const struct xfs_rmap_irec	*irec)
+{
+	if (!xfs_has_reflink(sc->mp))
+		return false;
+	if (XFS_RMAP_NON_INODE_OWNER(irec->rm_owner))
+		return false;
+	if (irec->rm_flags & (XFS_RMAP_BMBT_BLOCK | XFS_RMAP_ATTR_FORK |
+			      XFS_RMAP_UNWRITTEN))
+		return false;
+	return true;
+}
+
+/* Flag failures for records that overlap but cannot. */
+STATIC void
+xchk_rmapbt_check_overlapping(
+	struct xchk_btree		*bs,
+	struct xchk_rmap		*cr,
+	const struct xfs_rmap_irec	*irec)
+{
+	xfs_agblock_t			pnext, inext;
+
+	if (bs->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
+		return;
+
+	/* No previous record? */
+	if (cr->overlap_rec.rm_blockcount == 0)
+		goto set_prev;
+
+	/* Do overlap_rec and irec overlap? */
+	pnext = cr->overlap_rec.rm_startblock + cr->overlap_rec.rm_blockcount;
+	if (pnext <= irec->rm_startblock)
+		goto set_prev;
+
+	/* Overlap is only allowed if both records are data fork mappings. */
+	if (!xchk_rmapbt_is_shareable(bs->sc, &cr->overlap_rec) ||
+	    !xchk_rmapbt_is_shareable(bs->sc, irec))
+		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
+
+	/* Save whichever rmap record extends furthest. */
+	inext = irec->rm_startblock + irec->rm_blockcount;
+	if (pnext > inext)
+		return;
+
+set_prev:
+	memcpy(&cr->overlap_rec, irec, sizeof(struct xfs_rmap_irec));
+}
+
 /* Scrub an rmapbt record. */
 STATIC int
 xchk_rmapbt_rec(
 	struct xchk_btree	*bs,
 	const union xfs_btree_rec *rec)
 {
+	struct xchk_rmap	*cr = bs->private;
 	struct xfs_rmap_irec	irec;
 
 	if (xfs_rmap_btrec_to_irec(rec, &irec) != NULL ||
@@ -154,6 +214,7 @@ xchk_rmapbt_rec(
 	}
 
 	xchk_rmapbt_check_unwritten_in_keyflags(bs);
+	xchk_rmapbt_check_overlapping(bs, cr, &irec);
 	xchk_rmapbt_xref(bs->sc, &irec);
 	return 0;
 }
@@ -163,8 +224,17 @@ int
 xchk_rmapbt(
 	struct xfs_scrub	*sc)
 {
-	return xchk_btree(sc, sc->sa.rmap_cur, xchk_rmapbt_rec,
-			&XFS_RMAP_OINFO_AG, NULL);
+	struct xchk_rmap	*cr;
+	int			error;
+
+	cr = kzalloc(sizeof(struct xchk_rmap), XCHK_GFP_FLAGS);
+	if (!cr)
+		return -ENOMEM;
+
+	error = xchk_btree(sc, sc->sa.rmap_cur, xchk_rmapbt_rec,
+			&XFS_RMAP_OINFO_AG, cr);
+	kfree(cr);
+	return error;
 }
 
 /* xref check that the extent is owned only by a given owner */
-- 
cgit v1.2.3


From 1c1646afc96783702f92356846d6e47e0bbd6b11 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:28 -0700
Subject: xfs: check for reverse mapping records that could be merged

Enhance the rmap scrubber to flag adjacent records that could be merged.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/rmap.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index f7e0384a0e69..6d7e294110a2 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -39,6 +39,12 @@ struct xchk_rmap {
 	 * allocations that cannot be shared.
 	 */
 	struct xfs_rmap_irec	overlap_rec;
+
+	/*
+	 * The previous rmapbt record, so that we can check for two records
+	 * that could be one.
+	 */
+	struct xfs_rmap_irec	prev_rec;
 };
 
 /* Cross-reference a rmap against the refcount btree. */
@@ -198,6 +204,51 @@ set_prev:
 	memcpy(&cr->overlap_rec, irec, sizeof(struct xfs_rmap_irec));
 }
 
+/* Decide if two reverse-mapping records can be merged. */
+static inline bool
+xchk_rmap_mergeable(
+	struct xchk_rmap		*cr,
+	const struct xfs_rmap_irec	*r2)
+{
+	const struct xfs_rmap_irec	*r1 = &cr->prev_rec;
+
+	/* Ignore if prev_rec is not yet initialized. */
+	if (cr->prev_rec.rm_blockcount == 0)
+		return false;
+
+	if (r1->rm_owner != r2->rm_owner)
+		return false;
+	if (r1->rm_startblock + r1->rm_blockcount != r2->rm_startblock)
+		return false;
+	if ((unsigned long long)r1->rm_blockcount + r2->rm_blockcount >
+	    XFS_RMAP_LEN_MAX)
+		return false;
+	if (XFS_RMAP_NON_INODE_OWNER(r2->rm_owner))
+		return true;
+	/* must be an inode owner below here */
+	if (r1->rm_flags != r2->rm_flags)
+		return false;
+	if (r1->rm_flags & XFS_RMAP_BMBT_BLOCK)
+		return true;
+	return r1->rm_offset + r1->rm_blockcount == r2->rm_offset;
+}
+
+/* Flag failures for records that could be merged. */
+STATIC void
+xchk_rmapbt_check_mergeable(
+	struct xchk_btree		*bs,
+	struct xchk_rmap		*cr,
+	const struct xfs_rmap_irec	*irec)
+{
+	if (bs->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
+		return;
+
+	if (xchk_rmap_mergeable(cr, irec))
+		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
+
+	memcpy(&cr->prev_rec, irec, sizeof(struct xfs_rmap_irec));
+}
+
 /* Scrub an rmapbt record. */
 STATIC int
 xchk_rmapbt_rec(
@@ -214,6 +265,7 @@ xchk_rmapbt_rec(
 	}
 
 	xchk_rmapbt_check_unwritten_in_keyflags(bs);
+	xchk_rmapbt_check_mergeable(bs, cr, &irec);
 	xchk_rmapbt_check_overlapping(bs, cr, &irec);
 	xchk_rmapbt_xref(bs->sc, &irec);
 	return 0;
-- 
cgit v1.2.3


From c12ad41468a5f4112b98566dcb1ee9096579106a Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:29 -0700
Subject: xfs: xattr scrub should ensure one namespace bit per name

Check that each extended attribute exists in only one namespace.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/attr.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index 5573be3a3dfe..f6def9c864ab 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -128,10 +128,16 @@ xchk_xattr_listent(
 		return;
 	}
 
+	/* Only one namespace bit allowed. */
+	if (hweight32(flags & XFS_ATTR_NSP_ONDISK_MASK) > 1) {
+		xchk_fblock_set_corrupt(sx->sc, XFS_ATTR_FORK, args.blkno);
+		goto fail_xref;
+	}
+
 	/* Does this name make sense? */
 	if (!xfs_attr_namecheck(name, namelen)) {
 		xchk_fblock_set_corrupt(sx->sc, XFS_ATTR_FORK, args.blkno);
-		return;
+		goto fail_xref;
 	}
 
 	/*
-- 
cgit v1.2.3


From ee366fe4f519f0739a2c62cf998f58932e77c6a9 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:29 -0700
Subject: xfs: don't shadow @leaf in xchk_xattr_block

Don't shadow the leaf variable here, because it's misleading to have one
place in the codebase where two variables with different types have the
same name.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/attr.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index f6def9c864ab..6bb0256d0689 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -342,10 +342,10 @@ xchk_xattr_block(
 
 	/* Check all the padding. */
 	if (xfs_has_crc(ds->sc->mp)) {
-		struct xfs_attr3_leafblock	*leaf = bp->b_addr;
+		struct xfs_attr3_leafblock	*leaf3 = bp->b_addr;
 
-		if (leaf->hdr.pad1 != 0 || leaf->hdr.pad2 != 0 ||
-		    leaf->hdr.info.hdr.pad != 0)
+		if (leaf3->hdr.pad1 != 0 || leaf3->hdr.pad2 != 0 ||
+		    leaf3->hdr.info.hdr.pad != 0)
 			xchk_da_set_corrupt(ds, level);
 	} else {
 		if (leaf->hdr.pad1 != 0 || leaf->hdr.info.pad != 0)
-- 
cgit v1.2.3


From 4cb76025208925f697de66ac4d2d821cafabd367 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:30 -0700
Subject: xfs: remove unnecessary dstmap in xattr scrubber

Replace bitmap_and with bitmap_intersects in the xattr leaf block
scrubber, since we only care if there's overlap between the used space
bitmap and the free space bitmap.  This means we don't need dstmap any
more, and can thus reduce the memory requirements.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/attr.c |  7 +++----
 fs/xfs/scrub/attr.h | 12 +-----------
 2 files changed, 4 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index 6bb0256d0689..45fa8a5967de 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -36,10 +36,10 @@ xchk_setup_xattr_buf(
 
 	/*
 	 * We need enough space to read an xattr value from the file or enough
-	 * space to hold three copies of the xattr free space bitmap.  We don't
+	 * space to hold two copies of the xattr free space bitmap.  We don't
 	 * need the buffer space for both purposes at the same time.
 	 */
-	sz = 3 * sizeof(long) * BITS_TO_LONGS(sc->mp->m_attr_geo->blksize);
+	sz = 2 * sizeof(long) * BITS_TO_LONGS(sc->mp->m_attr_geo->blksize);
 	sz = max_t(size_t, sz, value_size);
 
 	/*
@@ -223,7 +223,6 @@ xchk_xattr_check_freemap(
 	struct xfs_attr3_icleaf_hdr	*leafhdr)
 {
 	unsigned long			*freemap = xchk_xattr_freemap(sc);
-	unsigned long			*dstmap = xchk_xattr_dstmap(sc);
 	unsigned int			mapsize = sc->mp->m_attr_geo->blksize;
 	int				i;
 
@@ -237,7 +236,7 @@ xchk_xattr_check_freemap(
 	}
 
 	/* Look for bits that are set in freemap and are marked in use. */
-	return bitmap_and(dstmap, freemap, map, mapsize) == 0;
+	return !bitmap_intersects(freemap, map, mapsize);
 }
 
 /*
diff --git a/fs/xfs/scrub/attr.h b/fs/xfs/scrub/attr.h
index bc6321552251..daf354a581bb 100644
--- a/fs/xfs/scrub/attr.h
+++ b/fs/xfs/scrub/attr.h
@@ -21,8 +21,7 @@ struct xchk_xattr_buf {
 	 * Each bitmap contains enough bits to track every byte in an attr
 	 * block (rounded up to the size of an unsigned long).  The attr block
 	 * used space bitmap starts at the beginning of the buffer; the free
-	 * space bitmap follows immediately after; and we have a third buffer
-	 * for storing intermediate bitmap results.
+	 * space bitmap follows immediately after.
 	 */
 	uint8_t			buf[];
 };
@@ -56,13 +55,4 @@ xchk_xattr_freemap(
 			BITS_TO_LONGS(sc->mp->m_attr_geo->blksize);
 }
 
-/* A bitmap used to hold temporary results. */
-static inline unsigned long *
-xchk_xattr_dstmap(
-	struct xfs_scrub	*sc)
-{
-	return xchk_xattr_freemap(sc) +
-			BITS_TO_LONGS(sc->mp->m_attr_geo->blksize);
-}
-
 #endif	/* __XFS_SCRUB_ATTR_H__ */
-- 
cgit v1.2.3


From 91781ff549379a867d8fbe588a4c95f6598b1fa9 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:30 -0700
Subject: xfs: split freemap from xchk_xattr_buf.buf

Move the free space bitmap from somewhere in xchk_xattr_buf.buf[] to an
explicit pointer.  This is the start of removing the complex overloaded
memory buffer that is the source of weird memory misuse bugs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/attr.c  | 40 ++++++++++++++++++++++++++++++++--------
 fs/xfs/scrub/attr.h  | 15 ++++-----------
 fs/xfs/scrub/scrub.c |  3 +++
 fs/xfs/scrub/scrub.h | 10 ++++++++++
 4 files changed, 49 insertions(+), 19 deletions(-)

diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index 45fa8a5967de..bc529b53a84f 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -20,6 +20,17 @@
 #include "scrub/dabtree.h"
 #include "scrub/attr.h"
 
+/* Free the buffers linked from the xattr buffer. */
+static void
+xchk_xattr_buf_cleanup(
+	void			*priv)
+{
+	struct xchk_xattr_buf	*ab = priv;
+
+	kvfree(ab->freemap);
+	ab->freemap = NULL;
+}
+
 /*
  * Allocate enough memory to hold an attr value and attr block bitmaps,
  * reallocating the buffer if necessary.  Buffer contents are not preserved
@@ -32,15 +43,18 @@ xchk_setup_xattr_buf(
 	gfp_t			flags)
 {
 	size_t			sz;
+	size_t			bmp_sz;
 	struct xchk_xattr_buf	*ab = sc->buf;
+	unsigned long		*old_freemap = NULL;
+
+	bmp_sz = sizeof(long) * BITS_TO_LONGS(sc->mp->m_attr_geo->blksize);
 
 	/*
 	 * We need enough space to read an xattr value from the file or enough
-	 * space to hold two copies of the xattr free space bitmap.  We don't
+	 * space to hold one copy of the xattr free space bitmap.  We don't
 	 * need the buffer space for both purposes at the same time.
 	 */
-	sz = 2 * sizeof(long) * BITS_TO_LONGS(sc->mp->m_attr_geo->blksize);
-	sz = max_t(size_t, sz, value_size);
+	sz = max_t(size_t, bmp_sz, value_size);
 
 	/*
 	 * If there's already a buffer, figure out if we need to reallocate it
@@ -49,6 +63,7 @@ xchk_setup_xattr_buf(
 	if (ab) {
 		if (sz <= ab->sz)
 			return 0;
+		old_freemap = ab->freemap;
 		kvfree(ab);
 		sc->buf = NULL;
 	}
@@ -60,9 +75,18 @@ xchk_setup_xattr_buf(
 	ab = kvmalloc(sizeof(*ab) + sz, flags);
 	if (!ab)
 		return -ENOMEM;
-
 	ab->sz = sz;
 	sc->buf = ab;
+	sc->buf_cleanup = xchk_xattr_buf_cleanup;
+
+	if (old_freemap) {
+		ab->freemap = old_freemap;
+	} else {
+		ab->freemap = kvmalloc(bmp_sz, flags);
+		if (!ab->freemap)
+			return -ENOMEM;
+	}
+
 	return 0;
 }
 
@@ -222,21 +246,21 @@ xchk_xattr_check_freemap(
 	unsigned long			*map,
 	struct xfs_attr3_icleaf_hdr	*leafhdr)
 {
-	unsigned long			*freemap = xchk_xattr_freemap(sc);
+	struct xchk_xattr_buf		*ab = sc->buf;
 	unsigned int			mapsize = sc->mp->m_attr_geo->blksize;
 	int				i;
 
 	/* Construct bitmap of freemap contents. */
-	bitmap_zero(freemap, mapsize);
+	bitmap_zero(ab->freemap, mapsize);
 	for (i = 0; i < XFS_ATTR_LEAF_MAPSIZE; i++) {
-		if (!xchk_xattr_set_map(sc, freemap,
+		if (!xchk_xattr_set_map(sc, ab->freemap,
 				leafhdr->freemap[i].base,
 				leafhdr->freemap[i].size))
 			return false;
 	}
 
 	/* Look for bits that are set in freemap and are marked in use. */
-	return !bitmap_intersects(freemap, map, mapsize);
+	return !bitmap_intersects(ab->freemap, map, mapsize);
 }
 
 /*
diff --git a/fs/xfs/scrub/attr.h b/fs/xfs/scrub/attr.h
index daf354a581bb..341855b3201b 100644
--- a/fs/xfs/scrub/attr.h
+++ b/fs/xfs/scrub/attr.h
@@ -10,6 +10,9 @@
  * Temporary storage for online scrub and repair of extended attributes.
  */
 struct xchk_xattr_buf {
+	/* Bitmap of free space in xattr leaf blocks. */
+	unsigned long		*freemap;
+
 	/* Size of @buf, in bytes. */
 	size_t			sz;
 
@@ -20,8 +23,7 @@ struct xchk_xattr_buf {
 	 *
 	 * Each bitmap contains enough bits to track every byte in an attr
 	 * block (rounded up to the size of an unsigned long).  The attr block
-	 * used space bitmap starts at the beginning of the buffer; the free
-	 * space bitmap follows immediately after.
+	 * used space bitmap starts at the beginning of the buffer.
 	 */
 	uint8_t			buf[];
 };
@@ -46,13 +48,4 @@ xchk_xattr_usedmap(
 	return (unsigned long *)ab->buf;
 }
 
-/* A bitmap of free space computed by walking attr leaf block free info. */
-static inline unsigned long *
-xchk_xattr_freemap(
-	struct xfs_scrub	*sc)
-{
-	return xchk_xattr_usedmap(sc) +
-			BITS_TO_LONGS(sc->mp->m_attr_geo->blksize);
-}
-
 #endif	/* __XFS_SCRUB_ATTR_H__ */
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 03ec455318f4..02819bedc5b1 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -189,7 +189,10 @@ xchk_teardown(
 	if (sc->flags & XCHK_REAPING_DISABLED)
 		xchk_start_reaping(sc);
 	if (sc->buf) {
+		if (sc->buf_cleanup)
+			sc->buf_cleanup(sc->buf);
 		kvfree(sc->buf);
+		sc->buf_cleanup = NULL;
 		sc->buf = NULL;
 	}
 
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index c519927355fe..e71903474cd7 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -77,7 +77,17 @@ struct xfs_scrub {
 	 */
 	struct xfs_inode		*ip;
 
+	/* Kernel memory buffer used by scrubbers; freed at teardown. */
 	void				*buf;
+
+	/*
+	 * Clean up resources owned by whatever is in the buffer.  Cleanup can
+	 * be deferred with this hook as a means for scrub functions to pass
+	 * data to repair functions.  This function must not free the buffer
+	 * itself.
+	 */
+	void				(*buf_cleanup)(void *buf);
+
 	uint				ilock_flags;
 
 	/* See the XCHK/XREP state flags below. */
-- 
cgit v1.2.3


From 80069284b5eb63e48bc7cb9d4bd179711ae6e77a Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:31 -0700
Subject: xfs: split usedmap from xchk_xattr_buf.buf

Move the used space bitmap from somewhere in xchk_xattr_buf.buf[] to an
explicit pointer.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/attr.c | 39 +++++++++++++++++++++------------------
 fs/xfs/scrub/attr.h | 22 +++++-----------------
 2 files changed, 26 insertions(+), 35 deletions(-)

diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index bc529b53a84f..65ec503c8a38 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -29,6 +29,8 @@ xchk_xattr_buf_cleanup(
 
 	kvfree(ab->freemap);
 	ab->freemap = NULL;
+	kvfree(ab->usedmap);
+	ab->usedmap = NULL;
 }
 
 /*
@@ -42,20 +44,14 @@ xchk_setup_xattr_buf(
 	size_t			value_size,
 	gfp_t			flags)
 {
-	size_t			sz;
+	size_t			sz = value_size;
 	size_t			bmp_sz;
 	struct xchk_xattr_buf	*ab = sc->buf;
+	unsigned long		*old_usedmap = NULL;
 	unsigned long		*old_freemap = NULL;
 
 	bmp_sz = sizeof(long) * BITS_TO_LONGS(sc->mp->m_attr_geo->blksize);
 
-	/*
-	 * We need enough space to read an xattr value from the file or enough
-	 * space to hold one copy of the xattr free space bitmap.  We don't
-	 * need the buffer space for both purposes at the same time.
-	 */
-	sz = max_t(size_t, bmp_sz, value_size);
-
 	/*
 	 * If there's already a buffer, figure out if we need to reallocate it
 	 * to accommodate a larger size.
@@ -64,6 +60,7 @@ xchk_setup_xattr_buf(
 		if (sz <= ab->sz)
 			return 0;
 		old_freemap = ab->freemap;
+		old_usedmap = ab->usedmap;
 		kvfree(ab);
 		sc->buf = NULL;
 	}
@@ -79,6 +76,14 @@ xchk_setup_xattr_buf(
 	sc->buf = ab;
 	sc->buf_cleanup = xchk_xattr_buf_cleanup;
 
+	if (old_usedmap) {
+		ab->usedmap = old_usedmap;
+	} else {
+		ab->usedmap = kvmalloc(bmp_sz, flags);
+		if (!ab->usedmap)
+			return -ENOMEM;
+	}
+
 	if (old_freemap) {
 		ab->freemap = old_freemap;
 	} else {
@@ -243,7 +248,6 @@ xchk_xattr_set_map(
 STATIC bool
 xchk_xattr_check_freemap(
 	struct xfs_scrub		*sc,
-	unsigned long			*map,
 	struct xfs_attr3_icleaf_hdr	*leafhdr)
 {
 	struct xchk_xattr_buf		*ab = sc->buf;
@@ -260,7 +264,7 @@ xchk_xattr_check_freemap(
 	}
 
 	/* Look for bits that are set in freemap and are marked in use. */
-	return !bitmap_intersects(ab->freemap, map, mapsize);
+	return !bitmap_intersects(ab->freemap, ab->usedmap, mapsize);
 }
 
 /*
@@ -280,7 +284,7 @@ xchk_xattr_entry(
 	__u32				*last_hashval)
 {
 	struct xfs_mount		*mp = ds->state->mp;
-	unsigned long			*usedmap = xchk_xattr_usedmap(ds->sc);
+	struct xchk_xattr_buf		*ab = ds->sc->buf;
 	char				*name_end;
 	struct xfs_attr_leaf_name_local	*lentry;
 	struct xfs_attr_leaf_name_remote *rentry;
@@ -320,7 +324,7 @@ xchk_xattr_entry(
 	if (name_end > buf_end)
 		xchk_da_set_corrupt(ds, level);
 
-	if (!xchk_xattr_set_map(ds->sc, usedmap, nameidx, namesize))
+	if (!xchk_xattr_set_map(ds->sc, ab->usedmap, nameidx, namesize))
 		xchk_da_set_corrupt(ds, level);
 	if (!(ds->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT))
 		*usedbytes += namesize;
@@ -340,7 +344,7 @@ xchk_xattr_block(
 	struct xfs_attr_leafblock	*leaf = bp->b_addr;
 	struct xfs_attr_leaf_entry	*ent;
 	struct xfs_attr_leaf_entry	*entries;
-	unsigned long			*usedmap;
+	struct xchk_xattr_buf		*ab = ds->sc->buf;
 	char				*buf_end;
 	size_t				off;
 	__u32				last_hashval = 0;
@@ -358,10 +362,9 @@ xchk_xattr_block(
 		return -EDEADLOCK;
 	if (error)
 		return error;
-	usedmap = xchk_xattr_usedmap(ds->sc);
 
 	*last_checked = blk->blkno;
-	bitmap_zero(usedmap, mp->m_attr_geo->blksize);
+	bitmap_zero(ab->usedmap, mp->m_attr_geo->blksize);
 
 	/* Check all the padding. */
 	if (xfs_has_crc(ds->sc->mp)) {
@@ -385,7 +388,7 @@ xchk_xattr_block(
 		xchk_da_set_corrupt(ds, level);
 	if (leafhdr.firstused < hdrsize)
 		xchk_da_set_corrupt(ds, level);
-	if (!xchk_xattr_set_map(ds->sc, usedmap, 0, hdrsize))
+	if (!xchk_xattr_set_map(ds->sc, ab->usedmap, 0, hdrsize))
 		xchk_da_set_corrupt(ds, level);
 
 	if (ds->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
@@ -399,7 +402,7 @@ xchk_xattr_block(
 	for (i = 0, ent = entries; i < leafhdr.count; ent++, i++) {
 		/* Mark the leaf entry itself. */
 		off = (char *)ent - (char *)leaf;
-		if (!xchk_xattr_set_map(ds->sc, usedmap, off,
+		if (!xchk_xattr_set_map(ds->sc, ab->usedmap, off,
 				sizeof(xfs_attr_leaf_entry_t))) {
 			xchk_da_set_corrupt(ds, level);
 			goto out;
@@ -413,7 +416,7 @@ xchk_xattr_block(
 			goto out;
 	}
 
-	if (!xchk_xattr_check_freemap(ds->sc, usedmap, &leafhdr))
+	if (!xchk_xattr_check_freemap(ds->sc, &leafhdr))
 		xchk_da_set_corrupt(ds, level);
 
 	if (leafhdr.usedbytes != usedbytes)
diff --git a/fs/xfs/scrub/attr.h b/fs/xfs/scrub/attr.h
index 341855b3201b..525f45815526 100644
--- a/fs/xfs/scrub/attr.h
+++ b/fs/xfs/scrub/attr.h
@@ -10,6 +10,9 @@
  * Temporary storage for online scrub and repair of extended attributes.
  */
 struct xchk_xattr_buf {
+	/* Bitmap of used space in xattr leaf blocks. */
+	unsigned long		*usedmap;
+
 	/* Bitmap of free space in xattr leaf blocks. */
 	unsigned long		*freemap;
 
@@ -17,13 +20,8 @@ struct xchk_xattr_buf {
 	size_t			sz;
 
 	/*
-	 * Memory buffer -- either used for extracting attr values while
-	 * walking the attributes; or for computing attr block bitmaps when
-	 * checking the attribute tree.
-	 *
-	 * Each bitmap contains enough bits to track every byte in an attr
-	 * block (rounded up to the size of an unsigned long).  The attr block
-	 * used space bitmap starts at the beginning of the buffer.
+	 * Memory buffer -- used for extracting attr values while walking the
+	 * attributes.
 	 */
 	uint8_t			buf[];
 };
@@ -38,14 +36,4 @@ xchk_xattr_valuebuf(
 	return ab->buf;
 }
 
-/* A bitmap of space usage computed by walking an attr leaf block. */
-static inline unsigned long *
-xchk_xattr_usedmap(
-	struct xfs_scrub	*sc)
-{
-	struct xchk_xattr_buf	*ab = sc->buf;
-
-	return (unsigned long *)ab->buf;
-}
-
 #endif	/* __XFS_SCRUB_ATTR_H__ */
-- 
cgit v1.2.3


From b996c9a80664b970b73961c83bede243b999341e Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:31 -0700
Subject: xfs: split valuebuf from xchk_xattr_buf.buf

Move the xattr value buffer from somewhere in xchk_xattr_buf.buf[] to an
explicit pointer.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/attr.c | 89 ++++++++++++++++++++++++++---------------------------
 fs/xfs/scrub/attr.h | 21 ++-----------
 2 files changed, 46 insertions(+), 64 deletions(-)

diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index 65ec503c8a38..7f723f206146 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -31,6 +31,9 @@ xchk_xattr_buf_cleanup(
 	ab->freemap = NULL;
 	kvfree(ab->usedmap);
 	ab->usedmap = NULL;
+	kvfree(ab->value);
+	ab->value = NULL;
+	ab->value_sz = 0;
 }
 
 /*
@@ -44,54 +47,45 @@ xchk_setup_xattr_buf(
 	size_t			value_size,
 	gfp_t			flags)
 {
-	size_t			sz = value_size;
 	size_t			bmp_sz;
 	struct xchk_xattr_buf	*ab = sc->buf;
-	unsigned long		*old_usedmap = NULL;
-	unsigned long		*old_freemap = NULL;
+	void			*new_val;
 
 	bmp_sz = sizeof(long) * BITS_TO_LONGS(sc->mp->m_attr_geo->blksize);
 
-	/*
-	 * If there's already a buffer, figure out if we need to reallocate it
-	 * to accommodate a larger size.
-	 */
-	if (ab) {
-		if (sz <= ab->sz)
-			return 0;
-		old_freemap = ab->freemap;
-		old_usedmap = ab->usedmap;
-		kvfree(ab);
-		sc->buf = NULL;
-	}
+	if (ab)
+		goto resize_value;
 
-	/*
-	 * Don't zero the buffer upon allocation to avoid runtime overhead.
-	 * All users must be careful never to read uninitialized contents.
-	 */
-	ab = kvmalloc(sizeof(*ab) + sz, flags);
+	ab = kvzalloc(sizeof(struct xchk_xattr_buf), flags);
 	if (!ab)
 		return -ENOMEM;
-	ab->sz = sz;
 	sc->buf = ab;
 	sc->buf_cleanup = xchk_xattr_buf_cleanup;
 
-	if (old_usedmap) {
-		ab->usedmap = old_usedmap;
-	} else {
-		ab->usedmap = kvmalloc(bmp_sz, flags);
-		if (!ab->usedmap)
-			return -ENOMEM;
-	}
+	ab->usedmap = kvmalloc(bmp_sz, flags);
+	if (!ab->usedmap)
+		return -ENOMEM;
 
-	if (old_freemap) {
-		ab->freemap = old_freemap;
-	} else {
-		ab->freemap = kvmalloc(bmp_sz, flags);
-		if (!ab->freemap)
-			return -ENOMEM;
+	ab->freemap = kvmalloc(bmp_sz, flags);
+	if (!ab->freemap)
+		return -ENOMEM;
+
+resize_value:
+	if (ab->value_sz >= value_size)
+		return 0;
+
+	if (ab->value) {
+		kvfree(ab->value);
+		ab->value = NULL;
+		ab->value_sz = 0;
 	}
 
+	new_val = kvmalloc(value_size, flags);
+	if (!new_val)
+		return -ENOMEM;
+
+	ab->value = new_val;
+	ab->value_sz = value_size;
 	return 0;
 }
 
@@ -140,11 +134,24 @@ xchk_xattr_listent(
 	int				namelen,
 	int				valuelen)
 {
+	struct xfs_da_args		args = {
+		.op_flags		= XFS_DA_OP_NOTIME,
+		.attr_filter		= flags & XFS_ATTR_NSP_ONDISK_MASK,
+		.geo			= context->dp->i_mount->m_attr_geo,
+		.whichfork		= XFS_ATTR_FORK,
+		.dp			= context->dp,
+		.name			= name,
+		.namelen		= namelen,
+		.hashval		= xfs_da_hashname(name, namelen),
+		.trans			= context->tp,
+		.valuelen		= valuelen,
+	};
+	struct xchk_xattr_buf		*ab;
 	struct xchk_xattr		*sx;
-	struct xfs_da_args		args = { NULL };
 	int				error = 0;
 
 	sx = container_of(context, struct xchk_xattr, context);
+	ab = sx->sc->buf;
 
 	if (xchk_should_terminate(sx->sc, &error)) {
 		context->seen_enough = error;
@@ -182,17 +189,7 @@ xchk_xattr_listent(
 		return;
 	}
 
-	args.op_flags = XFS_DA_OP_NOTIME;
-	args.attr_filter = flags & XFS_ATTR_NSP_ONDISK_MASK;
-	args.geo = context->dp->i_mount->m_attr_geo;
-	args.whichfork = XFS_ATTR_FORK;
-	args.dp = context->dp;
-	args.name = name;
-	args.namelen = namelen;
-	args.hashval = xfs_da_hashname(args.name, args.namelen);
-	args.trans = context->tp;
-	args.value = xchk_xattr_valuebuf(sx->sc);
-	args.valuelen = valuelen;
+	args.value = ab->value;
 
 	error = xfs_attr_get_ilocked(&args);
 	/* ENODATA means the hash lookup failed and the attr is bad */
diff --git a/fs/xfs/scrub/attr.h b/fs/xfs/scrub/attr.h
index 525f45815526..3c764722da61 100644
--- a/fs/xfs/scrub/attr.h
+++ b/fs/xfs/scrub/attr.h
@@ -16,24 +16,9 @@ struct xchk_xattr_buf {
 	/* Bitmap of free space in xattr leaf blocks. */
 	unsigned long		*freemap;
 
-	/* Size of @buf, in bytes. */
-	size_t			sz;
-
-	/*
-	 * Memory buffer -- used for extracting attr values while walking the
-	 * attributes.
-	 */
-	uint8_t			buf[];
+	/* Memory buffer used to extract xattr values. */
+	void			*value;
+	size_t			value_sz;
 };
 
-/* A place to store attribute values. */
-static inline uint8_t *
-xchk_xattr_valuebuf(
-	struct xfs_scrub	*sc)
-{
-	struct xchk_xattr_buf	*ab = sc->buf;
-
-	return ab->buf;
-}
-
 #endif	/* __XFS_SCRUB_ATTR_H__ */
-- 
cgit v1.2.3


From f58977edc0b50bf6aee5a63bff34581b22b2ce63 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:32 -0700
Subject: xfs: remove flags argument from xchk_setup_xattr_buf

All callers pass XCHK_GFP_FLAGS as the flags argument to
xchk_setup_xattr_buf, so get rid of the argument.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/attr.c | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index 7f723f206146..420166336f56 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -44,8 +44,7 @@ xchk_xattr_buf_cleanup(
 static int
 xchk_setup_xattr_buf(
 	struct xfs_scrub	*sc,
-	size_t			value_size,
-	gfp_t			flags)
+	size_t			value_size)
 {
 	size_t			bmp_sz;
 	struct xchk_xattr_buf	*ab = sc->buf;
@@ -56,17 +55,17 @@ xchk_setup_xattr_buf(
 	if (ab)
 		goto resize_value;
 
-	ab = kvzalloc(sizeof(struct xchk_xattr_buf), flags);
+	ab = kvzalloc(sizeof(struct xchk_xattr_buf), XCHK_GFP_FLAGS);
 	if (!ab)
 		return -ENOMEM;
 	sc->buf = ab;
 	sc->buf_cleanup = xchk_xattr_buf_cleanup;
 
-	ab->usedmap = kvmalloc(bmp_sz, flags);
+	ab->usedmap = kvmalloc(bmp_sz, XCHK_GFP_FLAGS);
 	if (!ab->usedmap)
 		return -ENOMEM;
 
-	ab->freemap = kvmalloc(bmp_sz, flags);
+	ab->freemap = kvmalloc(bmp_sz, XCHK_GFP_FLAGS);
 	if (!ab->freemap)
 		return -ENOMEM;
 
@@ -80,7 +79,7 @@ resize_value:
 		ab->value_sz = 0;
 	}
 
-	new_val = kvmalloc(value_size, flags);
+	new_val = kvmalloc(value_size, XCHK_GFP_FLAGS);
 	if (!new_val)
 		return -ENOMEM;
 
@@ -102,8 +101,7 @@ xchk_setup_xattr(
 	 * without the inode lock held, which means we can sleep.
 	 */
 	if (sc->flags & XCHK_TRY_HARDER) {
-		error = xchk_setup_xattr_buf(sc, XATTR_SIZE_MAX,
-				XCHK_GFP_FLAGS);
+		error = xchk_setup_xattr_buf(sc, XATTR_SIZE_MAX);
 		if (error)
 			return error;
 	}
@@ -181,7 +179,7 @@ xchk_xattr_listent(
 	 * doesn't work, we overload the seen_enough variable to convey
 	 * the error message back to the main scrub function.
 	 */
-	error = xchk_setup_xattr_buf(sx->sc, valuelen, XCHK_GFP_FLAGS);
+	error = xchk_setup_xattr_buf(sx->sc, valuelen);
 	if (error == -ENOMEM)
 		error = -EDEADLOCK;
 	if (error) {
@@ -354,7 +352,7 @@ xchk_xattr_block(
 		return 0;
 
 	/* Allocate memory for block usage checking. */
-	error = xchk_setup_xattr_buf(ds->sc, 0, XCHK_GFP_FLAGS);
+	error = xchk_setup_xattr_buf(ds->sc, 0);
 	if (error == -ENOMEM)
 		return -EDEADLOCK;
 	if (error)
-- 
cgit v1.2.3


From 5b02a3e8391c703638c5a97513c353aa0c3fd5b0 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:32 -0700
Subject: xfs: move xattr scrub buffer allocation to top level function

Move the xchk_setup_xattr_buf call from xchk_xattr_block to xchk_xattr,
since we only need to set up the leaf block bitmaps once.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/attr.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index 420166336f56..b37de9459dc0 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -346,18 +346,10 @@ xchk_xattr_block(
 	unsigned int			usedbytes = 0;
 	unsigned int			hdrsize;
 	int				i;
-	int				error;
 
 	if (*last_checked == blk->blkno)
 		return 0;
 
-	/* Allocate memory for block usage checking. */
-	error = xchk_setup_xattr_buf(ds->sc, 0);
-	if (error == -ENOMEM)
-		return -EDEADLOCK;
-	if (error)
-		return error;
-
 	*last_checked = blk->blkno;
 	bitmap_zero(ab->usedmap, mp->m_attr_geo->blksize);
 
@@ -507,6 +499,13 @@ xchk_xattr(
 	if (!xfs_inode_hasattr(sc->ip))
 		return -ENOENT;
 
+	/* Allocate memory for xattr checking. */
+	error = xchk_setup_xattr_buf(sc, 0);
+	if (error == -ENOMEM)
+		return -EDEADLOCK;
+	if (error)
+		return error;
+
 	memset(&sx, 0, sizeof(sx));
 	/* Check attribute tree structure */
 	error = xchk_da_btree(sc, XFS_ATTR_FORK, xchk_xattr_rec,
-- 
cgit v1.2.3


From ae0506eba78fd1d6236b46ca5aa089c8fc6050cf Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:33 -0700
Subject: xfs: check used space of shortform xattr structures

Make sure that the records used inside a shortform xattr structure do
not overlap.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/attr.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 fs/xfs/scrub/attr.h |  2 +-
 2 files changed, 76 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index b37de9459dc0..a49048f2a3db 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -15,6 +15,7 @@
 #include "xfs_da_btree.h"
 #include "xfs_attr.h"
 #include "xfs_attr_leaf.h"
+#include "xfs_attr_sf.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/dabtree.h"
@@ -487,6 +488,73 @@ out:
 	return error;
 }
 
+/* Check space usage of shortform attrs. */
+STATIC int
+xchk_xattr_check_sf(
+	struct xfs_scrub		*sc)
+{
+	struct xchk_xattr_buf		*ab = sc->buf;
+	struct xfs_attr_shortform	*sf;
+	struct xfs_attr_sf_entry	*sfe;
+	struct xfs_attr_sf_entry	*next;
+	struct xfs_ifork		*ifp;
+	unsigned char			*end;
+	int				i;
+	int				error = 0;
+
+	ifp = xfs_ifork_ptr(sc->ip, XFS_ATTR_FORK);
+
+	bitmap_zero(ab->usedmap, ifp->if_bytes);
+	sf = (struct xfs_attr_shortform *)sc->ip->i_af.if_u1.if_data;
+	end = (unsigned char *)ifp->if_u1.if_data + ifp->if_bytes;
+	xchk_xattr_set_map(sc, ab->usedmap, 0, sizeof(sf->hdr));
+
+	sfe = &sf->list[0];
+	if ((unsigned char *)sfe > end) {
+		xchk_fblock_set_corrupt(sc, XFS_ATTR_FORK, 0);
+		return 0;
+	}
+
+	for (i = 0; i < sf->hdr.count; i++) {
+		unsigned char		*name = sfe->nameval;
+		unsigned char		*value = &sfe->nameval[sfe->namelen];
+
+		if (xchk_should_terminate(sc, &error))
+			return error;
+
+		next = xfs_attr_sf_nextentry(sfe);
+		if ((unsigned char *)next > end) {
+			xchk_fblock_set_corrupt(sc, XFS_ATTR_FORK, 0);
+			break;
+		}
+
+		if (!xchk_xattr_set_map(sc, ab->usedmap,
+				(char *)sfe - (char *)sf,
+				sizeof(struct xfs_attr_sf_entry))) {
+			xchk_fblock_set_corrupt(sc, XFS_ATTR_FORK, 0);
+			break;
+		}
+
+		if (!xchk_xattr_set_map(sc, ab->usedmap,
+				(char *)name - (char *)sf,
+				sfe->namelen)) {
+			xchk_fblock_set_corrupt(sc, XFS_ATTR_FORK, 0);
+			break;
+		}
+
+		if (!xchk_xattr_set_map(sc, ab->usedmap,
+				(char *)value - (char *)sf,
+				sfe->valuelen)) {
+			xchk_fblock_set_corrupt(sc, XFS_ATTR_FORK, 0);
+			break;
+		}
+
+		sfe = next;
+	}
+
+	return 0;
+}
+
 /* Scrub the extended attribute metadata. */
 int
 xchk_xattr(
@@ -506,10 +574,12 @@ xchk_xattr(
 	if (error)
 		return error;
 
-	memset(&sx, 0, sizeof(sx));
-	/* Check attribute tree structure */
-	error = xchk_da_btree(sc, XFS_ATTR_FORK, xchk_xattr_rec,
-			&last_checked);
+	/* Check the physical structure of the xattr. */
+	if (sc->ip->i_af.if_format == XFS_DINODE_FMT_LOCAL)
+		error = xchk_xattr_check_sf(sc);
+	else
+		error = xchk_da_btree(sc, XFS_ATTR_FORK, xchk_xattr_rec,
+				&last_checked);
 	if (error)
 		goto out;
 
@@ -517,6 +587,7 @@ xchk_xattr(
 		goto out;
 
 	/* Check that every attr key can also be looked up by hash. */
+	memset(&sx, 0, sizeof(sx));
 	sx.context.dp = sc->ip;
 	sx.context.resynch = 1;
 	sx.context.put_listent = xchk_xattr_listent;
diff --git a/fs/xfs/scrub/attr.h b/fs/xfs/scrub/attr.h
index 3c764722da61..48fd9402c432 100644
--- a/fs/xfs/scrub/attr.h
+++ b/fs/xfs/scrub/attr.h
@@ -10,7 +10,7 @@
  * Temporary storage for online scrub and repair of extended attributes.
  */
 struct xchk_xattr_buf {
-	/* Bitmap of used space in xattr leaf blocks. */
+	/* Bitmap of used space in xattr leaf blocks and shortform forks. */
 	unsigned long		*usedmap;
 
 	/* Bitmap of free space in xattr leaf blocks. */
-- 
cgit v1.2.3


From 6cee51e6d02bac7ee72969aa23e32c9bdcd7bb6e Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:34 -0700
Subject: xfs: clean up xattr scrub initialization

Clean up local variable initialization and error returns in xchk_xattr.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/attr.c | 34 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index a49048f2a3db..d2e1856beeb6 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -560,7 +560,16 @@ int
 xchk_xattr(
 	struct xfs_scrub		*sc)
 {
-	struct xchk_xattr		sx;
+	struct xchk_xattr		sx = {
+		.sc			= sc,
+		.context		= {
+			.dp		= sc->ip,
+			.tp		= sc->tp,
+			.resynch	= 1,
+			.put_listent	= xchk_xattr_listent,
+			.allow_incomplete = true,
+		},
+	};
 	xfs_dablk_t			last_checked = -1U;
 	int				error = 0;
 
@@ -581,22 +590,13 @@ xchk_xattr(
 		error = xchk_da_btree(sc, XFS_ATTR_FORK, xchk_xattr_rec,
 				&last_checked);
 	if (error)
-		goto out;
+		return error;
 
 	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
-		goto out;
-
-	/* Check that every attr key can also be looked up by hash. */
-	memset(&sx, 0, sizeof(sx));
-	sx.context.dp = sc->ip;
-	sx.context.resynch = 1;
-	sx.context.put_listent = xchk_xattr_listent;
-	sx.context.tp = sc->tp;
-	sx.context.allow_incomplete = true;
-	sx.sc = sc;
+		return 0;
 
 	/*
-	 * Look up every xattr in this file by name.
+	 * Look up every xattr in this file by name and hash.
 	 *
 	 * Use the backend implementation of xfs_attr_list to call
 	 * xchk_xattr_listent on every attribute key in this inode.
@@ -613,11 +613,11 @@ xchk_xattr(
 	 */
 	error = xfs_attr_list_ilocked(&sx.context);
 	if (!xchk_fblock_process_error(sc, XFS_ATTR_FORK, 0, &error))
-		goto out;
+		return error;
 
 	/* Did our listent function try to return any errors? */
 	if (sx.context.seen_enough < 0)
-		error = sx.context.seen_enough;
-out:
-	return error;
+		return sx.context.seen_enough;
+
+	return 0;
 }
-- 
cgit v1.2.3


From 674f0d0dc6b5b2228c4e9d597a62d5aa6b54a9c5 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:34 -0700
Subject: xfs: only allocate free space bitmap for xattr scrub if needed

The free space bitmap is only required if we're going to check the
bestfree space at the end of an xattr leaf block.  Therefore, we can
reduce the memory requirements of this scrubber if we can determine that
the xattr is in short format.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/attr.c | 31 ++++++++++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index d2e1856beeb6..2445fe2860ff 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -37,6 +37,29 @@ xchk_xattr_buf_cleanup(
 	ab->value_sz = 0;
 }
 
+/*
+ * Allocate the free space bitmap if we're trying harder; there are leaf blocks
+ * in the attr fork; or we can't tell if there are leaf blocks.
+ */
+static inline bool
+xchk_xattr_want_freemap(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_ifork	*ifp;
+
+	if (sc->flags & XCHK_TRY_HARDER)
+		return true;
+
+	if (!sc->ip)
+		return true;
+
+	ifp = xfs_ifork_ptr(sc->ip, XFS_ATTR_FORK);
+	if (!ifp)
+		return false;
+
+	return xfs_ifork_has_extents(ifp);
+}
+
 /*
  * Allocate enough memory to hold an attr value and attr block bitmaps,
  * reallocating the buffer if necessary.  Buffer contents are not preserved
@@ -66,9 +89,11 @@ xchk_setup_xattr_buf(
 	if (!ab->usedmap)
 		return -ENOMEM;
 
-	ab->freemap = kvmalloc(bmp_sz, XCHK_GFP_FLAGS);
-	if (!ab->freemap)
-		return -ENOMEM;
+	if (xchk_xattr_want_freemap(sc)) {
+		ab->freemap = kvmalloc(bmp_sz, XCHK_GFP_FLAGS);
+		if (!ab->freemap)
+			return -ENOMEM;
+	}
 
 resize_value:
 	if (ab->value_sz >= value_size)
-- 
cgit v1.2.3


From 178b48d588ea5424a54423dc9c406416de0547c8 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:35 -0700
Subject: xfs: remove the for_each_xbitmap_ helpers

Remove the for_each_xbitmap_ macros in favor of proper iterator
functions.  We'll soon be switching this data structure over to an
interval tree implementation, which means that we can't allow callers to
modify the bitmap during iteration without telling us.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/agheader_repair.c |  89 +++++++++++++++++++----------------
 fs/xfs/scrub/bitmap.c          |  59 +++++++++++++++++++++++
 fs/xfs/scrub/bitmap.h          |  22 ++++++---
 fs/xfs/scrub/repair.c          | 104 ++++++++++++++++++++++-------------------
 4 files changed, 180 insertions(+), 94 deletions(-)

diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
index edfb1dfb80a9..997ddcd1f124 100644
--- a/fs/xfs/scrub/agheader_repair.c
+++ b/fs/xfs/scrub/agheader_repair.c
@@ -487,10 +487,11 @@ xrep_agfl_walk_rmap(
 /* Strike out the blocks that are cross-linked according to the rmapbt. */
 STATIC int
 xrep_agfl_check_extent(
-	struct xrep_agfl	*ra,
 	uint64_t		start,
-	uint64_t		len)
+	uint64_t		len,
+	void			*priv)
 {
+	struct xrep_agfl	*ra = priv;
 	xfs_agblock_t		agbno = XFS_FSB_TO_AGBNO(ra->sc->mp, start);
 	xfs_agblock_t		last_agbno = agbno + len - 1;
 	int			error;
@@ -538,7 +539,6 @@ xrep_agfl_collect_blocks(
 	struct xrep_agfl	ra;
 	struct xfs_mount	*mp = sc->mp;
 	struct xfs_btree_cur	*cur;
-	struct xbitmap_range	*br, *n;
 	int			error;
 
 	ra.sc = sc;
@@ -579,11 +579,7 @@ xrep_agfl_collect_blocks(
 
 	/* Strike out the blocks that are cross-linked. */
 	ra.rmap_cur = xfs_rmapbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.pag);
-	for_each_xbitmap_extent(br, n, agfl_extents) {
-		error = xrep_agfl_check_extent(&ra, br->start, br->len);
-		if (error)
-			break;
-	}
+	error = xbitmap_walk(agfl_extents, xrep_agfl_check_extent, &ra);
 	xfs_btree_del_cursor(ra.rmap_cur, error);
 	if (error)
 		goto out_bmp;
@@ -629,6 +625,43 @@ xrep_agfl_update_agf(
 			XFS_AGF_FLFIRST | XFS_AGF_FLLAST | XFS_AGF_FLCOUNT);
 }
 
+struct xrep_agfl_fill {
+	struct xbitmap		used_extents;
+	struct xfs_scrub	*sc;
+	__be32			*agfl_bno;
+	xfs_agblock_t		flcount;
+	unsigned int		fl_off;
+};
+
+/* Fill the AGFL with whatever blocks are in this extent. */
+static int
+xrep_agfl_fill(
+	uint64_t		start,
+	uint64_t		len,
+	void			*priv)
+{
+	struct xrep_agfl_fill	*af = priv;
+	struct xfs_scrub	*sc = af->sc;
+	xfs_fsblock_t		fsbno = start;
+	int			error;
+
+	while (fsbno < start + len && af->fl_off < af->flcount)
+		af->agfl_bno[af->fl_off++] =
+				cpu_to_be32(XFS_FSB_TO_AGBNO(sc->mp, fsbno++));
+
+	trace_xrep_agfl_insert(sc->mp, sc->sa.pag->pag_agno,
+			XFS_FSB_TO_AGBNO(sc->mp, start), len);
+
+	error = xbitmap_set(&af->used_extents, start, fsbno - 1);
+	if (error)
+		return error;
+
+	if (af->fl_off == af->flcount)
+		return -ECANCELED;
+
+	return 0;
+}
+
 /* Write out a totally new AGFL. */
 STATIC void
 xrep_agfl_init_header(
@@ -637,13 +670,12 @@ xrep_agfl_init_header(
 	struct xbitmap		*agfl_extents,
 	xfs_agblock_t		flcount)
 {
+	struct xrep_agfl_fill	af = {
+		.sc		= sc,
+		.flcount	= flcount,
+	};
 	struct xfs_mount	*mp = sc->mp;
-	__be32			*agfl_bno;
-	struct xbitmap_range	*br;
-	struct xbitmap_range	*n;
 	struct xfs_agfl		*agfl;
-	xfs_agblock_t		agbno;
-	unsigned int		fl_off;
 
 	ASSERT(flcount <= xfs_agfl_size(mp));
 
@@ -662,36 +694,15 @@ xrep_agfl_init_header(
 	 * blocks than fit in the AGFL, they will be freed in a subsequent
 	 * step.
 	 */
-	fl_off = 0;
-	agfl_bno = xfs_buf_to_agfl_bno(agfl_bp);
-	for_each_xbitmap_extent(br, n, agfl_extents) {
-		agbno = XFS_FSB_TO_AGBNO(mp, br->start);
-
-		trace_xrep_agfl_insert(mp, sc->sa.pag->pag_agno, agbno,
-				br->len);
-
-		while (br->len > 0 && fl_off < flcount) {
-			agfl_bno[fl_off] = cpu_to_be32(agbno);
-			fl_off++;
-			agbno++;
-
-			/*
-			 * We've now used br->start by putting it in the AGFL,
-			 * so bump br so that we don't reap the block later.
-			 */
-			br->start++;
-			br->len--;
-		}
-
-		if (br->len)
-			break;
-		list_del(&br->list);
-		kfree(br);
-	}
+	xbitmap_init(&af.used_extents);
+	af.agfl_bno = xfs_buf_to_agfl_bno(agfl_bp),
+	xbitmap_walk(agfl_extents, xrep_agfl_fill, &af);
+	xbitmap_disunion(agfl_extents, &af.used_extents);
 
 	/* Write new AGFL to disk. */
 	xfs_trans_buf_set_type(sc->tp, agfl_bp, XFS_BLFT_AGFL_BUF);
 	xfs_trans_log_buf(sc->tp, agfl_bp, 0, BBTOB(agfl_bp->b_length) - 1);
+	xbitmap_destroy(&af.used_extents);
 }
 
 /* Repair the AGFL. */
diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c
index ce8b17d76c0b..315b7c5844a3 100644
--- a/fs/xfs/scrub/bitmap.c
+++ b/fs/xfs/scrub/bitmap.c
@@ -13,6 +13,9 @@
 #include "scrub/scrub.h"
 #include "scrub/bitmap.h"
 
+#define for_each_xbitmap_extent(bex, n, bitmap) \
+	list_for_each_entry_safe((bex), (n), &(bitmap)->list, list)
+
 /*
  * Set a range of this bitmap.  Caller must ensure the range is not set.
  *
@@ -313,3 +316,59 @@ xbitmap_hweight(
 
 	return ret;
 }
+
+/* Call a function for every run of set bits in this bitmap. */
+int
+xbitmap_walk(
+	struct xbitmap		*bitmap,
+	xbitmap_walk_fn	fn,
+	void			*priv)
+{
+	struct xbitmap_range	*bex, *n;
+	int			error = 0;
+
+	for_each_xbitmap_extent(bex, n, bitmap) {
+		error = fn(bex->start, bex->len, priv);
+		if (error)
+			break;
+	}
+
+	return error;
+}
+
+struct xbitmap_walk_bits {
+	xbitmap_walk_bits_fn	fn;
+	void			*priv;
+};
+
+/* Walk all the bits in a run. */
+static int
+xbitmap_walk_bits_in_run(
+	uint64_t			start,
+	uint64_t			len,
+	void				*priv)
+{
+	struct xbitmap_walk_bits	*wb = priv;
+	uint64_t			i;
+	int				error = 0;
+
+	for (i = start; i < start + len; i++) {
+		error = wb->fn(i, wb->priv);
+		if (error)
+			break;
+	}
+
+	return error;
+}
+
+/* Call a function for every set bit in this bitmap. */
+int
+xbitmap_walk_bits(
+	struct xbitmap			*bitmap,
+	xbitmap_walk_bits_fn		fn,
+	void				*priv)
+{
+	struct xbitmap_walk_bits	wb = {.fn = fn, .priv = priv};
+
+	return xbitmap_walk(bitmap, xbitmap_walk_bits_in_run, &wb);
+}
diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h
index 85ec0e2792c5..01e37173dc34 100644
--- a/fs/xfs/scrub/bitmap.h
+++ b/fs/xfs/scrub/bitmap.h
@@ -19,13 +19,6 @@ struct xbitmap {
 void xbitmap_init(struct xbitmap *bitmap);
 void xbitmap_destroy(struct xbitmap *bitmap);
 
-#define for_each_xbitmap_extent(bex, n, bitmap) \
-	list_for_each_entry_safe((bex), (n), &(bitmap)->list, list)
-
-#define for_each_xbitmap_block(b, bex, n, bitmap) \
-	list_for_each_entry_safe((bex), (n), &(bitmap)->list, list) \
-		for ((b) = (bex)->start; (b) < (bex)->start + (bex)->len; (b)++)
-
 int xbitmap_set(struct xbitmap *bitmap, uint64_t start, uint64_t len);
 int xbitmap_disunion(struct xbitmap *bitmap, struct xbitmap *sub);
 int xbitmap_set_btcur_path(struct xbitmap *bitmap,
@@ -34,4 +27,19 @@ int xbitmap_set_btblocks(struct xbitmap *bitmap,
 		struct xfs_btree_cur *cur);
 uint64_t xbitmap_hweight(struct xbitmap *bitmap);
 
+/*
+ * Return codes for the bitmap iterator functions are 0 to continue iterating,
+ * and non-zero to stop iterating.  Any non-zero value will be passed up to the
+ * iteration caller.  The special value -ECANCELED can be used to stop
+ * iteration, because neither bitmap iterator ever generates that error code on
+ * its own.  Callers must not modify the bitmap while walking it.
+ */
+typedef int (*xbitmap_walk_fn)(uint64_t start, uint64_t len, void *priv);
+int xbitmap_walk(struct xbitmap *bitmap, xbitmap_walk_fn fn,
+		void *priv);
+
+typedef int (*xbitmap_walk_bits_fn)(uint64_t bit, void *priv);
+int xbitmap_walk_bits(struct xbitmap *bitmap, xbitmap_walk_bits_fn fn,
+		void *priv);
+
 #endif	/* __XFS_SCRUB_BITMAP_H__ */
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index ab0758308f57..ac6d8803e660 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -445,6 +445,30 @@ xrep_init_btblock(
  * buffers associated with @bitmap.
  */
 
+static int
+xrep_invalidate_block(
+	uint64_t		fsbno,
+	void			*priv)
+{
+	struct xfs_scrub	*sc = priv;
+	struct xfs_buf		*bp;
+	int			error;
+
+	/* Skip AG headers and post-EOFS blocks */
+	if (!xfs_verify_fsbno(sc->mp, fsbno))
+		return 0;
+
+	error = xfs_buf_incore(sc->mp->m_ddev_targp,
+			XFS_FSB_TO_DADDR(sc->mp, fsbno),
+			XFS_FSB_TO_BB(sc->mp, 1), XBF_TRYLOCK, &bp);
+	if (error)
+		return 0;
+
+	xfs_trans_bjoin(sc->tp, bp);
+	xfs_trans_binval(sc->tp, bp);
+	return 0;
+}
+
 /*
  * Invalidate buffers for per-AG btree blocks we're dumping.  This function
  * is not intended for use with file data repairs; we have bunmapi for that.
@@ -454,11 +478,6 @@ xrep_invalidate_blocks(
 	struct xfs_scrub	*sc,
 	struct xbitmap		*bitmap)
 {
-	struct xbitmap_range	*bmr;
-	struct xbitmap_range	*n;
-	struct xfs_buf		*bp;
-	xfs_fsblock_t		fsbno;
-
 	/*
 	 * For each block in each extent, see if there's an incore buffer for
 	 * exactly that block; if so, invalidate it.  The buffer cache only
@@ -467,23 +486,7 @@ xrep_invalidate_blocks(
 	 * because we never own those; and if we can't TRYLOCK the buffer we
 	 * assume it's owned by someone else.
 	 */
-	for_each_xbitmap_block(fsbno, bmr, n, bitmap) {
-		int		error;
-
-		/* Skip AG headers and post-EOFS blocks */
-		if (!xfs_verify_fsbno(sc->mp, fsbno))
-			continue;
-		error = xfs_buf_incore(sc->mp->m_ddev_targp,
-				XFS_FSB_TO_DADDR(sc->mp, fsbno),
-				XFS_FSB_TO_BB(sc->mp, 1), XBF_TRYLOCK, &bp);
-		if (error)
-			continue;
-
-		xfs_trans_bjoin(sc->tp, bp);
-		xfs_trans_binval(sc->tp, bp);
-	}
-
-	return 0;
+	return xbitmap_walk_bits(bitmap, xrep_invalidate_block, sc);
 }
 
 /* Ensure the freelist is the correct size. */
@@ -504,6 +507,15 @@ xrep_fix_freelist(
 			can_shrink ? 0 : XFS_ALLOC_FLAG_NOSHRINK);
 }
 
+/* Information about reaping extents after a repair. */
+struct xrep_reap_state {
+	struct xfs_scrub		*sc;
+
+	/* Reverse mapping owner and metadata reservation type. */
+	const struct xfs_owner_info	*oinfo;
+	enum xfs_ag_resv_type		resv;
+};
+
 /*
  * Put a block back on the AGFL.
  */
@@ -548,17 +560,23 @@ xrep_put_freelist(
 /* Dispose of a single block. */
 STATIC int
 xrep_reap_block(
-	struct xfs_scrub		*sc,
-	xfs_fsblock_t			fsbno,
-	const struct xfs_owner_info	*oinfo,
-	enum xfs_ag_resv_type		resv)
+	uint64_t			fsbno,
+	void				*priv)
 {
+	struct xrep_reap_state		*rs = priv;
+	struct xfs_scrub		*sc = rs->sc;
 	struct xfs_btree_cur		*cur;
 	struct xfs_buf			*agf_bp = NULL;
 	xfs_agblock_t			agbno;
 	bool				has_other_rmap;
 	int				error;
 
+	ASSERT(sc->ip != NULL ||
+	       XFS_FSB_TO_AGNO(sc->mp, fsbno) == sc->sa.pag->pag_agno);
+	trace_xrep_dispose_btree_extent(sc->mp,
+			XFS_FSB_TO_AGNO(sc->mp, fsbno),
+			XFS_FSB_TO_AGBNO(sc->mp, fsbno), 1);
+
 	agbno = XFS_FSB_TO_AGBNO(sc->mp, fsbno);
 	ASSERT(XFS_FSB_TO_AGNO(sc->mp, fsbno) == sc->sa.pag->pag_agno);
 
@@ -577,7 +595,8 @@ xrep_reap_block(
 	cur = xfs_rmapbt_init_cursor(sc->mp, sc->tp, agf_bp, sc->sa.pag);
 
 	/* Can we find any other rmappings? */
-	error = xfs_rmap_has_other_keys(cur, agbno, 1, oinfo, &has_other_rmap);
+	error = xfs_rmap_has_other_keys(cur, agbno, 1, rs->oinfo,
+			&has_other_rmap);
 	xfs_btree_del_cursor(cur, error);
 	if (error)
 		goto out_free;
@@ -597,12 +616,12 @@ xrep_reap_block(
 	 */
 	if (has_other_rmap)
 		error = xfs_rmap_free(sc->tp, agf_bp, sc->sa.pag, agbno,
-					1, oinfo);
-	else if (resv == XFS_AG_RESV_AGFL)
+					1, rs->oinfo);
+	else if (rs->resv == XFS_AG_RESV_AGFL)
 		error = xrep_put_freelist(sc, agbno);
 	else
-		error = xfs_free_extent(sc->tp, sc->sa.pag, agbno, 1, oinfo,
-				resv);
+		error = xfs_free_extent(sc->tp, sc->sa.pag, agbno, 1, rs->oinfo,
+				rs->resv);
 	if (agf_bp != sc->sa.agf_bp)
 		xfs_trans_brelse(sc->tp, agf_bp);
 	if (error)
@@ -626,26 +645,15 @@ xrep_reap_extents(
 	const struct xfs_owner_info	*oinfo,
 	enum xfs_ag_resv_type		type)
 {
-	struct xbitmap_range		*bmr;
-	struct xbitmap_range		*n;
-	xfs_fsblock_t			fsbno;
-	int				error = 0;
+	struct xrep_reap_state		rs = {
+		.sc			= sc,
+		.oinfo			= oinfo,
+		.resv			= type,
+	};
 
 	ASSERT(xfs_has_rmapbt(sc->mp));
 
-	for_each_xbitmap_block(fsbno, bmr, n, bitmap) {
-		ASSERT(sc->ip != NULL ||
-		       XFS_FSB_TO_AGNO(sc->mp, fsbno) == sc->sa.pag->pag_agno);
-		trace_xrep_dispose_btree_extent(sc->mp,
-				XFS_FSB_TO_AGNO(sc->mp, fsbno),
-				XFS_FSB_TO_AGBNO(sc->mp, fsbno), 1);
-
-		error = xrep_reap_block(sc, fsbno, oinfo, type);
-		if (error)
-			break;
-	}
-
-	return error;
+	return xbitmap_walk_bits(bitmap, xrep_reap_block, &rs);
 }
 
 /*
-- 
cgit v1.2.3


From 44af6c7e59b12d740809cf25a60c9f90f03e6d20 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:35 -0700
Subject: xfs: don't load local xattr values during scrub

Local extended attributes store their values within the same leaf block.
There's no header for the values themselves, nor are they separately
checksummed.  Hence we can save a bit of time in the attr scrubber by
not wasting time retrieving the values.

Regrettably, shortform attributes do not set XFS_ATTR_LOCAL so this
offers us no advantage there, but at least there are very few attrs in
that case.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/attr.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index 2445fe2860ff..6c16d9530cca 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -200,6 +200,14 @@ xchk_xattr_listent(
 		goto fail_xref;
 	}
 
+	/*
+	 * Local xattr values are stored in the attr leaf block, so we don't
+	 * need to retrieve the value from a remote block to detect corruption
+	 * problems.
+	 */
+	if (flags & XFS_ATTR_LOCAL)
+		goto fail_xref;
+
 	/*
 	 * Try to allocate enough memory to extrat the attr value.  If that
 	 * doesn't work, we overload the seen_enough variable to convey
-- 
cgit v1.2.3


From 7296a6d6fb8fef515352dac5d8af2ffe7a78e5cf Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:36 -0700
Subject: xfs: drop the _safe behavior from the xbitmap foreach macro

It's not safe to edit bitmap intervals while we're iterating them with
for_each_xbitmap_extent.  None of the existing callers actually need
that ability anyway, so drop the safe variable.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/bitmap.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c
index 315b7c5844a3..9927634a2c54 100644
--- a/fs/xfs/scrub/bitmap.c
+++ b/fs/xfs/scrub/bitmap.c
@@ -13,8 +13,9 @@
 #include "scrub/scrub.h"
 #include "scrub/bitmap.h"
 
-#define for_each_xbitmap_extent(bex, n, bitmap) \
-	list_for_each_entry_safe((bex), (n), &(bitmap)->list, list)
+/* Iterate each interval of a bitmap.  Do not change the bitmap. */
+#define for_each_xbitmap_extent(bex, bitmap) \
+	list_for_each_entry((bex), &(bitmap)->list, list)
 
 /*
  * Set a range of this bitmap.  Caller must ensure the range is not set.
@@ -46,10 +47,9 @@ void
 xbitmap_destroy(
 	struct xbitmap		*bitmap)
 {
-	struct xbitmap_range	*bmr;
-	struct xbitmap_range	*n;
+	struct xbitmap_range	*bmr, *n;
 
-	for_each_xbitmap_extent(bmr, n, bitmap) {
+	list_for_each_entry_safe(bmr, n, &bitmap->list, list) {
 		list_del(&bmr->list);
 		kfree(bmr);
 	}
@@ -308,10 +308,9 @@ xbitmap_hweight(
 	struct xbitmap		*bitmap)
 {
 	struct xbitmap_range	*bmr;
-	struct xbitmap_range	*n;
 	uint64_t		ret = 0;
 
-	for_each_xbitmap_extent(bmr, n, bitmap)
+	for_each_xbitmap_extent(bmr, bitmap)
 		ret += bmr->len;
 
 	return ret;
@@ -324,10 +323,10 @@ xbitmap_walk(
 	xbitmap_walk_fn	fn,
 	void			*priv)
 {
-	struct xbitmap_range	*bex, *n;
+	struct xbitmap_range	*bex;
 	int			error = 0;
 
-	for_each_xbitmap_extent(bex, n, bitmap) {
+	for_each_xbitmap_extent(bex, bitmap) {
 		error = fn(bex->start, bex->len, priv);
 		if (error)
 			break;
-- 
cgit v1.2.3


From 6772fcc8890ae34595253fcfb8196c1aea65e111 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:36 -0700
Subject: xfs: convert xbitmap to interval tree

Convert the xbitmap code to use interval trees instead of linked lists.
This reduces the amount of coding required to handle the disunion
operation and in the future will make it easier to set bits in arbitrary
order yet later be able to extract maximally sized extents, which we'll
need for rebuilding certain structures.  We define our own interval tree
type so that it can deal with 64-bit indices even on 32-bit machines.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/agheader_repair.c |  12 +-
 fs/xfs/scrub/bitmap.c          | 319 ++++++++++++++++++++++-------------------
 fs/xfs/scrub/bitmap.h          |  11 +-
 3 files changed, 185 insertions(+), 157 deletions(-)

diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
index 997ddcd1f124..bbaa65422c4f 100644
--- a/fs/xfs/scrub/agheader_repair.c
+++ b/fs/xfs/scrub/agheader_repair.c
@@ -663,7 +663,7 @@ xrep_agfl_fill(
 }
 
 /* Write out a totally new AGFL. */
-STATIC void
+STATIC int
 xrep_agfl_init_header(
 	struct xfs_scrub	*sc,
 	struct xfs_buf		*agfl_bp,
@@ -676,6 +676,7 @@ xrep_agfl_init_header(
 	};
 	struct xfs_mount	*mp = sc->mp;
 	struct xfs_agfl		*agfl;
+	int			error;
 
 	ASSERT(flcount <= xfs_agfl_size(mp));
 
@@ -697,12 +698,15 @@ xrep_agfl_init_header(
 	xbitmap_init(&af.used_extents);
 	af.agfl_bno = xfs_buf_to_agfl_bno(agfl_bp),
 	xbitmap_walk(agfl_extents, xrep_agfl_fill, &af);
-	xbitmap_disunion(agfl_extents, &af.used_extents);
+	error = xbitmap_disunion(agfl_extents, &af.used_extents);
+	if (error)
+		return error;
 
 	/* Write new AGFL to disk. */
 	xfs_trans_buf_set_type(sc->tp, agfl_bp, XFS_BLFT_AGFL_BUF);
 	xfs_trans_log_buf(sc->tp, agfl_bp, 0, BBTOB(agfl_bp->b_length) - 1);
 	xbitmap_destroy(&af.used_extents);
+	return 0;
 }
 
 /* Repair the AGFL. */
@@ -755,7 +759,9 @@ xrep_agfl(
 	 * buffers until we know that part works.
 	 */
 	xrep_agfl_update_agf(sc, agf_bp, flcount);
-	xrep_agfl_init_header(sc, agfl_bp, &agfl_extents, flcount);
+	error = xrep_agfl_init_header(sc, agfl_bp, &agfl_extents, flcount);
+	if (error)
+		goto err;
 
 	/*
 	 * Ok, the AGFL should be ready to go now.  Roll the transaction to
diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c
index 9927634a2c54..dc139f0031dc 100644
--- a/fs/xfs/scrub/bitmap.c
+++ b/fs/xfs/scrub/bitmap.c
@@ -13,31 +13,160 @@
 #include "scrub/scrub.h"
 #include "scrub/bitmap.h"
 
-/* Iterate each interval of a bitmap.  Do not change the bitmap. */
-#define for_each_xbitmap_extent(bex, bitmap) \
-	list_for_each_entry((bex), &(bitmap)->list, list)
+#include <linux/interval_tree_generic.h>
+
+struct xbitmap_node {
+	struct rb_node	bn_rbnode;
+
+	/* First set bit of this interval and subtree. */
+	uint64_t	bn_start;
+
+	/* Last set bit of this interval. */
+	uint64_t	bn_last;
+
+	/* Last set bit of this subtree.  Do not touch this. */
+	uint64_t	__bn_subtree_last;
+};
+
+/* Define our own interval tree type with uint64_t parameters. */
+
+#define START(node) ((node)->bn_start)
+#define LAST(node)  ((node)->bn_last)
 
 /*
- * Set a range of this bitmap.  Caller must ensure the range is not set.
- *
- * This is the logical equivalent of bitmap |= mask(start, len).
+ * These functions are defined by the INTERVAL_TREE_DEFINE macro, but we'll
+ * forward-declare them anyway for clarity.
  */
+static inline void
+xbitmap_tree_insert(struct xbitmap_node *node, struct rb_root_cached *root);
+
+static inline void
+xbitmap_tree_remove(struct xbitmap_node *node, struct rb_root_cached *root);
+
+static inline struct xbitmap_node *
+xbitmap_tree_iter_first(struct rb_root_cached *root, uint64_t start,
+			uint64_t last);
+
+static inline struct xbitmap_node *
+xbitmap_tree_iter_next(struct xbitmap_node *node, uint64_t start,
+		       uint64_t last);
+
+INTERVAL_TREE_DEFINE(struct xbitmap_node, bn_rbnode, uint64_t,
+		__bn_subtree_last, START, LAST, static inline, xbitmap_tree)
+
+/* Iterate each interval of a bitmap.  Do not change the bitmap. */
+#define for_each_xbitmap_extent(bn, bitmap) \
+	for ((bn) = rb_entry_safe(rb_first(&(bitmap)->xb_root.rb_root), \
+				   struct xbitmap_node, bn_rbnode); \
+	     (bn) != NULL; \
+	     (bn) = rb_entry_safe(rb_next(&(bn)->bn_rbnode), \
+				   struct xbitmap_node, bn_rbnode))
+
+/* Clear a range of this bitmap. */
+int
+xbitmap_clear(
+	struct xbitmap		*bitmap,
+	uint64_t		start,
+	uint64_t		len)
+{
+	struct xbitmap_node	*bn;
+	struct xbitmap_node	*new_bn;
+	uint64_t		last = start + len - 1;
+
+	while ((bn = xbitmap_tree_iter_first(&bitmap->xb_root, start, last))) {
+		if (bn->bn_start < start && bn->bn_last > last) {
+			uint64_t	old_last = bn->bn_last;
+
+			/* overlaps with the entire clearing range */
+			xbitmap_tree_remove(bn, &bitmap->xb_root);
+			bn->bn_last = start - 1;
+			xbitmap_tree_insert(bn, &bitmap->xb_root);
+
+			/* add an extent */
+			new_bn = kmalloc(sizeof(struct xbitmap_node),
+					XCHK_GFP_FLAGS);
+			if (!new_bn)
+				return -ENOMEM;
+			new_bn->bn_start = last + 1;
+			new_bn->bn_last = old_last;
+			xbitmap_tree_insert(new_bn, &bitmap->xb_root);
+		} else if (bn->bn_start < start) {
+			/* overlaps with the left side of the clearing range */
+			xbitmap_tree_remove(bn, &bitmap->xb_root);
+			bn->bn_last = start - 1;
+			xbitmap_tree_insert(bn, &bitmap->xb_root);
+		} else if (bn->bn_last > last) {
+			/* overlaps with the right side of the clearing range */
+			xbitmap_tree_remove(bn, &bitmap->xb_root);
+			bn->bn_start = last + 1;
+			xbitmap_tree_insert(bn, &bitmap->xb_root);
+			break;
+		} else {
+			/* in the middle of the clearing range */
+			xbitmap_tree_remove(bn, &bitmap->xb_root);
+			kfree(bn);
+		}
+	}
+
+	return 0;
+}
+
+/* Set a range of this bitmap. */
 int
 xbitmap_set(
 	struct xbitmap		*bitmap,
 	uint64_t		start,
 	uint64_t		len)
 {
-	struct xbitmap_range	*bmr;
+	struct xbitmap_node	*left;
+	struct xbitmap_node	*right;
+	uint64_t		last = start + len - 1;
+	int			error;
 
-	bmr = kmalloc(sizeof(struct xbitmap_range), XCHK_GFP_FLAGS);
-	if (!bmr)
-		return -ENOMEM;
+	/* Is this whole range already set? */
+	left = xbitmap_tree_iter_first(&bitmap->xb_root, start, last);
+	if (left && left->bn_start <= start && left->bn_last >= last)
+		return 0;
 
-	INIT_LIST_HEAD(&bmr->list);
-	bmr->start = start;
-	bmr->len = len;
-	list_add_tail(&bmr->list, &bitmap->list);
+	/* Clear out everything in the range we want to set. */
+	error = xbitmap_clear(bitmap, start, len);
+	if (error)
+		return error;
+
+	/* Do we have a left-adjacent extent? */
+	left = xbitmap_tree_iter_first(&bitmap->xb_root, start - 1, start - 1);
+	ASSERT(!left || left->bn_last + 1 == start);
+
+	/* Do we have a right-adjacent extent? */
+	right = xbitmap_tree_iter_first(&bitmap->xb_root, last + 1, last + 1);
+	ASSERT(!right || right->bn_start == last + 1);
+
+	if (left && right) {
+		/* combine left and right adjacent extent */
+		xbitmap_tree_remove(left, &bitmap->xb_root);
+		xbitmap_tree_remove(right, &bitmap->xb_root);
+		left->bn_last = right->bn_last;
+		xbitmap_tree_insert(left, &bitmap->xb_root);
+		kfree(right);
+	} else if (left) {
+		/* combine with left extent */
+		xbitmap_tree_remove(left, &bitmap->xb_root);
+		left->bn_last = last;
+		xbitmap_tree_insert(left, &bitmap->xb_root);
+	} else if (right) {
+		/* combine with right extent */
+		xbitmap_tree_remove(right, &bitmap->xb_root);
+		right->bn_start = start;
+		xbitmap_tree_insert(right, &bitmap->xb_root);
+	} else {
+		/* add an extent */
+		left = kmalloc(sizeof(struct xbitmap_node), XCHK_GFP_FLAGS);
+		if (!left)
+			return -ENOMEM;
+		left->bn_start = start;
+		left->bn_last = last;
+		xbitmap_tree_insert(left, &bitmap->xb_root);
+	}
 
 	return 0;
 }
@@ -47,11 +176,11 @@ void
 xbitmap_destroy(
 	struct xbitmap		*bitmap)
 {
-	struct xbitmap_range	*bmr, *n;
+	struct xbitmap_node	*bn;
 
-	list_for_each_entry_safe(bmr, n, &bitmap->list, list) {
-		list_del(&bmr->list);
-		kfree(bmr);
+	while ((bn = xbitmap_tree_iter_first(&bitmap->xb_root, 0, -1ULL))) {
+		xbitmap_tree_remove(bn, &bitmap->xb_root);
+		kfree(bn);
 	}
 }
 
@@ -60,27 +189,7 @@ void
 xbitmap_init(
 	struct xbitmap		*bitmap)
 {
-	INIT_LIST_HEAD(&bitmap->list);
-}
-
-/* Compare two btree extents. */
-static int
-xbitmap_range_cmp(
-	void			*priv,
-	const struct list_head	*a,
-	const struct list_head	*b)
-{
-	struct xbitmap_range	*ap;
-	struct xbitmap_range	*bp;
-
-	ap = container_of(a, struct xbitmap_range, list);
-	bp = container_of(b, struct xbitmap_range, list);
-
-	if (ap->start > bp->start)
-		return 1;
-	if (ap->start < bp->start)
-		return -1;
-	return 0;
+	bitmap->xb_root = RB_ROOT_CACHED;
 }
 
 /*
@@ -97,118 +206,26 @@ xbitmap_range_cmp(
  *
  * This is the logical equivalent of bitmap &= ~sub.
  */
-#define LEFT_ALIGNED	(1 << 0)
-#define RIGHT_ALIGNED	(1 << 1)
 int
 xbitmap_disunion(
 	struct xbitmap		*bitmap,
 	struct xbitmap		*sub)
 {
-	struct list_head	*lp;
-	struct xbitmap_range	*br;
-	struct xbitmap_range	*new_br;
-	struct xbitmap_range	*sub_br;
-	uint64_t		sub_start;
-	uint64_t		sub_len;
-	int			state;
-	int			error = 0;
+	struct xbitmap_node	*bn;
+	int			error;
 
-	if (list_empty(&bitmap->list) || list_empty(&sub->list))
+	if (xbitmap_empty(bitmap) || xbitmap_empty(sub))
 		return 0;
-	ASSERT(!list_empty(&sub->list));
-
-	list_sort(NULL, &bitmap->list, xbitmap_range_cmp);
-	list_sort(NULL, &sub->list, xbitmap_range_cmp);
-
-	/*
-	 * Now that we've sorted both lists, we iterate bitmap once, rolling
-	 * forward through sub and/or bitmap as necessary until we find an
-	 * overlap or reach the end of either list.  We do not reset lp to the
-	 * head of bitmap nor do we reset sub_br to the head of sub.  The
-	 * list traversal is similar to merge sort, but we're deleting
-	 * instead.  In this manner we avoid O(n^2) operations.
-	 */
-	sub_br = list_first_entry(&sub->list, struct xbitmap_range,
-			list);
-	lp = bitmap->list.next;
-	while (lp != &bitmap->list) {
-		br = list_entry(lp, struct xbitmap_range, list);
-
-		/*
-		 * Advance sub_br and/or br until we find a pair that
-		 * intersect or we run out of extents.
-		 */
-		while (sub_br->start + sub_br->len <= br->start) {
-			if (list_is_last(&sub_br->list, &sub->list))
-				goto out;
-			sub_br = list_next_entry(sub_br, list);
-		}
-		if (sub_br->start >= br->start + br->len) {
-			lp = lp->next;
-			continue;
-		}
 
-		/* trim sub_br to fit the extent we have */
-		sub_start = sub_br->start;
-		sub_len = sub_br->len;
-		if (sub_br->start < br->start) {
-			sub_len -= br->start - sub_br->start;
-			sub_start = br->start;
-		}
-		if (sub_len > br->len)
-			sub_len = br->len;
-
-		state = 0;
-		if (sub_start == br->start)
-			state |= LEFT_ALIGNED;
-		if (sub_start + sub_len == br->start + br->len)
-			state |= RIGHT_ALIGNED;
-		switch (state) {
-		case LEFT_ALIGNED:
-			/* Coincides with only the left. */
-			br->start += sub_len;
-			br->len -= sub_len;
-			break;
-		case RIGHT_ALIGNED:
-			/* Coincides with only the right. */
-			br->len -= sub_len;
-			lp = lp->next;
-			break;
-		case LEFT_ALIGNED | RIGHT_ALIGNED:
-			/* Total overlap, just delete ex. */
-			lp = lp->next;
-			list_del(&br->list);
-			kfree(br);
-			break;
-		case 0:
-			/*
-			 * Deleting from the middle: add the new right extent
-			 * and then shrink the left extent.
-			 */
-			new_br = kmalloc(sizeof(struct xbitmap_range),
-					XCHK_GFP_FLAGS);
-			if (!new_br) {
-				error = -ENOMEM;
-				goto out;
-			}
-			INIT_LIST_HEAD(&new_br->list);
-			new_br->start = sub_start + sub_len;
-			new_br->len = br->start + br->len - new_br->start;
-			list_add(&new_br->list, &br->list);
-			br->len = sub_start - br->start;
-			lp = lp->next;
-			break;
-		default:
-			ASSERT(0);
-			break;
-		}
+	for_each_xbitmap_extent(bn, sub) {
+		error = xbitmap_clear(bitmap, bn->bn_start,
+				bn->bn_last - bn->bn_start + 1);
+		if (error)
+			return error;
 	}
 
-out:
-	return error;
+	return 0;
 }
-#undef LEFT_ALIGNED
-#undef RIGHT_ALIGNED
 
 /*
  * Record all btree blocks seen while iterating all records of a btree.
@@ -307,11 +324,11 @@ uint64_t
 xbitmap_hweight(
 	struct xbitmap		*bitmap)
 {
-	struct xbitmap_range	*bmr;
+	struct xbitmap_node	*bn;
 	uint64_t		ret = 0;
 
-	for_each_xbitmap_extent(bmr, bitmap)
-		ret += bmr->len;
+	for_each_xbitmap_extent(bn, bitmap)
+		ret += bn->bn_last - bn->bn_start + 1;
 
 	return ret;
 }
@@ -320,14 +337,14 @@ xbitmap_hweight(
 int
 xbitmap_walk(
 	struct xbitmap		*bitmap,
-	xbitmap_walk_fn	fn,
+	xbitmap_walk_fn		fn,
 	void			*priv)
 {
-	struct xbitmap_range	*bex;
+	struct xbitmap_node	*bn;
 	int			error = 0;
 
-	for_each_xbitmap_extent(bex, bitmap) {
-		error = fn(bex->start, bex->len, priv);
+	for_each_xbitmap_extent(bn, bitmap) {
+		error = fn(bn->bn_start, bn->bn_last - bn->bn_start + 1, priv);
 		if (error)
 			break;
 	}
@@ -371,3 +388,11 @@ xbitmap_walk_bits(
 
 	return xbitmap_walk(bitmap, xbitmap_walk_bits_in_run, &wb);
 }
+
+/* Does this bitmap have no bits set at all? */
+bool
+xbitmap_empty(
+	struct xbitmap		*bitmap)
+{
+	return bitmap->xb_root.rb_root.rb_node == NULL;
+}
diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h
index 01e37173dc34..2ec4e1f3f24c 100644
--- a/fs/xfs/scrub/bitmap.h
+++ b/fs/xfs/scrub/bitmap.h
@@ -6,19 +6,14 @@
 #ifndef __XFS_SCRUB_BITMAP_H__
 #define __XFS_SCRUB_BITMAP_H__
 
-struct xbitmap_range {
-	struct list_head	list;
-	uint64_t		start;
-	uint64_t		len;
-};
-
 struct xbitmap {
-	struct list_head	list;
+	struct rb_root_cached	xb_root;
 };
 
 void xbitmap_init(struct xbitmap *bitmap);
 void xbitmap_destroy(struct xbitmap *bitmap);
 
+int xbitmap_clear(struct xbitmap *bitmap, uint64_t start, uint64_t len);
 int xbitmap_set(struct xbitmap *bitmap, uint64_t start, uint64_t len);
 int xbitmap_disunion(struct xbitmap *bitmap, struct xbitmap *sub);
 int xbitmap_set_btcur_path(struct xbitmap *bitmap,
@@ -42,4 +37,6 @@ typedef int (*xbitmap_walk_bits_fn)(uint64_t bit, void *priv);
 int xbitmap_walk_bits(struct xbitmap *bitmap, xbitmap_walk_bits_fn fn,
 		void *priv);
 
+bool xbitmap_empty(struct xbitmap *bitmap);
+
 #endif	/* __XFS_SCRUB_BITMAP_H__ */
-- 
cgit v1.2.3


From a47bd1e0e690d0296c3e48fc3d6f2d359c222d6c Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:37 -0700
Subject: xfs: introduce bitmap type for AG blocks

Create a typechecked bitmap for extents within an AG.  Online repair
uses bitmaps to store various different types of numbers, so let's make
it obvious when we're storing xfs_agblock_t (and later xfs_fsblock_t)
versus anything else.

In subsequent patches, we're going to use agblock bitmaps to enhance the
rmapbt checker to look for discrepancies between the rmapbt records and
AG metadata block usage.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/bitmap.h | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.h |  1 +
 2 files changed, 49 insertions(+)

diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h
index 2ec4e1f3f24c..972d5445cdb6 100644
--- a/fs/xfs/scrub/bitmap.h
+++ b/fs/xfs/scrub/bitmap.h
@@ -39,4 +39,52 @@ int xbitmap_walk_bits(struct xbitmap *bitmap, xbitmap_walk_bits_fn fn,
 
 bool xbitmap_empty(struct xbitmap *bitmap);
 
+/* Bitmaps, but for type-checked for xfs_agblock_t */
+
+struct xagb_bitmap {
+	struct xbitmap	agbitmap;
+};
+
+static inline void xagb_bitmap_init(struct xagb_bitmap *bitmap)
+{
+	xbitmap_init(&bitmap->agbitmap);
+}
+
+static inline void xagb_bitmap_destroy(struct xagb_bitmap *bitmap)
+{
+	xbitmap_destroy(&bitmap->agbitmap);
+}
+
+static inline int xagb_bitmap_clear(struct xagb_bitmap *bitmap,
+		xfs_agblock_t start, xfs_extlen_t len)
+{
+	return xbitmap_clear(&bitmap->agbitmap, start, len);
+}
+static inline int xagb_bitmap_set(struct xagb_bitmap *bitmap,
+		xfs_agblock_t start, xfs_extlen_t len)
+{
+	return xbitmap_set(&bitmap->agbitmap, start, len);
+}
+
+static inline int xagb_bitmap_disunion(struct xagb_bitmap *bitmap,
+		struct xagb_bitmap *sub)
+{
+	return xbitmap_disunion(&bitmap->agbitmap, &sub->agbitmap);
+}
+
+static inline uint32_t xagb_bitmap_hweight(struct xagb_bitmap *bitmap)
+{
+	return xbitmap_hweight(&bitmap->agbitmap);
+}
+static inline bool xagb_bitmap_empty(struct xagb_bitmap *bitmap)
+{
+	return xbitmap_empty(&bitmap->agbitmap);
+}
+
+static inline int xagb_bitmap_walk(struct xagb_bitmap *bitmap,
+		xbitmap_walk_fn fn, void *priv)
+{
+	return xbitmap_walk(&bitmap->agbitmap, fn, priv);
+}
+
 #endif	/* __XFS_SCRUB_BITMAP_H__ */
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 4fbb52228c48..dce791c679ee 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -31,6 +31,7 @@ int xrep_init_btblock(struct xfs_scrub *sc, xfs_fsblock_t fsb,
 		const struct xfs_buf_ops *ops);
 
 struct xbitmap;
+struct xagb_bitmap;
 
 int xrep_fix_freelist(struct xfs_scrub *sc, bool can_shrink);
 int xrep_invalidate_blocks(struct xfs_scrub *sc, struct xbitmap *btlist);
-- 
cgit v1.2.3


From fed050f3452da070fa90fc1b02c2bc2219d687a7 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:38 -0700
Subject: xfs: cross-reference rmap records with ag btrees

Strengthen the rmap btree record checker a little more by comparing
OWN_FS and OWN_LOG reverse mappings against the AG headers and internal
logs, respectively.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/Makefile       |   2 +-
 fs/xfs/scrub/bitmap.c |  22 +++++++
 fs/xfs/scrub/bitmap.h |  21 +++++++
 fs/xfs/scrub/rmap.c   | 159 +++++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 202 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index ac9d03cd2623..16e4eb431230 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -148,6 +148,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   agheader.o \
 				   alloc.o \
 				   attr.o \
+				   bitmap.o \
 				   bmap.o \
 				   btree.o \
 				   common.o \
@@ -172,7 +173,6 @@ xfs-$(CONFIG_XFS_QUOTA)		+= scrub/quota.o
 ifeq ($(CONFIG_XFS_ONLINE_REPAIR),y)
 xfs-y				+= $(addprefix scrub/, \
 				   agheader_repair.o \
-				   bitmap.o \
 				   repair.o \
 				   )
 endif
diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c
index dc139f0031dc..85e5beda186f 100644
--- a/fs/xfs/scrub/bitmap.c
+++ b/fs/xfs/scrub/bitmap.c
@@ -396,3 +396,25 @@ xbitmap_empty(
 {
 	return bitmap->xb_root.rb_root.rb_node == NULL;
 }
+
+/* Is the start of the range set or clear?  And for how long? */
+bool
+xbitmap_test(
+	struct xbitmap		*bitmap,
+	uint64_t		start,
+	uint64_t		*len)
+{
+	struct xbitmap_node	*bn;
+	uint64_t		last = start + *len - 1;
+
+	bn = xbitmap_tree_iter_first(&bitmap->xb_root, start, last);
+	if (!bn)
+		return false;
+	if (bn->bn_start <= start) {
+		if (bn->bn_last < last)
+			*len = bn->bn_last - start + 1;
+		return true;
+	}
+	*len = bn->bn_start - start;
+	return false;
+}
diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h
index 972d5445cdb6..55441feb039f 100644
--- a/fs/xfs/scrub/bitmap.h
+++ b/fs/xfs/scrub/bitmap.h
@@ -38,6 +38,7 @@ int xbitmap_walk_bits(struct xbitmap *bitmap, xbitmap_walk_bits_fn fn,
 		void *priv);
 
 bool xbitmap_empty(struct xbitmap *bitmap);
+bool xbitmap_test(struct xbitmap *bitmap, uint64_t start, uint64_t *len);
 
 /* Bitmaps, but for type-checked for xfs_agblock_t */
 
@@ -66,6 +67,26 @@ static inline int xagb_bitmap_set(struct xagb_bitmap *bitmap,
 	return xbitmap_set(&bitmap->agbitmap, start, len);
 }
 
+static inline bool
+xagb_bitmap_test(
+	struct xagb_bitmap	*bitmap,
+	xfs_agblock_t		start,
+	xfs_extlen_t		*len)
+{
+	uint64_t		biglen = *len;
+	bool			ret;
+
+	ret = xbitmap_test(&bitmap->agbitmap, start, &biglen);
+
+	if (start + biglen >= UINT_MAX) {
+		ASSERT(0);
+		biglen = UINT_MAX - start;
+	}
+
+	*len = biglen;
+	return ret;
+}
+
 static inline int xagb_bitmap_disunion(struct xagb_bitmap *bitmap,
 		struct xagb_bitmap *sub)
 {
diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index 6d7e294110a2..759349ccca26 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -12,10 +12,12 @@
 #include "xfs_btree.h"
 #include "xfs_rmap.h"
 #include "xfs_refcount.h"
+#include "xfs_ag.h"
+#include "xfs_bit.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/btree.h"
-#include "xfs_ag.h"
+#include "scrub/bitmap.h"
 
 /*
  * Set us up to scrub reverse mapping btrees.
@@ -45,6 +47,13 @@ struct xchk_rmap {
 	 * that could be one.
 	 */
 	struct xfs_rmap_irec	prev_rec;
+
+	/* Bitmaps containing all blocks for each type of AG metadata. */
+	struct xagb_bitmap	fs_owned;
+	struct xagb_bitmap	log_owned;
+
+	/* Did we complete the AG space metadata bitmaps? */
+	bool			bitmaps_complete;
 };
 
 /* Cross-reference a rmap against the refcount btree. */
@@ -249,6 +258,68 @@ xchk_rmapbt_check_mergeable(
 	memcpy(&cr->prev_rec, irec, sizeof(struct xfs_rmap_irec));
 }
 
+/* Compare an rmap for AG metadata against the metadata walk. */
+STATIC int
+xchk_rmapbt_mark_bitmap(
+	struct xchk_btree		*bs,
+	struct xchk_rmap		*cr,
+	const struct xfs_rmap_irec	*irec)
+{
+	struct xfs_scrub		*sc = bs->sc;
+	struct xagb_bitmap		*bmp = NULL;
+	xfs_extlen_t			fsbcount = irec->rm_blockcount;
+
+	/*
+	 * Skip corrupt records.  It is essential that we detect records in the
+	 * btree that cannot overlap but do, flag those as CORRUPT, and skip
+	 * the bitmap comparison to avoid generating false XCORRUPT reports.
+	 */
+	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
+		return 0;
+
+	/*
+	 * If the AG metadata walk didn't complete, there's no point in
+	 * comparing against partial results.
+	 */
+	if (!cr->bitmaps_complete)
+		return 0;
+
+	switch (irec->rm_owner) {
+	case XFS_RMAP_OWN_FS:
+		bmp = &cr->fs_owned;
+		break;
+	case XFS_RMAP_OWN_LOG:
+		bmp = &cr->log_owned;
+		break;
+	}
+
+	if (!bmp)
+		return 0;
+
+	if (xagb_bitmap_test(bmp, irec->rm_startblock, &fsbcount)) {
+		/*
+		 * The start of this reverse mapping corresponds to a set
+		 * region in the bitmap.  If the mapping covers more area than
+		 * the set region, then it covers space that wasn't found by
+		 * the AG metadata walk.
+		 */
+		if (fsbcount < irec->rm_blockcount)
+			xchk_btree_xref_set_corrupt(bs->sc,
+					bs->sc->sa.rmap_cur, 0);
+	} else {
+		/*
+		 * The start of this reverse mapping does not correspond to a
+		 * completely set region in the bitmap.  The region wasn't
+		 * fully set by walking the AG metadata, so this is a
+		 * cross-referencing corruption.
+		 */
+		xchk_btree_xref_set_corrupt(bs->sc, bs->sc->sa.rmap_cur, 0);
+	}
+
+	/* Unset the region so that we can detect missing rmap records. */
+	return xagb_bitmap_clear(bmp, irec->rm_startblock, irec->rm_blockcount);
+}
+
 /* Scrub an rmapbt record. */
 STATIC int
 xchk_rmapbt_rec(
@@ -268,9 +339,80 @@ xchk_rmapbt_rec(
 	xchk_rmapbt_check_mergeable(bs, cr, &irec);
 	xchk_rmapbt_check_overlapping(bs, cr, &irec);
 	xchk_rmapbt_xref(bs->sc, &irec);
+
+	return xchk_rmapbt_mark_bitmap(bs, cr, &irec);
+}
+
+/*
+ * Set up bitmaps mapping all the AG metadata to compare with the rmapbt
+ * records.
+ */
+STATIC int
+xchk_rmapbt_walk_ag_metadata(
+	struct xfs_scrub	*sc,
+	struct xchk_rmap	*cr)
+{
+	struct xfs_mount	*mp = sc->mp;
+	int			error;
+
+	/* OWN_FS: AG headers */
+	error = xagb_bitmap_set(&cr->fs_owned, XFS_SB_BLOCK(mp),
+			XFS_AGFL_BLOCK(mp) - XFS_SB_BLOCK(mp) + 1);
+	if (error)
+		goto out;
+
+	/* OWN_LOG: Internal log */
+	if (xfs_ag_contains_log(mp, sc->sa.pag->pag_agno)) {
+		error = xagb_bitmap_set(&cr->log_owned,
+				XFS_FSB_TO_AGBNO(mp, mp->m_sb.sb_logstart),
+				mp->m_sb.sb_logblocks);
+		if (error)
+			goto out;
+	}
+
+out:
+	/*
+	 * If there's an error, set XFAIL and disable the bitmap
+	 * cross-referencing checks, but proceed with the scrub anyway.
+	 */
+	if (error)
+		xchk_btree_xref_process_error(sc, sc->sa.rmap_cur,
+				sc->sa.rmap_cur->bc_nlevels - 1, &error);
+	else
+		cr->bitmaps_complete = true;
 	return 0;
 }
 
+/*
+ * Check for set regions in the bitmaps; if there are any, the rmap records do
+ * not describe all the AG metadata.
+ */
+STATIC void
+xchk_rmapbt_check_bitmaps(
+	struct xfs_scrub	*sc,
+	struct xchk_rmap	*cr)
+{
+	struct xfs_btree_cur	*cur = sc->sa.rmap_cur;
+	unsigned int		level;
+
+	if (sc->sm->sm_flags & (XFS_SCRUB_OFLAG_CORRUPT |
+				XFS_SCRUB_OFLAG_XFAIL))
+		return;
+	if (!cur)
+		return;
+	level = cur->bc_nlevels - 1;
+
+	/*
+	 * Any bitmap with bits still set indicates that the reverse mapping
+	 * doesn't cover the entire primary structure.
+	 */
+	if (xagb_bitmap_hweight(&cr->fs_owned) != 0)
+		xchk_btree_xref_set_corrupt(sc, cur, level);
+
+	if (xagb_bitmap_hweight(&cr->log_owned) != 0)
+		xchk_btree_xref_set_corrupt(sc, cur, level);
+}
+
 /* Scrub the rmap btree for some AG. */
 int
 xchk_rmapbt(
@@ -283,8 +425,23 @@ xchk_rmapbt(
 	if (!cr)
 		return -ENOMEM;
 
+	xagb_bitmap_init(&cr->fs_owned);
+	xagb_bitmap_init(&cr->log_owned);
+
+	error = xchk_rmapbt_walk_ag_metadata(sc, cr);
+	if (error)
+		goto out;
+
 	error = xchk_btree(sc, sc->sa.rmap_cur, xchk_rmapbt_rec,
 			&XFS_RMAP_OINFO_AG, cr);
+	if (error)
+		goto out;
+
+	xchk_rmapbt_check_bitmaps(sc, cr);
+
+out:
+	xagb_bitmap_destroy(&cr->log_owned);
+	xagb_bitmap_destroy(&cr->fs_owned);
 	kfree(cr);
 	return error;
 }
-- 
cgit v1.2.3


From 3a3108ea8c1d4f33ca49fa9fc80e6a6e450654cf Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:38 -0700
Subject: xfs: cross-reference rmap records with free space btrees

Strengthen the rmap btree record checker a little more by comparing
OWN_AG reverse mappings against the free space btrees, the rmap btree,
and the AGFL.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/bitmap.c | 33 ++++++++++++++++++++++++++
 fs/xfs/scrub/bitmap.h |  3 +++
 fs/xfs/scrub/rmap.c   | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 102 insertions(+)

diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c
index 85e5beda186f..0c959be396ea 100644
--- a/fs/xfs/scrub/bitmap.c
+++ b/fs/xfs/scrub/bitmap.c
@@ -6,6 +6,7 @@
 #include "xfs.h"
 #include "xfs_fs.h"
 #include "xfs_shared.h"
+#include "xfs_bit.h"
 #include "xfs_format.h"
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
@@ -262,6 +263,38 @@ xbitmap_disunion(
  * For the 300th record we just exit, with the list being [1, 4, 2, 3].
  */
 
+/* Mark a btree block to the agblock bitmap. */
+STATIC int
+xagb_bitmap_visit_btblock(
+	struct xfs_btree_cur	*cur,
+	int			level,
+	void			*priv)
+{
+	struct xagb_bitmap	*bitmap = priv;
+	struct xfs_buf		*bp;
+	xfs_fsblock_t		fsbno;
+	xfs_agblock_t		agbno;
+
+	xfs_btree_get_block(cur, level, &bp);
+	if (!bp)
+		return 0;
+
+	fsbno = XFS_DADDR_TO_FSB(cur->bc_mp, xfs_buf_daddr(bp));
+	agbno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno);
+
+	return xagb_bitmap_set(bitmap, agbno, 1);
+}
+
+/* Mark all (per-AG) btree blocks in the agblock bitmap. */
+int
+xagb_bitmap_set_btblocks(
+	struct xagb_bitmap	*bitmap,
+	struct xfs_btree_cur	*cur)
+{
+	return xfs_btree_visit_blocks(cur, xagb_bitmap_visit_btblock,
+			XFS_BTREE_VISIT_ALL, bitmap);
+}
+
 /*
  * Record all the buffers pointed to by the btree cursor.  Callers already
  * engaged in a btree walk should call this function to capture the list of
diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h
index 55441feb039f..84981724ecaf 100644
--- a/fs/xfs/scrub/bitmap.h
+++ b/fs/xfs/scrub/bitmap.h
@@ -108,4 +108,7 @@ static inline int xagb_bitmap_walk(struct xagb_bitmap *bitmap,
 	return xbitmap_walk(&bitmap->agbitmap, fn, priv);
 }
 
+int xagb_bitmap_set_btblocks(struct xagb_bitmap *bitmap,
+		struct xfs_btree_cur *cur);
+
 #endif	/* __XFS_SCRUB_BITMAP_H__ */
diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index 759349ccca26..1febadb269c5 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -7,13 +7,17 @@
 #include "xfs_fs.h"
 #include "xfs_shared.h"
 #include "xfs_format.h"
+#include "xfs_log_format.h"
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
+#include "xfs_trans.h"
 #include "xfs_btree.h"
 #include "xfs_rmap.h"
 #include "xfs_refcount.h"
 #include "xfs_ag.h"
 #include "xfs_bit.h"
+#include "xfs_alloc.h"
+#include "xfs_alloc_btree.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/btree.h"
@@ -51,6 +55,7 @@ struct xchk_rmap {
 	/* Bitmaps containing all blocks for each type of AG metadata. */
 	struct xagb_bitmap	fs_owned;
 	struct xagb_bitmap	log_owned;
+	struct xagb_bitmap	ag_owned;
 
 	/* Did we complete the AG space metadata bitmaps? */
 	bool			bitmaps_complete;
@@ -291,6 +296,9 @@ xchk_rmapbt_mark_bitmap(
 	case XFS_RMAP_OWN_LOG:
 		bmp = &cr->log_owned;
 		break;
+	case XFS_RMAP_OWN_AG:
+		bmp = &cr->ag_owned;
+		break;
 	}
 
 	if (!bmp)
@@ -343,9 +351,26 @@ xchk_rmapbt_rec(
 	return xchk_rmapbt_mark_bitmap(bs, cr, &irec);
 }
 
+/* Add an AGFL block to the rmap list. */
+STATIC int
+xchk_rmapbt_walk_agfl(
+	struct xfs_mount	*mp,
+	xfs_agblock_t		agbno,
+	void			*priv)
+{
+	struct xagb_bitmap	*bitmap = priv;
+
+	return xagb_bitmap_set(bitmap, agbno, 1);
+}
+
 /*
  * Set up bitmaps mapping all the AG metadata to compare with the rmapbt
  * records.
+ *
+ * Grab our own btree cursors here if the scrub setup function didn't give us a
+ * btree cursor due to reports of poor health.  We need to find out if the
+ * rmapbt disagrees with primary metadata btrees to tag the rmapbt as being
+ * XCORRUPT.
  */
 STATIC int
 xchk_rmapbt_walk_ag_metadata(
@@ -353,6 +378,9 @@ xchk_rmapbt_walk_ag_metadata(
 	struct xchk_rmap	*cr)
 {
 	struct xfs_mount	*mp = sc->mp;
+	struct xfs_buf		*agfl_bp;
+	struct xfs_agf		*agf = sc->sa.agf_bp->b_addr;
+	struct xfs_btree_cur	*cur;
 	int			error;
 
 	/* OWN_FS: AG headers */
@@ -370,6 +398,39 @@ xchk_rmapbt_walk_ag_metadata(
 			goto out;
 	}
 
+	/* OWN_AG: bnobt, cntbt, rmapbt, and AGFL */
+	cur = sc->sa.bno_cur;
+	if (!cur)
+		cur = xfs_allocbt_init_cursor(sc->mp, sc->tp, sc->sa.agf_bp,
+				sc->sa.pag, XFS_BTNUM_BNO);
+	error = xagb_bitmap_set_btblocks(&cr->ag_owned, cur);
+	if (cur != sc->sa.bno_cur)
+		xfs_btree_del_cursor(cur, error);
+	if (error)
+		goto out;
+
+	cur = sc->sa.cnt_cur;
+	if (!cur)
+		cur = xfs_allocbt_init_cursor(sc->mp, sc->tp, sc->sa.agf_bp,
+				sc->sa.pag, XFS_BTNUM_CNT);
+	error = xagb_bitmap_set_btblocks(&cr->ag_owned, cur);
+	if (cur != sc->sa.cnt_cur)
+		xfs_btree_del_cursor(cur, error);
+	if (error)
+		goto out;
+
+	error = xagb_bitmap_set_btblocks(&cr->ag_owned, sc->sa.rmap_cur);
+	if (error)
+		goto out;
+
+	error = xfs_alloc_read_agfl(sc->sa.pag, sc->tp, &agfl_bp);
+	if (error)
+		goto out;
+
+	error = xfs_agfl_walk(sc->mp, agf, agfl_bp, xchk_rmapbt_walk_agfl,
+			&cr->ag_owned);
+	xfs_trans_brelse(sc->tp, agfl_bp);
+
 out:
 	/*
 	 * If there's an error, set XFAIL and disable the bitmap
@@ -411,6 +472,9 @@ xchk_rmapbt_check_bitmaps(
 
 	if (xagb_bitmap_hweight(&cr->log_owned) != 0)
 		xchk_btree_xref_set_corrupt(sc, cur, level);
+
+	if (xagb_bitmap_hweight(&cr->ag_owned) != 0)
+		xchk_btree_xref_set_corrupt(sc, cur, level);
 }
 
 /* Scrub the rmap btree for some AG. */
@@ -427,6 +491,7 @@ xchk_rmapbt(
 
 	xagb_bitmap_init(&cr->fs_owned);
 	xagb_bitmap_init(&cr->log_owned);
+	xagb_bitmap_init(&cr->ag_owned);
 
 	error = xchk_rmapbt_walk_ag_metadata(sc, cr);
 	if (error)
@@ -440,6 +505,7 @@ xchk_rmapbt(
 	xchk_rmapbt_check_bitmaps(sc, cr);
 
 out:
+	xagb_bitmap_destroy(&cr->ag_owned);
 	xagb_bitmap_destroy(&cr->log_owned);
 	xagb_bitmap_destroy(&cr->fs_owned);
 	kfree(cr);
-- 
cgit v1.2.3


From 0abe6fc53bdb10a5b0b5235043ca1acc749fe069 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:39 -0700
Subject: xfs: cross-reference rmap records with inode btrees

Strengthen the rmap btree record checker a little more by comparing
OWN_INOBT reverse mappings against the inode btrees.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/rmap.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index 1febadb269c5..834b32063301 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -18,6 +18,7 @@
 #include "xfs_bit.h"
 #include "xfs_alloc.h"
 #include "xfs_alloc_btree.h"
+#include "xfs_ialloc_btree.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/btree.h"
@@ -56,6 +57,7 @@ struct xchk_rmap {
 	struct xagb_bitmap	fs_owned;
 	struct xagb_bitmap	log_owned;
 	struct xagb_bitmap	ag_owned;
+	struct xagb_bitmap	inobt_owned;
 
 	/* Did we complete the AG space metadata bitmaps? */
 	bool			bitmaps_complete;
@@ -299,6 +301,9 @@ xchk_rmapbt_mark_bitmap(
 	case XFS_RMAP_OWN_AG:
 		bmp = &cr->ag_owned;
 		break;
+	case XFS_RMAP_OWN_INOBT:
+		bmp = &cr->inobt_owned;
+		break;
 	}
 
 	if (!bmp)
@@ -430,6 +435,31 @@ xchk_rmapbt_walk_ag_metadata(
 	error = xfs_agfl_walk(sc->mp, agf, agfl_bp, xchk_rmapbt_walk_agfl,
 			&cr->ag_owned);
 	xfs_trans_brelse(sc->tp, agfl_bp);
+	if (error)
+		goto out;
+
+	/* OWN_INOBT: inobt, finobt */
+	cur = sc->sa.ino_cur;
+	if (!cur)
+		cur = xfs_inobt_init_cursor(sc->sa.pag, sc->tp, sc->sa.agi_bp,
+				XFS_BTNUM_INO);
+	error = xagb_bitmap_set_btblocks(&cr->inobt_owned, cur);
+	if (cur != sc->sa.ino_cur)
+		xfs_btree_del_cursor(cur, error);
+	if (error)
+		goto out;
+
+	if (xfs_has_finobt(sc->mp)) {
+		cur = sc->sa.fino_cur;
+		if (!cur)
+			cur = xfs_inobt_init_cursor(sc->sa.pag, sc->tp,
+					sc->sa.agi_bp, XFS_BTNUM_FINO);
+		error = xagb_bitmap_set_btblocks(&cr->inobt_owned, cur);
+		if (cur != sc->sa.fino_cur)
+			xfs_btree_del_cursor(cur, error);
+		if (error)
+			goto out;
+	}
 
 out:
 	/*
@@ -475,6 +505,9 @@ xchk_rmapbt_check_bitmaps(
 
 	if (xagb_bitmap_hweight(&cr->ag_owned) != 0)
 		xchk_btree_xref_set_corrupt(sc, cur, level);
+
+	if (xagb_bitmap_hweight(&cr->inobt_owned) != 0)
+		xchk_btree_xref_set_corrupt(sc, cur, level);
 }
 
 /* Scrub the rmap btree for some AG. */
@@ -492,6 +525,7 @@ xchk_rmapbt(
 	xagb_bitmap_init(&cr->fs_owned);
 	xagb_bitmap_init(&cr->log_owned);
 	xagb_bitmap_init(&cr->ag_owned);
+	xagb_bitmap_init(&cr->inobt_owned);
 
 	error = xchk_rmapbt_walk_ag_metadata(sc, cr);
 	if (error)
@@ -505,6 +539,7 @@ xchk_rmapbt(
 	xchk_rmapbt_check_bitmaps(sc, cr);
 
 out:
+	xagb_bitmap_destroy(&cr->inobt_owned);
 	xagb_bitmap_destroy(&cr->ag_owned);
 	xagb_bitmap_destroy(&cr->log_owned);
 	xagb_bitmap_destroy(&cr->fs_owned);
-- 
cgit v1.2.3


From 4f5e304248ab4939e9aef58244041c194f01f0b5 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:00:39 -0700
Subject: xfs: cross-reference rmap records with refcount btrees

Strengthen the rmap btree record checker a little more by comparing
OWN_REFCBT reverse mappings against the refcount btrees.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/rmap.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index 834b32063301..d29a26ecddd6 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -19,6 +19,7 @@
 #include "xfs_alloc.h"
 #include "xfs_alloc_btree.h"
 #include "xfs_ialloc_btree.h"
+#include "xfs_refcount_btree.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/btree.h"
@@ -58,6 +59,7 @@ struct xchk_rmap {
 	struct xagb_bitmap	log_owned;
 	struct xagb_bitmap	ag_owned;
 	struct xagb_bitmap	inobt_owned;
+	struct xagb_bitmap	refcbt_owned;
 
 	/* Did we complete the AG space metadata bitmaps? */
 	bool			bitmaps_complete;
@@ -304,6 +306,9 @@ xchk_rmapbt_mark_bitmap(
 	case XFS_RMAP_OWN_INOBT:
 		bmp = &cr->inobt_owned;
 		break;
+	case XFS_RMAP_OWN_REFC:
+		bmp = &cr->refcbt_owned;
+		break;
 	}
 
 	if (!bmp)
@@ -461,6 +466,19 @@ xchk_rmapbt_walk_ag_metadata(
 			goto out;
 	}
 
+	/* OWN_REFC: refcountbt */
+	if (xfs_has_reflink(sc->mp)) {
+		cur = sc->sa.refc_cur;
+		if (!cur)
+			cur = xfs_refcountbt_init_cursor(sc->mp, sc->tp,
+					sc->sa.agf_bp, sc->sa.pag);
+		error = xagb_bitmap_set_btblocks(&cr->refcbt_owned, cur);
+		if (cur != sc->sa.refc_cur)
+			xfs_btree_del_cursor(cur, error);
+		if (error)
+			goto out;
+	}
+
 out:
 	/*
 	 * If there's an error, set XFAIL and disable the bitmap
@@ -508,6 +526,9 @@ xchk_rmapbt_check_bitmaps(
 
 	if (xagb_bitmap_hweight(&cr->inobt_owned) != 0)
 		xchk_btree_xref_set_corrupt(sc, cur, level);
+
+	if (xagb_bitmap_hweight(&cr->refcbt_owned) != 0)
+		xchk_btree_xref_set_corrupt(sc, cur, level);
 }
 
 /* Scrub the rmap btree for some AG. */
@@ -526,6 +547,7 @@ xchk_rmapbt(
 	xagb_bitmap_init(&cr->log_owned);
 	xagb_bitmap_init(&cr->ag_owned);
 	xagb_bitmap_init(&cr->inobt_owned);
+	xagb_bitmap_init(&cr->refcbt_owned);
 
 	error = xchk_rmapbt_walk_ag_metadata(sc, cr);
 	if (error)
@@ -539,6 +561,7 @@ xchk_rmapbt(
 	xchk_rmapbt_check_bitmaps(sc, cr);
 
 out:
+	xagb_bitmap_destroy(&cr->refcbt_owned);
 	xagb_bitmap_destroy(&cr->inobt_owned);
 	xagb_bitmap_destroy(&cr->ag_owned);
 	xagb_bitmap_destroy(&cr->log_owned);
-- 
cgit v1.2.3


From a9248538facc3d9e769489e50a544509c2f9cebe Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:05:04 -0700
Subject: xfs: stabilize the dirent name transformation function used for
 ascii-ci dir hash computation
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Back in the old days, the "ascii-ci" feature was created to implement
case-insensitive directory entry lookups for latin1-encoded names and
remove the large overhead of Samba's case-insensitive lookup code.  UTF8
names were not allowed, but nobody explicitly wrote in the documentation
that this was only expected to work if the system used latin1 names.
The kernel tolower function was selected to prepare names for hashed
lookups.

There's a major discrepancy in the function that computes directory entry
hashes for filesystems that have ASCII case-insensitive lookups enabled.
The root of this is that the kernel and glibc's tolower implementations
have differing behavior for extended ASCII accented characters.  I wrote
a program to spit out characters for which the tolower() return value is
different from the input:

glibc tolower:
65:A 66:B 67:C 68:D 69:E 70:F 71:G 72:H 73:I 74:J 75:K 76:L 77:M 78:N
79:O 80:P 81:Q 82:R 83:S 84:T 85:U 86:V 87:W 88:X 89:Y 90:Z

kernel tolower:
65:A 66:B 67:C 68:D 69:E 70:F 71:G 72:H 73:I 74:J 75:K 76:L 77:M 78:N
79:O 80:P 81:Q 82:R 83:S 84:T 85:U 86:V 87:W 88:X 89:Y 90:Z 192:À 193:Á
194:Â 195:Ã 196:Ä 197:Å 198:Æ 199:Ç 200:È 201:É 202:Ê 203:Ë 204:Ì 205:Í
206:Î 207:Ï 208:Ð 209:Ñ 210:Ò 211:Ó 212:Ô 213:Õ 214:Ö 215:× 216:Ø 217:Ù
218:Ú 219:Û 220:Ü 221:Ý 222:Þ

Which means that the kernel and userspace do not agree on the hash value
for a directory filename that contains those higher values.  The hash
values are written into the leaf index block of directories that are
larger than two blocks in size, which means that xfs_repair will flag
these directories as having corrupted hash indexes and rewrite the index
with hash values that the kernel now will not recognize.

Because the ascii-ci feature is not frequently enabled and the kernel
touches filesystems far more frequently than xfs_repair does, fix this
by encoding the kernel's toupper predicate and tolower functions into
libxfs.  Give the new functions less provocative names to make it really
obvious that this is a pre-hash name preparation function, and nothing
else.  This change makes userspace's behavior consistent with the
kernel.

Found by auditing obfuscate_name in xfs_metadump as part of working on
parent pointers, wondering how it could possibly work correctly with ci
filesystems, writing a test tool to create a directory with
hash-colliding names, and watching xfs_repair flag it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_dir2.c |  5 +++--
 fs/xfs/libxfs/xfs_dir2.h | 31 +++++++++++++++++++++++++++++++
 2 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c
index 92bac3373f1f..f5462fd582d5 100644
--- a/fs/xfs/libxfs/xfs_dir2.c
+++ b/fs/xfs/libxfs/xfs_dir2.c
@@ -64,7 +64,7 @@ xfs_ascii_ci_hashname(
 	int			i;
 
 	for (i = 0, hash = 0; i < name->len; i++)
-		hash = tolower(name->name[i]) ^ rol32(hash, 7);
+		hash = xfs_ascii_ci_xfrm(name->name[i]) ^ rol32(hash, 7);
 
 	return hash;
 }
@@ -85,7 +85,8 @@ xfs_ascii_ci_compname(
 	for (i = 0; i < len; i++) {
 		if (args->name[i] == name[i])
 			continue;
-		if (tolower(args->name[i]) != tolower(name[i]))
+		if (xfs_ascii_ci_xfrm(args->name[i]) !=
+		    xfs_ascii_ci_xfrm(name[i]))
 			return XFS_CMP_DIFFERENT;
 		result = XFS_CMP_CASE;
 	}
diff --git a/fs/xfs/libxfs/xfs_dir2.h b/fs/xfs/libxfs/xfs_dir2.h
index dd39f17dd9a9..19af22a16c41 100644
--- a/fs/xfs/libxfs/xfs_dir2.h
+++ b/fs/xfs/libxfs/xfs_dir2.h
@@ -248,4 +248,35 @@ unsigned int xfs_dir3_data_end_offset(struct xfs_da_geometry *geo,
 		struct xfs_dir2_data_hdr *hdr);
 bool xfs_dir2_namecheck(const void *name, size_t length);
 
+/*
+ * The "ascii-ci" feature was created to speed up case-insensitive lookups for
+ * a Samba product.  Because of the inherent problems with CI and UTF-8
+ * encoding, etc, it was decided that Samba would be configured to export
+ * latin1/iso 8859-1 encodings as that covered >90% of the target markets for
+ * the product.  Hence the "ascii-ci" casefolding code could be encoded into
+ * the XFS directory operations and remove all the overhead of casefolding from
+ * Samba.
+ *
+ * To provide consistent hashing behavior between the userspace and kernel,
+ * these functions prepare names for hashing by transforming specific bytes
+ * to other bytes.  Robustness with other encodings is not guaranteed.
+ */
+static inline bool xfs_ascii_ci_need_xfrm(unsigned char c)
+{
+	if (c >= 0x41 && c <= 0x5a)	/* A-Z */
+		return true;
+	if (c >= 0xc0 && c <= 0xd6)	/* latin A-O with accents */
+		return true;
+	if (c >= 0xd8 && c <= 0xde)	/* latin O-Y with accents */
+		return true;
+	return false;
+}
+
+static inline unsigned char xfs_ascii_ci_xfrm(unsigned char c)
+{
+	if (xfs_ascii_ci_need_xfrm(c))
+		c -= 'A' - 'a';
+	return c;
+}
+
 #endif	/* __XFS_DIR2_H__ */
-- 
cgit v1.2.3


From 6db09a8d0377866b99882f8d44e4ba1a1bf6c6a1 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:05:05 -0700
Subject: xfs: test the ascii case-insensitive hash

Now that we've made kernel and userspace use the same tolower code for
computing directory index hashes, add that to the selftest code.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_dahash_test.c | 211 +++++++++++++++++++++++++----------------------
 1 file changed, 111 insertions(+), 100 deletions(-)

diff --git a/fs/xfs/xfs_dahash_test.c b/fs/xfs/xfs_dahash_test.c
index 230651ab5ce4..0dab5941e080 100644
--- a/fs/xfs/xfs_dahash_test.c
+++ b/fs/xfs/xfs_dahash_test.c
@@ -9,6 +9,9 @@
 #include "xfs_format.h"
 #include "xfs_da_format.h"
 #include "xfs_da_btree.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_dir2_priv.h"
 #include "xfs_dahash_test.h"
 
 /* 4096 random bytes */
@@ -533,108 +536,109 @@ static struct dahash_test {
 	uint16_t	start;	/* random 12 bit offset in buf */
 	uint16_t	length;	/* random 8 bit length of test */
 	xfs_dahash_t	dahash;	/* expected dahash result */
+	xfs_dahash_t	ascii_ci_dahash; /* expected ascii-ci dahash result */
 } test[] __initdata =
 {
-	{0x0567, 0x0097, 0x96951389},
-	{0x0869, 0x0055, 0x6455ab4f},
-	{0x0c51, 0x00be, 0x8663afde},
-	{0x044a, 0x00fc, 0x98fbe432},
-	{0x0f29, 0x0079, 0x42371997},
-	{0x08ba, 0x0052, 0x942be4f7},
-	{0x01f2, 0x0013, 0x5262687e},
-	{0x09e3, 0x00e2, 0x8ffb0908},
-	{0x007c, 0x0051, 0xb3158491},
-	{0x0854, 0x001f, 0x83bb20d9},
-	{0x031b, 0x0008, 0x98970bdf},
-	{0x0de7, 0x0027, 0xbfbf6f6c},
-	{0x0f76, 0x0005, 0x906a7105},
-	{0x092e, 0x00d0, 0x86631850},
-	{0x0233, 0x0082, 0xdbdd914e},
-	{0x04c9, 0x0075, 0x5a400a9e},
-	{0x0b66, 0x0099, 0xae128b45},
-	{0x000d, 0x00ed, 0xe61c216a},
-	{0x0a31, 0x003d, 0xf69663b9},
-	{0x00a3, 0x0052, 0x643c39ae},
-	{0x0125, 0x00d5, 0x7c310b0d},
-	{0x0105, 0x004a, 0x06a77e74},
-	{0x0858, 0x008e, 0x265bc739},
-	{0x045e, 0x0095, 0x13d6b192},
-	{0x0dab, 0x003c, 0xc4498704},
-	{0x00cd, 0x00b5, 0x802a4e2d},
-	{0x069b, 0x008c, 0x5df60f71},
-	{0x0454, 0x006c, 0x5f03d8bb},
-	{0x040e, 0x0032, 0x0ce513b5},
-	{0x0874, 0x00e2, 0x6a811fb3},
-	{0x0521, 0x00b4, 0x93296833},
-	{0x0ddc, 0x00cf, 0xf9305338},
-	{0x0a70, 0x0023, 0x239549ea},
-	{0x083e, 0x0027, 0x2d88ba97},
-	{0x0241, 0x00a7, 0xfe0b32e1},
-	{0x0dfc, 0x0096, 0x1a11e815},
-	{0x023e, 0x001e, 0xebc9a1f3},
-	{0x067e, 0x0066, 0xb1067f81},
-	{0x09ea, 0x000e, 0x46fd7247},
-	{0x036b, 0x008c, 0x1a39acdf},
-	{0x078f, 0x0030, 0x964042ab},
-	{0x085c, 0x008f, 0x1829edab},
-	{0x02ec, 0x009f, 0x6aefa72d},
-	{0x043b, 0x00ce, 0x65642ff5},
-	{0x0a32, 0x00b8, 0xbd82759e},
-	{0x0d3c, 0x0087, 0xf4d66d54},
-	{0x09ec, 0x008a, 0x06bfa1ff},
-	{0x0902, 0x0015, 0x755025d2},
-	{0x08fe, 0x000e, 0xf690ce2d},
-	{0x00fb, 0x00dc, 0xe55f1528},
-	{0x0eaa, 0x003a, 0x0fe0a8d7},
-	{0x05fb, 0x0006, 0x86281cfb},
-	{0x0dd1, 0x00a7, 0x60ab51b4},
-	{0x0005, 0x001b, 0xf51d969b},
-	{0x077c, 0x00dd, 0xc2fed268},
-	{0x0575, 0x00f5, 0x432c0b1a},
-	{0x05be, 0x0088, 0x78baa04b},
-	{0x0c89, 0x0068, 0xeda9e428},
-	{0x0f5c, 0x0068, 0xec143c76},
-	{0x06a8, 0x0009, 0xd72651ce},
-	{0x060f, 0x008e, 0x765426cd},
-	{0x07b1, 0x0047, 0x2cfcfa0c},
-	{0x04f1, 0x0041, 0x55b172f9},
-	{0x0e05, 0x00ac, 0x61efde93},
-	{0x0bf7, 0x0097, 0x05b83eee},
-	{0x04e9, 0x00f3, 0x9928223a},
-	{0x023a, 0x0005, 0xdfada9bc},
-	{0x0acb, 0x000e, 0x2217cecd},
-	{0x0148, 0x0060, 0xbc3f7405},
-	{0x0764, 0x0059, 0xcbc201b1},
-	{0x021f, 0x0059, 0x5d6b2256},
-	{0x0f1e, 0x006c, 0xdefeeb45},
-	{0x071c, 0x00b9, 0xb9b59309},
-	{0x0564, 0x0063, 0xae064271},
-	{0x0b14, 0x0044, 0xdb867d9b},
-	{0x0e5a, 0x0055, 0xff06b685},
-	{0x015e, 0x00ba, 0x1115ccbc},
-	{0x0379, 0x00e6, 0x5f4e58dd},
-	{0x013b, 0x0067, 0x4897427e},
-	{0x0e64, 0x0071, 0x7af2b7a4},
-	{0x0a11, 0x0050, 0x92105726},
-	{0x0109, 0x0055, 0xd0d000f9},
-	{0x00aa, 0x0022, 0x815d229d},
-	{0x09ac, 0x004f, 0x02f9d985},
-	{0x0e1b, 0x00ce, 0x5cf92ab4},
-	{0x08af, 0x00d8, 0x17ca72d1},
-	{0x0e33, 0x000a, 0xda2dba6b},
-	{0x0ee3, 0x006a, 0xb00048e5},
-	{0x0648, 0x001a, 0x2364b8cb},
-	{0x0315, 0x0085, 0x0596fd0d},
-	{0x0fbb, 0x003e, 0x298230ca},
-	{0x0422, 0x006a, 0x78ada4ab},
-	{0x04ba, 0x0073, 0xced1fbc2},
-	{0x007d, 0x0061, 0x4b7ff236},
-	{0x070b, 0x00d0, 0x261cf0ae},
-	{0x0c1a, 0x0035, 0x8be92ee2},
-	{0x0af8, 0x0063, 0x824dcf03},
-	{0x08f8, 0x006d, 0xd289710c},
-	{0x021b, 0x00ee, 0x6ac1c41d},
-	{0x05b5, 0x00da, 0x8e52f0e2},
+	{0x0567, 0x0097, 0x96951389, 0xc153aa0d},
+	{0x0869, 0x0055, 0x6455ab4f, 0xd07f69bf},
+	{0x0c51, 0x00be, 0x8663afde, 0xf9add90c},
+	{0x044a, 0x00fc, 0x98fbe432, 0xbf2abb76},
+	{0x0f29, 0x0079, 0x42371997, 0x282588b3},
+	{0x08ba, 0x0052, 0x942be4f7, 0x2e023547},
+	{0x01f2, 0x0013, 0x5262687e, 0x5266287e},
+	{0x09e3, 0x00e2, 0x8ffb0908, 0x1da892f3},
+	{0x007c, 0x0051, 0xb3158491, 0xb67f9e63},
+	{0x0854, 0x001f, 0x83bb20d9, 0x22bb21db},
+	{0x031b, 0x0008, 0x98970bdf, 0x9cd70adf},
+	{0x0de7, 0x0027, 0xbfbf6f6c, 0xae3f296c},
+	{0x0f76, 0x0005, 0x906a7105, 0x906a7105},
+	{0x092e, 0x00d0, 0x86631850, 0xa3f6ac04},
+	{0x0233, 0x0082, 0xdbdd914e, 0x5d8c7aac},
+	{0x04c9, 0x0075, 0x5a400a9e, 0x12f60711},
+	{0x0b66, 0x0099, 0xae128b45, 0x7551310d},
+	{0x000d, 0x00ed, 0xe61c216a, 0xc22d3c4c},
+	{0x0a31, 0x003d, 0xf69663b9, 0x51960bf8},
+	{0x00a3, 0x0052, 0x643c39ae, 0xa93c73a8},
+	{0x0125, 0x00d5, 0x7c310b0d, 0xf221cbb3},
+	{0x0105, 0x004a, 0x06a77e74, 0xa4ef4561},
+	{0x0858, 0x008e, 0x265bc739, 0xd6c36d9b},
+	{0x045e, 0x0095, 0x13d6b192, 0x5f5c1d62},
+	{0x0dab, 0x003c, 0xc4498704, 0x10414654},
+	{0x00cd, 0x00b5, 0x802a4e2d, 0xfbd17c9d},
+	{0x069b, 0x008c, 0x5df60f71, 0x91ddca5f},
+	{0x0454, 0x006c, 0x5f03d8bb, 0x5c59fce0},
+	{0x040e, 0x0032, 0x0ce513b5, 0xa8cd99b1},
+	{0x0874, 0x00e2, 0x6a811fb3, 0xca028316},
+	{0x0521, 0x00b4, 0x93296833, 0x2c4d4880},
+	{0x0ddc, 0x00cf, 0xf9305338, 0x2c94210d},
+	{0x0a70, 0x0023, 0x239549ea, 0x22b561aa},
+	{0x083e, 0x0027, 0x2d88ba97, 0x5cd8bb9d},
+	{0x0241, 0x00a7, 0xfe0b32e1, 0x17b506b8},
+	{0x0dfc, 0x0096, 0x1a11e815, 0xee4141bd},
+	{0x023e, 0x001e, 0xebc9a1f3, 0x5689a1f3},
+	{0x067e, 0x0066, 0xb1067f81, 0xd9952571},
+	{0x09ea, 0x000e, 0x46fd7247, 0x42b57245},
+	{0x036b, 0x008c, 0x1a39acdf, 0x58bf1586},
+	{0x078f, 0x0030, 0x964042ab, 0xb04218b9},
+	{0x085c, 0x008f, 0x1829edab, 0x9ceca89c},
+	{0x02ec, 0x009f, 0x6aefa72d, 0x634cc2a7},
+	{0x043b, 0x00ce, 0x65642ff5, 0x6c8a584e},
+	{0x0a32, 0x00b8, 0xbd82759e, 0x0f96a34f},
+	{0x0d3c, 0x0087, 0xf4d66d54, 0xb71ba5f4},
+	{0x09ec, 0x008a, 0x06bfa1ff, 0x576ca80f},
+	{0x0902, 0x0015, 0x755025d2, 0x517225c2},
+	{0x08fe, 0x000e, 0xf690ce2d, 0xf690cf3d},
+	{0x00fb, 0x00dc, 0xe55f1528, 0x707d7d92},
+	{0x0eaa, 0x003a, 0x0fe0a8d7, 0x87638cc5},
+	{0x05fb, 0x0006, 0x86281cfb, 0x86281cf9},
+	{0x0dd1, 0x00a7, 0x60ab51b4, 0xe28ef00c},
+	{0x0005, 0x001b, 0xf51d969b, 0xe71dd6d3},
+	{0x077c, 0x00dd, 0xc2fed268, 0xdc30c555},
+	{0x0575, 0x00f5, 0x432c0b1a, 0x81dd7d16},
+	{0x05be, 0x0088, 0x78baa04b, 0xd69b433e},
+	{0x0c89, 0x0068, 0xeda9e428, 0xe9b4fa0a},
+	{0x0f5c, 0x0068, 0xec143c76, 0x9947067a},
+	{0x06a8, 0x0009, 0xd72651ce, 0xd72651ee},
+	{0x060f, 0x008e, 0x765426cd, 0x2099626f},
+	{0x07b1, 0x0047, 0x2cfcfa0c, 0x1a4baa07},
+	{0x04f1, 0x0041, 0x55b172f9, 0x15331a79},
+	{0x0e05, 0x00ac, 0x61efde93, 0x320568cc},
+	{0x0bf7, 0x0097, 0x05b83eee, 0xc72fb7a3},
+	{0x04e9, 0x00f3, 0x9928223a, 0xe8c77de2},
+	{0x023a, 0x0005, 0xdfada9bc, 0xdfadb9be},
+	{0x0acb, 0x000e, 0x2217cecd, 0x0017d6cd},
+	{0x0148, 0x0060, 0xbc3f7405, 0xf5fd6615},
+	{0x0764, 0x0059, 0xcbc201b1, 0xbb089bf4},
+	{0x021f, 0x0059, 0x5d6b2256, 0xa16a0a59},
+	{0x0f1e, 0x006c, 0xdefeeb45, 0xfc34f9d6},
+	{0x071c, 0x00b9, 0xb9b59309, 0xb645eae2},
+	{0x0564, 0x0063, 0xae064271, 0x954dc6d1},
+	{0x0b14, 0x0044, 0xdb867d9b, 0xdf432309},
+	{0x0e5a, 0x0055, 0xff06b685, 0xa65ff257},
+	{0x015e, 0x00ba, 0x1115ccbc, 0x11c365f4},
+	{0x0379, 0x00e6, 0x5f4e58dd, 0x2d176d31},
+	{0x013b, 0x0067, 0x4897427e, 0xc40532fe},
+	{0x0e64, 0x0071, 0x7af2b7a4, 0x1fb7bf43},
+	{0x0a11, 0x0050, 0x92105726, 0xb1185e51},
+	{0x0109, 0x0055, 0xd0d000f9, 0x60a60bfd},
+	{0x00aa, 0x0022, 0x815d229d, 0x215d379c},
+	{0x09ac, 0x004f, 0x02f9d985, 0x10b90b20},
+	{0x0e1b, 0x00ce, 0x5cf92ab4, 0x6a477573},
+	{0x08af, 0x00d8, 0x17ca72d1, 0x385af156},
+	{0x0e33, 0x000a, 0xda2dba6b, 0xda2dbb69},
+	{0x0ee3, 0x006a, 0xb00048e5, 0xa9a2decc},
+	{0x0648, 0x001a, 0x2364b8cb, 0x3364b1cb},
+	{0x0315, 0x0085, 0x0596fd0d, 0xa651740f},
+	{0x0fbb, 0x003e, 0x298230ca, 0x7fc617c7},
+	{0x0422, 0x006a, 0x78ada4ab, 0xc576ae2a},
+	{0x04ba, 0x0073, 0xced1fbc2, 0xaac8455b},
+	{0x007d, 0x0061, 0x4b7ff236, 0x347d5739},
+	{0x070b, 0x00d0, 0x261cf0ae, 0xc7fb1c10},
+	{0x0c1a, 0x0035, 0x8be92ee2, 0x8be9b4e1},
+	{0x0af8, 0x0063, 0x824dcf03, 0x53010388},
+	{0x08f8, 0x006d, 0xd289710c, 0x30418edd},
+	{0x021b, 0x00ee, 0x6ac1c41d, 0x2557e9a3},
+	{0x05b5, 0x00da, 0x8e52f0e2, 0x98531012},
 };
 
 int __init
@@ -644,12 +648,19 @@ xfs_dahash_test(void)
 	unsigned int	errors = 0;
 
 	for (i = 0; i < ARRAY_SIZE(test); i++) {
+		struct xfs_name	xname = { };
 		xfs_dahash_t	hash;
 
 		hash = xfs_da_hashname(test_buf + test[i].start,
 				test[i].length);
 		if (hash != test[i].dahash)
 			errors++;
+
+		xname.name = test_buf + test[i].start;
+		xname.len = test[i].length;
+		hash = xfs_ascii_ci_hashname(&xname);
+		if (hash != test[i].ascii_ci_dahash)
+			errors++;
 	}
 
 	if (errors) {
-- 
cgit v1.2.3


From 7ba83850ca2691865713b307ed001bde5fddb084 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Tue, 11 Apr 2023 19:05:19 -0700
Subject: xfs: deprecate the ascii-ci feature

This feature is a mess -- the hash function has been broken for the
entire 15 years of its existence if you create names with extended ascii
bytes; metadump name obfuscation has silently failed for just as long;
and the feature clashes horribly with the UTF8 encodings that most
systems use today.  There is exactly one fstest for this feature.

In other words, this feature is crap.  Let's deprecate it now so we can
remove it from the codebase in 2030.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 Documentation/admin-guide/xfs.rst |  1 +
 fs/xfs/Kconfig                    | 27 +++++++++++++++++++++++++++
 fs/xfs/xfs_super.c                | 13 +++++++++++++
 3 files changed, 41 insertions(+)

diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst
index e2561416391c..e85a9404d5c0 100644
--- a/Documentation/admin-guide/xfs.rst
+++ b/Documentation/admin-guide/xfs.rst
@@ -240,6 +240,7 @@ Deprecated Mount Options
   Name				Removal Schedule
 ===========================     ================
 Mounting with V4 filesystem     September 2030
+Mounting ascii-ci filesystem    September 2030
 ikeep/noikeep			September 2025
 attr2/noattr2			September 2025
 ===========================     ================
diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index 05bc865142b8..52e1823241fb 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -47,6 +47,33 @@ config XFS_SUPPORT_V4
 	  To continue supporting the old V4 format (crc=0), say Y.
 	  To close off an attack surface, say N.
 
+config XFS_SUPPORT_ASCII_CI
+	bool "Support deprecated case-insensitive ascii (ascii-ci=1) format"
+	depends on XFS_FS
+	default y
+	help
+	  The ASCII case insensitivity filesystem feature only works correctly
+	  on systems that have been coerced into using ISO 8859-1, and it does
+	  not work on extended attributes.  The kernel has no visibility into
+	  the locale settings in userspace, so it corrupts UTF-8 names.
+	  Enabling this feature makes XFS vulnerable to mixed case sensitivity
+	  attacks.  Because of this, the feature is deprecated.  All users
+	  should upgrade by backing up their files, reformatting, and restoring
+	  from the backup.
+
+	  Administrators and users can detect such a filesystem by running
+	  xfs_info against a filesystem mountpoint and checking for a string
+	  beginning with "ascii-ci=".  If the string "ascii-ci=1" is found, the
+	  filesystem is a case-insensitive filesystem.  If no such string is
+	  found, please upgrade xfsprogs to the latest version and try again.
+
+	  This option will become default N in September 2025.  Support for the
+	  feature will be removed entirely in September 2030.  Distributors
+	  can say N here to withdraw support earlier.
+
+	  To continue supporting case-insensitivity (ascii-ci=1), say Y.
+	  To close off an attack surface, say N.
+
 config XFS_QUOTA
 	bool "XFS Quota support"
 	depends on XFS_FS
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 4f814f9e12ab..4d2e87462ac4 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1548,6 +1548,19 @@ xfs_fs_fill_super(
 #endif
 	}
 
+	/* ASCII case insensitivity is undergoing deprecation. */
+	if (xfs_has_asciici(mp)) {
+#ifdef CONFIG_XFS_SUPPORT_ASCII_CI
+		xfs_warn_once(mp,
+	"Deprecated ASCII case-insensitivity feature (ascii-ci=1) will not be supported after September 2030.");
+#else
+		xfs_warn(mp,
+	"Deprecated ASCII case-insensitivity feature (ascii-ci=1) not supported by kernel.");
+		error = -EINVAL;
+		goto out_free_sb;
+#endif
+	}
+
 	/* Filesystem claims it needs repair, so refuse the mount. */
 	if (xfs_has_needsrepair(mp)) {
 		xfs_warn(mp, "Filesystem needs repair.  Please run xfs_repair.");
-- 
cgit v1.2.3


From aa88019851a85df80cb77f143758b13aee09e3d9 Mon Sep 17 00:00:00 2001
From: Dave Chinner <dchinner@redhat.com>
Date: Wed, 12 Apr 2023 15:48:50 +1000
Subject: xfs: don't consider future format versions valid

In commit fe08cc504448 we reworked the valid superblock version
checks. If it is a V5 filesystem, it is always valid, then we
checked if the version was less than V4 (reject) and then checked
feature fields in the V4 flags to determine if it was valid.

What we missed was that if the version is not V4 at this point,
we shoudl reject the fs. i.e. the check current treats V6+
filesystems as if it was a v4 filesystem. Fix this.

cc: stable@vger.kernel.org
Fixes: fe08cc504448 ("xfs: open code sb verifier feature checks")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/libxfs/xfs_sb.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index 99cc03a298e2..ba0f17bc1dc0 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -72,7 +72,8 @@ xfs_sb_validate_v5_features(
 }
 
 /*
- * We support all XFS versions newer than a v4 superblock with V2 directories.
+ * We current support XFS v5 formats with known features and v4 superblocks with
+ * at least V2 directories.
  */
 bool
 xfs_sb_good_version(
@@ -86,16 +87,16 @@ xfs_sb_good_version(
 	if (xfs_sb_is_v5(sbp))
 		return xfs_sb_validate_v5_features(sbp);
 
+	/* versions prior to v4 are not supported */
+	if (XFS_SB_VERSION_NUM(sbp) != XFS_SB_VERSION_4)
+		return false;
+
 	/* We must not have any unknown v4 feature bits set */
 	if ((sbp->sb_versionnum & ~XFS_SB_VERSION_OKBITS) ||
 	    ((sbp->sb_versionnum & XFS_SB_VERSION_MOREBITSBIT) &&
 	     (sbp->sb_features2 & ~XFS_SB_VERSION2_OKBITS)))
 		return false;
 
-	/* versions prior to v4 are not supported */
-	if (XFS_SB_VERSION_NUM(sbp) < XFS_SB_VERSION_4)
-		return false;
-
 	/* V4 filesystems need v2 directories and unwritten extents */
 	if (!(sbp->sb_versionnum & XFS_SB_VERSION_DIRV2BIT))
 		return false;
-- 
cgit v1.2.3


From 4b827b3f305d1fcf837265f1e12acc22ee84327c Mon Sep 17 00:00:00 2001
From: Dave Chinner <dchinner@redhat.com>
Date: Wed, 12 Apr 2023 15:48:59 +1000
Subject: xfs: remove WARN when dquot cache insertion fails

It just creates unnecessary bot noise these days.

Reported-by: syzbot+6ae213503fb12e87934f@syzkaller.appspotmail.com
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/xfs_dquot.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index 8fb90da89787..7f071757f278 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -798,7 +798,6 @@ xfs_qm_dqget_cache_insert(
 	error = radix_tree_insert(tree, id, dqp);
 	if (unlikely(error)) {
 		/* Duplicate found!  Caller must try again. */
-		WARN_ON(error != -EEXIST);
 		mutex_unlock(&qi->qi_tree_lock);
 		trace_xfs_dqget_dup(dqp);
 		return error;
-- 
cgit v1.2.3


From c95356ca884885db702670e24933ee7f2b9f1754 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Wed, 12 Apr 2023 15:49:10 +1000
Subject: xfs: _{attr,data}_map_shared should take ILOCK_EXCL until
 iread_extents is completely done

While fuzzing the data fork extent count on a btree-format directory
with xfs/375, I observed the following (excerpted) splat:

XFS: Assertion failed: xfs_isilocked(ip, XFS_ILOCK_EXCL), file: fs/xfs/libxfs/xfs_bmap.c, line: 1208
------------[ cut here ]------------
WARNING: CPU: 0 PID: 43192 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs]
Call Trace:
 <TASK>
 xfs_iread_extents+0x1af/0x210 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xchk_dir_walk+0xb8/0x190 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xchk_parent_count_parent_dentries+0x41/0x80 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xchk_parent_validate+0x199/0x2e0 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xchk_parent+0xdf/0x130 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xfs_scrub_metadata+0x2b8/0x730 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xfs_scrubv_metadata+0x38b/0x4d0 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xfs_ioc_scrubv_metadata+0x111/0x160 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xfs_file_ioctl+0x367/0xf50 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 __x64_sys_ioctl+0x82/0xa0
 do_syscall_64+0x2b/0x80
 entry_SYSCALL_64_after_hwframe+0x46/0xb0

The cause of this is a race condition in xfs_ilock_data_map_shared,
which performs an unlocked access to the data fork to guess which lock
mode it needs:

Thread 0                          Thread 1

xfs_need_iread_extents
<observe no iext tree>
xfs_ilock(..., ILOCK_EXCL)
xfs_iread_extents
<observe no iext tree>
<check ILOCK_EXCL>
<load bmbt extents into iext>
<notice iext size doesn't
 match nextents>
                                  xfs_need_iread_extents
                                  <observe iext tree>
                                  xfs_ilock(..., ILOCK_SHARED)
<tear down iext tree>
xfs_iunlock(..., ILOCK_EXCL)
                                  xfs_iread_extents
                                  <observe no iext tree>
                                  <check ILOCK_EXCL>
                                  *BOOM*

Fix this race by adding a flag to the xfs_ifork structure to indicate
that we have not yet read in the extent records and changing the
predicate to look at the flag state, not if_height.  The memory barrier
ensures that the flag will not be set until the very end of the
function.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/libxfs/xfs_bmap.c       |  6 ++++++
 fs/xfs/libxfs/xfs_inode_fork.c | 16 +++++++++++++++-
 fs/xfs/libxfs/xfs_inode_fork.h |  6 ++++--
 3 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 34de6e6898c4..f11ef331e5a4 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -1171,6 +1171,12 @@ xfs_iread_extents(
 		goto out;
 	}
 	ASSERT(ir.loaded == xfs_iext_count(ifp));
+	/*
+	 * Use release semantics so that we can use acquire semantics in
+	 * xfs_need_iread_extents and be guaranteed to see a valid mapping tree
+	 * after that load.
+	 */
+	smp_store_release(&ifp->if_needextents, 0);
 	return 0;
 out:
 	xfs_iext_destroy(ifp);
diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
index 6b21760184d9..1bbe5ea3f00b 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.c
+++ b/fs/xfs/libxfs/xfs_inode_fork.c
@@ -226,10 +226,15 @@ xfs_iformat_data_fork(
 
 	/*
 	 * Initialize the extent count early, as the per-format routines may
-	 * depend on it.
+	 * depend on it.  Use release semantics to set needextents /after/ we
+	 * set the format. This ensures that we can use acquire semantics on
+	 * needextents in xfs_need_iread_extents() and be guaranteed to see a
+	 * valid format value after that load.
 	 */
 	ip->i_df.if_format = dip->di_format;
 	ip->i_df.if_nextents = xfs_dfork_data_extents(dip);
+	smp_store_release(&ip->i_df.if_needextents,
+			   ip->i_df.if_format == XFS_DINODE_FMT_BTREE ? 1 : 0);
 
 	switch (inode->i_mode & S_IFMT) {
 	case S_IFIFO:
@@ -282,8 +287,17 @@ xfs_ifork_init_attr(
 	enum xfs_dinode_fmt	format,
 	xfs_extnum_t		nextents)
 {
+	/*
+	 * Initialize the extent count early, as the per-format routines may
+	 * depend on it.  Use release semantics to set needextents /after/ we
+	 * set the format. This ensures that we can use acquire semantics on
+	 * needextents in xfs_need_iread_extents() and be guaranteed to see a
+	 * valid format value after that load.
+	 */
 	ip->i_af.if_format = format;
 	ip->i_af.if_nextents = nextents;
+	smp_store_release(&ip->i_af.if_needextents,
+			   ip->i_af.if_format == XFS_DINODE_FMT_BTREE ? 1 : 0);
 }
 
 void
diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
index d3943d6ad0b9..96d307784c85 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.h
+++ b/fs/xfs/libxfs/xfs_inode_fork.h
@@ -24,6 +24,7 @@ struct xfs_ifork {
 	xfs_extnum_t		if_nextents;	/* # of extents in this fork */
 	short			if_broot_bytes;	/* bytes allocated for root */
 	int8_t			if_format;	/* format of this fork */
+	uint8_t			if_needextents;	/* extents have not been read */
 };
 
 /*
@@ -260,9 +261,10 @@ int xfs_iext_count_upgrade(struct xfs_trans *tp, struct xfs_inode *ip,
 		uint nr_to_add);
 
 /* returns true if the fork has extents but they are not read in yet. */
-static inline bool xfs_need_iread_extents(struct xfs_ifork *ifp)
+static inline bool xfs_need_iread_extents(const struct xfs_ifork *ifp)
 {
-	return ifp->if_format == XFS_DINODE_FMT_BTREE && ifp->if_height == 0;
+	/* see xfs_iformat_{data,attr}_fork() for needextents semantics */
+	return smp_load_acquire(&ifp->if_needextents) != 0;
 }
 
 #endif	/* __XFS_INODE_FORK_H__ */
-- 
cgit v1.2.3


From 22ed903eee23a5b174e240f1cdfa9acf393a5210 Mon Sep 17 00:00:00 2001
From: "Darrick J. Wong" <djwong@kernel.org>
Date: Wed, 12 Apr 2023 15:49:23 +1000
Subject: xfs: verify buffer contents when we skip log replay

syzbot detected a crash during log recovery:

XFS (loop0): Mounting V5 Filesystem bfdc47fc-10d8-4eed-a562-11a831b3f791
XFS (loop0): Torn write (CRC failure) detected at log block 0x180. Truncating head block from 0x200.
XFS (loop0): Starting recovery (logdev: internal)
==================================================================
BUG: KASAN: slab-out-of-bounds in xfs_btree_lookup_get_block+0x15c/0x6d0 fs/xfs/libxfs/xfs_btree.c:1813
Read of size 8 at addr ffff88807e89f258 by task syz-executor132/5074

CPU: 0 PID: 5074 Comm: syz-executor132 Not tainted 6.2.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x1b1/0x290 lib/dump_stack.c:106
 print_address_description+0x74/0x340 mm/kasan/report.c:306
 print_report+0x107/0x1f0 mm/kasan/report.c:417
 kasan_report+0xcd/0x100 mm/kasan/report.c:517
 xfs_btree_lookup_get_block+0x15c/0x6d0 fs/xfs/libxfs/xfs_btree.c:1813
 xfs_btree_lookup+0x346/0x12c0 fs/xfs/libxfs/xfs_btree.c:1913
 xfs_btree_simple_query_range+0xde/0x6a0 fs/xfs/libxfs/xfs_btree.c:4713
 xfs_btree_query_range+0x2db/0x380 fs/xfs/libxfs/xfs_btree.c:4953
 xfs_refcount_recover_cow_leftovers+0x2d1/0xa60 fs/xfs/libxfs/xfs_refcount.c:1946
 xfs_reflink_recover_cow+0xab/0x1b0 fs/xfs/xfs_reflink.c:930
 xlog_recover_finish+0x824/0x920 fs/xfs/xfs_log_recover.c:3493
 xfs_log_mount_finish+0x1ec/0x3d0 fs/xfs/xfs_log.c:829
 xfs_mountfs+0x146a/0x1ef0 fs/xfs/xfs_mount.c:933
 xfs_fs_fill_super+0xf95/0x11f0 fs/xfs/xfs_super.c:1666
 get_tree_bdev+0x400/0x620 fs/super.c:1282
 vfs_get_tree+0x88/0x270 fs/super.c:1489
 do_new_mount+0x289/0xad0 fs/namespace.c:3145
 do_mount fs/namespace.c:3488 [inline]
 __do_sys_mount fs/namespace.c:3697 [inline]
 __se_sys_mount+0x2d3/0x3c0 fs/namespace.c:3674
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f89fa3f4aca
Code: 83 c4 08 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fffd5fb5ef8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00646975756f6e2c RCX: 00007f89fa3f4aca
RDX: 0000000020000100 RSI: 0000000020009640 RDI: 00007fffd5fb5f10
RBP: 00007fffd5fb5f10 R08: 00007fffd5fb5f50 R09: 000000000000970d
R10: 0000000000200800 R11: 0000000000000206 R12: 0000000000000004
R13: 0000555556c6b2c0 R14: 0000000000200800 R15: 00007fffd5fb5f50
 </TASK>

The fuzzed image contains an AGF with an obviously garbage
agf_refcount_level value of 32, and a dirty log with a buffer log item
for that AGF.  The ondisk AGF has a higher LSN than the recovered log
item.  xlog_recover_buf_commit_pass2 reads the buffer, compares the
LSNs, and decides to skip replay because the ondisk buffer appears to be
newer.

Unfortunately, the ondisk buffer is corrupt, but recovery just read the
buffer with no buffer ops specified:

	error = xfs_buf_read(mp->m_ddev_targp, buf_f->blf_blkno,
			buf_f->blf_len, buf_flags, &bp, NULL);

Skipping the buffer leaves its contents in memory unverified.  This sets
us up for a kernel crash because xfs_refcount_recover_cow_leftovers
reads the buffer (which is still around in XBF_DONE state, so no read
verification) and creates a refcountbt cursor of height 32.  This is
impossible so we run off the end of the cursor object and crash.

Fix this by invoking the verifier on all skipped buffers and aborting
log recovery if the ondisk buffer is corrupt.  It might be smarter to
force replay the log item atop the buffer and then see if it'll pass the
write verifier (like ext4 does) but for now let's go with the
conservative option where we stop immediately.

Link: https://syzkaller.appspot.com/bug?extid=7e9494b8b399902e994e
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/xfs_buf_item_recover.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/xfs/xfs_buf_item_recover.c b/fs/xfs/xfs_buf_item_recover.c
index ffa94102094d..43167f543afc 100644
--- a/fs/xfs/xfs_buf_item_recover.c
+++ b/fs/xfs/xfs_buf_item_recover.c
@@ -943,6 +943,16 @@ xlog_recover_buf_commit_pass2(
 	if (lsn && lsn != -1 && XFS_LSN_CMP(lsn, current_lsn) >= 0) {
 		trace_xfs_log_recover_buf_skip(log, buf_f);
 		xlog_recover_validate_buf_type(mp, bp, buf_f, NULLCOMMITLSN);
+
+		/*
+		 * We're skipping replay of this buffer log item due to the log
+		 * item LSN being behind the ondisk buffer.  Verify the buffer
+		 * contents since we aren't going to run the write verifier.
+		 */
+		if (bp->b_ops) {
+			bp->b_ops->verify_read(bp);
+			error = bp->b_error;
+		}
 		goto out_release;
 	}
 
-- 
cgit v1.2.3


From 8ee81ed581ff35882b006a5205100db0b57bf070 Mon Sep 17 00:00:00 2001
From: Ye Bin <yebin10@huawei.com>
Date: Wed, 12 Apr 2023 15:49:44 +1000
Subject: xfs: fix BUG_ON in xfs_getbmap()

There's issue as follows:
XFS: Assertion failed: (bmv->bmv_iflags & BMV_IF_DELALLOC) != 0, file: fs/xfs/xfs_bmap_util.c, line: 329
------------[ cut here ]------------
kernel BUG at fs/xfs/xfs_message.c:102!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 14612 Comm: xfs_io Not tainted 6.3.0-rc2-next-20230315-00006-g2729d23ddb3b-dirty #422
RIP: 0010:assfail+0x96/0xa0
RSP: 0018:ffffc9000fa178c0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff888179a18000
RDX: 0000000000000000 RSI: ffff888179a18000 RDI: 0000000000000002
RBP: 0000000000000000 R08: ffffffff8321aab6 R09: 0000000000000000
R10: 0000000000000001 R11: ffffed1105f85139 R12: ffffffff8aacc4c0
R13: 0000000000000149 R14: ffff888269f58000 R15: 000000000000000c
FS:  00007f42f27a4740(0000) GS:ffff88882fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000b92388 CR3: 000000024f006000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 xfs_getbmap+0x1a5b/0x1e40
 xfs_ioc_getbmap+0x1fd/0x5b0
 xfs_file_ioctl+0x2cb/0x1d50
 __x64_sys_ioctl+0x197/0x210
 do_syscall_64+0x39/0xb0
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Above issue may happen as follows:
         ThreadA                       ThreadB
do_shared_fault
 __do_fault
  xfs_filemap_fault
   __xfs_filemap_fault
    filemap_fault
                             xfs_ioc_getbmap -> Without BMV_IF_DELALLOC flag
			      xfs_getbmap
			       xfs_ilock(ip, XFS_IOLOCK_SHARED);
			       filemap_write_and_wait
 do_page_mkwrite
  xfs_filemap_page_mkwrite
   __xfs_filemap_fault
    xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
    iomap_page_mkwrite
     ...
     xfs_buffered_write_iomap_begin
      xfs_bmapi_reserve_delalloc -> Allocate delay extent
                              xfs_ilock_data_map_shared(ip)
	                      xfs_getbmap_report_one
			       ASSERT((bmv->bmv_iflags & BMV_IF_DELALLOC) != 0)
	                        -> trigger BUG_ON

As xfs_filemap_page_mkwrite() only hold XFS_MMAPLOCK_SHARED lock, there's
small window mkwrite can produce delay extent after file write in xfs_getbmap().
To solve above issue, just skip delalloc extents.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/xfs_bmap_util.c | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index a09dd2606479..f032d3a4b727 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -314,15 +314,13 @@ xfs_getbmap_report_one(
 	if (isnullstartblock(got->br_startblock) ||
 	    got->br_startblock == DELAYSTARTBLOCK) {
 		/*
-		 * Delalloc extents that start beyond EOF can occur due to
-		 * speculative EOF allocation when the delalloc extent is larger
-		 * than the largest freespace extent at conversion time.  These
-		 * extents cannot be converted by data writeback, so can exist
-		 * here even if we are not supposed to be finding delalloc
-		 * extents.
+		 * Take the flush completion as being a point-in-time snapshot
+		 * where there are no delalloc extents, and if any new ones
+		 * have been created racily, just skip them as being 'after'
+		 * the flush and so don't get reported.
 		 */
-		if (got->br_startoff < XFS_B_TO_FSB(ip->i_mount, XFS_ISIZE(ip)))
-			ASSERT((bmv->bmv_iflags & BMV_IF_DELALLOC) != 0);
+		if (!(bmv->bmv_iflags & BMV_IF_DELALLOC))
+			return 0;
 
 		p->bmv_oflags |= BMV_OF_DELALLOC;
 		p->bmv_block = -2;
-- 
cgit v1.2.3


From 422d56536fd2ebac254e469e196c64cf53cd8a07 Mon Sep 17 00:00:00 2001
From: Dave Chinner <dchinner@redhat.com>
Date: Thu, 20 Apr 2023 08:18:34 +1000
Subject: xfs: fix duplicate includes

Header files were already included, just not in the normal order.
Remove the duplicates, preserving normal order. Also move xfs_ag.h
include to before the scrub internal includes which are normally
last in the include list.

Fixes: d5c88131dbf0 ("xfs: allow queued AG intents to drain before scrubbing")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/scrub/refcount.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
index ed5eb367ce49..304ea1e1bfb0 100644
--- a/fs/xfs/scrub/refcount.c
+++ b/fs/xfs/scrub/refcount.c
@@ -9,6 +9,7 @@
 #include "xfs_format.h"
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
+#include "xfs_ag.h"
 #include "xfs_btree.h"
 #include "xfs_rmap.h"
 #include "xfs_refcount.h"
@@ -16,9 +17,6 @@
 #include "scrub/common.h"
 #include "scrub/btree.h"
 #include "scrub/trace.h"
-#include "xfs_trans_resv.h"
-#include "xfs_mount.h"
-#include "xfs_ag.h"
 
 /*
  * Set us up to scrub reference count btrees.
-- 
cgit v1.2.3


From 71deb8a5658c592ccad5ededb2ceffef6fcbba5f Mon Sep 17 00:00:00 2001
From: Bagas Sanjaya <bagasdotme@gmail.com>
Date: Thu, 20 Apr 2023 08:18:36 +1000
Subject: xfs: Extend table marker on deprecated mount options table

Sphinx reports htmldocs warning on deprecated mount options table:

/home/bagas/repo/linux-kernel/Documentation/admin-guide/xfs.rst:243: WARNING: Malformed table.
Text in column margin in table line 5.

===========================     ================
  Name                          Removal Schedule
===========================     ================
Mounting with V4 filesystem     September 2030
Mounting ascii-ci filesystem    September 2030
ikeep/noikeep                   September 2025
attr2/noattr2                   September 2025
===========================     ================

Extend the table markers to take account of the second name entry
("Mounting ascii-ci filesystem"), which is now the widest and
to fix the above warning.

Fixes: 7ba83850ca2691 ("xfs: deprecate the ascii-ci feature")
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 Documentation/admin-guide/xfs.rst | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst
index e85a9404d5c0..3a9c041d7f6c 100644
--- a/Documentation/admin-guide/xfs.rst
+++ b/Documentation/admin-guide/xfs.rst
@@ -236,14 +236,14 @@ the dates listed above.
 Deprecated Mount Options
 ========================
 
-===========================     ================
+============================    ================
   Name				Removal Schedule
-===========================     ================
+============================    ================
 Mounting with V4 filesystem     September 2030
 Mounting ascii-ci filesystem    September 2030
 ikeep/noikeep			September 2025
 attr2/noattr2			September 2025
-===========================     ================
+============================    ================
 
 
 Removed Mount Options
-- 
cgit v1.2.3


From 9419092fb2630c30e4ffeb9ef61007ef0c61827a Mon Sep 17 00:00:00 2001
From: Dave Chinner <dchinner@redhat.com>
Date: Thu, 27 Apr 2023 09:02:11 +1000
Subject: xfs: fix livelock in delayed allocation at ENOSPC

On a filesystem with a non-zero stripe unit and a large sequential
write, delayed allocation will set a minimum allocation length of
the stripe unit. If allocation fails because there are no extents
long enough for an aligned minlen allocation, it is supposed to
fall back to unaligned allocation which allows single block extents
to be allocated.

When the allocator code was rewritting in the 6.3 cycle, this
fallback was broken - the old code used args->fsbno as the both the
allocation target and the allocation result, the new code passes the
target as a separate parameter. The conversion didn't handle the
aligned->unaligned fallback path correctly - it reset args->fsbno to
the target fsbno on failure which broke allocation failure detection
in the high level code and so it never fell back to unaligned
allocations.

This resulted in a loop in writeback trying to allocate an aligned
block, getting a false positive success, trying to insert the result
in the BMBT. This did nothing because the extent already was in the
BMBT (merge results in an unchanged extent) and so it returned the
prior extent to the conversion code as the current iomap.

Because the iomap returned didn't cover the offset we tried to map,
xfs_convert_blocks() then retries the allocation, which fails in the
same way and now we have a livelock.

Reported-and-tested-by: Brian Foster <bfoster@redhat.com>
Fixes: 85843327094f ("xfs: factor xfs_bmap_btalloc()")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 1a4e446194dd..b512de0540d5 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3540,7 +3540,6 @@ xfs_bmap_btalloc_at_eof(
 	 * original non-aligned state so the caller can proceed on allocation
 	 * failure as if this function was never called.
 	 */
-	args->fsbno = ap->blkno;
 	args->alignment = 1;
 	return 0;
 }
-- 
cgit v1.2.3