aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2017-07-24Merge remote-tracking branch 'aio/master'Stephen Rothwell
2017-07-24Merge remote-tracking branch 'userns/for-next'Stephen Rothwell
2017-07-24Merge remote-tracking branch 'char-misc/char-misc-next'Stephen Rothwell
2017-07-24Merge remote-tracking branch 'gfs2/for-next'Stephen Rothwell
2017-07-24Merge remote-tracking branch 'dlm/next'Stephen Rothwell
2017-07-24Merge remote-tracking branch 'vfs/for-next'Stephen Rothwell
2017-07-24Merge remote-tracking branch 'file-locks/linux-next'Stephen Rothwell
2017-07-24Merge remote-tracking branch 'jfs/jfs-next'Stephen Rothwell
2017-07-24Merge remote-tracking branch 'fuse/for-next'Stephen Rothwell
2017-07-24Merge remote-tracking branch 'ecryptfs/next'Stephen Rothwell
2017-07-24Merge remote-tracking branch 'btrfs-kdave/for-next'Stephen Rothwell
2017-07-22fs: convert sync_file_range to use errseq_t based error-trackingJeff Layton
sync_file_range doesn't call down into the filesystem directly at all. It only kicks off writeback of pagecache pages and optionally waits on the result. Convert sync_file_range to use errseq_t based error tracking, under the assumption that most users will prefer this behavior when errors occur. Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-07-21Merge tag 'nfs-for-4.13-2' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Anna Schumaker: "Stable bugfix: - Fix error reporting regression Bugfixes: - Fix setting filelayout ds address race - Fix subtle access bug when using ACLs - Fix setting mnt3_counts array size - Fix a couple of pNFS commit races" * tag 'nfs-for-4.13-2' of git://git.linux-nfs.org/projects/anna/linux-nfs: NFS/filelayout: Fix racy setting of fl->dsaddr in filelayout_check_deviceid() NFS: Be more careful about mapping file permissions NFS: Store the raw NFS access mask in the inode's access cache NFSv3: Convert nfs3_proc_access() to use nfs_access_set_mask() NFS: Refactor NFS access to kernel access mask calculation net/sunrpc/xprt_sock: fix regression in connection error reporting. nfs: count correct array for mnt3_counts array size Revert commit 722f0b891198 ("pNFS: Don't send COMMITs to the DSes if...") pNFS/flexfiles: Handle expired layout segments in ff_layout_initiate_commit() NFS: Fix another COMMIT race in pNFS NFS: Fix a COMMIT race in pNFS mount: copy the port field into the cloned nfs_server structure. NFS: Don't run wake_up_bit() when nobody is waiting... nfs: add export operations
2017-07-21Merge branch 'overlayfs-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs fixes from Miklos Szeredi: "This fixes a crash with SELinux and several other old and new bugs" * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: check for bad and whiteout index on lookup ovl: do not cleanup directory and whiteout index entries ovl: fix xattr get and set with selinux ovl: remove unneeded check for IS_ERR() ovl: fix origin verification of index dir ovl: mark parent impure on ovl_link() ovl: fix random return value on mount
2017-07-21NFS/filelayout: Fix racy setting of fl->dsaddr in filelayout_check_deviceid()Trond Myklebust
We must set fl->dsaddr once, and once only, even if there are multiple processes calling filelayout_check_deviceid() for the same layout segment. Reported-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-21locks: restore a warn for leaked locks on closeBenjamin Coddington
When locks.c moved to using file_lock_context, the check for any locks that were not released was moved from the __fput() to destroy_inode() path in commit 8634b51f6ca2 ("locks: convert lease handling to file_lock_context"). This warning has been quite useful for catching bugs, particularly in NFS where lock handling still sees some churn. Let's bring back the warning for leaked locks on __fput, as this warning is much more likely to be seen and reported by users. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-07-21Merge branch 'for-next-next-v4.14-20170721' into for-next-20170721David Sterba
2017-07-21Merge branch 'compression' into for-next-next-v4.14-20170721David Sterba
2017-07-21Merge branch 'fix/wait-io' into for-next-next-v4.14-20170721David Sterba
2017-07-21Merge branch 'cleanup/dev-helpers' into for-next-next-v4.14-20170721David Sterba
2017-07-21Merge branch 'cleanup/arg-reduction' into for-next-next-v4.14-20170721David Sterba
2017-07-21Merge branch 'dev/gfp-alloc' into for-next-next-v4.14-20170721David Sterba
2017-07-21Merge branch 'ext/hans/ssd-no-spread' into for-next-next-v4.14-20170721David Sterba
2017-07-21Merge branch 'ext/jeffm/lockup-find-free-ro-bg' into ↵David Sterba
for-next-next-v4.14-20170721
2017-07-21Merge branch 'ext/omar/shrink-delalloc-no-overcommit' into ↵David Sterba
for-next-next-v4.14-20170721
2017-07-21Merge branch 'ext/jeffm/pinned-bytes-should-alloc-chunk' into ↵David Sterba
for-next-next-v4.14-20170721
2017-07-21Merge branch 'ext/ed/reftree' into for-next-next-v4.14-20170721David Sterba
2017-07-21Merge branch 'ext/timofey/heuristic' into compressionDavid Sterba
# Conflicts: # fs/btrfs/inode.c
2017-07-21Merge branch 'misc-next' into for-next-next-v4.14-20170721David Sterba
2017-07-21Merge branch 'dev/compr-type-prop-defrag' into compressionDavid Sterba
2017-07-21Merge branch 'dev/comp-level-prep' into compressionDavid Sterba
2017-07-21Merge branch 'misc-4.13' into for-next-current-v4.13-20170721David Sterba
2017-07-21Merge branch 'for-4.13-part2' into for-next-current-v4.13-20170721David Sterba
2017-07-21Merge branch 'for-4.13-part1' into for-next-current-v4.13-20170721David Sterba
2017-07-21Btrfs: Do not use data_alloc_cluster in ssd modeHans van Kranenburg
In the first year of btrfs development, around early 2008, btrfs gained a mount option which enables specific functionality for filesystems on solid state devices. The first occurance of this functionality is in commit e18e4809, labeled "Add mount -o ssd, which includes optimizations for seek free storage". The effect on allocating free space for doing (data) writes is to 'cluster' writes together, writing them out in contiguous space, as opposed to a 'tetris' way of putting all separate writes into any free space fragment that fits (which is what the -o nossd behaviour does). A somewhat simplified explanation of what happens is that, when for example, the 'cluster' size is set to 2MiB, when we do some writes, the data allocator will search for a free space block that is 2MiB big, and put the writes in there. The ssd mode itself might allow a 2MiB cluster to be composed of multiple free space extents with some existing data in between, while the additional ssd_spread mount option kills off this option and requires fully free space. The idea behind this is (commit 536ac8ae): "The [...] clusters make it more likely a given IO will completely overwrite the ssd block, so it doesn't have to do an internal rwm cycle."; ssd block meaning nand erase block. So, effectively this means applying a "locality based algorithm" and trying to outsmart the actual ssd. Since then, various changes have been made to the involved code, but the basic idea is still present, and gets activated whenever the ssd mount option is active. This also happens by default, when the rotational flag as seen at /sys/block/<device>/queue/rotational is set to 0. However, there's a number of problems with this approach. First, what the optimization is trying to do is outsmart the ssd by assuming there is a relation between the physical address space of the block device as seen by btrfs and the actual physical storage of the ssd, and then adjusting data placement. However, since the introduction of the Flash Translation Layer (FTL) which is a part of the internal controller of an ssd, these attempts are futile. The use of good quality FTL in consumer ssd products might have been limited in 2008, but this situation has changed drastically soon after that time. Today, even the flash memory in your automatic cat feeding machine or your grandma's wheelchair has a full featured one. Second, the behaviour as described above results in the filesystem being filled up with badly fragmented free space extents because of relatively small pieces of space that are freed up by deletes, but not selected again as part of a 'cluster'. Since the algorithm prefers allocating a new chunk over going back to tetris mode, the end result is a filesystem in which all raw space is allocated, but which is composed of underutilized chunks with a 'shotgun blast' pattern of fragmented free space. Usually, the next problematic thing that happens is the filesystem wanting to allocate new space for metadata, which causes the filesystem to fail in spectacular ways. Third, the default mount options you get for an ssd ('ssd' mode enabled, 'discard' not enabled), in combination with spreading out writes over the full address space and ignoring freed up space leads to worst case behaviour in providing information to the ssd itself, since it will never learn that all the free space left behind is actually free. There are two ways to let an ssd know previously written data does not have to be preserved, which are sending explicit signals using discard or fstrim, or by simply overwriting the space with new data. The worst case behaviour is the btrfs ssd_spread mount option in combination with not having discard enabled. It has a side effect of minimizing the reuse of free space previously written in. Fourth, the rotational flag in /sys/ does not reliably indicate if the device is a locally attached ssd. For example, iSCSI or NBD displays as non-rotational, while a loop device on an ssd shows up as rotational. The combination of the second and third problem effectively means that despite all the good intentions, the btrfs ssd mode reliably causes the ssd hardware and the filesystem structures and performance to be choked to death. The clickbait version of the title of this story would have been "Btrfs ssd optimizations condsidered harmful for ssds". The current nossd 'tetris' mode (even still without discard) allows a pattern of overwriting much more previously used space, causing many more implicit discards to happen because of the overwrite information the ssd gets. The actual location in the physical address space, as seen from the point of view of btrfs is irrelevant, because the actual writes to the low level flash are reordered anyway thanks to the FTL. So what now...? The changes in here do the following: 1. Throw out the current ssd_spread behaviour. 2. Move the current ssd behaviour to the ssd_spread option. 3. Make ssd mode data allocation identical to tetris mode, like nossd. 4. Adjust and clean up filesystem mount messages so that we can easily identify if a kernel has this patch applied or not, when providing support to end users. Instead of directly cutting out all code related to the data cluster, it makes sense to take a gradual approach and allow users who are still able to find a valid reason to prefer the current ssd mode the means to do so by specifiying the additional ssd_spread option. Since there are other uses of the ssd mode, we keep the difference between nossd and ssd mode. However, the usage of the rotational attribute warrants some reconsideration in the future. Notes for whoever wants to backport this patch on their 4.9 LTS kernel: * First apply commit 951e7966 "btrfs: drop the nossd flag when remounting with -o ssd", or fixup the differences manually. * The rest of the conflicts are because of the fs_info refactoring. So, for example, instead of using fs_info, it's root->fs_info in extent-tree.c Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Tested-by: Austin S. Hemmelgarn <ahferroin7@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-21btrfs: fix lockup in find_free_extent with read-only block groupsJeff Mahoney
If we have a block group that is all of the following: 1) uncached in memory 2) is read-only 3) has a disk cache state that indicates we need to recreate the cache AND the file system has enough free space fragmentation such that the request for an extent of a given size can't be honored; AND have a single CPU core; AND it's the block group with the highest starting offset such that there are no opportunities (like reading from disk) for the loop to yield the CPU; We can end up with a lockup. The root cause is simple. Once we're in the position that we've read in all of the other block groups directly and none of those block groups can honor the request, there are no more opportunities to sleep. We end up trying to start a caching thread which never gets run if we only have one core. This *should* present as a hung task waiting on the caching thread to make some progress, but it doesn't. Instead, it degrades into a busy loop because of the placement of the read-only check. During the first pass through the loop, block_group->cached will be set to BTRFS_CACHE_STARTED and have_caching_bg will be set. Then we hit the read-only check and short circuit the loop. We're not yet in LOOP_CACHING_WAIT, so we skip that loop back before going through the loop again for other raid groups. Then we move to LOOP_CACHING_WAIT state. During the this pass through the loop, ->cached will still be BTRFS_CACHE_STARTED, which means it's not cached, so we'll enter cache_block_group, do a lot of nothing, and return, and also set have_caching_bg again. Then we hit the read-only check and short circuit the loop. The same thing happens as before except now we DO trigger the LOOP_CACHING_WAIT && have_caching_bg check and loop back up to the top. We do this forever. There are two fixes in this patch since they address the same underlying bug. The first is to add a cond_resched to the end of the loop to ensure that the caching thread always has an opportunity to run. This will fix the soft lockup issue, but find_free_extent will still loop doing nothing until the thread has completed. The second is to move the read-only check to the top of the loop. We're never going to return an allocation within a read-only block group so we may as well skip it early. The check for ->cached == BTRFS_CACHE_ERROR would cause the same problem except that BTRFS_CACHE_ERROR is considered a "done" state and we won't re-set have_caching_bg again. Many thanks to Stephan Kulow <coolo@suse.de> for his excellent help in the testing process. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-21Btrfs: fix early ENOSPC due to delallocOmar Sandoval
If a lot of metadata is reserved for outstanding delayed allocations, we rely on shrink_delalloc() to reclaim metadata space in order to fulfill reservation tickets. However, shrink_delalloc() has a shortcut where if it determines that space can be overcommitted, it will stop early. This made sense before the ticketed enospc system, but now it means that shrink_delalloc() will often not reclaim enough space to fulfill any tickets, leading to an early ENOSPC. (Reservation tickets don't care about being able to overcommit, they need every byte accounted for.) Fix it by getting rid of the shortcut so that shrink_delalloc() reclaims all of the metadata it is supposed to. This fixes early ENOSPCs we were seeing when doing a btrfs receive to populate a new filesystem, as well as early ENOSPCs Christoph saw when doing a big cp -r onto Btrfs. Fixes: 957780eb2788 ("Btrfs: introduce ticketed enospc infrastructure") Tested-by: Christoph Anton Mitterer <mail@christoph.anton.mitterer.name> Cc: stable@vger.kernel.org Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-21btrfs: Remove never reached error handling code in __add_reloc_rootNikolay Borisov
One of the error handling paths in __add_reloc_root contains btrfs_panic() followed by some other code. As the name implies what it does is print some error message and call BUG, naturally what follow afterwards is not invoked. So remove this extra code. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-21btrfs: Remove unused parameters from volume.c functionsNikolay Borisov
This also adjusts the respective callers in other files. Those were found with -Wunused-parameter. btrfs_full_stripe_len's mapping_tree - introduced by 53b381b3abeb ("Btrfs: RAID5 and RAID6") but it was never really used even in that commit btrfs_is_parity_mirror's mirror_num - same as above chunk_drange_filter's chunk_offset - introduced by 94e60d5a5c4b ("Btrfs: devid subset filter") and never used. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-21btrfs: Remove unused variablesNikolay Borisov
clear_super - usage was removed in commit cea67ab92d3d ("btrfs: clean the old superblocks before freeing the device") but that change forgot to remove the actual variable. max_key - commit 6174d3cb43aa ("Btrfs: remove unused max_key arg from btrfs_search_forward") removed the max_key parameter but it forgot to remove references from callers. stripe_len - this one was added by e06cd3dd7cea ("Btrfs: add validadtion checks for chunk loading") but even then it wasn't used. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-21NFS: Be more careful about mapping file permissionsTrond Myklebust
When mapping a directory, we want the MAY_WRITE permissions to reflect whether or not we have permission to modify, add and delete the directory entries. MAY_EXEC must map to lookup permissions. On the other hand, for files, we want MAY_WRITE to reflect a permission to modify and extend the file. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-21NFS: Store the raw NFS access mask in the inode's access cacheTrond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-21NFSv3: Convert nfs3_proc_access() to use nfs_access_set_mask()Trond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-21NFS: Refactor NFS access to kernel access mask calculationTrond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-21GFS2: Clear gl_object when deleting an inode in gfs2_delete_inodeBob Peterson
This patch adds some calls to clear gl_object in function gfs2_delete_inode. Since we are deleting the inode, and the glock typically outlives the inode in core, we must clear gl_object so subsequent use of the glock (e.g. for a new inode in its place) will not have the old pointer sitting there. In error cases we need to tidy up after ourselves. In non-error cases, we need to clear gl_object before we set the block free in the bitmap so residules aren't left for potential inode creators. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2017-07-21GFS2: Clear gl_object if gfs2_create_inode failsBob Peterson
If function gfs2_create_inode fails after the inode has been created (for example, if the inode_refresh fails for some reason) the function was setting gl_object but never clearing it again. The glocks are left pointing to a freed inode. This patch adds the calls to clear gl_object in the appropriate error paths. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2017-07-21GFS2: Set gl_object in inode lookup only after block type checkBob Peterson
Before this patch, the inode glock's gl_object was set after a reference was acquired, but before the block type was verified. In cases where the block was unlinked, then freed and reused on another node, a residule delete callback (delete_work) would try to look up the inode, eventually failing the block check, but only after it overwrites gl_object with a pointer to the wrong inode. This patch moves the assignment of gl_object after the block check so it won't be improperly overwritten. Likewise, at the end of the function, gfs2_inode_lookup was clearing gl_object after it unlocked the glock, which meant another process might free the glock in the meantime. This patch guards against that case. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2017-07-21GFS2: Introduce helper for clearing gl_objectBob Peterson
This patch introduces a new helper function in glock.h that clears gl_object, with an added integrity check. An additional integrity check has been added to glock_set_object, plus comments. This is step 1 in a series to ensure gl_object integrity. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2017-07-21nfs: count correct array for mnt3_counts array sizeEryu Guan
Array size of mnt3_counts should be the size of array mnt3_procedures, not mnt_procedures, though they're same in size right now. Found this by code inspection. Fixes: 1c5876ddbdb4 ("sunrpc: move p_count out of struct rpc_procinfo") Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Eryu Guan <eguan@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-21gfs2: add flag REQ_PRIO for metadata I/OColy Li
When gfs2 does metadata I/O, only REQ_META is used as a metadata hint of the bio. But flag REQ_META is just a hint for block trace, not for block layer code to handle a bio as metadata request. For some of metadata I/Os of gfs2, A REQ_PRIO flag on the metadata bio would be very informative to block layer code. For example, if bcache is used as a I/O cache for gfs2, it will be possible for bcache code to get the hint and cache the pre-fetched metadata blocks on cache device. This behavior may be helpful to improve metadata I/O performance if the following requests hit the cache. Here are the locations in gfs2 code where a REQ_PRIO flag should be added, - All places where REQ_READAHEAD is used, gfs2 code uses this flag for metadata read ahead. - In gfs2_meta_rq() where the first metadata block is read in. - In gfs2_write_buf_to_page(), read in quota metadata blocks to have them up to date. These metadata blocks are probably to be accessed again in future, adding a REQ_PRIO flag may have bcache to keep such metadata in fast cache device. For system without a cache layer, REQ_PRIO can still provide hint to block layer to handle metadata requests more properly. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Bob Peterson <rpeterso@redhat.com>